The Theory of Constraints (Or Why Bottlenecks Matter)

When I read about how to make programs that I write run faster, the most common advice is “find bottlenecks and eliminate them.” That sounds good on its face, but why is it true? That’s a question that I’ve thought about for a long time. Why not ignore the bottlenecks and make the normal parts of my code super fast. Shouldn’t that have the same overall effect?

For the past week or so I’ve been reading a business book that I found through a Hacker News thread called The Goal. Written by physicist turned business  “guru” Eliyahu Goldratt, the book, (as far as I have read) tells the story of a manager named Alex Rogo who turns to an old college professor for help saving his factory from being shut down.

The factory has an excess of inventory and is constantly behind schedule even after the installation of various machines that automate much of its work. Listening to his issues, Rogo’s professor shares with him what he calls “The Theory of Constraints.” Applying the lessons learned from this theory helps Rogo to save his factory.

So what is the theory of constraints and how the hell does it relate to programming? The theory of constraints, stated simply, says that, “a chain is no stronger than its weakest link.”  In the book, Rogo realizes that his factory isn’t running fast because even though robots do much of the work, there are small bottlenecks in his system that greatly impact the performance of his plant.

Understanding why these bottlenecks have such a huge impact on factory performance requires an understanding of two terms Goldratt dubs dependent events, and statistical fluctuations. Dependent events are events that must take place before other ones can take place. Put simply, they form a chain of events where Event A must occur before Event B can occur and so on.  Any program is really a series of dependent events. For example, a game loop or even the Mongrel server boot script is a series of dependent events. As for statistical fluctuations, they are just that: statistical fluctuations in each event’s ability to produce a result at a specified capacity.  So let’s say Event A is a process by which a worker screws a bolt into a car. At this worker’s capacity (or limit) is to screw 10 bolts a minute. But workers are inconsistent; sometimes he screws 6 bolts in a minute, sometimes 3. The key question is: what affect do these statistical inconsistencies have on the rest of the chain? My first thought was that statistical fluctuations shouldn’t be a big deal. I mean they do average out after all. 

Well, with a little help from a trusty Python script I wrote we can take a look. Let’s imagine we have a pin factory (sorry Adam Smith you’ve got competition). In the pin factory there are 5 stations each with one worker.  The stations are set up as a series of dependent events i.e. the worker at Station 1 must complete his job on each pin before the pin can be passed to Station 2. The worker at Station 2 must complete her job on each pin before it can be passed to Station 3. And so on. Each station has the capacity to produce 6 pins every hour.  This means that at maximum each station can produce 6 pins in an hour. However, because each station is run by a human who doesn’t make things at a constant rate it may produce anywhere between 1 pin and 6 pins in a given hour.

Assume for simplicity that no one station works at the same time as any other station i.e. Station 1 works for an hour, passes off its work to Station 2 and stops working. How many pins can we expect to be produced by the last station after 5 hours (1 cycle)?

We know that a station can produce anywhere between 1 and 6 pins in a given hour. This means that the station will produce an average of 3.5 pins per hour. Let’s check it with a Python script I wrote to replicate the workings of the factory.

Note: a station’s capacity is the most number of pins it can produce at a maximum. However if Station B has a capacity of 6 pins, but Station A has only produced 1 pin to give to Station B to work on, then Station B can only produce 1 pin.

 

>>>

Station working currently is: Station 1

Capacity of this station for this hour: 5 pins

Pins at Station 1: 5

Pins at Station 2: 0

Pins at Station 3: 0

Pins at Station 4: 0

Pins at Station 5: 0

 

Station working currently is: Station 2

Capacity of this station for this hour: 3 pins

Pins at Station 1: 2

Pins at Station 2: 3

Pins at Station 3: 0

Pins at Station 4: 0

Pins at Station 5: 0

 

Station working currently is: Station 3

Capacity of this station for this hour: 1 pins

Pins at Station 1: 2

Pins at Station 2: 2

Pins at Station 3: 1

Pins at Station 4: 0

Pins at Station 5: 0

 

Station working currently is: Station 4

Capacity of this station for this hour: 2 pins

Pins at Station 1: 2

Pins at Station 2: 2

Pins at Station 3: 0

Pins at Station 4: 1

Pins at Station 5: 0

 

Station working currently is: Station 5

Capacity of this station for this hour: 5 pins

Pins at Station 1: 2

Pins at Station 2: 2

Pins at Station 3: 0

Pins at Station 4: 0

Pins at Station 5: 1

 

Station 5 is now outputting products.

Capacity of this station for this hour: 5 pins

Pins at Station 1: 2

Pins at Station 2: 2

Pins at Station 3: 0

Pins at Station 4: 0

Pins at Station 5: 0

 

Pins still in the system: 4.

Pins produced: 1.

We can see that the total output after 1 cycle of the system is 1 pin. How can this happen? Station 1 starts by producing 5 pins. The next hour however, Station 2’s capacity is only 3 pins. Therefore 2 pins remain at Station 1 and 3 pins at Station 2. After that Station 3 only has a capacity for 1 pin. This one pin is passed from Station 3 to the last station. Thus the product of these series of operations is 1 pin, while 4 pins remain in the system because of capacity problems.

Well, you might say, that was only one cycle. If you do enough cycles it should average out to 3.5 pins per turn (1 cycle = 5 turns). Let’s look at what we get if we run 1 million cycles:

Cycles: 1000000

Pins still in the system: 3982.

Pins produced: 2993707.

As we can see we get 2,993,707 pins produced over 5,000,000 turns (we convert cycles to turns here). This means we are producing pins as a system at a rate of .59 pins per turn – well below the average we decided we ought to be producing them at.

The important point to see is that when you are dealing with dependent events, small statistical fluctuations matter a great deal in the overall efficiency of the system. And because each system is bound to have a weak link or a bottleneck, the best way to attack the efficiency of the system as a whole is to concentrate resources on that bottleneck. 

So how does this relate to programming? Well think about a game loop. A simple game loop might look something like this:

            while (user hasn’t quit)

                        check for user input

                        run AI

                        draw graphics

                        play sounds

            end  – (got this from Wikipedia)

If “run AI” runs slow, it doesn’t make any sense to rip out something that runs at normal speed like “draw graphics” and replace it with a blazing fast version of itself. This is because a game loop is made up of a series of dependent events so even if “draw graphics” is blazing fast, the total efficiency of the system is limited by its weakest link – “run AI.”  So next time you’re looking for what to optimize, go for the bottlenecks. Why? Because the Theory of Constraints says so.

You should follow me on Twitter.

Disclaimer: I’m not a mathemetician nor am I a business school student so not everything in here may be 100% correct. If you see an error let me know and I will fix it, and I apologize in advance for any that I have made.

 

 


23 Jun 2011, 5:42am | 1 comment

  • Jesse Beyroutey

    great post dude — you might actually enjoy studying optimization and ops management. interesting to note that dependencies between events may make the cause of bottlenecks less obvious. especially when you have tasks running in parallel, and asynchronously. a task that looks like it’s taking really long might be being driven because of the latency or output of something else

 

Never miss a new post



Subscribe!