Thursday, July 20, 2006

Codeline Flow, Availability and Throughput

There has been an interesting discussion on codeline build+commit contention on the XP yahoogroup initiated by Jay Flowers' post about a proposed build contention equation ...

The basic problem is that there have been some commit contention issues where someone is ready to commit their changes, but someone else is already in the process of committing changes and is still building/testing the result to guarantee that they didnt break the codeline. So the issue isnt that they are trying to merge changes to the codeline at the same time, the issue is that there is overlap in the time-window it takes to merge+build+test (the overall integration process for "accepting" a change to the codeline).

Jay is being very agile to the extent that he wants to promote and sustain "flow" on the codeline (see my previous blog-entry on the 5 Cs of Agile SCM Codelines). He is looking at the [average] number of change-packages committed in a day, and taking into account build+test time, as well as some preparation and buffer time. Here the "buffer time" is to help reduce contention. It makes me think of the "buffer" in the Drum-Buffer-Rope strategy of critical-chain project management (CCPM) and theory-of-constraints (TOC).

Several interesting concepts were mentioned that seem to be closely related (and useful):
If we regard a codeline as a production system, its availability to the team is a critical resource. If the codeline is unavailable, it represents a "network outage" and critical block/bottleneck of the flow of value through the system. This relates to the above as follows:
  • Throughput of the codeline is the [average] number of change "transactions" per unit of time. In this case we'll use hours or days. So the number of change-tasks committed per day or per hour is the throughput (note that the "value" associated with each change is not part of the equation, just the rate at which changes flow thru the system).

  • Process Batch-size is all the changes made for a single change-task to "commit" and ...
  • Transfer Batch-size would be the number of change-tasks we allow to be queued-up (submitted) prior to merging+building+testing the result. In this case, Jay is targeting a one change-task per commit (which is basically attempting single-piece flow).

  • Processing-time is average duration of a development-task form the time it begins up until it is ready-to-commit. And ...
  • Transfer-time is the time it takes to transfer (merge) and then verify (build+test) the result.

  • Takt time in this case would regard the development as the "customers" and would be (if I understand it correctly) the [average] number of changes the team can complete during a given day/hour if they didnt have to wait-around for someone else's changes to be committed.

  • System outage would occur if the codeline/build is broken. It could also be unavailable for other reasons, like if corresponding network or hardware of version-control tool was "down", but for now let's just assume that outages are due to failure of the codeline to build and/or pass its tests (we can call these "breakages" rather than "outages" :-)

  • MTTR (Mean-time to repair) is the average time to fix codeline "breakage," and ...
  • MTBF (Mean-time before failure) is the average time between "breakages" of the codeline
Note that if full builds (rather than incremental builds) are used for verifying commits, then build-time is independent of the number of changes. Also note that it might be useful to capture the [average] number of people blocked by a "breakage," as well as the number of people recruited (and total effort expended) to fix it. That will helps us determine the severity (cost) of the breakage, and whether or not we're better off trying to have the whole team try to fix it, or just one person (ostensibly the person who broke it), or somewhere in between (maybe just the set of folks who are blocked).

Anyway, it's an interesting service-model of codeline availability and reliability for optimizing the throughput of codeline changes and maximizing collaborative "flow."

Has anyone ever captured these kinds of measures and calculations before? How did you decide the desired commit-frequency and how did you minimize build+test times? Did you resort to using incremental builds or testing?

I think that giving developers a repeatable way of doing a private development build in their workspace, even if its only incremental building+testing, gives developers a safe way to fail early+fast prior to commiting their changes, while sustaining flow.

I don't particularly care for the practice "build-breaker fixes build-breakage." At the very least I think everybody who is blocked should probably try to help (unless the number of people blocked is more than recommended size of a single team), and I'm sure the person who broke the build probably feels bad enough for causing the blockage (maybe even more so if multiple people help fix it). I think the build-breaker should certainly be a primary contributor to fixing the build and may be most familiar with the offending code, but they may need some help too, as they might not be as familiar with why/how the breakage happened in the first place since it slipped past them (unless of course it was "the stupid stuff" - which I suppose happens quite a bit :-)

So is anyone out their measuring serviceability, availability, reliability of their codelines? Are any of you using these other concepts to help balance the load on the codeline and maximize its throughput? I think that same of the more recent build automation tools (BuildForge, Maven, MS Build + TeamSystem, ParaBuild, etc.) on the market these days could help capture this kind of data fairly inobtrusively (except for maybe MTTR, and the number of people blocked and people+effort needed to effect the repair).

2 comments:

Anonymous said...

Brad,

I realize that I am responding to a post that is five months old, but later is better than never :)

I think the idea of blocking the team from submitting changes until the codeline is fixed or any kind of serialization based on some metrics doesn't scale well for bigger teams and/or heavily loaded code lines. This is mainly because those wanted to submit their changes would have to wait until others are done. The delay time grows proportionally the number of changes and generally is a function of size of the team.

Instead, our Parabuild follows what I see as an optimistic integration model or "faithful commit model" for the luck of a better term. A member of the team builds and runs tests clean locally, then syncs to the last state of code line known to be clean (Parabuild provides this information) and repeats clean build and test. After ensuring that everything builds and tests she submits her changes. Note that this approach is not concerned with the state of the code line head at all.

The key assumption here is that if the head of the code line is is broken someone is taking care of it. The integration build ensures that new changes integrate on a continuous basis. If by the next build cycle there is more than one change (including yours), they will be built as a group.

As you can see, this way no one is waiting and commits are made as soon as changes have been confirmed clean locally. This is a very efficient and I believe the fastest approach to managing the code base.

Regards,

Slava Imeshev

Brad Appleton said...

Hi Slava! I think you may have misunderstood what I wrote above. I didn't suggest "blocking" the team until the build is fixed, I suggested that everyone who is "blocked" by the broken build should perhaps contribute to fixing the build.

I'm aware of several approaches that attempt to continue "submitting" changes even when the build is broken - these are different from the debates for synchronous -vs- asynchronous integration.