Andrew Certain's Twitter Thread

People dunked on this tweet, saying, in essence, "This isn't 100% correct - you shouldn't pay attention." But that misses the point. The value of any model is that it's simpler than reality so that you can gain insight. Here are the insights I have gained from this model.

Fred Brooks first put forth the idea that adding people to a late project makes it later, and stated that the pairwise communication was the real killer. Note that he was only talking about adding people to a late project - more on this later. But first, a digression (or two)!

Around the time I started QLDB, I read about the Universal Scalability Law via the always-informative @mbrooker (https://brooker.co.za/blog/201... This law extends Amdahl's Law to explain why adding more processors to a task can make the task take longer.

As a refresher, Amdahl's Law says that if you use N processors, the throughput relative to a single processor is N/(1 + α(N - 1)), where 0 <= α <= 1. Amdahl's Law is usually stated in terms of latency, but I'm going to use the throughput formulation, as I find it more useful.

Throughput is more useful when analyzing distributed systems with many servers. I am more interested in how continuously-arriving tasks are handled by the service as a whole as opposed to analyzing the latency of a single task with serial and parallel portions.

The USL extends the Amdahl's Law with a second coefficient: N/(1 + α(N - 1) + βN(N - 1)). It's this beta coefficient that models the negative returns to adding additional processors. The intuition behind the parameters is as follows.

Alpha is the fraction of the workload that is single-threaded. Gunther (who created the USL) calls it "contention." So if every request has to go through a single-threaded authorization process, and that takes 5% of total request time, α=.05.

Beta is Gunther's contribution to the model. He calls it "coherency." It acknowledges that some things are worse than a bottleneck. One example is cache coherency slowdown in multi-core systems.

In a distributed system, this term comes from gathering consensus. It's why nobody implements naïve Paxos, where any node can propose a new state. If multiple nodes make simultaneous proposals, you need extra communication to determine the winner.

For high-throughput distributed consensus, you first elect a leader as the sole proposer, and pay the expensive coherency cost only when a new leader needs to be elected. In the equation, more nodes mean more potential for conflicting proposals, e.g. higher beta.

What does all of this have to do with the Mythical Man Month? The "pairwise communication" part of "adding people to a late project makes it later" is beta. Gunther himself saw this parallel and wrote about it here: http://perfdynamics.blogspot.c...

When analyzing team throughput, alpha represents the time spent in one-to-many communication (e.g. team meetings where just one person is speaking). This time is a fixed tax on each additional person's contribution to the project. Beta is the time spent on pairwise coherency.

An obvious example of pairwise coherency is standup. If every person speaks, the time goes up linearly with the number of people, and since every person is in standup, the total people-minutes consumed by standup goes up as N².

Wait, you say. Is it βN² or βN(N - 1)? This gets at the point I started with. It doesn't matter. I'm not trying to claim that by plugging in α, β, and N you can precisely compute how long a project will take if you add three new people. That's not the value of the model.

Now for computer systems, the model can be fit statistically to predict of future throughput. Unfortunately, humans are way more complicated. But that doesn't mean the model doesn't have value! I claim that by studying this model we can extract insight in order to guide action.

When I started at Amazon in 1998, there were 60 people in tech. Not 60 SDEs. 60 people total, including DBAs, SAs, TPMs, managers, etc. We were divided into two teams of roughly 30 people. Each team met every other week and all 60 met together in the alternate weeks.

This setup didn't last. As we grew, it became impractical for us all to meet together, so we split into more and more units. But those team meetings aren't the story here - they're alpha. The real story is beta - and how it creeps in where you might not expect.

Part of the reason for Amazon's incredible success across a staggering array of ventures is our focus on pushing autonomy down as far as possible. Jeff said from the start, "I don't want to make communication more efficient - I want there to be less communication!"

So we focused on creating small teams with clear business goals. Those teams of 30 were really subdivided into smaller teams. For example, when I started, there was a "search" team with four people and a "personalization and community" team with three.

As we grew, these teams continued to split and specialize. So the three people on personalization and community became six and then split into two teams - one for personalization and one for community - allowing them to each grow to six.

By having two teams of six instead of a team of twelve, standup costs were capped, but that's not the main source of beta, it turns out. Just as in the USL applied to services, the main source of beta is coherency, or coming to consensus.

It's straightforward to take six people and split them into two disjoint teams. You just pick from the 20 possible combinations! 😉 But that doesn't magically partition either the software or the knowledge in their heads. Here's where beta really gets you.

Some of the software is now shared between the two teams. These teams have different goals and priorities. Maybe one team wants to extend some functionality to be more flexible, but that would mean the other team has to adjust how they are using the shared software.

Now the teams have to spend time deciding the best way to make the changes, how valuable they are, etc. They would have had to have similar discussions if they were one team, but it's harder to achieve consensus among groups with different priorities.

This is one reason that I insisted early on that our sub-teams share as little code as possible. Some of the engineers thought it was dumb that each of them were writing similar services to manage, say, the EC2 instances they needed.

"Duplicated effort" is an anathema to most developers. But for a new service, when you are figuring things out and need to move quickly, the cost of consensus can be crushing. It's better to have "duplicate" efforts and join them later if you figure out they really are the same.

Obviously, you do really need to have discussions about strategy, or software design, or whatever. Gathering different perspectives is important. Letting many people have a voice is important. You just need to be conscious of the cost and pay it when it's worth it.

It's so easy in these discussions to become attached to being "right," and to continue to argue your position even if the other position is just different, or if it's unknowable which one will prove to be right. This is beta - remember that it has N² impact on productivity.

Sometimes it is important to continue to argue. We make decisions every day that have lasting ramifications. But often it's really hard to know how the decisions will play out. What you do know is that time spent coming to consensus is time nobody is producing software.

For more-senior people, especially, it can be hard to let things go when you think the path being proposed isn't the best. But try to ask yourself whether paying the cost of consensus delivers more than letting the less-senior person try it their way. You might be surprised!

Amazon has grown its tech community by an average of over 30% per year for over 20 years, so I've had a lot of opportunity to observe this process. And my observation is that it's hard for everybody to adjust to growth, regardless of their position on the team.

Share this thread

Read on Twitter

Navigate thread