Thursday, 4 June 2015

Story Points + Estimation = Confusion

When teams are implementing agile for the first time, a lot of people struggle with the use of story points (sometimes called effort points) to estimate tasks.  I've heard the following sorts of questions come up many times:
  • Why are we estimating complexity?
  • Why aren't we estimating time?
  • Does complexity = time?
  • Why are we using a Fibonacci sequence?
  • Isn't this needlessly complicated?
They're all good questions, and the fact that so many people ask them indicates that there is some serious confusion about what's going on.  In this post I'll try to clarify things a bit.  Go get a cup of coffee first though, because this post is going to be a long one.


Got your coffee?  Right, let's crack on.

Overview


Story points are units of measurement that are used to measure the complexity of a task.  The valid values that can be used are normally in a pseudo-Fibonacci sequence, e.g. 0, 0.5, 1, 2, 3, 5, 8, 13, 25.  The complexity of a task is measured during sprint planning using these story points.  An initial small high-priority task is selected from the release backlog by the team and given a story points number of 2.  This task acts as the basis for comparison with the next task on the backlog.  

The highest-priority task on the backlog is then discussed and, once everyone understands what has to be done, every member of the team decides what the complexity of the task is, relative to the first task that has 2 story points of complexity.  This decision is made silently by each individual without conferring, and then at an agreed point (normally the Scrum master saying "Is everyone ready?", or words to that effect), every member of the team displays their estimate at the same time.  Usually this is done by using planning poker cards that everyone holds up at the same time.  If everyone holds up the same card then the task is given that number of story points and added to the sprint backlog, and the team moves on to the next task on the release backlog.  If team members have held up cards with different estimates on, the team discusses the reasons for this, and then estimates again.  This process continues until either all team members agree, or there is a large majority consensus.  

Tasks are pulled off the release backlog in priority order and discussed and estimated (using the tasks that have already been estimated as the basis for comparison) until the sprint backlog is full.  Previous sprints are used to provide an average number of story points that get completed in a sprint.  Once this average number is reached the sprint backlog is considered full.


Why are we estimating complexity instead of time?


The most obvious and most asked question about story points: We know how many people we have, we know how many work hours are in the sprint, why don't we just estimate the time each task will take instead and keep going until the sprint is filled?

In short, it's because of the Planning Fallacy and the Dunning-Kruger effect.  Lots can be said about this fallacy and effect, but the important point is that humans are really bad at estimating how long it will take them to do something. This is for all manner of reasons and holds regardless of whether the estimator is an expert or not.  However - and this is probably the key to understanding why story points are used - humans are very good at comparing 2 things with each other, a process known as relative estimation.  For example, if I ask you to estimate how tall George Clooney is, then you might guess 5 foot 10 inches.  That estimate might be right or wrong (I have no idea).  But if I show you this picture of George Clooney and Brad Pitt and ask you who is taller, you'll say Brad Pitt.  That's relative estimation, a.k.a. comparing 2 things, and it's something you're good at (see the next section for a more detailed explanation of relative estimation).

Aside from the Planning Fallacy and the Dunning-Kruger effect, time comes with its own social, political and company implications.  A stakeholder will ask "Why will it take that long?", when the answer, to you as the expert who does this for a living, is obvious: Because it will.  But that's not a very useful answer.  Likewise, some people have a tendency to give the answer that they think the stakeholder wants to hear: 


"When can this be done by?" 
"Er....tomorrow?"  

That doesn't happen with story points because a measure of complexity cannot be directly equated to a measure of time.  Over a number of sprints the velocity of each individual sprint, assuming a settled team, should get closer and closer to the average velocity of all the sprints.  If your average velocity is 30 story points, and your sprint length is 2 weeks, you can equate 15 story points of complexity with one week's work by the team. But"equates" is not the same thing as "equals".  This is an average, and it is entirely possible that on average 20 points gets done in the first week of the sprint and only 10 points in the second week (or vice versa).  This could be due to any number of factors.  The best you can do, whilst still being accurate, is to say that this specific team can achieve, on average, 30 story points of work in a 2 week sprint. That's your velocity.

Abstracting complexity away from time helps the team to be a "black box" - work items are added to the release backlog, and every sprint will slice off the top x number of story points.  That is as detailed as stakeholders need to know.  They might want to know more, but that's their problem, not the team's problem.  Remember, the onus is on the team to be self-organising and responsible.  That means that the stakeholders have to trust that the team will achieve their commitment in each sprint, all things being equal, and the team has to take responsibility for achieving that commitment. How the team achieves it is up to them, not up to the stakeholders.

(Occasionally a crisis will occur and you'll need to provide a patch to resolve a system down or some other equally serious problem.  In this kind of circumstance you will probably need to give a time line for delivery, and, Agile theory aside, welcome to the real world.  But this shouldn't happen very often, and the Agile theory should become your "real world" day in, day out.  Agile is a way of working, not a luxury that you abandon when the unexpected occurs.)

What are we estimating the complexity against?


There is no objective yardstick against which story point complexity is measured, so a task that has a complexity of 2 story points for one team might have a complexity of 8 story points for another team.  This is entirely different to units of measurement that do have an objective yardstick.  For example:

A metre is the length of the path travelled by light in vacuum during a time interval of 1/299,792,458 of a second.

Either something is a metre or or it isn't, depending on whether it meets that objective definition of what a metre is.  A task can be measured accurately as 2 story points, or 5 story points, or 8 story points by 3 different teams. A metre is an absolute measurement.  A story point is a relative measurement.

However, even for relative measurement you still need something against which you can measure an item.  Otherwise the measurement of the item is not relative to anything.  When estimating tasks for a sprint backlog, a bootstrapping process is used to provide the relative measurement.  A single item (A) is chosen, given a story point number, and used as the measurement against which the next item (B) on the sprint backlog is measured.  A and B are then used as measures of complexity to measure item C against, and so on.  So how do you choose item A, and how complex should it be?

The first item chosen should be one of the least complex tasks on the backlog that is in a priority position that means it will be done in this sprint.  It shouldn't be the smallest possible task, because that doesn't give you any leeway if you come up against a task that turns out to be even less complex.  But it should be simple enough that the team can agree on a complexity of 2 story points for it.  Why 2? Because that should be simple enough for there to be a decent amount of certainty about what is entailed and how difficult that will be, but it also leaves you 0.5 and 1 on your pseudo-Fibbonacci sequence if you find tasks which are less complex. 

As an example, you might pick a work item (let's call it D) that requires you to add a column to a database table, add a text field to a form, and link the column to the field.  If the Product Owner has done their job correctly, the acceptance criteria should already include information about field type, length, nullable or not nullable, and so on for the database field, and the location when the field should be added to the software along with the name, max characters, alpha, numeric, or alphanumeric, etc.  The team should have groomed item D and made sure all of this information was available ready for planning. Because this work is not complex, item D should have a low number of story points, but it is possible to think of an item that is even less complex - say, adding a static value to a database field - for which you'll need the lowest number of story points.  Therefore give item D 2 story points, add it to the sprint backlog and start estimating the highest priority item on the release backlog.  Item D can be used as a marker against which the complexity of this next item can be relatively measured.

So isn't complexity just another way of saying how long it will take?


No.  There is a loose correlation between complexity and time, in the sense that the Large Hadron Collider took a lot longer to build than a microscope does, but that doesn't have to be so.  The fastest 147 ever recorded in professional snooker took Ronnie O'Sullivan 5 minutes 20 seconds, whereas the slowest took Cliff Thorburn over 30 minutes.  The outcome was the same (a maximum break), and to you or me the complexity would be the same (very, very high).  If the task was less complex for O'Sullivan, that was because he is the greatest snooker player to ever pick up a cue, whilst Thorburn, though a world champion in his own right, is not.  The time in which O'Sullivan compiled his 147 was a function of his ability, experience and working style (O'Sullivan's nickname is "The Rocket", whereas Thorburn's nickname was "The Grinder").  The time was not a function of the complexity of achieving a 147.  This is why you can't compare the velocity of 2 different teams, or even the same team that's had one or more personnel changes.  Different skills, experience levels, personalities and working styles combine in a team to generate a unique speed of working that can't and shouldn't be judged against the speed of working of another team.  This is especially true when the teams are not working on exactly the same product and problems.  Complexity is a measure of how complex a work item is.  Time is a measurement of...well, time.  They are not the same thing.

Why do we use a pseudo-Fibonacci sequence?


There are 2 obvious alternative scales - the actual Fibonacci sequence or a sequence of consecutive numbers like 1....20 - but there are problems with both of them. 

The actual Fibonacci sequence is 0, 1, 1, 2, 3, 5, 8, 13 and so on.  The second "1" is not required when measuring complexity, so it's taken out of the possible values and 0.5 is used instead.

Using a sequential list, such as 1 to 20, means that the difference in complexity between 5 and 6 is something you can argue about, but the difference in complexity between 5 and 8 will be more easily visible.  At the end of sprint planning there should be an obvious difference of complexity between those items with 5 story points and those items with 8 story points.  That won't be the case if you use a sequential list where the differences between a 5 and a six are too fine-grained to understand at this stage of the process.

A couple of notes about this:

  • 0 can be used for non-productive tasks that need to be recorded.
  • 0.5 is not always used; some teams find it useful, others don't. 
  • Some teams use a task that has 2 story points as their baseline first task, some teams use a task that has 1 story point. It really depends on how often you need to use 0.5 and 1 as to whether you use 1 or 2 as your baseline first task.
  • If an estimate is agreed to be 13 or higher, some teams take this as a sign that the task has not been sufficiently broken down to the level of granularity that is required for a task. The task needs to be groomed into smaller chunks before it can be estimated.

Inspect and adapt depending on whether your team finds it useful to have all of these values as an option.

How do we use story points to calculate velocity?


The temptation is to say "Our sprint velocity is x hours of work.", or "Our sprint velocity is y work items.".  Waterfall uses time and number of items left as important markers in the production process, and so it's tempting to keep using them when transitioning to Agile. 

Don't do this.

Measuring how much time is left is like measuring how much fluid is left to drain out of a vessel.  It's a meaningless measurement, unless you know the rate of flow.  A litre of water could flow out of a vessel in seconds, or in days, depending on whether it's gushing or dripping.  Likewise, measuring the number of items left is like measuring how much wood a carpenter has left whilst he's making something.  Unless you know how quickly the carpenter can use that wood and what he's got to do to it to shape it appropriately, knowing how much he's got left won't tell you when he will finish, it will just tell you that he's not finished yet. 

Story points abstract away the type of work that's being done, and just leave the complexity.  This helps to remove prejudices and assumptions about how long different types of task take to complete.  For example, most bugs might take, on average, 4 hours to investigate, fix, test and document, whereas most enhancements might take on average 8 hours.  But "most" is not the same as "all".  Anyone involved in software development who hasn't seen what superficially appeared to be a simple bug grow into a behemoth that took several people several days or more to find a viable solution for, hasn't been in software development for that long.  The process of grooming will (hopefully) identify most bugs that might fall under this category, and as such they can be given an appropriate complexity, rather than people making the assumption that a bug = 4 hours and estimating it as such.

Story points provide a more accurate measure of how much work can actually get done in a sprint.  The more sprints that are done, the more accurate the average measure of "number of sprints" / "Total number of story points done in those sprints" becomes.

Why can't we confer when we decide on the complexity?


A big part of the power of agile is the self-organising team.  The team should, at its most fluent, operate almost as a gestalt entity, many minds combined to form something which is greater than the whole. Admittedly that's more of a theoretical goal than a practical option, but you shouldn't ever forget that the more harmonious and collective the team is, the more efficient and effective you will be.  With that in mind, reality dictates that in any group of individuals there will be people who are more vocal than others, people who are more persuasive than others, and people who are more influential than others.   

There is nothing wrong with this, but these people shouldn't be allowed to skew the decision of others, except through the use of relevant facts.  "The joint knowledge of many diverse individuals can outperform experts in estimation and decisions-making problems", and if an influential person says "This item has a complexity of 5", then some of the group may agree with them simply because they don't want to disagree.  But many minds provide many viewpoints, skills and insights, and it's important that these are all heard, otherwise the whole point of team planning is missed.  The multiple inputs are the engine that drives accurate estimation, and conferring on the complexity score can prevent that.

It's important to understand that this lack of conferring is as much to do with forcing people who don't want to disagree to give an opinion, as it is to do with making sure that one or two people don't dominate the decision. Both sides need to move from dominant or submissive attitudes to assertive attitudes instead.  But people being people, you will likely only get so far down that path before people's hard limits are reached.  The talker will always prefer to talk, the silent will always prefer to stay silent, and the estimation process will suffer as a result.

Therefore agile best practise is not to confer on the specific story point number.  The complexity can be discussed in relation to previously estimated tasks - "I think this is more complicated than item A because of x, y and z." - but when it comes to people holding up their planning poker cards, they should make their own choice. Your Scrum Master should be aware of this and facilitate the discussion accordingly.

How do we work out how many story points we can fit into the first sprint?


Good question - if you don't know the average number of story points that the team has completed in previous sprints (your velocity), how do you know how many story points to commit to in the first sprint?  As discussed above, story points are a measure of relative complexity.  It is a fair assumption that your team won't be entirely made up of fresh-from-graduating staff with no real world experience, so the team should decide together how many story points they are willing to commit to in the first sprint.  This decision should be based on the team's collective experience of development, testing and documentation (or whatever skills are used in the team).  It is important that this decision is not made before planning, because you will be limiting yourself needlessly.  Always remember the "Inspect and adapt" maxim; inspect the sprint backlog as you go through the planning meeting and adapt your decision about how many story points you're willing to commit to as you go. 

Give your team a chance to work out their velocity by slightly under-committing in the first sprint, because you can always add more work in part-way through the sprint if you haven't added enough at the initial planning.  Adding work into a sprint is always preferable to overcommitting and having to decide what to take out. The first sprint should be a calibration sprint and I wouldn't recommend planning a release at the end of it if you can avoid it. You are stepping into the unknown and you don't need that additional pressure.  Obviously this advice is contingent on how supportive your internal stakeholders and management are, but if they don't understand why you'd rather not commit to a release at the end of the first sprint, they either don't understand Agile, or they're unrealistic. 

Once you've completed your first sprint you can discuss the velocity at the retrospective, and work out if your story point estimation was a fair reflection of each work item's complexity.  Take the results of these discussions into the planning session for sprint 2 and use them to determine an achievable story point number for the sprint.



Hopefully that will clear up some of the confusion you might feel about story point estimating.  There is more to say about the difference between estimating and comparing, but I'll leave that for another post.  If you've got any questions about story points or estimating, feel free to drop them in the comments and I'll do my best to answer them.