It is often the case that we just sample the response times of a few transactions rather than metering all of them. When sampling, how do you know you’ve sampled enough to get an average response time that is representative of all the transactions?
If you make some change to the system, and the average response time falls from 10 seconds to 0.2 seconds, it doesn’t take a rocket scientist to know that is a real improvement. However, if the before and after numbers are reasonably close, it’s not as clear that that change was an improvement. We could have just gotten lucky in our sampling. So, how can we know anything without all the data?
Think about a bowl of jellybeans for a minute. Imagine you blindly and randomly select and eat two jellybeans from that bowl. You find one is orange and one is strawberry. You could at this point state that the bowl contains 50% orange and 50% strawberry jellybeans, but you wouldn’t be too confident about it. If the next ten randomly selected jellybeans confirmed the 50/50 ratio then your confidence would grow. However, to be absolutely certain of this ratio, you’d have to eat all the jellybeans in the bowl.
The same is true for any sampled data. The more sampled transactions you have, the more confident you are of your result. To be absolutely sure, you have to measure every transaction. But, how many samples is enough so you can be reasonably sure? For that we are going to have to use statistics. Please don’t panic. We are going to use a couple of simple Excel functions to do the math. Let’s work through an example.
Suppose you are comparing 10 samples of response time data before and 10 samples after an upgrade to see if things are better or worse. Before the upgrade the average response time of 10 transactions was 4.5 seconds and after it was 4.1 seconds. To be sure a small difference is a real difference, you need to calculate the confidence interval. This is a four-step process:
■ Download/copy the individual samples into a column of an Excel spreadsheet. For this example there ten of them starting at cell A1 going through A10.
■ Use the AVERAGE function to find the average value (arithmetic mean) of all the samples. This function takes one argument, which is a range of cells containing the response times. For this example AVERAGE(A1:A10) equals 4.5.
■ Use the STDEV function to find the standard deviation of all of the samples. This function takes one argument, which is a range of cells containing the response times. For this example STDEV(A1:A10).
■ Use the CONFIDENCE.NORM function to find the confidence interval. This function takes three arguments:
- Alpha – This is a number between zero and one that tells the function how confident we want to be. The confidence level equals one minus the Alpha. In other words, an Alpha of 0.05 asks for a 95 percent confidence level, which is what we want here.
- StandardDeviation -The value returned by the STDEV function in step 3.
- Size – This is the count of individual test results in our sample. In this example the count is 10.
The CONFIDENCE.NORM function returns a number: 0.51. This tells us that we can be 95% confident that the average response time of all transactions during the studied interval before the upgrade (not just the ones we sampled) is 4.50 seconds ± 0.51 seconds. In other words, we are 95% confident the average pre-upgrade response time is between 3.99 and 5.01 seconds.
Now, let’s say we calculated the confidence interval for the after-the-upgrade data, and the calculations showed we are 95% confident that the actual average response time of all transactions during the studied interval (not just the ones we sampled) is 4.10 seconds ± 0.49 seconds.
So what does this all mean? If the confidence intervals overlap, there is no statistically significant improvement. As you can see below, they clearly overlap and, even though the after-the-upgrade response times numbers look better, statistics can offer no guarantee of any real improvement. The upgrade might have helped, but you can’t prove it with the data you have to the level of confidence (95%) you want.
This is the same calculation pollsters’ do when they randomly call ~1000 people and, from that small sample, predict how the nation will vote. When these polls are talked about, they rarely quote the ALPHA or the confidence interval. If they did, the lead story of some future newscast might be:
The latest polls are 95% confident that candidate X is polling at 53% and candidate Y is at 48%. The margin of error is ± 5 points so there is no statistically measurable difference and thus we really have no idea who is winning.
Now you might want to be absolutely 100% sure you are seeing an improvement. Statistics can’t help you here because, to be 100% confident, you need to have response time data from ALL the transactions, not just a sample of them. If you have 100% of the data, you don’t need statistics because you have 100% of the data. For most cases, a confidence level of 95% or 98% will do nicely.
Bob Wescott is the author of “The Every Computer Performance Book”.