Is It Fixed? Statistical Test for the Debugging Question

December, 1997

You ran your program, and it crashed horribly. While you were still gazing despondently at your screen, Fred wandered past, on his way to or from the soda machine, and said that you really should include the "/iefbr14" option on your link line --- nobody knows what it does, but everybody has the feeling that it sometimes cures strange problems.

So you re-link with the mysterious option and run the resulting program, and it runs flawlessly. Can you conclude that "/iefbr14" fixed the problem? How sure do you want to be? Will this software be flying a jumbo jet full of passengers tomorrow? Or will you be demonstrating it to your boss's boss this afternoon? Or is this just one small part of an operating system that's so full of bugs nobody will notice one more?

Being the cautious type, you run the program several more times, and the dreaded crash does not recur. Now how confident are you?

Statistical reasoning can help a lot with this quandry. It can help distinguish between these two hypotheses:

Hypothesis H0:
"/iefbr14" did not fix anything. There is still an error in your program, but the error only occurs with some probability p < 1. That the error hasn't occurred since you re-linked with "/iefbr14" just happened by chance.
Hypothesis H1:
Re-linking with "/iefbr14" fixed something. At the very least, it reduced the probability of crashing.

If you've run the re-linked program a huge number of times without another crash, you'll find Hypothesis H0 pretty implausible, because it requires an unlikely event to have happened: that a crash that was sufficiently probable that it actually happened the first time you ran the program should fail to happen in the succeeding huge number of trials. Statisticians traditionally reject a hypothesis that requires something with 1-in-20 odds to have happened (termed "statistically significant"), and reject with greater confidence ("statistically highly significant") a hypothesis that requires an occurrence with 1-in-100 odds.

Suppose the original program was run just once, and crashed, and the re-linked program was run 7 times, and never crashed. If the probability of crashing, p, was constant during these 8 tests, then the probability of getting the observed lopsided distribution of crashes --- one before the change, and none after --- is

p * (1-p)^7.

As your intuition may have told you, the value of p for which this distribution is most probable is 1/8. For that value of p, the probability of getting the observed distribution is 4.9% --- slightly less than 1 in 20. Accordingly, in the traditional language of statistical reasoning, we say that we can reject hypothesis H0 with 95.1% confidence.

(This conclusion is based on the assumption that p is constant over all the trials. If it is possible that the probability of a crash depends on time of day, processor loading, or anything else that would differ consistently between the "before" and "after" tests, this assumption is invalid.)

What if we want more confidence? How many consecutive successful trials do we have to run in achieve 99% confidence? Unfortunately, lots: 37.

The number of trials required can be reduced enormously by one simple expedient: running more trials before applying the fix. If the original program was run N times, and crashed every time, before you re-linked it, then the number of successful trials of the re-linked program required to achieve a given level of confidence is given in the following table:

Successful trials needed for...
  N    95% confidence    99% confidence  
  1    7    37  
  2    3     7  
  3    2     4  
  4    2     3  
  5    1     3  
  6    1     3  
  7    1     2  

For example, if the original program was run 4 times, and crashed every time, and the re-linked program was run twice, and never crashed, you can be "95% confident" that re-linking improved things. If the re-linked program is run a third time without crashing, your confidence increases to 99%.

A given level of confidence is much harder to achieve if the original program sometimes does not crash, or if the re-linked program sometimes crashes. In either of these cases, the above table does not apply. The applicable numbers are only slightly more difficult to compute, but since they cannot be compactly tabulated for any generally useful case, we won't attempt to present them here.

This analysis is not limited to debugging software. Here are some other situations in which it can be applied.



Back to Odds and Ends.

My email address: