An exercise in analysing the causes of software defects

The obvious…

Software has bugs. Despite what some developers would like to believe, the vast majority of bugs are created through the conscious (or unconscious) actions of the person writing the code. As programmers, we can continue to write software the way we did last week, last month or last year, or we can ask ourselves some questions:

  • What kinds of bugs are being introduced?
  • How do these bugs get introduced?
  • How can I prevent these bugs being created?

In other words, how can I become a better programmer? Without some data, it is difficult to answer these questions.

Background

Way back in the early 1990s I was working for a small software development company, which sold a complete solution for manufacturing and distribution companies. The development team was quite small (around 8-10 people) but everyone was pretty smart. The software was reasonably large, so there wasn’t a lot of overlap in the areas that people worked in. Back in those days, just about every hardware vendor had their own flavour of Unix, and every customer chose a different vendor. The software was continually being enhanced, so there was plenty of opportunity for people to create new and interesting bugs.

Despite being a small company with a lot of work on our plate, we took time to develop common libraries and tools, and we were willing to set aside time to improve the way we did things.

At some point I ended with some QA responsibilities in addition to my development work. One of the things we had done early one was to create an administrative / work tracking system that used our core libraries (thus eating our own dog food). This tool was used to manage all our development activities, including the logging and management of defects (in a parallel universe, we realised that this tool should be commercialised as well, and maybe we could have saved the company, which is now long-gone…).

Anyway, by this time we had quite a collection of bugs in the system, and some quick-and-dirty analysis revealed that we were spending around 50% of our time fixing bugs instead of adding new features. I decided that we needed to know more about where these bugs were coming from, and what we could do about it. So I started on a project to associate “cause codes” with the bugs in our database. Maybe once we had some data collected, we could get some ideas on how to get our bug load down.

Project initiation

Before you even start down a project like this, you need to appreciate that there is a technical aspect and social aspect. The technical aspect was defining the data I wanted, maintaining it in our bug tracking system and adding some analysis capability. The social aspect was twofold: how do I get the rest of the team to sign up to maintaining the data, and also how do I get them to take on board any action items arising from the data analysis?  Let’s start with the technical aspect (although they are both linked)…

I decided that I wanted to add 2 new fields to the bug table in our tracking application:

  1. A short cause code, that categorised the cause of the defect in some logical way
  2. A text field to allow the developer to add some explanatory information.

Only the cause code was used for data analysis, so I’ll concentrate on that.

Choosing the breakdown of bug causes into categories is not trivial. Firstly you need to make sure that there is a good coverage of all the common causes of defects in your organisation (the last thing you want to accumulate lots of “none of the above” causes). Secondly, the cause codes need to suggest some form of remedial action. For example, a cause code of “code standard violation” could feed into the code review process or maybe into a tool to automatically detect some common cases. On the other hand, a cause code of “testing failure” does not suggest any useful actions that could be taken (apart from telling the testers to “work harder” or something 🙂 ). Finally, you want to make it is easy as possible for the developers to choose the right code to add, so use mnemonics, UI drop-downs, mouse-over text, cheat-sheets etc.

Despite having worked at this company for something like 6 years, coming up with a good list wasn’t easy. In the end I spent a lot of time reading through lots of bug reports and associated fixes and trying out various candidates for cause codes until I got down to a list of around 30. That might sound like a lot, but they were broken down into the phase where the bug was introduced (specification, development, testing, installation and “other”). The list of defect codes I chose is down at the bottom of this post.

Having come up with a cause code list, I needed to modify our bug tracking application to maintain and report on them. This was actually pretty easy, as the application was based on our own in-house tools.

Now came the hard part – the social aspect of creating and maintaining the data. This exercise isn’t going to get very far if the developers don’t want to help. To get them on board,

  • I sold them the benefits – spending less time on fixing the same bugs again and again
  • I emphasised that this was a non-judgmental process – it wasn’t important who created the bug, but rather how we could stop it happening again.
  • I requested that if people weren’t sure what cause to assign, they could easily refer it to me to decide – that helps to prevent people entering garbage values.

To get the ball rolling I decided to go over the previous 6-9 months of bugs and assign cause codes to them all. This had several benefits. Firstly, it gave me a good feel for how the system worked; I tweaked a few things to make the process easier. Secondly, it gave us a baseline of data to begin analysis with. That meant we could get some results out right from the start. Finally, it also meant that none of the other developers could complain about the effort on their part 🙂 .

In the end, the developers were happy to help maintain the data. I didn’t need management backing to push the process forward.

 Results

What kind of information can you get from this kind of data?

Firstly, just reviewing all the bugs in each category gives you a real insight into common factors. It can be a revelation to see how many bugs arise from essentially the same underlying cause.

Grouping cause codes by the phase in which they were introduced also gives you some higher-level ideas about where some of your problems lie.

Cross-referencing with bug severity gives you another dimension to the data. You can use this to help decide which bug causes you should attack first.

If you choose your cause codes to be prescriptive – that is, there is some kind of action you can take – then you immediately have some opportunities to address some of the causes. Some of the actions we took based on our analysis included:

  • augmenting our automated test tools to detect code patterns which we knew led to incorrect behaviour,
  • adding some extra sanity checking to our tool libraries to highlight known areas of misuse,
  • adding better practices to our code standards document and review processes.

Of course, some bug causes just don’t have an easy solution. Maybe they are just to hard to predict, or perhaps the effort the prevent them is on the wrong side of the cost/benefit ratio.

In conclusion

If you want to become a better programmer, then you need to “close the loop” between fixing bugs and preventing them being added in the first place.

In the end, there were some bug causes that we were able to almost entirely eliminate. As developers we also acquired some priceless knowledge about how we introduced bugs into the software.

Appendix: Cause Codes

These are the cause codes I came up with. Some of them are pretty specific to the software we were writing, the processes we followed and the environments we deployed against.

You’d probably want to change them for your situation, but you can get the sense of how we broke things down.

Mnemonic Cause code Description
DE Design Error The program followed the specification, but there specification was incorrect. This would apply even if there was no specification.
DO Design Omission The specification (and hence the program) did not describe what should be done in the situation where the bug arose.
DM Design Modification The specification has been enhanced to cover a new area. This differs from a suggested modification, often because we want to get the change out into the field before a problem is discovered by a client. Typically these bugs are raised internally.
BA Bad Algorithm The programmer selected an inefficient algorithm, or one which did not perform correctly under all inputs.
NS Non Standard Behaviour The program did not adhere to accepted performance standards. E.g. scroller columns were not correctly aligned, screens do not redraw properly.
NC Non Standard Code The program did not adhere to accepted programming standards.
DI Design Ignored The program does not conform to the specification.
IP Incorrect Procedure The programmer did not follow the correct procedures (e.g. omitted a file from the task list, copied a .q file into the demo area).
TU Incorrect Tool Use A library function was incorrectly used.
BC Boundary Condition The program did not cope with some boundary condition data (e.g. zero-quantity stock record).
CE Coding Error The programmer obviously meant one thing but typed another.
TD Tool Defect The problem was caused by a bug in a standard routine.
RB RPT Bug A report did not function correctly. Perhaps it was not being sent the correct data, or not displaying it properly.
CD Code Decay The program used to work, but now no longer does, and nothing obvious was changed to break it. Perhaps a library change caused it.
IT Incomplete Testing The bug should have been found during routine testing. Perhaps the programmer did not know enough about the area to thoroughly exercise the program or interpret the output correctly.
OD Odd Data The program failed under input which could not reasonably have been expected by the programmer.
IN INstallation Problem The program was not properly installed.
NP Non Portable Code The program use routines or techniques which were not portable. This includes both operating system and database version problems.
MI MIscellaneous Some bugs will not fall into any existing category. If you use this code, please be sure to add details to the notes for the bug.
UR UnResolved The actual cause of the bug was never determined. Perhaps the program was rewritten instead, or the bug was closed because of lack of information.
DP Data Problem The bug was caused by invalid data or a database corruption.
CP Client Problem The client was responsible for causing the program to fail. E.g. they bulk loaded invalid data, or sent a SIGKILL to a program.
MP Machine Problem The bug was caused by a machine-related problem. E.g. a compiler bug, or running out of system resources.
II Insufficient Information There is insufficient information to determine the cause. This applies to historical bugs, or ones which were taken by someone no longer at the company.
NB Not a Bug The report turned out not to be a programming-related bug at all. E.g. a mistake in some help text, a bug report for standard behaviour, or a report which was closed because it couldn't be reproduced.
DU DUplicate Bug The bug report duplicates another. This includes cases where the bug has already been fixed.
UK UnKnown The cause has not yet been determined. This is a transitory code; no bug will end up with this cause recorded. All new bugs start out with this cause code.
NA Not Applicable The bug report was an enhancement request.