Invited Address
Pacific Northwest Software Quality Conference
Portland, Oregon
October 17, 2000
AcknowledgmentMuch of the material in this paper was presented or developed by the participants of the Software Test Managers Roundtable (STMR) and the Los Altos Workshop on Software Testing (LAWST).
|
At PNSQC a year ago, I spoke with you about problems inherent in software measurement, and described a framework for developing and evaluating measures (Kaner, 1999). Similar approaches have been laid out by other authors on software measurement (Kitchenham, Pfleeger, & Fenton, 1995; Zuse, 1997) and in other fields, such as physics (see, e.g. Sydenham, Hancock & Thorn, 1989) and psychometrics (see, e.g. Allen & Yen, 1979). My point of departure from the traditional software measurement literature is in the extent to which I ask (a) whether commonly used measures are valid, (b) how we can tell, and (c) what kinds of side effects are we likely to encounter if we use a measure? Even though many authors mention these issues, I dont think theyve been explored in enough depth and I dont think that many of us are thoughtfully enough facing the risks associated with using a poor software measure (Austin, 1996; Hoffman, 2000). Developing a valid, useful measure is not easy. It takes time, combined with theoretical and empirical work. In Kaner (1999), I provided detailed examples from the history of psychophysics to illustrate the development of a fields measures over a 100-year period. The most common question that has come to me in response to that paper (Kaner, 1999) is what measures I think are valid and useful. I dont have a simple answer to that, but I can report on some work in progress regarding measurement of the extent of testing of a product. This paper is a progress report, based primarily on the work of colleagues who manage software test groups or consult to test managers. It describes some of the data they collect and how they report it to others. Some of the material in this paper will be immediately useful to some readers, but the point of the paper is not to present finished work. Im still at the stage of collecting and sifting ideas, looking for common threads and themes, trying to get a better idea of the question that we intend when we ask how to measure how much testing has been done, and to understand the possible dimensions of an answer to such a question. Theres an enormous amount of detail in this paper and you might get lost. Heres the one thing that I would most like you to take away from the paper:
To evaluate any proposed measure (or metric), I propose that we ask the following ten questions. For additional details, see Kaner (1999):
The phrase "directness of measurement" is often bandied about in the literature. If "directness of measurement" means anything, it must imply something about the monotonicity (preferably linearity?) of the relationship between attribute and instrument as considered in both #9 and #10. The phrases "hard measure" and "soft measure" are also bandied about a lot. Let me suggest that a measure is "hard" to the extent that we can describe and experimentally investigate a mechanism that underlies the relationship between an attribute and its measure and that accurately predicts the rate of change in one variable as a function of the other.
When someone asks us for a report on the amount of testing that weve completed, what do they mean? The question is ambiguous. At an early point in a class he teaches, James Bach has students (experienced testers) do some testing of a simple program. At the end of the exercise, James asks them (on a scale from 0 to 100, where 100 means completely tested) how much testing theyve done, or how much they would get done if they tested for another 8 hours. Different students give strikingly different answers based on essentially the same data. They also justify their answers quite differently. We found the same broad variation at STMR (a collection of experienced test managers and test management consultants). If someone asks you how much of the testing you (or your group) has done, and you say 30%, you might be basing your answer on any combination of the following dimensions (or on some others that Ive missed):
These questions overlap with the measurement framework analysis but in discussions, they seem to stimulate different answers and different ideas. Therefore, for now, Im keeping them as an independently useful list.
The material that follows lists and organizes some of the ideas and examples that we (see the Acknowledgement, above) collected or developed over the last year. I have filtered out many suggestions but the lists that remain are still very broad. My intent is to show a range of thinking, to provide you with a collection of ideas from many sources, not (yet) to recommend that you use a particular measure or combination of them. The ordering and grouping of ideas here are for convenience. The material could be reorganized in several other ways, and probably in some better ways. Suggestions are welcome. Let me stress that I do not endorse or recommend the listed measures. I think that some of them are likely to cause more harm than good. My objective is to list ideas that reasonable people in the field have found useful, even if reasonable people disagree over their value. If you are considering using one, or using a combination of them, I suggest that you evaluate the proposed measure using the measurement framework described above. The sections below do not make a good enough connection with the software measurement literature. Several sources (for example, Zuse, 1997, and Ross Collards software testing course notes) provide additional measures or details. As I suggested at the start, this paper reports work in progress. The next report will provide details that this one lacks.
"Coverage" is sometimes interpreted in terms of a specific measure, usually statement coverage or branch coverage. You achieve 100% statement coverage if you execute every statement (such as, every line) in the program. You achieve 100% branch coverage if you execute every statement and take every branch from one statement to another. (So, if there were four ways to reach a given statement, you would try all four.) Ill call this type of coverage (coverage based on statements, branches, perhaps also logical conditions), code coverage.
Code coverage is a tidy measureit is easy to count, unambiguous, and easy to explain. Unfortunately, this measure carries risks (Marick, 1999). It is easy (and not uncommon) to write a set of relatively weak tests that hit all of the statements and conditions but dont necessarily hit them very hard. Additionally, code coverage is incomplete. For examples of incompleteness:
Testing the lines of code that are there does not necessarily reveal the problems arising from the code that is not there. Marick (2000) summarizes data from cases in which 22% to 54% of the errors found were faults of omission.
You can achieve 100% code coverage while missing errors that would have been found by a simple data flow analysis. (Richard Bender provides a clear and simple example of this in his excellent course on Requirements Based Testing.)
Code coverage doesnt address interrupts (there is an implicit branch from every statement in the program to the interrupt handler and back, but because it is implicitwired into the processor rather than written into the program directlyit just doesnt show up in a test of every visible line of code) or other multi-tasking issues.
Table-driven programming is puzzling for code coverage because much of the work done is in the table entries, which are neither lines of code nor branches.
User interface errors, device incompatibilities, and other interactions with the environment are likely to be under-considered in a test suite based on code coverage.
Errors that take time to make themselves visible, such as wild pointers or stack corruption or memory leaks, might not yield visible failures until the same sub-path has been executed many times. Repeatedly hitting the same lines of code and the same branches doesnt add any extent-of-testing credit to the code coverage measure.
Sometimes, the most important coverage measure has nothing to do with code coverage. For example, I worked on a product that had to print well. This was an essential benefit of the product. We selected 80 printers for detailed compatibility testing. We tracked the percentage of those 80 printers that the program could pass. This was a coverage measure, but it had nothing to do with lines of code. We might pass through exactly the same code (in our program) when testing two printers but fail with only one of them (because of problems in their driver or firmware).
You have a coverage measure if you can imagine any kind of testing that can be done, and a way to calculate what percent of that kind of testing youve done. Similarly, you have a coverage measure if you can calculate what percentage of testing youve done of some aspect of the program or its environment. As with the 80 printers, you might artificially restrict the population of tests (we decided that testing 80 printers was good enough under the circumstances) and compute the percentage of that population that you have run.
Here are some examples of coverage measures. Some of this list comes from Kaner (1995), which provides some additional context and discussion.
Compatibility
Compatibility with every previous version of the program.
Ability to read every type of data available in every readable input file format. If a file format is subject to subtle variations (e.g. CGM) or has several sub-types (e.g. TIFF) or versions (e.g. dBASE), test each one.
Write every type of data to every available output file format. Again, beware of subtle variations in file formatsif youre writing a CGM file, full coverage would require you to test your programs outputs readability by every one of the main programs that read CGM files.
Every typeface supplied with the product. Check all characters in all sizes and styles. If your program adds typefaces to a collection of fonts that are available to several other programs, check compatibility with the other programs (nonstandard typefaces will crash some programs).
Every type of typeface compatible with the program. For example, you might test the program with (many different) TrueType and Postscript typefaces, and fixed-sized bitmap fonts.
Every piece of clip art in the product. Test each with this program. Test each with other programs that should be able to read this type of art.
Every sound / animation provided with the product. Play them all under different device (e.g. sound) drivers / devices. Check compatibility with other programs that should be able to play this clip-content.
Every supplied (or constructible) script to drive other machines / software (e.g. macros) / BBSs and information services (communications scripts).
All commands available in a supplied communications protocol.
Recognized characteristics. For example, every speakers voice characteristics (for voice recognition software) or writers handwriting characteristics (handwriting recognition software) or every typeface (OCR software).
Every type of keyboard and keyboard driver.
Every type of pointing device and driver at every resolution level and ballistic setting.
Every output feature with every sound card and associated drivers.
Every output feature with every type of printer and associated drivers at every resolution level.
Every output feature with every type of video card and associated drivers at every resolution level.
Every output feature with every type of terminal and associated protocols.
Every output feature with every type of video monitor and monitor-specific drivers at every resolution level.
Every color shade displayed or printed to every color output device (video card / monitor / printer / etc.) and associated drivers at every resolution level. And check the conversion to grey scale or black and white.
Every color shade readable or scannable from each type of color input device at every resolution level.
Every possible feature interaction between video card type and resolution, pointing device type and resolution, printer type and resolution, and memory level. This may seem excessively complex, but Ive seen crash bugs that occur only under the pairing of specific printer and video drivers at a high resolution setting. Other crashes required pairing of a specific mouse and printer driver, pairing of mouse and video driver, and a combination of mouse driver plus video driver plus ballistic setting.
Every type of CD-ROM drive, connected to every type of port (serial / parallel / SCSI) and associated drivers.
Every type of writable disk drive / port / associated driver. Dont forget the fun you can have with removable drives or disks.
Compatibility with every type of disk compression software. Check error handling for every type of disk error, such as full disk.
Every voltage level from analog input devices.
Every voltage level to analog output devices.
Every type of modem and associated drivers.
Every FAX command (send and receive operations) for every type of FAX card under every protocol and driver.
Every type of connection of the computer to the telephone line (direct, via PBX, etc.; digital vs. analog connection and signalling); test every phone control command under every telephone control driver.
Tolerance of every type of telephone line noise and regional variation (including variations that are out of spec) in telephone signaling (intensity, frequency, timing, other characteristics of ring / busy / etc. tones).
Every variation in telephone dialing plans.
Every possible keyboard combination. Sometimes youll find trap doors that the programmer used as hotkeys to call up debugging tools; these hotkeys may crash a debuggerless program. Other times, youll discover an Easter Egg (an undocumented, probably unauthorized, and possibly embarrassing feature). The broader coverage measure is every possible keyboard combination at every error message and every data entry point. Youll often find different bugs when checking different keys in response to different error messages.
Recovery from every potential type of equipment failure. Full coverage includes each type of equipment, each driver, and each error state. For example, test the programs ability to recover from full disk errors on writable disks. Include floppies, hard drives, cartridge drives, optical drives, etc. Include the various connections to the drive, such as IDE, SCSI, MFM, parallel port, and serial connections, because these will probably involve different drivers.
Function equivalence. For each mathematical function, check the output against a known good implementation of the function in a different program. Complete coverage involves equivalence testing of all testable functions across all possible input values.
Zero handling. For each mathematical function, test when every input value, intermediate variable, or output variable is zero or near-zero. Look for severe rounding errors or divide-by-zero errors.
Accuracy of every report.
Look at the correctness of every value, the formatting of every page, and the correctness of the selection of records used in each report.
Accuracy of every message.
Accuracy of every screen.
Accuracy of every word and illustration in the manual.
Accuracy of every fact or statement in every data file provided with the product.
Accuracy of every word and illustration in the on-line help.
Every jump, search term, or other means of navigation through the on-line help.
Check for every type of virus / worm that could ship with the program.
Every possible kind of security violation of the program, or of the system while using the program.
Check for copyright permissions for every statement, picture, sound clip, or other creation provided with the program.
Every string.
Check programs ability to display and use this string if it is modified by changing the length, using high or low ASCII characters, different capitalization rules, etc.
Every date, number and measure in the program.
Hardware and drivers, operating system versions, and memory-resident programs that are popular in other countries.
Every input format, import format, output format, or export format that would be commonly used in programs that are popular in other countries.
Cross-cultural appraisal of the meaning and propriety of every string and graphic shipped with the program.
Verification of the program against every program requirement and published specification.(How well are we checking these requirements? Are we just hitting line items or getting to the essence of them?)
Verify against every business objectives associated with the program (these may or may not be listed in the requirements).
Verification against every regulation (IRS, SEC, FDA, etc.) that applies to the data or procedures of the program.
Automation coverage: Percent of code (lines, branches, paths, etc.) covered by the pool of automated tests for this product.
Scenario (or soap opera) coverage: Percent of code that is covered by the set of scenario tests developed for this program.
Coverage associated with the set of planned tests.
Percent that the planned tests cover non-dead code (code that can be reached by a customer)
Extent to which tests of the important lines, branches, paths, conditions (etc.important in the eyes of the tester) cover the total population of lines, branches, etc.
Extent to which the planned tests cover the important lines, branches, paths, conditions, etc.
Inspection coverage
How many of the types of users have been covered or simulated.
Agreement-based measures start from an agreement about what testing will and will not be done. The agreement might be recorded in a formal test plan or in a memo cosigned by different stakeholders. Or it might simply be a work list that the test group has settled on. The list might fully detailed, spelling out every test case. Or it might list more general areas of work and describe the depth of testing (and the time budget) for each one. The essence of agreement-based measures is progress against a plan.
All of these measures are proportions: amount of work done divided by the amount of work planned. We can convert these to effort reports easily enough just by reporting the amount of work done.
These can be turned into agreement-based measures if we have an expected level of effort for comparison to the actual level completed.
If youve run several projects, you can compare todays project with historical results. When someone claims that the project has reached "beta", you might compare current status with the state of the other products when they reached the beta milestone. For examplIe, some groups add features until the last minute. At those companies, "all code complete" is not a criterion for alpha or beta milestones. But you might discover that previous projects were 90% code complete 10 weeks before their ship date. Many of the measures flagged as based on risk, effort, or agreement can be used as project-history based if you have measured those variables previously, under circumstances that allow for meaningful comparison, and have the data available for comparison.
As I am using the term, the risk measures focus on risk remaining in the project. Imagine this question: If we were to release the product today, without further testing, what problems would we anticipate? Similarly, if we were to release the product on its intended release date, what problems would we anticipate?
Obstacles are risks to the testing project (or to the development project as a whole). These are the things that make it hard to do the testing or fixing well. This is not intended as a list of everything that can go wrong on a project. Instead, it is a list of common problems that make the testing less efficient.
Turnover of development staff (programmers, testers, writers, etc.)
Number of marketing VPs per release. (Less flippantly, what is the rate of turnover among executives who influence the design of the product?).
Layoffs of testing (or other development) staff.
Number of organizational changes over the life of the project.
Number of people who influence product release, and level of consensus about the product among them.
Appropriateness of the tester to programmer ratio. (Note: Ive seen successful ratios ranging from 1:5 through 5:1. It depends on the balance of work split between the groups and the extent to which the programmers are able to get most of the code right the first time.)
Number of testers who speak English (if you do your work in English).
Number of testers who speak the same language as the programmers.
How many bugs are found by isolated (such as remote or offsite) testers compared to testers who are co-located with or in good communication with the programmers?
Number of tests blocked by defects.
List of defects blocking tests.
Defect fix percentage (if low).
Slow defect fix rate compared to find rate.
Average age of open bugs.
Average time from initial report of defect until fix verification.
Number of times a bug is reopened (if high).
Number of promotions and demotions of defect priority
Number of failed attempts to get builds to pass smoke tests.
Number of changes to (specifications or) requirements during testing.
Percentage of tests changed by modified requirements.
Time lost to development issues (such as lack of specifications or features not yet coded).
Time required to test a typical emergency fix.
Percentage of time spent on testing emergency fixes.
Percentage of time spent providing technical support for pre-release users (such as beta testers).
How many billable hours per week (for a consulting firm) or the equivalent task-focused hours (for in-house work) are required of the testers and how does this influence or interfere with their work?
Ability of test environment team (or information systems support team) to build the system to be tested as specified by the programming team.
Time lost to environment issues (such as difficulties obtaining test equipment or configuring test systems, defects in the operating system, device drivers, file system or other 3rd party, system software).
These help you assess the testing effort. How hard are the testers doing, how well are they doing it, what could they improve? A high number for one of these measures might be good for one group and bad for another. For example, in a company that relies on an outside test lab to design a specialized set of tests for a very technical area, wed expect a high bug find rate from third party test cases. In a company that thinks of its testing as more self-contained, a high rate from third parties is a warning flag.
Number of crises involving test tools.
Number of defects related to test environment.
Number of bugs found by boundary or negative tests vs. feature tests.
Number of faults found in localized versions compared to base code.
Number of faults found in inspected vs. uninspected areas.
Number of defects found by inspection of the testing artifacts.
Number of defects discovered during test design (rather than in later testing).
Number of defects found by 3rd party test cases.
Number of defects found by 3rd party test group vs. your group.
Number of defects found by developers.
Number of defects found by developers after unit testing.
Backlog indicators. For example, how many unverified bug fixes are there (perhaps as a ratio of new bug fixes submitted) or what is the mean time to verify a fix?
How many phone calls were generated during beta.
Number of surprise bugs (bugs you didnt know about) reported by beta testers. (the content of these calls indicate holes in testing or, possibly, weaknesses in the risk analysis that allowed a particular bug to be deferred)..
After the product is released,
Number of "surprisingly serious" defects (deferred problems that have generated more calls or angrier calls than anticipated.).
Angry letters to the CEO. (Did testers mis-estimate severity?)
Published criticism of the product. (Did testers mis-estimate visibility?)
Rate of customer complaints for technical support? (Did testers mis-estimate customer impact?)
Number of direct vs. indirect bug finds (were you looking for that bug or did you stumble on it as a side effect of some other test?)
Number of irreproducible bugs (perhaps as a percentage of total bugs found).
Number of noise bugs (issues that did not reflect software errors).
Number of duplicate bugs being reported.
Rumors of off-the-record defects (defects discovered but not formally reported or tracked. The discoveries might be by programmers or by testers who are choosing not to enter bugs into the tracking systema common problem in companies that pay special attention to bug counts.)
Number of test cases that can be automated.
Cyclomatic complexity of automated test scripts.
Size (e.g. lines of code) of test automation code. Appraisal of the codes maintainability and modularity.
Existence of requirements analyses, requirements documents, specifications, test plans and other software engineering processes and artifacts generated for the software test automation effort.
Percentage of time spent by testers writing defect reports.
Percentage of time spent by testers searching for bugs, writing bug reports, or doing focused test planning on this project. (Sometimes, youre getting a lot less testing on a project than you think. Your staff may be so overcommitted that they have almost no time to focus and make real progress on anything. Other times, the deallocation of resources is intentional. For example, one company has an aggressive quality control group who publish bug curves and comparisons across projects. To deal with the political hassles posed by this external group, the software development team sends the testers to the movies whenever the open bug counts get too high or the fix rates get too low.)
Percentage of bugs found by planned vs. exploratory vs. unplanned methods, compared to the percentage that you intended.
What test techniques were applied compared to the population of techniques that you think are applicable compared to the techniques the testers actually know.
Comparative effectiveness: what problems were found by one method of testing or one source vs. another.
Comparative effectiveness over time: compare the effectiveness of different test methods over the life cycle of the product. We might expect simple function tests to yield more bugs early in testing and complex scenario tests to be more effective later.
Complexity of causes over time: is there a trend that bugs found later have more complex causes (for example, require a more complex set of conditions) than bugs found earlier? Are there particular patterns of causes that should be tested for earlier?
Differential bug rates across predictors: defect rates tracked across different predictors. For example, we might predict that applications with high McCabe-complexity numbers would have more bugs. Or we might predict that applications that were heavily changed or that were rated by programmers as more fragile, etc., would have more bugs.
Delta between the planned and actual test effort.
Ability of testers to articulate the test strategy.
Ability of the programmers and other developers to articulate the test strategy.
Approval of the test effort by an experienced, respected tester.
Estimate the prospective effectiveness of the planned tests. (How good do you think these are?)
Estimate the effectiveness of the tests run to date (subjective evaluation by testers, other developers).
Estimate confidence that the right quality product will ship on time (subjective evaluation by testers, other developers).
So what has the test group accomplished? Most of the bug count metrics belong here.
The following examples are based on ideas presented at LAWST and STMR, but Ive made some changes to simplify the description, or because additional ideas came up at the meeting (or since) that seem to me to extend the report in useful ways. Occasionally, the difference between my version and the original occurs because (oops) I misunderstood the original presentation.
This description is based on a presentation by Jim Bampos and an ensuing discussion. This illustrates an approach that is primarily focused on agreement-based measures.
For each feature, list different types of testing that would be appropriate (and that you intend to do). Different features might involve different test types. Then, for each feature / testing type pair, determine when the feature will be ready for that type of testing, when you plan to do that testing, how much you plan to do, and what youve achieved.
A spreadsheet that tracked this might look like:
Feature |
Test Type |
Ready |
Primary Testing (week of. . .) |
Time Budget |
Time Spent |
Notes |
Feature 1 |
Basic functionality |
12/1/00 |
12/1/00 |
_ day |
||
Domain |
12/1/00 |
12/1/00 |
_ day |
|||
Load / stress |
1/5/01 |
1/15/01 |
1.5 days |
|||
Scenario |
12/20/00 |
1/20/01 |
4 days |
The chart shows "time spent" but not "how well tested." Supplementing the chart is a list of deliverables and these are reviewed for quality and coverage. Examples of primary deliverables are:
This description is based on a presentation by Elisabeth Hendrickson and an ensuing discussion. It primarily focuses on agreement-based measures. For each component list the appropriate types of testing. (Examples: functionality, install/uninstall, load/stress, import/export, engine, API, performance, customization.) These vary for different components. Add columns for the tester assigned to do the testing, the estimated total time and elapsed time to date, total tests, number passed, failed or blocked and corresponding percentages, and the projected testing time for this build. As Elisabeth uses it, this chart covers testing for a single build, which usually lasts a couple weeks. Later builds get separate charts. In addition to these detail rows, a summary is made by tester and by component.
Component |
Test Type |
Tester |
Total Tests Planned / Created |
Tests Passed / Failed / Blocked |
Time Budget |
Time Spent |
Projected for Next Build |
Notes |
Gary Halstead and I used a chart like this to manage a fairly complex testing project. This was one of the successful projects that led to my report in Kaner (1996) rather than one of the failures that led to my suggesting in that report that this approach doesnt always work. For our purposes, think of the output from testing project planning as a chart that spans many pages.
This chart might run as many as
50 pages (one row per sub-area task). No one will want to review it at a status
meeting, but it is a useful data collection worksheet. There are various ways
to figure out, each week, how much got done on each task. This administrative
work is not inexpensive. However, it can pay for itself quickly by providing
relatively early notice that some tasks are not being done or are out of control.
A summary of the chart is useful and well received in status reports.
For each area, you can determine from the worksheets the total amount of time
budgeted for the area (just add up the times from the individual tasks), how
much time has actually been spent, and what percentage of work is actually getting
done. (Use a weighted averagea 4 week task should have 4 times as much
effect on the calculation of total percent complete as a 1 week task.) The figure
below illustrates the layout of the summary chart. The chart shows every area.
The areas are labeled I, II, and III in the figure. You might prefer meaningful
names. For each area, there are three bars. The first shows total time budgeted,
the second shows time spent on this area, the third shows percentage of this
areas work that has been completed. Once you get used to the chart, you
can tell at a glance whether the rate of getting work done is higher or lower
than your rate of spending your budgeted time.
Some test groups submit a report on the status of each testing project every week. I think this is a useful practice.
The memo can include different types of information. Borrowing an analogy from newspapers, I put the bug counts (software equivalent of sports statistics) (the most read part of local newspapers) back several pages. People will fish through the report to find them.
The front page covers issues that might need or benefit from management attention, such as lists of deliverables due (and when they are due) from other groups, decisions needed, bugs that are blocking testing, and identification of other unexpected obstacles, risks, or problems. I might also brag about significant staff accomplishments.
The second page includes some kind of chart that shows progress against plan. Any of the charts described so far (the feature map, the component map, the summary status chart) would be useful on the second page.
The third page features bug counts. The fourth page typically lists the recently deferred bugs, which well talk about at the weekly status meeting.
Across the four pages (or four sections), you can fit information about effort, obstacles, risk, agreement-based status, bug counts, and anything else you want to bring to the project teams or managements attention.
When you assign someone to work on a project, you might not get much of their time for that project. They have other things to do. Further, even if they were spending 100% of their time on that project, they might only be running tests a few hours per week. The rest of their time might be spent on test planning, meetings, writing bug reports, and so on. The following chart is a way of capturing how your staff spend their time. Rather than using percentages, you might prefer to work in hours. This might help you discover that the only people who are getting to the testing tasks are doing them in their overtime. (Yes, they are spending 20% of their time on task, but that 20% is 10 hours of a 50 hour week.) Effort-based reporting is useful when people are trying to add new projects or new tasks to your groups workload, or when they ask you to change your focus. You show the chart and ask which things to cut back on, in order to make room for the new stuff.
0-10% |
10%-20% |
20%-30% |
30%-40% |
40%-60% |
|
Coordination |
8 |
2 |
0 |
2 |
0 |
Status Reporting |
12 |
0 |
0 |
0 |
0 |
|
Setup Environment |
4 |
3 |
3 |
0 |
1 |
Scoping |
5 |
5 |
2 |
0 |
0 |
Ad Hoc |
7 |
2 |
1 |
2 |
0 |
Design/Doc |
1 |
3 |
5 |
1 |
2 |
Execution |
1 |
3 |
5 |
3 |
0 |
Inspections |
7 |
5 |
0 |
0 |
0 |
Maintenance |
9 |
3 |
0 |
0 |
0 |
Review |
10 |
2 |
0 |
0 |
0 |
Bug Advocacy |
5 |
5 |
2 |
0 |
0 |
Jim Kandler manages testing in an FDA-regulated context. He tracks/reports the progress of the testing effort using a set of charts similar to the following four:
Requirement 1 |
Requirement 2 |
Requirement 3 |
Requirement 4 |
Requirement 5 |
Etc. |
|
Test 1 |
* |
* |
||||
Test 2 |
* |
* |
* |
|||
Test 3 |
* |
|||||
Totals |
1 |
1 |
3 |
0 |
1 |
The chart shows several things:
Notice that if we dont write down a requirement, we wont know from this chart what is the level of coverage against that requirement. This is less of an issue in an FDA-regulated shop because all requirements (claims) must be listed, and there must be tests that trace back to them.
For each test case, this chart tracks whether it is written, has been reviewed, has been used, and what bugs have been reported on the basis of it. If the test is blocked (you cant complete it because of a bug), the chart shows when the blocking bug was discovered.
Many companies operate with a relatively small number of complex regression tests (such as 500). A chart that shows the status of each test is manageable in size and can be quite informative.
Test Case ID |
Written |
Reviewed |
Executed |
Blocked |
Software Change Request |
1 |
1 |
1 |
|||
2 |
1 |
1 |
1 |
118 |
|
3 |
1 |
1 |
1 |
10/15/00 |
113 |
4 |
1 |
||||
5 |
|||||
6 |
|||||
7 |
|||||
Totals 7 |
4 |
3 |
2 |
||
100% |
57% |
42% |
29% |
For each tester, this shows how many hours the person spent testing that week, how many bugs (issues) she found and therefore how many hours per issue. The numbers are rolled up across testers for a summary report to management that shows the hours and issues of the group as a whole.
Week of |
Hours |
Issues |
Hr/Issue |
20 Aug |
19.5 |
19 |
1.03 |
27 Aug |
40.0 |
35 |
|
3 Sept |
62 |
37 |
|
10 Sept |
15 |
10 |
A chart like this helps the manager discover that individual staff members have conflicting commitments (not enough hours available for testing) or that they are taking remarkably short or long times per bug. Trends (over a long series of weeks) in availability and in time per failure can be useful for highlighting problems with individuals, the product, or the project staffing plan. As a manager, I would not share the individuals chart with anyone outside of my group. There are too many risks of side effects. Even the group totals invite side effects, but the risk might be more manageable.
Bug handling varies enough across companies that Ill skip the chart. For each bug, it is interesting to know when it was opened, how long its been opened, who has to deal with it next, and its current resolution status (under investigation, fixed, returned for more info, deferred, etc.). Plenty of other information might be interesting. If you put too much in one chart, no one will understand it.
Jim shared some lessons learned with these four charts (traceability, test case status, effort, and bug statistics) that are generally applicable for project status reporting. In summary, he said:
These lessons apply well in highly regulated industries. In mass-market software, some of them will be less applicable.
James Bach has spoken several times about his project status dashboard. See, for example, Bach (1999), which includes the following picture as an example. This is a simple tool, but a very useful one.
The dashboard is created on a large whiteboard, typically in the main project conference room. Different groups vary the columns and the headings. Bach prefers to keep the design as simple and uncluttered as possible. Heres the variation that I prefer (which breaks one of James columns in two).
A manager can read the testing project status at a glance from this board. If there are lots of frowny faces and red rows, the project has trouble. If planned and achieved coverage dont match, the product needs more testing. Update the dashboard frequently (every day or few days) and it becomes a focal point for project status discussions. Bach (1999) provides more information about the dashboard in practice. His live presentation is worth attending.
Many test groups focus their progress reports on bug counts. When you ask how much testing theyve done, they say, 322 bugs worth. Or they speak in terms of a product coverage measure, like code coverage. There are many other ways to answer the questions:
This paper has provided a survey of suggested answers, a framework for evaluating them, and samples of presentation approaches used by several successful test managers and consultants. This paper doesnt provide what we might think that we really needa small set of measures that are known to be useful and valid, and a standardized reporting style that everyone will anticipate and learn to understand. It takes time to develop and validate a good set of metrics. Im not there yet. I dont think the field is there yet. I dont expect us to be there next year or the year after, but well make progress. Unless we have a breakthrough, progress is incremental.
Amland (1999), "Risk Based Testing and Metrics" 16th International Conference on Testing Computer Software.
Austin, R.D. (1996) Measuring and Managing Performance in Organizations.
Bach, James (1999) "A Low Tech Testing Dashboard", STAR Conference East, Available at www.satisfice.com/presentations/dashboard.pdf.
Bach, Jonathan. (2000) "Measuring Ad Hoc Testing", STAR Conference West.
Beizer, B. (1990) Software Testing Techniques (2nd Ed.)
S. Brown (Ed., 1991) The Product Liability Handbook: Prevention, Risk, Consequence, and Forensics of Product Failure.
Cornett, S., Code Coverage Analysis, www.bullseye.com/coverage.html
Dhillon, B.S. (1986) Human Reliability With Human Factors.
Fenton, N. & Pfleeger S. (1997) Software Metrics: A Rigorous & Practical Approach.
Glass, R.L. (1992) Building Quality Software.
Grady, R.B. (1992) Practical Software Metrics for Project Management and Process Improvement.
Grady, R.B. & D.L. Caswell, (1987) Software Metrics: Establishing a Company-Wide Program.
Hoffman, D. (2000) "The
darker side of metrics", Pacific Northwest Software Quality
Conference.
Johnson, M.A. (1996) Effective and Appropriate Use of Controlled Experimentation in Software Development Research, Master's Thesis (Computer Science), Portland State University.
Jones, C. (1991) Applied Software Measurement, 1991, p. 238-341.
Kaner, C. (1995) "Software negligence and testing coverage", Software QA Quarterly, Volume 2, #2, 18. Available at www.kaner.com.
Kaner, C. (1996) "Negotiating testing resources: A collaborative approach", Ninth International Software Quality Week Conference. Available at www.kaner.com.
Kaner, C. (1997) "The impossibility of complete testing", Software QA, Volume 4, #4, 28. Available at www.kaner.com.
Kaner, C. (1999) "Yes, but what are we measuring?", Pacific Northwest Software Quality Conference, available from the author at kaner@kaner.com.
Kaner, C., J. Falk, & H.Q. Nguyen (1993, reprinted 1999) Testing Computer Software (2nd. Ed.)
Kitchenham, Pfleeger, & Fenton (1995) "Towards a framework for software measurement and validation." IEEE Transactions on Software Engineering, vol. 21, December, 929.
Marick, B. (1995) The Craft of Software Testing. Marick, B. (1999), "How to Misuse Code Coverage", 16th International Conference on Testing Computer Software. Available at www.testing.com/writings.html.
Marick, B. (2000), "Faults of Omission", Software Testing & Quality Engineering, January issue. Available at www.testing.com/writings.html.
Myers, G. (1979) The Art of Software Testing.
Schneidewind, N. (1994) "Methodology for Validating Software Metrics," Encyclopedia of Software Engineering (Marciniak, ed.) see also IEEE 1061, Standard for a Software Quality Metrics Methodology.
Weinberg, G.M. (1993) Quality Software Management, Volume 2, First-Order Measurement.
Weinberg, G.M. & E.L. Schulman (1974) "Goals and performance in computer programming," Human Factors, 16(1), 70-77.
Zuse, H. (1997) A Framework of Software Measurement.