Measurement of the Extent of Testing

Invited Address

Pacific Northwest Software Quality Conference

Portland, Oregon

October 17, 2000

 

Acknowledgment

Much of the material in this paper was presented or developed by the participants of the Software Test Managers Roundtable (STMR) and the Los Altos Workshop on Software Testing (LAWST).

  • STMR 1 (October 3, November 1, 1999) focused on the question, How to deal with too many projects and not enough staff? Participants included Jim Bampos, Sue Bartlett, Jennifer Brock, David Gelperin, Payson Hall, George Hamblen, Mark Harding, Elisabeth Hendrickson, Kathy Iberle, Herb Isenberg, Jim Kandler, Cem Kaner, Brian Lawrence, Fran McKain, Steve Tolman and Jim Williams.
  • STMR 2 (April 30, May 1, 2000) focused on the topic, Measuring the extent of testing. Participants included James Bach, Jim Bampos, Bernie Berger, Jennifer Brock, Dorothy Graham, George Hamblen, Kathy Iberle, Jim Kandler, Cem Kaner, Brian Lawrence, Fran McKain, and Steve Tolman.
  • LAWST 8 (December 4-5, 1999) focused on Measurement. Participants included Chris Agruss, James Bach, Jaya Carl, Rochelle Grober, Payson Hall, Elisabeth Hendrickson, Doug Hoffman, III, Bob Johnson, Mark Johnson, Cem Kaner, Brian Lawrence, Brian Marick, Hung Nguyen, Bret Pettichord, Melora Svoboda, and Scott Vernon.
Facilities and other support for STMR were provided by Software Quality Engineering, which hosts these meetings in conjunction with the STAR conferences. Facilities for LAWST were provided by the University of California (Extension) Santa Cruz.

Background

At PNSQC a year ago, I spoke with you about problems inherent in software measurement, and described a framework for developing and evaluating measures (Kaner, 1999). Similar approaches have been laid out by other authors on software measurement (Kitchenham, Pfleeger, & Fenton, 1995; Zuse, 1997) and in other fields, such as physics (see, e.g. Sydenham, Hancock & Thorn, 1989) and psychometrics (see, e.g. Allen & Yen, 1979). My point of departure from the traditional software measurement literature is in the extent to which I ask (a) whether commonly used measures are valid, (b) how we can tell, and (c) what kinds of side effects are we likely to encounter if we use a measure? Even though many authors mention these issues, I don’t think they’ve been explored in enough depth and I don’t think that many of us are thoughtfully enough facing the risks associated with using a poor software measure (Austin, 1996; Hoffman, 2000). Developing a valid, useful measure is not easy. It takes time, combined with theoretical and empirical work. In Kaner (1999), I provided detailed examples from the history of psychophysics to illustrate the development of a field’s measures over a 100-year period. The most common question that has come to me in response to that paper (Kaner, 1999) is what measures I think are valid and useful. I don’t have a simple answer to that, but I can report on some work in progress regarding measurement of the extent of testing of a product. This paper is a progress report, based primarily on the work of colleagues who manage software test groups or consult to test managers. It describes some of the data they collect and how they report it to others. Some of the material in this paper will be immediately useful to some readers, but the point of the paper is not to present finished work. I’m still at the stage of collecting and sifting ideas, looking for common threads and themes, trying to get a better idea of the question that we intend when we ask how to measure how much testing has been done, and to understand the possible dimensions of an answer to such a question. There’s an enormous amount of detail in this paper and you might get lost. Here’s the one thing that I would most like you to take away from the paper:

Bug count metrics reflect only a small part of the work and progress of the testing group. Many alternatives look more closely at what has to be done and what has been done. These will often be more useful and less prone to side effects than bug count metrics.

The Measurement Framework

To evaluate any proposed measure (or metric), I propose that we ask the following ten questions. For additional details, see Kaner (1999):

  1. What is the purpose of this measure? Some measures are used on a confidential basis between friends. The goal is to spot trends, and perhaps to follow those up with additional investigation, coaching, or exploration of new techniques. A measure used only for this purpose can be useful and safe even if it is relatively weak and indirect. Higher standards must apply as the measurements become more public or as consequences (rewards or punishments) become attached to them.
  2. What is the scope of this measure? Circumstances differ across groups, projects, managers, companies, countries. The wider the range of situations and people you want to cover with the method, the wider the range of issues that can invalidate or be impacted by the measure.
  3. What attribute are we trying to measure? If you only have a fuzzy idea of what you are trying to measure, your measure will probably bear only a fuzzy relationship to whatever you had in mind.
  4. What is the natural scale of the attribute? We might measure a table’s length in inches, but what units should we use for extent of testing?
  5. What is the natural variability of the attribute? If you measure two supposedly identical tables, their lengths are probably slightly different. Similarly, your weight varies a little bit from day to day. What are the inherent sources of variation of "extent of testing"?
  6. What instrument are we using to measure the attribute and what reading do we take from the instrument? You might measure length with a ruler and come up with a reading (a measurement) of 6 inches.
  7. What is the natural scale of the instrument? Fenton & Pfleeger (1997) discuss this is detail.
  8. What is the natural variability of the readings? This is normally studied in terms of "measurement error."
  9. What is the relationship of the attribute to the instrument? What mechanism causes an increase in the reading as a function of an increase in the attribute? If we increase the attribute by 20%, what will show up in the next measurement? Will we see a 20% increase? Any increase?
  10. What are the natural and foreseeable side effects of using this instrument? If we change our circumstances or behavior in order to improve the measured result, what impact are we going to have on the attribute? Will a 20% increase in our measurement imply a 20% improvement in the underlying attribute? Can we change our behavior in a way that optimizes the measured result but without improving the underlying attribute at all? What else will we affect when we do what we do to raise the measured result? Austin (1996) explores the dangers of measurement–the unintended side effects that result–in detail, across industries. Hoffman (2000) describes several specific side effects that he has seen during his consultations to software companies.

The phrase "directness of measurement" is often bandied about in the literature. If "directness of measurement" means anything, it must imply something about the monotonicity (preferably linearity?) of the relationship between attribute and instrument as considered in both #9 and #10. The phrases "hard measure" and "soft measure" are also bandied about a lot. Let me suggest that a measure is "hard" to the extent that we can describe and experimentally investigate a mechanism that underlies the relationship between an attribute and its measure and that accurately predicts the rate of change in one variable as a function of the other.

Exploring the Question

When someone asks us for a report on the amount of testing that we’ve completed, what do they mean? The question is ambiguous. At an early point in a class he teaches, James Bach has students (experienced testers) do some testing of a simple program. At the end of the exercise, James asks them (on a scale from 0 to 100, where 100 means completely tested) how much testing they’ve done, or how much they would get done if they tested for another 8 hours. Different students give strikingly different answers based on essentially the same data. They also justify their answers quite differently. We found the same broad variation at STMR (a collection of experienced test managers and test management consultants). If someone asks you how much of the testing you (or your group) has done, and you say 30%, you might be basing your answer on any combination of the following dimensions (or on some others that I’ve missed):

Another Look at the Question

These questions overlap with the measurement framework analysis but in discussions, they seem to stimulate different answers and different ideas. Therefore, for now, I’m keeping them as an independently useful list.

The Rest of this Paper

The material that follows lists and organizes some of the ideas and examples that we (see the Acknowledgement, above) collected or developed over the last year. I have filtered out many suggestions but the lists that remain are still very broad. My intent is to show a range of thinking, to provide you with a collection of ideas from many sources, not (yet) to recommend that you use a particular measure or combination of them. The ordering and grouping of ideas here are for convenience. The material could be reorganized in several other ways, and probably in some better ways. Suggestions are welcome. Let me stress that I do not endorse or recommend the listed measures. I think that some of them are likely to cause more harm than good. My objective is to list ideas that reasonable people in the field have found useful, even if reasonable people disagree over their value. If you are considering using one, or using a combination of them, I suggest that you evaluate the proposed measure using the measurement framework described above. The sections below do not make a good enough connection with the software measurement literature. Several sources (for example, Zuse, 1997, and Ross Collard’s software testing course notes) provide additional measures or details. As I suggested at the start, this paper reports work in progress. The next report will provide details that this one lacks.

Coverage-Based Measures

"Coverage" is sometimes interpreted in terms of a specific measure, usually statement coverage or branch coverage. You achieve 100% statement coverage if you execute every statement (such as, every line) in the program. You achieve 100% branch coverage if you execute every statement and take every branch from one statement to another. (So, if there were four ways to reach a given statement, you would try all four.) I’ll call this type of coverage (coverage based on statements, branches, perhaps also logical conditions), code coverage.

Code coverage is a tidy measure–it is easy to count, unambiguous, and easy to explain. Unfortunately, this measure carries risks (Marick, 1999). It is easy (and not uncommon) to write a set of relatively weak tests that hit all of the statements and conditions but don’t necessarily hit them very hard. Additionally, code coverage is incomplete. For examples of incompleteness:

Sometimes, the most important coverage measure has nothing to do with code coverage. For example, I worked on a product that had to print well. This was an essential benefit of the product. We selected 80 printers for detailed compatibility testing. We tracked the percentage of those 80 printers that the program could pass. This was a coverage measure, but it had nothing to do with lines of code. We might pass through exactly the same code (in our program) when testing two printers but fail with only one of them (because of problems in their driver or firmware).

You have a coverage measure if you can imagine any kind of testing that can be done, and a way to calculate what percent of that kind of testing you’ve done. Similarly, you have a coverage measure if you can calculate what percentage of testing you’ve done of some aspect of the program or its environment. As with the 80 printers, you might artificially restrict the population of tests (we decided that testing 80 printers was good enough under the circumstances) and compute the percentage of that population that you have run.

Here are some examples of coverage measures. Some of this list comes from Kaner (1995), which provides some additional context and discussion.

Operation of every function / feature / data handling operation under:

Compatibility

Computation

Information Content

Threats, Legal Risks

Usability tests of:

Localizability / localization tests:

Verifications

Coverage of specific types of tests:

Inspection coverage

  1. How much code has been inspected.
  2. How many of the requirements inspected/reviewed.
  3. How many of the design documents inspected.
  4. How many of the unit tests inspected.
  5. How many of the black box tests inspected.
  6. How many of the automated tests inspected.
  7. How many test artifacts reviewed by developers.

User-focused testing

Agreement-Based Measures

Agreement-based measures start from an agreement about what testing will and will not be done. The agreement might be recorded in a formal test plan or in a memo cosigned by different stakeholders. Or it might simply be a work list that the test group has settled on. The list might fully detailed, spelling out every test case. Or it might list more general areas of work and describe the depth of testing (and the time budget) for each one. The essence of agreement-based measures is progress against a plan.

All of these measures are proportions: amount of work done divided by the amount of work planned. We can convert these to effort reports easily enough just by reporting the amount of work done.

Effort-Based Measures

These can be turned into agreement-based measures if we have an expected level of effort for comparison to the actual level completed.

Project-History Based Measures

If you’ve run several projects, you can compare today’s project with historical results. When someone claims that the project has reached "beta", you might compare current status with the state of the other products when they reached the beta milestone. For examplIe, some groups add features until the last minute. At those companies, "all code complete" is not a criterion for alpha or beta milestones. But you might discover that previous projects were 90% code complete 10 weeks before their ship date. Many of the measures flagged as based on risk, effort, or agreement can be used as project-history based if you have measured those variables previously, under circumstances that allow for meaningful comparison, and have the data available for comparison.

Risk Based Measures

As I am using the term, the risk measures focus on risk remaining in the project. Imagine this question: If we were to release the product today, without further testing, what problems would we anticipate? Similarly, if we were to release the product on its intended release date, what problems would we anticipate?

Obstacle Reports

Obstacles are risks to the testing project (or to the development project as a whole). These are the things that make it hard to do the testing or fixing well. This is not intended as a list of everything that can go wrong on a project. Instead, it is a list of common problems that make the testing less efficient.

Evaluation-of-Testing Based Measures

These help you assess the testing effort. How hard are the testers doing, how well are they doing it, what could they improve? A high number for one of these measures might be good for one group and bad for another. For example, in a company that relies on an outside test lab to design a specialized set of tests for a very technical area, we’d expect a high bug find rate from third party test cases. In a company that thinks of its testing as more self-contained, a high rate from third parties is a warning flag.

Results Reports

So what has the test group accomplished? Most of the bug count metrics belong here.

Progress Reporting Examples

The following examples are based on ideas presented at LAWST and STMR, but I’ve made some changes to simplify the description, or because additional ideas came up at the meeting (or since) that seem to me to extend the report in useful ways. Occasionally, the difference between my version and the original occurs because (oops) I misunderstood the original presentation.

Feature Map

This description is based on a presentation by Jim Bampos and an ensuing discussion. This illustrates an approach that is primarily focused on agreement-based measures.

For each feature, list different types of testing that would be appropriate (and that you intend to do). Different features might involve different test types. Then, for each feature / testing type pair, determine when the feature will be ready for that type of testing, when you plan to do that testing, how much you plan to do, and what you’ve achieved.

A spreadsheet that tracked this might look like:

Feature

Test Type

Ready

Primary Testing (week of. . .)

Time Budget

Time Spent

Notes

Feature 1

Basic functionality

12/1/00

12/1/00

_ day

   
 

Domain

12/1/00

12/1/00

_ day

   
 

Load / stress

1/5/01

1/15/01

1.5 days

   
 

Scenario

12/20/00

1/20/01

4 days

   

The chart shows "time spent" but not "how well tested." Supplementing the chart is a list of deliverables and these are reviewed for quality and coverage. Examples of primary deliverables are:

Component Map

This description is based on a presentation by Elisabeth Hendrickson and an ensuing discussion. It primarily focuses on agreement-based measures. For each component list the appropriate types of testing. (Examples: functionality, install/uninstall, load/stress, import/export, engine, API, performance, customization.) These vary for different components. Add columns for the tester assigned to do the testing, the estimated total time and elapsed time to date, total tests, number passed, failed or blocked and corresponding percentages, and the projected testing time for this build. As Elisabeth uses it, this chart covers testing for a single build, which usually lasts a couple weeks. Later builds get separate charts. In addition to these detail rows, a summary is made by tester and by component.

Component

Test Type

Tester

Total Tests Planned / Created

Tests Passed / Failed / Blocked

Time Budget

Time Spent

Projected for Next Build

Notes

                 
                 

 

Area / Feature Summary Status Chart

Gary Halstead and I used a chart like this to manage a fairly complex testing project. This was one of the successful projects that led to my report in Kaner (1996) rather than one of the failures that led to my suggesting in that report that this approach doesn’t always work. For our purposes, think of the output from testing project planning as a chart that spans many pages.

This chart might run as many as 50 pages (one row per sub-area task). No one will want to review it at a status meeting, but it is a useful data collection worksheet. There are various ways to figure out, each week, how much got done on each task. This administrative work is not inexpensive. However, it can pay for itself quickly by providing relatively early notice that some tasks are not being done or are out of control. A summary of the chart is useful and well received in status reports. For each area, you can determine from the worksheets the total amount of time budgeted for the area (just add up the times from the individual tasks), how much time has actually been spent, and what percentage of work is actually getting done. (Use a weighted average–a 4 week task should have 4 times as much effect on the calculation of total percent complete as a 1 week task.) The figure below illustrates the layout of the summary chart. The chart shows every area. The areas are labeled I, II, and III in the figure. You might prefer meaningful names. For each area, there are three bars. The first shows total time budgeted, the second shows time spent on this area, the third shows percentage of this area’s work that has been completed. Once you get used to the chart, you can tell at a glance whether the rate of getting work done is higher or lower than your rate of spending your budgeted time.

Project Status Memo

Some test groups submit a report on the status of each testing project every week. I think this is a useful practice.

The memo can include different types of information. Borrowing an analogy from newspapers, I put the bug counts (software equivalent of sports statistics) (the most read part of local newspapers) back several pages. People will fish through the report to find them.

The front page covers issues that might need or benefit from management attention, such as lists of deliverables due (and when they are due) from other groups, decisions needed, bugs that are blocking testing, and identification of other unexpected obstacles, risks, or problems. I might also brag about significant staff accomplishments.

The second page includes some kind of chart that shows progress against plan. Any of the charts described so far (the feature map, the component map, the summary status chart) would be useful on the second page.

The third page features bug counts. The fourth page typically lists the recently deferred bugs, which we’ll talk about at the weekly status meeting.

Across the four pages (or four sections), you can fit information about effort, obstacles, risk, agreement-based status, bug counts, and anything else you want to bring to the project team’s or management’s attention.

Effort Report

When you assign someone to work on a project, you might not get much of their time for that project. They have other things to do. Further, even if they were spending 100% of their time on that project, they might only be running tests a few hours per week. The rest of their time might be spent on test planning, meetings, writing bug reports, and so on. The following chart is a way of capturing how your staff spend their time. Rather than using percentages, you might prefer to work in hours. This might help you discover that the only people who are getting to the testing tasks are doing them in their overtime. (Yes, they are spending 20% of their time on task, but that 20% is 10 hours of a 50 hour week.) Effort-based reporting is useful when people are trying to add new projects or new tasks to your group’s workload, or when they ask you to change your focus. You show the chart and ask which things to cut back on, in order to make room for the new stuff.  

 

0-10%

10%-20%

20%-30%

30%-40%

40%-60%

Coordination

8

2

0

2

0

Status Reporting

12

0

0

0

0

Setup Environment

4

3

3

0

1

Scoping

5

5

2

0

0

Ad Hoc

7

2

1

2

0

Design/Doc

1

3

5

1

2

Execution

1

3

5

3

0

Inspections

7

5

0

0

0

Maintenance

9

3

0

0

0

Review

10

2

0

0

0

Bug Advocacy

5

5

2

0

0

 

A Set of Charts for Requirements-Driven Testing

Jim Kandler manages testing in an FDA-regulated context. He tracks/reports the progress of the testing effort using a set of charts similar to the following four:

Traceability Matrix

In the traceability matrix, you list specification items or requirements items across the top (one per column). Each row is a test case. A cell in the matrix pairs a test case with a requirement. Check it off if that test case tests that requirement.

 

Requirement 1

Requirement 2

Requirement 3

Requirement 4

Requirement 5

Etc.

Test 1

*

 

*

     

Test 2

 

*

*

 

*

 

Test 3

   

*

     

Totals

1

1

3

0

1

 

  The chart shows several things:

Notice that if we don’t write down a requirement, we won’t know from this chart what is the level of coverage against that requirement. This is less of an issue in an FDA-regulated shop because all requirements (claims) must be listed, and there must be tests that trace back to them.

Test Case Status Chart

For each test case, this chart tracks whether it is written, has been reviewed, has been used, and what bugs have been reported on the basis of it. If the test is blocked (you can’t complete it because of a bug), the chart shows when the blocking bug was discovered.

Many companies operate with a relatively small number of complex regression tests (such as 500). A chart that shows the status of each test is manageable in size and can be quite informative.

Test Case ID

Written

Reviewed

Executed

Blocked

Software Change Request

1

1

1

     

2

1

1

1

 

118

3

1

1

1

10/15/00

113

4

1

       

5

         

6

         

7

         

Totals 7

4

3

2

   

100%

57%

42%

29%

   

 

Testing Effort Chart

For each tester, this shows how many ho