Guide to Reliability of Electrical/Electronic Equipment and Products--Software (part 1)

Home | Articles | Forum | Glossary | Books


1. INTRODUCTION

As stated at the outset, I am a hardware person. However, to paint a complete picture of reliability, I think it is important to mention some of the issues involved in developing reliable software. The point is that for microprocessor-based products, hardware and software are inextricably interrelated and codependent in fielding a reliable product to the customer base.

In today's systems, the majority of issues/problems that crop up are attributed to software rather than hardware or the interaction between hardware and software. For complex systems like high-end servers, oftentimes software fixes are made to address hardware problems because software changes can be made more rapidly than hardware changes or redesign. There appears to be a larger gap between customer expectations and satisfaction as relates to software than there is for hardware. Common software shortfalls from a system perspective include reliability, responsiveness to solving anomalies, ease of ownership, and quality of new versions.

An effective software process must be predictable; cost estimates and schedule commitments must be met with reasonable consistency; and the resulting products (software) should meet user's functional and quality expectations. The software process is the set of tools, methods, and practices that are used to produce a software product. The objectives of software process management are to produce products according to plan while simultaneously improving an organization's capability to produce better products. During software development a creative tension exists between productivity (in terms of number of lines of codes written) and quality. The basic principles are those of statistical control, which is based on measurement. Unless the attributes of a process can be measured and expressed numerically, one cannot begin to bring about improvement.

But (as with hardware) one cannot use numbers arbitrarily to control things. The numbers must accurately represent the process being controlled, and they must be sufficiently well defined and verified to provide a reliable basis for action.

While process measurements are essential for improvement, planning and preparation are required; otherwise the results will be disappointing.

At the inception of a software project, a commitment must be made to quality as reflected in a quality plan that is produced during the initial project planning cycle. The plan is documented, reviewed, tracked, and compared to prior actual experience. The steps involved in developing and implementing a quality plan include

1. Senior management establishes aggressive and explicit numerical quality goals. Without numerical measures, the quality effort will be just another motivational program with little lasting impact.

2. The quality measures used are objective, requiring a minimum of human judgment.

3. These measures are precisely defined and documented so computer programs can be written to gather and process them.

4. A quality plan is produced at the beginning of each project. This plan commits to specific numerical targets, and it is updated at every significant project change and milestone.

5. These plans are reviewed for compliance with management quality goals. Where noncompliance is found, re-planning or exception approval is required.

6. Quality performance is tracked and publicized. When performance falls short of the plan, corrective action is required.

7. Since no single measure can adequately represent a complex product, the quality measures are treated as indicators of overall performance.

These indicators are validated whenever possible through early user involvement as well as simulated and/or actual operational testing.

2. HARDWARE/SOFTWARE DEVELOPMENT COMPARISON

There are both similarities and differences between the processes used to develop hardware solutions and those used to create software. Several comparisons are presented to make the point.

First, a typical hardware design team for a high-end server might have 20 designers, whereas the software project, to support the server hardware design might involve 100 software engineers.

Second, in the hardware world there is a cost to pay for interconnects.

Therefore, the goal is to minimize the number of interconnects. With software code, on the other hand, there is no penalty or associated cost for connections (go to statements). The lack of such a penalty for the use of connections can add complexity through the generation of "spaghetti code" and thus many opportunities for error. So an enforced coding policy is required that limits the use of go to statements to minimize defects.

Also like hardware design, software development can be segmented or di vided into manageable parts. Each software developer writes what is called a unit of code. All of the units of code written for a project are combined and integrated to become the system software. It is easier to test and debug a unit of software code than it is to test and debug the entire system.

Hardware designers will often copy a mundane workhorse portion of a circuit and embed it in the new circuit design (Designers like to spend their time designing with the latest microprocessor, memory, DSP, or whatever, rather than designing circuitry they consider to be mundane and not challenging). This approach often causes interface and timing problems that are not found until printed wiring assembly (PWA) or system test. The results of software designers copying a previously written unit of code and inserting it in a new software development project without any forethought could post similarly dangerous results.

Another significant difference between hardware design and software code development is that a unit of software may contain a bug and still function (i.e., it works), but it does the wrong thing. However, if a hardware circuit contains a defect (the equivalent of a software bug), it generally will not function.

As a result of these similarities and differences, hardware developers and their software counterparts can learn from each other as to what methods and processes work best to build a robust and reliable product.

3. SOFTWARE AVAILABILITY

For most development projects the customer is concerned with the system's (equipment's) ability to perform the intended function whenever needed, this is called availability. Such measures are particularly important for system and communication programs that are fundamental to overall system operation. These generally include the control program, database manager, job scheduler, user interface, communication control, network manager, and the input/output system.

The key to including any program in this list is whether its failure will bring down the critical applications. If so, its availability must be considered.

Availability cannot be measured directly but must be calculated from such probabilistic measures as the mean time between failure (MTBF) and the mean time required to repair and restore the system to full operation (MTTR). Assuming the system is required to be continuously available, availability is the percent of total time that the system is available for use:

Availability _ MTTR MTTR _ MTBF _ 100

Availability is a useful measure of the operational quality of some systems. Unfortunately, it is very difficult to project prior to operational testing.

4. SOFTWARE QUALITY

4.1 Software Quality Estimate

An estimate of software quality includes the following steps:

1. Specify the new development project quality goals [normally included in the marketing requirements list (MRL)], which represent customer needs.

2. Document and review lessons learned. Review recent projects completed by the organization to identify the ones most similar to the pro posed product. Where the data warrant, this is done for each product element or unit of code. Often a design team disbands after a project is complete (software code is written) and moves on to the next project.

The same thing happens with hardware designs as well. This prevents iterative learning from taking place.

3. Examine available quantitative quality data for these projects to establish a basis for the quality estimate.

4. Determine the significant product and process differences and estimate their potential effects.

5. Based on these historical data and the planned process changes, project the anticipated quality for the new product development process.

6. Compare this projection with the goals stated in the MRL. Highlight differences and identify needed process improvements to overcome the differences and meet the goals; this leads to the creation of a project quality profile.

7. Produce a development plan that specifies the process to be used to achieve this quality profile.

In making a software quality estimate it is important to remember that every project is different. While all estimates should be based on historical experience, good estimating also requires an intuitive understanding of the special characteristics of the product involved. Some examples of the factors to consider are:

What is the anticipated rate of customer installation for this type of product? A high installation rate generally causes a sharp early peak in defect rate with a rapid subsequent decline. Typically, programs install most rapidly when they require minimal conversion and when they do not affect over all system operation. Compiler and utility programs are common examples of rapidly installed products.

What is the product release history? A subsequent release may install quickly if it corrects serious deficiencies in the prior release. This, of course, requires that the earliest experience with the new version is positive. If not, it may get a bad reputation and be poorly accepted.

What is the distribution plan? Will the product be shipped to all buyers immediately; will initial availability be limited; or is there to be a preliminary trial period? Is the service system established? Regardless of the product quality, will the users be motivated and able to submit defect reports? If not, the defect data will not be sufficiently reliable to validate the development process.

To make an accurate quality estimate it is essential to have a quality model.

One of the special challenges of software engineering is that it is an intellectual process that produces software programs that do not obey the laws of nature.

The models needed for estimation purposes must reflect the way people actually write programs, and thus don't lend themselves to mathematical formulas. It is these unique, nonuniform characteristics that make the software engineering pro cess manageable. Some of these characteristics are Program module quality will vary, with a relatively few modules containing the bulk of the errors.

The remaining modules will likely contain a few randomly distributed defects that must be individually found and removed.

The distribution of defect types will also be highly skewed, with a relatively few types covering a large proportion of the defects.

Since programming changes are highly error-prone; all changes should be viewed as potential sources of defect injection.

While this characterization does not qualify as a model in any formal sense, it does provide focus and a framework for quality planning.


FIGURE 1 Typical cause-effect diagram.

=============

TABLE 1 Pareto List of Software Defects

User interface interaction

1. User needs additional data fields.

2. Existing data need to be organized/presented differently.

3. Edits on data values are too restrictive.

4. Edits on data values are too loose.

5. Inadequate system controls or audit trails.

6. Unclear instructions or responses.

7. New function or different processing required.

Programming defect

1. Data incorrectly or inconsistently defined.

2. Initialization problems.

3. Database processing incorrect.

4. Screen processing incorrect.

5. Incorrect language instruction.

6. Incorrect parameter passing.

7. Unanticipated error condition.

8. Operating system file handling incorrect.

9. Incorrect program control flow.

10. Incorrect processing logic or algorithm.

11. Processing requirement overlooked or not defined.

12. Changes required to conform to standards.

Operating environment

1. Terminal differences.

2. Printer differences.

3. Different versions of systems software.

4. Incorrect job control language.

5. Incorrect account structure or capabilities.

6. Unforeseen local system requirements.

7. Prototyping language problem.

=============

4.2 Software Defects and Statistical Process Control

There is a strong parallel between hardware and software in the following areas:

Defects.

Defect detection [for an IC one can't find all of the defect/bugs. Stuck at fault (SAF) coverage ranges from 85-98%. However, delay, bridging, opens, and other defects cannot be found with SAF.

100% test coverage is impossible to obtain. Software bug coverage is 50%.] root cause analysis and resolution Defect prevention Importance of peer design reviews. Code review should be conducted, preferably without the author being present. If the author is present, he or she cannot speak. The reviewers will try to see if they can understand the author's thought process in constructing the software. This can be an eye-opening educational experience for the software developer.

Need for testing (automated testing, unit testing).

Use of statistical methodology (plan-do-check-act).

Need for software quality programs with clear metrics.

Quality systems.

Statistical process control (SPC) has been used extensively in hardware development and manufacturing. One SPC tool is the fishbone diagram (also called an Ishikawa or a cause-effect diagram), which helps one explore the reasons, or causes, for a particular problem or effect. Figure 1 shows a fishbone diagram for register allocation defects in a compiler. This diagram enables one to graphically identify all the potential causes of a problem and the relationship with the effect, but it does not illustrate the magnitude of a particular cause's effect on the problem.

Pareto diagrams complement cause-effect diagrams by illustrating which causes have the greatest effect on the problem. This information is then used to determine where one's problem-solving efforts should be directed. Table 1 is a Pareto distribution of software module defect densities or defect types. The defects are ranked from most prevalent to least. Normally, a frequency of occurrence, expressed either numerically or in percentage, is listed for each defect to show which defects are responsible for most of the problems. Table 2 is a Pareto listing of software errors both numerically and as a percentage of the total. By focusing process improvement efforts on the most prevalent defects, significant quality improvements can be achieved.

The real value of SPC is to effectively define areas for software quality improvement and the resultant actions that must be taken.

================

TABLE 2 Pareto List of Software Errors

Error category

Incomplete/erroneous specification Intentional deviation from specification Violation of programming standards Erroneous data accessing Erroneous decision logic or sequencing Erroneous arithmetic computations Invalid timing Improper handling of interrupts Wrong constants and data values Inaccurate documentation Total

Frequency of occurrence | Percent

349 28 145 12 118 10 120 10 139 12 113 9

44 4

46 4

41 3

96 8 202 100

================

4.3 Software Quality Measures

The specific quality measures must be selected by each organization based on the data available or easily gathered, but the prime emphasis should be on those that are likely to cause customer problems. Program interrupt rate is one useful measure of quality that is important to customers. Many organizations restrict the definition of valid defects to those that require code changes. If clear criteria can be established, however, documentation changes also should be included.

Both software defect and test data should be used as quality indicators.

Defect data should be used since it is all that most software development organizations can obtain before system shipment. Then, too, if defects are not measured, the software professionals will not take any other measures very seriously.

They know from experience that the program has defects that must be identified and fixed before the product can be shipped. Everything else will take second priority.

It is important to perform software testing. This should begin as the soft ware is being developed, i.e., unit testing. It is also worthwhile to do some early testing in either a simulated or real operational environment prior to final delivery.

Even with the most thorough plans and a highly capable development team, the operational environment always seems to present some unexpected problems.

These tests also provide a means to validate the earlier installation and operational plans and tests, make early availability measurements, and debug the installation, operation, and support procedures.

Once the defect types have been established, normalization for program size is generally required. Defects per 1000 lines of source code is generally the simplest and most practical measure for most organizations. This measure, however, requires that the line-of-code definition be established. Here the cumulative number of defects received each month are plotted and used as the basis for corrective action.

The next issue is determining what specific defects should be measured and over what period of time they should be measured. This again depends on the quality program objectives. Quality measures are needed during development, test, and customer use. The development measures provide a timely indicator of software code performance; the test measures then provide an early validation; and the customer-use data complete the quality evaluation. With this full spectrum of data it is possible to calibrate the effectiveness of development and test at finding and fixing defects. This requires long-term product tracking during customer use and some means to identify each defect with its point of introduction. Errors can then be separated by release, and those caused by maintenance activity can be distinguished.

When such long-term tracking is done, it is possible to evaluate many soft ware process activities. By tracking the inspection and test history of the complete product, for example, it is possible to see how effective each of these actions was at finding and removing the product defects. This evaluation can be especially relevant at the module level, where it provides an objective way to compare task effectiveness.

4.4 Defect Prevention

Defect prevention begins with (1) a clear understanding of how software defects (bugs) occur and (2) a clear management commitment to quality. Investigation of the factors that result in high-quality software being written revealed an interesting finding: the quality of the written software code is directly related to the number of interruptions that the software developer experiences while writing the code, such as someone stopping by for a chat. The interruptions cause the developer to lose the mental construct being formulated, which is continually updating mentally, to write the code. On the other hand, software code quality was found to be independent of such things as the developer's previous experience, the most recent project the developer was involved with, the level of education, and the like. So as much as possible, an interruption-free and noise-free environment needs to be provided for software developers (use white noise generators to block out all noise sources).

The management commitment must be explicit, and all members of the organization must know that quality comes first. Management must direct code "walk-throughs" and reviews. Until management delays or redirects a project to meet the established quality goals and thus ensures that the software development process has the right focus, people will not really believe it. Even then, the point must be reemphasized and the software engineers urged to propose quality improvement actions, even at the potential cost of schedule delays. In spite of what the schedule says, when quality problems are fixed early, both time and resources are saved. When management really believes this and continually conveys and reinforces the point to the software developers, the right quality attitudes will finally develop. It is important to identify defects and ways to prevent them from occurring. As with hardware. The cost of finding and repairing defects increases exponentially the later they are found in the process.

Preventing defects is generally less expensive than finding and repairing them, even early in the process.

Finding and fixing errors accounts for much of the cost of software development and maintenance. When one includes the costs of inspections, testing, and rework, as much as half or more of the typical development bill is spent in detecting and removing errors. What is more, the process of fixing defects is even more error prone than original software creation. Thus with a low-quality process, the error rate spiral will continue to escalate.

Hewlett-Packard found that more than a third of their software errors were due to poor understanding of interface requirements. By establishing an extensive prototyping and design review program, the number of defects found after release was sharply reduced.

A development project at another company used defect prevention methods to achieve a 50% reduction in defects found during development and a 78% reduction in errors shipped. This is a factor of 2 improvement in injected errors, and a 4-to-1 improvement in shipped quality.

Finding and identifying defects is necessary but not sufficient. The most important reason for instituting defect prevention is to provide a continuing focus for process improvement. Unless some mechanism drives process change, it will not happen in an orderly or consistent way. A defect prevention mindset focuses on those process areas that are the greatest source of trouble, whether methods, technology, procedures, or training.

The fundamental objective of software defect prevention is to make sure that errors, once identified and addressed, do not occur again. Defect prevention cannot be done by one or two people, and it cannot be done sporadically. Every one must participate. As with any other skill, it takes time to learn defect prevention well, but if everyone on the project participates, it can transform an organization.

Most software developers spend much of their working lives reacting to defects. They know that each individual defect can be fixed, but that its near twin will happen again and again and again. To prevent these endless repetitions, we need to understand what causes these errors and take a conscious action to prevent them. We must then obtain data on what we do, analyze it, and act on what it tells us. This is called the Deming or Shewhart cycle:

1. Defect reporting. This includes sufficient information to categorize each defect and determine its cause.

2. Cause analysis. The causes of the most prevalent defects are determined.

3. Action plan development. Action teams are established to devise preventions for the most prevalent problems.

4. Action implementation. Once the actions have been determined, they must be implemented. This generally involves all parts of the organization in a concerted improvement effort.

5. Performance tracking. Performance data are gathered, and all action items are tracked to completion.

6. Starting over. Do it all again; this time focus on the most prevalent of the remaining defects.

As an example let's take cause analysis and delve into how this should be done. Cause analysis should be conducted as early as possible after a defect is found. For a given product, some useful guidelines on holding these sessions are:

1. Shortly after the time that all product modules have completed detailed design, cause analysis sessions should be held for all problems identified during the design inspections/reviews.

2. Shortly after the last module has completed each test phase, cause analysis meetings should be held for all problems found for the first few modules as soon as possible after they have completed each development inspection or test phase.

3. To be of most value to the later modules, cause analysis meetings should be held for all problems found for the first few modules as soon as possible after they have completed each development inspection or test phase.

4. Cause analysis reviews should be held on a product after a reasonable number of user problems have been found after customer release. Such reviews should be held at least annually after release, even if the defect rate is relatively low. These reviews should be continued until no significant number of new defects is reported.

5. The cause analysis meetings should be held often enough to permit completion in several hours. Longer meetings will be less effective and harder to schedule. Short cause analysis meetings are generally most productive.

The objective of the cause analysis meeting is to determine the following:

1. What caused each of the defects found to date?

2. What are the major cause categories?

3. What steps are recommended for preventing these errors in the future?

4. What priorities are suggested for accomplishing these actions?

Defect identification and improvement have been discussed, but the real solution is to learn from the past and apply it to present software development projects to prevent defects in the first place. The principles of software defect prevention are:

1. The programmers must evaluate their own errors. Not only are they the best people to do so, but they are most interested and will learn the most from the process.

2. Feedback is an essential part of defect prevention. People cannot consistently improve what they are doing if there is not timely reinforcement of their actions.

3. There is no single cure-all that will solve all the problems. Improvement of the software process requires that error causes be removed one at a time. Since there are at least as many error causes as there are error types, this is clearly a long-term job. The initiation of many small improvements, however, will generally achieve far more than any one shot breakthrough.

4. Process improvement must be an integral part of the process. As the volume of process change grows, as much effort and discipline should be invested in defect prevention as is used on defect detection and repair. This requires that the process is architected and designed, inspections and tests are conducted, baselines are established, problem reports written, and all changes tracked and controlled.

5. Process improvement takes time to learn. When dealing with human frailties, we must proceed slowly. A focus on process improvement is healthy, but it must also recognize the programmers' need for a reason ably stable and familiar working environment. This requires a properly paced and managed program. By maintaining a consistent, long-term focus on process improvement, disruption can be avoided and steady progress will likely be achieved.

4.5 The SEI Capability Maturity Model

Because of the dominance and importance of software to a product's success, there is an increased focus on software reliability. This is accomplished by improving how software is developed, i.e., software development organizations. The following list provides the steps that lead to an improved software development organization:

1. Hire someone with previous software development project management experience to lead the software development process. If you've never done a software development project or led a development project, "you don't know what you don't know."

2. Understand the current status of the development process.

3. Develop a vision of the desired process.

4. Establish a list of required process improvement actions in order of priority.

5. Produce a plan to accomplish the required actions.

6. Commit the resources to execute the plan.

7. Start over at step 2.

The effective development of software is limited by several factors-such as an ill-defined process, inconsistent implementation, and poor process management-and hinges on experienced competent software developers (who have a track record). A model was devised to assess and grade a company's software development process, to address these factors, and to drive continuous improvement to the next level of sophistication or maturity by the Software Engineering Institute of Carnegie Mellon University. The SEI Capability Maturity Model (CMM) structure addresses the seven steps by characterizing the software development process into five maturity levels to facilitate high-reliability software pro gram code being written. These levels are Level 1: Initial. Until the process is under statistical control, no orderly progress in process improvement is possible.

Level 2: Repeatable. The organization has achieved a stable process with a repeatable level of statistical control by initiating rigorous project management of commitments, costs, schedules, and changes.

Level 3: Defined. The organization has defined the process. This helps ensure consistent implementation and provides a basis for a better understanding of the process.

Level 4: Managed. The organization has initiated comprehensive pro cess measurements and analyses beyond those of cost and schedule performance.

Level 5: Optimizing. The organization now has a foundation for continuing improvement and optimization of the process.


FIGURE 2 The five levels of the SEI capability maturity model.

The optimizing process helps people to be effective in several ways:

It helps managers understand where help is needed and how best to provide the people with the support they require.

It lets the software developers communicate in concise, quantitative terms.

It provides the framework for the software developers to understand their work performance and to see how to improve it.

A graphic depiction of the SEI Capability Maturity Model is presented in Figure 2. Notice that as an organization moves from Level I (Initial) to Level 5 (Optimizing) the risk decreases and the productivity, quality, and reliability in crease. Surveys have shown that most companies that are audited per SEI criteria score between 1.0 and 2.5. There are very few 3s and 4s and perhaps only one 5 in the world.

These levels have been selected because they:

Reasonably represent the actual historical phases of evolutionary improvement of real software organizations

Represent a measure of improvement that is reasonable to achieve from the prior level

Suggest interim improvement goals and progress measures

Easily identify a set of immediate improvement priorities once an organization's status in this framework is known

While there are many other elements to these maturity level transitions, the primary objective is to achieve a controlled and measured process as the foundation for continuing improvement.

The process maturity structure is used in conjunction with an assessment methodology and a management system to help an organization identify its specific maturity status and to establish a structure for implementing the priority improvement actions, respectively. Once its position in this maturity structure is defined, the organization can concentrate on those items that will help it advance to the next level. Currently, the majority of organizations assessed with the SEI methodology are at CMM Level 1, indicating that much work needs to be done to improve software development.

Level 1: The Initial Process

The initial process level could properly be described as ad hoc, and it is often even chaotic. At this level the organization typically operates without formalized procedures, cost estimates, or project plans. Tools are neither well integrated with the process nor uniformly applied. Change control is lax, and there is little senior management exposure or understanding of the problems and issues. Since many problems are deferred or even forgotten, software installation and maintenance often present serious problems.

While organizations at this level may have formal procedures for planning and tracking their work, there is no management mechanism to ensure that they are used. The best test is to observe how such an organization behaves in a crisis.

If it abandons established procedures and essentially reverts to coding and testing, it is likely to be at the initial process level. After all, if the techniques and methods are appropriate, then they should be used in a crisis; if they are not appropriate in a crisis, they should not be used at all. One reason why organizations behave in this fashion is that they have not experienced the benefits of a mature process and thus do not understand the consequences of their chaotic behavior. Because many effective software actions (such as design and code inspections or test data analysis) do not appear to directly support shipping the product, they seem expendable.

In software, coding and testing seem like progress, but they are often only wheel spinning. While they must be done, there is always the danger of going in the wrong direction. Without a sound plan and a thoughtful analysis of the problems, there is no way to know.

NEXT>>

Top of Page PREV.   NEXT Article Index HOME