Enumerated and Analytical Statistics (Part 1)
Dr. Deming introduced the distinction between what he called enumerated and analytical use of data in his 1950 book "Some Theory of Sampling." Dr. Deming often discussed enumerated versus analytical statistical studies throughout his career. According to him, statistical data should be used to provide a rational basis for actions. He repeatedly emphasized the consequences of failing to distinguish between enumerated and analytical statistics. Essentially, enumerative studies are concerned with categorizing data. For example, how many software builds failed? The purpose of analytical studies is to determine cause and effect based on data (i.e., to build theories explaining why the data is the way it is). An example can be comparing a process, where there are three possible actions: adopt a new method, keep the original method, or continue the experiment and gather more data. Deming called this Plan DO Study Act (PDSA). In Deming's simple explanation, enumeration is the question of how many, while analytical is the question of why.
An organization might be implementing a DevOps methodology and decide to track how many times per day they deploy software. This metric can help determine whether an organization or team is improving. However, the enumerated approach is less effective if an organization or team needs to discover why they are improving. Enumerated studies typically depend on the frequency of a metric. Likewise IT governance policies generally are designed to monitor the frequency of passed and failed risks. These risks are classified because the quality of the risk management policy typically depends on validations, thresholds, and actions taken or not taken. As part of an enumerated approach, the quality of a risk scorecard is influenced by how many security controls pass or fail. Turing our attention to an analytical approach may help us better understand why we have risk.
Most of this kind of discussion can be found under the topic of Operations Management. Operational Management was basically founded in the early 20th century mainly due to the Second Industrial Revolution, mostly by Henry Ford's invention of mass production. As a science, it was heavily influenced by Fredrick Winslow Taylor's 1911 publication "The Principles of Scientific Management."
Walter Shewhart introduced the new concept of a control chart in 1924 at Bell Labs, where he first introduced statistics as a primary tool within Operations Management. With the help of Dr. Deming, Shewhart published his "Statistical Method from the Viewpoint of Quality Control" in 1939, where they explained the idea of Statistical Process Control (SPC). They also introduced the concept of a control chart. As opposed to the exact science like mass production, Shewhart's control chart was based on probable science. The idea of a control chart is to essentialally use statistical tools like mean, mode, range, and standard deviation, to visualize causes of variation with the premise that all processes have variation. Shewhart defined two types of variation, assignable and chance. Deming later called these special and common cause variation. One of the simplest methods for distinguishing special versus common cause variation is by statistically defining an upper and lower control limit from the average of ordered data points. Data points that fall outside of the control limits are considered special cause variation. In common practice, several more complicated rules are typically used. Figure 1 is an example of a control chart showing the average software deploys per week for 25 weeks where all the data points are considered common cause variation or a process that is in control.
SPC and control charts are at the core of what Deming refers to as analytical statistics. Shewhart's SPC concepts have proven successful for over 100 years, from toasters to cars to nuclear power plants. However, we rarely find them used in IT. Although we use the "Ops" part of DevOps and DevSecOps to describe operations, IT has very little Operations Management (i.e., analytical statistics) in practice.
Let's look at a specific example from a DevOps automated governance perspective. Scanning container images for vulnerabilities is a typical modern governance risk control. An enumerated graph, a frequency distribution, of 25 weeks of failed container scans can be seen in Figure 2. Most failures occur less frequently than 30 times a week, with 68% falling between 24 and 30. Although this is graph is interesting it doesn’t really help us understand how to improve the process.
Looking at the 25 weeks through a control chart in Figure 3, i.e., an analytical approach, we can see a lot more information. An important point here is that Figure 2 and Figure 3 are based on the same exact data. In contrast, Figure 3 provides a lot more information. First, we see a summary at the bottom of the chart telling us the center (the mean). We also see upper and lower control limits (i.e., the UCL and the LCL.) In a control chart standard deviation can be used to calculate the UCL and LCL. Control charts use sigma to represent standard deviation. Where one sigma is one standard deviation, two sigmas are two standard deviations, and three sigmas are three standard deviations from the mean. In SPC parlance, points between the UCL and LCL are generally considered common cause variation, and points outside the UCL and LCL are regarded as special cause variation. Figure 3 shows five red dots depicting 5 points (i.e., weeks) where the number of failed container scans exceeded the control limits. We can do further investigation to try and understand why these occurrences happened. However, the real power of control charts comes from the more complicated rules described earlier. These rules are heuristic and mathematical patterns that can reveal more detailed opportunities for improvement. Over the years, different industries and statisticians have created several sets of rules. However, most of the basic rules are similar. A common rule is n points in a row increasing or decreasing. We see in Figure 3 that starting at week 18; there are eight increasing points. This pattern indicates another form of special cause variation. One of the critical tenants of analytical statistics is that the statistics is not designed to reveal the problem. The statistical data enables a subject matter expert (SME) to figure out where to find the problem. In this example, the SME notices a trend of increased container scan vulnerabilities starting in week 18. After further investigation, the SME discovered that a new development team had been added to the original team and they had not been trained to use the proper container image registries. Instead of internal registries, they used public registries that had not been scanned for vulnerabilities. After correcting the issues, the next 25-week control chart should show mostly common cause variation if the problem has been fixed.
I will go into more detail in Part 2 of Enumerated versus Analytical Statistics on how control charts can be used as a Kaizen tool based on Deming's System of Profound Knowledge.