THE HISTOGRAM DISTRIBUTION

One frequently encountered continuous probability distribution, the Histogram Distribution, is often seen either as a means of summarizing a real data sample, or as a means of summarizing the results of a computer simulation. This is essentially a bar graph where the individual bars may be of differing widths, and the heights of the bars may vary as well. Formally, we can describe a histogram distribution with n bars or intervals by means of a set of interval endpoints {x0,x1, ... ,xn} and a set of interval probabilities {p1,...,pn} such that the interval probabilities sum to one. The density function for the distribution is then a constant function on each interval, with height hi such that the area over the interval is given by (xi - xi-1) hi = pi.   The CDF for the histogram is then given by a monotonically increasing piecewise linear curve which starts on the left at 0 and ends on the right at 1.  Graphs for the example distribution described below are shown here to illustrate the general shape characteristics of histogram distributions.
 

PDF CDF
 
Computation of the various statistics associated with a histogram distribution is best described by means of an example.

EXAMPLE HISTOGRAM COMPUTATION

Suppose a historical data set, or a Monte Carlo simulation, of the duration of a project yields a histogram distribution as given in the following table:
 

Interval
Frequency
40-44
22
44-48
46
48-52
74
52-56
52
56-60
6
 

a) What is the mode for this distribution. (Hint: Use mid-point of most frequently occurring interval).

b) Compute the cumulative probabilities for the right end points of each interval, and plot the cumulative distribution function for the distribution.

c) By linear interpolation, compute the median (50%) and the lower (25%) and upper (75%) quartiles for the distribution. How much time would be allowed to insure a 95% chance of completing the project on time? What is the probability that the project will take more than 55 days to complete?

d) Using the mid-point  as a representative value for each interval, use the discrete distribution formulas to compute the mean, variance and the standard deviation for the simulated project duration distribution.

It turns out that the mean value from the discrete mean value formula is correct for the continuous histogram distribution when the interval midpoint is used as a "representative value" for each interval, thus

But the true variance of the continuous distribution is given by a somewhat more complex expression, namely

Obtain the true variance and standard deviation of the histogram distribution for the given data using this exact formula and compare with the approximate variance and standard deviation obtained with the discrete formula applied to the interval midpoints. What is the percentage error in the standard deviation resulting from using the midpoint approximation instead of the exact continuous formulas?

LINEAR INTERPOLATION FORMULAS

In solving this problem, we will make use of some standard formulas for linear interpolation, which in our context can be stated as follows. If (xk,pk) and (xk+1,pk+1) are two consecutive breakpoints on your CDF, and if p is intermediate between pk and pk+1, then the linear interpolation formula is

where  is the width of the interval containing x and is the fraction of the way through the interval indicated by the location of p with respect to the probability end-points pk and pk+1. Conversely, if given an x value between xk and xk+1, then the corresponding p value would be

with analogous interpretations of the various terms.

HISTOGRAM DISTRIBUTION SOLUTIONS

Part a). Since the 48-52 interval occurs most frequently (74 out of 200 trials) we take it’s mid-point, or 50, as the mode for the distribution.

Part b). By adding three more columns to the given data table, one for cumulative frequency and two for the relative frequency measures, we obtain the cumulative probability values to associate with the right endpoint of each interval.
 

Interval
Frequency
Cum Frequency
Relative Frequency
Rel Cum Freq
40-44
22
22
.11
.11
44-48
46
68
.23
.34
48-52
74
142
.37
.71
52-56
52
194
.26
.97
56-60
6
200
.03
1.00
 

Setting the cumulative probability to 0 at 40, we get the plot shown before of the CDF. Notice that straight-line interpolations between the interval end-points are used to get the complete CDF from the values in our histogram table. This corresponds to integrating under a histogram chart of the PDF for this distribution, which when graphed would look like the CDF shown above.

Part c). The linear interpolation for the median would proceed as follows. First note that it would lie in the interval from 48 to 50 since the cumulative probability is less than .5 at 48 (namely .34) and greater than .5 at 52 (namely .71). The fraction of the way through the interval is given by the ratio (50-34)/(71-34) or .432432... so the median for the distribution is at 48 + .432432(52-48) = 49.7297. Similarly, the 25% fractile will fall between 44 and 48, the fraction being given by (25-11)/(34-11) or .60869565. Thus the 25% fractile is at 44 + .60869565(48-44) or 46.4348. The 75% fractile will fall between 52 and 56, the fraction being given by (75-71)/(97-71) or .153846. Thus the 75% fractile is at 52 + .153846(56-52) or 52.6154. To achieve a 95% confidence level we would still be in the 52-56 interval but closer to th 56 end. This time the fraction would be (95-71)/(97-71) or .923077. Thus the 95% confidence level would be at 52 + .923077(56-52) or 55.6923.

Now working the formula in the other direction, if we ask about a time of 55 days we are 3/4 or .75 of the way through the interval from 52 to 56, so the corresponding cumulative probability would be .71 + .75(.97-.71) or .905. The question asks for the probability of taking MORE than 55 days (a right tail question), so the requested probability would be 1 - .905 or .095.

Part d). The mid-points of the intervals are 42, 46, 50, 54, and 58 respectively, so the mean value computation would be

The discrete variance formula can be computed from the defining equation or from the alternate form in terms of the 2nd moment. The defining formula would be evaluated as
 

Alternatively, using the formula with the second moment we have
 

The result is the same, but the arithmetic is a little simpler in the alternate formula since we only have to deal with the fractional mean value one time. The standard deviation is then obtained by a simple square root calculation, or
 

By concentrating all the probability for each interval at it’s midpoint, instead of allowing a uniform distribution accross each interval, we have altered it’s variance (and standard deviation) slightly since the range of the midpoints only extends from 42 to 58 whereas the true histogram distribution actually extends from 40 to 60. To get the exact variance and standard deviation of the continuous histogram distribution, we have to use the second moment formula which is easily accomplished by adding another column for the individual terms in the sum.

Thus the true histogram variance is given by  which leads to a standard deviation of 4.22645. Thus the "discretization error" in this case leads to a standard deviation which is about 4% less than the true standard deviation.  So a possible heuristic when using the midpoint discretization in a situation like this one would be to compute the standard deviation using midpoints, and then increase the result by 4%.