One frequently encountered continuous probability distribution, the
Histogram Distribution, is often seen either as a means of summarizing
a real data sample, or as a means of summarizing the results of a computer
simulation. This is essentially a bar graph where the individual bars may
be of differing widths, and the heights of the bars may vary as well. Formally,
we can describe a histogram distribution with n bars or intervals by means
of a set of interval endpoints {x0,x1, ... ,xn}
and a set of interval probabilities {p1,...,pn} such
that the interval probabilities sum to one. The density function for the
distribution is then a constant function on each interval, with height
hi such that the area over the interval is given by (xi
- xi-1) hi = pi. The CDF for
the histogram is then given by a monotonically increasing piecewise linear
curve which starts on the left at 0 and ends on the right at 1. Graphs
for the example distribution described below are shown here to illustrate
the general shape characteristics of histogram distributions.
| CDF | |
![]() |
![]() |
EXAMPLE HISTOGRAM COMPUTATION
Suppose a historical data set, or a Monte Carlo simulation, of the duration
of a project yields a histogram distribution as given in the following
table:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
a) What is the mode for this distribution. (Hint: Use mid-point of most frequently occurring interval).
b) Compute the cumulative probabilities for the right end points of each interval, and plot the cumulative distribution function for the distribution.
c) By linear interpolation, compute the median (50%) and the lower (25%) and upper (75%) quartiles for the distribution. How much time would be allowed to insure a 95% chance of completing the project on time? What is the probability that the project will take more than 55 days to complete?
d) Using the mid-point
as a representative value for each interval, use the discrete distribution
formulas to compute the mean, variance and the standard deviation for the
simulated project duration distribution.
It turns out that the mean value from the discrete mean value formula is correct for the continuous histogram distribution when the interval midpoint is used as a "representative value" for each interval, thus

But the true variance of the continuous distribution is given by a somewhat more complex expression, namely

Obtain the true variance and standard deviation of the histogram distribution for the given data using this exact formula and compare with the approximate variance and standard deviation obtained with the discrete formula applied to the interval midpoints. What is the percentage error in the standard deviation resulting from using the midpoint approximation instead of the exact continuous formulas?
LINEAR INTERPOLATION FORMULAS
In solving this problem, we will make use of some standard formulas for linear interpolation, which in our context can be stated as follows. If (xk,pk) and (xk+1,pk+1) are two consecutive breakpoints on your CDF, and if p is intermediate between pk and pk+1, then the linear interpolation formula is

where
is the width of the interval containing x and
is
the fraction of the way through the interval indicated by the location
of p with respect to the probability end-points pk and pk+1.
Conversely, if given an x value between xk and xk+1,
then the corresponding p value would be

with analogous interpretations of the various terms.
HISTOGRAM DISTRIBUTION SOLUTIONS
Part a). Since the 48-52 interval occurs most frequently (74 out of 200 trials) we take its mid-point, or 50, as the mode for the distribution.
Part b). By adding three more columns to the given data table, one for
cumulative frequency and two for the relative frequency measures, we obtain
the cumulative probability values to associate with the right endpoint
of each interval.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Setting the cumulative probability to 0 at 40, we get the plot shown before of the CDF. Notice that straight-line interpolations between the interval end-points are used to get the complete CDF from the values in our histogram table. This corresponds to integrating under a histogram chart of the PDF for this distribution, which when graphed would look like the CDF shown above.
Part c). The linear interpolation for the median would proceed as follows. First note that it would lie in the interval from 48 to 50 since the cumulative probability is less than .5 at 48 (namely .34) and greater than .5 at 52 (namely .71). The fraction of the way through the interval is given by the ratio (50-34)/(71-34) or .432432... so the median for the distribution is at 48 + .432432(52-48) = 49.7297. Similarly, the 25% fractile will fall between 44 and 48, the fraction being given by (25-11)/(34-11) or .60869565. Thus the 25% fractile is at 44 + .60869565(48-44) or 46.4348. The 75% fractile will fall between 52 and 56, the fraction being given by (75-71)/(97-71) or .153846. Thus the 75% fractile is at 52 + .153846(56-52) or 52.6154. To achieve a 95% confidence level we would still be in the 52-56 interval but closer to th 56 end. This time the fraction would be (95-71)/(97-71) or .923077. Thus the 95% confidence level would be at 52 + .923077(56-52) or 55.6923.
Now working the formula in the other direction, if we ask about a time of 55 days we are 3/4 or .75 of the way through the interval from 52 to 56, so the corresponding cumulative probability would be .71 + .75(.97-.71) or .905. The question asks for the probability of taking MORE than 55 days (a right tail question), so the requested probability would be 1 - .905 or .095.
Part d). The mid-points of the intervals are 42, 46, 50, 54, and 58 respectively, so the mean value computation would be
The discrete variance formula can be computed from the defining equation
or from the alternate form in terms of the 2nd moment. The defining formula
would be evaluated as

Alternatively, using the formula with the second moment we have

The result is the same, but the arithmetic is a little simpler in the
alternate formula since we only have to deal with the fractional mean value
one time. The standard deviation is then obtained by a simple square root
calculation, or
By concentrating all the probability for each interval at its midpoint,
instead of allowing a uniform distribution accross each interval, we have
altered its variance (and standard deviation) slightly since the range
of the midpoints only extends from 42 to 58 whereas the true histogram
distribution actually extends from 40 to 60. To get the exact variance
and standard deviation of the continuous histogram distribution, we have
to use the second moment formula
which
is easily accomplished by adding another column for the individual terms
in the sum.

Thus the true histogram variance is given by
which leads to a standard deviation of 4.22645. Thus the "discretization
error" in this case leads to a standard deviation which is about 4% less
than the true standard deviation. So a possible heuristic when using
the midpoint discretization in a situation like this one would be to compute
the standard deviation using midpoints, and then increase the result by
4%.