The purpose of descriptive statistics is to present a mass of data in a more understandable form. We may summarize the data in numbers as (a) some form of average, or in some cases a proportion, (b) some measure of variability or spread, and (c) quantities such as quartiles or percentiles, which divide the data so that certain percentages of the data are above or below these marks. Furthermore, we may choose to describe the data by various graphical displays or by the bar graphs called histograms, which show the distribution of data among various intervals of the varying quantity.
Looking for Top Jobs in Data Science ? This blog post gives you all the information you need !
Various “averages” are used to indicate a central value of a set of data. Some of these are referred to as means.
(a) Arithmetic Mean
Of these “averages,” the most common and familiar is the arithmetic mean, defined by
(b) Other Means
The geometric mean, logarithmic mean, and harmonic mean are all important in some areas of engineering. The geometric mean is defined as the nth root of the product of n observations:
The logarithmic mean of two numbers is given by the difference of the natural logarithms of the two numbers, divided by the difference between the numbers. It is used particularly in heat transfer and mass transfer.
Logarithmic mean =
The harmonic mean involves inverses—i.e., one divided by each of the quantities. The harmonic mean is the inverse of the arithmetic mean of all the inverses.
Another representative quantity, quite different from a mean, is the median. If all the items with which we are concerned are sorted in order of increasing magnitude (size), from the smallest to the largest, then the median is the middle item. Consider the five items: 12, 13, 21, 27, 31. Then 21 is the median. If the number of items is even, the median is given by the arithmetic mean of the two middle items. Consider the six items: 12, 13, 21, 27, 31, 33.
The median is (21 + 27) / 2 = 24. One desirable property of the median is that it is not much affected by outliers.
If the frequency varies from one item to another, the mode is the value which appears most frequently. In the case of continuous variables the frequency depends upon how many digits are quoted, so the mode is more usefully considered as the midpoint of the class with the largest frequency.
Variability or Spread of the Data –
(a) Sample Range
One simple measure of variability is the sample range, the difference between the smallest item and the largest item in each sample. For small samples all of the same size, the sample range is a useful quantity. However, it is not a good indicator if the sample size varies, because the sample range tends to increase with increasing sample size. Its other major drawback is that it depends on only two items in each sample, the smallest and the largest, so it does not make use of all the data.
This disadvantage becomes more serious as the sample size increases. Because of its simplicity, the sample range is used frequently in quality control when the sample size is constant; simplicity is particularly desirable in this case so that people do not need much education to apply the test.
(b) Interquartile Range
The interquartile range is the difference between the upper quartile and the lower quartile. It is used fairly frequently as a measure of variability, particularly in the Box Plot. It is used less than some alternatives because it is not related to any of the important theoretical distributions.
(c) Mean Deviation from the Mean
The mean deviation from the mean, defined as –
(d) Mean Absolute Deviation from the Mean
However, the mean absolute deviation from the mean, defined as –
Its disadvantage is that it is not simply related to the parameters of theoretical distributions.
Variance is defined as –
It is the mean of the squares of the deviations of each measurement from the mean of the population. Since squares of both positive and negative real numbers are always positive, the variance is always positive.
(f) Standard Deviation
The standard deviation is extremely important. It is defined as the square root of the variance:
Thus, it has the same units as the original data and is a representative of the deviations from the mean.
(g)Coefficient of Variation
A dimensionless quantity, the coefficient of variation is the ratio between the standard deviation and the mean for the same set of data, expressed as a percentage. This can be either (σ / μ) or (s / x ), whichever is appropriate, multiplied by 100%.
Quartiles, Deciles, Percentiles, and Quantiles
Quartiles, deciles, and percentiles divide a frequency distribution into a number of parts containing equal frequencies. The items are first put into order of increasing magnitude.
- Quartiles divide the range of values into four parts, each containing one quarter of the values. Again, if an item comes exactly on a dividing line, half of it is counted in the group above and half is counted below.
- Deciles divide into ten parts, each containing one tenth of the total frequency.
- Percentiles divide into a hundred parts, each containing one hundredth of the total frequency.
- Quantile divides a frequency distribution into parts containing stated proportions of a distribution.