Sampling and Combination of Variables

Techniques of Sampling and Combining Variables in Statistics

Sampling

A sample is a group of objects or readings taken from a population for counting or measurement. From the observations of the sample, we infer properties of the population. For example, the sample mean, x , is an unbiased estimate of the population mean, μ, and that the sample variance, s2, is an unbiased estimate of the corresponding population variance, σ2(sqaure).
The sample must be random, meaning that all possible samples of a particular size must have equal probabilities of being chosen from the population. This will prevent bias in the sampling process. If the effects of one or more factors are being investigated but other factors (not of direct interest) may interfere, sampling must be done carefully to avoid bias from the interfering factors. The effects of the interfering factors can be minimized (and usually made negligible) by randomizing with great care both choice of the parts of the sample that receive different treatments and the order with which items are taken for sampling and analysis.

Push the Boundaries of Data Science Excellence

Your Data Science Career Starts Here

Explore Program

Linear Combination of Independent Variables

Say we have two independent variables, X and Y. Then a linear combination consists of the sum of a constant multiplied by one variable and another constant multiplied by the other variable. Algebraically, this becomes W = aX + bY, where W is the combined variable and a and b are constants.
The mean of a linear combination is exactly what we would expect: W = aX + bY.
If we multiply a variable by a constant, the variance increases by a factor of the constant squared: variance(aX) = a2 variance(X). This is consistent with the fact that variance has units of the square of the variable. Variances must increase when two variables are combined: there can be no cancellation because variabilities accumulate.
Variance is always a positive quantity, so variance multiplied by the square of a constant would be positive. Thus, the following relation for combination of two independent variables is reasonable:

More than two independent variables can be combined in the same way.

If the independent variables X and Y are simply added together, the constants a and b are both equal to one, so the individual variances are added:

If the variable W is the sum of n independent variables X, each of which has the same probability distribution and so the same variance σx2, then-

Variance of Sample Means

It is an indication of the reliability of the sample mean as an estimate of the population mean.
Say the sample consists of n independent observations. If each observation is multiplied by 1/n , the sum of the products is the mean of the observations, X bar . That is,
mean

Then the variance of X bar is –

But the variables all came from the same distribution with variance σ2, so-

Example –
A population consists of one 2, one 5, and one 9. Samples of size 2 are chosen randomly from this population with replacement.
Answer: The original population has a mean of 5.3333 and a variance of (22 + 52 + 92 – 162/3) / 3 = 8.2222, so a standard deviation of 2.8674. Its probability distribution is –

Shape of Distribution of Sample Means: Central Limit Theorem

Central Limit Theorem: if random and independent samples are taken from any practical population of mean μ and variance σ2, as the sample size n increases the distribution of sample means approaches a normal distribution.
The sampling distribution will have mean μ and variance σ2/n. How large does the sample size have to be before the distribution of sample means becomes approximately the normal distribution?

Stay ahead—take a free Data Science course.

The Ultimate Free Data Science Program Awaits

Explore Program

That depends on the shape of the original distribution. If the original population was normally distributed, means of samples of any size at all will be normally distributed (and sums and differences of normally distributed variables will also be normally distributed). If the original distribution was not normal, the means of samples of size two or larger will come closer to a normal distribution. Sample means of samples taken from almost all distributions encountered in practice will be normally distributed with negligible error if the sample size is at least 30. Almost the only exceptions will be samples taken from populations containing distant outliers.
The Central Limit Theorem also gives us some indication of which sets of measurements are likely to be closely approximated by normal distributions. If variation is caused by several small, independent, random sources of variation of similar size, the measurements are likely close to a normal distribution.
Of course, if one variable affects the probability distribution of the result in a form of conditional probability, so that the probability distribution changes as the variable changes, we cannot expect the result to be distributed normally. If the most important single factor is far from being normally distributed, the resulting distribution may not be close to normal. If there are only a few sources of variation, the resulting measurements are not likely to follow a distribution close to normal.

Enroll in our comprehensive online data science course today and start mastering the skills needed to excel in data analysis, machine learning, and more!