A box plot is one way to represent our data in the form of a graph to see how it is spread out. Data distribution is the spread of data. It uses just five important numbers to give you a quick look at the data. It’s good at pointing out any unusual values in your data. Throughout this blog, we will keep it simple and talk about what is a box plot, why it is helpful, and how it can tell you important things about your information.
Table Of Contents
What is a Box plot in Data Visualization?
A box plot, or a box-and-whiskers plot, is how to see how data are spread out, find unusual values, and determine if the data leans more to one side. Let’s break it down.
- A box plot gives you a quick look at where most of the data lies. It does this using Five- Number Summary, which contains the minimum and maximum, the median, and two more values that help mark the edges of where most of the data falls. We shall learn about them in detail below.
- Now, on those odd values, we refer to them as outliers. These are the really low or really high numbers that are found in data. They also may be values that simply are not like the rest of your data and do not belong.
- And when we mention the skewness of the data, we are checking whether the information is balanced or if it leans more to one side.
So, if we are all clear with the basic terms like outliers, data distribution and skewness of the data, then let’s go ahead and see what is a five-number summary.
Experience the Power of Data Science
with Our Comprehensive Certification
Five – Number Summary of a Box Plot
It can be divided into 5 parts to represent the spread of how numbers are spread out in the data, known as five-number summary. A five-number summary comprises – Minimum, Quartile 1, Median, Quartile 3, and Maximum. Let’s discuss it further.
- Median: Also known as Quartile 2 or Q2 is the middle value of the dataset. Take our data on a straight line, the middle point of the data would be marked as the median or the quartile 2 on the box plot. In the above image we can see that we have numbers from 0 to 10 where 5 will become our median or quartile 2
- Quartile 1 and Quartile 3: Now let’s take each side of the median and divide by half, which will take us to one value to the right of the median and one to the left. These are called Quartiles. The right value to the median will be Quartile 3(Q3) also referred to as the 75th percentile, meaning that 75 percent of the data lies below that point. The left value to the median will be Quartile 1(Q1) also known as the 25th percentile which tells us that 25 percent of data is lying below that point. In our diagram above we can say that 2.5 will be the q1 and 7.5 will be the q3.
- Minimum and Maximum: Also known as Quartile 0 and Quartile 4, these are also an extremely important part of a box plot. The reason they are important is because any data point that crosses these lines would be considered as an unfit data referred to as outliers. To find the minimum and maximum we first find the distance between Q1 and Q3 called Inter Quartile Range or IQR. We will be learning more about this in the next section
- Lower Whisker and Upper Whisker: Lower Whisker is the data which is higher than Minimum and lower than Quartile 1. Upper Whisker is the data which is lower than Maximum and higher than Quartile 3
- Outliers: These are values that are either higher than maximum or lower than minimum. Outliers are values which are different from our whole data. In diagram if we add lets say number 20 it will be treated as outlier since it is much larger in scale of number comparing to our normal spread of data. Similarly if we consider a number -10 it again is not a constituent part of our normal spread of data hence this will be treated as outlier
- Formula to Find the IQR: Inter-quartile range is that the data that falls in the range between quartile 1 and quartile 3. We acquire the range by differencing our Q3 with Q1. Below is the formula,
- Formula to Find Minimum and Maximum: Finding minimum and maximum is imperative to pinpoint if we have outliers. The that is data exceeding the two limits will be considered as an outlier. In the below formula, any data that falls before the distance of 1.5 from Q1 and any data that falls after the distance of 1.5 from Q3 will be considered as an outlier. Below is the formula to find the minimum and maximum with 1.5 as a constant
Why do we use a Box Plot
- Box plots are the easiest way to find the outliers. The reason we are stressing over outliers is because these extreme values can form skewed distributions (data that is not equally spread) which can heavily impact any test results like hypothesis testing (A statistical test which will helps you figure out if what you think is happening is really true, or if it’s just a coincidence).
- Box plots enable us to see the spread of the data points.
- They assist in knowing whether the data set is symmetrical or whether it is skewed in nature.
- They help compare various distributions at a single point in time.
Accelerate Your Data Science Success
with Our Proven Certification
Example Of A Box Plot
The age of the employees from Company XYZ has been imagined. A box plot will show the whole range for understanding the age distribution in order to catch any employees with unusually high or low ages, as compared to others. In this case, the outlier would be those employees whose age is above the average of the employees or below it with very significant differences.
Let’s start with the ages we have here: 45, 27, 24, 26 35, 38, 61, 65, 70, 55, 59, 66, 29, 52, 21. This box plot will visualize the distribution of all the aforementioned ages and find out whether an employee’s age is strikingly high or low when viewed against that of the rest of the cohort.
Step 1: Gather the data where our given data is 45, 27, 24, 26, 35, 38, 61, 65, 70, 55, 59, 66, 29, 52, 21
Step 2: Arrange the data in ascending order. Below is the arranged data 21, 24, 26, 27, 29, 35, 38, 45, 52, 55, 59, 61, 65, 66, 70
Step 3: Find the median. To find the median we need to divide the data in 2 equal halves. If we can see here 45 will be our median
Step 4: Determine the Q1 and Q3. To calculate Q1 and Q3, we will split the left and right side of the median into two equal parts once again. In this instance, 27 and 29 are in the middle on the left-hand side. And 59 and 61 are in the middle on the right-hand side.
In this instance we shall take Q1 and Q3 as the average as illustrated below.
From this, we can note that Q1 is 28, while Q3 is 60
Step 5: Calculate IQR. In order to achieve this we shall use the following formula and compute.
Step 6: Identify Minimum and Maximum. This is performed using the above-mentioned formula and it gives,
Step 7: Here minimum is -20 and the maximum is 108. From the data as shown, none of the employees’ ages are above 108 and -20 as such it is impossible. Hence it sums up our data have no outliers.
Get 100% Hike!
Master Most in Demand Skills Now!
Conclusion
In this blog, we wish you have come to realize what is a box plot, its significance, its components, and how to handle it with manual examples. Here we discussed how you can use formulas to point out an outlier in data. Hopefully now, the box plots do not seem like a complicated thing as they seemed before. If you want to learn more about these techniques, please check out Data Science Course
FAQ’s
Do we have an alternative plot for a box plot?
Yes, a violin plot can be considered as an alternative of the box plot as it can give us the same insights.
Are mean and median always same in the box plot?
Mean and median are not always same as mena is the average and median is the midpoint of data. This ia always dependant on the data.
What is 1.5 in the minimum and maximum formula? Can it be any number?
1.5 is a constant which says any data which is 1.5 times away from the IQR are considered outliers. We can have any number but this is considered fair after all the experiments.
What are upper fences and lower fence in box plot?
Upper fence, lower fence, upper bound and lower bound are just another names given to the minimum and maximum in the box plot. Sometimes we also refer to them as upper whisker and lower whisker.
What is the middle line in the box plot?
The middle line represents the median of the data.
Our Data Science Courses Duration and Fees
Cohort starts on 1st Feb 2025
₹65,037
Cohort starts on 25th Jan 2025
₹65,037
Cohort starts on 11th Jan 2025
₹65,037