A box plot is a way to show our data in the form of a graph to see how it is spread out. The spread of data is called data distribution. It uses just five important numbers to give you a quick look at the data. It’s good at pointing out any unusual values in your data. Throughout this blog, we’ll keep it simple and talk about what is a box plot, why it’s helpful, and how it can tell you important things about your information.
Table Of Contents
Watch the video below to understand data visualization in detail:
What is a Box plot in Data Visualization?
A box plot, or a box-and-whiskers plot, is a way to see how data is spread out, find unusual values, and figure out if the data leans more to one side. Let’s break it down.
- A box plot gives you a quick look at where most of the data lies. It does this using Five- Number Summary which contains the minimum and maximum, the median, and two more values that help mark the edges of where most of the data falls. We shall learn about them in detail below.
- Now, about those unusual values, we call them outliers. They’re the really small or really big numbers that stand out in the data. They can also be values that are different from the rest of the data and do not fit.
- And when we talk about the skewness of the data, we’re checking if the information is balanced or if it leans more to one side.
So, if we are clear with the basic terms like outliers, data distribution, and skewness of the data let’s go ahead and see what is a five-number summary.
Five – Number Summary of a Box Plot
A box plot is divided into 5 parts to show the how numbers are spread out in the data, called a five-number summary. A five-number summary includes – Minimum, Quartile 1, Median, Quartile 3, and Maximum. Let’s discuss this in detail.
- Median: Also referred to as Quartile 2 or Q2 is the median of the dataset. Consider our data on a straight line, the mid-point of the data would be marked as the median or the quartile 2 on the box plot. In the above image we can see that we have numbers from 0 to 10 where 5 will become our median or quartile 2
- Quartile 1 and Quartile 3: Now let’s go ahead and divide each side of the median by half which will lead us to one value to the right of the median and one to the left, these are called Quartiles. The right value to the median will be Quartlie 3(Q3) also referred to as the 75th percentile which signifies that 75 percent of the data lies below that point. The left value to the median will be Quartlie 1(Q1) also referred to as the 25th percentile which tells us that 25 percent of data is lying below that point. In our diagram above we can say that 2.5 will be the q1 and 7.5 will be the q3.
- Minimum and Maximum: Also referred to as Quartile 0 and Quartlie 4, these are also a very important part of the box plot. The reason they are important is because any data point that crosses these lines would be considered as an unfit data referred to as outliers. To find the minimum and maximum we first find the distance between Q1 and Q3 called Inter Quartile Range or IQR. We shall know more about this in the next section.
- Lower Whisker and Upper Whisker: Lower Whisker represents the data that is greater than Minimum and lesser than Quartile 1. Upper Whisker is the data that is less than Maximum and greater than Quartile 3.
- Outliers: These are values that are either greater than maximum or less than minimum. Outliers are values that are different from our entire data. In the diagram, if we add lets say number 20 it will be considered as an outlier as it is way larger than our normal spread of data. Like-wise if we take a number -10 which is again not a part of our normal spread of data will be considered as an outlier. We shall discuss the calculations and understand how to spot an outlier in our data in the later part of this blog.
Get 100% Hike!
Master Most in Demand Skills Now !
Formulas Used In a Box Plot
- Formula to Find the IQR: Inter quartile range is the data that falls between quartile 1 and quartile 3. We get this range by differencing our Q3 with Q1. Below is the formula,
- Formula to Find Minimum and Maximum: Finding minimum and maximum is important to pinpoint if we have outliers. The that is data exceeding the two limits will be considered as an outlier. In the below formula, any data that falls before the distance of 1.5 from Q1 and any data that falls after the distance of 1.5 from Q3 will be considered as an outlier. Below is the formula to find the minimum and maximum with 1.5 as a constant.
Do you want to jumpstart your career in data science? Enroll in our Best Data Science Course and gain the skills to succeed!
Why do we use a Box Plot
- Box plots are the easiest way to find the outliers. The reason we are stressing over outliers is because these extreme values can form skewed distributions(data that is not equally spread) which can heavily impact any test results like hypothesis testing (A statistical test which will helps you figure out if what you think is happening is really true, or if it’s just a coincidence).
- Box plots help us to understand how the data points are distributed.
- They help us to understand if our data is symmetrical or has any skewness.
- They help us to compare multiple distributions at a time.
Example Of A Box Plot
Imagine we’re exploring the ages of employees at Company XYZ. To understand the age distribution and identify any unusual cases, we can use a box plot. In this scenario, outliers would be employees with ages significantly higher or lower than the average age of the workforce.
Now, let’s take a look at the ages we have: 45, 27, 24, 26, 35, 38, 61, 65, 70, 55, 59, 66, 29, 52, 21. The box plot will help us visualize the spread of these ages and spot if there are any employees whose age stands out as exceptionally high or low compared to the rest of the group.
Step 1: Collect the data where our given data is 45, 27, 24, 26, 35, 38, 61, 65, 70, 55, 59, 66, 29, 52, 21
Step 2: Arrangin the data in ascending order. Below is the arranged data
21, 24, 26, 27, 29, 35, 38, 45, 52, 55, 59, 61, 65, 66, 70
Step 3: Find the median. To find the median we need to divide the data in 2 equal halves.
If we can see here 45 will be our median
Step 4: Find the Q1 and Q3. In-order to find Q1 and Q3 we need to divide the right and left of the median in equal halves again. In this case, 27 and 29 are in the middle on the left side. And 59 and 61 are in the middle on the right side.
In this case we will consider Q1 and Q3 as the average as shown below
Here we can see that our Q1 is 28 and Q3 is 60
Step 5: Finding IQR. We shall use the formula and find it.
Step 6: Finding the minimum and maximum. To do this we shall use the formulas discussed above
Step 7: Finding Outliers
Here the minimum is -20 and the maximum is 108. As we see none of the employees have the age greater than 108 and -20(as it is not possible). This summarizes that our data do not have any outliers.
Check out the top 100 Python Interview Questions to ace your next interview!
In this blog we hope you have got a good understanding of what is a box plot, its importance and its components and a manual example. Here we saw how to use formulas to spot an outlier in the data. Hope now box plots dont seem as complicated as they use to before. If you have any doubts associated to this article, do comment it down below and if you like our article do explore our courses at Intellipaat.
If you have any doubts or queries, drop them on our Box plot Community!
Do we have an alternative plot for a box plot?
Yes, a violin plot can be considered as an alternative of the box plot as it can give us the same insights.
Are mean and median always same in the box plot?
Mean and median are not always same as mena is the average and median is the midpoint of data. This ia always dependant on the data.
What is 1.5 in the minimum and maximum formula? Can it be any number?
1.5 is a constant which says any data which is 1.5 times away from the IQR are considered outliers. We can have any number but this is considered fair after all the experiments.
What are upper fences and lower fence in box plot?
Upper fence, lower fence, upper bound and lower bound are just another names given to the minimum and maximum in the box plot. Sometimes we also refer to them as upper whisker and lower whisker.
What is the middle line in the box plot?
The middle line represents the median of the data.