Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (50.2k points)

I have a pandas dataframe with a few columns.

Now I know that certain rows are outliers based on a certain column value.

For instance columns - 'Vol' has all values around 12xx and one value is 4000 (Outlier).

Now I would like to exclude those rows that have 'Vol' Column like this. So, essentially I need to put a filter on the data frame such that we select all rows where the values of a certain column are within saying 3 standard deviations from mean.

What is an elegant way to achieve this?

1 Answer

0 votes
by (108k points)

If you want to remove all rows that have outliers in at least one column, refer the following code:

df = pd.DataFrame(np.random.randn(100, 3))

from scipy import stats

df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

The above code means that:

  • For each column, first, it computes the Z-score of each value in the column, relative to the column mean and standard deviation.

  • Then it takes the absolute of Z-score because the direction does not matter, only if it is below the threshold.

  • all(axis=1) ensures that for each row, all columns satisfy the constraint.

  • Finally, the result of this condition is used to index the dataframe.

Related questions

Browse Categories

...