Detect and exclude outliers in Pandas data frame

Question

1 Answer

vinita · Answer 1 · 2019-09-10T09:55:30+0000

If you want to remove all rows that have outliers in at least one column, refer the following code:

df = pd.DataFrame(np.random.randn(100, 3))
from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

The above code means that:

For each column, first, it computes the Z-score of each value in the column, relative to the column mean and standard deviation.
Then it takes the absolute of Z-score because the direction does not matter, only if it is below the threshold.
all(axis=1) ensures that for each row, all columns satisfy the constraint.
Finally, the result of this condition is used to index the dataframe.