Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (18.4k points)

Below is my dataset and want to drop the column named neighbourhood that has less than 10 phone connections. Each row corresponds to the house in neigbourhood A,B or C. The Phone connection is a binary variable, where 1 means that the house has the connection.

House| Neighbourhood | Phone connection

#1            A                   0

#2            B                   0

#3            B                   1

#4            A                   1

#5            C                   1

#6            C                   0

How can aggregate this data on the neighbourhood level, while filtering only a neighbourhoods with more than 10 connections?

1 Answer

0 votes
by (36.8k points)

Use the GroupBy.transform with sum for new column filled by aggregated values, compare for greater by Series.gt and filer by boolean indexing:

df[df.groupby('Neighbourhood')['Phone connection'].transform('sum').gt(10)]

Sample: with greater like 3:

print (df)

   House Neighbourhood Phone connection

0 #1 A 1

1 #2 B 0

2 #3 B 1

3 #4 A 1

4 #5 C 1

5 #6 C 0

6 #7 A 0

7 #8 B 1

8 #9 B 1

9 #10 A 1

10 #11 A 1

11 #12 B 0

12 #13 B 1

13 #14 A 1

14 #15 C 1

15 #16 C 1

df = df[df.groupby('Neighbourhood')['Phone connection'].transform('sum').gt(3)]

print (df)

   House Neighbourhood Phone connection

0 #1 A 1

1 #2 B 0

2 #3 B 1

3 #4 A 1

6 #7 A 0

7 #8 B 1

8 #9 B 1

9 #10 A 1

10 #11 A 1

11 #12 B 0

12 #13 B 1

13 #14 A 1

If you want to know more about the Data Science then do check out the following Data Science which will help you in understanding Data Science from scratch 

Browse Categories

...