Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

I have Dataset which has many columns and I want create a new column based on two columns on that dataset.

   train_data[['CtpJobId', 'SegmentId']]

     CtpJobId     SegmentId

0   qa1-9epx-dk1    347772

1   qa1-9epx-dv1    347774

2   qa1-9epx-dv1    347777

3   qa1-9epx-dv1    347780

4   qa1-9epx-dv1    347783

5   qa1-9epx-dv1    347786

6   qa1-9epx-dv1    347789

7   qa1-9epx-dv1    347792

8   qa1-9epx-e01    347794

9   qa1-9epx-eb2    347795

10  qa1-9epx-ez1    347796

11  qa1-9epx-f32    347797

12  qa1-9epx-fi1    347798

Now I want create a new column called numberOfSegment such as if same jobId has multiple segmentId then aggregate that segmentId and insert that sum up to that new column.

      CtpJobId        SegmentId    numberOfSegment

0   qa1-9epx-dk1    347772             1

1   qa1-9epx-dv1    347774             7

2   qa1-9epx-dv1    347777             7

3   qa1-9epx-dv1    347780             7

4   qa1-9epx-dv1    347783             7

5   qa1-9epx-dv1    347786             7

6   qa1-9epx-dv1    347789             7

7   qa1-9epx-dv1    347792             7

8   qa1-9epx-e01    347794             1

9   qa1-9epx-eb2    347795             1

10  qa1-9epx-ez1    347796             1

11  qa1-9epx-f32    347797             1

I did in one way but it gives wrong

train_data['NumberOfSegment'] = train_data.groupby('CtpJobId')['SegmentId'].sum()

 train_data[['CtpJobId','NumberOfSegment']]

   CtpJobId    NumberOfSegment

0   qa1-9epx-dk1    NaN

1   qa1-9epx-dv1    NaN

2   qa1-9epx-dv1    NaN

3   qa1-9epx-dv1    NaN

4   qa1-9epx-dv1    NaN

5   qa1-9epx-dv1    NaN

6   qa1-9epx-dv1    NaN

7   qa1-9epx-dv1    NaN

8   qa1-9epx-e01    NaN

9   qa1-9epx-eb2    NaN

10  qa1-9epx-ez1    NaN

Can Anyone please help me? Thanks in advance

1 Answer

0 votes
by (41.4k points)

Here, you need to use transform with 'count' as depicted in the code below:

train_data['NumberOfSegment'] = train_data.groupby('CtpJobId')['SegmentId'].transform('count')

And then the output will be:

 CtpJobId         SegmentId NumberOfSegment

0   qa1-9epx-dk1     347772     1

1   qa1-9epx-dv1     347774     7

2   qa1-9epx-dv1     347777     7

3   qa1-9epx-dv1     347780     7

4   qa1-9epx-dv1     347783     7

5   qa1-9epx-dv1     347786     7

6   qa1-9epx-dv1      347789    7

7   qa1-9epx-dv1      347792    7

8   qa1-9epx-e01      347794    1

9   qa1-9epx-eb2      347795    1

10  qa1-9epx-ez1     347796     1

11  qa1-9epx-f32      347797    1

12  qa1-9epx-fi1       347798 1

If you want to learn Pandas then visit this Python Pandas Tutorial.

...