Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
3 views
in Data Science by (50.2k points)

I have a DataFrame with a column containing labels for each row (in addition to some relevant data for each row). I have a dictionary with keys equal to the possible labels and values equal to 2-tuples of information related to that label. I'd like to tack two new columns onto my frame, one for each part of the 2-tuple corresponding to the label for each row.

Here is the setup:

import pandas as pd

import numpy as np

np.random.seed(1)

n = 10

labels = list('abcdef')

colors = ['red', 'green', 'blue']

sizes = ['small', 'medium', 'large']

labeldict = {c: (np.random.choice(colors), np.random.choice(sizes)) for c in labels}

df = pd.DataFrame({'label': np.random.choice(labels, n), 

                   'somedata': np.random.randn(n)})

I can get what I want by running:

df['color'], df['size'] = zip(*df['label'].map(labeldict))

print df

  label  somedata  color    size

0     b  0.196643    red  medium

1     c -1.545214  green   small

2     a -0.088104  green   small

3     c  0.852239  green   small

4     b  0.677234    red  medium

5     c -0.106878  green   small

6     a  0.725274  green   small

7     d  0.934889    red  medium

8     a  1.118297  green   small

9     c  0.055613  green   small

But how can I do this if I don't want to manually type out the two columns on the left side of the assignment? I.e. how can I create multiple new columns on the fly. For example, if I had 10-tuples in labeldict instead of 2-tuples, this would be a real pain as currently written. Here are a couple of things that don't work:

# set up attrlist for later use

attrlist = ['color', 'size']

# non-working idea 1)

df[attrlist] = zip(*df['label'].map(labeldict))

# non-working idea 2)

df.loc[:, attrlist] = zip(*df['label'].map(labeldict))

This does work, but seems like a hack:

for a in attrlist:

    df[a] = 0

df[attrlist] = zip(*df['label'].map(labeldict))

Better solutions?

1 Answer

0 votes
by (107k points)

What you can do is to use merge instead:

>>> ld = pd.DataFrame(labeldict).T

>>> ld.columns = ['color', 'size']

>>> ld.index.name = 'label'

>>> df.merge(ld.reset_index(), on='label')

  label  somedata  color    size

0     b  1.462108    red  medium

1     c -2.060141  green   small

2     c  1.133769  green   small

3     c  0.042214  green   small

4     e -0.322417    red  medium

5     e -1.099891    red  medium

6     e -0.877858    red  medium

7     e  0.582815    red  medium

8     f -0.384054    red   large

9     d -0.172428    red  medium

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...