Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

I'm wanting to take the following style of dataframe into a dictionary.

Input:

>>>import pandas as pd

>>>df = pd.read_csv('file.csv')

>>>print(df)

   Market  Rep  Name  Date  Amount

0  A1      B1   C1    D1    1

1  A1      B1   C1    D1    2 

2  A1      B1   C1    D2    3

3  A1      B1   C1    D2    4

4  A1      B1   C2    D1    5

5  A1      B1   C2    D1    6

6  A1      B1   C2    D2    7

7  A1      B1   C2    D2    8

8  A1      B2   C3    D1    9

9  A1      B2   C3    D1    10

10 A1      B2   C3    D2    11

11 A1      B2   C3    D2    12

12 A2      B3   C4    D1    13

13 A2      B3   C4    D1    14

Desired Output:

>>> print(associated_data)

{'A1': {'B1': {'C1': {'D1':[1 + 2],

                     {'D2':[3 + 4]},

               'C2': {'D1':[5 + 6],

                      'D2':[7 + 8]}}

       {'B2': {'C3': {'D1':[9 + 10],

                      'D2':[11 + 12]}}},

 'A2': {'B3': {'C4': {'D1':[13 + 14]}}}}

This might not be the best way to go about organizing and sorting the data, so I'm open to suggesting.

I've tried a method I was hoping would work by going through a ton of for loops like this:

 

# Main function

for market in df['Market'].unique():

    market_data = self.df.loc[self.df['Market'] == market]

    associated_reps = market_data['Rep'].unique()

    # Repeat

    for rep in associated_reps:

        rep_data = market_data.loc[market_data['Rep'] == rep]

        associated_names = rep_data['Name'].unique()

        # Repeat

        for name in associated_names:

            name_data = rep_data.loc[rep_data['Name'] == name]

            associated_dates = name_data['Date'].unique()

            # Repeat

            for date in associated_dates:

                date_data = name_data.loc[name_data['Date'] == date]

                associated_amount = sum(date_data['Amount'].tolist())

                # Attempted solution code (total fail)

                breakdown[market][rep][name][date] = associated_amount

This does appropriately separate all of the data, and at the end attempts to put it all together. I was hoping that you could make a super nested dict like this, but it failed completely (as it turns out that's just not how dicts work unfortunately lmao).

How can you produce the desired output to produce the same result (maybe with shorter sorting code as well)?

Thanks!

1 Answer

0 votes
by (41.4k points)

You have to first set the indices for all "categories" in your data.Then, aggregate on the index to remove duplicate indices and after that generate output dict.

import print

import numpy as np

def make_dict(ind_vals, d, v):

  """Accumulate index entries as keys in a dict."""

  p = d

  for ix in ind_vals[:-1]:

    # Replace with collection.OrderedDict if necessary.

    p = p.setdefault(ix, {})

  # Set the actual value of interest.

  p[ind_vals[-1]] = v

# Set indices correctly.

df = df.set_index(['Market', 'Rep', 'Name', 'Date'])

# Group values so we don't have duplicate indices

df = df.groupby(level=df.index.names).apply(np.sum)

dct = {}  # Replace with collection.OrderedDict if necessary.

for idx, val in df.iterrows():

  make_dict(idx, dct, val.Amount)

pprint.pprint(dct)

# {'A1': {'B1': {'C1': {'D1': 3, 'D2': 7}, 'C2': {'D1': 11, 'D2': 15}},

#         'B2': {'C3': {'D1': 19, 'D2': 23}}},

#  'A2': {'B3': {'C4': {'D1': 27}}}}

Related questions

Browse Categories

...