0 votes
1 view
in Data Science by (48.7k points)

I am working with CSV files where several of the columns have a simple json object (several key value pairs) while other columns are normal. Here is an example:

name,dob,stats

john smith,1/1/1980,"{""eye_color"": ""brown"", ""height"": 160, ""weight"": 76}"

dave jones,2/2/1981,"{""eye_color"": ""blue"", ""height"": 170, ""weight"": 85}"

bob roberts,3/3/1982,"{""eye_color"": ""green"", ""height"": 180, ""weight"": 94}"

After using df = pandas.read_csv('file.csv'), what's the most efficient way to parse and split the stats column into additional columns?

After about an hour, the only thing I could come up with was:

import json

stdf = df['stats'].apply(json.loads)

stlst = list(stdf)

stjson = json.dumps(stlst)

df.join(pandas.read_json(stjson))

This seems like I'm doing it wrong, and it's quite a bit of work considering I'll need to do this on three columns regularly.

*Edit: desired output is the dataframe object below. added following lines of code to get there in my (crappy) way:

image

1 Answer

0 votes
by (105k points)

Applying the json.load is a great idea, but from there you can just simply directly convert it to dataframe columns instead of writing/loading it again:

stdf = df['stats'].apply(json.loads)

pd.DataFrame(stdf.tolist()) # or stdf.apply(pd.Series)

or you can also do this in one step:

df.join(df['stats'].apply(json.loads).apply(pd.Series))

If you wish to learn more about Pandas visit this Pandas Tutorial.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...