Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (50.2k points)

I am working with CSV files where several of the columns have a simple json object (several key value pairs) while other columns are normal. Here is an example:

name,dob,stats

john smith,1/1/1980,"{""eye_color"": ""brown"", ""height"": 160, ""weight"": 76}"

dave jones,2/2/1981,"{""eye_color"": ""blue"", ""height"": 170, ""weight"": 85}"

bob roberts,3/3/1982,"{""eye_color"": ""green"", ""height"": 180, ""weight"": 94}"

After using df = pandas.read_csv('file.csv'), what's the most efficient way to parse and split the stats column into additional columns?

After about an hour, the only thing I could come up with was:

import json

stdf = df['stats'].apply(json.loads)

stlst = list(stdf)

stjson = json.dumps(stlst)

df.join(pandas.read_json(stjson))

This seems like I'm doing it wrong, and it's quite a bit of work considering I'll need to do this on three columns regularly.

*Edit: desired output is the dataframe object below. added following lines of code to get there in my (crappy) way:

image

1 Answer

0 votes
by (108k points)

Applying the json.load is a great idea, but from there you can just simply directly convert it to dataframe columns instead of writing/loading it again:

stdf = df['stats'].apply(json.loads)

pd.DataFrame(stdf.tolist()) # or stdf.apply(pd.Series)

or you can also do this in one step:

df.join(df['stats'].apply(json.loads).apply(pd.Series))

If you wish to learn more about Pandas visit this Pandas Tutorial.

Browse Categories

...