Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

I am new to data science and I am currently practicing to improve my skills. I used a data set from kaggle and planned how to present the data and came across a problem.

What I was trying to achieve is to insert data to different data frames using a for loop. I have seen an example of this and used the dictionary to save data frames but the data on the data frame is overwritten.

I have a list of data frames:

continents_list = [african_countries, asian_countries, european_countries, north_american_countries,

          south_american_countries, oceanian_countries]

This is an example of my data frame from one of the continents:

    Continent   Country Name   Country Code    2010    2011    2012    2013    2014

7    Oceania      Australia         AUS        11.4    11.4    11.7    12.2    13.1

63   Oceania         Fiji           FJI        20.1    20.1    20.2    19.6    18.6

149  Oceania     New Zealand        NZL        17.0    17.2    17.7    15.8    14.6

157  Oceania   Papua New Guinea     PNG         5.4     5.3     5.4     5.5     5.4

174  Oceania   Solomon Islands      SLB         9.1     8.9     9.3     9.4     9.5

I first selected the whole row for the country which has the highest rate on a year:

def select_highest_rate(continent, year):

    highest_rate_idx = continent[year].idxmax()

    return continent.loc[highest_rate_idx]

then created a for loop which creates different data frames for each separate years which must contain all the continent and its corresponding country and rate on that year:

def show_highest_countries(continents_list):

    df_highest_countries = {}

    years_list = ['2010','2011','2012','2013','2014']

    for continent in continents_list:

        for year in years_list:

            highest_country = select_highest_rate(continent, year)

            highest_countries = highest_country[['Continent','Country Name',year]]

            df_highest_countries[year] = pd.DataFrame(highest_countries)

    return df_highest_countries

here is what it returns: different data frames but only for the last continent

Question: How do I save all the data(continents) on the same data frame? Is it not possible with dictionaries?

1 Answer

0 votes
by (41.4k points)
edited by

Here, you are overwriting the year index with each loop and therefore only the last continent dataframe is remaining for years 2010-2014.

df_highest_countries[year] = pd.DataFrame(highest_countries)

Here, you can add continent and  then concatenate to one final dataframe. Adding continent results in having a more unique dictionary key. 

df_highest_countries[continent+str(year)] = pd.DataFrame(highest_countries)

finaldf = pd.concat(df_highest_countries, join='outer').reset_index(drop=True)

Avoiding the nested for loops by concatenating all together at the beginning.

After that melt the data for groupby aggregation. Also, keep only those records with max values for each year and continent.

Then if needed, you can pivot with pivot_table back to year columns

df = pd.concat(continents_list)

# melt for year values in columns

df = pd.melt(df, id_vars=['Continent', 'Country Name', 'Country Code'], var_name='Year')

# aggregate highest value and merge back to original set

df = df.groupby(['Continent', 'Year'])['value'].max().reset_index().\

        merge(df, on=['Continent', 'Year', 'value'])

# pivot back to years columns

pvt = df.pivot_table(index=['Continent', 'Country Name', 'Country Code'],

                     columns='Year', values='value').reset_index()

If you wish to learn more about Pandas visit this Pandas Tutorial.

Enroll in Masters in Data Science in UK to enhance your knowledge in Data Science!

Browse Categories

...