0 votes
1 view
in BI by (15.1k points)

I have a CSV file that I want to read with Pandas library in Python.

In this table when we encounter a new item (e.g. items Nr. 1393 or 1654 in the example below) we first have a 4 column row metadata and after that several 100 column rows as real data associated to that item. Then it happens again for the next item and so on.

The table is like this:

1,1393,0,0

1,1393,1,22,55,63,...

1,1393,5,32,43,163,...

.

.

.

2,1654,0,0

2,1654,8,95,96,142,...

2,1654,21,31,364,9,...

.

.

.

So the problems are:

  1. Some rows have different sizes than others
  2. We do not have headers and can not create it as the first row has 4 entries and second one 100 entries
  3. My CSV file is huge (about 10G)!

Any suggestion which helps me to organize my data in Pandas or any other Python library is highly appreciated.

PS: BTW, anybody knows how to manage it in Tableau? 

1 Answer

0 votes
by (41.2k points)

When data doesn't fit an existing pandas reader, you can create your own generator and populate the dataframe with from_records. Lacking details on how these various items should be related, I wrote an example that just adds the latest metadata to the front of each row.

import pandas as pd

def my_data_generator(fp):

    metadata = []

    for line in fp:

        data = line.strip().split(',')

        if len(data) == 4:

            metadata = data

        elif not metadata:

            raise ValueError("csv file did not start with metadata")

        elif data:

            yield metadata + data

df = pd.DataFrame.from_records(my_data_generator(open('somefile.csv')))

print(df)

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...