Create a pandas DataFrame from generator?

Question

asked Oct 5, 2019 in Data Science by sourav (17.6k points)

I've create a tuple generator that extract information from a file filtering only the records of interest and converting it to a tuple that generator returns.

I've try to create a DataFrame from:

import pandas as pd
df = pd.DataFrame.from_records(tuple_generator, columns = tuple_fields_name_list)

but throws an error:

...
C:\Anaconda\envs\py33\lib\site-packages\pandas\core\frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows)
1046 values.append(row)
1047 i += 1
-> 1048 if i >= nrows:
1049 break
1050
TypeError: unorderable types: int() >= NoneType()

I managed it to work consuming the generator in a list, but uses twice memory:

df = pd.DataFrame.from_records(list(tuple_generator), columns = tuple_fields_name_list)

The files I want to load are big, and memory consumption matters. The last try my computer spends two hours trying to increment virtual memory :(

The question: Anyone knows a method to create a DataFrame from a record generator directly, without previously convert it to a list?

Note: I'm using python 3.3 and pandas 0.12 with Anaconda on Windows.

Update:

It's not problem of reading the file, my tuple generator do it well, it scan a text compressed file of intermixed records line by line and convert only the wanted data to the correct types, then it yields fields in a generator of tuples form. Some numbers, it scans 2111412 records on a 130MB gzip file, about 6.5GB uncompressed, in about a minute and with little memory used.

Pandas 0.12 does not allow generators, dev version allows it but put all the generator in a list and then convert to a frame. It's not efficient but it's something that have to deal internally pandas. Meanwhile I've must think about buy some more memory.

1 Answer

Shlok Pandey · Answer 1 · 2019-10-05T08:50:42+0000

You cannot create a DataFrame from a generator with the 0.12 version of pandas. You can either update yourself to the development version (get it from the github and compile it - which is a little bit painful on windows but I would prefer this option).

Or you can, since you said you are filtering the lines, first filter them, write them to a file and then load them using read_csv or something else...

If you want to get super complicated you can create a file like object that will return the lines:

def gen():
lines = [
'col1,col2\n',
'foo,bar\n',
'foo,baz\n',
'bar,baz\n'
]
for line in lines:
yield line
class Reader(object):
def __init__(self, g):
self.g = g
def read(self, n=0):
try:
return next(self.g)
except StopIteration:
return ''

And then use the read_csv:

>>> pd.read_csv(Reader(gen()))
col1 col2
0 foo bar
1 foo baz
2 bar baz

Create a pandas DataFrame from generator?

Create a pandas DataFrame from generator?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Browse Categories

Popular Courses

Top Tutorials

Top Articles

Top Interview Questions