Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (17.6k points)

I've create a tuple generator that extract information from a file filtering only the records of interest and converting it to a tuple that generator returns.

I've try to create a DataFrame from:

import pandas as pd

df = pd.DataFrame.from_records(tuple_generator, columns = tuple_fields_name_list)

but throws an error:

... 

C:\Anaconda\envs\py33\lib\site-packages\pandas\core\frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows)

   1046                 values.append(row)

   1047                 i += 1

-> 1048                 if i >= nrows:

   1049                     break

   1050 

TypeError: unorderable types: int() >= NoneType()

I managed it to work consuming the generator in a list, but uses twice memory:

df = pd.DataFrame.from_records(list(tuple_generator), columns = tuple_fields_name_list)

The files I want to load are big, and memory consumption matters. The last try my computer spends two hours trying to increment virtual memory :(

The question: Anyone knows a method to create a DataFrame from a record generator directly, without previously convert it to a list?

Note: I'm using python 3.3 and pandas 0.12 with Anaconda on Windows.

Update:

It's not problem of reading the file, my tuple generator do it well, it scan a text compressed file of intermixed records line by line and convert only the wanted data to the correct types, then it yields fields in a generator of tuples form. Some numbers, it scans 2111412 records on a 130MB gzip file, about 6.5GB uncompressed, in about a minute and with little memory used.

Pandas 0.12 does not allow generators, dev version allows it but put all the generator in a list and then convert to a frame. It's not efficient but it's something that have to deal internally pandas. Meanwhile I've must think about buy some more memory.

1 Answer

0 votes
by (41.4k points)

You cannot create a DataFrame from a generator with the 0.12 version of pandas. You can either update yourself to the development version (get it from the github and compile it - which is a little bit painful on windows but I would prefer this option).

Or you can, since you said you are filtering the lines, first filter them, write them to a file and then load them using read_csv or something else...

If you want to get super complicated you can create a file like object that will return the lines:

def gen():

    lines = [

        'col1,col2\n',

        'foo,bar\n',

        'foo,baz\n',

        'bar,baz\n'

    ]

    for line in lines:

        yield line

class Reader(object):

    def __init__(self, g):

        self.g = g

    def read(self, n=0):

        try:

            return next(self.g)

        except StopIteration:

            return ''

And then use the read_csv:

>>> pd.read_csv(Reader(gen()))

  col1 col2

0  foo  bar

1  foo  baz

2  bar  baz

Browse Categories

...