Pandas dataframe read_csv on bad data

Question

asked Sep 17, 2019 in Data Science by ashely (50.2k points)

I want to read in a very large csv (cannot be opened in excel and edited easily) but somewhere around the 100,000th row, there is a row with one extra column causing the program to crash. This row is errored so I need a way to ignore the fact that it was an extra column. There are around 50 columns so hardcoding the headers and using names or usecols isn't preferable. I'll also possibly encounter this issue in other CSV's and want a generic solution. I couldn't find anything in read_csv unfortunately. The code is as simple as this:

def loadCSV(filePath):
    dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', nrows=1000)
    datakeys = dataframe.keys();
    return dataframe, datakeys

1 Answer

vinita · Answer 1 · 2019-09-17T06:20:39+0000

You can either pass error_bad_lines=False to skip erroneous rows or you can read a single line to get the correct number of cols and then re-read again to read only those columns e.g.

cols = pd.read_csv(file, nrows=1).columns

df = pd.read_csv(file, usecols=cols)

This will then ignore the extra column.

If you are interested in learning Pandas and want to become an expert in Python Programming, then check out this Python Course and upskill yourself.

Pandas dataframe read_csv on bad data

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources