Explore Courses Blog Tutorials Interview Questions
0 votes
in Data Science by (50.2k points)

I want to read in a very large csv (cannot be opened in excel and edited easily) but somewhere around the 100,000th row, there is a row with one extra column causing the program to crash. This row is errored so I need a way to ignore the fact that it was an extra column. There are around 50 columns so hardcoding the headers and using names or usecols isn't preferable. I'll also possibly encounter this issue in other CSV's and want a generic solution. I couldn't find anything in read_csv unfortunately. The code is as simple as this:

def loadCSV(filePath):

    dataframe = pd.read_csv(filePath, index_col=False, encoding='iso-8859-1', nrows=1000)

    datakeys = dataframe.keys();

    return dataframe, datakeys

1 Answer

0 votes
by (108k points)

You can either pass error_bad_lines=False to skip erroneous rows or you can read a single line to get the correct number of cols and then re-read again to read only those columns e.g.

cols = pd.read_csv(file, nrows=1).columns

df = pd.read_csv(file, usecols=cols)

This will then ignore the extra column.

If you are interested in learning Pandas and want to become an expert in Python Programming, then check out this Python Course and upskill yourself.

Related questions

0 votes
1 answer
asked Sep 10, 2019 in Data Science by ashely (50.2k points)
0 votes
1 answer
0 votes
1 answer

Browse Categories