Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
3 views
in Python by (45.3k points)

So I tried reading all the CSV files from a folder and then concatenate them to create a big CSV(structure of all the files was same), save it and read it again. All this was done using Pandas. The error occurs while reading. I am Attaching the code and the Error below.

import pandas as pd

import numpy as np

import glob

path =r'somePath' # use your path

allFiles = glob.glob(path + "/*.csv")

frame = pd.DataFrame()

list_ = []

for file_ in allFiles:

    df = pd.read_csv(file_,index_col=None, header=0)

    list_.append(df)

store = pd.concat(list_)

store.to_csv("C:\work\DATA\Raw_data\\store.csv", sep=',', index= False)

store1 = pd.read_csv("C:\work\DATA\Raw_data\\store.csv", sep=',')

Error:-

CParserError                              Traceback (most recent call last)

<ipython-input-48-2983d97ccca6> in <module>()

----> 1 store1 = pd.read_csv("C:\work\DATA\Raw_data\\store.csv", sep=',')

C:\Users\armsharm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)

    472                     skip_blank_lines=skip_blank_lines)

    473 

--> 474         return _read(filepath_or_buffer, kwds)

    475 

    476     parser_f.__name__ = name

C:\Users\armsharm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)

    258         return parser

    259 

--> 260     return parser.read()

    261 

    262 _parser_defaults = {

C:\Users\armsharm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)

    719                 raise ValueError('skip_footer not supported for iteration')

    720 

--> 721         ret = self._engine.read(nrows)

    722 

    723         if self.options.get('as_recarray'):

C:\Users\armsharm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)

   1168 

   1169         try:

-> 1170             data = self._reader.read(nrows)

   1171         except StopIteration:

   1172             if nrows is None:

pandas\parser.pyx in pandas.parser.TextReader.read (pandas\parser.c:7544)()

pandas\parser.pyx in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7784)()

pandas\parser.pyx in pandas.parser.TextReader._read_rows (pandas\parser.c:8401)()

pandas\parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:8275)()

pandas\parser.pyx in pandas.parser.raise_parser_error (pandas\parser.c:20691)()

CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

I tried using CSV reader as well:-

import csv

with open("C:\work\DATA\Raw_data\\store.csv", 'rb') as f:

    reader = csv.reader(f)

    l = list(reader)

Error:-

Error                                     Traceback (most recent call last)

<ipython-input-36-9249469f31a6> in <module>()

      1 with open('C:\work\DATA\Raw_data\\store.csv', 'rb') as f:

      2     reader = csv.reader(f)

----> 3     l = list(reader)

Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

2 Answers

0 votes
by (16.8k points)

The cause was that there were some carriage returns "\r" in the data that pandas was using as a line terminator as if it was "\n". I thought I'd post here as that might be a common reason this error might come up.

The solution I found was to add lineterminator='\n' into the read_csv function like this:

df_clean = pd.read_csv('test_error.csv',

                 lineterminator='\n')

0 votes
by (1.3k points)

Code:

import pandas as pd

import glob

path = r'Path' # replace with Path  

allFiles = glob.glob(path + "/*.csv")

new_list = []

for file in allFiles:

    df = pd.read_csv(file, index_col=None, header=0)

    new_list.append(df)

store = pd.concat(new_list)

store.to_csv(r"C:\work\DATA\Raw_data\store.csv", sep=',', index=False)

store1 = pd.read_csv(r"C:\work\DATA\Raw_data\store.csv", sep=',')

Above updated code can be used to read and merge multiple dataset , you can then read it again

Related questions

0 votes
1 answer
asked Aug 10, 2019 in Data Science by sourav (17.6k points)
0 votes
1 answer
0 votes
1 answer
asked Oct 5, 2019 in Data Science by ashely (50.2k points)

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...