Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Python by (1.6k points)

I have a large text file (~7 GB). I am looking if exist the fastest way to read large text file. I have been reading about using several approach as read chunk-by-chunk in order to speed the process.

at example effbot suggest

# File: readline-example-3.py

file = open("sample.txt")

while 1:

    lines = file.readlines(100000)

    if not lines:

        break

    for line in lines:

        pass # do something**strong text**

in order to process 96,900 lines of text per second. Other authors suggest to use islice()

from itertools import islice

with open(...) as f:

    while True:

        next_n_lines = list(islice(f, n))

        if not next_n_lines:

            break

        # process next_n_lines

list(islice(f, n)) will return a list of the next n lines of the file f. Using this inside a loop will give you the file in chunks of n lines

2 Answers

0 votes
by (25.1k points)

Just open the file in a with block to avoid having to close it. Then, iterate over each line in the file object in a for loop and process those lines. e.g.:

with open("file.txt") as f:

    for line in f:

        process_lines(line)

Learn Python with the help of this PPython Certification 

0 votes
by (3.1k points)

For files whose size is hundreds of megabytes or more (so several GB), you have a performance requirement and will be interested in the ability to minimize memory use, to maximize reading speed on that file. Here are a few ways to read big files with efficiency in Python:

  1. Buffered Reading:

There is also an efficient way in Python to read files without having to load the entire file into memory at one time: Python's built-in open() function. Here is how it can be done using a with statement (ensuring the file will get properly closed):

# Efficient line-by-line reading

with open('large_file.txt', 'r', buffering=2**20) as f:  # 1MB buffer size

    for line in f:

process_line(line)

2. mmap (Memory Mapping)

It is a little faster on really huge files. mmap module lets one map a file into memory directly like it was an array. It's very fast, in particular for regular file I/O, when one operates with very huge files.

import mmap

# Open file and create a memory-map to it

with open('large_file.txt', 'r') as f:

mmapped_file = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)

    # Example of iteration over the memory-mapped file (byte-by-byte)

    for i in range(len(mmapped_file)):

        process_byte(mmapped_file[i])

        

    mmapped_file.close()

3.Using iter() with read() in Chunks:

You can iterate over the file in some number of bytes, using iter() to read that much from the file for each iteration. This could be much more efficient for large files than reading line by line.

# Read in 1MB pieces

chunk_size = 1024 * 1024  # 1 MB

with open('large_file.txt', 'r') as f:

    for chunk in iter(lambda: f.read(chunk_size), ''):

        process_chunk(chunk)

4. Using fileinput Module for Line-by-Line Processing:

If you have to process several large files in a single run, the fileinput module of Python has a line-by-line reader for several files. It will be useful for logs and to merge multiple files into a single stream.

import fileinput

for line in fileinput.input(files = ('large_file1.txt', 'large_file2.txt')):

    process_line(line)

5.Using csv.reader for Structured Files

When the file is CSV or similar text, structured, then reading using csv.reader will enable you to parse efficiently

import csv

with open('large_file.csv', 'r') as f:

    reader = csv.reader(f)

    for row in reader:

        process_row(row)

Related questions

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...