For files whose size is hundreds of megabytes or more (so several GB), you have a performance requirement and will be interested in the ability to minimize memory use, to maximize reading speed on that file. Here are a few ways to read big files with efficiency in Python:
Buffered Reading:
There is also an efficient way in Python to read files without having to load the entire file into memory at one time: Python's built-in open() function. Here is how it can be done using a with statement (ensuring the file will get properly closed):
# Efficient line-by-line reading
with open('large_file.txt', 'r', buffering=2**20) as f: # 1MB buffer size
for line in f:
process_line(line)
2. mmap (Memory Mapping)
It is a little faster on really huge files. mmap module lets one map a file into memory directly like it was an array. It's very fast, in particular for regular file I/O, when one operates with very huge files.
import mmap
# Open file and create a memory-map to it
with open('large_file.txt', 'r') as f:
mmapped_file = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
# Example of iteration over the memory-mapped file (byte-by-byte)
for i in range(len(mmapped_file)):
process_byte(mmapped_file[i])
mmapped_file.close()
3.Using iter() with read() in Chunks:
You can iterate over the file in some number of bytes, using iter() to read that much from the file for each iteration. This could be much more efficient for large files than reading line by line.
# Read in 1MB pieces
chunk_size = 1024 * 1024 # 1 MB
with open('large_file.txt', 'r') as f:
for chunk in iter(lambda: f.read(chunk_size), ''):
process_chunk(chunk)
4. Using fileinput Module for Line-by-Line Processing:
If you have to process several large files in a single run, the fileinput module of Python has a line-by-line reader for several files. It will be useful for logs and to merge multiple files into a single stream.
import fileinput
for line in fileinput.input(files = ('large_file1.txt', 'large_file2.txt')):
process_line(line)
5.Using csv.reader for Structured Files
When the file is CSV or similar text, structured, then reading using csv.reader will enable you to parse efficiently
import csv
with open('large_file.csv', 'r') as f:
reader = csv.reader(f)
for row in reader:
process_row(row)