Python Read Big File Example

Read file content in python is very simple, you can use the Python file object’s method  read and readlines to easily read them. But there are also some tricks in using them. This article will tell you how to use them correctly.

1. Python file object’s read() and readlines() method.

You often see the pairs of read() and readlines() functions in a handy tutorial for searching python read-write files. So we’ll often see the following code.

with open(file_path, 'rb') as f:
    for line in f.readlines():
        print(line)

with open(file_path, 'rb') as f:
    print(f.read())

This does not cause any exceptions when reading small files, but once reading large files, it can easily lead to memory leak MemoryError.

1.1 read([size]).

The read([size]) method reads size bytes from the current location of the file. If you do not specify the value of parameter size, it will read until the end of the file. All the data will be saved in one string object.

1.2 readlines().

This method reads one line at a time, so it takes up less memory to read and is more suitable for large files. But the readlines method will construct a list object to store each string line. So everything is saved in memory and memory overflow errors may occur.

2. How To Correctly Use read and readlines.

It is very dangerous to write the above code in a real running system. So let’s see how we can use it correctly.

2.1 read a binary file.

If the file is a binary file, the recommended way is to specify how many bytes the buffer read. Obviously the larger the buffer, the faster the read.

with open(file_path, 'rb') as f:
    while True:
        buf = f.read(1024)
        if buf: 
            ...
        else:
            break

2.2 read text file.

If it is a text file, you can use the readline method and directly iterate the file to read one line at a time, the efficiency is relatively low.

with open(file_path, 'rb') as f:
    while True:
        line = f.readline()
        if line: 
            print(line)
        else:
            break

3. Question & Answer.

3.1 How to write python code to lazy read big file.

  1. My log file has 6GB size big data and when I use the python file object’s readlines function to read the log data, my python program hangs. My idea is to read the log data from my big log file piece by piece, it is something like a lazy load function, after reading the piece of data then I write the data to another file, then I can read the next piece of log data again. This can avoid reading the whole big file data into the memory at one time. But I do not know how to implement it, can anyone give me some help?
  2. You can use the python yield keyword to write a function that behaves like a lazy function as below.
    '''
    This is the lazy function, in this function it will read a piece of chunk_size size data at one time.
    '''    
    def read_file_in_chunks(file_object, chunk_size=3072):
        
        while True:
            # Just read chunk_size size data.
            data = file_object.read(chunk_size)
    
            # If it reach to the end of the file.
            if not data:
                # Break the loop.
                break
    
            yield data
    
    
    # Open the big log data file.
    with open('big_log_file.dat') as f:
    
        # Invoke the above lazy file read data function.
        for piece in read_file_in_chunks(f):
    
            # Process the piece of data such as write the data to another file.
            process_data(piece)
  3. You can also use the python fileinput module to iterate the file lines to read each line into the memory one by one, this can avoid read the whole file content into the memory issue if the file is too big.
    # First import the python fileinput module.
    >>> import fileinput
    >>>
    >>> for line in fileinput.input("C:\WorkSpace\Downloads\desktop.ini")
      File "<stdin>", line 1
        for line in fileinput.input("C:\WorkSpace\Downloads\desktop.ini")
                                                                        ^
    SyntaxError: invalid syntax
    >>>
    >>>
    # Iterate the data file content.
    >>> for line in fileinput.input("C:\\WorkSpace\\Downloads\\log.dat"):
            # Print out each text line.
    ...     print(line)
    ...

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.