How To Process Large Excel File In Python

This article tells you how to use the python vaex library to process a large Excel file.

1. How To Process Excel File In Python.

  1. First of all, let’s look at the simplest case. Let’s not consider the problem of performance. Then we can use the python xlrd module to open and load an excel table like below.
    def read_table_by_xlrd():
        # import python xlrd module.
        import xlrd
        # open an excel file.
        workbook = xlrd.open_workbook(r'data.xls')
        # get all worksheet names in the workbook.
        sheet_name = workbook.sheet_names()
        print ('All sheets in the file data.xls are: {}'.format(sheet_name))
        # get the first worksheet.
        sheet = workbook.sheet_by_index(0)
        # print cell(0, 1) value.
        print ('The cell value of row index 0 and col index 1 is: {}'.format(sheet.cell_value(0, 1)))
        # print the first row value.
        print ('The elements of row index 0 are: {}'.format(sheet.row_values(0)))
        # print the column1 length.
        print ('The length of col index 1 are: {}'.format(len(sheet.col_values(1))))
    if __name__ == '__main__':
  2. We have successfully loaded a table in XLS format into the memory in python, and we can analyze these data. If you need to modify the data, you can use the python module openpyxl.
  3. There is another very common and powerful library in python that can be used to process excel table data, that is panda.
  4. Here we use IPython to show how to use panda to process excel table data as below.
    $ ipython
    Python 3.7.7 (default, May  6 2020, 04:59:01) 
    Type 'copyright', 'credits' or 'license' for more information
    IPython 7.16.1 -- An enhanced Interactive Python. Type '?' for help.
    # import python pandas module.
    In [1]: import pandas as pd
    In [2]: !ls -l
    -rw-r--r-- 1 dechin dechin 372736  3月 27 21:31 data.xls
    -rw-r--r-- 1 dechin dechin    563  3月 27 21:42
    In [3]: data = pd.read_excel('data.xls', 'Sheet1') # Read excel file with pandas read_excel method.
    In [4]: data.to_csv('data.csv', encoding='utf-8') # Convert to csv format file.
    In [7]: !ls -l
    -rw-r--r-- 1 dechin dechin 221872  3月 27 21:52 data.csv
    -rw-r--r-- 1 dechin dechin 372736  3月 27 21:31 data.xls
    -rw-r--r-- 1 dechin dechin    563  3月 27 21:42
    In [8]: !head -n 10 data.csv # Read the first 10 lines of the csv file.
  5. In iPython, we can not only execute the python command, but also we can execute some system commands by adding a ! before the command ( for example !ls -l ), which is very convenient.
  6. However, whether using xlrd or pandas, we will face the same problem: we need to load all the data into memory for processing.
  7. Our personal computer only has 8GB-16GB of memory. Even with a relatively large 64GB of memory, we can only process files with a memory size of less than 64GB in the memory, which is far from enough for big data scenarios.
  8. Therefore, the python vaex library introduced in the next chapter is a good solution.
  9. In addition, the command to view the local memory and usage under Linux is vmstat, you can use it to monitor your local memory, if you use macOS, you can use the command vm_stat.

2. How To Install & Use Python vaex Library.

2.1 Overview.

  1. The python vaex library provides a memory-mapped data processing solution, we don’t need to load the entire data file into the memory for processing, we can directly operate on the hard disk storage.
  2. The size of the file we can handle is no longer limited by the size of the memory. As long as the disk storage space allows, we can all process files of this size.

2.2 Install.

  1. Open a terminal and run the command pip install vaex to install it.
  2. After the word successfully installed appears, it means that we have successfully installed it and we can start to use it.

2.3 Performance Comparison.

  1. In order to show the advantages of using vaex, here we directly use xlrd and vaex to open the same excel file in IPython, so as to compare the opening time of the two.
    $ ipython
    Python 3.7.7 (default, May  6 2020, 04:59:01) 
    Type 'copyright', 'credits' or 'license' for more information
    IPython 7.16.1 -- An enhanced Interactive Python. Type '?' for help.
    In [1]: import xlrd
    In [2]: import vaex
    In [3]: %timeit xlrd.open_workbook(r'data.xls')
    46.4 ms ± 76.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    In [4]: %timeit'data.csv')
    4.95 ms ± 48.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    In [7]: %timeit'data.hdf5')
    1.34 ms ± 1.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  2. We found from the results that to open the same file, it takes nearly 50ms to use xlrd, while vaex only takes 1ms at least.

2.4 Data Format Conversion.

  1. In the source code of the previous section, we used a file that was not mentioned: data.hdf5, which is actually converted from data.csv.
  2. In this section, we mainly introduce how to convert the data format to adapt to the format that vaex can open and recognize.
  3. The first solution is to use pandas to directly convert csv format files to hdf5 format, the operation is similar to converting xls format files to csv format.
  4. First, we need to install python tables library with the command pip install tables.

    $ pip install tables
    Collecting tables
      Downloading tables-3.6.1-cp37-cp37m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (4.3 MB)
         |████████████████████████████████| 4.3 MB 1.2 MB/s 
    Collecting numexpr>=2.6.2
      Downloading numexpr-2.7.3-cp37-cp37m-macosx_10_9_x86_64.whl (101 kB)
         |████████████████████████████████| 101 kB 683 kB/s 
    Requirement already satisfied: numpy>=1.9.3 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from tables) (1.20.1)
    Installing collected packages: numexpr, tables
    Successfully installed numexpr-2.7.3 tables-3.6.1
  5. Open a terminal and run the below python source code in .
    $ python
    Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 16:52:21) 
    [Clang 6.0 (clang-600.0.57)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import pandas as pd
    >>> data = pd.read_csv('/Users/employee_info.csv')
    >>> data.to_hdf('/Users/data.hdf5','data',mode='w', format='table')
  6. After the operation is completed, an hdf5 file is generated in the related directory.
  7. Now we can use the method to open it and print the result as below.
    >>> import vaex
    >>> df ='/Users/data.hdf5')
    >>> df
      #  table
      0  (0, [16000], [b'jerry', b'2010-01-01'])
      1  (1, [6000], [b'tom', b'2011-08-19'])
      2  (2, [13000], [b'kevin', b'2009-02-08'])
      3  (3, [5000], [b'richard', b'2012-03-19'])
      4  (4, [28000], [b'jackie', b'2015-06-08'])
      5  (5, [36000], [b'steven', b'2008-02-01'])
      6  (6, [8000], [b'jack', b'2006-09-19'])
      7  (7, [19000], [b'gary', b'2018-01-16'])
      8  (8, [16600], [b'john', b'2017-10-01'])
  8. We can also directly use the python vaex module ‘s from_csv() method and export_hdf5() method for data format conversion.
    >>> import vaex
    >>> df = vaex.from_csv('/Users/employee_info.csv')
    >>> df.export_hdf5('/Users/data.hdf5')
    >>> df ='/Users/data.hdf5')
    >>> df
      #  Name       Hire Date       Salary
      0  'jerry'    '2010-01-01'     16000
      1  'tom'      '2011-08-19'      6000
      2  'kevin'    '2009-02-08'     13000
      3  'richard'  '2012-03-19'      5000
      4  'jackie'   '2015-06-08'     28000
      5  'steven'   '2008-02-01'     36000
      6  'jack'     '2006-09-19'      8000
      7  'gary'     '2018-01-16'     19000
      8  'john'     '2017-10-01'     16600

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.