What Is Pandas DataFrame With Examples ·

DataFrame is one of the important data structures of pandas and one of the most commonly used structures in the process of using pandas for data analysis. It can be said that if you master the usage of DataFrame, you will have the basic ability to learn data analysis.

Table of Contents

1. What Is Pandas DataFrame Structure.

DataFrame is a tabular data structure with both row labels and column labels. Data is represented in rows and columns, where each column represents an attribute of the entity and each row represents the data of an entity.

It is also called the heterogeneous data table. The so-called heterogeneous means that the data types of each column in the table can be different, such as string, integer, or floating-point.
Each row of data in the DataFrame can be regarded as a Series structure, but the DataFrame adds a column label to each data value in these rows. Therefore, DataFrame actually evolved from Series. The DataFrame structure is similar to the table of EXECL.
Like Series, the DataFrame has its own row label index, which is an “implicit index” by default, that is, it increases successively from 0, and the row label corresponds to the data items in the DataFrame one by one. Of course, you can also use “explicit index” to set row labels.

Each data value in the DataFrame can be modified, the number of rows and columns of the DataFrame structure can be increased or deleted, DataFrame has two label axes that is row labels and column labels, DataFrame can perform arithmetic operations on rows and columns.

2. How To Create Pandas DataFrame Object.

The syntax format for creating a DataFrame object is as follows.

# import the pandas library
import pandas as pd

# call the DataFrame() mehtod to create a pandas DataFrame object.
pd.DataFrame( data, index, columns, dtype, copy)

data  : The input data can be ndarray, series, list, dict, scalar and another DataFrame object.

index : Row label, if no index value is passed, the default row label is np.arange(n), and n represents the number of data elements.

columns : Column label, if the columns value is not passed, the default column label is np.arange(n).

dtype : Dtype represents the data type of each column.

copy ： The default value is false, which means data is copied.

Create an empty DataFrame object.

>>> import pandas as pd # import the pandas module.
>>>
>>> df = pd.DataFrame() # create an empty DataFrame object.
>>>
>>> print(df) # print out the empty DataFrame object.
Empty DataFrame
Columns: []
Index: []

Create a DataFrame with an array object.

>>> import pandas as pd
>>>
>>> data = ['python', 100, 'javascript', 'java', 199] # create a one dimension array.
>>>
>>> df = pd.DataFrame(data) # create a DataFrame object based on the above array.
>>>
>>> print(df) # print out the above DataFrame object.
            0
0      python
1         100
2  javascript
3        java
4         199
>>>

Create a DataFrame object using a nested list object.

>>> import pandas as pd
>>>
>>> data = [['Tom', 'Developer', 10000],['Bob', 'QA', 12000],['Jerry', 'Manager', 13000]] # define the 2 dimension array to save employee data ( name, title, salary ).
>>>
>>> columns_array = ['Name','Title','Salary']
>>>
>>> df = pd.DataFrame(data,columns = columns_array) # create the DataFrame object with the above data, and set the data columns with the columns_arrary.
>>>
>>> print(df)
    Name      Title  Salary
0    Tom  Developer   10000
1    Bob         QA   12000
2  Jerry    Manager   13000
>>>

Specify the data type of the numeric element to float.

>>> import pandas as pd
>>>
>>> data = [['Tom', 'Developer', 10000],['Bob', 'QA', 12000],['Jerry', 'Manager', 13000]] # define the 2 dimension array to save employee data ( name, title, salary ).
>>>
>>> columns_array = ['Name','Title','Salary']
>>>
>>> df = pd.DataFrame(data,columns = columns_array, dtype = float) # create the DataFrame object with the above data, and set the data columns with the columns_arrary, specify the data type to float
sys:1: FutureWarning: Could not cast to float64, falling back to object. This behavior is deprecated. In a future version, when a dtype is passed to 'DataFrame', either all columns will be cast to that dtype, or a TypeError will be raised
>>>
>>> print(df)
    Name      Title   Salary
0    Tom  Developer  10000.0
1    Bob         QA  12000.0
2  Jerry    Manager  13000.0
>>>

Create DataFrame with the python dictionary object. In the data dictionary, the element length of the value corresponding to each key must be the same (that is, the value list length must be the same). If the index is passed, the length of the index should be equal to the length of the array. If no index is passed, by default, the index will be range(n), where n is the length of the array.

>>> import pandas as pd
>>>
>>> data = {'Name':['Tom', 'Jerry', 'Bill', 'Bob'], 'Title':['Developer', 'QA', 'PM', 'Manager'], 'Salary':[12800,13400,12900,14200]} # define a python dictionary object, the key is the DataFrame column & the list value is the column data.
>>>
>>> df = pd.DataFrame(data) # create a DataFrame object based on the above python dictionary object.
>>>
>>> print(df) # print out the DataFrame object.
    Name      Title  Salary
0    Tom  Developer   12800
1  Jerry         QA   13400
2   Bill         PM   12900
3    Bob    Manager   14200
>>>
>>># The above example uses the default row label, which is generated by the function range(n). It generates 0,1,2,3 and corresponds to each element value in the list.
>>>
>>># We can also add custom row labels to the above example like below.
>>>
>>> import pandas as pd
>>>
>>> data = {'Name':['Tom', 'Jerry', 'Bill', 'Bob'], 'Title':['Developer', 'QA', 'PM', 'Manager'], 'Salary':[12800,13400,12900,14200]}
>>>
>>> index = ['employee_1', 'employee_2', 'employee_3', 'employee_4'] # define a python list array.
>>>
>>> df = pd.DataFrame(data, index = index) # use the above list array as the DataFrame row label.
>>>
>>> print(df)
             Name      Title  Salary
employee_1    Tom  Developer   12800
employee_2  Jerry         QA   13400
employee_3   Bill         PM   12900
employee_4    Bob    Manager   14200

List nested dictionaries can be passed as input data to the DataFrame constructor. By default, the key of the dictionary is used as the column name.

>>> import pandas as pd
>>>
>>> data_list_dict = [{'python': 90, 'java': 88},{'python': 99, 'java': 68, 'javascript': 100}]
>>>
>>> df = pd.DataFrame(data_list_dict)
>>>
>>> print(df) # the first row does not contain the 'javascript' column, so the 'javascript' column's value is NaN.
   python  java  javascript
0      90    88         NaN
1      99    68       100.0
>>>

Create a DataFrame object using dictionary nested lists and provide the row and column labels.

>>> import pandas as pd
>>>
>>> data_list_dict = [{'python': 90, 'java': 88},{'python': 99, 'java': 68, 'javascript': 100}]
>>>
>>> index_label = ['Tom', 'Jerry'] # because the data_list_dict only contains 2 rows of data, so the row index label should have 2 elements also, otherwise it will throw the ValueError: Shape of passed values is (2, 3), indices imply (3, 3).
>>>
>>> column_label = ['python', 'java', 'javascript']
>>>
>>> df = pd.DataFrame(data, index = index_label, columns = column_label)
>>>
>>> print(df) # the first row does not contain the 'javascript' column, so the 'javascript' column's value is NaN.
        python  java  javascript
Tom        90    88         NaN
Jerry      99    68       100.0
>>>

You can also pass a Series in a dictionary to create a DataFrame object, the row index label of the output result is the collection of indexes of all the Series objects in the dictionary object.

>>> import pandas as pd
>>>
>>> series_1 = pd.Series(['Tom', 'Jerry', 'Bill', 'Bob'], index=['name_row_1', 'name_row_2', 'row_3', 'row_4'])
>>>
>>> series_2 = pd.Series(['Developer', 'QA', 'PM', 'Manager'], index=['row_a', 'row_2', 'row_c', 'row_d'])
>>>
>>> series_3 = pd.Series([12800,13400,12900,14200], index=['row_', 'row_2', 'row_c', 'row_d'])
>>>
>>> data = {'Name':series_1, 'Title':series_2, 'Salary':series_3 }
>>>
>>> df = pd.DataFrame(data)
>>>
>>> print(df)
             Name      Title   Salary
name_row_1    Tom        NaN      NaN
name_row_2  Jerry        NaN      NaN
row_          NaN        NaN  12800.0
row_2         NaN         QA  13400.0
row_3        Bill        NaN      NaN
row_4         Bob        NaN      NaN
row_a         NaN  Developer      NaN
row_c         NaN         PM  12900.0
row_d         NaN    Manager  14200.0

2. How To Query, Add, Delete Pandas DataFrame Data.

2.1 Manipulate DataFrame Object With Column Index Label.

Query column data: You can use column index labels to easily query DataFrame object column data. Below is the example.

>>> import pandas as pd
>>>
>>> series_1 = pd.Series(['Tom', 'Jerry', 'Bill', 'Bob'])
>>>
>>> series_2 = pd.Series(['Developer', 'QA', 'PM', 'Manager'])
>>>
>>> series_3 = pd.Series([12800,13400,12900,14200])
>>>
>>> data = {'Name':series_1, 'Title':series_2, 'Salary':series_3 }
>>>
>>> print(data)
{'Name': 0      Tom
1    Jerry
2     Bill
3      Bob
dtype: object, 'Title': 0    Developer
1           QA
2           PM
3      Manager
dtype: object, 'Salary': 0    12800
1    13400
2    12900
3    14200
dtype: int64}
>>>
>>>
>>> df = pd.DataFrame(data)
>>>
>>> print(df)
    Name      Title  Salary
0    Tom  Developer   12800
1  Jerry         QA   13400
2   Bill         PM   12900
3    Bob    Manager   14200
>>>
>>> print(df['Name']) # print out the 'Name' column.
0      Tom
1    Jerry
2     Bill
3      Bob
Name: Name, dtype: object

Add column data: Using the columns index label or the DataFrame object insert function, you can add new data columns, below is the example.

>>> import pandas as pd
>>>
>>> series_1 = pd.Series(['Tom', 'Jerry', 'Bill', 'Bob'])
>>>
>>> series_2 = pd.Series(['Developer', 'QA', 'PM', 'Manager'])
>>>
>>> data = {'Name':series_1}
>>>
>>> df = pd.DataFrame(data) # create the DataFrame object.
>>>
>>> print(df)
    Name
0    Tom
1  Jerry
2   Bill
3    Bob
>>>
>>> df['Title'] = series_2 # add the Title column.
>>>
>>> print(df)
    Name      Title
0    Tom  Developer
1  Jerry         QA
2   Bill         PM
3    Bob    Manager
>>>
>>> df['Name - Title'] = df['Name'] + ' - ' +  df['Title'] # add a new column based on the Name & Title column.
>>>
>>> print(df)
    Name      Title     Name - Title
0    Tom  Developer  Tom - Developer
1  Jerry         QA       Jerry - QA
2   Bill         PM        Bill - PM
3    Bob    Manager    Bob - Manager
>>>
>>> series_3 = pd.Series([12800,13400,12900,14200]) # define the third column Series object.
>>>
>>> df.insert(2, column = 'Salary', value = series_3) # insert the third column into the DataFrame object.
>>>
>>> print(df)
    Name      Title  Salary     Name - Title
0    Tom  Developer   12800  Tom - Developer
1  Jerry         QA   13400       Jerry - QA
2   Bill         PM   12900        Bill - PM
3    Bob    Manager   14200    Bob - Manager

Delete column data: It is easy to use the python del command or the DataFrame object’s pop() function to delete the DataFrame object’s data columns. Below is the example.

>>> import pandas as pd
>>>
>>> series_1 = pd.Series(['Tom', 'Jerry', 'Bill', 'Bob'])
>>>
>>> series_2 = pd.Series(['Developer', 'QA', 'PM', 'Manager'])
>>>
>>> series_3 = pd.Series([12800,13400,12900,14200])
>>>
>>> data = {'Name':series_1, 'Title':series_2, 'Salary':series_3 }
>>>
>>> df = pd.DataFrame(data) # create the DataFrame object.
>>>
>>> print(df)
    Name      Title  Salary
0    Tom  Developer   12800
1  Jerry         QA   13400
2   Bill         PM   12900
3    Bob    Manager   14200
>>>
>>> del df['Salary'] # delete the Salary column with the del command.
>>>
>>> print(df) # we can see that the Salary column has been removed from the original DataFrame object.
    Name      Title
0    Tom  Developer
1  Jerry         QA
2   Bill         PM
3    Bob    Manager
>>>
>>>
>>> df.pop('Title') # delete the Title column with the pop() function.
0    Developer
1           QA
2           PM
3      Manager
Name: Title, dtype: object
>>>
>>> print(df) # the Title column has been removed also.
    Name
0    Tom
1  Jerry
2   Bill
3    Bob
>>>

2.2 Manipulate DataFrame Object With Row Index Label.

Query row data: You can pass the row label to the DataFrame object’s loc attribute or row index number to the DataFrame object’s iloc attribute to query the row’s data. Below is the example.

>>> import pandas as pd
>>> 
>>> data = [['Tom', 'Developer', 10000],['Bob', 'QA', 12000],['Jerry', 'Manager', 13000]]
>>> 
>>> columns_array = ['Name','Title','Salary']
>>> 
>>> row_index_label_arr = ['a', 'b', 'c']
>>> 
>>> df = pd.DataFrame(data,columns = columns_array, index = row_index_label_arr)
>>> 
>>> print(df)
    Name      Title  Salary
a    Tom  Developer   10000
b    Bob         QA   12000
c  Jerry    Manager   13000
>>> 
>>> 
>>> print(df.loc['a']) # call the DataFrame object's loc attribute to get one row by row index label.
Name            Tom
Title     Developer
Salary        10000
Name: a, dtype: object
>>> 
>>> print(df.iloc[2]) # call the DataFrame object's iloc attribute to get one row by row index number.
Name        Jerry
Title     Manager
Salary      13000
Name: c, dtype: object
>>>

You can also use slicing to select multiple rows at the same time. Below is the example.

>>> print(df[1:3]) # return 2 rows from the DataFrame object. 
    Name    Title  Salary
b    Bob       QA   12000
c  Jerry  Manager   13000

Add row data: Using the append() function, you can add another DataFrame object’s rows to the current DataFrame object, it will append the data row at the end of the row. Below is the example.

>>> import pandas as pd
>>> 
>>> data = [['Tom', 'Developer', 10000],['Bob', 'QA', 12000]]
>>> 
>>> columns_array = ['Name','Title','Salary']
>>> 
>>> df = pd.DataFrame(data,columns = columns_array) # create the first DataFrame object.
>>> 
>>> print(df)
  Name      Title  Salary
0  Tom  Developer   10000
1  Bob         QA   12000
>>> 
>>> data1 = [['Jerry', 'Manager', 13000]] 
>>> 
>>> df1 = pd.DataFrame(data1,columns = columns_array) # create the second DataFrame object.
>>> 
>>> print(df1)
    Name    Title  Salary
0  Jerry  Manager   13000
>>> 
>>> df = df.append(df1) # append df1 to the end of df.
>>>  
>>> print(df)
    Name      Title  Salary
0    Tom  Developer   10000
1    Bob         QA   12000
0  Jerry    Manager   13000
>>>
>>> df = df.append(df1, ignore_index = True, sort = True) # append df1 to df with the parameters, ignore_index means ignore the original index and create new index
>>> 
>>> print(df)
    Name  Salary      Title
0    Tom   10000  Developer
1    Bob   12000         QA
2  Jerry   13000    Manager

Delete row data: You can use the DataFrame object’s drop() method and pass the row index tag to it to delete a row of data from the DataFrame object. If there are duplicate index labels, they will be deleted together. Below is the example.

>>> print(df) # print out the original DataFrame object.
    Name  Salary      Title
0    Tom   10000  Developer
1    Bob   12000         QA
2  Jerry   13000    Manager
>>> 
>>> df.drop(0)  # drop the first row.
    Name  Salary    Title
1    Bob   12000       QA
2  Jerry   13000  Manager
>>> 
>>> print(df) # the original DataFrame object is not changed.
    Name  Salary      Title
0    Tom   10000  Developer
1    Bob   12000         QA
2  Jerry   13000    Manager
>>> 
>>> df.drop(0, inplace = True) # add the inplace = True argument when invoke the DataFrame object's drop() method to modify the original DataFrame object.
>>> 
>>> print(df)
    Name  Salary    Title
1    Bob   12000       QA
2  Jerry   13000  Manager

3. DataFrame Attributes & Methods.

T (Transpose): This attribute will return the transpose of a DataFrame object, that is, exchange the DataFrame object’s rows and columns.

>>> import pandas as pd
>>> 
>>> data = [['Tom', 'Developer', 10000],['Bob', 'QA', 12000],['Jerry', 'Manager', 13000]]
>>> 
>>> columns_array = ['Name','Title','Salary']
>>> 
>>> row_index_label_arr = ['a', 'b', 'c']
>>> 
>>> df = pd.DataFrame(data,columns = columns_array, index = row_index_label_arr)
>>> 
>>> print(df)
    Name      Title  Salary
a    Tom  Developer   10000
b    Bob         QA   12000
c  Jerry    Manager   13000
>>> 
>>> print(df.T) # exchange the DataFrame object's rows and columns
                a      b        c
Name          Tom    Bob    Jerry
Title   Developer     QA  Manager
Salary      10000  12000    13000
>>>

dtypes: Returns the data type of each column.

>>> print(df.dtypes)
Name      object
Title     object
Salary     int64
dtype: object

axes: Returns a list of row labels and column labels.

>>> print(df.axes)
[Index(['a', 'b', 'c'], dtype='object'), Index(['Name', 'Title', 'Salary'], dtype='object')]

empty: Returns a Boolean value to judge whether the output data object is empty. If true, it means the object is empty.

>>> print(df)
    Name      Title  Salary
a    Tom  Developer   10000
b    Bob         QA   12000
c  Jerry    Manager   13000
>>> 
>>> print(df.empty)
False
>>> 
>>> df.drop('a', inplace=True)
>>> 
>>> df.drop('b', inplace=True)
>>> 
>>> df.drop('c', inplace=True)
>>> 
>>> print(df)
Empty DataFrame
Columns: [Name, Title, Salary]
Index: []
>>> 
>>> print(df.empty) # now the DataFrame object's empty attribute returns True.
True

ndim: Returns the dimension of the data object. Dataframe is a two-dimensional data structure.
```
>>> print(df.ndim)
2
```
size: Returns the number of elements in the DataFrame object.
```
>>> print(df.size)
0
```

shape: Returns a tuple representing the DataFrame dimension. Return a value tuple (a, b), where a represents the number of rows and b represents the number of columns.

>>> df = pd.DataFrame(data,columns = columns_array, index = row_index_label_arr)
>>> 
>>> print(df)
    Name      Title  Salary
a    Tom  Developer   10000
b    Bob         QA   12000
c  Jerry    Manager   13000
>>> 
>>> print(df.shape)
(3, 3)

values: Returns the data in the DataFrame object as a 2 dimension array object.

>>> df = pd.DataFrame(data,columns = columns_array, index = row_index_label_arr)
>>> 
>>> print(df)
    Name      Title  Salary
a    Tom  Developer   10000
b    Bob         QA   12000
c  Jerry    Manager   13000
>>> 
>>> print(df.values)
[['Tom' 'Developer' 10000]
 ['Bob' 'QA' 12000]
 ['Jerry' 'Manager' 13000]]

head(n): Return the first n rows of data, and the first 5 rows of data are returned if not provided n.

tail(n): Return the last n rows of data, the n‘s default value is 5.

>>> print(df)
    Name      Title  Salary
a    Tom  Developer   10000
b    Bob         QA   12000
c  Jerry    Manager   13000
>>> 
>>> print(df.head(1))
  Name      Title  Salary
a  Tom  Developer   10000
>>> 
>>> print(df.tail(1))
    Name    Title  Salary
c  Jerry  Manager   13000

shift(): Move rows or columns. It provides a periods parameter that represents moving steps on a specific axis. Below is the syntax format of the shift() function.

DataFrame.shift(periods=1, freq=None, axis=0):

1. periods :  The type is int, which indicates the moving steps. It can be positive or negative. The default value is 1.

2. freq : Date offset. The default value is None. It is applicable to time sequence. The value is a string that conforms to the time rule.

3. axis : If it is 0 or "index", it will move up and down. If it is 1 or "columns", it will move left and right.

4. fill_value : This parameter is used to fill in missing values, it can also be used to replace the original data.

Below is the shift() method examples.

>>> print(df)
    Name      Title  Salary
a    Tom  Developer   10000
b    Bob         QA   12000
c  Jerry    Manager   13000
>>> 
>>> df.shift(axis=0, periods=1) # you can find the first row has been shifted to the top.
  Name      Title   Salary
a  NaN        NaN      NaN
b  Tom  Developer  10000.0
c  Bob         QA  12000.0
>>> 
>>> df1 = df.shift(axis=1, periods=1) # you can find the first column has been shifted to the right.
>>> 
>>> df1
  Name  Title     Salary
a  NaN    Tom  Developer
b  NaN    Bob         QA
c  NaN  Jerry    Manager
>>> df1 = df.shift(axis=1, periods=1, fill_value='') # use empty string to replace the NaN value.
>>> 
>>> df1
  Name  Title     Salary
a         Tom  Developer
b         Bob         QA
c       Jerry    Manager