How To Run Descriptive Statistics On Python Pandas DataFrame Object

This article briefly introduces the statistical functions commonly used by pandas, and there are some examples of applying these statistical functions to DataFrame objects.

Table of Contents

1. Python Pandas Statistics Function List.

Below is the python pandas statistical functions list.

abs(): Get absolute value.
corr(): Calculate the correlation coefficient between series or variables, with the value of – 1 to 1. The larger the value, the stronger the correlation.
count(): Count the quantity of non-null values.

cumprod(): Calculate the cumulative product, axis=0, cumulative by row; axis =1, accumulate by column.
cumsum(): Calculate the cumulative sum, axis=0, sum by row; axis =1, sum by column.
max(): Get the maximum value.
mean(): Get the mean value.

median(): Get the median value.
min(): Get the minimum value.
prod(): Get the product of all values.
std(): Get the standard deviation value.

sum(): Calculate the summary of the values.

2. Perform Aggregate Calculation Operation On DataFrame Object.

From the perspective of descriptive statistics, we can perform aggregation calculation and other operations on the pandas DataFrame structure, such as running the sum() and mean() methods.
For the DataFrame objects, the axis parameter needs to be specified when using the aggregate class method on it.
Now let’s introduce the two methods of parameter transmission.

For line operations, you should pass in the axis = 0 or “index” parameter.
For column operations, you should pass in the axis = 1 or “columns” parameter.
Axis = 0 means to calculate in the vertical direction, while axis = 1 means to calculate in the horizontal direction.

3. DataFrame Object Aggregate Calculation Operation Examples.

3.1 The Example Base DataFrame Structure Value.

Now let’s create a DataFrame object and use it to demonstrate the contents of this example.

Below is the basic DataFrame object data that will be used in this example.

import pandas as pd
    
def run_statistics_function():
    
    # create the name column data.
    name_series = pd.Series(['Tom', 'Jerry', 'Mike'])
    
    # create the salary column data.
    salary_series = pd.Series([10000, 8000, 12000])
    
    # create the data dictionary object.
    account_dict = {'Name':name_series, 'Salary':salary_series}
    
    # create the DataFrame object based on the above python dictionary object.
    df = pd.DataFrame(account_dict)
    
    # print out the DataFrame object.
    print(df)
    
    # return the DataFrame object.
    return df

if __name__ == '__main__':
    
    run_statistics_function()

========================================================================
when you run the above example source code, you will get the below DataFrame data output.

    Name  Salary
0    Tom   10000
1  Jerry    8000
2   Mike   12000

3.2 describe().

The function displays a summary of statistics related to the DataFrame data columns.

import pandas as pd
    
def run_statistics_function():
    ......
    # return the DataFrame object.
    return df

if __name__ == '__main__':
    
    df = run_statistics_function()
    
    print(df.describe())

========================================================================
Below is the above code execution result.

    Name  Salary
0    Tom   10000
1  Jerry    8000
2   Mike   12000
        Salary
count      3.0
mean   10000.0
std     2000.0
min     8000.0
25%     9000.0
50%    10000.0
75%    11000.0
max    12000.0

Through the include parameter provided by the describe() method, we can filter the summary information of character columns or numeric columns.

    print(df.describe(include=['object']))
==========================================================
Below is the example execution output.

         Name
count       3
unique      3
top     Jerry
freq        1

3.3 mean().

Calculate the average value.

import pandas as pd
    
def run_statistics_function():
    
    # create the name column data.
    name_series = pd.Series(['Tom', 'Jerry', 'Mike'])
    
    ......
    
    # print out the DataFrame object.
    print(df)
    
    # return the DataFrame object.
    return df

if __name__ == '__main__':
    
    df = run_statistics_function()
    
    print('\r\n\r\n****** df.mean() ******\r\n', df.mean())

=======================================================================

Below is the above source code execution result.

    Name  Salary
0    Tom   10000
1  Jerry    8000
2   Mike   12000

****** df.mean() ******

 Salary    10000.0
dtype: float64

3.4 std().

Calculate the standard deviation.

import pandas as pd
    
def run_statistics_function():
    
    # create the name column data.
    name_series = pd.Series(['Tom', 'Jerry', 'Mike'])
    
    ......
    
    # print out the DataFrame object.
    print(df)
    
    # return the DataFrame object.
    return df

if __name__ == '__main__':
    
    df = run_statistics_function()
    
    print('\r\n\r\n****** df.std() ******\r\n', df.std())

==========================================================================

The above example source code output.

    Name  Salary
0    Tom   10000
1  Jerry    8000
2   Mike   12000

****** df.std() ******

 Salary    2000.0
dtype: float64

3.5 sum().

Calculate the sum value.

import pandas as pd
    
def run_statistics_function():
    
    # create the name column data.
    name_series = pd.Series(['Tom', 'Jerry', 'Mike'])
    
   ......
    
    # print out the DataFrame object.
    print(df)
    
    # return the DataFrame object.
    return df

if __name__ == '__main__':
    
    df = run_statistics_function()
    
    print('\r\n\r\n****** df.sum(axis=0) ******\r\n', df.sum())
    
    print('\r\n\r\n****** df.sum(axis=1) ******\r\n', df.sum(axis=1))

===============================================================================

Below is the example output.

    Name  Salary
0    Tom   10000
1  Jerry    8000
2   Mike   12000

****** df.sum(axis=0) ******

 Name      TomJerryMike
Salary           30000
dtype: object

****** df.sum(axis=1) ******

 0    10000
1     8000
2    12000
dtype: int64