This article briefly introduces the statistical functions commonly used by pandas, and there are some examples of applying these statistical functions to DataFrame objects.
1. Python Pandas Statistics Function List.
- Below is the python pandas statistical functions list.
- abs(): Get absolute value.
- corr(): Calculate the correlation coefficient between series or variables, with the value of – 1 to 1. The larger the value, the stronger the correlation.
- count(): Count the quantity of non-null values.
- cumprod(): Calculate the cumulative product, axis=0, cumulative by row; axis =1, accumulate by column.
- cumsum(): Calculate the cumulative sum, axis=0, sum by row; axis =1, sum by column.
- max(): Get the maximum value.
- mean(): Get the mean value.
- median(): Get the median value.
- min(): Get the minimum value.
- prod(): Get the product of all values.
- std(): Get the standard deviation value.
- sum(): Calculate the summary of the values.
2. Perform Aggregate Calculation Operation On DataFrame Object.
- From the perspective of descriptive statistics, we can perform aggregation calculation and other operations on the pandas DataFrame structure, such as running the sum() and mean() methods.
- For the DataFrame objects, the axis parameter needs to be specified when using the aggregate class method on it.
- Now let’s introduce the two methods of parameter transmission.
- For line operations, you should pass in the axis = 0 or “index” parameter.
- For column operations, you should pass in the axis = 1 or “columns” parameter.
- Axis = 0 means to calculate in the vertical direction, while axis = 1 means to calculate in the horizontal direction.
3. DataFrame Object Aggregate Calculation Operation Examples.
3.1 The Example Base DataFrame Structure Value.
- Now let’s create a DataFrame object and use it to demonstrate the contents of this example.
- Below is the basic DataFrame object data that will be used in this example.
import pandas as pd def run_statistics_function(): # create the name column data. name_series = pd.Series(['Tom', 'Jerry', 'Mike']) # create the salary column data. salary_series = pd.Series([10000, 8000, 12000]) # create the data dictionary object. account_dict = {'Name':name_series, 'Salary':salary_series} # create the DataFrame object based on the above python dictionary object. df = pd.DataFrame(account_dict) # print out the DataFrame object. print(df) # return the DataFrame object. return df if __name__ == '__main__': run_statistics_function() ======================================================================== when you run the above example source code, you will get the below DataFrame data output. Name Salary 0 Tom 10000 1 Jerry 8000 2 Mike 12000
3.2 describe().
- The function displays a summary of statistics related to the DataFrame data columns.
import pandas as pd def run_statistics_function(): ...... # return the DataFrame object. return df if __name__ == '__main__': df = run_statistics_function() print(df.describe()) ======================================================================== Below is the above code execution result. Name Salary 0 Tom 10000 1 Jerry 8000 2 Mike 12000 Salary count 3.0 mean 10000.0 std 2000.0 min 8000.0 25% 9000.0 50% 10000.0 75% 11000.0 max 12000.0
- Through the include parameter provided by the describe() method, we can filter the summary information of character columns or numeric columns.
print(df.describe(include=['object'])) ========================================================== Below is the example execution output. Name count 3 unique 3 top Jerry freq 1
3.3 mean().
- Calculate the average value.
import pandas as pd def run_statistics_function(): # create the name column data. name_series = pd.Series(['Tom', 'Jerry', 'Mike']) ...... # print out the DataFrame object. print(df) # return the DataFrame object. return df if __name__ == '__main__': df = run_statistics_function() print('\r\n\r\n****** df.mean() ******\r\n', df.mean()) ======================================================================= Below is the above source code execution result. Name Salary 0 Tom 10000 1 Jerry 8000 2 Mike 12000 ****** df.mean() ****** Salary 10000.0 dtype: float64
3.4 std().
- Calculate the standard deviation.
import pandas as pd def run_statistics_function(): # create the name column data. name_series = pd.Series(['Tom', 'Jerry', 'Mike']) ...... # print out the DataFrame object. print(df) # return the DataFrame object. return df if __name__ == '__main__': df = run_statistics_function() print('\r\n\r\n****** df.std() ******\r\n', df.std()) ========================================================================== The above example source code output. Name Salary 0 Tom 10000 1 Jerry 8000 2 Mike 12000 ****** df.std() ****** Salary 2000.0 dtype: float64
3.5 sum().
- Calculate the sum value.
import pandas as pd def run_statistics_function(): # create the name column data. name_series = pd.Series(['Tom', 'Jerry', 'Mike']) ...... # print out the DataFrame object. print(df) # return the DataFrame object. return df if __name__ == '__main__': df = run_statistics_function() print('\r\n\r\n****** df.sum(axis=0) ******\r\n', df.sum()) print('\r\n\r\n****** df.sum(axis=1) ******\r\n', df.sum(axis=1)) =============================================================================== Below is the example output. Name Salary 0 Tom 10000 1 Jerry 8000 2 Mike 12000 ****** df.sum(axis=0) ****** Name TomJerryMike Salary 30000 dtype: object ****** df.sum(axis=1) ****** 0 10000 1 8000 2 12000 dtype: int64