I. Introduction to Pandas
1. Introduction
Pandas is an open-source Python data analysis library that provides high-performance, easy-to-use data structures and data analysis tools. The name Pandas comes from “Panel Data,” and it is primarily used for processing structured data, such as tabular data and time-series data . Built on top of the NumPy library, Pandas inherits NumPy’s high-performance array computation capabilities and adds support for complex data operations. It provides two main data structures: Series (one-dimensional arrays) and DataFrame (two-dimensional tables) , making data processing more efficient and convenient.
Pandas was initially created by Wes McKinney in 2008 to meet the needs of financial data analysis. Over time, it has become an indispensable part of the Python data analysis ecosystem, widely used in data science, machine learning, finance, statistics, and many other fields.
2. Main Features
(a) Powerful data structures
- DataFrame : This is one of the core data structures in Pandas. It’s a two-dimensional tabular data structure, similar to a table in Excel. Each column of a DataFrame can store different types of data (such as numbers, strings, etc.), and the data can be accessed through column names and row indices. For example, we can create a DataFrame containing employee information (name, age, department, etc.), and retrieve all employee names by the column name “name,” or retrieve all information for a specific employee by the row index.
- Series : A Series is a one-dimensional array, similar to a NumPy array, but it can have an index. The index can be any hash type, such as an integer or string. Series are very convenient for processing single-column data. For example, when operating on a set of time series data (such as stock prices), the price at a specific point in time can be quickly accessed through the date index.
(ii) Flexible data manipulation
- Data cleaning : Pandas provides rich data cleaning functions that can easily handle missing values (such as filling and deleting missing values), duplicate data (such as deleting duplicate rows), outliers, etc. For example, for a DataFrame containing missing values, we can use the fillna() method to fill missing values, or use the dropna() method to delete rows containing missing values.
- Data filtering : Data can be filtered using various methods such as Boolean indexes, tag indexes, and position indexes. For example, we can use the Boolean expression `df[df[‘age’] > 30]` to filter out employee data whose age is greater than 30.
- Data merging : Pandas supports various data merging operations, such as `concat()` (for joining multiple data structures), `merge()` (similar to the SQL `JOIN` operation, which merges two DataFrames based on a specified key), and `join()` (merging based on an index). These features make it easy to integrate multiple data sources. For example, merging the employee information table and the employee performance table based on employee ID.
- Data grouping and aggregation : The `groupby()` method can be used to group data, and then aggregate functions (such as summation, average, count, etc.) can be applied to each group. For example, sales data can be grouped by region, and then the total sales for each region can be calculated.
(III) Time series analysis support
Pandas offers robust support for time series data. It can directly parse date and time strings to generate time series indexes. For example, `pd.date_range(start=’2024-01-01′, end=’2024-12-31′, freq=’D’)` generates a daily time series index from January 1, 2024 to December 31, 2024. Furthermore, Pandas supports operations such as time frequency conversion (e.g., converting daily data to monthly data) and time window calculations (e.g., calculating moving averages), making it extremely useful in fields involving time series data analysis, such as financial data analysis and meteorological data analysis.
(iv) Compatibility with other libraries
Pandas works seamlessly with other data analysis and scientific computing libraries in Python. For example, it can perform efficient array operations with NumPy, data visualization with Matplotlib, and training and predicting machine learning models with Scikit-learn. This excellent compatibility allows Pandas to be integrated into complete data analysis workflows.
3. Application Scenarios
(I) Financial Sector
In financial data analysis, Pandas can be used to process stock price data, trading data, and more. For example, by reading time-series stock price data, indicators such as stock returns and moving averages can be calculated. Return data from multiple stocks can also be merged to analyze correlations between them. Furthermore, Pandas can be used for risk management, such as identifying abnormal trading behavior by analyzing trading data.
(II) Data Science and Machine Learning
In data science projects, Pandas is an essential tool for data preprocessing. Data scientists can use Pandas to load data (e.g., from CSV files, databases, etc.), clean data (handling missing values, outliers, etc.), and explore data (calculating descriptive statistics, plotting data distribution, etc.). In machine learning, Pandas can divide datasets into training and testing sets, and can also perform operations such as standardization and normalization on feature data, preparing it for training machine learning models.
(III) Business Data Analysis
Business analysts can use Pandas to analyze sales data, customer data, and more. For example, by analyzing time series of sales data, seasonal patterns in sales can be discovered; by analyzing customer data, customers can be segmented (e.g., using the groupby() method to group customers by spending amount) and characteristics of each customer group can be calculated (such as average spending amount, purchase frequency, etc.). These analytical results can help businesses develop marketing strategies and optimize product offerings.
(iv) Biomedical research
In the biomedical field, Pandas can be used to process experimental data, such as gene expression data and clinical trial data. Researchers can use Pandas to format data, screen specific samples or genes, and perform statistical analyses, such as calculating differences in gene expression.
II. Common Uses of the Pandas Library
1. Data Reading
1.1 Reading CSV files
Pandas loads CSV files and allows programmatic manipulation of them in Python. At its core, Pandas uses the “DataFrame” object type – essentially a table of values with a label for each row and column. The `read_csv()` method is used to load CSV files:
import pandas as pd
# Read CSV file
df = pd.read_csv('ReportData.csv',encoding='utf-8')
# Display the top 5 rows of data
print(df.head())
1.2 Reading Excel files
- The pd.read_excel() method can be used to read Excel files (.xls or .xlsx format).
import pandas as pd
# Read Excel file
df = pd.read_excel('ReportData.xlsx',encoding='gbk')
# Display the top 5 rows of data
print(df.head())
1.3 Reading Database Data
- You can use the pd.read_sql() method to read data from the database.
import pandas as pd
import sqlite3
# Connect to SQLite database
conn = sqlite3.connect('economics_data.db')
# Read data from database tables
df = pd.read_sql('SELECT * FROM table_name', conn)
1.4 Reading JSON Data
- The pd.read_json() method can be used to read JSON files.
import pandas as pd
df = pd.read_json("ReportData.json", orient="records")
parameter :
- orient : Specifies the format of the JSON file. Common formats include ‘records’ (each line is a JSON object) and ‘split’ (key-value pairs are stored separately).
- ‘records’: JSON is a list of records, where each record is a dictionary; this is the default value.
- ‘index’: In JSON, the key is the index and the value is the column.
- ‘columns’: The keys in JSON are columns, and the values are indices.
- ‘values’: The JSON is a two-dimensional array, with each row representing a record.
import pandas as pd
# Example JSON data
json_data = {
"index": ["row1", "row2"],
"columns": ["col1", "col2"],
"data": [[1, 2], [3, 4]]
}
# Read by origin='split '
df = pd.read_json(pd.io.json.dumps(json_data), orient='split')
print(df)
2. Write data
2.1 Writing to a CSV file
- The DataFrame.to_csv() method can be used to save a DataFrame as a CSV file.
import pandas as pd
# create a DataFrame
data = {
'name': ['Alice', 'Bob'],
'age': [25, 30]
}
df = pd.DataFrame(data)
# Write data to CSV file
# index=False Do not save row index
df.to_csv('output.csv', index=False)
Parameter description :
- index: Whether to save the row index, the default is True.
- sep: Specifies the separator, which defaults to a comma (,).
2.2 Write to Excel file
- The DataFrame.to_excel() method can be used to save a DataFrame as an Excel file.
import pandas as pd
# Create a DataFrame
data = {
'name': ['Alice', 'Bob', 'Gary', 'Lily'],
'age': [25, 30, 24, 26],
'Grade': [97, 76, 61, 88]
}
df = pd.DataFrame(data)
# Write to Excel file
df.to_excel('exam_result.xlsx', sheet_name='Sheet1', index=False)
Parameter description :
- sheet_name: Specifies the name of the worksheet to be saved.
- index: Whether to save the row index.
2.3 Writing to the database
- The DataFrame.to_sql() method can be used to save a DataFrame to a database.
import pandas as pd
from sqlalchemy import create_engine
# Create a DataFrame
data = {
'name': ['Alice', 'Bob', 'Gary', 'Lily'],
'age': [25, 30, 24, 26],
‘gender’: ['F', 'M', 'M', 'F'],
'Grade': [97, 76, 61, 88]
}
df = pd.DataFrame(data)
# Create a database connection for SQLAlchemy
conn = create_engine('mysql+pymysql://root:123456@localhost:3306/exam_db?charset=utf8')
# Write to database
df.to_sql('exam_result', conn, if_exists='replace', index=False)
Parameter description :
- The first parameter is the name of the target database table.
- `conn` is a database connection object, and it must be a SQLAlchemy database connection.
- `if_exists`: Specifies the behavior if the table already exists. Possible values are ‘fail’ (throw an error), ‘replace’ (replace the table), and ‘append’ (append data).
`index`: Whether to save the row index as a column in the database.
3. Data Cleaning
3.1 Handling Missing Values
(1) Check for missing values
- The isnull() or isna() methods can be used to check for empty values in the data.
# Return a Boolean DataFrame with a emptyvalue of True
print(df.isnull())
# Count the number of empty values in each column
print(df.isnull().sum())
(2) Fill missing values
- The fillna() method can be used to fill in missing values.
# Fill all empty values with 0
df.fillna(value=0, inplace=True)
# Fill in the empty values of a column with its mean
df['column1'].fillna(value=df['column1'].mean(), inplace=True)
(3) Remove missing values
- The dropna() method can be used to delete rows or columns containing empty values.
# Delete rows containing empty values
df.dropna(inplace=True)
# Delete columns containing empty values
df.dropna(axis=1, inplace=True)
3.2 Handling Duplicate Data
(1) Check for duplicate data
- The duplicated() method can be used to check for duplicate rows.
# Return Boolean Series, repeat behavior is True
print(df.duplicated())
# Count the number of duplicate rows
print(df.duplicated().sum())
(2) Delete duplicate data
- The drop_duplicates() method can be used to remove duplicate rows.
# Delete duplicate rows
df.drop_duplicates(inplace=True)
3.3 Data Type Conversion
- The astype() method can be used to convert the data type of a column to a specified type.
# Convert columns to integer type
df['column1'] = df['column1'].astype('int')
# Convert columns to floating-point type
df['column1'] = df['column1'].astype('float')
# Convert columns to string type
df['column1'] = df['column1'].astype('str')
3.4 Rename column names
- The rename() method can be used to rename column names.
- Usage: df.rename(columns={‘old column name 1’: ‘new column name 1’, ‘old column name 2’: ‘new column name 2’}, inplace=True)
stock_data_frame = pd.DataFrame(stock_data).rename(
columns={
0: 'SECURITY_CODE',
1: 'SECURITY_NAME',
2: 'LATEST_PRICE',
3: 'PRICES_CHANGE_RATIO',
4: 'PRICES_CHANGE',
5: 'TRADING_COUNT',
6: 'TRADING_VALUE',
7: 'AMPLITUDE',
8: 'TURN_OVER_RATIO',
9: 'PE_RATIO',
10: 'TRADE_RATIO',
11: 'MAX_PRICE',
12: 'MIN_PRICE',
13: 'OPEN_PRICE',
14: 'PRE_CLOSE_PRICE',
15: 'TOTAL_MARKET_VALUE',
16: 'CIRCULATION_MARKET_VALUE',
17: 'PB_RATIO',
18: 'PE_RATIO_TTM',
19: 'UPWARD_RATE',
20: 'MONTH_CHANGE_RATIO',
21: 'YEAR_CHANGE_RATIO',
22: 'INFLOW_FUNDS'
}
)
Example : We currently have a CSV file containing missing and duplicate values of national economic performance, which needs further data cleaning for subsequent analysis.
import pandas as pd
# Read data
data = pd.read_csv('ReportData.csv',encoding='utf-8')
# Display the top five rows of data
print(data.head())
# Handling empty values
# Solution 1: Delete rows containing empty values
data = data.dropna()
# Solution 2: Fill in empty values as 0
data = data.fillna(0)
# Remove duplicate values
data = data.drop_duplicates()
# Display the top 5 processed data items
print(data.head())
4. Data viewing and statistics
4.1 Examining the Data Structure
(1) View the top few rows of the DataFrame
- The head() method can be used to view the top few rows of a DataFrame; by default, the top 5 rows are displayed.
# View the top 5 lines
print(df.head())
# View the top10 lines
print(df.head(10))
(2) View the last few rows of the DataFrame
- The tail() method can be used to view the last few rows of a DataFrame; by default, it displays the last 5 rows.
# View the last 5 lines
print(df.tail())
# View the last 10 lines
print(df.tail(10))
(3) View the number of rows and columns (shape) of the DataFrame
- The shape property can be used to get the shape of a DataFrame (number of rows and columns).
# Output the number of rows and columns
print(df.shape)
(4) View the column names of the DataFrame
- The column names of a DataFrame can be obtained using the column properties.
# Output column names
print(df.columns)
(5) Check the data type of the DataFrame
- You can use the dtypes property to view the data type of each column.
# Output the data type of the column
print(df.dbtypes)
4.2 Data Statistics
(1) Descriptive statistics
- The describe() method can be used to obtain descriptive statistics of a DataFrame, including mean, standard deviation, minimum, quartiles, etc.
# Only numerical columns are counted by default
print(df.describe())
# Perform statistics on all columns
print(df.describe(include='all'))
(2) Calculation of specific statistics
- You can use methods such as mean(), median(), std(), sum(), and count() to calculate specific statistics.
- mean() : Statistical mean
- median(): Calculates the median
- std(): Statistical standard deviation
- sum(): Summation and statistics of data
- count(): Counts the number of items.
# Calculate the mean of a certain column
df['column1'].mean()
# Calculate the total sum of a column
df['column1'].sum()
(3) Calculate the correlation coefficient
- The corr() method can be used to calculate the correlation coefficient between numerical columns in a DataFrame.
# Output correlation coefficient matrix
print(df.corr())
- Example:
import pandas as pd
import numpy as np
# create a demo DataFrame
data = {
'A': np.random.randn(100), # Randomly generate 100 numbers of standard normal distributions
'B': np.random.rand(100), # Randomly generate 100 numbers uniformly distributed in the [0,1) interval
'C': np.random.randint(0, 100, 100), # Randomly generate 10 integers between [0,10)
'D': np.random.randn(100) # Randomly generate 100 numbers of standard normal distributions
}
df = pd.DataFrame(data)
# Calculate the correlation coefficient of all numerical columns in the DataFrame
correlation_matrix = df.corr()
print("Correlation Matrix:")
print(correlation_matrix)
(4) Calculate the data distribution
- The value_counts() method can be used to count the number of times each value appears in a column.
# Count the occurrence frequency of each value in a certain column
print(df['column1'].value_counts())
5. Data Filtering
5.1 Filtering by Boolean Expression
- Pandas supports filtering DataFrames using Boolean expressions.
# Filter a column for rows greater than 10
filtered_df = df[df['column1'] > 10]
# Use multiple criteria for filtering
filtered_df = df[(df['column1'] > 10) & (df['column2'] < 20)]
- Example: Filter stocks based on criteria
#Conditional stock selection logic:Volume Ratio
# -> 1Turnover rate ≥ 5% and ≤ 10%
# -Amplitude ≥ 3% and ≤ 10%
# -Circulating market value ≥ 5 billion and ≤ 20 billion
# -Dynamic PE ratio>0 and<=30 (industry average × 1.5)
# -Capital flow>5 million
# ->1
filtered = stock_data_frame[
# TRADE RATIOThe turnover rate ranges from 5% to 10%
(stock_data_frame['TRADE_RATIO'] > 100) &
#The turnover rate ranges from 5% to 10%
(stock_data_frame['TURN_OVER_RATIO'] >= 500) &The amplitude ranges from 3% to 10%
(stock_data_frame['TURN_OVER_RATIO'] <= 1000) &
#The circulating market value ranges from 5 billion yuan to 20 billion yuan
(stock_data_frame['AMPLITUDE'] >= 300) &
(stock_data_frame['AMPLITUDE'] <= 1000) &
#Dynamic PE ratio between 0 and 40
(stock_data_frame['CIRCULATION_MARKET_VALUE'] >= 5000000000) &
(stock_data_frame['CIRCULATION_MARKET_VALUE'] <= 20000000000) &
#The latest price of the stock is within 30 yuan
(stock_data_frame['PE_RATIO'] > 0) &
(stock_data_frame['PE_RATIO'] <= 4000) &
#On that day, the inflow of funds exceeded 5 million
(stock_data_frame['LATEST_PRICE'] <= 3000) &
#
(stock_data_frame['INFLOW_FUNDS'] > 5000000)
]
5.2 Filter by Index
- In Pandas, you can use loc() and iloc() to filter by index.
- loc: Based on label indexing, it can specify both rows and columns.
- iloc: Based on position index, it can only specify the position of rows and columns.
# Using LOC
# Select lines 2 to 5, specify column names
selected_df = df.loc[df.index[1:5], ['column1', 'column2']]
# Using LOC
# Select rows 2 to 5, columns 1 and 2
selected_df = df.iloc[1:5, [0, 1]]
5.3 Filter by column name
- In Pandas, you can select columns directly by their names.
# Select Single Column
selected_column = df['column1']
# Select multiple columns
selected_columns = df[['column1', column2']]
6. Data Conversion and Processing
6.1 Data Sorting
(1) Sort by column value
- The sort_values() method can be used to sort a DataFrame by column values.
# Sort in ascending order by a certain column
df.sort_values(by='column1', ascending=True, inplace=True)
# Sort by Multiple Columns
df.sort_values(by=['column1', 'column2'], ascending=[True, False], inplace=True)
(2) Sort by index
- The sort_index() method can be used to sort a DataFrame by its index.
# Sort by index
df.sort_index(inplace=True)
6.2 Data Grouping and Aggregation
(1) Data grouping
- In Pandas, the groupby() method can be used to group data.
# Group by a certain column
grouped = df.groupby('column1')
# Group by Multiple Columns
grouped = df.groupby(['column1', 'column2'])
(2) Data aggregation
- In Pandas, aggregate functions (such as sum(), mean(), count(), etc.) are used to perform calculations on grouped data.
# Sum up the grouped data
result = grouped.sum()
# Calculate the mean of the grouped data
result = grouped.mean()
# Apply different aggregation functions to different columns
result = grouped.agg({'column1': 'sum', 'column2': 'mean'})
6.3 Pivot Tables
(1) Data Reshaping
- In Pandas, the pivot() function is used to convert long format data to wide format.
import pandas as pd
import numpy as np
# Create sample data
data = {
'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
'var': ['A', 'B', 'A', 'B'],
'values': [10, 20, 30, 40]
}
df = pd.DataFrame(data)
# Reshaping with pivot
pivot_df = df.pivot(index='date', columns='var', values='values')
print("Pivot Result:\n", pivot_df)
Parameter description:
- index: Specifies the column to be used as the row index.
- columns: Specifies the columns to be used as column names.
- values: Specifies the column to fill with values
(2) Data Pivot
- Pivot tables can be created using the pivot_table() method in Pandas.
pivot_table = df.pivot_table(index='column1', columns='column2', values='column3', aggfunc='sum')
# Create data with duplicate values
data = {
'date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
'var': ['A', 'B', 'A', 'B'],
'values': [10, 20, 30, 40]
}
df = pd.DataFrame(data)
# Aggregate using pivot table
pivot_table_df = pd.pivot_table(df, values='values', index='date', columns='var', aggfunc=np.mean)
print("\nPivot Table Result:\n", pivot_table_df)
Parameter description:
- aggfunc: Specifies the aggregate function, defaults to np.mean.
- It can handle duplicate values and aggregate combinations of the same indexes and columns.
7. Data Merging and Joining
7.1 Data Merging
- In Pandas, the merge() method can be used to merge two DataFrames by a specified key, similar to the JOIN operation in SQL.
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value2': [4, 5, 6]})
# Internal connection merging
merged_df = pd.merge(df1, df2, on='key', how='inner')
# External connection merging
merged_df = pd.merge(df1, df2, on='key', how='outer')
# Merge left connection
merged_df = pd.merge(df1, df2, on='key', how='left')
# Merge left connection and merge right connection
merged_df = pd.merge(df1, df2, on='key', how='right')
7.2 Data Connection
- In Pandas, the concat() method can be used to connect multiple DataFrames by rows or columns.
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
# Connect by row
concatenated_df = pd.concat([df1, df2], axis=0)
# Connect by column
concatenated_df = pd.concat([df1, df2], axis=1)
8. Time Series Data Processing
8.1 Creating Time Series
- You can use pd.date_range() to create a time series index.
# Daily time series
date_range = pd.date_range(start='2024-01-01', end='2024-12-31', freq='D')
# Create a time series DataFrame
df = pd.DataFrame({'value': range(len(date_range))}, index=date_range)
8.2 Time-Frequency Conversion
- The resample() method can be used to perform frequency transformation on time series data.
# Convert daily data to monthly data and sum them up
monthly_df = df.resample('M').sum()
8.3 Time Window Calculation
- The rolling() method can be used to perform window calculations (such as moving averages) on time series data.
# Calculate the 7-day moving average
df['moving_avg'] = df['value'].rolling(window=7).mean()
When calculating stock quotes, the rolling() method can be used to calculate the moving average (MA).
# Calculate moving average
stock_data['ma5'] = stock_data['close'].rolling(5).mean()
stock_data['ma10'] = stock_data['close'].rolling(10).mean()
stock_data['ma20'] = stock_data['close'].rolling(20).mean()