Table of Contents
- Introduction
- Prerequisites
- Installation
- Importing Required Libraries
- Loading and Preparing Time Series Data
- Exploratory Data Analysis
- Stationarity
- Decomposition
- Autocorrelation and Partial Autocorrelation
- Modeling
- Making Predictions
- Model Evaluation
- Conclusion
Introduction
In this tutorial, we will explore time series analysis using Python and the Statsmodels library. Time series analysis is a statistical technique for analyzing and forecasting data points collected over time. It is widely used in various domains, including finance, economics, weather forecasting, and more.
By the end of the tutorial, you will learn how to:
- Load and prepare time series data
- Perform exploratory data analysis
- Test for stationarity
- Decompose time series into trend, seasonal, and residual components
- Analyze autocorrelation and partial autocorrelation
- Build time series models using AR, MA, and ARIMA models
- Make predictions using the fitted models
- Evaluate the performance of the models
Let’s get started!
Prerequisites
To follow along with this tutorial, you should have a basic understanding of Python programming language and fundamental concepts of statistics. Familiarity with pandas library will also be helpful for data manipulation tasks.
Installation
Before we begin, let’s ensure that we have the necessary libraries installed. Open your terminal and run the following command to install the required libraries:
python
pip install pandas statsmodels matplotlib
The above command will install pandas, statsmodels, and matplotlib libraries, which are essential for time series analysis. If you are using Jupyter Notebook, you can run the command directly in a code cell.
Importing Required Libraries
Once we have installed the necessary libraries, we can import them into our Python script or Jupyter Notebook. Open your favorite text editor or Jupyter Notebook, and let’s start by importing the required libraries:
python
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
We have imported pandas as pd, statsmodels.api as sm, and matplotlib.pyplot as plt. These libraries will be used for data manipulation, statistical modeling, and visualization, respectively.
Loading and Preparing Time Series Data
To perform time series analysis, we need a dataset that contains time-stamped observations. In this tutorial, we will use a sample dataset available in the Statsmodels library called “macrodata.” The macrodata dataset contains various macroeconomic variables collected over time.
We can load the macrodata dataset using the following code:
python
macro_data = sm.datasets.macrodata.load_pandas().data
The macrodata dataset will be loaded into a Pandas DataFrame called “macro_data.”
Next, we need to prepare our time series data before proceeding with the analysis. Typically, a time series dataset includes a date or timestamp column and a corresponding value column. In our case, the timestamp column is contained in the “year” and “quarter” columns of the macro_data DataFrame.
To ensure that our date is recognized as a time series object, we will convert the “year” and “quarter” columns into a proper time series index. Here’s how you can do it:
python
macro_data['date'] = pd.date_range(start='1959Q1', periods=len(macro_data), freq='Q')
macro_data.set_index('date', inplace=True)
In the code above, we create a new “date” column using the pd.date_range()
function. We specify the start date as ‘1959Q1’, the length of the data as the number of rows in the macro_data DataFrame, and the frequency as ‘Q’ (quarterly).
Finally, we set the “date” column as the index of the macro_data DataFrame using the set_index()
method.
Exploratory Data Analysis
Once we have prepared our time series data, it’s a good practice to perform exploratory data analysis (EDA) to gain insights into the dataset. EDA involves visualizing and summarizing the data to understand its properties and patterns.
Let’s start by plotting the time series data using Matplotlib:
python
plt.figure(figsize=(10, 6))
plt.plot(macro_data.index, macro_data['infl'], label='Inflation')
plt.xlabel('Year')
plt.ylabel('Inflation')
plt.title('Inflation Over Time')
plt.legend()
plt.show()
In the code above, we create a figure with a size of 10 (width) by 6 (height) using plt.figure()
. Then, we plot the inflation data by accessing the ‘infl’ column of the macro_data DataFrame and pass it to plt.plot()
. We add labels and a title to the plot using plt.xlabel()
, plt.ylabel()
, and plt.title()
. Finally, we display the plot using plt.show()
.
Running the code above will generate a line plot showing the inflation over time.
Stationarity
Stationarity is an essential concept in time series analysis. A stationary time series is one whose statistical properties, such as mean and variance, remain constant over time. Stationarity allows us to model the time series data more accurately.
We can check for stationarity in our data using the Augmented Dickey-Fuller (ADF) test, which tests the null hypothesis of non-stationarity. Here’s how you can perform the ADF test using Statsmodels:
python
adf_result = sm.tsa.stattools.adfuller(macro_data['infl'])
print('ADF statistic:', adf_result[0])
print('p-value:', adf_result[1])
print('Critical values:', adf_result[4])
In the code above, we pass the inflation data to the sm.tsa.stattools.adfuller()
function, which performs the ADF test. The function returns a tuple containing the ADF statistic, p-value, critical values, and other information.
By printing the ADF statistic, p-value, and critical values, we can assess whether our data is stationary or not. If the p-value is less than a significance level (e.g., 0.05), we can reject the null hypothesis of non-stationarity and conclude that our data is stationary.
Decomposition
Time series data can often exhibit a combination of various patterns, including trends, seasonality, and residual noise. Decomposition helps us separate these individual components to better understand the underlying patterns.
We can decompose our time series data using the sm.tsa.seasonal_decompose()
function. Here’s an example of how to decompose the inflation data into trend, seasonal, and residual components:
python
decomposition = sm.tsa.seasonal_decompose(macro_data['infl'], model='additive')
In the code above, we pass the inflation data to the sm.tsa.seasonal_decompose()
function, specifying the model as ‘additive’. The function returns a DecomposeResult object containing the trend, seasonal, and residual components.
Once decomposed, we can visualize the individual components using the following code: ```python plt.figure(figsize=(10, 8))
plt.subplot(4, 1, 1)
plt.plot(macro_data['infl'], label='Original')
plt.ylabel('Inflation')
plt.legend()
plt.subplot(4, 1, 2)
plt.plot(decomposition.trend, label='Trend')
plt.ylabel('Trend')
plt.legend()
plt.subplot(4, 1, 3)
plt.plot(decomposition.seasonal, label='Seasonal')
plt.ylabel('Seasonal')
plt.legend()
plt.subplot(4, 1, 4)
plt.plot(decomposition.resid, label='Residual')
plt.xlabel('Year')
plt.ylabel('Residual')
plt.legend()
plt.tight_layout()
plt.show()