I want us to now move on to learning about the main tool we will use in time series forecasting: the Statsmodels
library. Statsmodels can be thoughts of as follow:
A python module that provides classes and function for the estimation of many different statistical models, as well as conducting statistical tests, and statistical data exploration.
Keep in mind that we won't really be doing any forecasting yet. Rather, we will be familiarizing with the statsmodels library and some of the statistical tests that you can be performing on time series data.
Let's look at some basic properties of time series data. To begin, we have trends. Time series can have trends, as seen below:
import numpy as np
import pandas as pd
import boto3
import matplotlib.pyplot as plt
import seaborn as sns
import cufflinks
import plotly.plotly as py
import plotly
import plotly.graph_objs as go
from plotly.offline import iplot
from IPython.core.display import HTML
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
sns.set(style="white", palette="husl")
sns.set_context("talk")
sns.set_style("ticks")
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl', offline=True)
from plotly import tools
x = np.arange(0,50,0.01)
y_stationary = np.sin(x)
y_upward = x*0.1 + np.sin(x)
y_downward = np.sin(x) - x*0.1
trace1 = go.Scatter(
x=x,
y=y_stationary
)
trace2 = go.Scatter(
x=x,
y=y_upward,
xaxis='x2',
yaxis='y2'
)
trace3 = go.Scatter(
x=x,
y=y_downward,
xaxis='x3',
yaxis='y3'
)
data = [trace1, trace2, trace3]
layout = go.Layout(
showlegend=False
)
fig = tools.make_subplots(
rows=1,
cols=3,
subplot_titles=("Stationary", "Upward", "Downward"),
print_grid=False
)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 3)
fig['layout']['yaxis1'].update(range=[-3, 3])
fig['layout'].update(
showlegend=False,
height=300
)
plotly.offline.iplot(fig)
# html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
# display(HTML(html_fig))
Above, we have can see upward, stationary, and downward trends. Time series will always exhibit one of the above trends. Additionally, time series can exhibit seasonality, a repeating trend:
s3 = boto3.client('s3')
bucket = "intuitiveml-data-sets"
key = "monthly_milk_production.csv"
obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(obj['Body'], index_col="Date", parse_dates=True)
clipped_df = df["1962-01-01":"1968-01-01"]
trace1 = go.Scatter(
x=clipped_df.index,
y=clipped_df["Production"].values,
marker = dict(
size = 6,
color = 'green',
),
)
data = [trace1]
layout = go.Layout(
showlegend=False,
width=800,
height=400,
title="Seasonality and Upward Trend",
xaxis=dict(title="Date")
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
# html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
# display(HTML(html_fig))
We can clearly see in the plot above that their is a seasonality trend associated with the data. At around the 3rd month of each year we observe a peak. It also looks as though this trend repeats every cycle. Hence, overall it looks like the volume of search results is going down.
Finally, we also have cyclical components. Cyclical components are trends that have no set repetition. Here is does look like there are trends, however, it does not look like it occurs on a regular cycle.
bucket = "intuitiveml-data-sets"
key = "starbucks.csv"
obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(obj['Body'], index_col="Date", parse_dates=True)
trace1 = go.Scatter(
x=df.index,
y=df["Close"].values,
marker = dict(
size = 6,
color = 'green',
),
)
data = [trace1]
layout = go.Layout(
showlegend=False,
width=800,
height=400,
title="Cyclical and Upward Trend",
xaxis=dict(title="Date")
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
# html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
# display(HTML(html_fig))
Now that we have a basic understanding of the different properties of time series data, we can look at how they may be separated from one another. To start, we have the Hodrick-Prescott filter which separates a time-series, $y_t$ into a trend component, $\tau_t$, and a cyclical component, $c_t$:
$$y_t = \tau_t + c_t$$The components are determined via minimizing the following quadratic loss function, where $\lambda$ is a smoothing parameter:
$$ min_{\tau_t} = \overbrace{\sum_{t=1}^T c_t^2}^\text{cyclical component} + \lambda \overbrace{\sum_{t=1}^T \Big[ (\tau_t - \tau_{t-1}) - (\tau_{t-1} - \tau_{t-2})\Big]^2}^\text{trend component} $$So, essentially, by using the above equation the hodrick prescott filter is able to separate out the above two components. Note, $\lambda$ handles the variation in the growth rate of the trend component. There are some good default values to use for $\lambda$:
Data Frequency | $\lambda$ |
---|---|
Quarterly | 1600 |
Annual | 6.25 |
Monthly | 129,600 |
Now let's take a look at how we can implement this filter via stats models. To start, we are going to look at data related to GDP per year:
key = "macrodata.csv"
obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(obj['Body'], index_col=0, parse_dates=True)
df.head()
trace1 = go.Scatter(
x=df.index,
y=df["realgdp"].values,
marker = dict(
size = 6,
color = 'green',
),
)
data = [trace1]
layout = go.Layout(
showlegend=False,
width=800,
height=400,
title="Real GDP per. Year",
yaxis=dict(title="GDP"),
xaxis=dict(title="Year")
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
So we can see that in general GDP tends to increase over time, with a noticable dip due to the recession in 2008. So, what we are going to try and do is use statsmodels
to get the trend. By using the Hodrick-Prescott filter we should be able to separate out the time series into a trend and cylical component.
from statsmodels.tsa.filters.hp_filter import hpfilter
gdp_cycle, gdp_trend = hpfilter(df["realgdp"], lamb=1600)
trace1 = go.Scatter(
x=df.index,
y=df["realgdp"].values,
marker = dict(
size = 6,
color = 'green',
),
name="Real GDP"
)
trace2 = go.Scatter(
x=gdp_trend.index,
y=gdp_trend,
marker = dict(
size = 6,
color = '#FFBF00',
),
name="trend"
)
trace3 = go.Scatter(
x=gdp_cycle.index,
y=gdp_cycle,
marker = dict(
size = 6,
color = 'blue',
),
name="cycle"
)
data = [trace1, trace2, trace3]
layout = go.Layout(
showlegend=True,
width=800,
height=400,
title="Real GDP per. Year",
yaxis=dict(title="GDP"),
xaxis=dict(title="Year")
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
We can see above the different components as well as the real GDP. If you zoom in on the period from 2005 onward you can see that the real gdp is no longer following the general trend of the past few years (where we were in a bull market), but rather the effects of the Great Recession hit in 2008.
We are now going to discuss ETS models, with ETS standing for Error-Trend-Seasonality. ETS models actually encompass quite a wide variety of models, namely:
To start we will discuss ETS decomposition. Statsmodels provides a seasonal decomposition tool that we can use to separate out the different components of a time series. We already saw a brief introduction to this when discussing the Hodrick-Prescott Filter. ETS models will take each of those terms for smoothing, and may add them, multiply them, or even just leave some of them out. Based off these key factors we can try to create a model to fit our data. Our goal for now is to get an idea of what error, trend, and seasonality are. Visualizing data based off of it's ETS is a good way to get an understanding of the general behavior of the time series.
We are going to look at a classic data set known as the Airline Passengers data set, which looks at the number of passengers per month (in thousands) flying, from 1949-1969.
Before we get started, I do want to note the following; we will apply an additive model when it seems that the trend is more linear and the seasonality and trend components seem to be constant over time (e.g. every year we add 10,000 passengers). On the hand we will use a multiplicative model when we are increasing (or decreasing) at a non-linear rate (e.g. each year we double the number of passengers).
bucket = "intuitiveml-data-sets"
key = "airline_passengers.csv"
obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(obj['Body'], index_col="Month", parse_dates=True)
display(df.head())
One thing to note is that we cannot actually have missing data when performing ETS decomposition, so we will actually want to drop null values:
df = df.dropna()
Now, let's start by looking at a plot of our raw data:
trace1 = go.Scatter(
x=df.index,
y=df["Thousands of Passengers"].values,
marker = dict(
size = 6,
color = 'green',
),
)
data = [trace1]
layout = go.Layout(
showlegend=False,
width=800,
height=400,
title="Thousands of Passengers Flying per Month",
yaxis=dict(title="Thousands of Passengers"),
xaxis=dict(title="Month")
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
Now, looking at this data we get an inclination that the data is increasing at a slightly higher rate than just linear, so we will want to use a multiplicative model.
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(df["Thousands of Passengers"], model="multiplicative")
We can quickly take a look at what this result
object looks like on it's own:
result
We can then have a look at which individual components are available:
display(list(filter(lambda x: x[0] != "_" ,dir(result))))
For example, let's grab the seasonal component:
display(result.seasonal.head())
trace1 = go.Scatter(
x=result.observed.index,
y=result.observed
)
trace2 = go.Scatter(
x=result.trend.index,
y=result.trend,
xaxis='x2',
yaxis='y2'
)
trace3 = go.Scatter(
x=result.seasonal.index,
y=result.seasonal,
xaxis='x3',
yaxis='y3'
)
trace4 = go.Scatter(
x=result.resid.index,
y=result.resid,
xaxis='x4',
yaxis='y4'
)
data = [trace1, trace2, trace3, trace4]
layout = go.Layout(
showlegend=False
)
fig = tools.make_subplots(
rows=4,
cols=1,
print_grid=False
)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 2, 1)
fig.append_trace(trace3, 3, 1)
fig.append_trace(trace4, 4, 1)
fig['layout'].update(
showlegend=False,
height=600,
yaxis=dict(title="Observed"),
yaxis2=dict(title="Trend"),
yaxis3=dict(title="Seasonal"),
yaxis4=dict(title="Residual"),
xaxis4=dict(title="Month")
)
plotly.offline.iplot(fig)
# html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
# display(HTML(html_fig))
And with that we have taken a raw data set and nicely decomposed it into it's trend and seasonal terms! One small note, in this context residual means error!
Previously we have seen how calculating a simple moving average can allow us to create a simple model that describes some trend level behavior of a time series (as was the case with the thousands of passengers data set above). In theory we could attempt to use these simple moving averages to build a generalized model for the real world time series we are analyzing. This isn't necessarily a means of forcasting, but rather a we are just trying to describe the general behavior of the time series.
What we are going to do now is expand off the idea of a simple moving average (SMA) by utilizing an exponentially weighted moving average (EWMA). What exactly are the shortcomings of an SMA that require us to utilize the more intricate EWMA?
What would be great is if we could have more recent data be weighted more than older data. We do this by implementing a EWMA instead of SMA. And EWMA will allow us to reduce the lag effect from the SMA and it will put more weight on valeus that occurred more recently (by applying more weight to the more recent values). The amount of weight applied to the most recent values will depend on the actual parameters used in the EWMA and the number of periods given a window size.
Let's now apply this to our airline passenger's data set. To start, let's look at how to create a simple moving average once again:
df["6-month-SMA"] = df["Thousands of Passengers"].rolling(window=6).mean() # 6 month window
df["12-month-SMA"] = df["Thousands of Passengers"].rolling(window=12).mean() # year window
trace1 = go.Scatter(
x=df.index,
y=df["Thousands of Passengers"].values,
opacity=0.3,
marker = dict(
size = 6,
color="grey"
),
name="Thousands of Passengers"
)
trace2 = go.Scatter(
x=df.index,
y=df["6-month-SMA"].values,
marker = dict(
size = 6,
),
name="6 Month SMA"
)
trace3 = go.Scatter(
x=df.index,
y=df["12-month-SMA"].values,
marker = dict(
size = 6,
),
name="12 Month SMA"
)
data = [trace1, trace2, trace3]
layout = go.Layout(
showlegend=True,
width=800,
height=400,
title="Thousands of Passengers Flying per Month with SMA",
yaxis=dict(title="Thousands of Passengers"),
xaxis=dict(title="Month")
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
Now let's take a look at how to implement an exponential moving average. Keep in mind it's main benefits:
It can be implemented as follows:
df["EWMA-12"] = df["Thousands of Passengers"].ewm(span=12).mean()
trace1 = go.Scatter(
x=df.index,
y=df["Thousands of Passengers"].values,
opacity=0.3,
marker = dict(
size = 6,
color="grey"
),
name="Thousands of Passengers"
)
trace2 = go.Scatter(
x=df.index,
y=df["EWMA-12"].values,
marker = dict(
size = 6,
color="green"
),
name="12 Month EWMA"
)
data = [trace1, trace2]
layout = go.Layout(
showlegend=True,
width=800,
height=400,
title="Thousands of Passengers Flying per Month with 12 Month EWMA",
yaxis=dict(title="Thousands of Passengers"),
xaxis=dict(title="Month")
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
Above, we can see that the behavior at the beginning is different than the behavior at the end. You may notice that the seasonality trend is a lot more clear towards the end points, than towards the beginning. That is because we have weighted the points closer to the present time more heavily than the historical values.
We are going to take a minute to dig into something that may seem straight forward: How to calculate an average. The first thought we all have when being asked to do this is: Why not just add all of the sample data points, and then divide by the number of data points, resulting in the sample mean:
$$\bar{X}_N = \frac{1}{N}\sum_{n=1}^NX_n$$But now let's suppose that you have a large amount of data-so much so that all of your X's cannot fit into memory at the same time. Is it still possible to calculate the sample mean? Yes it is! We can read in one data point at a time, and then delete each data point after we've looked at it. It is shown below that the current sample mean can actually be expressed in terms of the previous sample mean and the current data point.
$$\bar{X}_N = \frac{1}{N}\sum_{n=1}^NX_n = \frac{1}{N}\Big((N-1)\bar{X}_{N-1} + X_N \Big) = (1 - \frac{1}{N})\bar{X}_{N-1}+\frac{1}{N}X_N$$We can then express this using simpler symbols. We can call $Y$ our output, and we can use $t$ to represent the current time step:
$$Y_t = (1 - \frac{1}{t})Y_{t-1} + \frac{1}{t}X_t$$Great, so we have solved our problem of how to calculate the sample mean when we can't fit all of the data into memory, but we can see that there is this $\frac{1}{t}$ term. This says that as $t$ grows larger, the current sample has less and less of an effect on the total mean. Clearly this makes sense, because as $t$ grows that means the total number of $X$'s we've seen has grown. We also decrease the influence of the previous $Y$ by $1 - \frac{1}{t}$. This means that each new $Y$ is just part of the old $Y$, plus part of the newest $X$. But in the end, it balances out to give us exactly the sample mean of $X$.
For convenience we can call this $\frac{1}{t}$ term $\alpha_t$. What if we were to say that we did not want $\alpha$ to be $\frac{1}{t}$? What if we said that we wanted each data point to matter equally at the time that we see it, so that we can set alpha to be a constant? Of course, $\alpha$ needs to be less than 1 so that we don't end up negating the previous mean.
$$0 < \alpha_t = constant < 1 $$$$Y_t = (1 - \alpha)Y_{t-1} + \alpha X_t$$So what does this give us?
This gives us what is called the exponentially weighted moving average! We can see why it is called exponential when we express it in terms of only $X$'s.
$$Y_t = (1 - \alpha)^tY_0 + \alpha \sum_{\tau = 0}^{t - 1}(1- \alpha)^\tau X(t- \tau)$$If the equation above is not clear, the expansion below should clear up where everything is coming from and why this is called exponential. Let's say we are looking at $Y_{100}$:
$$Y_{100} = (1-\alpha)^{100}Y_0 + \alpha * X_{100} + \alpha * (1 - \alpha)^1*X_{99} + \alpha * (1 - \alpha)^2 * X_{98}+ ...$$We can see the exponential term starts to accumulate along the $(1 - \alpha)$! Now, does this still give us the mean, aka the expected value of $X$? Well, if you take the expected value of everything, we can see that we arrive at the expected value of $X$:
$$(1 - \alpha)E[y(t-1)] + \alpha E[X(t)] = (1-\alpha)E(X) + \alpha E(X) = E(X)$$We do arrive at the expected value of $X$, so we can see that the math does checkout! Of course, this is assuming that the distribution of $X$ does not change over time. Note that if you have come from a signal processing background, you may recognize this as a low-pass filter. Another way to think about this is that you are saying current values matter more, and past values matter less in an exponentially decreasing way. So, if $X$ is not stationary (meaning it's distribution changes over time), then this is actually a better way to estimate the mean (average) then weighting all data points equally over all time.
Now, we can see above that we need to set the $\alpha$ term specifically. We do this in pandas by setting the adjust
parameter to the ewm
method. If we set adjust=True
we ensure that the most recent observations are weighted more heavily compared to the older observations. If on the other hand we set adjust=False
, we end up with:
df["EWMA-6-no-adjust"] = df["Thousands of Passengers"].ewm(span=6, adjust=False).mean()
df["EWMA-6-adjust"] = df["Thousands of Passengers"].ewm(span=6, adjust=True).mean()
trace1 = go.Scatter(
x=df.index,
y=df["Thousands of Passengers"].values,
opacity=0.3,
marker = dict(
size = 6,
color="grey",
),
name="Thousands of Passengers"
)
trace2 = go.Scatter(
x=df.index,
y=df["EWMA-6-no-adjust"].values,
marker = dict(
size = 6,
),
name="EWMA-6-no-adjust"
)
trace3 = go.Scatter(
x=df.index,
y=df["EWMA-6-adjust"].values,
marker = dict(
size = 6,
),
name="EWMA-6-adjust"
)
data = [trace1, trace2, trace3]
layout = go.Layout(
showlegend=True,
width=800,
height=400,
title="Thousands of Passengers Flying per Month with 12 Month EWMA",
yaxis=dict(title="Thousands of Passengers"),
xaxis=dict(title="Month")
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)