If you have been following along with my posts you may have realized that something I hadn't spent a lot of time dealing with was time series and subsequent forecasting. I have dealt with sequences (both via Recurrent Neural Networks and Markov Models), but given vast amount of time series data that you can encounter in industry, this post is long over due.
Before digging into the theory, I think that it may be the most beneficial to start from the top and work our way down. Namely, I'd like to get every one exposed to the actual interaction with time series data through the pandas library, and then we will move on to specific forecasting techniques and dive into the mathematics. The reason for this is because in my experience, getting the time series data into the correct format and manipulating it as needed is a bit more challenging than the traditional ML preprocessing (especially if you have never worked with it before). With that said, let's get started!
from datetime import datetime
# Date information
year = 2020
month = 1
day = 2
# Time information
hour = 13
mins = 30
sec = 15
date = datetime(year, month, day, hour, mins, sec)
print(date)
Now, while python does have the built in ability to handle date times, numpy
is more efficient when it comes to handling date's. The numpy data type for date times datetime64
, here. It will be have a different type compared to python's built in datetime
:
import numpy as np
np_date = np.array(["2020-03-15"], dtype="datetime64")
print(f"Pythons datetime type: {type(date)}\n", "Numpy datetime type: ", np_date.dtype)
We can take this one step further and actually create numpy date ranges. By using np.arange()
we can create a date range easily as follows:
display(np.arange("2018-06-01", "2018-06-23", 7, dtype="datetime64[D]")) # Where the dtype specifies our step type
Pandas handles datetimes in a way that is built in top of numpy. The API to create a date range is shown below:
import pandas as pd
display(pd.date_range("2020-01-01", periods=7, freq="D")) # String code D stands for days
The return value is a pandas DatetimeIndex
which is a specialized pandas index built for datetimes. In other words, it is not just normal string codes, rather it is aware that they are datetime64
objects. Pandas is able to handle a variety of string codes, however, we will stick with the standard year-month-day
.
A nice helper method that pandas offers is the to_datetime
method:
display(pd.to_datetime(["1/2/2018"]))
Which again returns a DatetimeIndex
. An interesting thing to note is that if we don't pass in a list
above, we receive a Timestamp
object instead:
display(pd.to_datetime("1/2/2018"))
Another common bit of preprocessing that you will most certainly come across is receiving date times in a format that is not expected/the default date time format. For instance, imagine that you are being sent a series of date times from an external database and the format is:
"2018--1--2"
Well, thankfully pandas to_datetime
has a format
key word argument that can be used as follows:
display(pd.to_datetime(["2018--1--2"], format="%Y--%m--%d"))
display(pd.to_datetime(["2018/-1/-2"], format="%Y/-%m/-%d"))
Finally, we can create a pandas dataframe with a date time index:
idx = pd.date_range("2020-01-01", periods=3, freq="D")
cols = ["A", "B"]
df = pd.DataFrame(np.random.randn(3, 2), index=idx, columns=cols)
display(df)
And we can see that our index is indeed comprised of datetime objects:
display(df.index)
And also perform operations such as:
display(df.index.max())
Now that we have an idea of how to deal with basic time series object creation in native python, numpy, and pandas, we can start perform more specific time series manipulation. To start, we can look at time resampling. This operates similar to a groupby operation, except we end up aggregating based off of some sort of time frequency.
For example, we could take daily data and resample it into monthly data (by taking the average, or some, or some other sort of aggregation). Let's look into this further with a real data set, a csv related to starbucks stock prices by date. We will read in our csv with a date index that is a datetime and not a string:
import boto3
s3 = boto3.client('s3')
bucket = "intuitiveml-data-sets"
key = "starbucks.csv"
obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(obj['Body'], index_col="Date", parse_dates=True)
display(df.head())
Seeing that our index is in fact a date time:
display(df.index)
We can now perform a basic resampling as follows (the rule A
is found here):
# daily ---> yearly
df_resampled = df["Close"].resample(rule="A").mean()
display(df_resampled)
A very cool feature is that we can even implement our own resampling functions (if mean
, max
, min
, etc do not provide the necessary functionality):
def first_day(entry):
if len(entry): return entry[0]
df_resampled = df["Close"].resample(rule="A").apply(first_day)
display(df_resampled)
We can of course combine this resampling with some basic plotting. Below, we can see the average closing price per year:
import matplotlib.pyplot as plt
import seaborn as sns
import cufflinks
import plotly.plotly as py
import plotly
import plotly.graph_objs as go
from plotly.offline import iplot
from IPython.core.display import HTML
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
sns.set(style="white", palette="husl")
sns.set_context("talk")
sns.set_style("ticks")
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl', offline=True)
trace1 = go.Bar(
x=df_resampled.index,
y=df_resampled.values,
marker = dict(
color = 'green',
),
)
data = [trace1]
layout = go.Layout(
showlegend=False,
width=500,
height=400,
title="Yearly Mean Closing Price for Starbucks",
xaxis=dict(title="Date"),
yaxis=dict(title="Mean Closing Price")
)
fig = go.Figure(data=data, layout=layout)
# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))
We can of course perform the same sort of resampling at a monthly frequency as well:
df_resampled = df["Close"].resample(rule="M").max()
display(df_resampled.head())
trace1 = go.Bar(
x=df_resampled.index,
y=df_resampled.values,
marker = dict(
color = 'green',
),
)
data = [trace1]
layout = go.Layout(
showlegend=False,
width=500,
height=400,
title="Yearly Mean Closing Price for Starbucks",
xaxis=dict(title="Date"),
yaxis=dict(title="Mean Closing Price")
)
fig = go.Figure(data=data, layout=layout)
# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))
Sometimes when working with time series data, you may need to shift it all up or down along the time series index. Pandas has built in methods that can easily accomplish this. Recall the head of our starbucks df
display(df.head())
If we shift our rows by a single row we end up with the following:
display(df.shift(1).head())
We can also shift based on frequency codes. For instance, we can shift everything forward one month:
display(df.shift(periods=1, freq="M").head())
Let's now take a minute to go over rolling and expanding our time series data with pandas. The basic premise is that a common process when working with time series data is to create data based off of a rolling mean. So, what we can do is divide the data into windows of time, then calculate an aggregate for each moving window. In this way we will have calculated a simple moving average.
Recall that our closing price data looks like:
trace1 = go.Scatter(
x=df.index,
y=df.Close.values,
marker = dict(
color = 'green',
),
)
data = [trace1]
layout = go.Layout(
showlegend=False,
width=650,
height=350,
title="Closing Price for Starbucks",
xaxis=dict(title="Date"),
yaxis=dict(title="Closing Price")
)
fig = go.Figure(data=data, layout=layout)
# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))
What we are going to do is add in a rolling mean. A rolling mean simply create a little window, say 7 days, and it looks at each section of 7 days and performs some sort of aggregate function on it. In this case it will be a mean, or average. So every 7 days will will take the mean and keep rolling it along with that 7 day window.
trace1 = go.Scatter(
x=df.index,
y=df.Close.rolling(window=7).mean().values,
marker = dict(
color = 'green',
),
)
data = [trace1]
layout = go.Layout(
showlegend=False,
width=650,
height=350,
title="Rolling Average",
xaxis=dict(title="Date"),
yaxis=dict(title="Closing Price")
)
fig = go.Figure(data=data, layout=layout)
# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))
To be sure that this is clear, look at the first 10 rows of our dataframe:
display(df["Close"].head(10))
Now, the rolling averages are found as follows:
for i in range(0, 4):
print(f"Window {i+1}, the average of rows {i}:{i+7} ->", df["Close"][i:i+7].mean())
Which we can see is equivalent to the rolling average found via pandas:
display(
df["Close"].rolling(
window=7
).mean()[6:10]
)
Let's now overlay our original closing price with the rolling average:
df_rolling_window = df["Close"].rolling(
window=7
).mean()
trace1 = go.Scatter(
x = df_rolling_window.index,
y = df_rolling_window.values,
mode="lines",
marker = dict(
size = 6,
color = 'orange',
),
name="Rolling Mean, Window = 7 days"
)
trace2 = go.Scatter(
x = df["Close"].index,
y = df["Close"].values,
mode="lines",
marker = dict(
size = 6,
color = 'blue',
),
name="Original"
)
data = [trace2, trace1]
layout=go.Layout(
title="Rolling Average (7 day window) vs. No transformation Starbucks Closing Price",
width=950,
height=500,
xaxis=dict(title="Date"),
yaxis=dict(title='Closing Price'),
legend=dict(x=0.05, y=1)
)
fig = go.Figure(data=data, layout=layout)
# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))
We can of course increase our window size, and we will subsequently see more smoothing:
df_rolling_window = df["Close"].rolling(
window=30
).mean()
trace1 = go.Scatter(
x = df_rolling_window.index,
y = df_rolling_window.values,
mode="lines",
marker = dict(
size = 6,
color = 'orange',
),
name="Rolling Mean, Window = 30 days"
)
trace2 = go.Scatter(
x = df["Close"].index,
y = df["Close"].values,
mode="lines",
marker = dict(
size = 6,
color = 'blue',
),
name="Original"
)
data = [trace2, trace1]
layout=go.Layout(
title="Rolling Average (30 day window) vs. No transformation Starbucks Closing Price",
width=950,
height=500,
xaxis=dict(title="Date"),
yaxis=dict(title='Closing Price'),
legend=dict(x=0.05, y=1)
)
fig = go.Figure(data=data, layout=layout)
# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))
As we continue increasing the window size, we can see that we are viewing a more general trend. Now, in addition to rolling windows we can also work with expanding. For instance, what if we wanted to take into account everything from the start of the time series up to each point in time (aka a cumulative average). This would work as follows:
df_expanding = df["Close"].expanding().mean()
trace1 = go.Scatter(
x=df_expanding.index,
y=df_expanding.values,
marker = dict(
color = 'green',
),
)
data = [trace1]
layout = go.Layout(
showlegend=False,
width=650,
height=350,
title="Expanding Closing Price Average",
xaxis=dict(title="Date"),
yaxis=dict(title="Closing Price")
)
fig = go.Figure(data=data, layout=layout)
# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))
We can see that this curve is clearly logarithmic, as an expanding mean generally will be.
Now, one thing that is incredibly common when dealing with real world data is for there to be missing and null values. For instance, let's look at a simple data frame with all of the values present at each time interval:
idx = pd.date_range("2020-01-01", periods=10, freq="D")
cols = ["A"]
df = pd.DataFrame(np.random.randn(10, 1), index=idx, columns=cols)
display(df)
Clearly, all of our values are present. Now, in traditional contexts if we were missing a value or an entry, it may show up as:
df.loc["2020-01-04"] = np.nan
display(df)
And we can identify missing values graphically via a heatmap:
plt.figure(figsize=(5,4))
sns.set(font_scale=1)
sns.heatmap(df.isnull(),yticklabels=True, cbar=False, cmap='viridis')
plt.show()
However, in the context of time series data the entry often is simply not present at all; i.e. there is no index value for that specific date:
df = df.drop(pd.to_datetime("2020-01-04"))
display(df)
plt.figure(figsize=(5,4))
sns.set(font_scale=1)
sns.heatmap(df.isnull(),yticklabels=True, cbar=False, cmap='viridis')
plt.show()
Often when performing an analysis such as this you will want to have an idea of what day is missing! So, that begs the question, how can we create the missing indices and set their values to null? There are two main ways of going about this, and we will explore each.
The first way that we can go about solving this issue is by applying a custom reindex
function. Essentially, we will use pandas apply
method, pass in our dataframe, create the set of indices that we want, reindex and fill missing values with nan:
def reindex_by_date(df):
dates = pd.date_range(df.index.min(), df.index.max())
return df.reindex(dates).fillna(np.nan)
df_reindexed = df.apply(reindex_by_date)
display(df_reindexed)
And we have the data frame that we were looking for! Dates that were not originally present are now showing up with nan values, allowing us to use a heatmap to identify missing values!
Sometimes you may have time series data specific to multiple entities; for instance, you may have different companies, each of which you have several years worth of time series price data:
key = "starbucks.csv"
obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(obj['Body'], index_col="Date", parse_dates=True, usecols=["Date", "Close"])
companies = ["starbucks", "tim hortons", "dunkin donuts"]
dfs = [df.assign(company=company) for company in companies]
df = pd.concat(dfs)
display(df.head())
Visually we can see that we are now missing the 3rd and 4th of January (as well as 48 other dates) for Starbucks. We are also missing these values for the other companies!
display(df[df.company == "tim hortons"].head())
Now, without any modification if we were to try our previous reindexing technique, we would receive the following error:
ValueError: ('cannot reindex from a duplicate axis', 'occurred at index Close')
This is because we have the same indice values (dates) for different companies. Pandas is not ready to handle this internally, and has specifically designed a MultiIndex
for this exact scenario.
df_multi = df.set_index(
["company", df.index]
)
display(df_multi.head())
We now have a company
and Date
index, from which we can update our reindex_by_date
function to handle:
def reindex_multi_by_date(df):
dates = pd.date_range(
df.index.get_level_values(1).min(),
df.index.get_level_values(1).max()
)
new_idx = pd.MultiIndex.from_product(
[df.index.levels[0], dates],
names=["company", "Date"]
)
return df.reindex(new_idx).fillna(np.nan)
df_multi = reindex_multi_by_date(df_multi)
display(df_multi.head())
display(df_multi.iloc[
df_multi.index.get_level_values("Date") == pd.to_datetime("2018-12-02")
])
Which leaves us with exactly what we want; NaN
values for missing rows:
pivot = df_multi.reset_index().pivot(columns="company", index="Date", values="Close")
plt.figure(figsize=(10,6))
sns.set(font_scale=1)
sns.heatmap(pivot.isnull(), cbar=True, cmap='viridis')
plt.show()
print("Number of days without a close value: ", pivot.isnull().sum().sum())
Now, there actually is a way that we can use our original reindex_by_date
function without needing a MultiIndex (and it may very well be far less hassle!). We can simply use a groupby
before we apply the reindex function:
def reindex_by_date(df):
dates = pd.date_range(df.index.min(), df.index.max())
df = df.reindex(dates)
df = df.assign(company=df.company.fillna(method="backfill"))
return df
df_reindexed = df.groupby("company").apply(reindex_by_date).reset_index(0, drop=True)
display(df_reindexed.head())
df_reindexed.index.name = "Date"
And we will arrive at the same result, 1362
missing entries:
print("Number of days without a close value: ", df_reindexed.Close.isnull().sum())
Again, we can see these visually with a heatmap:
pivot = df_reindexed.reset_index().pivot(columns="company", index="Date", values="Close")
plt.figure(figsize=(10,6))
sns.set(font_scale=1)
sns.heatmap(pivot.isnull(), cbar=True, cmap='viridis')
plt.show()