Nathaniel Dake Blog

1. Time Series in Pandas

If you have been following along with my posts you may have realized that something I hadn't spent a lot of time dealing with was time series and subsequent forecasting. I have dealt with sequences (both via Recurrent Neural Networks and Markov Models), but given vast amount of time series data that you can encounter in industry, this post is long over due.

Before digging into the theory, I think that it may be the most beneficial to start from the top and work our way down. Namely, I'd like to get every one exposed to the actual interaction with time series data through the pandas library, and then we will move on to specific forecasting techniques and dive into the mathematics. The reason for this is because in my experience, getting the time series data into the correct format and manipulating it as needed is a bit more challenging than the traditional ML preprocessing (especially if you have never worked with it before). With that said, let's get started!

1. Datetimes in Python, Numpy and Pandas

Built right in to native python is the ability to create a datetime object:

In [43]:
from datetime import datetime

# Date information
year = 2020
month = 1
day = 2

# Time information
hour = 13
mins = 30
sec = 15

date = datetime(year, month, day, hour, mins, sec)
print(date)
2020-01-02 13:30:15

Now, while python does have the built in ability to handle date times, numpy is more efficient when it comes to handling date's. The numpy data type for date times datetime64, here. It will be have a different type compared to python's built in datetime:

In [44]:
import numpy as np

np_date = np.array(["2020-03-15"], dtype="datetime64")


print(f"Pythons datetime type: {type(date)}\n", "Numpy datetime type: ", np_date.dtype)
Pythons datetime type: <class 'datetime.datetime'>
 Numpy datetime type:  datetime64[D]

We can take this one step further and actually create numpy date ranges. By using np.arange() we can create a date range easily as follows:

In [45]:
display(np.arange("2018-06-01", "2018-06-23", 7, dtype="datetime64[D]")) # Where the dtype specifies our step type
array(['2018-06-01', '2018-06-08', '2018-06-15', '2018-06-22'],
      dtype='datetime64[D]')

Pandas handles datetimes in a way that is built in top of numpy. The API to create a date range is shown below:

In [46]:
import pandas as pd

display(pd.date_range("2020-01-01", periods=7, freq="D")) # String code D stands for days
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07'],
              dtype='datetime64[ns]', freq='D')

The return value is a pandas DatetimeIndex which is a specialized pandas index built for datetimes. In other words, it is not just normal string codes, rather it is aware that they are datetime64 objects. Pandas is able to handle a variety of string codes, however, we will stick with the standard year-month-day.

A nice helper method that pandas offers is the to_datetime method:

In [47]:
display(pd.to_datetime(["1/2/2018"]))
DatetimeIndex(['2018-01-02'], dtype='datetime64[ns]', freq=None)

Which again returns a DatetimeIndex. An interesting thing to note is that if we don't pass in a list above, we receive a Timestamp object instead:

In [48]:
display(pd.to_datetime("1/2/2018"))
Timestamp('2018-01-02 00:00:00')

Another common bit of preprocessing that you will most certainly come across is receiving date times in a format that is not expected/the default date time format. For instance, imagine that you are being sent a series of date times from an external database and the format is:

"2018--1--2"

Well, thankfully pandas to_datetime has a format key word argument that can be used as follows:

In [49]:
display(pd.to_datetime(["2018--1--2"], format="%Y--%m--%d"))
display(pd.to_datetime(["2018/-1/-2"], format="%Y/-%m/-%d"))
DatetimeIndex(['2018-01-02'], dtype='datetime64[ns]', freq=None)
DatetimeIndex(['2018-01-02'], dtype='datetime64[ns]', freq=None)

Finally, we can create a pandas dataframe with a date time index:

In [50]:
idx = pd.date_range("2020-01-01", periods=3, freq="D")
cols = ["A", "B"]
df = pd.DataFrame(np.random.randn(3, 2), index=idx, columns=cols)

display(df)
A B
2020-01-01 -0.145376 -0.648200
2020-01-02 0.891391 0.138016
2020-01-03 0.927283 -0.277852

And we can see that our index is indeed comprised of datetime objects:

In [51]:
display(df.index)
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03'], dtype='datetime64[ns]', freq='D')

And also perform operations such as:

In [52]:
display(df.index.max())
Timestamp('2020-01-03 00:00:00')

1.1 Time Resampling

Now that we have an idea of how to deal with basic time series object creation in native python, numpy, and pandas, we can start perform more specific time series manipulation. To start, we can look at time resampling. This operates similar to a groupby operation, except we end up aggregating based off of some sort of time frequency.

For example, we could take daily data and resample it into monthly data (by taking the average, or some, or some other sort of aggregation). Let's look into this further with a real data set, a csv related to starbucks stock prices by date. We will read in our csv with a date index that is a datetime and not a string:

In [53]:
import boto3 
s3 = boto3.client('s3')

bucket = "intuitiveml-data-sets"
key = "starbucks.csv"

obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(obj['Body'], index_col="Date", parse_dates=True)

display(df.head())
Close Volume
Date
2015-01-02 38.0061 6906098
2015-01-05 37.2781 11623796
2015-01-06 36.9748 7664340
2015-01-07 37.8848 9732554
2015-01-08 38.4961 13170548

Seeing that our index is in fact a date time:

In [54]:
display(df.index)
DatetimeIndex(['2015-01-02', '2015-01-05', '2015-01-06', '2015-01-07',
               '2015-01-08', '2015-01-09', '2015-01-12', '2015-01-13',
               '2015-01-14', '2015-01-15',
               ...
               '2018-12-17', '2018-12-18', '2018-12-19', '2018-12-20',
               '2018-12-21', '2018-12-24', '2018-12-26', '2018-12-27',
               '2018-12-28', '2018-12-31'],
              dtype='datetime64[ns]', name='Date', length=1006, freq=None)

We can now perform a basic resampling as follows (the rule A is found here):

In [55]:
# daily ---> yearly
df_resampled = df["Close"].resample(rule="A").mean()
display(df_resampled)
Date
2015-12-31    50.078100
2016-12-31    53.891732
2017-12-31    55.457310
2018-12-31    56.870005
Freq: A-DEC, Name: Close, dtype: float64

A very cool feature is that we can even implement our own resampling functions (if mean, max, min, etc do not provide the necessary functionality):

In [56]:
def first_day(entry):
    if len(entry): return entry[0]
In [57]:
df_resampled = df["Close"].resample(rule="A").apply(first_day)
display(df_resampled)
Date
2015-12-31    38.0061
2016-12-31    55.0780
2017-12-31    53.1100
2018-12-31    56.3243
Freq: A-DEC, Name: Close, dtype: float64

We can of course combine this resampling with some basic plotting. Below, we can see the average closing price per year:

In [93]:
import matplotlib.pyplot as plt
import seaborn as sns
import cufflinks
import plotly.plotly as py
import plotly
import plotly.graph_objs as go
from plotly.offline import iplot
from IPython.core.display import HTML

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

sns.set(style="white", palette="husl")
sns.set_context("talk")
sns.set_style("ticks")

cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl', offline=True)
In [59]:
trace1 = go.Bar(
    x=df_resampled.index,
    y=df_resampled.values,
    marker = dict(
        color = 'green',
    ),
)

data = [trace1]
layout = go.Layout(
    showlegend=False,
    width=500,
    height=400,
    title="Yearly Mean Closing Price for Starbucks",
    xaxis=dict(title="Date"),
    yaxis=dict(title="Mean Closing Price")
)

fig = go.Figure(data=data, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

We can of course perform the same sort of resampling at a monthly frequency as well:

In [60]:
df_resampled = df["Close"].resample(rule="M").max()
display(df_resampled.head())
Date
2015-01-31    41.5575
2015-02-28    44.2853
2015-03-31    45.8614
2015-04-30    48.5616
2015-05-31    48.8289
Freq: M, Name: Close, dtype: float64
In [61]:
trace1 = go.Bar(
    x=df_resampled.index,
    y=df_resampled.values,
    marker = dict(
        color = 'green',
    ),
)

data = [trace1]
layout = go.Layout(
    showlegend=False,
    width=500,
    height=400,
    title="Yearly Mean Closing Price for Starbucks",
    xaxis=dict(title="Date"),
    yaxis=dict(title="Mean Closing Price")
)

fig = go.Figure(data=data, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

1.2 Time Shifting

Sometimes when working with time series data, you may need to shift it all up or down along the time series index. Pandas has built in methods that can easily accomplish this. Recall the head of our starbucks df

In [62]:
display(df.head())
Close Volume
Date
2015-01-02 38.0061 6906098
2015-01-05 37.2781 11623796
2015-01-06 36.9748 7664340
2015-01-07 37.8848 9732554
2015-01-08 38.4961 13170548

If we shift our rows by a single row we end up with the following:

In [63]:
display(df.shift(1).head())
Close Volume
Date
2015-01-02 NaN NaN
2015-01-05 38.0061 6906098.0
2015-01-06 37.2781 11623796.0
2015-01-07 36.9748 7664340.0
2015-01-08 37.8848 9732554.0

We can also shift based on frequency codes. For instance, we can shift everything forward one month:

In [64]:
display(df.shift(periods=1, freq="M").head())
Close Volume
Date
2015-01-31 38.0061 6906098
2015-01-31 37.2781 11623796
2015-01-31 36.9748 7664340
2015-01-31 37.8848 9732554
2015-01-31 38.4961 13170548

1.3 Rolling and Expanding

Let's now take a minute to go over rolling and expanding our time series data with pandas. The basic premise is that a common process when working with time series data is to create data based off of a rolling mean. So, what we can do is divide the data into windows of time, then calculate an aggregate for each moving window. In this way we will have calculated a simple moving average.

Recall that our closing price data looks like:

In [65]:
trace1 = go.Scatter(
    x=df.index,
    y=df.Close.values,
    marker = dict(
        color = 'green',
    ),
)

data = [trace1]
layout = go.Layout(
    showlegend=False,
    width=650,
    height=350,
    title="Closing Price for Starbucks",
    xaxis=dict(title="Date"),
    yaxis=dict(title="Closing Price")
)

fig = go.Figure(data=data, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

What we are going to do is add in a rolling mean. A rolling mean simply create a little window, say 7 days, and it looks at each section of 7 days and performs some sort of aggregate function on it. In this case it will be a mean, or average. So every 7 days will will take the mean and keep rolling it along with that 7 day window.

In [66]:
trace1 = go.Scatter(
    x=df.index,
    y=df.Close.rolling(window=7).mean().values,
    marker = dict(
        color = 'green',
    ),
)

data = [trace1]
layout = go.Layout(
    showlegend=False,
    width=650,
    height=350,
    title="Rolling Average",
    xaxis=dict(title="Date"),
    yaxis=dict(title="Closing Price")
)

fig = go.Figure(data=data, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

To be sure that this is clear, look at the first 10 rows of our dataframe:

In [67]:
display(df["Close"].head(10))
Date
2015-01-02    38.0061
2015-01-05    37.2781
2015-01-06    36.9748
2015-01-07    37.8848
2015-01-08    38.4961
2015-01-09    37.2361
2015-01-12    37.4415
2015-01-13    37.7401
2015-01-14    37.5301
2015-01-15    37.1381
Name: Close, dtype: float64

Now, the rolling averages are found as follows:

In [68]:
for i in range(0, 4):
    print(f"Window {i+1}, the average of rows {i}:{i+7} ->", df["Close"][i:i+7].mean())
Window 1, the average of rows 0:7 -> 37.61678571428571
Window 2, the average of rows 1:8 -> 37.57878571428571
Window 3, the average of rows 2:9 -> 37.61478571428571
Window 4, the average of rows 3:10 -> 37.63811428571428

Which we can see is equivalent to the rolling average found via pandas:

In [69]:
display(
    df["Close"].rolling(
        window=7
    ).mean()[6:10]
)
Date
2015-01-12    37.616786
2015-01-13    37.578786
2015-01-14    37.614786
2015-01-15    37.638114
Name: Close, dtype: float64

Let's now overlay our original closing price with the rolling average:

In [70]:
df_rolling_window = df["Close"].rolling(
    window=7
).mean()

trace1 = go.Scatter(
    x = df_rolling_window.index,
    y = df_rolling_window.values,
    mode="lines",
    marker = dict(
        size = 6,
        color = 'orange',
    ),
    name="Rolling Mean, Window = 7 days"
)

trace2 = go.Scatter(
    x = df["Close"].index,
    y = df["Close"].values,
    mode="lines",
    marker = dict(
        size = 6,
        color = 'blue',
    ),
    name="Original"
)
data = [trace2, trace1]

layout=go.Layout(
    title="Rolling Average (7 day window) vs. No transformation Starbucks Closing Price",
    width=950,
    height=500,
    xaxis=dict(title="Date"),
    yaxis=dict(title='Closing Price'),
    legend=dict(x=0.05, y=1)
)

fig = go.Figure(data=data, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

We can of course increase our window size, and we will subsequently see more smoothing:

In [71]:
df_rolling_window = df["Close"].rolling(
    window=30
).mean()

trace1 = go.Scatter(
    x = df_rolling_window.index,
    y = df_rolling_window.values,
    mode="lines",
    marker = dict(
        size = 6,
        color = 'orange',
    ),
    name="Rolling Mean, Window = 30 days"
)

trace2 = go.Scatter(
    x = df["Close"].index,
    y = df["Close"].values,
    mode="lines",
    marker = dict(
        size = 6,
        color = 'blue',
    ),
    name="Original"
)
data = [trace2, trace1]

layout=go.Layout(
    title="Rolling Average (30 day window) vs. No transformation Starbucks Closing Price",
    width=950,
    height=500,
    xaxis=dict(title="Date"),
    yaxis=dict(title='Closing Price'),
    legend=dict(x=0.05, y=1)
)

fig = go.Figure(data=data, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

As we continue increasing the window size, we can see that we are viewing a more general trend. Now, in addition to rolling windows we can also work with expanding. For instance, what if we wanted to take into account everything from the start of the time series up to each point in time (aka a cumulative average). This would work as follows:

In [72]:
df_expanding = df["Close"].expanding().mean()

trace1 = go.Scatter(
    x=df_expanding.index,
    y=df_expanding.values,
    marker = dict(
        color = 'green',
    ),
)

data = [trace1]
layout = go.Layout(
    showlegend=False,
    width=650,
    height=350,
    title="Expanding Closing Price Average",
    xaxis=dict(title="Date"),
    yaxis=dict(title="Closing Price")
)

fig = go.Figure(data=data, layout=layout)

# plotly.offline.iplot(fig)
html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
display(HTML(html_fig))

We can see that this curve is clearly logarithmic, as an expanding mean generally will be.

1.4 Missing Dates

Now, one thing that is incredibly common when dealing with real world data is for there to be missing and null values. For instance, let's look at a simple data frame with all of the values present at each time interval:

In [275]:
idx = pd.date_range("2020-01-01", periods=10, freq="D")
cols = ["A"]
df = pd.DataFrame(np.random.randn(10, 1), index=idx, columns=cols)

display(df)
A
2020-01-01 -0.098591
2020-01-02 0.890374
2020-01-03 0.608196
2020-01-04 -0.180755
2020-01-05 -1.028363
2020-01-06 0.552435
2020-01-07 0.231417
2020-01-08 -0.223659
2020-01-09 -0.059113
2020-01-10 0.151531

Clearly, all of our values are present. Now, in traditional contexts if we were missing a value or an entry, it may show up as:

In [276]:
df.loc["2020-01-04"] = np.nan
display(df)
A
2020-01-01 -0.098591
2020-01-02 0.890374
2020-01-03 0.608196
2020-01-04 NaN
2020-01-05 -1.028363
2020-01-06 0.552435
2020-01-07 0.231417
2020-01-08 -0.223659
2020-01-09 -0.059113
2020-01-10 0.151531

And we can identify missing values graphically via a heatmap:

In [277]:
plt.figure(figsize=(5,4))
sns.set(font_scale=1)
sns.heatmap(df.isnull(),yticklabels=True, cbar=False, cmap='viridis')
plt.show()

However, in the context of time series data the entry often is simply not present at all; i.e. there is no index value for that specific date:

In [278]:
df = df.drop(pd.to_datetime("2020-01-04"))
display(df)
A
2020-01-01 -0.098591
2020-01-02 0.890374
2020-01-03 0.608196
2020-01-05 -1.028363
2020-01-06 0.552435
2020-01-07 0.231417
2020-01-08 -0.223659
2020-01-09 -0.059113
2020-01-10 0.151531
In [279]:
plt.figure(figsize=(5,4))
sns.set(font_scale=1)
sns.heatmap(df.isnull(),yticklabels=True, cbar=False, cmap='viridis')
plt.show()

Often when performing an analysis such as this you will want to have an idea of what day is missing! So, that begs the question, how can we create the missing indices and set their values to null? There are two main ways of going about this, and we will explore each.

1.4.1 Apply a reindex function

The first way that we can go about solving this issue is by applying a custom reindex function. Essentially, we will use pandas apply method, pass in our dataframe, create the set of indices that we want, reindex and fill missing values with nan:

In [280]:
def reindex_by_date(df):
    dates = pd.date_range(df.index.min(), df.index.max())
    return df.reindex(dates).fillna(np.nan)

df_reindexed = df.apply(reindex_by_date)
display(df_reindexed)
A
2020-01-01 -0.098591
2020-01-02 0.890374
2020-01-03 0.608196
2020-01-04 NaN
2020-01-05 -1.028363
2020-01-06 0.552435
2020-01-07 0.231417
2020-01-08 -0.223659
2020-01-09 -0.059113
2020-01-10 0.151531

And we have the data frame that we were looking for! Dates that were not originally present are now showing up with nan values, allowing us to use a heatmap to identify missing values!

1.4.2 Create a multiindex

Sometimes you may have time series data specific to multiple entities; for instance, you may have different companies, each of which you have several years worth of time series price data:

In [347]:
key = "starbucks.csv"

obj = s3.get_object(Bucket=bucket, Key=key)
df = pd.read_csv(obj['Body'], index_col="Date", parse_dates=True, usecols=["Date", "Close"])

companies = ["starbucks", "tim hortons", "dunkin donuts"]

dfs = [df.assign(company=company) for company in companies]
df = pd.concat(dfs)

display(df.head())
Close company
Date
2015-01-02 38.0061 starbucks
2015-01-05 37.2781 starbucks
2015-01-06 36.9748 starbucks
2015-01-07 37.8848 starbucks
2015-01-08 38.4961 starbucks

Visually we can see that we are now missing the 3rd and 4th of January (as well as 48 other dates) for Starbucks. We are also missing these values for the other companies!

In [348]:
display(df[df.company == "tim hortons"].head())
Close company
Date
2015-01-02 38.0061 tim hortons
2015-01-05 37.2781 tim hortons
2015-01-06 36.9748 tim hortons
2015-01-07 37.8848 tim hortons
2015-01-08 38.4961 tim hortons

Now, without any modification if we were to try our previous reindexing technique, we would receive the following error:

ValueError: ('cannot reindex from a duplicate axis', 'occurred at index Close')

This is because we have the same indice values (dates) for different companies. Pandas is not ready to handle this internally, and has specifically designed a MultiIndex for this exact scenario.

In [349]:
df_multi = df.set_index(
    ["company", df.index]
)

display(df_multi.head())
Close
company Date
starbucks 2015-01-02 38.0061
2015-01-05 37.2781
2015-01-06 36.9748
2015-01-07 37.8848
2015-01-08 38.4961

We now have a company and Date index, from which we can update our reindex_by_date function to handle:

In [350]:
def reindex_multi_by_date(df):
    dates = pd.date_range(
        df.index.get_level_values(1).min(),
        df.index.get_level_values(1).max()
    )
    new_idx = pd.MultiIndex.from_product(
        [df.index.levels[0], dates], 
        names=["company", "Date"]
    )
    return df.reindex(new_idx).fillna(np.nan)

df_multi = reindex_multi_by_date(df_multi)
display(df_multi.head())
Close
company Date
dunkin donuts 2015-01-02 38.0061
2015-01-03 NaN
2015-01-04 NaN
2015-01-05 37.2781
2015-01-06 36.9748
In [369]:
display(df_multi.iloc[
    df_multi.index.get_level_values("Date") == pd.to_datetime("2018-12-02")
])
Close
company Date
dunkin donuts 2018-12-02 NaN
starbucks 2018-12-02 NaN
tim hortons 2018-12-02 NaN
In [ ]:
 

Which leaves us with exactly what we want; NaN values for missing rows:

In [352]:
pivot = df_multi.reset_index().pivot(columns="company", index="Date", values="Close")

plt.figure(figsize=(10,6))
sns.set(font_scale=1)
sns.heatmap(pivot.isnull(), cbar=True, cmap='viridis')
plt.show()
In [353]:
print("Number of days without a close value: ", pivot.isnull().sum().sum())
Number of days without a close value:  1362

1.4.3 Groupby and apply reindex

Now, there actually is a way that we can use our original reindex_by_date function without needing a MultiIndex (and it may very well be far less hassle!). We can simply use a groupby before we apply the reindex function:

In [412]:
def reindex_by_date(df):
    dates = pd.date_range(df.index.min(), df.index.max())
    df = df.reindex(dates)
    df = df.assign(company=df.company.fillna(method="backfill"))
    return df
In [414]:
df_reindexed = df.groupby("company").apply(reindex_by_date).reset_index(0, drop=True)
display(df_reindexed.head())
Close company
2015-01-02 38.0061 dunkin donuts
2015-01-03 NaN dunkin donuts
2015-01-04 NaN dunkin donuts
2015-01-05 37.2781 dunkin donuts
2015-01-06 36.9748 dunkin donuts
In [418]:
df_reindexed.index.name = "Date"

And we will arrive at the same result, 1362 missing entries:

In [424]:
print("Number of days without a close value: ", df_reindexed.Close.isnull().sum())
Number of days without a close value:  1362

Again, we can see these visually with a heatmap:

In [423]:
pivot = df_reindexed.reset_index().pivot(columns="company", index="Date", values="Close")

plt.figure(figsize=(10,6))
sns.set(font_scale=1)
sns.heatmap(pivot.isnull(), cbar=True, cmap='viridis')
plt.show()

© 2018 Nathaniel Dake