Nathaniel Dake Blog

5. Customer Segmentation

The vast majority of my articles are written with intention of highlighting some mathematical or computer science concept and tying it in to a real world example. In other words, we are operating as follows:

$$\text{Determine concept to teach} \longrightarrow \text{Find problem to demonstrate concept}$$

For instance, in the case of Bayesian Classifiers I decided that I wanted to cover that particular concept and all of it's inner workings, and I then found problems (that I encountered working as a data scientist) in order to demonstrate how a Bayes Classifier would work in the real world.

However, this is not how things work in the real world! TODO: Mention taleb ideas relating to trying to fit real world into Platonic framework (when it may not really be that way). In reality, a data scientist is often given a vague problem that needs solving by making use of data. If lucky, they will be given a stand alone data set (i.e. a csv), but often they may not even have that, leaving them to wrangle the necessary data from databases, API's, and other sources.

I want to take some time to write an article that demonstrates how to handle being confronted with a data set and simply being tasked to "explore" it and find something useful (known as data mining). This post will specifically be based on the exploration of the online retail data set.

5.1 Background and Context

Assume that you are a data scientist for an online retail store. Management comes to you saying that they have a set of customer transaction data over the course of ~ 1 year. They want you to simply explore and see if there is anything interesting; in other words, are there patterns in the data that could be relevant to marketing, trends that could be useful in customer prediction, and so on. We are starting with a very blank canvas; where do we begin?

There is a "general" exploration framework that is generally followed by most data scientists. It almost always starts with data preparation and basic exploration. Data preparation consists of gathering relevant data into a single location. For example, if you are an energy utility and you have customers that have billing and energy usage data held in different databases, this phase may consist of querying the different databases and exporting the desired tables (potentially for specific time intervals) to a specific format (csv). Once the relevant data is in a desirable format, the next phase consists of basic exploration. What exactly is meant by that? Well, that consists of things such as:

  • Determine size and shape of data set
  • Inspect variables of interest
  • Gather metrics about columns
  • Potentially perform different groupings
  • Feature generation

For this post we are starting at the tail end of the data preparation phase (thankfully a csv has already been created for us). For that we will be using the python scientific computing stack-pandas, numpy, scipy, matplotlib, etc. So, without further ado, let's begin!

5.2 Data Preparation

To start, we can load our necessary libraries, and then load our data set:

In [1]:
import numpy as np
import pandas
import seaborn as sns
import random
import matplotlib.pyplot as plt
from matplotlib import rc, animation
from IPython.core.display import display, HTML
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.ticker import LinearLocator, FormatStrFormatter

from _plotly_future_ import v4_subplots
import cufflinks
import plotly.plotly as py
import plotly
import plotly.graph_objs as go
from plotly.offline import iplot
import plotly.figure_factory as ff

from util import get_csv_from_s3, get_obj_s3

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

plotly.offline.init_notebook_mode(connected=True)
sns.set(style="white", palette="husl")
sns.set_context("talk")
sns.set_style("ticks")
/usr/local/lib/python3.6/site-packages/plotly/graph_objs/_deprecations.py:558: DeprecationWarning:

plotly.graph_objs.YAxis is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.YAxis
  - plotly.graph_objs.layout.scene.YAxis


/usr/local/lib/python3.6/site-packages/plotly/graph_objs/_deprecations.py:531: DeprecationWarning:

plotly.graph_objs.XAxis is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.XAxis
  - plotly.graph_objs.layout.scene.XAxis


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.
In [2]:
df_obj = get_obj_s3("data_customer_segmentation.csv")
df = pandas.read_csv(
    df_obj, 
    encoding="ISO-8859-1",
    dtype={
        "CustomerID": str,
        "InvoiceID": str,
    }
)

A good rule of thumb when just getting started is to simply utilize head() to get a glimpse of the data set:

In [3]:
display(df.head())
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 12/1/2010 8:26 2.55 17850 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 12/1/2010 8:26 3.39 17850 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12/1/2010 8:26 2.75 17850 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 12/1/2010 8:26 3.39 17850 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 12/1/2010 8:26 3.39 17850 United Kingdom

And then determine the shape of the data set, and get an idea of the different columns present (including their data types and number of missing values):

In [4]:
display(f"Dataframe Shape: {df.shape}")
'Dataframe Shape: (541909, 8)'
In [5]:
def df_info(df):
    
    df_info = pandas.DataFrame(df.dtypes).T.rename(index={0: "Column Type"})
    df_null_count = pandas.DataFrame(df.isnull().sum()).T.rename(index={0: "# Null values"})
    df_null_percent = pandas.DataFrame(
        (df.isnull().sum() * 100) / df.shape[0]
    ).T.rename(index={0: "% Null values"})
    
    return df_info.append([df_null_count, df_null_percent])

display(df_info(df))
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
Column Type object object object int64 object float64 object object
# Null values 0 0 1454 0 0 0 135080 0
% Null values 0 0 0.268311 0 0 0 24.9267 0

We immediately get an idea of the variables available to us for analysis, as well as where we may have particular gaps (null values). Right from the get go we see that there are ~25% of transactions that do not have an associated CustomerID. Based on the data that is readily available, it is impossible to impute these values. Because the vast majority of inferences that we will try and make will tie back to a particular customer, these rows can be dropped:

In [6]:
df = df.dropna(axis=0, subset=["CustomerID"])

display(f"Dataframe shape: {df.shape}")
display(df_info(df))
'Dataframe shape: (406829, 8)'
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
Column Type object object object int64 object float64 object object
# Null values 0 0 0 0 0 0 0 0
% Null values 0 0 0 0 0 0 0 0

We also will want to be sure to check for duplicate entries:

In [7]:
display(f"Number of duplicate entries: {df.duplicated().sum()}")
'Number of duplicate entries: 5225'

And drop them:

In [8]:
df = df.drop_duplicates()

We have now accounted for null values and duplicate entries, meaning we are ready for basic data exploration.

5.3 Data Exploration

Our dataframe contains 8 different features, defined as:

  • InvoiceNo: Invoice number. Nominal, a 6-digit integer number uniquely assigned to each transaction. If this code starts with a letter is indicates a cancellation.
  • StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
  • Description: Product (item) name. Nominal.
  • Quantity: The quantities of each product (item) per transaction. Numeric.
  • InvoiceDate: Invoice Date and time. Numeric, the day and time when each transaction was generated.
  • UnitPrice: Unit price. Numeric, Product price per unit in sterling.
  • CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
  • Country: Country name. Nominal, the name of the country where each customer resides.

5.3.1 Geographic Exploration

Let's start by exploring the countries from which the orders were made. It would be useful to know how many different countries orders have originated from.

In [9]:
unique_countries = df[["Country"]].groupby("Country").nunique()

display(f"Number of countries from which transactions occured: {len(unique_countries)}")
'Number of countries from which transactions occured: 37'

We may also want to know the number of orders per country. To do this, we need to account for the fact that an order can have many items in it; this means that a single order may contain multiple rows. We can see this below:

In [10]:
display(df[df.InvoiceNo == "536365"])
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 12/1/2010 8:26 2.55 17850 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 12/1/2010 8:26 3.39 17850 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 12/1/2010 8:26 2.75 17850 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 12/1/2010 8:26 3.39 17850 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 12/1/2010 8:26 3.39 17850 United Kingdom
5 536365 22752 SET 7 BABUSHKA NESTING BOXES 2 12/1/2010 8:26 7.65 17850 United Kingdom
6 536365 21730 GLASS STAR FROSTED T-LIGHT HOLDER 6 12/1/2010 8:26 4.25 17850 United Kingdom

We can see that above a single order, InvoiceNo = 536365, contained 7 different items, and hence 7 different rows in our dataframe. However, we just want to count the number of orders per country, and hence those should be grouped as a single value.

In [11]:
countries = df[
    ["InvoiceNo", "Country"]
].groupby(
    ["InvoiceNo", "Country"]
).count().reset_index()["Country"].value_counts()

We can visualize these results nicely with a chloropleth map:

In [12]:
data = dict(
    type="choropleth",
    locations=countries.index,
    locationmode="country names", 
    z=countries,
    text=countries.index, 
    colorbar={"title": "Order #"},
    colorscale=[
        [0, 'rgb(224,255,255)'],
        [0.01, 'rgb(166,206,227)'], 
        [0.02, 'rgb(31,120,180)'],
        [0.03, 'rgb(178,223,138)'], 
        [0.05, 'rgb(51,160,44)'],
        [0.10, 'rgb(251,154,153)'], 
        [0.20, 'rgb(255,255,0)'],
        [1, 'rgb(227,26,28)']
    ],    
    reversescale=False
)

layout = dict(
    title="Number of orders per country",
    width=800,
    height=500,
)

choromap = go.Figure(data = [data], layout = layout)
iplot(choromap, validate=False)

We can see that the data is heavily dominated by orders placed from the UK.

5.3.2 Number of products purchased distributions

Another thing that we may want to explore further is the number of customers and the number of products.

In [13]:
temp = pandas.DataFrame(
    [{
        "Products": df.StockCode.nunique(),
        "Transactions": df.InvoiceNo.nunique(),
        "Customers": df.CustomerID.nunique()
    }],
    index=["Quantity"]
)

display(temp)
Customers Products Transactions
Quantity 4372 3684 22190

We can see that our data set consists of 4372 customers who have bought 3684 different products over the course of 22,190 total transactions. Let's try to get an idea of the distribution of number of products purchased per transaction:

In [14]:
temp = df.groupby(
    "InvoiceNo"
).count()[
    ["StockCode"]
].rename(
    columns={"StockCode": "Number unique products in transaction"}
)

display(temp.head())
Number unique products in transaction
InvoiceNo
536365 7
536366 2
536367 12
536368 4
536369 1

Which can be visualized via the histogram below:

In [15]:
trace1 = go.Histogram(
    x=temp["Number unique products in transaction"],
    nbinsx=200,
    name="Number unique products in transaction",
    marker_color='blue',
)

data = [trace1]

layout = go.Layout(
    width=750,
    height=450,
    title="Distribution of unique products purchased per transaction",
    xaxis=dict(title="Number of unique products purchased"),
    yaxis=dict(title="Number of Orders"),
    barmode='stack'
)

fig = go.Figure(data=data, layout=layout)
fig.update_traces(opacity=0.75)

plotly.offline.iplot(fig)

And we can display a few metrics related to this data:

In [16]:
display(temp.describe())
Number unique products in transaction
count 22190.000000
mean 18.098423
std 23.363888
min 1.000000
25% 3.000000
50% 12.000000
75% 24.000000
max 542.000000

This is all definitely useful data, and certainly interesting; for instance, we see that number of unique products purchased is indeed a long tail distribution, and that certain orders have over 500 items! However, right now these purchases are not tied to specific customers. Let's update our group by's to remedy that! In order to implement a proper fix we will need to:

  • Include CustomerID in our groupby so that it does not get aggregated when we perform the count()
  • Perform a second group in order to find the mean Number unique products in transaction per customer (since we will again be collapsing our data when trying to achieve customer level granularity

To start let's look at the first groupby:

In [17]:
temp1 = df.groupby(
    ["InvoiceNo", "CustomerID"]
).count()[
    ["StockCode"]
].rename(
    columns={"StockCode": "Number unique products in transaction"}
).reset_index()

display(temp1.sort_values("Number unique products in transaction", ascending=False).head())
InvoiceNo CustomerID Number unique products in transaction
16242 576339 14096 542
17524 579196 14096 533
18165 580727 14096 529
17128 578270 14096 442
15081 573576 14096 435

We can see that right now CustomerID = 14096 has many large transactions. We want to find the mean number of unique products in a transactions, over all transactions, per customer. We can achieve this with another groupby, this time on CustomerID. Additionally, we want to use the agg function in order apply different aggregation functions mean and count to certain columns:

In [18]:
temp2 = temp1.groupby("CustomerID").agg(
    {
        "Number unique products in transaction": "mean",
        "InvoiceNo": "count"
    }
).rename(
    columns={
        "Number unique products in transaction": 
        "Average number unique products in transaction",
        "InvoiceNo": "Number of transactions"
    }
).reset_index()

display(
    temp2.sort_values("Average number unique products in transaction", ascending=False).head(10)
)
CustomerID Average number unique products in transaction Number of transactions
27 12378 219.000000 1
282 12688 171.000000 1
4178 18022 164.000000 1
3293 16800 156.000000 1
2306 15464 155.000000 1
1952 15004 154.000000 1
1300 14096 150.823529 34
2949 16340 141.000000 1
889 13521 139.333333 3
1954 15006 132.000000 1

We now have the Average number unique products in transaction and Number of transactions for individual customers. A question that may come up: are these two features correlated? We can get a visual gut check with the scatterplot below:

In [19]:
trace1 = go.Scatter(
    x=temp2["Average number unique products in transaction"],
    y=temp2["Number of transactions"],
    marker = dict(
        color = 'green',
    ),
    mode="markers"
)

data = [trace1]

layout = go.Layout(
    showlegend=False,
    width=650,
    height=350,
    title="Average number unique products in transaction vs. Number of Transactions",
    xaxis=dict(title="Average number unique products in transaction"),
    yaxis=dict(title="Number of transactions")
)

fig = go.Figure(data=data, layout=layout)

plotly.offline.iplot(fig)
# html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
# display(HTML(html_fig))

There looks to be a negative correlation, but that is to be expected given the nature of the dataset (it is mentioned that many of the customers are whole sale customers). We can look at the distribution of the average number of unique products in a transactions to see how it compares the to the number of unique productions in a transaction:

In [20]:
trace1 = go.Histogram(
    x=temp2["Average number unique products in transaction"],
    nbinsx=150,
    name="Number unique products in transaction",
    marker_color='blue',
)

data = [trace1]

layout = go.Layout(
    width=750,
    height=450,
    title="Distribution of Average number of unique products purchased per transaction",
    xaxis=dict(title="Average number unique products in transaction"),
    yaxis=dict(title="Number of Customers"),
    barmode='stack'
)

fig = go.Figure(data=data, layout=layout)
fig.update_traces(opacity=0.75)

plotly.offline.iplot(fig)

Again we have what looks to be a long tail distribution. We can also look at the distribution of number of orders per customer:

In [21]:
trace1 = go.Histogram(
    x=temp2["Number of transactions"],
    nbinsx=150,
    name="Number of transactions per customer",
    marker_color='blue',
)

data = [trace1]

layout = go.Layout(
    width=750,
    height=450,
    title="Distribution of Number of transactions per customer",
    xaxis=dict(title="Number of transactions per customer"),
    yaxis=dict(title="Number of Customers"),
    barmode='stack'
)

fig = go.Figure(data=data, layout=layout)
fig.update_traces(opacity=0.75)

plotly.offline.iplot(fig)

5.3.3 Cancelled Orders

Something that did look interesting came up during this exploration: the presence of cancelled orders. For example, let's take a peak at customer 14096 again:

In [22]:
display(temp1[temp1.CustomerID == "14096"][-5:])
InvoiceNo CustomerID Number unique products in transaction
21769 C575606 14096 1
21832 C576338 14096 1
21969 C578269 14096 1
22000 C579195 14096 1
22120 C580726 14096 1

The question arises, what should we do with these cancelled orders? Well, first let's assess the severity of the issue. To start, we can see how many transactions in total were cancelled:

In [23]:
number_products_per_transaction = df.groupby(
    ["CustomerID", "InvoiceNo"]
).count()[
    ["StockCode"]
].rename(
    columns={"StockCode": "Number unique products in transaction"}
).reset_index()

temp = number_products_per_transaction.assign(
    cancelled=(number_products_per_transaction.InvoiceNo.str[0] == "C")
)

display(temp.head())
CustomerID InvoiceNo Number unique products in transaction cancelled
0 12346 541431 1 False
1 12346 C541433 1 True
2 12347 537626 31 False
3 12347 542237 29 False
4 12347 549222 24 False
In [24]:
n1 = temp.shape[0]
n2 = temp.cancelled.sum()

percentage_orders_cancelled = n2 / n1 * 100

print(f"Percentage of orders cancelled: {percentage_orders_cancelled}%")
Percentage of orders cancelled: 16.466876971608833%

Over 16% of orders end up cancelled! This is a very large number! Let's take a look at the start of the dataframe:

In [25]:
display(df.sort_values('CustomerID')[:5])
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
61619 541431 23166 MEDIUM CERAMIC TOP STORAGE JAR 74215 1/18/2011 10:01 1.04 12346 United Kingdom
61624 C541433 23166 MEDIUM CERAMIC TOP STORAGE JAR -74215 1/18/2011 10:17 1.04 12346 United Kingdom
286623 562032 22375 AIRLINE BAG VINTAGE JET SET BROWN 4 8/2/2011 8:48 4.25 12347 Iceland
72260 542237 84991 60 TEATIME FAIRY CAKE CASES 24 1/26/2011 14:30 0.55 12347 Iceland
14943 537626 22772 PINK DRAWER KNOB ACRYLIC EDWARDIAN 12 12/7/2010 14:57 1.25 12347 Iceland

Interesting-it looks like we see that when an order is cancelled, we have another transaction in the dataframe, mostly identical except for the Quantity (the negation) and the InvoiceDate. We should figure out if this is true for all entries. We can do this via

In [26]:
df_cancelled = df[(df.InvoiceNo.str[0] == "C") & (df.StockCode != "D")]
df_not_cancelled = df[df.InvoiceNo.str[0] != "C"]
In [27]:
num_with_counterpart = 0
test = None
for idx, row in df_cancelled.iterrows():
    test = row
    equivalence_condition = (
        (row.CustomerID == df_not_cancelled.CustomerID) & 
        (row.StockCode == df_not_cancelled.StockCode) &
        (row.Quantity == -df_not_cancelled.Quantity)
    )
    
    if equivalence_condition.sum() > 0:
        num_with_counterpart += 1
        
print(num_with_counterpart)
3246

Interesting-this condition clearly does not hold for all cases. We have a total of 8795 cancelled orders, but only 3224 of them perfectly cancel a prior order. For example, there are most certainly cases where a customer bought, say, 6 of a product, and then returned 3, meaning there is a transaction with an InvoiceNo that has C prepended at the start. In this case, to correctly remedy the situation we must subtract the quantity in the cancellation from the nearest purchase of the same item. The logic is a little bit convoluted, but it can be done! We will first write a function that will gather some sufficient test groups for us-this will then allow us to write a function that will actually handle the cancellations.

In [28]:
def find_test_groups(num_groups=1000):
    temp = df.assign(
        cancelled=(df.InvoiceNo.str[0] == "C"),
        InvoiceDate=pandas.to_datetime(df.InvoiceDate)
    )

    grouped = temp.sort_values(
        'InvoiceDate'
    ).groupby(
        ['CustomerID', 'StockCode']
    )

    sampled_df_i = random.sample(list(grouped.groups.keys()), num_groups)

    df_list = list(map(
        lambda df_i: grouped.get_group(df_i), sampled_df_i
    ))
    
    df_list_select = list(filter(
        lambda df_i: ((df_i.shape[0] > 5) and (df_i.cancelled.sum() > 0)), 
        df_list
    ))
    
    return df_list_select
In [29]:
my_group = find_test_groups()[0]

Below we can see an example group (based on CustomerID and StockCode) that has a cancelled value. We can check to make sure that the function we write to handle all cancelled values in the data set can handle this sample group. my_group currently has the shape:

In [58]:
display(my_group)
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country cancelled
241447 558232 22169 FAMILY ALBUM WHITE PICTURE FRAME 1 2011-06-27 14:30:00 8.5 17841 United Kingdom False
252124 559068 22169 FAMILY ALBUM WHITE PICTURE FRAME 1 2011-07-06 10:41:00 8.5 17841 United Kingdom False
264712 560115 22169 FAMILY ALBUM WHITE PICTURE FRAME 1 2011-07-15 10:05:00 8.5 17841 United Kingdom False
315549 C564730 22169 FAMILY ALBUM WHITE PICTURE FRAME -1 2011-08-28 12:47:00 8.5 17841 United Kingdom True
402356 571498 22169 FAMILY ALBUM WHITE PICTURE FRAME 1 2011-10-17 15:06:00 8.5 17841 United Kingdom False
424194 573245 22169 FAMILY ALBUM WHITE PICTURE FRAME 1 2011-10-28 12:08:00 8.5 17841 United Kingdom False
531442 580956 22169 FAMILY ALBUM WHITE PICTURE FRAME 2 2011-12-06 14:23:00 8.5 17841 United Kingdom False
In [61]:
display(my_group.shape)
(7, 9)

So, that means we expect our function to take in my_group and return a group that is (5, 9):

In [32]:
def handle_perform_shift(my_group):
    
    while np.any(my_group.Quantity < 0):
        my_group = my_group.assign(
            Quantity=perform_shift(my_group)
        )
        
    return my_group[my_group.Quantity > 0]


def perform_shift(my_group):
    
    # conditions:
    # - current value is positive, next value is negative
    # - current value is positive, next value is positive or NaN
    # - current value is negative, next value is positive or Nan
    # - current value is negative, next value is negative
    # - current value is 0
    conditions_list = [
        (my_group.Quantity > 0) & (my_group.Quantity.shift(-1) < 0),
        (my_group.Quantity > 0) & ((my_group.Quantity.shift(-1) >= 0) | (my_group.Quantity.shift(-1).isnull())),
        (my_group.Quantity < 0) & ((my_group.Quantity.shift(-1) > 0) | (my_group.Quantity.shift(-1).isnull())),
        (my_group.Quantity < 0) & (my_group.Quantity.shift(-1) < 0),
        (my_group.Quantity == 0)
    ]

    choices_list = [
        my_group.Quantity + my_group.Quantity.shift(-1),
        my_group.Quantity,
        0,
        my_group.Quantity.shift(-1),
        0
    ]

    return np.select(conditions_list, choices_list)
In [33]:
test_remove_cancellations = handle_perform_shift(my_group)
In [60]:
display(test_remove_cancellations)
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country cancelled
241447 558232 22169 FAMILY ALBUM WHITE PICTURE FRAME 1.0 2011-06-27 14:30:00 8.5 17841 United Kingdom False
252124 559068 22169 FAMILY ALBUM WHITE PICTURE FRAME 1.0 2011-07-06 10:41:00 8.5 17841 United Kingdom False
402356 571498 22169 FAMILY ALBUM WHITE PICTURE FRAME 1.0 2011-10-17 15:06:00 8.5 17841 United Kingdom False
424194 573245 22169 FAMILY ALBUM WHITE PICTURE FRAME 1.0 2011-10-28 12:08:00 8.5 17841 United Kingdom False
531442 580956 22169 FAMILY ALBUM WHITE PICTURE FRAME 2.0 2011-12-06 14:23:00 8.5 17841 United Kingdom False
In [62]:
display(test_remove_cancellations.shape)
(5, 9)

Our functions seems to effectively remove cancellations! We can now apply it to the entire dataframe.

In [39]:
%%time

temp = df.assign(
    cancelled=(df.InvoiceNo.str[0] == "C"),
    InvoiceDate=pandas.to_datetime(df.InvoiceDate)
)

grouped = temp.sort_values(
    'InvoiceDate'
).groupby(
    ['CustomerID', 'StockCode']
)
    
df_no_cancellations = grouped.apply(handle_perform_shift).reset_index(drop=True)
CPU times: user 6min 37s, sys: 4 s, total: 6min 41s
Wall time: 6min 42s
In [40]:
df[df.Quantity > 0].shape
Out[40]:
(392732, 8)
In [41]:
df_no_cancellations[df_no_cancellations.Quantity < 0].shape
Out[41]:
(0, 9)
In [42]:
df_no_cancellations.shape
Out[42]:
(389929, 9)

This looks great! Now this would be a great place for optimization; the apply can be performed in parallel with multiprocessing or dask, but I will leave that for another post.

5.3.4 StockCode Exploration

Now, let's take a peak at the StockCodes. We have seen that some values of StockCode deal with discounts; we can check to see what other types of codes there are:

In [66]:
special_codes = df_no_cancellations[
    df_no_cancellations.StockCode.str.contains('^[a-zA-Z]+', regex=True)
].StockCode.unique()

display(special_codes)
array(['POST', 'M', 'C2', 'BANK CHARGES', 'PADS', 'DOT'], dtype=object)
In [74]:
for code in special_codes:
    val = df_no_cancellations[df_no_cancellations.StockCode == code]["Description"].unique()[0]
    display("{:<15} -> {:<30}".format(code, val))
'POST            -> POSTAGE                       '
'M               -> Manual                        '
'C2              -> CARRIAGE                      '
'BANK CHARGES    -> Bank Charges                  '
'PADS            -> PADS TO MATCH ALL CUSHIONS    '
'DOT             -> DOTCOM POSTAGE                '

5.3.5 Total Cart Price

We can see that there are other types of strange charges (such as specific bank charges). We may come back to this, but for now let's move on to total item price. This is based on the UnitPrice of the item, times it's Quantity:

In [78]:
df = df_no_cancellations.assign(
    item_price=df_no_cancellations.UnitPrice * df_no_cancellations.Quantity
)

display(df.head())
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country cancelled item_price
0 549222 16008 SMALL FOLDING SCISSOR(POINTED EDGE) 24.0 2011-04-07 10:43:00 0.25 12347 Iceland False 6.0
1 556201 17021 NAMASTE SWAGAT INCENSE 36.0 2011-06-09 13:01:00 0.30 12347 Iceland False 10.8
2 549222 20665 RED RETROSPOT PURSE 6.0 2011-04-07 10:43:00 2.95 12347 Iceland False 17.7
3 542237 20719 WOODLAND CHARLOTTE BAG 10.0 2011-01-26 14:30:00 0.85 12347 Iceland False 8.5
4 562032 20719 WOODLAND CHARLOTTE BAG 10.0 2011-08-02 08:48:00 0.85 12347 Iceland False 8.5

From here we can find the total order price:

In [82]:
df = df.assign(
    total_price=df.groupby("InvoiceNo")['item_price'].transform('sum')
)
In [114]:
df.head()
Out[114]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country cancelled item_price total_price
0 549222 16008 SMALL FOLDING SCISSOR(POINTED EDGE) 24.0 2011-04-07 10:43:00 0.25 12347 Iceland False 6.0 636.25
1 556201 17021 NAMASTE SWAGAT INCENSE 36.0 2011-06-09 13:01:00 0.30 12347 Iceland False 10.8 382.52
2 549222 20665 RED RETROSPOT PURSE 6.0 2011-04-07 10:43:00 2.95 12347 Iceland False 17.7 636.25
3 542237 20719 WOODLAND CHARLOTTE BAG 10.0 2011-01-26 14:30:00 0.85 12347 Iceland False 8.5 475.39
4 562032 20719 WOODLAND CHARLOTTE BAG 10.0 2011-08-02 08:48:00 0.85 12347 Iceland False 8.5 584.91

Which has the following distribution (showing clipped and full data set distributions below):

In [113]:
from plotly import tools

trace1 = go.Histogram(
    x=df.total_price.unique(),
    nbinsx=100,
    name="Full",
    marker_color='blue',
    
)

trace2 = go.Histogram(
    x=df.total_price[df.total_price < 5000].unique(),
    nbinsx=100,
    name="Clipped",
    marker_color='green',
)

data = [trace1]

layout = go.Layout(
    width=700,
    height=400,
    title_text="Distribution of Average number of unique products purchased per transaction",
    xaxis=dict(title="Average number unique products in transaction"),
    yaxis=dict(title="Number of Customers"),
    barmode='stack'
)

fig = tools.make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Full Dataset", "Clipped at 5000"),
    print_grid=False
)

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)

fig.update_yaxes(title_text="Number Orders", row=1, col=1)
fig.update_xaxes(title_text="Total Order Price", row=1, col=1)
fig.update_xaxes(title_text="Total Order Price", row=1, col=2)

fig.layout.update(title_text="Distribution of Total Order Price")

plotly.offline.iplot(fig)
In [ ]:
 
In [ ]:
 

© 2018 Nathaniel Dake