The vast majority of my articles are written with intention of highlighting some mathematical or computer science concept and tying it in to a real world example. In other words, we are operating as follows:
$$\text{Determine concept to teach} \longrightarrow \text{Find problem to demonstrate concept}$$For instance, in the case of Bayesian Classifiers I decided that I wanted to cover that particular concept and all of it's inner workings, and I then found problems (that I encountered working as a data scientist) in order to demonstrate how a Bayes Classifier would work in the real world.
However, this is not how things work in the real world! TODO: Mention taleb ideas relating to trying to fit real world into Platonic framework (when it may not really be that way). In reality, a data scientist is often given a vague problem that needs solving by making use of data. If lucky, they will be given a stand alone data set (i.e. a csv), but often they may not even have that, leaving them to wrangle the necessary data from databases, API's, and other sources.
I want to take some time to write an article that demonstrates how to handle being confronted with a data set and simply being tasked to "explore" it and find something useful (known as data mining). This post will specifically be based on the exploration of the online retail data set.
Assume that you are a data scientist for an online retail store. Management comes to you saying that they have a set of customer transaction data over the course of ~ 1 year. They want you to simply explore and see if there is anything interesting; in other words, are there patterns in the data that could be relevant to marketing, trends that could be useful in customer prediction, and so on. We are starting with a very blank canvas; where do we begin?
There is a "general" exploration framework that is generally followed by most data scientists. It almost always starts with data preparation and basic exploration. Data preparation consists of gathering relevant data into a single location. For example, if you are an energy utility and you have customers that have billing and energy usage data held in different databases, this phase may consist of querying the different databases and exporting the desired tables (potentially for specific time intervals) to a specific format (csv). Once the relevant data is in a desirable format, the next phase consists of basic exploration. What exactly is meant by that? Well, that consists of things such as:
For this post we are starting at the tail end of the data preparation phase (thankfully a csv has already been created for us). For that we will be using the python scientific computing stack-pandas, numpy, scipy, matplotlib, etc. So, without further ado, let's begin!
To start, we can load our necessary libraries, and then load our data set:
import numpy as np
import pandas
import seaborn as sns
import random
import matplotlib.pyplot as plt
from matplotlib import rc, animation
from IPython.core.display import display, HTML
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.ticker import LinearLocator, FormatStrFormatter
from _plotly_future_ import v4_subplots
import cufflinks
import plotly.plotly as py
import plotly
import plotly.graph_objs as go
from plotly.offline import iplot
import plotly.figure_factory as ff
from util import get_csv_from_s3, get_obj_s3
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plotly.offline.init_notebook_mode(connected=True)
sns.set(style="white", palette="husl")
sns.set_context("talk")
sns.set_style("ticks")
df_obj = get_obj_s3("data_customer_segmentation.csv")
df = pandas.read_csv(
df_obj,
encoding="ISO-8859-1",
dtype={
"CustomerID": str,
"InvoiceID": str,
}
)
A good rule of thumb when just getting started is to simply utilize head()
to get a glimpse of the data set:
display(df.head())
And then determine the shape of the data set, and get an idea of the different columns present (including their data types and number of missing values):
display(f"Dataframe Shape: {df.shape}")
def df_info(df):
df_info = pandas.DataFrame(df.dtypes).T.rename(index={0: "Column Type"})
df_null_count = pandas.DataFrame(df.isnull().sum()).T.rename(index={0: "# Null values"})
df_null_percent = pandas.DataFrame(
(df.isnull().sum() * 100) / df.shape[0]
).T.rename(index={0: "% Null values"})
return df_info.append([df_null_count, df_null_percent])
display(df_info(df))
We immediately get an idea of the variables available to us for analysis, as well as where we may have particular gaps (null values). Right from the get go we see that there are ~25%
of transactions that do not have an associated CustomerID
. Based on the data that is readily available, it is impossible to impute these values. Because the vast majority of inferences that we will try and make will tie back to a particular customer, these rows can be dropped:
df = df.dropna(axis=0, subset=["CustomerID"])
display(f"Dataframe shape: {df.shape}")
display(df_info(df))
We also will want to be sure to check for duplicate entries:
display(f"Number of duplicate entries: {df.duplicated().sum()}")
And drop them:
df = df.drop_duplicates()
We have now accounted for null values and duplicate entries, meaning we are ready for basic data exploration.
Our dataframe contains 8 different features, defined as:
InvoiceNo
: Invoice number. Nominal, a 6-digit integer number uniquely assigned to each transaction. If this code starts with a letter is indicates a cancellation.StockCode
: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.Description
: Product (item) name. Nominal.Quantity
: The quantities of each product (item) per transaction. Numeric.InvoiceDate
: Invoice Date and time. Numeric, the day and time when each transaction was generated.UnitPrice
: Unit price. Numeric, Product price per unit in sterling.CustomerID
: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.Country
: Country name. Nominal, the name of the country where each customer resides.Let's start by exploring the countries from which the orders were made. It would be useful to know how many different countries orders have originated from.
unique_countries = df[["Country"]].groupby("Country").nunique()
display(f"Number of countries from which transactions occured: {len(unique_countries)}")
We may also want to know the number of orders per country. To do this, we need to account for the fact that an order can have many items in it; this means that a single order may contain multiple rows. We can see this below:
display(df[df.InvoiceNo == "536365"])
We can see that above a single order, InvoiceNo = 536365
, contained 7 different items, and hence 7 different rows in our dataframe. However, we just want to count the number of orders per country, and hence those should be grouped as a single value.
countries = df[
["InvoiceNo", "Country"]
].groupby(
["InvoiceNo", "Country"]
).count().reset_index()["Country"].value_counts()
We can visualize these results nicely with a chloropleth map:
data = dict(
type="choropleth",
locations=countries.index,
locationmode="country names",
z=countries,
text=countries.index,
colorbar={"title": "Order #"},
colorscale=[
[0, 'rgb(224,255,255)'],
[0.01, 'rgb(166,206,227)'],
[0.02, 'rgb(31,120,180)'],
[0.03, 'rgb(178,223,138)'],
[0.05, 'rgb(51,160,44)'],
[0.10, 'rgb(251,154,153)'],
[0.20, 'rgb(255,255,0)'],
[1, 'rgb(227,26,28)']
],
reversescale=False
)
layout = dict(
title="Number of orders per country",
width=800,
height=500,
)
choromap = go.Figure(data = [data], layout = layout)
iplot(choromap, validate=False)
We can see that the data is heavily dominated by orders placed from the UK.
Another thing that we may want to explore further is the number of customers and the number of products.
temp = pandas.DataFrame(
[{
"Products": df.StockCode.nunique(),
"Transactions": df.InvoiceNo.nunique(),
"Customers": df.CustomerID.nunique()
}],
index=["Quantity"]
)
display(temp)
We can see that our data set consists of 4372 customers
who have bought 3684 different products
over the course of 22,190 total transactions
. Let's try to get an idea of the distribution of number of products purchased per transaction:
temp = df.groupby(
"InvoiceNo"
).count()[
["StockCode"]
].rename(
columns={"StockCode": "Number unique products in transaction"}
)
display(temp.head())
Which can be visualized via the histogram below:
trace1 = go.Histogram(
x=temp["Number unique products in transaction"],
nbinsx=200,
name="Number unique products in transaction",
marker_color='blue',
)
data = [trace1]
layout = go.Layout(
width=750,
height=450,
title="Distribution of unique products purchased per transaction",
xaxis=dict(title="Number of unique products purchased"),
yaxis=dict(title="Number of Orders"),
barmode='stack'
)
fig = go.Figure(data=data, layout=layout)
fig.update_traces(opacity=0.75)
plotly.offline.iplot(fig)
And we can display a few metrics related to this data:
display(temp.describe())
This is all definitely useful data, and certainly interesting; for instance, we see that number of unique products purchased is indeed a long tail distribution, and that certain orders have over 500 items! However, right now these purchases are not tied to specific customers. Let's update our group by's to remedy that! In order to implement a proper fix we will need to:
CustomerID
in our groupby so that it does not get aggregated when we perform the count()
Number unique products in transaction
per customer (since we will again be collapsing our data when trying to achieve customer level granularityTo start let's look at the first groupby:
temp1 = df.groupby(
["InvoiceNo", "CustomerID"]
).count()[
["StockCode"]
].rename(
columns={"StockCode": "Number unique products in transaction"}
).reset_index()
display(temp1.sort_values("Number unique products in transaction", ascending=False).head())
We can see that right now CustomerID = 14096
has many large transactions. We want to find the mean number of unique products in a transactions, over all transactions, per customer. We can achieve this with another groupby, this time on CustomerID
. Additionally, we want to use the agg
function in order apply different aggregation functions mean
and count
to certain columns:
temp2 = temp1.groupby("CustomerID").agg(
{
"Number unique products in transaction": "mean",
"InvoiceNo": "count"
}
).rename(
columns={
"Number unique products in transaction":
"Average number unique products in transaction",
"InvoiceNo": "Number of transactions"
}
).reset_index()
display(
temp2.sort_values("Average number unique products in transaction", ascending=False).head(10)
)
We now have the Average number unique products in transaction
and Number of transactions
for individual customers. A question that may come up: are these two features correlated? We can get a visual gut check with the scatterplot below:
trace1 = go.Scatter(
x=temp2["Average number unique products in transaction"],
y=temp2["Number of transactions"],
marker = dict(
color = 'green',
),
mode="markers"
)
data = [trace1]
layout = go.Layout(
showlegend=False,
width=650,
height=350,
title="Average number unique products in transaction vs. Number of Transactions",
xaxis=dict(title="Average number unique products in transaction"),
yaxis=dict(title="Number of transactions")
)
fig = go.Figure(data=data, layout=layout)
plotly.offline.iplot(fig)
# html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
# display(HTML(html_fig))
There looks to be a negative correlation, but that is to be expected given the nature of the dataset (it is mentioned that many of the customers are whole sale customers). We can look at the distribution of the average number of unique products in a transactions to see how it compares the to the number of unique productions in a transaction:
trace1 = go.Histogram(
x=temp2["Average number unique products in transaction"],
nbinsx=150,
name="Number unique products in transaction",
marker_color='blue',
)
data = [trace1]
layout = go.Layout(
width=750,
height=450,
title="Distribution of Average number of unique products purchased per transaction",
xaxis=dict(title="Average number unique products in transaction"),
yaxis=dict(title="Number of Customers"),
barmode='stack'
)
fig = go.Figure(data=data, layout=layout)
fig.update_traces(opacity=0.75)
plotly.offline.iplot(fig)
Again we have what looks to be a long tail distribution. We can also look at the distribution of number of orders per customer:
trace1 = go.Histogram(
x=temp2["Number of transactions"],
nbinsx=150,
name="Number of transactions per customer",
marker_color='blue',
)
data = [trace1]
layout = go.Layout(
width=750,
height=450,
title="Distribution of Number of transactions per customer",
xaxis=dict(title="Number of transactions per customer"),
yaxis=dict(title="Number of Customers"),
barmode='stack'
)
fig = go.Figure(data=data, layout=layout)
fig.update_traces(opacity=0.75)
plotly.offline.iplot(fig)
Something that did look interesting came up during this exploration: the presence of cancelled orders. For example, let's take a peak at customer 14096
again:
display(temp1[temp1.CustomerID == "14096"][-5:])
The question arises, what should we do with these cancelled orders? Well, first let's assess the severity of the issue. To start, we can see how many transactions in total were cancelled:
number_products_per_transaction = df.groupby(
["CustomerID", "InvoiceNo"]
).count()[
["StockCode"]
].rename(
columns={"StockCode": "Number unique products in transaction"}
).reset_index()
temp = number_products_per_transaction.assign(
cancelled=(number_products_per_transaction.InvoiceNo.str[0] == "C")
)
display(temp.head())
n1 = temp.shape[0]
n2 = temp.cancelled.sum()
percentage_orders_cancelled = n2 / n1 * 100
print(f"Percentage of orders cancelled: {percentage_orders_cancelled}%")
Over 16% of orders end up cancelled! This is a very large number! Let's take a look at the start of the dataframe:
display(df.sort_values('CustomerID')[:5])
Interesting-it looks like we see that when an order is cancelled, we have another transaction in the dataframe, mostly identical except for the Quantity
(the negation) and the InvoiceDate
. We should figure out if this is true for all entries. We can do this via
df_cancelled = df[(df.InvoiceNo.str[0] == "C") & (df.StockCode != "D")]
df_not_cancelled = df[df.InvoiceNo.str[0] != "C"]
num_with_counterpart = 0
test = None
for idx, row in df_cancelled.iterrows():
test = row
equivalence_condition = (
(row.CustomerID == df_not_cancelled.CustomerID) &
(row.StockCode == df_not_cancelled.StockCode) &
(row.Quantity == -df_not_cancelled.Quantity)
)
if equivalence_condition.sum() > 0:
num_with_counterpart += 1
print(num_with_counterpart)
Interesting-this condition clearly does not hold for all cases. We have a total of 8795
cancelled orders, but only 3224
of them perfectly cancel a prior order. For example, there are most certainly cases where a customer bought, say, 6
of a product, and then returned 3
, meaning there is a transaction with an InvoiceNo
that has C
prepended at the start. In this case, to correctly remedy the situation we must subtract the quantity in the cancellation from the nearest purchase of the same item. The logic is a little bit convoluted, but it can be done! We will first write a function that will gather some sufficient test groups for us-this will then allow us to write a function that will actually handle the cancellations.
def find_test_groups(num_groups=1000):
temp = df.assign(
cancelled=(df.InvoiceNo.str[0] == "C"),
InvoiceDate=pandas.to_datetime(df.InvoiceDate)
)
grouped = temp.sort_values(
'InvoiceDate'
).groupby(
['CustomerID', 'StockCode']
)
sampled_df_i = random.sample(list(grouped.groups.keys()), num_groups)
df_list = list(map(
lambda df_i: grouped.get_group(df_i), sampled_df_i
))
df_list_select = list(filter(
lambda df_i: ((df_i.shape[0] > 5) and (df_i.cancelled.sum() > 0)),
df_list
))
return df_list_select
my_group = find_test_groups()[0]
Below we can see an example group (based on CustomerID
and StockCode
) that has a cancelled value. We can check to make sure that the function we write to handle all cancelled values in the data set can handle this sample group. my_group
currently has the shape:
display(my_group)
display(my_group.shape)
So, that means we expect our function to take in my_group
and return a group that is (5, 9)
:
def handle_perform_shift(my_group):
while np.any(my_group.Quantity < 0):
my_group = my_group.assign(
Quantity=perform_shift(my_group)
)
return my_group[my_group.Quantity > 0]
def perform_shift(my_group):
# conditions:
# - current value is positive, next value is negative
# - current value is positive, next value is positive or NaN
# - current value is negative, next value is positive or Nan
# - current value is negative, next value is negative
# - current value is 0
conditions_list = [
(my_group.Quantity > 0) & (my_group.Quantity.shift(-1) < 0),
(my_group.Quantity > 0) & ((my_group.Quantity.shift(-1) >= 0) | (my_group.Quantity.shift(-1).isnull())),
(my_group.Quantity < 0) & ((my_group.Quantity.shift(-1) > 0) | (my_group.Quantity.shift(-1).isnull())),
(my_group.Quantity < 0) & (my_group.Quantity.shift(-1) < 0),
(my_group.Quantity == 0)
]
choices_list = [
my_group.Quantity + my_group.Quantity.shift(-1),
my_group.Quantity,
0,
my_group.Quantity.shift(-1),
0
]
return np.select(conditions_list, choices_list)
test_remove_cancellations = handle_perform_shift(my_group)
display(test_remove_cancellations)
display(test_remove_cancellations.shape)
Our functions seems to effectively remove cancellations! We can now apply
it to the entire dataframe.
%%time
temp = df.assign(
cancelled=(df.InvoiceNo.str[0] == "C"),
InvoiceDate=pandas.to_datetime(df.InvoiceDate)
)
grouped = temp.sort_values(
'InvoiceDate'
).groupby(
['CustomerID', 'StockCode']
)
df_no_cancellations = grouped.apply(handle_perform_shift).reset_index(drop=True)
df[df.Quantity > 0].shape
df_no_cancellations[df_no_cancellations.Quantity < 0].shape
df_no_cancellations.shape
This looks great! Now this would be a great place for optimization; the apply
can be performed in parallel with multiprocessing
or dask
, but I will leave that for another post.
Now, let's take a peak at the StockCode
s. We have seen that some values of StockCode
deal with discounts; we can check to see what other types of codes there are:
special_codes = df_no_cancellations[
df_no_cancellations.StockCode.str.contains('^[a-zA-Z]+', regex=True)
].StockCode.unique()
display(special_codes)
for code in special_codes:
val = df_no_cancellations[df_no_cancellations.StockCode == code]["Description"].unique()[0]
display("{:<15} -> {:<30}".format(code, val))
We can see that there are other types of strange charges (such as specific bank charges). We may come back to this, but for now let's move on to total item price. This is based on the UnitPrice
of the item, times it's Quantity
:
df = df_no_cancellations.assign(
item_price=df_no_cancellations.UnitPrice * df_no_cancellations.Quantity
)
display(df.head())
From here we can find the total order price:
df = df.assign(
total_price=df.groupby("InvoiceNo")['item_price'].transform('sum')
)
df.head()
Which has the following distribution (showing clipped and full data set distributions below):
from plotly import tools
trace1 = go.Histogram(
x=df.total_price.unique(),
nbinsx=100,
name="Full",
marker_color='blue',
)
trace2 = go.Histogram(
x=df.total_price[df.total_price < 5000].unique(),
nbinsx=100,
name="Clipped",
marker_color='green',
)
data = [trace1]
layout = go.Layout(
width=700,
height=400,
title_text="Distribution of Average number of unique products purchased per transaction",
xaxis=dict(title="Average number unique products in transaction"),
yaxis=dict(title="Number of Customers"),
barmode='stack'
)
fig = tools.make_subplots(
rows=1,
cols=2,
subplot_titles=("Full Dataset", "Clipped at 5000"),
print_grid=False
)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.update_yaxes(title_text="Number Orders", row=1, col=1)
fig.update_xaxes(title_text="Total Order Price", row=1, col=1)
fig.update_xaxes(title_text="Total Order Price", row=1, col=2)
fig.layout.update(title_text="Distribution of Total Order Price")
plotly.offline.iplot(fig)