5. Customer Segmentation¶

The vast majority of my articles are written with intention of highlighting some mathematical or computer science concept and tying it in to a real world example. In other words, we are operating as follows:

$$\text{Determine concept to teach} \longrightarrow \text{Find problem to demonstrate concept}$$

For instance, in the case of Bayesian Classifiers I decided that I wanted to cover that particular concept and all of it's inner workings, and I then found problems (that I encountered working as a data scientist) in order to demonstrate how a Bayes Classifier would work in the real world.

However, this is not how things work in the real world! TODO: Mention taleb ideas relating to trying to fit real world into Platonic framework (when it may not really be that way). In reality, a data scientist is often given a vague problem that needs solving by making use of data. If lucky, they will be given a stand alone data set (i.e. a csv), but often they may not even have that, leaving them to wrangle the necessary data from databases, API's, and other sources.

I want to take some time to write an article that demonstrates how to handle being confronted with a data set and simply being tasked to "explore" it and find something useful (known as data mining). This post will specifically be based on the exploration of the online retail data set.

5.1 Background and Context¶

Assume that you are a data scientist for an online retail store. Management comes to you saying that they have a set of customer transaction data over the course of ~ 1 year. They want you to simply explore and see if there is anything interesting; in other words, are there patterns in the data that could be relevant to marketing, trends that could be useful in customer prediction, and so on. We are starting with a very blank canvas; where do we begin?

There is a "general" exploration framework that is generally followed by most data scientists. It almost always starts with data preparation and basic exploration. Data preparation consists of gathering relevant data into a single location. For example, if you are an energy utility and you have customers that have billing and energy usage data held in different databases, this phase may consist of querying the different databases and exporting the desired tables (potentially for specific time intervals) to a specific format (csv). Once the relevant data is in a desirable format, the next phase consists of basic exploration. What exactly is meant by that? Well, that consists of things such as:

Determine size and shape of data set
Inspect variables of interest
Gather metrics about columns
Potentially perform different groupings
Feature generation

For this post we are starting at the tail end of the data preparation phase (thankfully a csv has already been created for us). For that we will be using the python scientific computing stack-pandas, numpy, scipy, matplotlib, etc. So, without further ado, let's begin!

5.2 Data Preparation¶

To start, we can load our necessary libraries, and then load our data set:

import numpy as np
import pandas
import seaborn as sns
import random
import matplotlib.pyplot as plt
from matplotlib import rc, animation
from IPython.core.display import display, HTML
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.ticker import LinearLocator, FormatStrFormatter

from _plotly_future_ import v4_subplots
import cufflinks
import plotly.plotly as py
import plotly
import plotly.graph_objs as go
from plotly.offline import iplot
import plotly.figure_factory as ff

from util import get_csv_from_s3, get_obj_s3

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

plotly.offline.init_notebook_mode(connected=True)
sns.set(style="white", palette="husl")
sns.set_context("talk")
sns.set_style("ticks")

/usr/local/lib/python3.6/site-packages/plotly/graph_objs/_deprecations.py:558: DeprecationWarning:

plotly.graph_objs.YAxis is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.YAxis
  - plotly.graph_objs.layout.scene.YAxis


/usr/local/lib/python3.6/site-packages/plotly/graph_objs/_deprecations.py:531: DeprecationWarning:

plotly.graph_objs.XAxis is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.XAxis
  - plotly.graph_objs.layout.scene.XAxis


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

df_obj = get_obj_s3("data_customer_segmentation.csv")
df = pandas.read_csv(
    df_obj, 
    encoding="ISO-8859-1",
    dtype={
        "CustomerID": str,
        "InvoiceID": str,
    }
)

A good rule of thumb when just getting started is to simply utilize head() to get a glimpse of the data set:

display(df.head())

And then determine the shape of the data set, and get an idea of the different columns present (including their data types and number of missing values):

display(f"Dataframe Shape: {df.shape}")

'Dataframe Shape: (541909, 8)'

def df_info(df):
    
    df_info = pandas.DataFrame(df.dtypes).T.rename(index={0: "Column Type"})
    df_null_count = pandas.DataFrame(df.isnull().sum()).T.rename(index={0: "# Null values"})
    df_null_percent = pandas.DataFrame(
        (df.isnull().sum() * 100) / df.shape[0]
    ).T.rename(index={0: "% Null values"})
    
    return df_info.append([df_null_count, df_null_percent])

display(df_info(df))

We immediately get an idea of the variables available to us for analysis, as well as where we may have particular gaps (null values). Right from the get go we see that there are ~25% of transactions that do not have an associated CustomerID. Based on the data that is readily available, it is impossible to impute these values. Because the vast majority of inferences that we will try and make will tie back to a particular customer, these rows can be dropped:

df = df.dropna(axis=0, subset=["CustomerID"])

display(f"Dataframe shape: {df.shape}")
display(df_info(df))

'Dataframe shape: (406829, 8)'

We also will want to be sure to check for duplicate entries:

display(f"Number of duplicate entries: {df.duplicated().sum()}")

'Number of duplicate entries: 5225'

And drop them:

df = df.drop_duplicates()

We have now accounted for null values and duplicate entries, meaning we are ready for basic data exploration.

5.3 Data Exploration¶

Our dataframe contains 8 different features, defined as:

InvoiceNo: Invoice number. Nominal, a 6-digit integer number uniquely assigned to each transaction. If this code starts with a letter is indicates a cancellation.
StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
Description: Product (item) name. Nominal.
Quantity: The quantities of each product (item) per transaction. Numeric.
InvoiceDate: Invoice Date and time. Numeric, the day and time when each transaction was generated.
UnitPrice: Unit price. Numeric, Product price per unit in sterling.
CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
Country: Country name. Nominal, the name of the country where each customer resides.

5.3.1 Geographic Exploration¶

Let's start by exploring the countries from which the orders were made. It would be useful to know how many different countries orders have originated from.

unique_countries = df[["Country"]].groupby("Country").nunique()

display(f"Number of countries from which transactions occured: {len(unique_countries)}")

'Number of countries from which transactions occured: 37'

We may also want to know the number of orders per country. To do this, we need to account for the fact that an order can have many items in it; this means that a single order may contain multiple rows. We can see this below:

display(df[df.InvoiceNo == "536365"])

We can see that above a single order, InvoiceNo = 536365, contained 7 different items, and hence 7 different rows in our dataframe. However, we just want to count the number of orders per country, and hence those should be grouped as a single value.

countries = df[
    ["InvoiceNo", "Country"]
].groupby(
    ["InvoiceNo", "Country"]
).count().reset_index()["Country"].value_counts()

We can visualize these results nicely with a chloropleth map:

data = dict(
    type="choropleth",
    locations=countries.index,
    locationmode="country names", 
    z=countries,
    text=countries.index, 
    colorbar={"title": "Order #"},
    colorscale=[
        [0, 'rgb(224,255,255)'],
        [0.01, 'rgb(166,206,227)'], 
        [0.02, 'rgb(31,120,180)'],
        [0.03, 'rgb(178,223,138)'], 
        [0.05, 'rgb(51,160,44)'],
        [0.10, 'rgb(251,154,153)'], 
        [0.20, 'rgb(255,255,0)'],
        [1, 'rgb(227,26,28)']
    ],    
    reversescale=False
)

layout = dict(
    title="Number of orders per country",
    width=800,
    height=500,
)

choromap = go.Figure(data = [data], layout = layout)
iplot(choromap, validate=False)

We can see that the data is heavily dominated by orders placed from the UK.

5.3.2 Number of products purchased distributions¶

Another thing that we may want to explore further is the number of customers and the number of products.

temp = pandas.DataFrame(
    [{
        "Products": df.StockCode.nunique(),
        "Transactions": df.InvoiceNo.nunique(),
        "Customers": df.CustomerID.nunique()
    }],
    index=["Quantity"]
)

display(temp)

We can see that our data set consists of 4372 customers who have bought 3684 different products over the course of 22,190 total transactions. Let's try to get an idea of the distribution of number of products purchased per transaction:

temp = df.groupby(
    "InvoiceNo"
).count()[
    ["StockCode"]
].rename(
    columns={"StockCode": "Number unique products in transaction"}
)

display(temp.head())

Which can be visualized via the histogram below:

trace1 = go.Histogram(
    x=temp["Number unique products in transaction"],
    nbinsx=200,
    name="Number unique products in transaction",
    marker_color='blue',
)

data = [trace1]

layout = go.Layout(
    width=750,
    height=450,
    title="Distribution of unique products purchased per transaction",
    xaxis=dict(title="Number of unique products purchased"),
    yaxis=dict(title="Number of Orders"),
    barmode='stack'
)

fig = go.Figure(data=data, layout=layout)
fig.update_traces(opacity=0.75)

plotly.offline.iplot(fig)

And we can display a few metrics related to this data:

display(temp.describe())

This is all definitely useful data, and certainly interesting; for instance, we see that number of unique products purchased is indeed a long tail distribution, and that certain orders have over 500 items! However, right now these purchases are not tied to specific customers. Let's update our group by's to remedy that! In order to implement a proper fix we will need to:

Include CustomerID in our groupby so that it does not get aggregated when we perform the count()
Perform a second group in order to find the mean Number unique products in transaction per customer (since we will again be collapsing our data when trying to achieve customer level granularity

To start let's look at the first groupby:

temp1 = df.groupby(
    ["InvoiceNo", "CustomerID"]
).count()[
    ["StockCode"]
].rename(
    columns={"StockCode": "Number unique products in transaction"}
).reset_index()

display(temp1.sort_values("Number unique products in transaction", ascending=False).head())

We can see that right now CustomerID = 14096 has many large transactions. We want to find the mean number of unique products in a transactions, over all transactions, per customer. We can achieve this with another groupby, this time on CustomerID. Additionally, we want to use the agg function in order apply different aggregation functions mean and count to certain columns:

temp2 = temp1.groupby("CustomerID").agg(
    {
        "Number unique products in transaction": "mean",
        "InvoiceNo": "count"
    }
).rename(
    columns={
        "Number unique products in transaction": 
        "Average number unique products in transaction",
        "InvoiceNo": "Number of transactions"
    }
).reset_index()

display(
    temp2.sort_values("Average number unique products in transaction", ascending=False).head(10)
)

We now have the Average number unique products in transaction and Number of transactions for individual customers. A question that may come up: are these two features correlated? We can get a visual gut check with the scatterplot below:

trace1 = go.Scatter(
    x=temp2["Average number unique products in transaction"],
    y=temp2["Number of transactions"],
    marker = dict(
        color = 'green',
    ),
    mode="markers"
)

data = [trace1]

layout = go.Layout(
    showlegend=False,
    width=650,
    height=350,
    title="Average number unique products in transaction vs. Number of Transactions",
    xaxis=dict(title="Average number unique products in transaction"),
    yaxis=dict(title="Number of transactions")
)

fig = go.Figure(data=data, layout=layout)

plotly.offline.iplot(fig)
# html_fig = plotly.io.to_html(fig, include_plotlyjs=True)
# display(HTML(html_fig))

There looks to be a negative correlation, but that is to be expected given the nature of the dataset (it is mentioned that many of the customers are whole sale customers). We can look at the distribution of the average number of unique products in a transactions to see how it compares the to the number of unique productions in a transaction:

trace1 = go.Histogram(
    x=temp2["Average number unique products in transaction"],
    nbinsx=150,
    name="Number unique products in transaction",
    marker_color='blue',
)

data = [trace1]

layout = go.Layout(
    width=750,
    height=450,
    title="Distribution of Average number of unique products purchased per transaction",
    xaxis=dict(title="Average number unique products in transaction"),
    yaxis=dict(title="Number of Customers"),
    barmode='stack'
)

fig = go.Figure(data=data, layout=layout)
fig.update_traces(opacity=0.75)

plotly.offline.iplot(fig)

Again we have what looks to be a long tail distribution. We can also look at the distribution of number of orders per customer:

trace1 = go.Histogram(
    x=temp2["Number of transactions"],
    nbinsx=150,
    name="Number of transactions per customer",
    marker_color='blue',
)

data = [trace1]

layout = go.Layout(
    width=750,
    height=450,
    title="Distribution of Number of transactions per customer",
    xaxis=dict(title="Number of transactions per customer"),
    yaxis=dict(title="Number of Customers"),
    barmode='stack'
)

fig = go.Figure(data=data, layout=layout)
fig.update_traces(opacity=0.75)

plotly.offline.iplot(fig)

5.3.3 Cancelled Orders¶

Something that did look interesting came up during this exploration: the presence of cancelled orders. For example, let's take a peak at customer 14096 again:

display(temp1[temp1.CustomerID == "14096"][-5:])

The question arises, what should we do with these cancelled orders? Well, first let's assess the severity of the issue. To start, we can see how many transactions in total were cancelled:

number_products_per_transaction = df.groupby(
    ["CustomerID", "InvoiceNo"]
).count()[
    ["StockCode"]
].rename(
    columns={"StockCode": "Number unique products in transaction"}
).reset_index()

temp = number_products_per_transaction.assign(
    cancelled=(number_products_per_transaction.InvoiceNo.str[0] == "C")
)

display(temp.head())

n1 = temp.shape[0]
n2 = temp.cancelled.sum()

percentage_orders_cancelled = n2 / n1 * 100

print(f"Percentage of orders cancelled: {percentage_orders_cancelled}%")

Percentage of orders cancelled: 16.466876971608833%

Over 16% of orders end up cancelled! This is a very large number! Let's take a look at the start of the dataframe:

display(df.sort_values('CustomerID')[:5])

Interesting-it looks like we see that when an order is cancelled, we have another transaction in the dataframe, mostly identical except for the Quantity (the negation) and the InvoiceDate. We should figure out if this is true for all entries. We can do this via

df_cancelled = df[(df.InvoiceNo.str[0] == "C") & (df.StockCode != "D")]
df_not_cancelled = df[df.InvoiceNo.str[0] != "C"]

num_with_counterpart = 0
test = None
for idx, row in df_cancelled.iterrows():
    test = row
    equivalence_condition = (
        (row.CustomerID == df_not_cancelled.CustomerID) & 
        (row.StockCode == df_not_cancelled.StockCode) &
        (row.Quantity == -df_not_cancelled.Quantity)
    )
    
    if equivalence_condition.sum() > 0:
        num_with_counterpart += 1
        
print(num_with_counterpart)

3246

Interesting-this condition clearly does not hold for all cases. We have a total of 8795 cancelled orders, but only 3224 of them perfectly cancel a prior order. For example, there are most certainly cases where a customer bought, say, 6 of a product, and then returned 3, meaning there is a transaction with an InvoiceNo that has C prepended at the start. In this case, to correctly remedy the situation we must subtract the quantity in the cancellation from the nearest purchase of the same item. The logic is a little bit convoluted, but it can be done! We will first write a function that will gather some sufficient test groups for us-this will then allow us to write a function that will actually handle the cancellations.

def find_test_groups(num_groups=1000):
    temp = df.assign(
        cancelled=(df.InvoiceNo.str[0] == "C"),
        InvoiceDate=pandas.to_datetime(df.InvoiceDate)
    )

    grouped = temp.sort_values(
        'InvoiceDate'
    ).groupby(
        ['CustomerID', 'StockCode']
    )

    sampled_df_i = random.sample(list(grouped.groups.keys()), num_groups)

    df_list = list(map(
        lambda df_i: grouped.get_group(df_i), sampled_df_i
    ))
    
    df_list_select = list(filter(
        lambda df_i: ((df_i.shape[0] > 5) and (df_i.cancelled.sum() > 0)), 
        df_list
    ))
    
    return df_list_select

my_group = find_test_groups()[0]

Below we can see an example group (based on CustomerID and StockCode) that has a cancelled value. We can check to make sure that the function we write to handle all cancelled values in the data set can handle this sample group. my_group currently has the shape:

display(my_group)

display(my_group.shape)

(7, 9)

So, that means we expect our function to take in my_group and return a group that is (5, 9):

def handle_perform_shift(my_group):
    
    while np.any(my_group.Quantity < 0):
        my_group = my_group.assign(
            Quantity=perform_shift(my_group)
        )
        
    return my_group[my_group.Quantity > 0]


def perform_shift(my_group):
    
    # conditions:
    # - current value is positive, next value is negative
    # - current value is positive, next value is positive or NaN
    # - current value is negative, next value is positive or Nan
    # - current value is negative, next value is negative
    # - current value is 0
    conditions_list = [
        (my_group.Quantity > 0) & (my_group.Quantity.shift(-1) < 0),
        (my_group.Quantity > 0) & ((my_group.Quantity.shift(-1) >= 0) | (my_group.Quantity.shift(-1).isnull())),
        (my_group.Quantity < 0) & ((my_group.Quantity.shift(-1) > 0) | (my_group.Quantity.shift(-1).isnull())),
        (my_group.Quantity < 0) & (my_group.Quantity.shift(-1) < 0),
        (my_group.Quantity == 0)
    ]

    choices_list = [
        my_group.Quantity + my_group.Quantity.shift(-1),
        my_group.Quantity,
        0,
        my_group.Quantity.shift(-1),
        0
    ]

    return np.select(conditions_list, choices_list)

test_remove_cancellations = handle_perform_shift(my_group)

display(test_remove_cancellations)

display(test_remove_cancellations.shape)

(5, 9)

Our functions seems to effectively remove cancellations! We can now apply it to the entire dataframe.

%%time

temp = df.assign(
    cancelled=(df.InvoiceNo.str[0] == "C"),
    InvoiceDate=pandas.to_datetime(df.InvoiceDate)
)

grouped = temp.sort_values(
    'InvoiceDate'
).groupby(
    ['CustomerID', 'StockCode']
)
    
df_no_cancellations = grouped.apply(handle_perform_shift).reset_index(drop=True)

CPU times: user 6min 37s, sys: 4 s, total: 6min 41s
Wall time: 6min 42s

df[df.Quantity > 0].shape

(392732, 8)

df_no_cancellations[df_no_cancellations.Quantity < 0].shape

(0, 9)

df_no_cancellations.shape

(389929, 9)

This looks great! Now this would be a great place for optimization; the apply can be performed in parallel with multiprocessing or dask, but I will leave that for another post.

5.3.4 StockCode Exploration¶

Now, let's take a peak at the StockCodes. We have seen that some values of StockCode deal with discounts; we can check to see what other types of codes there are:

special_codes = df_no_cancellations[
    df_no_cancellations.StockCode.str.contains('^[a-zA-Z]+', regex=True)
].StockCode.unique()

display(special_codes)

array(['POST', 'M', 'C2', 'BANK CHARGES', 'PADS', 'DOT'], dtype=object)

for code in special_codes:
    val = df_no_cancellations[df_no_cancellations.StockCode == code]["Description"].unique()[0]
    display("{:<15} -> {:<30}".format(code, val))

'POST            -> POSTAGE                       '

'M               -> Manual                        '

'C2              -> CARRIAGE                      '

'BANK CHARGES    -> Bank Charges                  '

'PADS            -> PADS TO MATCH ALL CUSHIONS    '

'DOT             -> DOTCOM POSTAGE                '

5.3.5 Total Cart Price¶

We can see that there are other types of strange charges (such as specific bank charges). We may come back to this, but for now let's move on to total item price. This is based on the UnitPrice of the item, times it's Quantity:

df = df_no_cancellations.assign(
    item_price=df_no_cancellations.UnitPrice * df_no_cancellations.Quantity
)

display(df.head())

From here we can find the total order price:

df = df.assign(
    total_price=df.groupby("InvoiceNo")['item_price'].transform('sum')
)

df.head()

Which has the following distribution (showing clipped and full data set distributions below):

from plotly import tools

trace1 = go.Histogram(
    x=df.total_price.unique(),
    nbinsx=100,
    name="Full",
    marker_color='blue',
    
)

trace2 = go.Histogram(
    x=df.total_price[df.total_price < 5000].unique(),
    nbinsx=100,
    name="Clipped",
    marker_color='green',
)

data = [trace1]

layout = go.Layout(
    width=700,
    height=400,
    title_text="Distribution of Average number of unique products purchased per transaction",
    xaxis=dict(title="Average number unique products in transaction"),
    yaxis=dict(title="Number of Customers"),
    barmode='stack'
)

fig = tools.make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Full Dataset", "Clipped at 5000"),
    print_grid=False
)

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)

fig.update_yaxes(title_text="Number Orders", row=1, col=1)
fig.update_xaxes(title_text="Total Order Price", row=1, col=1)
fig.update_xaxes(title_text="Total Order Price", row=1, col=2)

fig.layout.update(title_text="Distribution of Total Order Price")

plotly.offline.iplot(fig)

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	12/1/2010 8:26	2.55	17850	United Kingdom
1	536365	71053	WHITE METAL LANTERN	6	12/1/2010 8:26	3.39	17850	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	12/1/2010 8:26	2.75	17850	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	12/1/2010 8:26	3.39	17850	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	12/1/2010 8:26	3.39	17850	United Kingdom

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	12/1/2010 8:26	2.55	17850	United Kingdom
1	536365	71053	WHITE METAL LANTERN	6	12/1/2010 8:26	3.39	17850	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	12/1/2010 8:26	2.75	17850	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	12/1/2010 8:26	3.39	17850	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	12/1/2010 8:26	3.39	17850	United Kingdom
5	536365	22752	SET 7 BABUSHKA NESTING BOXES	2	12/1/2010 8:26	7.65	17850	United Kingdom
6	536365	21730	GLASS STAR FROSTED T-LIGHT HOLDER	6	12/1/2010 8:26	4.25	17850	United Kingdom

	Number unique products in transaction
count	22190.000000
mean	18.098423
std	23.363888
min	1.000000
25%	3.000000
50%	12.000000
75%	24.000000
max	542.000000

	InvoiceNo	CustomerID	Number unique products in transaction
16242	576339	14096	542
17524	579196	14096	533
18165	580727	14096	529
17128	578270	14096	442
15081	573576	14096	435

	CustomerID	Average number unique products in transaction	Number of transactions
27	12378	219.000000	1
282	12688	171.000000	1
4178	18022	164.000000	1
3293	16800	156.000000	1
2306	15464	155.000000	1
1952	15004	154.000000	1
1300	14096	150.823529	34
2949	16340	141.000000	1
889	13521	139.333333	3
1954	15006	132.000000	1

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
Column Type	object	object	object	int64	object	float64	object	object
# Null values	0	0	1454	0	0	0	135080	0
% Null values	0	0	0.268311	0	0	0	24.9267	0

	InvoiceNo	CustomerID	Number unique products in transaction
21769	C575606	14096	1
21832	C576338	14096	1
21969	C578269	14096	1
22000	C579195	14096	1
22120	C580726	14096	1

	CustomerID	InvoiceNo	Number unique products in transaction	cancelled
0	12346	541431	1	False
1	12346	C541433	1	True
2	12347	537626	31	False
3	12347	542237	29	False
4	12347	549222	24	False

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
61619	541431	23166	MEDIUM CERAMIC TOP STORAGE JAR	74215	1/18/2011 10:01	1.04	12346	United Kingdom
61624	C541433	23166	MEDIUM CERAMIC TOP STORAGE JAR	-74215	1/18/2011 10:17	1.04	12346	United Kingdom
286623	562032	22375	AIRLINE BAG VINTAGE JET SET BROWN	4	8/2/2011 8:48	4.25	12347	Iceland
72260	542237	84991	60 TEATIME FAIRY CAKE CASES	24	1/26/2011 14:30	0.55	12347	Iceland
14943	537626	22772	PINK DRAWER KNOB ACRYLIC EDWARDIAN	12	12/7/2010 14:57	1.25	12347	Iceland

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	cancelled
241447	558232	22169	FAMILY ALBUM WHITE PICTURE FRAME	1	2011-06-27 14:30:00	8.5	17841	United Kingdom	False
252124	559068	22169	FAMILY ALBUM WHITE PICTURE FRAME	1	2011-07-06 10:41:00	8.5	17841	United Kingdom	False
264712	560115	22169	FAMILY ALBUM WHITE PICTURE FRAME	1	2011-07-15 10:05:00	8.5	17841	United Kingdom	False
315549	C564730	22169	FAMILY ALBUM WHITE PICTURE FRAME	-1	2011-08-28 12:47:00	8.5	17841	United Kingdom	True
402356	571498	22169	FAMILY ALBUM WHITE PICTURE FRAME	1	2011-10-17 15:06:00	8.5	17841	United Kingdom	False
424194	573245	22169	FAMILY ALBUM WHITE PICTURE FRAME	1	2011-10-28 12:08:00	8.5	17841	United Kingdom	False
531442	580956	22169	FAMILY ALBUM WHITE PICTURE FRAME	2	2011-12-06 14:23:00	8.5	17841	United Kingdom	False

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country	cancelled	item_price
0	549222	16008	SMALL FOLDING SCISSOR(POINTED EDGE)	24.0	2011-04-07 10:43:00	0.25	12347	Iceland	False	6.0
1	556201	17021	NAMASTE SWAGAT INCENSE	36.0	2011-06-09 13:01:00	0.30	12347	Iceland	False	10.8
2	549222	20665	RED RETROSPOT PURSE	6.0	2011-04-07 10:43:00	2.95	12347	Iceland	False	17.7
3	542237	20719	WOODLAND CHARLOTTE BAG	10.0	2011-01-26 14:30:00	0.85	12347	Iceland	False	8.5
4	562032	20719	WOODLAND CHARLOTTE BAG	10.0	2011-08-02 08:48:00	0.85	12347	Iceland	False	8.5