Analyze channel migration using python/pandas

Anyone who sells products through multiple channels wants to understand how well each channel performs. Especially if you have preferred channels (that convert better or at a lower CAC) you want to understand how users migrate from channel to channel and if certain channels have more "stickyness" than others.

The full repository, including sample-data is available on github :]

Recently an executive went into my office and asked me "Jonas, do you know if customers who first bought from our newsletter keep buying from the newsletter? Or do they more or less randomly buy from different channels wherever something attracts their attention?" Well.. I didn't know but I promised to put something together and report by tomorrow lunchtime.

This is what I did:

  1. First I pulled together a dataset of sales along with the info on which channel the sale was initiated. The data looks like this:
In [4]: data.columns

Out[4]: Index(['Booking date', 'Quantity', 'Customer ID', 'Product ID', 'Channel',
       'Landingpage', 'Channel_Name'],
In [5]: data.head()

Out [4]:
 	Booking date 	Quantity 	Customer ID 	Product ID 	Channel 	Landingpage 	Channel_Name
0 	01.01.2016 	2 	1155 	204 	4 	8 	Store
1 	01.01.2016 	2 	3367 	204 	4 	8 	Store
2 	01.01.2016 	2 	4825 	497 	1 	1 	Affiliate A
3 	01.01.2016 	2 	95 	157 	4 	8 	Store
4 	01.01.2018 	2 	6212 	138 	0 	0 	CRM

The first task at hand was to understand which channels any given customer used on the first, second, ..nth booking. So I needed to mark the bookings with their respective sequence number. In python that's two lines of code:

# sort the data so we can efficiently compare line n with line n+1
data.sort_values(by=['Customer ID', 'Booking date'], axis=0, ascending=True, inplace=True)

# Is this the customers 1st second, third, ... booking?
data['Booking Sequence'] = data.groupby(['Customer ID']).cumcount()+1;

Next I needed to understand from which channels customers were moving into what other channels. Every booking already contained the info on which channel it was made. I decided to add on which channel the previous bookings was made. This would allow me to do my "from" - "to" analysis.

A first approach to do this was to iterate through all rows and write into every row the channel-value from the previous row (Given that the previous row was from the same customer).

## unperformant solution, but describes what we want to do: On which channel was the customer aquired last time?
#for i in range(1, len(data)):
#    if (groupdata.loc[i, 'Customer ID'] == groupdata.loc[i-1, 'Customer ID']):
#        #print(str(data.loc[i, 'Customer ID']) + '==' + str(data.loc[i-1, 'Customer ID']))
#        data.loc[i, 'Previous Channel_Name'] = data.loc[i-1, 'Channel_Name']
#        data.loc[i, 'Previous Channel'] = data.loc[i-1, 'Channel']
#        data.loc[i, 'Previous Product ID'] = data.loc[i-1, 'Product ID']
#        data.loc[i, 'Previous Landingpage'] = data.loc[i-1, 'Landingpage']

However this is very inefficient and took ages to complete. (Actually I didn't wait 'till the end). Whenever a loop makes problems, I try to find a vectorized solution which solves the problem in most cases, thanks to python/numpy's premium-guts. Basically I just split out the columns in question, shifted their index by one and joined them back again to the original dataframe. This completed in seconds:

# performant solution: shift the index by 1 and join again to original df
#Customer ID 	Product ID 	Channel 	Landingpage 	Channel_Name 	Booking Sequence
data_shifted = data[['Customer ID', 'Product ID', 'Channel', 
                     'Channel_Name', 'Landingpage', 'Booking Sequence', 'Booking date']].copy(deep=True)
data_shifted.index = data_shifted.index+1
data = data.join(data_shifted, how='left', rsuffix='_shifted')

Add a little bit of cleanup and shining and data preparation done! Final step: Analysis and presentation of results. As we hired tableau to do this job for us, I built a tableau-dashboard:

Channel Migration