Introductory time series analysis on Shopee E-commerce activity

Daryl

6 min readJun 16, 2020

Shopee’s Corporate Headquarters in Singapore; Credits: Vulcan Post

For those of you not from this part of the world, Shopee is really big here.

Calling it South East Asia’s Amazon or Alibaba, wouldn’t be too far off.

Nap Pods: Can you really call yourself a tech firm if you don’t have an abundance of these?

Every year, in its quest for the best coders and data scientists in the region, Shopee organises hackathons.

They have taken on different names and iterations over the past few years.

This year, it’s the same deal.

Its my first time participating, so I can’t speak for other years. But this year, how Shopee has chosen to do it remotely, due to the worldwide lockdown stemming from the Coronavirus outbreak.

On weekdays, there are webinars like these.

And on each of the weekends, we have the competitions.

I won’t be discussing the competition metrics here, but i’ll going down a slightly different track.

Given a smattering of data, how can we, group, observe and determine the buying trends?

The dataset is here, provided by Shopee themselves.

Kaggle: Your Home for Data Science

Kaggle is the world's largest data science community with powerful tools and resources to help you achieve your data…

www.kaggle.com

And off we go!

The IDE used here is Jupyter Lab.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inlinedf = pd.read_csv('order_brush_order.csv')
df.head()

A peak at the dataset: A mishmash of E-commerce orders.

Timestamped, though not in chronological order.

df.shape< (222750, 4) >

222,750 rows, 4 columns.

In the raw dataset, the time stamp is nothing but a random string of numbers. We need to explicitly tell the program to process those numbers in a meaningful format, via the Pandas function: to_datetime().

Additionally the default numeric index (0,1,2,3 … ) , doesn’t serve much use here. So we replace it, using the timestamp as our method of indexing by the Pandas function: set_index().

df['event_time'] = pd.to_datetime(df['event_time'])
df = df.set_index('event_time')
df.head()

df.index.min()< Timestamp('2019-12-27 00:00:00')>     #Earliest: 27 Dec 0000 Hrsdf.index.max()< Timestamp('2019-12-31 23:59:56') >    #Latest 31 Dec 2359 hoursdf.index

Once again, we see the various date and time stamps.

As a first step, to make things a little less messy, we sort the data in chronological order, using the Pandas function: sort_values(by= ())

df = df.sort_values(by='event_time')
df.index

As expected, the data is now sorted in chronological order. With the earliest order, and latest order clearly visible.

Next, we want to analyse buying activity.

That means the number of orders placed, and how they vary with each unit of time.

There is no way to do so using the raw dataset, as all it has provided us are the unique order identification serial numbers.

We know that each serial number corresponds to one unique order, but again the computer needs to be explicitly programed to treat it as such.

#Add in a dummy variabledf['quantity'] = df.apply(lambda row: float(1), axis=1)
df.head()

Next, we need to group the orders by each unit of time.

There are no hard and fast rules here, but I have chosen to group them by the hour, Using the Pandas function: resample()

The resample() function provides many different time windows for sorting. From years, quarters, months, days, minutes, right down to the nanosecond.

Here the hour is denoted by “H”.

df_hourly = df.resample('H').sum()
df_hourly.head()

Now that we have what we need, the identification numbers are no longer useful, and can be discarded, via the Pandas function: drop().

df_hourly=df_hourly.drop(['orderid','shopid','userid'], axis=1)
df_hourly.head()

We have our indexing by the hour, and we have the hourly orders. The next logical step is the visualise the data.

plt.figure(figsize=(18,5))
df_hourly.plot(grid=True)
plt.ylabel('Order Volume')
plt.xlabel('Hour')
plt.title('Order Volume by Hour')plt.show()

Just some preliminary observations and the possible human behaviours behind them.

Order volume nosedives in the wee hours of the morning.

Which makes sense, as most shoppers aren’t night owls and will be going to bed.

2. Order volume has a mini peak around mid-day and decreases gradually.

Shoppers may place orders during their lunch break, and then place their phones aside as they get back to work.

3. Order volume reaches its maximum in the evening.

The day has ended, and shoppers are engaging in retail therapy after knocking off from work .

To add some rigour to the analysis, lets introduce a concept from Statistics.

The Moving Average.

Moving average

In statistics, a moving average ( rolling average or running average) is a calculation to analyze data points by…

en.wikipedia.org

Moving average is used in many places. A few include

Technical Analysis

How to Use a Moving Average to Buy Stocks

The moving average (MA) is a simple technical analysis tool that smooths out price data by creating a constantly…

www.investopedia.com

2. Energy Arbitrage

When the electricity industry decides how to price utility plans.

Designing an energy arbitrage strategy with linear programming

How to use a classic mathematical optimization technique to guide the operation of a grid-connected battery and…

towardsdatascience.com

We shall calculate the moving average across different time scales, via the Pandas function: .rolling( window=() ).mean(), and append them to our original data.

#3, 6, 24 hour moving averagedf_hourly["3hr Moving Average"] = df_hourly['quantity'].rolling(window=3).mean()df_hourly["6hr Moving Average"] = df_hourly['quantity'].rolling(window=6).mean()df_hourly["24hr Moving Average"] = df_hourly['quantity'].rolling(window=24).mean()df_hourly.head()

In doing so, note that the first few rows are blank.

This is so as there are inadequate values to calculate the average.

Lastly, we plot these moving averages against the orignal hourly volume as a comparison.

plt.figure(figsize=(18,5))
df_hourly.plot(grid=True)plt.ylabel('Order Volume')
plt.xlabel('Hour')
plt.title('Order Volume by Hour')plt.plot(df_hourly['quantity'],  
         color='blue')plt.plot(df_hourly['3hr Moving Average'], 
         label='3hr Moving Average', 
         color='red')plt.plot(df_hourly['6hr Moving Average'], 
         label='6hr Moving Average', 
         color='pink')plt.plot(df_hourly['24hr Moving Average'], 
         label='24hr Moving Average', 
         color='green')plt.show()

We see the maximum order volume increase, toward 31st December, as evidenced by the blue peaks.

Naturally, the 24 hour moving average inches up as well from about 1750 to above 2000.

It then falls back down, on New Year’s eve.

Perhaps we have shoppaholics repeating to themselves one of the oldest lies in the world:

“New Year, New Me. I promise to rein in my spending.”

But we all know how that works out every year, don’t we all?

That shall be all for now and see you next time!