Demand Forecasting Using XGB Regression, ARIMA And SARIMAX

Problem Statement

Given 5 years of store-item deals information, and requested to foresee 3 months of deals for 50 distinct items at 10 unique stores.

What's the best way to deal with seasonality? Should stores be modeled separately, or can you pool them together? Does deep learning work better than ARIMA?


Dataset description

The dataset is privided with 4 columns:


date: The date of sale store: This is the store number item: This is the item number sales: Sales made on that particular day Approach: Here I will try to use the basic ARIMA model, below are the steps to start with the solution:

Loading and Handling Time Series in Pandas How to Check Stationarity of a Time Series?


How to make a Time Series Stationary? Forecasting a Time Series.



Import Libraries And Related Dependencies

#import dependencies
import warnings 
warnings.filterwarnings('ignore')

#Basic packages
import os
import pandas as pd
import numpy as np
import datetime

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Time Series
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from pandas.plotting import autocorrelation_plot
from statsmodels.tsa.stattools import adfuller, acf, pacf,arma_order_select_ic
import statsmodels.formula.api as smf
import statsmodels.tsa.api as smt
import statsmodels.api as sm
import scipy.stats as scs
from sklearn.model_selection import train_test_split, TimeSeriesSplit
import xgboost as xgb
from sklearn.metrics import mean_absolute_error

color = sns.color_palette()
sns.set_style('darkgrid')



Load Dataset

train_df = pd.read_csv("train.csv")
train_df['date'] = pd.to_datetime(train_df['date'], format="%Y-%m-%d")
train_df.head()








Check Missing Value

# Here we can see that there are no missing values in our dataframe
train_df.info()

Output:

<class 'pandas.core.frame.DataFrame'> RangeIndex: 913000 entries, 0 to 912999 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 913000 non-null datetime64[ns] 1 store 913000 non-null int64 2 item 913000 non-null int64 3 sales 913000 non-null int64 dtypes: datetime64[ns](1), int64(3) memory usage: 27.9 MB



Convert date and time object to day month year and dayofweek

# Expand dataframe with more useful columns
# convert date and time object to day month year and dayofweek
def expand_df(df):
    data = df.copy()
    
    data['day'] = data.date.dt.day
    data['month'] = data.date.dt.month
    data['year'] = data.date.dt.year
    data['dayofweek'] = data.date.dt.dayofweek
    return data

train_df = expand_df(train_df)
display(train_df)

Output













Combining Train and Test Data

train = pd.read_csv("train.csv") 
test = pd.read_csv("test.csv") 

# Combine the train and test data
data_combine = pd.concat([train,test])

print("size of data_combine",data_combine.shape)

data_combine['date'] = pd.to_datetime(data_combine['date'],infer_datetime_format=True)

data_combine['month'] = data_combine['date'].dt.month
data_combine['weekday'] = data_combine['date'].dt.dayofweek
data_combine['year'] = data_combine['date'].dt.year
# df['date'].dt.
data_combine['week_of_year']  = data_combine.date.dt.weekofyear

data_combine['date_order'] = (data_combine['date'] - datetime.datetime(2013, 1, 1)).dt.days

data_combine['sale_moving_average_7days']=data_combine.groupby(["item","store"])['sales'].transform(lambda x: x.rolling(window=7,min_periods=1).mean())
data_combine['sale_moving_average_7days_shifted-90']=data_combine.groupby(["item","store"])['sale_moving_average_7days'].transform(lambda x:x.shift(90))

data_combine['store_item_shifted-90'] = data_combine.groupby(["item","store"])['sales'].transform(lambda x:x.shift(90))

data_combine['store_item_shifted-10'] = data_combine.groupby(["item","store"])['sales'].transform(lambda x:x.shift(10))

Output

size of data_combine (958000, 5)



Splitting the dataset

col = [i for i in data_combine.columns if i not in ['date','id','sale_moving_average_7days']]
y_target = train.sales

train_new = data_combine.loc[~data_combine.sales.isna()]
print("new train",train_new.shape)
test_new = data_combine.loc[data_combine.sales.isna()]
print("new test",test_new.shape)
train_new = (train_new[col]).dropna()
y_target = train_new.sales
col = [i for i in data_combine.columns if i not in ['date','id','sales','sale_moving_average_7days']]
X_train, X_test, y_train, y_test = train_test_split( train_new[col] ,train_new.sales, test_size=0.15, random_state=42)

Output:

new train (913000, 14) new test (45000, 14)


grand_avg = train_df.sales.mean()
print(f"The grand average of sales in this dataset is {grand_avg:.4f}")

Output

The grand average of sales in this dataset is 52.2503



train_df = train_df.set_index('date')
train_df.head()

Output:










train_df.tail()

Output










Exploratory Data Analysis(EDA)

To represent the data in different model we are using pivot table.



#Sales by year
agg_year_item = pd.pivot_table(train_df, index='year', columns='item',
                               values='sales', aggfunc=np.mean).values
agg_year_store = pd.pivot_table(train_df, index='year', columns='store',
                                values='sales', aggfunc=np.mean).values

plt.figure(figsize=(12, 5))
plt.subplot(121)
plt.plot(agg_year_item / agg_year_item.mean(0)[np.newaxis])
plt.title("Items")
plt.xlabel("Year")
plt.ylabel("Relative Sales")
plt.subplot(122)
plt.plot(agg_year_store / agg_year_store.mean(0)[np.newaxis])
plt.title("Stores")
plt.xlabel("Year")
plt.ylabel("Relative Sales")
plt.show()


Output











We can see that the Items and Stores have grown similarly over the time.


# Now lets check the sales w.r.t. the month

agg_month_item = pd.pivot_table(train_df, index='month', columns='item',
                               values='sales', aggfunc=np.mean).values
agg_month_store = pd.pivot_table(train_df, index='month', columns='store',
                                values='sales', aggfunc=np.mean).values

plt.figure(figsize=(12, 5))
plt.subplot(121)
plt.plot(agg_month_item / agg_month_item.mean(0)[np.newaxis])
plt.title("Items")
plt.xlabel("Month")
plt.ylabel("Relative Sales")
plt.subplot(122)
plt.plot(agg_month_store / agg_month_store.mean(0)[np.newaxis])
plt.title("Stores")
plt.xlabel("Month")
plt.ylabel("Relative Sales")
plt.show()












Sales by month has also shown similar growth pattern for both items and sales.


#Sales by days of week

agg_weekly_item = pd.pivot_table(train_df, index='dayofweek', columns='item',
                               values='sales', aggfunc=np.mean).values
agg_weekly_store = pd.pivot_table(train_df, index='dayofweek', columns='store',
                                values='sales', aggfunc=np.mean).values

plt.figure(figsize=(12, 5))
plt.subplot(121)
plt.plot(agg_weekly_item / agg_weekly_item.mean(0)[np.newaxis])
plt.title("Items")
plt.xlabel("Days of week")
plt.ylabel("Relative Sales")
plt.subplot(122)
plt.plot(agg_weekly_store / agg_weekly_store.mean(0)[np.newaxis])
plt.title("Stores")
plt.xlabel("Days of week")
plt.ylabel("Relative Sales")
plt.show()












Sales by week has also shown similar growth pattern for both items and sales.

#Also now lets check for item store relationship

agg_store_item = pd.pivot_table(train_df, index='store', columns='item',
                                values='sales', aggfunc=np.mean).values

plt.figure(figsize=(14, 5))
plt.subplot(121)
plt.plot(agg_store_item / agg_store_item.mean(0)[np.newaxis])
plt.title("Items")
plt.xlabel("Store")
plt.ylabel("Relative Sales")
plt.subplot(122)
plt.plot(agg_store_item.T / agg_store_item.T.mean(0)[np.newaxis])
plt.title("Stores")
plt.xlabel("Item")
plt.ylabel("Relative Sales")
plt.show()










Now we have checked the relationships between different variables.

It's time for us to decompose the time series


sns.lineplot(x="date",y="sales",legend="full",data=train_df)

Output












Here we can see that sales with respect to date is time series data.


Decompose the time series

Time series data can exhibit a variety of patterns, and it is often helpful to split a time series into several components, each representing an underlying pattern category. When we decompose a time series into components, we usually combine the trend and cycle into a single trend-cycle component (sometimes called the trend for simplicity). Thus we think of a time series comprising three components: a trend-cycle component, a seasonal component, and a remainder component (containing anything else in the time series).

The Seasonal component: A seasonal pattern occurs when a time series is affected by seasonal factors such as the time of the year or the day of the week. Seasonality is always of a fixed and known frequency. A time series can contain multiple superimposed seasonal periods.

The Trend component: A trend exists when there is a long-term increase or decrease in the data. It does not have to be linear. Sometimes a trend is referred to as “changing direction” when it might go from an increasing trend to a decreasing trend.

The Cyclical component: The cyclical component represents phenomena that happen across seasonal periods. Cyclical patterns do not have a fixed period like seasonal patterns do. The cyclical component is hard to isolate and it's often ‘left alone’ by combining it with the trend component. The Noise component: The noise or the random component is what remains behind when you separate out seasonality and trend from the time series. Noise is the effect of factors that you do not know, or which you cannot measure. It is the effect of the known unknowns, or the unknown unknowns.



# Lets decompose for data of smaller size. Here I will take data having item and store equal to 1.

train_item1 = train_df[train_df['item']==1]
train_final = train_item1[train_item1['store']==1]

from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(train_final['sales'], model='additive', freq=365)

fig = plt.figure()  
fig = result.plot()  
fig.set_size_inches(14, 12)