Demand Forecasting Using XGB Regression, ARIMA And SARIMAX

realcode4you
Jul 13, 2021
7 min read

Problem Statement

Given 5 years of store-item deals information, and requested to foresee 3 months of deals for 50 distinct items at 10 unique stores.

What's the best way to deal with seasonality? Should stores be modeled separately, or can you pool them together? Does deep learning work better than ARIMA?

Dataset description

The dataset is privided with 4 columns:

date: The date of sale store: This is the store number item: This is the item number sales: Sales made on that particular day Approach: Here I will try to use the basic ARIMA model, below are the steps to start with the solution:

Loading and Handling Time Series in Pandas How to Check Stationarity of a Time Series?

How to make a Time Series Stationary? Forecasting a Time Series.

Import Libraries And Related Dependencies

#import dependencies
import warnings 
warnings.filterwarnings('ignore')

#Basic packages
import os
import pandas as pd
import numpy as np
import datetime

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Time Series
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from pandas.plotting import autocorrelation_plot
from statsmodels.tsa.stattools import adfuller, acf, pacf,arma_order_select_ic
import statsmodels.formula.api as smf
import statsmodels.tsa.api as smt
import statsmodels.api as sm
import scipy.stats as scs
from sklearn.model_selection import train_test_split, TimeSeriesSplit
import xgboost as xgb
from sklearn.metrics import mean_absolute_error

color = sns.color_palette()
sns.set_style('darkgrid')

Load Dataset

train_df = pd.read_csv("train.csv")
train_df['date'] = pd.to_datetime(train_df['date'], format="%Y-%m-%d")
train_df.head()

Check Missing Value

# Here we can see that there are no missing values in our dataframe
train_df.info()

Output:

<class 'pandas.core.frame.DataFrame'> RangeIndex: 913000 entries, 0 to 912999 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 913000 non-null datetime64[ns] 1 store 913000 non-null int64 2 item 913000 non-null int64 3 sales 913000 non-null int64 dtypes: datetime64[ns](1), int64(3) memory usage: 27.9 MB

Convert date and time object to day month year and dayofweek

# Expand dataframe with more useful columns
# convert date and time object to day month year and dayofweek
def expand_df(df):
    data = df.copy()
    
    data['day'] = data.date.dt.day
    data['month'] = data.date.dt.month
    data['year'] = data.date.dt.year
    data['dayofweek'] = data.date.dt.dayofweek
    return data

train_df = expand_df(train_df)
display(train_df)

Output

Combining Train and Test Data

train = pd.read_csv("train.csv") 
test = pd.read_csv("test.csv") 

# Combine the train and test data
data_combine = pd.concat([train,test])

print("size of data_combine",data_combine.shape)

data_combine['date'] = pd.to_datetime(data_combine['date'],infer_datetime_format=True)

data_combine['month'] = data_combine['date'].dt.month
data_combine['weekday'] = data_combine['date'].dt.dayofweek
data_combine['year'] = data_combine['date'].dt.year
# df['date'].dt.
data_combine['week_of_year']  = data_combine.date.dt.weekofyear

data_combine['date_order'] = (data_combine['date'] - datetime.datetime(2013, 1, 1)).dt.days

data_combine['sale_moving_average_7days']=data_combine.groupby(["item","store"])['sales'].transform(lambda x: x.rolling(window=7,min_periods=1).mean())
data_combine['sale_moving_average_7days_shifted-90']=data_combine.groupby(["item","store"])['sale_moving_average_7days'].transform(lambda x:x.shift(90))

data_combine['store_item_shifted-90'] = data_combine.groupby(["item","store"])['sales'].transform(lambda x:x.shift(90))

data_combine['store_item_shifted-10'] = data_combine.groupby(["item","store"])['sales'].transform(lambda x:x.shift(10))

Output

size of data_combine (958000, 5)

Splitting the dataset

col = [i for i in data_combine.columns if i not in ['date','id','sale_moving_average_7days']]
y_target = train.sales

train_new = data_combine.loc[~data_combine.sales.isna()]
print("new train",train_new.shape)
test_new = data_combine.loc[data_combine.sales.isna()]
print("new test",test_new.shape)
train_new = (train_new[col]).dropna()
y_target = train_new.sales
col = [i for i in data_combine.columns if i not in ['date','id','sales','sale_moving_average_7days']]
X_train, X_test, y_train, y_test = train_test_split( train_new[col] ,train_new.sales, test_size=0.15, random_state=42)

Output:

new train (913000, 14) new test (45000, 14)

grand_avg = train_df.sales.mean()
print(f"The grand average of sales in this dataset is {grand_avg:.4f}")

Output

The grand average of sales in this dataset is 52.2503

train_df = train_df.set_index('date')
train_df.head()

Output:

train_df.tail()

Output

Exploratory Data Analysis(EDA)

To represent the data in different model we are using pivot table.

#Sales by year
agg_year_item = pd.pivot_table(train_df, index='year', columns='item',
                               values='sales', aggfunc=np.mean).values
agg_year_store = pd.pivot_table(train_df, index='year', columns='store',
                                values='sales', aggfunc=np.mean).values

plt.figure(figsize=(12, 5))
plt.subplot(121)
plt.plot(agg_year_item / agg_year_item.mean(0)[np.newaxis])
plt.title("Items")
plt.xlabel("Year")
plt.ylabel("Relative Sales")
plt.subplot(122)
plt.plot(agg_year_store / agg_year_store.mean(0)[np.newaxis])
plt.title("Stores")
plt.xlabel("Year")
plt.ylabel("Relative Sales")
plt.show()

Output

We can see that the Items and Stores have grown similarly over the time.

# Now lets check the sales w.r.t. the month

agg_month_item = pd.pivot_table(train_df, index='month', columns='item',
                               values='sales', aggfunc=np.mean).values
agg_month_store = pd.pivot_table(train_df, index='month', columns='store',
                                values='sales', aggfunc=np.mean).values

plt.figure(figsize=(12, 5))
plt.subplot(121)
plt.plot(agg_month_item / agg_month_item.mean(0)[np.newaxis])
plt.title("Items")
plt.xlabel("Month")
plt.ylabel("Relative Sales")
plt.subplot(122)
plt.plot(agg_month_store / agg_month_store.mean(0)[np.newaxis])
plt.title("Stores")
plt.xlabel("Month")
plt.ylabel("Relative Sales")
plt.show()

Sales by month has also shown similar growth pattern for both items and sales.

#Sales by days of week

agg_weekly_item = pd.pivot_table(train_df, index='dayofweek', columns='item',
                               values='sales', aggfunc=np.mean).values
agg_weekly_store = pd.pivot_table(train_df, index='dayofweek', columns='store',
                                values='sales', aggfunc=np.mean).values

plt.figure(figsize=(12, 5))
plt.subplot(121)
plt.plot(agg_weekly_item / agg_weekly_item.mean(0)[np.newaxis])
plt.title("Items")
plt.xlabel("Days of week")
plt.ylabel("Relative Sales")
plt.subplot(122)
plt.plot(agg_weekly_store / agg_weekly_store.mean(0)[np.newaxis])
plt.title("Stores")
plt.xlabel("Days of week")
plt.ylabel("Relative Sales")
plt.show()

Sales by week has also shown similar growth pattern for both items and sales.

#Also now lets check for item store relationship

agg_store_item = pd.pivot_table(train_df, index='store', columns='item',
                                values='sales', aggfunc=np.mean).values

plt.figure(figsize=(14, 5))
plt.subplot(121)
plt.plot(agg_store_item / agg_store_item.mean(0)[np.newaxis])
plt.title("Items")
plt.xlabel("Store")
plt.ylabel("Relative Sales")
plt.subplot(122)
plt.plot(agg_store_item.T / agg_store_item.T.mean(0)[np.newaxis])
plt.title("Stores")
plt.xlabel("Item")
plt.ylabel("Relative Sales")
plt.show()

Now we have checked the relationships between different variables.

It's time for us to decompose the time series

sns.lineplot(x="date",y="sales",legend="full",data=train_df)

Output

Here we can see that sales with respect to date is time series data.

Decompose the time series

Time series data can exhibit a variety of patterns, and it is often helpful to split a time series into several components, each representing an underlying pattern category. When we decompose a time series into components, we usually combine the trend and cycle into a single trend-cycle component (sometimes called the trend for simplicity). Thus we think of a time series comprising three components: a trend-cycle component, a seasonal component, and a remainder component (containing anything else in the time series).

The Seasonal component: A seasonal pattern occurs when a time series is affected by seasonal factors such as the time of the year or the day of the week. Seasonality is always of a fixed and known frequency. A time series can contain multiple superimposed seasonal periods.

The Trend component: A trend exists when there is a long-term increase or decrease in the data. It does not have to be linear. Sometimes a trend is referred to as “changing direction” when it might go from an increasing trend to a decreasing trend.

The Cyclical component: The cyclical component represents phenomena that happen across seasonal periods. Cyclical patterns do not have a fixed period like seasonal patterns do. The cyclical component is hard to isolate and it's often ‘left alone’ by combining it with the trend component. The Noise component: The noise or the random component is what remains behind when you separate out seasonality and trend from the time series. Noise is the effect of factors that you do not know, or which you cannot measure. It is the effect of the known unknowns, or the unknown unknowns.

# Lets decompose for data of smaller size. Here I will take data having item and store equal to 1.

train_item1 = train_df[train_df['item']==1]
train_final = train_item1[train_item1['store']==1]

from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(train_final['sales'], model='additive', freq=365)

fig = plt.figure()  
fig = result.plot()  
fig.set_size_inches(14, 12)

How to determine additive or multiplicative model for decomposition? We use multiplicative models when the magnitude of the seasonal pattern in the data depends on the magnitude of the data. On other hand, in the additive model, the magnitude of seasonality does not change in relation to time.

Depending on whether the composition is multiplicative or additive, we’ll need to divide or subtract the trend component from the original time series to retrieve the seasonal and noise components.

XGB Regression

def smape(A, F):
    return 100/len(A) * np.sum(2 * np.abs(F - A) / (np.abs(A) + np.abs(F)))

Fit into model

model_sets=[]
for max_depth in range(4,10,3):
  xgb_model = xgb.XGBRegressor(max_depth=max_depth ,min_child_weight=1)
#xgb_model.fit(X_train,y_train,eval_metric='mae')
  xgb_model.fit(X_train,y_train.values,eval_metric=smape)
  model_sets.append(xgb_model)
  
  y_train_pred_xgb=xgb_model.predict(X_train)
  y_test_pred_xgb=xgb_model.predict(X_test)
  print('smape error: max_depth=', max_depth ,',train:' , smape(y_train.values,y_train_pred_xgb),'test:',smape(y_test.values,y_test_pred_xgb))
  print('MSE train:' , mean_absolute_error(np.log1p(y_train),np.log1p(y_train_pred_xgb)),'test:',mean_absolute_error(np.log1p(y_test),np.log1p(y_test_pred_xgb)))

Stationarity

Before applying any statistical model on a Time Series, the series has to be stationary or time invariant, which means that, over different time periods, it should have constant means, constant variance and constant covariance. It means that the data should have constant mean throughout, scattered consistently and should have same frequency throughout. So, if our data mean, variance and covariance is varied with time then our data is non-stationary and we have to make it stationary before applying any method. This is necessary because if our data has some regular pattern then there’s a high probability that over a different interval, it will have same behavior and can cause problem in accuracy of model. And also, mathematical computation for stationary data is easier as compared to that of non-stationary data.

Lets check the stationarity Here we are going to check the stationarity using 2 methods:

Rolling Mean: Plot the moving average or moving standard deviation to see if it varies with time.
ADCF Test — Augmented Dickey–Fuller test: This is used to gives us various values that can help in identifying stationarity. The Null hypothesis says that a Time-series is non-stationary. It comprises of a Test Statistics & some critical values for some confidence levels. If the Test statistics is less than the critical values, we can reject the null hypothesis & say that the series is stationary. The ADCF test also gives us a p-value. According to the null hypothesis, lower values of p is better.

# Rolling Mean Analysis

def roll_stats(timeseries, window = 12, cutoff = 0.01):
    
    rolmean = timeseries.rolling(window).mean()
    rolstd = timeseries.rolling(window).std()

    #Plot rolling statistics:
    fig = plt.figure(figsize=(16, 4))
    orig = plt.plot(timeseries, color='blue',label='Original')
    mean = plt.plot(rolmean, color='red', label='Rolling Mean')
    std = plt.plot(rolstd, color='black', label = 'Rolling Std')
    plt.legend(loc='best')
   # plt.rcParams['agg.path.chunksize'] = 50000
    plt.title('Rolling Mean & Standard Deviation')
    plt.show()
    
roll_stats(train_final['sales'])

# Perform Dickey-Fuller test
def dickey_fuller_test(timeseries, window = 12, cutoff = 0.01):
    dftest = adfuller(timeseries, autolag='AIC', maxlag = 20 )
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    pvalue = dftest[1]
    if pvalue < cutoff:
        print('p-value = %.4f. The series is likely stationary.' % pvalue)
    else:
        print('p-value = %.4f. The series is likely non-stationary.' % pvalue)
    
    print(dfoutput)
    
dickey_fuller_test(train_final['sales'])

Output

p-value = 0.0361. The series is likely non-stationary. Test Statistic -2.987278 p-value 0.036100 #Lags Used 20.000000 Number of Observations Used 1805.000000 Critical Value (1%) -3.433978 Critical Value (5%) -2.863143 Critical Value (10%) -2.567623 dtype: float64

If you need any programming assignment help in Python Machine Learning, Machine Learning project or Machine Learning homework or need solution of above problem then we are ready to help you.

Send your request at realcode4you@gmail.com and get instant help with an affordable price.

We are always focus to delivered unique or without plagiarism code which is written by our highly educated professional which provide well structured code within your given time frame.

If you are looking other programming language help like C, C++, Java, Python, PHP, Asp.Net, NodeJs, ReactJs, etc. with the different types of databases like MySQL, MongoDB, SQL Server, Oracle, etc. then also contact us.