Problem Statement

This tutorial takes a practical and coding-focused approach. We'll learn gradient boosting by applying it to a real-world dataset from the Rossmann Store Sales competition on Kaggle:

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality.

With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied. You are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set. Note that some stores in the dataset were temporarily closed for refurbishment.

View and download the data here: https://www.kaggle.com/c/rossmann-store-sales/data

Download Data

import os
import opendatasets as od
import pandas as pd
pd.set_option("display.max_columns", 120)
pd.set_option("display.max_rows", 120)

od.download('https://www.kaggle.com/c/rossmann-store-sales')

Skipping, found downloaded files in "./rossmann-store-sales" (use force=True to force download)

data_dir = os.listdir('rossmann-store-sales/')

data_dir

['sample_submission.csv', 'store.csv', 'test.csv', 'train.csv']

ross_df = pd.read_csv('./rossmann-store-sales/train.csv', low_memory=False)
test_df = pd.read_csv('./rossmann-store-sales/test.csv')
store_df = pd.read_csv('./rossmann-store-sales/store.csv')
submission_df = pd.read_csv('./rossmann-store-sales/sample_submission.csv')

ross_df

test_df

store_df

merged_df = ross_df.merge(store_df, how='left', on='Store')
merged_test_df = test_df.merge(store_df, how='left', on='Store')

merged_df

Exploratory Data Analysis

Preprocessing and Feature Engineering

merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1017209 entries, 0 to 1017208
Data columns (total 18 columns):
 #   Column                     Non-Null Count    Dtype  
---  ------                     --------------    -----  
 0   Store                      1017209 non-null  int64  
 1   DayOfWeek                  1017209 non-null  int64  
 2   Date                       1017209 non-null  object 
 3   Sales                      1017209 non-null  int64  
 4   Customers                  1017209 non-null  int64  
 5   Open                       1017209 non-null  int64  
 6   Promo                      1017209 non-null  int64  
 7   StateHoliday               1017209 non-null  object 
 8   SchoolHoliday              1017209 non-null  int64  
 9   StoreType                  1017209 non-null  object 
 10  Assortment                 1017209 non-null  object 
 11  CompetitionDistance        1014567 non-null  float64
 12  CompetitionOpenSinceMonth  693861 non-null   float64
 13  CompetitionOpenSinceYear   693861 non-null   float64
 14  Promo2                     1017209 non-null  int64  
 15  Promo2SinceWeek            509178 non-null   float64
 16  Promo2SinceYear            509178 non-null   float64
 17  PromoInterval              509178 non-null   object 
dtypes: float64(5), int64(8), object(5)
memory usage: 147.5+ MB

Date

def split_date(df):
    df['Date'] = pd.to_datetime(df['Date'])
    df['Year'] = df.Date.dt.year
    df['Month'] = df.Date.dt.month
    df['Day'] = df.Date.dt.day
    df['WeekofYear'] = df.Date.dt.isocalendar().week

split_date(merged_df)
split_date(merged_test_df)

merged_df

Store Open/Closed

merged_df[merged_df.Open ==0].Sales.value_counts()

0    172817
Name: Sales, dtype: int64

Instead of trying to model this relationship, it would be better to hard-code it in our predictions, and remove the rows where the store is closed. We won't remove any rows from the test set, since we need to make predictions for every row.

merged_df = merged_df[merged_df.Open == 1].copy()

Competition

merged_df.sample(10)

we can use the columns CompetitionOpenSince[Month/Year] columns from store_df to compute the number of months for which a competitor has been open near the store.

def comp_months(df):
    df['CompetitionOpen'] = 12 * (df.Year - df.CompetitionOpenSinceYear) + (df.Month - df.CompetitionOpenSinceMonth)
    df['CompetitionOpen'] = df['CompetitionOpen'].map(lambda x: 0 if x < 0 else x).fillna(0)

comp_months(merged_df)
comp_months(merged_test_df)

merged_df.head(10)

merged_df[['Date', 'CompetitionDistance', 'CompetitionOpenSinceYear', 'CompetitionOpenSinceMonth', 'CompetitionOpen']]

Additional Promotion

merged_df

add some additional columns to indicate how long a store has been running Promo2 and whether a new round of Promo2 starts in the current month.

def check_promo_month(row):
    month2str = {1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun',              
                 7:'Jul', 8:'Aug', 9:'Sept', 10:'Oct', 11:'Nov', 12:'Dec'}
    
    try:
        months = (row['PromoInterval'] or '').split(',')
        if row['Promo2Open'] and month2str[row['Month']] in months:
            return -1
        else:
            return 0
        
    except Exception:
        return 0
    
    
def promo_cols(df):
    # month since Promo2 was open
    df['Promo2Open'] = 12 * (df.Year-df.Promo2SinceYear) + (df.WeekofYear - df.Promo2SinceWeek) * 7/30.5
    df['Promo2Open'] = df['Promo2Open'] .map(lambda x: 0 if x < 0 else x).fillna(0) * df['Promo2']
    
    #wether a new round of promotions was started in the current month
    df['IsPromo2Month'] = df.apply(check_promo_month, axis=1) * df['Promo2']

promo_cols(merged_df)
promo_cols(merged_test_df)

merged_df

merged_df[['Date', 'Promo2', 'Promo2SinceYear', 'Promo2SinceWeek', 'PromoInterval', 'Promo2Open', 'IsPromo2Month']].sample(20)

merged_df

Input and Target Columns

merged_df.columns

Index(['Store', 'DayOfWeek', 'Date', 'Sales', 'Customers', 'Open', 'Promo',
       'StateHoliday', 'SchoolHoliday', 'StoreType', 'Assortment',
       'CompetitionDistance', 'CompetitionOpenSinceMonth',
       'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek',
       'Promo2SinceYear', 'PromoInterval', 'Year', 'Month', 'Day',
       'WeekofYear', 'CompetitionOpen', 'Promo2Open', 'IsPromo2Month'],
      dtype='object')

input_cols = ['Store', 'DayOfWeek', 'Promo', 'StateHoliday', 'SchoolHoliday', 
              'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpen', 
              'Day', 'Month', 'Year', 'WeekofYear',  'Promo2', 
              'Promo2Open', 'IsPromo2Month']
target_col = 'Sales'

inputs = merged_df[input_cols].copy()
targets = merged_df[target_col].copy()

test_inputs = merged_test_df[input_cols].copy()

inputs

numeric_cols = ['Store', 'Promo', 'SchoolHoliday', 'CompetitionDistance',
                'CompetitionOpen', 'Promo2', 'Promo2Open', 'IsPromo2Month',
                'Day', 'Month', 'Year', 'WeekofYear']

categorical_cols = ['DayOfWeek', 'StateHoliday', 'StoreType', 'Assortment']

Impute missing Numrical Data

inputs[numeric_cols].isna().sum()

Store                     0
Promo                     0
SchoolHoliday             0
CompetitionDistance    2186
CompetitionOpen           0
Promo2                    0
Promo2Open                0
IsPromo2Month             0
Day                       0
Month                     0
Year                      0
WeekofYear                0
dtype: int64

test_inputs[numeric_cols].isna().sum()

Store                   0
Promo                   0
SchoolHoliday           0
CompetitionDistance    96
CompetitionOpen         0
Promo2                  0
Promo2Open              0
IsPromo2Month           0
Day                     0
Month                   0
Year                    0
WeekofYear              0
dtype: int64

Seems like competition distance is the only missing value, and we can simply fill it with the highest value (to indicate that competition is very far away).

max_distance = inputs.CompetitionDistance.max()
max_distance

75860.0

inputs['CompetitionDistance'].fillna(max_distance*2, inplace=True)
test_inputs['CompetitionDistance'].fillna(max_distance*2, inplace=True)

Scale Numeric Value

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler().fit(inputs[numeric_cols])

inputs[numeric_cols] = scaler.transform(inputs[numeric_cols])
test_inputs[numeric_cols] = scaler.transform(test_inputs[numeric_cols])

inputs[numeric_cols].head(10)

Encode Categorical Columns

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False, handle_unknown='ignore').fit(inputs[categorical_cols])
encoded_cols = list(encoder.get_feature_names(categorical_cols))

inputs[encoded_cols] = encoder.transform(inputs[categorical_cols])
test_inputs[encoded_cols] = encoder.transform(test_inputs[categorical_cols])

encoded_cols

['DayOfWeek_1',
 'DayOfWeek_2',
 'DayOfWeek_3',
 'DayOfWeek_4',
 'DayOfWeek_5',
 'DayOfWeek_6',
 'DayOfWeek_7',
 'StateHoliday_0',
 'StateHoliday_a',
 'StateHoliday_b',
 'StateHoliday_c',
 'StoreType_a',
 'StoreType_b',
 'StoreType_c',
 'StoreType_d',
 'Assortment_a',
 'Assortment_b',
 'Assortment_c']

X = inputs[numeric_cols + encoded_cols]
X_test = test_inputs[numeric_cols + encoded_cols]

X.head(10)

checkout_otherswork

Gradient Boosting

The average value of the target column and uses as an initial prediction every input.
The residuals (difference) of the predictions with the targets are computed.
A decision tree of limited depth is trained to predict just the residuals for each input.
Predictions from the decision tree are scaled using a parameter called the learning rate (this prevents overfitting)
Scaled predictions fro the tree are added to the previous predictions to obtain the new and improved predictions.
Steps 2 to 5 are repeated to create new decision trees, each of which is trained to predict just the residuals from the previous prediction.

The term "gradient" refers to the fact that each decision tree is trained with the purpose of reducing the loss from the previous iteration (similar to gradient descent). The term "boosting" refers the general technique of training new models to improve the results of an existing model.

For a mathematical explanation of gradient boosting, check out the following resources:

Here's a visual representation of gradient boosting:

Gradient Boosting Machines: GBMs

Traning

from xgboost import XGBRegressor

?XGBRegressor

model = XGBRegressor(random_state=42, n_jobs=-1, n_estimators=20, max_depth=4)

lets train model using model.fit

%%time
model.fit(X, targets)

CPU times: user 42.7 s, sys: 981 ms, total: 43.7 s
Wall time: 4.16 s

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=4,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=20, n_jobs=-1, num_parallel_tree=1, random_state=42,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

Prediction

preds = model.predict(X)

preds

array([ 8127.9404,  7606.919 ,  8525.857 , ...,  6412.8247,  9460.068 ,
       10302.145 ], dtype=float32)

Evaluation

from sklearn.metrics import mean_squared_error

def rmse(a, b):
    return mean_squared_error(a, b, squared=False)

rmse(preds, targets)

2377.752008804669

merged_df.Sales.min(), merged_df.Sales.max()

(0, 41551)

import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(merged_df.Sales.sample(10000))

(array([3.640e+02, 4.017e+03, 3.639e+03, 1.357e+03, 3.930e+02, 1.420e+02,
        5.900e+01, 2.300e+01, 4.000e+00, 2.000e+00]),
 array([    0. ,  3002.6,  6005.2,  9007.8, 12010.4, 15013. , 18015.6,
        21018.2, 24020.8, 27023.4, 30026. ]),
 <BarContainer object of 10 artists>)

Visualization

from xgboost import plot_tree
from matplotlib.pylab import rcParams

rcParams['figure.figsize'] = 60,60

plot_tree(model, rankdir='LR', num_trees=0);

plot_tree(model, rankdir='LR', num_trees=1);

plot_tree(model, rankdir='LR', num_trees=19);

trees = model.get_booster().get_dump()

len(trees)

20

print(trees[0])

0:[Promo<0.5] yes=1,no=2,missing=1
	1:[StoreType_b<0.5] yes=3,no=4,missing=3
		3:[Assortment_a<0.5] yes=7,no=8,missing=7
			7:[CompetitionDistance<0.00220830599] yes=15,no=16,missing=15
				15:leaf=2309.51147
				16:leaf=1823.30444
			8:[WeekofYear<0.911764741] yes=17,no=18,missing=17
				17:leaf=1619.43994
				18:leaf=2002.44897
		4:[CompetitionDistance<0.00800922886] yes=9,no=10,missing=9
			9:[CompetitionDistance<0.00672379695] yes=19,no=20,missing=19
				19:leaf=2740.44067
				20:leaf=5576.85889
			10:[DayOfWeek_7<0.5] yes=21,no=22,missing=21
				21:leaf=1898.36487
				22:leaf=2961.08765
	2:[DayOfWeek_1<0.5] yes=5,no=6,missing=5
		5:[Month<0.954545498] yes=11,no=12,missing=11
			11:[StoreType_b<0.5] yes=23,no=24,missing=23
				23:leaf=2295.30566
				24:leaf=3294.27759
			12:[Day<0.333333343] yes=25,no=26,missing=25
				25:leaf=2754.58521
				26:leaf=3246.39014
		6:[Month<0.954545498] yes=13,no=14,missing=13
			13:[CompetitionDistance<0.00135135138] yes=27,no=28,missing=27
				27:leaf=3347.80688
				28:leaf=2839.39551
			14:[Day<0.25] yes=29,no=30,missing=29
				29:leaf=3400.54419
				30:leaf=4059.85938

Feature Importance

importance_df = pd.DataFrame({
    'feature': X.columns,
    'importance':model.feature_importances_
}).sort_values('importance', ascending=False)

importance_df.head(10)

import seaborn as sns
plt.figure(figsize=(10, 6))
plt.title('Feature Importance')
sns.barplot(data=importance_df.head(10), x='importance', y='feature');

K Fold Cross Validation

kfold

from sklearn.model_selection import KFold

def train_and_evaluate(X_train, train_targets, X_val, val_targets, **params):
    model = XGBRegressor(random_state=42, n_jobs=-1, **params)
    model.fit(X_train, train_targets)
    train_rmse = rmse(model.predict(X_train), train_targets)
    val_rmse = rmse(model.predict(X_val), val_targets)
    return model, train_rmse, val_rmse

kfold = KFold(n_splits=5)

models = []

for train_idxs, val_idxs in kfold.split(X):
    X_train, train_targets = X.iloc[train_idxs], targets.iloc[train_idxs]
    X_val, val_targets = X.iloc[val_idxs], targets.iloc[val_idxs]
    model, train_rmse, val_rmse = train_and_evaluate(X_train, 
                                                     train_targets, 
                                                     X_val, 
                                                     val_targets, 
                                                     max_depth=4, 
                                                     n_estimators=20)
    models.append(model)
    print('Train RMSE: {}, Validation RMSE: {}'.format(train_rmse, val_rmse))

Train RMSE: 2352.216448531526, Validation RMSE: 2424.6228916973314
Train RMSE: 2406.709513789309, Validation RMSE: 2451.9646038059277
Train RMSE: 2365.7354745443067, Validation RMSE: 2336.984157073758
Train RMSE: 2366.4732092777763, Validation RMSE: 2460.8995475901697
Train RMSE: 2379.3752997474626, Validation RMSE: 2440.665320626728

import numpy as np

def predict_avg(models, inputs):
    return np.mean([model.predict(inputs) for model in models], axis=0)

preds = predict_avg(models, X)

preds

array([8021.374 , 7577.715 , 8747.863 , ..., 7615.0303, 7924.784 ,
       9600.297 ], dtype=float32)

Hyperparameter Tuning and Regularization

Just like other machine learning models, there are several hyperparameters we can to adjust the capacity of model and reduce overfitting.

hpt

Check out the following resources to learn more about hyperparameter supported by XGBoost:

model

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=4,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=20, n_jobs=-1, num_parallel_tree=1, random_state=42,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

def test_params_kfold(n_splits, **params):
    train_rmses, val_rmses, models = [], [], []
    kfold = KFold(n_splits)
    for train_idxs, val_idxs in kfold.split(X):
        X_train, train_targets = X.iloc[train_idxs], targets.iloc[train_idxs]
        X_val, val_targets = X.iloc[val_idxs], targets.iloc[val_idxs]
        model, train_rmse, val_rmse = train_and_evaluate(X_train, train_targets, X_val, val_targets, **params)
        models.append(model)
        train_rmses.append(train_rmse)
        val_rmses.append(val_rmse)
    print('Train RMSE: {}, Validation RMSE: {}'.format(np.mean(train_rmses), np.mean(val_rmses)))
    return models

from sklearn.model_selection import train_test_split

X_train, X_val, train_targets, val_targets = train_test_split(X, targets, test_size=0.1)

def test_params(**params):
    model = XGBRegressor(n_jobs=-1, random_state=42, **params)
    model.fit(X_train, train_targets)
    train_rmse = rmse(model.predict(X_train), train_targets)
    val_rmse = rmse(model.predict(X_val), val_targets)
    print('Train RMSE: {}, Validation RMSE: {}'.format(train_rmse, val_rmse))

n_estimators

test_params(n_estimators=10)

Train RMSE: 2362.76195480308, Validation RMSE: 2348.702085349391

test_params(n_estimators=30)

Train RMSE: 1848.646131664556, Validation RMSE: 1839.8374140100673

test_params(n_estimators=100)

Train RMSE: 1189.0907835721493, Validation RMSE: 1182.191673835398

test_params(n_estimators=240)

Train RMSE: 900.8147134992857, Validation RMSE: 909.3936493978623

max_depth

test_params(max_depth=2)

Train RMSE: 2337.546637614515, Validation RMSE: 2315.858142292135

test_params(max_depth=5)

Train RMSE: 1420.072022973741, Validation RMSE: 1411.0540606302004

test_params(max_depth=10)

Train RMSE: 687.2499006884119, Validation RMSE: 775.8096745259544

learning_rate

test_params(n_estimators=50, learning_rate=0.01)

Train RMSE: 5044.48850172585, Validation RMSE: 5036.89440315394

test_params(n_estimators=50, learning_rate=0.1)

Train RMSE: 2200.172003242853, Validation RMSE: 2183.102936809057

test_params(n_estimators=50, learning_rate=0.3)

Train RMSE: 1541.0745464792515, Validation RMSE: 1529.8196486972695

test_params(n_estimators=50, learning_rate=0.9)

Train RMSE: 1121.9325695263058, Validation RMSE: 1123.541189011288

Booster

test_params(booster='gblinear')

Train RMSE: 2730.688924077907, Validation RMSE: 2707.009329528743

Putting it Together and Making Predictions

model = XGBRegressor(n_jobs=-1, random_state=42, n_estimators=1000, 
                     learning_rate=0.2, max_depth=10, subsample=0.9, 
                     colsample_bytree=0.7)

%%time
model.fit(X, targets)

CPU times: user 1h 6min 23s, sys: 13.4 s, total: 1h 6min 36s
Wall time: 5min 49s

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.7, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.2, max_delta_step=0, max_depth=10,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=1000, n_jobs=-1, num_parallel_tree=1, random_state=42,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=0.9,
             tree_method='exact', validate_parameters=1, verbosity=None)

test_preds = model.predict(X_test)

submission_df['Sales']  = test_preds

test_df.Open.isna().sum()

11

submission_df['Sales'] = submission_df['Sales'] * test_df.Open.fillna(1.)

submission_df

submission_df.to_csv('submission.csv', index=None)

	Store	DayOfWeek	Date	Sales	Customers	Open	Promo	StateHoliday	SchoolHoliday
0	1	5	2015-07-31	5263	555	1	1	0	1
1	2	5	2015-07-31	6064	625	1	1	0	1
2	3	5	2015-07-31	8314	821	1	1	0	1
3	4	5	2015-07-31	13995	1498	1	1	0	1
4	5	5	2015-07-31	4822	559	1	1	0	1
...	...	...	...	...	...	...	...	...	...
1017204	1111	2	2013-01-01	0	0	0	0	a	1
1017205	1112	2	2013-01-01	0	0	0	0	a	1
1017206	1113	2	2013-01-01	0	0	0	0	a	1
1017207	1114	2	2013-01-01	0	0	0	0	a	1
1017208	1115	2	2013-01-01	0	0	0	0	a	1

	Store	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear	PromoInterval
0	1	c	a	1270.0	9.0	2008.0	0	NaN	NaN	NaN
1	2	a	a	570.0	11.0	2007.0	1	13.0	2010.0	Jan,Apr,Jul,Oct
2	3	a	a	14130.0	12.0	2006.0	1	14.0	2011.0	Jan,Apr,Jul,Oct
3	4	c	c	620.0	9.0	2009.0	0	NaN	NaN	NaN
4	5	a	a	29910.0	4.0	2015.0	0	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...
1110	1111	a	a	1900.0	6.0	2014.0	1	31.0	2013.0	Jan,Apr,Jul,Oct
1111	1112	c	c	1880.0	4.0	2006.0	0	NaN	NaN	NaN
1112	1113	a	c	9260.0	NaN	NaN	0	NaN	NaN	NaN
1113	1114	a	c	870.0	NaN	NaN	0	NaN	NaN	NaN
1114	1115	d	c	5350.0	NaN	NaN	1	22.0	2012.0	Mar,Jun,Sept,Dec

	Store	DayOfWeek	Date	Sales	Customers	Open	Promo	StateHoliday	SchoolHoliday	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear	PromoInterval
0	1	5	2015-07-31	5263	555	1	1	0	1	c	a	1270.0	9.0	2008.0	0	NaN	NaN	NaN
1	2	5	2015-07-31	6064	625	1	1	0	1	a	a	570.0	11.0	2007.0	1	13.0	2010.0	Jan,Apr,Jul,Oct
2	3	5	2015-07-31	8314	821	1	1	0	1	a	a	14130.0	12.0	2006.0	1	14.0	2011.0	Jan,Apr,Jul,Oct
3	4	5	2015-07-31	13995	1498	1	1	0	1	c	c	620.0	9.0	2009.0	0	NaN	NaN	NaN
4	5	5	2015-07-31	4822	559	1	1	0	1	a	a	29910.0	4.0	2015.0	0	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1017204	1111	2	2013-01-01	0	0	0	0	a	1	a	a	1900.0	6.0	2014.0	1	31.0	2013.0	Jan,Apr,Jul,Oct
1017205	1112	2	2013-01-01	0	0	0	0	a	1	c	c	1880.0	4.0	2006.0	0	NaN	NaN	NaN
1017206	1113	2	2013-01-01	0	0	0	0	a	1	a	c	9260.0	NaN	NaN	0	NaN	NaN	NaN
1017207	1114	2	2013-01-01	0	0	0	0	a	1	a	c	870.0	NaN	NaN	0	NaN	NaN	NaN
1017208	1115	2	2013-01-01	0	0	0	0	a	1	d	c	5350.0	NaN	NaN	1	22.0	2012.0	Mar,Jun,Sept,Dec

	Store	DayOfWeek	Date	Sales	Customers	Open	Promo	StateHoliday	SchoolHoliday	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear	PromoInterval	Year	Month	Day	WeekofYear
0	1	5	2015-07-31	5263	555	1	1	0	1	c	a	1270.0	9.0	2008.0	0	NaN	NaN	NaN	2015	7	31	31
1	2	5	2015-07-31	6064	625	1	1	0	1	a	a	570.0	11.0	2007.0	1	13.0	2010.0	Jan,Apr,Jul,Oct	2015	7	31	31
2	3	5	2015-07-31	8314	821	1	1	0	1	a	a	14130.0	12.0	2006.0	1	14.0	2011.0	Jan,Apr,Jul,Oct	2015	7	31	31
3	4	5	2015-07-31	13995	1498	1	1	0	1	c	c	620.0	9.0	2009.0	0	NaN	NaN	NaN	2015	7	31	31
4	5	5	2015-07-31	4822	559	1	1	0	1	a	a	29910.0	4.0	2015.0	0	NaN	NaN	NaN	2015	7	31	31
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1017204	1111	2	2013-01-01	0	0	0	0	a	1	a	a	1900.0	6.0	2014.0	1	31.0	2013.0	Jan,Apr,Jul,Oct	2013	1	1	1
1017205	1112	2	2013-01-01	0	0	0	0	a	1	c	c	1880.0	4.0	2006.0	0	NaN	NaN	NaN	2013	1	1	1
1017206	1113	2	2013-01-01	0	0	0	0	a	1	a	c	9260.0	NaN	NaN	0	NaN	NaN	NaN	2013	1	1	1
1017207	1114	2	2013-01-01	0	0	0	0	a	1	a	c	870.0	NaN	NaN	0	NaN	NaN	NaN	2013	1	1	1
1017208	1115	2	2013-01-01	0	0	0	0	a	1	d	c	5350.0	NaN	NaN	1	22.0	2012.0	Mar,Jun,Sept,Dec	2013	1	1	1

	Store	DayOfWeek	Date	Sales	Customers	Open	Promo	SchoolHoliday	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear	PromoInterval	Year	Month	Day	WeekofYear
81229	950	3	2015-05-20	6709	619	1	1	0	a	a	8460.0	11.0	1994.0	0	NaN	NaN	NaN	2015	5	20	21
3394	50	2	2015-07-28	5115	390	1	1	1	d	a	6260.0	11.0	2009.0	0	NaN	NaN	NaN	2015	7	28	31
157691	477	4	2015-03-12	5154	394	1	0	0	d	a	770.0	7.0	2010.0	1	35.0	2010.0	Jan,Apr,Jul,Oct	2015	3	12	11
1013035	286	5	2013-01-04	4505	442	1	0	1	a	a	1460.0	4.0	2015.0	0	NaN	NaN	NaN	2013	1	4	1
366973	757	4	2014-08-14	4553	465	1	0	1	a	c	3450.0	NaN	NaN	0	NaN	NaN	NaN	2014	8	14	33
40863	724	4	2015-06-25	5821	635	1	0	0	d	c	5900.0	NaN	NaN	0	NaN	NaN	NaN	2015	6	25	26
416853	629	1	2014-06-23	7655	637	1	0	0	d	a	510.0	7.0	2014.0	1	23.0	2015.0	Mar,Jun,Sept,Dec	2014	6	23	26
213001	37	3	2015-01-21	4175	535	1	0	0	c	a	4230.0	12.0	2014.0	0	NaN	NaN	NaN	2015	1	21	4
228007	548	4	2015-01-08	4576	359	1	1	1	d	c	3760.0	2.0	2009.0	0	NaN	NaN	NaN	2015	1	8	2
59667	573	1	2015-06-08	3584	310	1	0	0	a	a	1190.0	11.0	2012.0	1	36.0	2013.0	Jan,Apr,Jul,Oct	2015	6	8	24

	Id	Store	DayOfWeek	Date	Open	Promo	StateHoliday	SchoolHoliday
0	1	1	4	2015-09-17	1.0	1	0	0
1	2	3	4	2015-09-17	1.0	1	0	0
2	3	7	4	2015-09-17	1.0	1	0	0
3	4	8	4	2015-09-17	1.0	1	0	0
4	5	9	4	2015-09-17	1.0	1	0	0
...	...	...	...	...	...	...	...	...
41083	41084	1111	6	2015-08-01	1.0	0	0	0
41084	41085	1112	6	2015-08-01	1.0	0	0	0
41085	41086	1113	6	2015-08-01	1.0	0	0	0
41086	41087	1114	6	2015-08-01	1.0	0	0	0
41087	41088	1115	6	2015-08-01	1.0	0	0	1

	Date	Promo2	Promo2SinceYear	Promo2SinceWeek	PromoInterval	Promo2Open
778123	2013-08-03	0	NaN	NaN	NaN	0.000000
485713	2014-04-22	0	NaN	NaN	NaN	0.000000
212896	2015-01-22	1	2009.0	45.0	Feb,May,Aug,Nov	62.590164
998575	2013-01-17	1	2014.0	10.0	Mar,Jun,Sept,Dec	0.000000
408579	2014-06-30	0	NaN	NaN	NaN	0.000000
891338	2013-04-23	0	NaN	NaN	NaN	0.000000
884400	2013-04-30	0	NaN	NaN	NaN	0.000000
892016	2013-04-23	1	2011.0	22.0	Feb,May,Aug,Nov	22.852459
645306	2013-11-30	1	2009.0	37.0	Jan,Apr,Jul,Oct	50.524590
429670	2014-06-11	1	2009.0	37.0	Jan,Apr,Jul,Oct	57.016393
989989	2013-01-25	1	2013.0	5.0	Feb,May,Aug,Nov	0.000000
110965	2015-04-23	0	NaN	NaN	NaN	0.000000
723857	2013-09-21	0	NaN	NaN	NaN	0.000000
445329	2014-05-28	1	2014.0	40.0	Jan,Apr,Jul,Oct	0.000000
1298	2015-07-30	0	NaN	NaN	NaN	0.000000
575426	2014-02-01	1	2015.0	23.0	Mar,Jun,Sept,Dec	0.000000
232178	2015-01-04	0	NaN	NaN	NaN	0.000000
196856	2015-02-05	1	2011.0	9.0	Jan,Apr,Jul,Oct	47.311475
926871	2013-03-23	1	2014.0	40.0	Jan,Apr,Jul,Oct	0.000000
950895	2013-03-01	0	NaN	NaN	NaN	0.000000

	Store	DayOfWeek	Promo	StateHoliday	SchoolHoliday	StoreType	Assortment	CompetitionDistance	CompetitionOpen	Day	Month	Year	WeekofYear	Promo2	Promo2Open	IsPromo2Month
0	1	5	1	0	1	c	a	1270.0	82.0	31	7	2015	31	0	0.000000	0
1	2	5	1	0	1	a	a	570.0	92.0	31	7	2015	31	1	64.131148	-1
2	3	5	1	0	1	a	a	14130.0	103.0	31	7	2015	31	1	51.901639	-1
3	4	5	1	0	1	c	c	620.0	70.0	31	7	2015	31	0	0.000000	0
4	5	5	1	0	1	a	a	29910.0	3.0	31	7	2015	31	0	0.000000	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1016776	682	2	0	a	1	b	a	150.0	76.0	1	1	2013	1	0	0.000000	0
1016827	733	2	0	a	1	b	b	860.0	159.0	1	1	2013	1	0	0.000000	0
1016863	769	2	0	a	1	b	b	840.0	0.0	1	1	2013	1	1	1.213115	-1
1017042	948	2	0	a	1	b	b	1430.0	0.0	1	1	2013	1	0	0.000000	0
1017190	1097	2	0	a	1	b	b	720.0	130.0	1	1	2013	1	0	0.000000	0

	Store	Promo	SchoolHoliday	CompetitionDistance	CompetitionOpen	Promo2	Promo2Open	IsPromo2Month	Day	Month	Year	WeekofYear
0	0.000000	1.0	1.0	0.008240	0.059163	0.0	0.000000	1.0	1.0	0.545455	1.0	0.588235
1	0.000898	1.0	1.0	0.003626	0.066378	1.0	0.890710	0.0	1.0	0.545455	1.0	0.588235
2	0.001795	1.0	1.0	0.093013	0.074315	1.0	0.720856	0.0	1.0	0.545455	1.0	0.588235
3	0.002693	1.0	1.0	0.003955	0.050505	0.0	0.000000	1.0	1.0	0.545455	1.0	0.588235
4	0.003591	1.0	1.0	0.197034	0.002165	0.0	0.000000	1.0	1.0	0.545455	1.0	0.588235
5	0.004488	1.0	1.0	0.001912	0.013709	0.0	0.000000	1.0	1.0	0.545455	1.0	0.588235
6	0.005386	1.0	1.0	0.158075	0.019481	0.0	0.000000	1.0	1.0	0.545455	1.0	0.588235
7	0.006284	1.0	1.0	0.049440	0.006494	0.0	0.000000	1.0	1.0	0.545455	1.0	0.588235
8	0.007181	1.0	1.0	0.013250	0.129149	0.0	0.000000	1.0	1.0	0.545455	1.0	0.588235
9	0.008079	1.0	1.0	0.020699	0.050505	0.0	0.000000	1.0	1.0	0.545455	1.0	0.588235

	feature	importance
1	Promo	0.317473
24	StoreType_b	0.086472
12	DayOfWeek_1	0.082269
5	Promo2	0.063986
3	CompetitionDistance	0.045053
29	Assortment_c	0.040226
27	Assortment_a	0.038759
9	Month	0.038493
0	Store	0.038119
8	Day	0.033209

	Id	Sales
0	1	3978.170898
1	2	7715.094727
2	3	8728.141602
3	4	7502.937012
4	5	7036.492676
...	...	...
41083	41084	2987.847168
41084	41085	6813.018066
41085	41086	6863.016602
41086	41087	23361.136719
41087	41088	6743.966797