Store Item Demand Kaggle
Store item Demand Forecasting Challenge
%matplotlib inline
%reload_ext autoreload
%autoreload 2
Import fastai libraries to use columnar data in DNN
from fastai.structured import *
from fastai.column_data import *
np.set_printoptions(threshold=50, edgeitems=20)
PATH='/home/adrianrdzv/Documentos/fastai/fastai/data/forecasting/'
Feature Space:
- train: Los datos de entranamiento proporcionados por Kaggle
- submmission: ejemplo de salida
- test: los datos de salida que generaremos
Analyze data previous to data cleansing
table_names = ['train', 'test']
tables = [pd.read_csv(f'{PATH}{fname}.csv', low_memory=False) for fname in table_names]
from IPython.display import HTML
for t in tables: display(t.head())
date | store | item | sales | |
---|---|---|---|---|
0 | 2013-01-01 | 1 | 1 | 13 |
1 | 2013-01-02 | 1 | 1 | 11 |
2 | 2013-01-03 | 1 | 1 | 14 |
3 | 2013-01-04 | 1 | 1 | 13 |
4 | 2013-01-05 | 1 | 1 | 10 |
id | date | store | item | |
---|---|---|---|---|
0 | 0 | 2018-01-01 | 1 | 1 |
1 | 1 | 2018-01-02 | 1 | 1 |
2 | 2 | 2018-01-03 | 1 | 1 |
3 | 3 | 2018-01-04 | 1 | 1 |
4 | 4 | 2018-01-05 | 1 | 1 |
for t in tables: display(DataFrameSummary(t).summary())
date | store | item | sales | |
---|---|---|---|---|
count | NaN | 913000 | 913000 | 913000 |
mean | NaN | 5.5 | 25.5 | 52.2503 |
std | NaN | 2.87228 | 14.4309 | 28.8011 |
min | NaN | 1 | 1 | 0 |
25% | NaN | 3 | 13 | 30 |
50% | NaN | 5.5 | 25.5 | 47 |
75% | NaN | 8 | 38 | 70 |
max | NaN | 10 | 50 | 231 |
counts | 913000 | 913000 | 913000 | 913000 |
uniques | 1826 | 10 | 50 | 213 |
missing | 0 | 0 | 0 | 0 |
missing_perc | 0% | 0% | 0% | 0% |
types | categorical | numeric | numeric | numeric |
id | date | store | item | |
---|---|---|---|---|
count | 45000 | NaN | 45000 | 45000 |
mean | 22499.5 | NaN | 5.5 | 25.5 |
std | 12990.5 | NaN | 2.87231 | 14.431 |
min | 0 | NaN | 1 | 1 |
25% | 11249.8 | NaN | 3 | 13 |
50% | 22499.5 | NaN | 5.5 | 25.5 |
75% | 33749.2 | NaN | 8 | 38 |
max | 44999 | NaN | 10 | 50 |
counts | 45000 | 45000 | 45000 | 45000 |
uniques | 45000 | 90 | 10 | 50 |
missing | 0 | 0 | 0 | 0 |
missing_perc | 0% | 0% | 0% | 0% |
types | numeric | categorical | numeric | numeric |
Data Cleaning / Feature Engineering
Adjusting the data to be use in the DNN
train,test = tables
len(train),len(test)
(913000, 45000)
La siguiente función “add_datepart” nos proporciona una gran variedad de variables temporales que pueden capturar comportamientos de estacioanlidad y demas caracteristicas temporales en los datos
The next function “add_datepart” will give us a variety of temporary variables who captures seasonality behaviours and stuff related to time series
add_datepart(train, "date", drop=False)
add_datepart(test, "date", drop=False)
train.head()
date | store | item | sales | Year | Month | Week | Day | Dayofweek | Dayofyear | Is_month_end | Is_month_start | Is_quarter_end | Is_quarter_start | Is_year_end | Is_year_start | Elapsed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013-01-01 | 1 | 1 | 13 | 2013 | 1 | 1 | 1 | 1 | 1 | False | True | False | True | False | True | 1356998400 |
1 | 2013-01-02 | 1 | 1 | 11 | 2013 | 1 | 1 | 2 | 2 | 2 | False | False | False | False | False | False | 1357084800 |
2 | 2013-01-03 | 1 | 1 | 14 | 2013 | 1 | 1 | 3 | 3 | 3 | False | False | False | False | False | False | 1357171200 |
3 | 2013-01-04 | 1 | 1 | 13 | 2013 | 1 | 1 | 4 | 4 | 4 | False | False | False | False | False | False | 1357257600 |
4 | 2013-01-05 | 1 | 1 | 10 | 2013 | 1 | 1 | 5 | 5 | 5 | False | False | False | False | False | False | 1357344000 |
test.head()
id | date | store | item | Year | Month | Week | Day | Dayofweek | Dayofyear | Is_month_end | Is_month_start | Is_quarter_end | Is_quarter_start | Is_year_end | Is_year_start | Elapsed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2018-01-01 | 1 | 1 | 2018 | 1 | 1 | 1 | 0 | 1 | False | True | False | True | False | True | 1514764800 |
1 | 1 | 2018-01-02 | 1 | 1 | 2018 | 1 | 1 | 2 | 1 | 2 | False | False | False | False | False | False | 1514851200 |
2 | 2 | 2018-01-03 | 1 | 1 | 2018 | 1 | 1 | 3 | 2 | 3 | False | False | False | False | False | False | 1514937600 |
3 | 3 | 2018-01-04 | 1 | 1 | 2018 | 1 | 1 | 4 | 3 | 4 | False | False | False | False | False | False | 1515024000 |
4 | 4 | 2018-01-05 | 1 | 1 | 2018 | 1 | 1 | 5 | 4 | 5 | False | False | False | False | False | False | 1515110400 |
columns = ["date"]
df = train[columns]
df = df.set_index("date")
df.reset_index(inplace=True)
df.head()
date | |
---|---|
0 | 2013-01-01 |
1 | 2013-01-02 |
2 | 2013-01-03 |
3 | 2013-01-04 |
4 | 2013-01-05 |
test = test.set_index("date")
test.reset_index(inplace=True)
test.head()
date | id | store | item | Year | Month | Week | Day | Dayofweek | Dayofyear | Is_month_end | Is_month_start | Is_quarter_end | Is_quarter_start | Is_year_end | Is_year_start | Elapsed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2018-01-01 | 0 | 1 | 1 | 2018 | 1 | 1 | 1 | 0 | 1 | False | True | False | True | False | True | 1514764800 |
1 | 2018-01-02 | 1 | 1 | 1 | 2018 | 1 | 1 | 2 | 1 | 2 | False | False | False | False | False | False | 1514851200 |
2 | 2018-01-03 | 2 | 1 | 1 | 2018 | 1 | 1 | 3 | 2 | 3 | False | False | False | False | False | False | 1514937600 |
3 | 2018-01-04 | 3 | 1 | 1 | 2018 | 1 | 1 | 4 | 3 | 4 | False | False | False | False | False | False | 1515024000 |
4 | 2018-01-05 | 4 | 1 | 1 | 2018 | 1 | 1 | 5 | 4 | 5 | False | False | False | False | False | False | 1515110400 |
Para poder hacer uso de las funciones de la red neuronal debemos identificar que variables trataremos como categoricas y cuales como continuas, para este caso todas las trataremos como categoricas
#cat_vars = ['store', 'item', 'Dayofweek', 'Year', 'Month', 'Day']
cat_vars = ['store','item', 'Year', 'Month', 'Week', 'Day','Dayofweek', 'Dayofyear', 'Is_month_end', 'Is_month_start','Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start']
contin_vars = []
n = len(test); n
45000
dep = 'sales'
test[dep]=0
join_test=test.copy() # Pendiente eliminar
Debemos modificar explicitamente el dataframe para que cada variable categorica se exprese como tal, dado que asi espera las funciones
We need to modify explictly the dataframe in order to every categorical variable be tagged as categorical, and the same for continuos
for v in cat_vars: test[v] = test[v].astype('category').cat.as_ordered()
for v in cat_vars: train[v] = train[v].astype('category').cat.as_ordered()
for v in contin_vars:
train[v] = train[v].astype('float32')
test[v] = test[v].astype('float32')
We are going to use the full dataset to train our model
samp_size = len(train)
joined_samp = train.set_index("date")
#joined_samp = joined_samp.set_index("date")
joined_test = test.set_index("date")
We can now process our data…
joined_samp.head(10)
store | item | sales | Year | Month | Week | Day | Dayofweek | Dayofyear | Is_month_end | Is_month_start | Is_quarter_end | Is_quarter_start | Is_year_end | Is_year_start | Elapsed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | ||||||||||||||||
2013-01-01 | 1 | 1 | 13 | 2013 | 1 | 1 | 1 | 1 | 1 | False | True | False | True | False | True | 1356998400 |
2013-01-02 | 1 | 1 | 11 | 2013 | 1 | 1 | 2 | 2 | 2 | False | False | False | False | False | False | 1357084800 |
2013-01-03 | 1 | 1 | 14 | 2013 | 1 | 1 | 3 | 3 | 3 | False | False | False | False | False | False | 1357171200 |
2013-01-04 | 1 | 1 | 13 | 2013 | 1 | 1 | 4 | 4 | 4 | False | False | False | False | False | False | 1357257600 |
2013-01-05 | 1 | 1 | 10 | 2013 | 1 | 1 | 5 | 5 | 5 | False | False | False | False | False | False | 1357344000 |
2013-01-06 | 1 | 1 | 12 | 2013 | 1 | 1 | 6 | 6 | 6 | False | False | False | False | False | False | 1357430400 |
2013-01-07 | 1 | 1 | 10 | 2013 | 1 | 2 | 7 | 0 | 7 | False | False | False | False | False | False | 1357516800 |
2013-01-08 | 1 | 1 | 9 | 2013 | 1 | 2 | 8 | 1 | 8 | False | False | False | False | False | False | 1357603200 |
2013-01-09 | 1 | 1 | 12 | 2013 | 1 | 2 | 9 | 2 | 9 | False | False | False | False | False | False | 1357689600 |
2013-01-10 | 1 | 1 | 9 | 2013 | 1 | 2 | 10 | 3 | 10 | False | False | False | False | False | False | 1357776000 |
joined_test
id | store | item | Year | Month | Week | Day | Dayofweek | Dayofyear | Is_month_end | Is_month_start | Is_quarter_end | Is_quarter_start | Is_year_end | Is_year_start | Elapsed | sales | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | |||||||||||||||||
2018-01-01 | 0 | 1 | 1 | 2018 | 1 | 1 | 1 | 0 | 1 | False | True | False | True | False | True | 1514764800 | 0 |
2018-01-02 | 1 | 1 | 1 | 2018 | 1 | 1 | 2 | 1 | 2 | False | False | False | False | False | False | 1514851200 | 0 |
2018-01-03 | 2 | 1 | 1 | 2018 | 1 | 1 | 3 | 2 | 3 | False | False | False | False | False | False | 1514937600 | 0 |
2018-01-04 | 3 | 1 | 1 | 2018 | 1 | 1 | 4 | 3 | 4 | False | False | False | False | False | False | 1515024000 | 0 |
2018-01-05 | 4 | 1 | 1 | 2018 | 1 | 1 | 5 | 4 | 5 | False | False | False | False | False | False | 1515110400 | 0 |
2018-01-06 | 5 | 1 | 1 | 2018 | 1 | 1 | 6 | 5 | 6 | False | False | False | False | False | False | 1515196800 | 0 |
2018-01-07 | 6 | 1 | 1 | 2018 | 1 | 1 | 7 | 6 | 7 | False | False | False | False | False | False | 1515283200 | 0 |
2018-01-08 | 7 | 1 | 1 | 2018 | 1 | 2 | 8 | 0 | 8 | False | False | False | False | False | False | 1515369600 | 0 |
2018-01-09 | 8 | 1 | 1 | 2018 | 1 | 2 | 9 | 1 | 9 | False | False | False | False | False | False | 1515456000 | 0 |
2018-01-10 | 9 | 1 | 1 | 2018 | 1 | 2 | 10 | 2 | 10 | False | False | False | False | False | False | 1515542400 | 0 |
2018-01-11 | 10 | 1 | 1 | 2018 | 1 | 2 | 11 | 3 | 11 | False | False | False | False | False | False | 1515628800 | 0 |
2018-01-12 | 11 | 1 | 1 | 2018 | 1 | 2 | 12 | 4 | 12 | False | False | False | False | False | False | 1515715200 | 0 |
2018-01-13 | 12 | 1 | 1 | 2018 | 1 | 2 | 13 | 5 | 13 | False | False | False | False | False | False | 1515801600 | 0 |
2018-01-14 | 13 | 1 | 1 | 2018 | 1 | 2 | 14 | 6 | 14 | False | False | False | False | False | False | 1515888000 | 0 |
2018-01-15 | 14 | 1 | 1 | 2018 | 1 | 3 | 15 | 0 | 15 | False | False | False | False | False | False | 1515974400 | 0 |
2018-01-16 | 15 | 1 | 1 | 2018 | 1 | 3 | 16 | 1 | 16 | False | False | False | False | False | False | 1516060800 | 0 |
2018-01-17 | 16 | 1 | 1 | 2018 | 1 | 3 | 17 | 2 | 17 | False | False | False | False | False | False | 1516147200 | 0 |
2018-01-18 | 17 | 1 | 1 | 2018 | 1 | 3 | 18 | 3 | 18 | False | False | False | False | False | False | 1516233600 | 0 |
2018-01-19 | 18 | 1 | 1 | 2018 | 1 | 3 | 19 | 4 | 19 | False | False | False | False | False | False | 1516320000 | 0 |
2018-01-20 | 19 | 1 | 1 | 2018 | 1 | 3 | 20 | 5 | 20 | False | False | False | False | False | False | 1516406400 | 0 |
2018-01-21 | 20 | 1 | 1 | 2018 | 1 | 3 | 21 | 6 | 21 | False | False | False | False | False | False | 1516492800 | 0 |
2018-01-22 | 21 | 1 | 1 | 2018 | 1 | 4 | 22 | 0 | 22 | False | False | False | False | False | False | 1516579200 | 0 |
2018-01-23 | 22 | 1 | 1 | 2018 | 1 | 4 | 23 | 1 | 23 | False | False | False | False | False | False | 1516665600 | 0 |
2018-01-24 | 23 | 1 | 1 | 2018 | 1 | 4 | 24 | 2 | 24 | False | False | False | False | False | False | 1516752000 | 0 |
2018-01-25 | 24 | 1 | 1 | 2018 | 1 | 4 | 25 | 3 | 25 | False | False | False | False | False | False | 1516838400 | 0 |
2018-01-26 | 25 | 1 | 1 | 2018 | 1 | 4 | 26 | 4 | 26 | False | False | False | False | False | False | 1516924800 | 0 |
2018-01-27 | 26 | 1 | 1 | 2018 | 1 | 4 | 27 | 5 | 27 | False | False | False | False | False | False | 1517011200 | 0 |
2018-01-28 | 27 | 1 | 1 | 2018 | 1 | 4 | 28 | 6 | 28 | False | False | False | False | False | False | 1517097600 | 0 |
2018-01-29 | 28 | 1 | 1 | 2018 | 1 | 5 | 29 | 0 | 29 | False | False | False | False | False | False | 1517184000 | 0 |
2018-01-30 | 29 | 1 | 1 | 2018 | 1 | 5 | 30 | 1 | 30 | False | False | False | False | False | False | 1517270400 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2018-03-02 | 44970 | 10 | 50 | 2018 | 3 | 9 | 2 | 4 | 61 | False | False | False | False | False | False | 1519948800 | 0 |
2018-03-03 | 44971 | 10 | 50 | 2018 | 3 | 9 | 3 | 5 | 62 | False | False | False | False | False | False | 1520035200 | 0 |
2018-03-04 | 44972 | 10 | 50 | 2018 | 3 | 9 | 4 | 6 | 63 | False | False | False | False | False | False | 1520121600 | 0 |
2018-03-05 | 44973 | 10 | 50 | 2018 | 3 | 10 | 5 | 0 | 64 | False | False | False | False | False | False | 1520208000 | 0 |
2018-03-06 | 44974 | 10 | 50 | 2018 | 3 | 10 | 6 | 1 | 65 | False | False | False | False | False | False | 1520294400 | 0 |
2018-03-07 | 44975 | 10 | 50 | 2018 | 3 | 10 | 7 | 2 | 66 | False | False | False | False | False | False | 1520380800 | 0 |
2018-03-08 | 44976 | 10 | 50 | 2018 | 3 | 10 | 8 | 3 | 67 | False | False | False | False | False | False | 1520467200 | 0 |
2018-03-09 | 44977 | 10 | 50 | 2018 | 3 | 10 | 9 | 4 | 68 | False | False | False | False | False | False | 1520553600 | 0 |
2018-03-10 | 44978 | 10 | 50 | 2018 | 3 | 10 | 10 | 5 | 69 | False | False | False | False | False | False | 1520640000 | 0 |
2018-03-11 | 44979 | 10 | 50 | 2018 | 3 | 10 | 11 | 6 | 70 | False | False | False | False | False | False | 1520726400 | 0 |
2018-03-12 | 44980 | 10 | 50 | 2018 | 3 | 11 | 12 | 0 | 71 | False | False | False | False | False | False | 1520812800 | 0 |
2018-03-13 | 44981 | 10 | 50 | 2018 | 3 | 11 | 13 | 1 | 72 | False | False | False | False | False | False | 1520899200 | 0 |
2018-03-14 | 44982 | 10 | 50 | 2018 | 3 | 11 | 14 | 2 | 73 | False | False | False | False | False | False | 1520985600 | 0 |
2018-03-15 | 44983 | 10 | 50 | 2018 | 3 | 11 | 15 | 3 | 74 | False | False | False | False | False | False | 1521072000 | 0 |
2018-03-16 | 44984 | 10 | 50 | 2018 | 3 | 11 | 16 | 4 | 75 | False | False | False | False | False | False | 1521158400 | 0 |
2018-03-17 | 44985 | 10 | 50 | 2018 | 3 | 11 | 17 | 5 | 76 | False | False | False | False | False | False | 1521244800 | 0 |
2018-03-18 | 44986 | 10 | 50 | 2018 | 3 | 11 | 18 | 6 | 77 | False | False | False | False | False | False | 1521331200 | 0 |
2018-03-19 | 44987 | 10 | 50 | 2018 | 3 | 12 | 19 | 0 | 78 | False | False | False | False | False | False | 1521417600 | 0 |
2018-03-20 | 44988 | 10 | 50 | 2018 | 3 | 12 | 20 | 1 | 79 | False | False | False | False | False | False | 1521504000 | 0 |
2018-03-21 | 44989 | 10 | 50 | 2018 | 3 | 12 | 21 | 2 | 80 | False | False | False | False | False | False | 1521590400 | 0 |
2018-03-22 | 44990 | 10 | 50 | 2018 | 3 | 12 | 22 | 3 | 81 | False | False | False | False | False | False | 1521676800 | 0 |
2018-03-23 | 44991 | 10 | 50 | 2018 | 3 | 12 | 23 | 4 | 82 | False | False | False | False | False | False | 1521763200 | 0 |
2018-03-24 | 44992 | 10 | 50 | 2018 | 3 | 12 | 24 | 5 | 83 | False | False | False | False | False | False | 1521849600 | 0 |
2018-03-25 | 44993 | 10 | 50 | 2018 | 3 | 12 | 25 | 6 | 84 | False | False | False | False | False | False | 1521936000 | 0 |
2018-03-26 | 44994 | 10 | 50 | 2018 | 3 | 13 | 26 | 0 | 85 | False | False | False | False | False | False | 1522022400 | 0 |
2018-03-27 | 44995 | 10 | 50 | 2018 | 3 | 13 | 27 | 1 | 86 | False | False | False | False | False | False | 1522108800 | 0 |
2018-03-28 | 44996 | 10 | 50 | 2018 | 3 | 13 | 28 | 2 | 87 | False | False | False | False | False | False | 1522195200 | 0 |
2018-03-29 | 44997 | 10 | 50 | 2018 | 3 | 13 | 29 | 3 | 88 | False | False | False | False | False | False | 1522281600 | 0 |
2018-03-30 | 44998 | 10 | 50 | 2018 | 3 | 13 | 30 | 4 | 89 | False | False | False | False | False | False | 1522368000 | 0 |
2018-03-31 | 44999 | 10 | 50 | 2018 | 3 | 13 | 31 | 5 | 90 | True | False | True | False | False | False | 1522454400 | 0 |
45000 rows × 17 columns
The next function process the dataframe, to convert the data to the final representation to be used in the Neural network, and separates the response variable from the predictors.
df, y, nas, mapper = proc_df(joined_samp, 'sales', do_scale=True)
yl = np.log(y)
/home/adrianrdzv/anaconda3/envs/fastai/lib/python3.6/site-packages/ipykernel_launcher.py:2: RuntimeWarning: divide by zero encountered in log
#df_test, _, nas, mapper = proc_df(test, 'sales', do_scale=True, skip_flds=['id'], mapper=mapper, na_dict=nas)
df_test, _, nas, mapper = proc_df(joined_test, 'sales', do_scale=True, skip_flds=['id'],mapper=mapper, na_dict=nas)
As we can see in the next chunks of code the data is now all continuos
df.head()
store | item | Year | Month | Week | Day | Dayofweek | Dayofyear | Is_month_end | Is_month_start | Is_quarter_end | Is_quarter_start | Is_year_end | Is_year_start | Elapsed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | |||||||||||||||
2013-01-01 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 2 | 1 | 2 | 1 | 2 | -1.731103 |
2013-01-02 | 1 | 1 | 1 | 1 | 1 | 2 | 3 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | -1.729205 |
2013-01-03 | 1 | 1 | 1 | 1 | 1 | 3 | 4 | 3 | 1 | 1 | 1 | 1 | 1 | 1 | -1.727308 |
2013-01-04 | 1 | 1 | 1 | 1 | 1 | 4 | 5 | 4 | 1 | 1 | 1 | 1 | 1 | 1 | -1.725411 |
2013-01-05 | 1 | 1 | 1 | 1 | 1 | 5 | 6 | 5 | 1 | 1 | 1 | 1 | 1 | 1 | -1.723514 |
df_test.head()
store | item | Year | Month | Week | Day | Dayofweek | Dayofyear | Is_month_end | Is_month_start | Is_quarter_end | Is_quarter_start | Is_year_end | Is_year_start | Elapsed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | |||||||||||||||
2018-01-01 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 2 | 1 | 2 | 1.733000 |
2018-01-02 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1.734897 |
2018-01-03 | 1 | 1 | 1 | 1 | 1 | 3 | 3 | 3 | 1 | 1 | 1 | 1 | 1 | 1 | 1.736794 |
2018-01-04 | 1 | 1 | 1 | 1 | 1 | 4 | 4 | 4 | 1 | 1 | 1 | 1 | 1 | 1 | 1.738691 |
2018-01-05 | 1 | 1 | 1 | 1 | 1 | 5 | 5 | 5 | 1 | 1 | 1 | 1 | 1 | 1 | 1.740588 |
The validation set to be used, will be accord to the expected test set, so we will use the last three months as a validation set
val_idx = np.flatnonzero(
(df.index<=datetime.datetime(2017,12,31)) & (df.index>=datetime.datetime(2017,10,1)))
len(df.iloc[val_idx])
46000
val_idx
array([ 1734, 1735, 1736, 1737, 1738, 1739, 1740, 1741, 1742, 1743, 1744, 1745,
1746, 1747, 1748, 1749, 1750, 1751, 1752, 1753, ..., 912980, 912981, 912982, 912983,
912984, 912985, 912986, 912987, 912988, 912989, 912990, 912991, 912992, 912993, 912994, 912995,
912996, 912997, 912998, 912999])
#prueba de datos de validacion en rango correcto
df.iloc[912984]
store 10.000000
item 50.000000
Year 5.000000
Month 12.000000
Week 50.000000
Day 16.000000
Dayofweek 6.000000
Dayofyear 350.000000
Is_month_end 1.000000
Is_month_start 1.000000
Is_quarter_end 1.000000
Is_quarter_start 1.000000
Is_year_end 1.000000
Is_year_start 1.000000
Elapsed 1.702646
Name: 2017-12-16 00:00:00, dtype: float64
DL
We are going to create a new metric SMAPE https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error
#Eliminar la metrica de rossman (NA's)
def inv_y(a): return np.exp(a)
def exp_rmspe(y_pred, targ):
targ = inv_y(targ)
pct_var = (targ - inv_y(y_pred))/targ
return math.sqrt((pct_var**2).mean())
max_log_y = np.max(yl)
y_range = (0, max_log_y*1.2)
max_y = np.max(y)
y_range=(0,max_y*1.2)
##F_t = y_pred A_t = targ
def SMAPE(y_pred,targ):
return (np.abs(y_pred-targ)/((np.fabs(y_pred)+np.fabs(targ))/2)).mean()
#return (math.fabs(y_pred-targ)/((math.fabs(y_pred)+math.fabs(targ))/2))
We can create a ModelData object directly from out data frame.
#md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl.astype(np.float32), cat_flds=cat_vars, bs=128,
# test_df=df_test)
#md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl.astype(np.float32), cat_flds=cat_vars, bs=128)
#md = ColumnarModelData.from_data_frame(PATH, val_idx, df, y.astype(np.float32), cat_flds=cat_vars, bs=128)
#md = ColumnarModelData.from_data_frame(PATH, val_idx, df, y.astype(np.float32), cat_flds=cat_vars, bs=128,test_df=df_test)
md = ColumnarModelData.from_data_frame(PATH, val_idx, df, y.astype(np.float32), cat_flds=cat_vars, bs=256,test_df=df_test)
cat_vars
['store',
'item',
'Year',
'Month',
'Week',
'Day',
'Dayofweek',
'Dayofyear',
'Is_month_end',
'Is_month_start',
'Is_quarter_end',
'Is_quarter_start',
'Is_year_end',
'Is_year_start']
Some categorical variables have a lot more levels than others. Store, in particular, has over a thousand!
cat_sz = [(c, len(joined_samp[c].cat.categories)+1) for c in cat_vars]
cat_sz
[('store', 11),
('item', 51),
('Year', 6),
('Month', 13),
('Week', 54),
('Day', 32),
('Dayofweek', 8),
('Dayofyear', 367),
('Is_month_end', 3),
('Is_month_start', 3),
('Is_quarter_end', 3),
('Is_quarter_start', 3),
('Is_year_end', 3),
('Is_year_start', 3)]
We use the cardinality of each variable (that is, its number of unique values) to decide how large to make its embeddings. Each level will be associated with a vector with length defined as below.
emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]
emb_szs
[(11, 6),
(51, 26),
(6, 3),
(13, 7),
(54, 27),
(32, 16),
(8, 4),
(367, 50),
(3, 2),
(3, 2),
(3, 2),
(3, 2),
(3, 2),
(3, 2)]
y
array([13, 11, 14, 13, 10, 12, 10, 9, 12, 9, 9, 7, 10, 12, 5, 7, 16, 7, 18, 15, ..., 67, 67, 72, 72,
52, 86, 53, 54, 51, 63, 75, 70, 76, 51, 41, 63, 59, 74, 62, 82])
m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
0.04, 1, [2000,800], [0.001,0.01], y_range=y_range)
lr = 1e-3
/home/adrianrdzv/Documentos/fastai/fastai/courses/dl1/fastai/column_data.py:101: UserWarning: nn.init.kaiming_normal is now deprecated in favor of nn.init.kaiming_normal_.
for o in self.lins: kaiming_normal(o.weight.data)
/home/adrianrdzv/Documentos/fastai/fastai/courses/dl1/fastai/column_data.py:103: UserWarning: nn.init.kaiming_normal is now deprecated in favor of nn.init.kaiming_normal_.
kaiming_normal(self.outp.weight.data)
m.lr_find()
HBox(children=(IntProgress(value=0, description='Epoch', max=1), HTML(value='')))
61%|██████ | 2053/3387 [02:50<01:50, 12.08it/s, loss=435]
m.sched.plot()
With the fit function and passing the metric SMAPE we could see how our model is working in the validation set, and seeing this we could then generate our prediction for the test range
m.fit(lr, 3, metrics=[SMAPE])
HBox(children=(IntProgress(value=0, description='Epoch', max=3), HTML(value='')))
1%| | 31/3387 [00:02<04:33, 12.28it/s, loss=3.51e+03]
epoch trn_loss val_loss SMAPE
0 58.461612 62.077307 0.12835
1 57.60579 61.226561 0.128534
2 55.84363 60.653265 0.127744
[60.65326502791695, 0.12774361452849015]
#Calar mañana con este en kaggle
#m.fit(lr, 2, metrics=[exp_rmspe,SMAPE], cycle_len=3)
HBox(children=(IntProgress(value=0, description='Epoch', max=6), HTML(value='')))
epoch trn_loss val_loss exp_rmspe SMAPE
0 58.428253 60.687587 nan 0.125747
1 55.664219 58.901838 nan 0.123808
2 53.808453 56.613611 nan 0.122914
3 56.226725 56.70169 nan 0.122738
4 53.166681 56.525577 nan 0.122837
5 53.49262 56.550598 nan 0.122943
[56.550598181683085, nan, 0.1229434128092683]
#m.load_cycle()
#m.load
We could calculate the metric again to validate our model before being use to predict in the test dataset
x,y=m.predict_with_targs()
SMAPE(x,y)
0.12774362
Using the predict function with True parameter we predict in the test dataset, so we can save our results
pred_test=m.predict(True)
len(pred_test)
pred_test
array([[12.4353 ],
[15.99211],
[15.5996 ],
[16.17989],
[17.26732],
[18.59948],
[19.75835],
[13.07455],
[15.55011],
[15.71422],
[16.48752],
[17.36207],
[18.38673],
[19.4954 ],
[13.12858],
[16.1863 ],
[15.82948],
[16.28306],
[17.44085],
[18.91442],
...,
[64.77862],
[75.88193],
[76.84612],
[80.93729],
[85.14046],
[90.38946],
[96.48158],
[65.06002],
[76.08002],
[76.95865],
[80.9269 ],
[84.26495],
[89.71136],
[97.39009],
[64.57227],
[76.35717],
[75.9097 ],
[82.60927],
[86.11958],
[91.05175]], dtype=float32)
joined_test['sales'] = pred_test
joined_test
id | store | item | Year | Month | Week | Day | Dayofweek | Dayofyear | Is_month_end | Is_month_start | Is_quarter_end | Is_quarter_start | Is_year_end | Is_year_start | Elapsed | sales | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | |||||||||||||||||
2018-01-01 | 0 | 1 | 1 | 2018 | 1 | 1 | 1 | 0 | 1 | False | True | False | True | False | True | 1514764800 | 12.435302 |
2018-01-02 | 1 | 1 | 1 | 2018 | 1 | 1 | 2 | 1 | 2 | False | False | False | False | False | False | 1514851200 | 15.992112 |
2018-01-03 | 2 | 1 | 1 | 2018 | 1 | 1 | 3 | 2 | 3 | False | False | False | False | False | False | 1514937600 | 15.599597 |
2018-01-04 | 3 | 1 | 1 | 2018 | 1 | 1 | 4 | 3 | 4 | False | False | False | False | False | False | 1515024000 | 16.179892 |
2018-01-05 | 4 | 1 | 1 | 2018 | 1 | 1 | 5 | 4 | 5 | False | False | False | False | False | False | 1515110400 | 17.267321 |
2018-01-06 | 5 | 1 | 1 | 2018 | 1 | 1 | 6 | 5 | 6 | False | False | False | False | False | False | 1515196800 | 18.599476 |
2018-01-07 | 6 | 1 | 1 | 2018 | 1 | 1 | 7 | 6 | 7 | False | False | False | False | False | False | 1515283200 | 19.758350 |
2018-01-08 | 7 | 1 | 1 | 2018 | 1 | 2 | 8 | 0 | 8 | False | False | False | False | False | False | 1515369600 | 13.074553 |
2018-01-09 | 8 | 1 | 1 | 2018 | 1 | 2 | 9 | 1 | 9 | False | False | False | False | False | False | 1515456000 | 15.550114 |
2018-01-10 | 9 | 1 | 1 | 2018 | 1 | 2 | 10 | 2 | 10 | False | False | False | False | False | False | 1515542400 | 15.714221 |
2018-01-11 | 10 | 1 | 1 | 2018 | 1 | 2 | 11 | 3 | 11 | False | False | False | False | False | False | 1515628800 | 16.487524 |
2018-01-12 | 11 | 1 | 1 | 2018 | 1 | 2 | 12 | 4 | 12 | False | False | False | False | False | False | 1515715200 | 17.362072 |
2018-01-13 | 12 | 1 | 1 | 2018 | 1 | 2 | 13 | 5 | 13 | False | False | False | False | False | False | 1515801600 | 18.386734 |
2018-01-14 | 13 | 1 | 1 | 2018 | 1 | 2 | 14 | 6 | 14 | False | False | False | False | False | False | 1515888000 | 19.495398 |
2018-01-15 | 14 | 1 | 1 | 2018 | 1 | 3 | 15 | 0 | 15 | False | False | False | False | False | False | 1515974400 | 13.128580 |
2018-01-16 | 15 | 1 | 1 | 2018 | 1 | 3 | 16 | 1 | 16 | False | False | False | False | False | False | 1516060800 | 16.186298 |
2018-01-17 | 16 | 1 | 1 | 2018 | 1 | 3 | 17 | 2 | 17 | False | False | False | False | False | False | 1516147200 | 15.829476 |
2018-01-18 | 17 | 1 | 1 | 2018 | 1 | 3 | 18 | 3 | 18 | False | False | False | False | False | False | 1516233600 | 16.283060 |
2018-01-19 | 18 | 1 | 1 | 2018 | 1 | 3 | 19 | 4 | 19 | False | False | False | False | False | False | 1516320000 | 17.440847 |
2018-01-20 | 19 | 1 | 1 | 2018 | 1 | 3 | 20 | 5 | 20 | False | False | False | False | False | False | 1516406400 | 18.914425 |
2018-01-21 | 20 | 1 | 1 | 2018 | 1 | 3 | 21 | 6 | 21 | False | False | False | False | False | False | 1516492800 | 19.885489 |
2018-01-22 | 21 | 1 | 1 | 2018 | 1 | 4 | 22 | 0 | 22 | False | False | False | False | False | False | 1516579200 | 13.303565 |
2018-01-23 | 22 | 1 | 1 | 2018 | 1 | 4 | 23 | 1 | 23 | False | False | False | False | False | False | 1516665600 | 16.030056 |
2018-01-24 | 23 | 1 | 1 | 2018 | 1 | 4 | 24 | 2 | 24 | False | False | False | False | False | False | 1516752000 | 16.096254 |
2018-01-25 | 24 | 1 | 1 | 2018 | 1 | 4 | 25 | 3 | 25 | False | False | False | False | False | False | 1516838400 | 16.395996 |
2018-01-26 | 25 | 1 | 1 | 2018 | 1 | 4 | 26 | 4 | 26 | False | False | False | False | False | False | 1516924800 | 17.516056 |
2018-01-27 | 26 | 1 | 1 | 2018 | 1 | 4 | 27 | 5 | 27 | False | False | False | False | False | False | 1517011200 | 18.376472 |
2018-01-28 | 27 | 1 | 1 | 2018 | 1 | 4 | 28 | 6 | 28 | False | False | False | False | False | False | 1517097600 | 19.416647 |
2018-01-29 | 28 | 1 | 1 | 2018 | 1 | 5 | 29 | 0 | 29 | False | False | False | False | False | False | 1517184000 | 13.390797 |
2018-01-30 | 29 | 1 | 1 | 2018 | 1 | 5 | 30 | 1 | 30 | False | False | False | False | False | False | 1517270400 | 15.469314 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2018-03-02 | 44970 | 10 | 50 | 2018 | 3 | 9 | 2 | 4 | 61 | False | False | False | False | False | False | 1519948800 | 84.911552 |
2018-03-03 | 44971 | 10 | 50 | 2018 | 3 | 9 | 3 | 5 | 62 | False | False | False | False | False | False | 1520035200 | 90.113197 |
2018-03-04 | 44972 | 10 | 50 | 2018 | 3 | 9 | 4 | 6 | 63 | False | False | False | False | False | False | 1520121600 | 94.746307 |
2018-03-05 | 44973 | 10 | 50 | 2018 | 3 | 10 | 5 | 0 | 64 | False | False | False | False | False | False | 1520208000 | 65.446701 |
2018-03-06 | 44974 | 10 | 50 | 2018 | 3 | 10 | 6 | 1 | 65 | False | False | False | False | False | False | 1520294400 | 76.037209 |
2018-03-07 | 44975 | 10 | 50 | 2018 | 3 | 10 | 7 | 2 | 66 | False | False | False | False | False | False | 1520380800 | 75.994850 |
2018-03-08 | 44976 | 10 | 50 | 2018 | 3 | 10 | 8 | 3 | 67 | False | False | False | False | False | False | 1520467200 | 80.185951 |
2018-03-09 | 44977 | 10 | 50 | 2018 | 3 | 10 | 9 | 4 | 68 | False | False | False | False | False | False | 1520553600 | 84.384804 |
2018-03-10 | 44978 | 10 | 50 | 2018 | 3 | 10 | 10 | 5 | 69 | False | False | False | False | False | False | 1520640000 | 91.049042 |
2018-03-11 | 44979 | 10 | 50 | 2018 | 3 | 10 | 11 | 6 | 70 | False | False | False | False | False | False | 1520726400 | 96.319534 |
2018-03-12 | 44980 | 10 | 50 | 2018 | 3 | 11 | 12 | 0 | 71 | False | False | False | False | False | False | 1520812800 | 64.778618 |
2018-03-13 | 44981 | 10 | 50 | 2018 | 3 | 11 | 13 | 1 | 72 | False | False | False | False | False | False | 1520899200 | 75.881927 |
2018-03-14 | 44982 | 10 | 50 | 2018 | 3 | 11 | 14 | 2 | 73 | False | False | False | False | False | False | 1520985600 | 76.846123 |
2018-03-15 | 44983 | 10 | 50 | 2018 | 3 | 11 | 15 | 3 | 74 | False | False | False | False | False | False | 1521072000 | 80.937286 |
2018-03-16 | 44984 | 10 | 50 | 2018 | 3 | 11 | 16 | 4 | 75 | False | False | False | False | False | False | 1521158400 | 85.140465 |
2018-03-17 | 44985 | 10 | 50 | 2018 | 3 | 11 | 17 | 5 | 76 | False | False | False | False | False | False | 1521244800 | 90.389458 |
2018-03-18 | 44986 | 10 | 50 | 2018 | 3 | 11 | 18 | 6 | 77 | False | False | False | False | False | False | 1521331200 | 96.481575 |
2018-03-19 | 44987 | 10 | 50 | 2018 | 3 | 12 | 19 | 0 | 78 | False | False | False | False | False | False | 1521417600 | 65.060020 |
2018-03-20 | 44988 | 10 | 50 | 2018 | 3 | 12 | 20 | 1 | 79 | False | False | False | False | False | False | 1521504000 | 76.080025 |
2018-03-21 | 44989 | 10 | 50 | 2018 | 3 | 12 | 21 | 2 | 80 | False | False | False | False | False | False | 1521590400 | 76.958649 |
2018-03-22 | 44990 | 10 | 50 | 2018 | 3 | 12 | 22 | 3 | 81 | False | False | False | False | False | False | 1521676800 | 80.926903 |
2018-03-23 | 44991 | 10 | 50 | 2018 | 3 | 12 | 23 | 4 | 82 | False | False | False | False | False | False | 1521763200 | 84.264954 |
2018-03-24 | 44992 | 10 | 50 | 2018 | 3 | 12 | 24 | 5 | 83 | False | False | False | False | False | False | 1521849600 | 89.711365 |
2018-03-25 | 44993 | 10 | 50 | 2018 | 3 | 12 | 25 | 6 | 84 | False | False | False | False | False | False | 1521936000 | 97.390091 |
2018-03-26 | 44994 | 10 | 50 | 2018 | 3 | 13 | 26 | 0 | 85 | False | False | False | False | False | False | 1522022400 | 64.572266 |
2018-03-27 | 44995 | 10 | 50 | 2018 | 3 | 13 | 27 | 1 | 86 | False | False | False | False | False | False | 1522108800 | 76.357170 |
2018-03-28 | 44996 | 10 | 50 | 2018 | 3 | 13 | 28 | 2 | 87 | False | False | False | False | False | False | 1522195200 | 75.909698 |
2018-03-29 | 44997 | 10 | 50 | 2018 | 3 | 13 | 29 | 3 | 88 | False | False | False | False | False | False | 1522281600 | 82.609268 |
2018-03-30 | 44998 | 10 | 50 | 2018 | 3 | 13 | 30 | 4 | 89 | False | False | False | False | False | False | 1522368000 | 86.119576 |
2018-03-31 | 44999 | 10 | 50 | 2018 | 3 | 13 | 31 | 5 | 90 | True | False | True | False | False | False | 1522454400 | 91.051750 |
45000 rows × 17 columns
We can save our results in the format specified, with the next chunk of code
csv_fn=f'{PATH}tmp/submission_4agosto.csv'
joined_test[['id','sales']].to_csv(csv_fn, index=False)