A common problem regrading the retail markets and stores is predicting the amount of money a certain user would spend in a specific span of time in the future. The main application of this is to have a clear estimate of the monetary value of each customer in the near future.
There are a set of challenges we face here that could be make it harder than a simple prediction problem. These challenges are as follows
- It is a time series problem so all the challenges of predicting a quantity with non stationary characteristics are are present (e.g. validation, seasonality and ….).
- There are outliers in the data that are dependent on the period of time in which is used for training model, for example at the end of year we could expect a lot of peaks in some user spending.
- Most of customers in the data have just visited the store once so for the majority of user we have no data on the behavior of preference of users individually, rather we have their collective behavior pattern.
- The heavy tailed nature of the output signals makes a bit tricky in case of choosing a good performance metric. Metrics like MSE would be very sensitive to outliers.
How to hack this problem
- Dataset: The data set of this competition is comprised of features related to customers’ such as their purchase histories, locations, number of visits and time spent upon vising and page.
- Validation Scheme: Data was split into multiple time spans so we could thoroughly simulate the conditions under which the problem should be solved.
- Feature Engineering: Based on 10 time spans, a set of features such as
- maximum, average and minimum amounts of purchase of each customer in a certain time span,
- maximum, average and minimum amounts of visits of each customer in a certain time span,
- maximum, average and minimum amounts of time spent by of each customer in a certain time span,
- number of unique visits per time span for each customer and
- trend (slope of a linear prediction model) of purchase in the time spans were calculated as the main features.
- Models: Modeling has been done in R and the main models were LightGBM (dart and gbtree boosters), GLMNET, a MLP model created with Keras.
- Ensembling: The linear blend of the above mentioned models.
I you have any question please contact us through firstname.lastname@example.org.