Kaggle, Titanic contest. Test Accuracy 0.87, leaderboard 0.75 why?

M

Maxim2017-09-23 23:54:51

Machine learning

Maxim, 2017-09-23 23:54:51

I'm training a model in R with caret. Divided the data into training and test sets (80/20). Trained the model using repeated cross-validation 5x10. Got a cross-validation prediction accuracy of around 0.85 with a standard deviation of around 0.02. Then I applied the model on the test set and got the accuracy: 0.8701, 95% CI : (0.8114, 0.9158). How is it that the predictions on the test set and cross-validation tell me that in the worst case I will get an accuracy in the region of 0.80, and when I load the solution I get 0.75? This situation occurs with three models: Random Forest, CatBoost and XGBoost. It turns out that the training sample and the test sample are different populations? Then what's the point of the competition?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

�

⚡ Kotobotov ⚡, 2017-09-24
@angrySCV

heh dude, if the training and test sets were the same, then you could just load the answers from the training set into the test and not fool around, getting a 100% correct result.
the meaning of the competition is to learn how to build a model that works in the GENERAL case, for any data examples.
P.S.
you fit your model to your test data, as a result, if the results on other test data are much worse, then this only means that you refitted your model to your test data, and it works worse in the general case.