Train, Test and Validation sets? What are they?
Today, I’m going to talk about something we hear quite a lot about when we talk about machine learning. Train, test and validation splits are very important concepts in the model building process. If, like me, you like food, then you are in luck, because I’m going to explain it using cakes.
The perfect cake
A bakery owner wants to get the “Itoro” star award. To get this award, they need to bake one type of cake which they will serve to members of the public. The public need to eat the cake and vote that it is simply the best thing they have ever tasted and then I will issue the exclusive “Itoro” star award.
The baker recruits 100 members of the pubic to help with testing. She asks 20 of them to wait in the main restaurant and takes the other 80 to the back. We will call the 20 in the main restaurant the “Test” group. At the back of the restaurant, she splits them even further, 60 of them are asked to explore recipes and choose what they think is the best recipe. The 20 at the back will have the fun of tasting the cakes. We will call the 60 who will come up with recipes “Train” group, and the 20 at the back will be called the “Validation” group.
The training people get exploring, they come up with a recipe, the baker bakes the cake and serves it to the validation group. The validation group eat the cake and give it a score out of 100. The baker is happy with the score. She’s got a cake which she serves to the test group in the restaurant who also score the cake. At this point, she thinks she’s ready for the general public to test her cake.
However, later that day, she wonders how much she can trust the judgement of her training group and her validation group. After all, some of them had lost their sense of taste after having covid. What if all those without a sense of taste were in her validation group?
5 fold cross validation
So the next day, she gets the 100 back in the restaurant. She keeps the same 80/20 split between the front and the back of the restaurant. At the back she does something different. She mixes up the 60/20 split, so that they are different from the previous day. She asks them to get to work. Lo and behold, they produce a different recipe. She bakes it and gives it to the new validation group and the score is different from the previous day. What a revelation! She repeats this process once a day for the rest of the week and realizes she got 5 different recipes and 5 very different scores. Now she can pick the best cake out of the 5 and serve it to the test group.
The test group should give her a good indication of what the general public think of her cake. Because they weren’t in the back of the restaurant, they are not influenced by the conversations, smells and activity of the baking process.
The reality
In the modelling process, 5-fold cross validation works slightly differently. The 80 people would be split into 5 groups of 16 and remain in those groups throughout the week. Let’s refer to those groups as A, B, C, D and E.
On the first day of her experiment, the baker would use group A as the validation group and everyone else would be in the training group. On the second day, she would choose group B as the validation group. Group C on the third day etc. This means that everyone gets one opportunity to be in the validation group.
Python libraries like sklearn help you actively split out your train set from your test set by using sklearn.model_selection.train_test_split. However, cross validation can be defined during the modelling process when you set up the classifier.
Cross validation is a good way to find the best model parameters and hyper parameters. The next time data scientists are talking about cross validation, you can join in the conversation.
If you have enjoyed this post, read more of my posts here.