sklearn StratifiedKFold

We’ve learnt sklearn KFold. KFold splits an array into several groups. If the elements in the array are associated with a label/class, there would raise a problem. The ration of labels/classes in each fold may be different than in the whole dataset. If your dataset is sorted by labels, and you do not shuffle the dataset before splitting, you’ll be in the worst situation. The training set would be lack of some classes(which go to the test set), thus your model cannot fit well to instances associated with those classes. The evaluation would be worse because you test your model against those lacking classes. It would be better to keep the ratios of every class in the test set and the training set the same as in the whole dataset. Unfortunately, KFold cannot do that. KFold does not consider the labels when doing the splitting.

kfold.split(data)

StratifiedKFold is a variation of KFold. The only difference between StratifiedKFold and KFold is the folds StratifiedKFold makes are stratified folds. A stratified fold maintains the ratios of every label as in the whole dataset. So the resulting test set and the training set are both stratified. I’m not clear about the details that StratifiedKFold is implemented. But we can imagine the following implementation. The whole set is divided into strata according to class. Each stratum is consisted of instances belong to the same class. The ratio of each class is also calculated. Then, the folds are formed by taking units from every stratum according to its ratio. In this process, label/class is needed, so you must provide the label parameter in the StratifiedKFold split method: