Cross-Validation Summary and Comparison¶
Overview¶
Method | stratified | shuffle | n_splits | test_size | train_size |
---|---|---|---|---|---|
KFold | - | shuffle = True |
Y | - | - |
StratifiedKFold | Y | shuffle = True |
Y | - | - |
ShuffleSplit | - | Y | Y | Y | Y |
StratifiedShuffleSplit | Y | Y | Y | Y | Y |
train_test_split | stratify = None |
shuffle = True |
- | Y | Y |
Create Toy Dataset¶
indices [0,1,2] belong to class "A"
# toy X, y (class '1': 30%)
X = range(10)
y = list('A' * 3 + 'B' * 7)
print "X: ", X
print "y: ", y
X: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
y: ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B']
# helper function to print train/test index:
def print_index_for_folds (cv, X, y = None):
if y == None:
for train_ix, test_ix in cv.split(X):
print "Train_ix:", train_ix, " Test_ix:", test_ix
else:
for train_ix, test_ix in cv.split(X,y):
print "Train_ix:", train_ix, " Test_ix:", test_ix
from sklearn.model_selection import KFold, StratifiedKFold, ShuffleSplit, StratifiedShuffleSplit, train_test_split
KFold¶
# KFold
kf = KFold(n_splits=3, random_state=1)
print_index_for_folds(kf, X)
Train_ix: [4 5 6 7 8 9] Test_ix: [0 1 2 3]
Train_ix: [0 1 2 3 7 8 9] Test_ix: [4 5 6]
Train_ix: [0 1 2 3 4 5 6] Test_ix: [7 8 9]
# KFold(shuffle = True)
kf = KFold(n_splits = 3, shuffle = True, random_state = 1)
print_index_for_folds(kf, X)
Train_ix: [0 1 3 5 7 8] Test_ix: [2 4 6 9]
Train_ix: [2 4 5 6 7 8 9] Test_ix: [0 1 3]
Train_ix: [0 1 2 3 4 6 9] Test_ix: [5 7 8]
StratifiedKFold¶
# notice the 30% weight for class '1' in y (i.e., index 0, 1, 2)
# is preserved for both test and train folds through strtification
s_kf = StratifiedKFold(n_splits=3, random_state= 1)
print_index_for_folds(s_kf, X, y)
Train_ix: [1 2 6 7 8 9] Test_ix: [0 3 4 5]
Train_ix: [0 2 3 4 5 8 9] Test_ix: [1 6 7]
Train_ix: [0 1 3 4 5 6 7] Test_ix: [2 8 9]
s_kf = StratifiedKFold(n_splits=3, shuffle=True, random_state= 1)
print_index_for_folds(s_kf, X, y)
Train_ix: [1 2 3 6 7 8] Test_ix: [0 4 5 9]
Train_ix: [0 1 4 5 6 8 9] Test_ix: [2 3 7]
Train_ix: [0 2 3 4 5 7 9] Test_ix: [1 6 8]
ShuffleSplit¶
ss = ShuffleSplit(n_splits=3, test_size=0.3, random_state=1)
print_index_for_folds(ss, X)
Train_ix: [4 0 3 1 7 8 5] Test_ix: [2 9 6]
Train_ix: [0 8 4 2 1 6 7] Test_ix: [9 5 3]
Train_ix: [9 0 6 1 7 4 2] Test_ix: [8 3 5]
ss = ShuffleSplit(n_splits=3, test_size=0.3, train_size = 0.6, random_state=1)
print_index_for_folds(ss, X)
Train_ix: [4 0 3 1 7 8] Test_ix: [2 9 6]
Train_ix: [0 8 4 2 1 6] Test_ix: [9 5 3]
Train_ix: [9 0 6 1 7 4] Test_ix: [8 3 5]
StratifiedShuffleSplit¶
= StratifiedKFold + ShuffleSplit
# notice the distribution of index 0, 1, 2 preserve the 30% class weight in y
s_ss = StratifiedShuffleSplit(n_splits=3, test_size=0.3, train_size=0.6, random_state=1)
print_index_for_folds(s_ss, X, y)
Train_ix: [8 5 9 2 6 0] Test_ix: [3 1 4]
Train_ix: [8 3 1 0 4 6] Test_ix: [9 2 7]
Train_ix: [5 6 7 3 0 1] Test_ix: [2 8 9]
train_test_Split¶
Useful one line wrapper for train/test split: 1. shuffle by default, 2. NOT stratify by default, 3. NO iterations.
X_train, X_test = train_test_split(X, random_state = 1)
print "Train: ", X_train, " Test: ", X_test
Train: [4, 0, 3, 1, 7, 8, 5] Test: [2, 9, 6]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 9)
print "X_train: ", X_train, " X_test: ", X_test
print "y_train: ", y_train, " y_test: ", y_test
X_train: [2, 1, 9, 3, 0, 6, 5] X_test: [8, 4, 7]
y_train: ['A', 'A', 'B', 'B', 'A', 'B', 'B'] y_test: ['B', 'B', 'B']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 9)
print "X_train: ", X_train, " X_test: ", X_test
print "y_train: ", y_train, " y_test: ", y_test
X_train: [6, 3, 7, 1, 4, 5, 0] X_test: [9, 8, 2]
y_train: ['B', 'B', 'B', 'A', 'B', 'B', 'A'] y_test: ['B', 'B', 'A']