How to split data into training and testing in Python
In statistics and machine learning, data is split into two subsets: training data and testing data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset in order to test our model’s prediction on this subset.
Lets see how this is done:
First we need version 0.23.1 of scikit-learn, or sklearn
. We will be using model_selection
package, and the function train_test_split()
.
python -m pip install -U "scikit-learn==0.23.1"
import numpy as np from sklearn.model_selection import train_test_split x = np.arange(1, 25).reshape(12, 2) y = np.array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0]) x_train, x_test, y_train, y_test = train_test_split(x, y,test_size=4)
We have two sequences, like x
and y
here, train_test_split()
performs the split and returns four sequences (in this case NumPy arrays) in this order:
x_train
: The training part of the first sequence (x
)x_test
: The test part of the first sequence (x
)y_train
: The training part of the second sequence (y
)y_test
: The test part of the second sequence (y
)
We can specify the desired size of the training and test sets. By default, 25 percent of samples are assigned to the test set. We have set 4 in this example.