How to split data into training and testing in Python

In statistics and machine learning, data is split into two subsets: training data and testing data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset in order to test our model’s prediction on this subset.

Lets see how this is done:

First we need version 0.23.1 of scikit-learn, or sklearn. We will be using model_selection package, and the function train_test_split().

python -m pip install -U "scikit-learn==0.23.1"

import numpy as np
from sklearn.model_selection import train_test_split

x = np.arange(1, 25).reshape(12, 2)
y = np.array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0])

x_train, x_test, y_train, y_test = train_test_split(x, y,test_size=4)

We have two sequences, like x and y here, train_test_split() performs the split and returns four sequences (in this case NumPy arrays) in this order:

x_train: The training part of the first sequence (x)
x_test: The test part of the first sequence (x)
y_train: The training part of the second sequence (y)
y_test: The test part of the second sequence (y)

We can specify the desired size of the training and test sets. By default, 25 percent of samples are assigned to the test set. We have set 4 in this example.

Python

How to split data into training and testing in Python

Why learn python?

How to sort a dictionary in python by values?

Contact

Company

Useful Links

Support

Python

Why learn python?

How to sort a dictionary in python by values?

You may also like

15 Powerful Step for Mastering JSON Parsing in Python: Boosting Data Manipulation and Validation

Introduction to Transfer Learning with Python: A Practical Guide

How to Check Type in Python

Contact

Company

Useful Links

Support

Login with your site account

Register a new account