In this article, we are going to see how to Splitting the dataset into the training and test sets using R Programming Language.
The sample() method in base R is used to take a specified size data set as input. The data set may be a vector, matrix or a data frame. This method then extracts a sample from the specified data set. The sample chosen contains elements of a specified size from the data set which can be either chosen with or without replacement.
The sampling method has the following documentation in R :
The following code snippet illustrates the procedure where first the dataset matrix is created.
Output:
The dplyr package in R is used to perform data manipulations and operations. It can be loaded and installed into the R working space using the following command :
install.packages("dplyr")
A data frame is first created using the data.frame method in R. The sample_frac method of the dplyr package is then applied using the piping operator. The sample_frac() method in this package is used to select random sample from the input data set. It is used to select the specified percentage of items from the input dataset. The training dataset can be created using this method. It has the following syntax :
In order to create the testing dataset, the anti_join() method of this package can be used which is used to select the rows from the main input dataset that do not lie in the dataset specified as the second argument. As a result, both the datasets will be disjoint in nature. The method has the following syntax :