In this chapter , We will discuss KNN Learning in Scikit-Learn.
k-NN (k-Nearest Neighbor), one of the simplest machine learning algorithms, is non-parametric and lazy in nature. Non-parametric means that there is no assumption for the underlying data distribution i.e. the model structure is determined from the dataset. Lazy or instance-based learning means that for the purpose of model generation, it does not require any training data points and whole training data is used in the testing phase.
The k-NN algorithm consist of the following two steps −
Step 1
In this step, it computes and stores the k nearest neighbors for each sample in the training set.
Step 2
In this step, for an unlabeled sample, it retrieves the k nearest neighbors from dataset. Then among these k-nearest neighbors, it predicts the class through voting (class with majority votes wins).
The module, sklearn.neighbors that implements the k-nearest neighbors algorithm, provides the functionality for unsupervised as well as supervised neighbors-based learning methods.
The unsupervised nearest neighbors implement different algorithms (BallTree, KDTree or Brute Force) to find the nearest neighbor(s) for each sample. This unsupervised version is basically only step 1, which is discussed above, and the foundation of many algorithms (KNN and K-means being the famous one) which require the neighbor search. In simple words, it is Unsupervised learner for implementing neighbor searches.
On the other hand, the supervised neighbors-based learning is used for classification as well as regression.
Unsupervised KNN Learning
As discussed, there exist many algorithms like KNN and K-Means that requires nearest neighbor searches. That is why Scikit-learn decided to implement the neighbor search part as its own “learner”. The reason behind making neighbor search as a separate learner is that computing all pairwise distance for finding a nearest neighbor is obviously not very efficient. Let’s see the module used by Sklearn to implement unsupervised nearest neighbor learning along with example.
Scikit-learn module
sklearn.neighbors.NearestNeighbors is the module used to implement unsupervised nearest neighbor learning. It uses specific nearest neighbor algorithms named BallTree, KDTree or Brute Force. In other words, it acts as a uniform interface to these three algorithms.
Parameters
Followings table consist the parameters used by NearestNeighbors module −
Sr.No | Parameter & Description |
---|---|
1 | n_neighbors − int, optionalThe number of neighbors to get. The default value is 5. |
2 | radius − float, optionalIt limits the distance of neighbors to returns. The default value is 1.0. |
3 | algorithm − {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optionalThis parameter will take the algorithm (BallTree, KDTree or Brute-force) you want to use to compute the nearest neighbors. If you will provide ‘auto’, it will attempt to decide the most appropriate algorithm based on the values passed to fit method. |
4 | leaf_size − int, optionalIt can affect the speed of the construction & query as well as the memory required to store the tree. It is passed to BallTree or KDTree. Although the optimal value depends on the nature of the problem, its default value is 30. |
5 | metric − string or callableIt is the metric to use for distance computation between points. We can pass it as a string or callable function. In case of callable function, the metric is called on each pair of rows and the resulting value is recorded. It is less efficient than passing the metric name as a string.We can choose from metric from scikit-learn or scipy.spatial.distance. the valid values are as follows −Scikit-learn − [‘cosine’,’manhattan’,‘Euclidean’, ‘l1’,’l2’, ‘cityblock’]Scipy.spatial.distance −[‘braycurtis’,‘canberra’,‘chebyshev’,‘dice’,‘hamming’,‘jaccard’, ‘correlation’,‘kulsinski’,‘mahalanobis’,‘minkowski’,‘rogerstanimoto’,‘russellrao’, ‘sokalmicheme’,’sokalsneath’, ‘seuclidean’, ‘sqeuclidean’, ‘yule’].The default metric is ‘Minkowski’. |
6 | P − integer, optionalIt is the parameter for the Minkowski metric. The default value is 2 which is equivalent to using Euclidean_distance(l2). |
7 | metric_params − dict, optionalThis is the additional keyword arguments for the metric function. The default value is None. |
8 | N_jobs − int or None, optionalIt reprsetst the numer of parallel jobs to run for neighbor search. The default value is None. |
Implementation Example
The example below will find the nearest neighbors between two sets of data by using the sklearn.neighbors.NearestNeighbors module.
First, we need to import the required module and packages −
from sklearn.neighbors import NearestNeighbors import numpy as np
Now, after importing the packages, define the sets of data in between we want to find the nearest neighbors −
Input_data = np.array([[-1, 1], [-2, 2], [-3, 3], [1, 2], [2, 3], [3, 4],[4, 5]])
Next, apply the unsupervised learning algorithm, as follows −
nrst_neigh = NearestNeighbors(n_neighbors = 3, algorithm = 'ball_tree')
Next, fit the model with input data set.
nrst_neigh.fit(Input_data)
Now, find the K-neighbors of data set. It will return the indices and distances of the neighbors of each point.
distances, indices = nbrs.kneighbors(Input_data) indices
Output
array( [ [0, 1, 3], [1, 2, 0], [2, 1, 0], [3, 4, 0], [4, 5, 3], [5, 6, 4], [6, 5, 4] ], dtype = int64 ) distances
Output
array( [ [0. , 1.41421356, 2.23606798], [0. , 1.41421356, 1.41421356], [0. , 1.41421356, 2.82842712], [0. , 1.41421356, 2.23606798], [0. , 1.41421356, 1.41421356], [0. , 1.41421356, 1.41421356], [0. , 1.41421356, 2.82842712] ] )
The above output shows that the nearest neighbor of each point is the point itself i.e. at zero. It is because the query set matches the training set.
Example
We can also show a connection between neighboring points by producing a sparse graph as follows −
nrst_neigh.kneighbors_graph(Input_data).toarray()
Output
array( [ [1., 1., 0., 1., 0., 0., 0.], [1., 1., 1., 0., 0., 0., 0.], [1., 1., 1., 0., 0., 0., 0.], [1., 0., 0., 1., 1., 0., 0.], [0., 0., 0., 1., 1., 1., 0.], [0., 0., 0., 0., 1., 1., 1.], [0., 0., 0., 0., 1., 1., 1.] ] )
Once we fit the unsupervised NearestNeighbors model, the data will be stored in a data structure based on the value set for the argument ‘algorithm’. After that we can use this unsupervised learner’s kneighbors in a model which requires neighbor searches.
Complete working/executable program
from sklearn.neighbors import NearestNeighbors import numpy as np Input_data = np.array([[-1, 1], [-2, 2], [-3, 3], [1, 2], [2, 3], [3, 4],[4, 5]]) nrst_neigh = NearestNeighbors(n_neighbors = 3, algorithm='ball_tree') nrst_neigh.fit(Input_data) distances, indices = nbrs.kneighbors(Input_data) indices distances nrst_neigh.kneighbors_graph(Input_data).toarray()
Supervised KNN Learning
The supervised neighbors-based learning is used for following −
- Classification, for the data with discrete labels
- Regression, for the data with continuous labels.
Nearest Neighbor Classifier
We can understand Neighbors-based classification with the help of following two characteristics −
- It is computed from a simple majority vote of the nearest neighbors of each point.
- It simply stores instances of the training data, that’s why it is a type of non-generalizing learning.
Scikit-learn modules
Followings are the two different types of nearest neighbor classifiers used by scikit-learn −
S.No. | Classifiers & Description |
---|---|
1. | KNeighborsClassifier The K in the name of this classifier represents the k nearest neighbors, where k is an integer value specified by the user. Hence as the name suggests, this classifier implements learning based on the k nearest neighbors. The choice of the value of k is dependent on data. |
2. | RadiusNeighborsClassifier The Radius in the name of this classifier represents the nearest neighbors within a specified radius r, where r is a floating-point value specified by the user. Hence as the name suggests, this classifier implements learning based on the number of neighbors within a fixed radius r of each training point. |
Nearest Neighbor Regressor
It is used in the cases where data labels are continuous in nature. The assigned data labels are computed on the basis on the mean of the labels of its nearest neighbors.
Followings are the two different types of nearest neighbor regressors used by scikit-learn −
KNeighborsRegressor
The K in the name of this regressor represents the k nearest neighbors, where k is an integer value specified by the user. Hence, as the name suggests, this regressor implements learning based on the k nearest neighbors. The choice of the value of k is dependent on data. Let’s understand it more with the help of an implementation example.
Followings are the two different types of nearest neighbor regressors used by scikit-learn −
Implementation Example
In this example, we will be implementing KNN on data set named Iris Flower data set by using scikit-learn KNeighborsRegressor.
First, import the iris dataset as follows −
from sklearn.datasets import load_iris iris = load_iris()
Now, we need to split the data into training and testing data. We will be using Sklearn train_test_split function to split the data into the ratio of 70 (training data) and 20 (testing data) −
X = iris.data[:, :4] y = iris.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
Next, we will be doing data scaling with the help of Sklearn preprocessing module as follows −
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test)
Next, import the KNeighborsRegressor class from Sklearn and provide the value of neighbors as follows.
Example
import numpy as np from sklearn.neighbors import KNeighborsRegressor knnr = KNeighborsRegressor(n_neighbors = 8) knnr.fit(X_train, y_train)
Output
KNeighborsRegressor( algorithm = 'auto', leaf_size = 30, metric = 'minkowski', metric_params = None, n_jobs = None, n_neighbors = 8, p = 2, weights = 'uniform' )
Example
Now, we can find the MSE (Mean Squared Error) as follows −
print ("The MSE is:",format(np.power(y-knnr.predict(X),4).mean()))
Output
The MSE is: 4.4333349609375
Example
Now, use it to predict the value as follows −
X = [[0], [1], [2], [3]] y = [0, 0, 1, 1] from sklearn.neighbors import KNeighborsRegressor knnr = KNeighborsRegressor(n_neighbors = 3) knnr.fit(X, y) print(knnr.predict([[2.5]]))
Output
[0.66666667]
Complete working/executable program
from sklearn.datasets import load_iris iris = load_iris() X = iris.data[:, :4] y = iris.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20) from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) import numpy as np from sklearn.neighbors import KNeighborsRegressor knnr = KNeighborsRegressor(n_neighbors=8) knnr.fit(X_train, y_train) print ("The MSE is:",format(np.power(y-knnr.predict(X),4).mean())) X = [[0], [1], [2], [3]] y = [0, 0, 1, 1] from sklearn.neighbors import KNeighborsRegressor knnr = KNeighborsRegressor(n_neighbors=3) knnr.fit(X, y) print(knnr.predict([[2.5]]))
RadiusNeighborsRegressor
The Radius in the name of this regressor represents the nearest neighbors within a specified radius r, where r is a floating-point value specified by the user. Hence as the name suggests, this regressor implements learning based on the number neighbors within a fixed radius r of each training point. Let’s understand it more with the help if an implementation example −
Implementation Example
In this example, we will be implementing KNN on data set named Iris Flower data set by using scikit-learn RadiusNeighborsRegressor −
First, import the iris dataset as follows −
from sklearn.datasets import load_iris iris = load_iris()
Now, we need to split the data into training and testing data. We will be using Sklearn train_test_split function to split the data into the ratio of 70 (training data) and 20 (testing data) −
X = iris.data[:, :4] y = iris.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
Next, we will be doing data scaling with the help of Sklearn preprocessing module as follows −
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test)
Next, import the RadiusneighborsRegressor class from Sklearn and provide the value of radius as follows −
import numpy as np from sklearn.neighbors import RadiusNeighborsRegressor knnr_r = RadiusNeighborsRegressor(radius=1) knnr_r.fit(X_train, y_train)
Example
Now, we can find the MSE (Mean Squared Error) as follows −
print ("The MSE is:",format(np.power(y-knnr_r.predict(X),4).mean()))
Output
The MSE is: The MSE is: 5.666666666666667
Example
Now, use it to predict the value as follows −
X = [[0], [1], [2], [3]] y = [0, 0, 1, 1] from sklearn.neighbors import RadiusNeighborsRegressor knnr_r = RadiusNeighborsRegressor(radius=1) knnr_r.fit(X, y) print(knnr_r.predict([[2.5]]))
Output
[1.]
Complete working/executable program
from sklearn.datasets import load_iris iris = load_iris() X = iris.data[:, :4] y = iris.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20) from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) import numpy as np from sklearn.neighbors import RadiusNeighborsRegressor knnr_r = RadiusNeighborsRegressor(radius = 1) knnr_r.fit(X_train, y_train) print ("The MSE is:",format(np.power(y-knnr_r.predict(X),4).mean())) X = [[0], [1], [2], [3]] y = [0, 0, 1, 1] from sklearn.neighbors import RadiusNeighborsRegressor knnr_r = RadiusNeighborsRegressor(radius = 1) knnr_r.fit(X, y) print(knnr_r.predict([[2.5]]))
Next Topic : Click Here
Pingback: Scikit-Learn : K-Nearest Neighbors (KNN) | Adglob Infosystem Pvt Ltd