How to choose the best value of n_neighbors?

Masszo · 10 July 2021 00:09

how to choose the best value of n_neighbors ?

ThomasLoock · 10 July 2021 07:20

Maybe have a read here:

echidne · 10 July 2021 09:07

Hi ThomasLoock,
the problem wth your link is that is not free to read.
I know it s easy to be hacked but not really legal.

echidne · 10 July 2021 09:44

Hi Masszo,

To compute the best n_neighbors value in KNN you can use several methods ( as discussed here).
Since you are in module 1, a very intuitive method should be to do model with a range of n_neighbors, to compute scores or errors and choose the parameter that give you the best score or the smallest error as shown in this video.
Later in the MOOC you ll see how to do Hyperparameters tunning. The methods show in that section allow you to compute easily the best parameters.
As example:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

#create a knn model
knn = KNeighborsClassifier()

#create a dictionary of all values we want to test for n_neighbors
param_grid = {‘n_neighbors’: np.arange(1, 25)}

#use gridsearch to test all values for n_neighbors
knn_gscv = GridSearchCV(knn, param_grid, cv=5)

#fit model to data
knn_gscv.fit(data, target)

#check top performing n_neighbors value
knn_gscv.best_params_

ogrisel · 10 July 2021 10:06

This the correct answer. The tuning of the parameters is explained later in the mooc in module 3.

Masszo · 10 July 2021 12:20

Thanks for answering and sharing links. I was able to go through all of them. And it seems they are all based in the principle of choosing a range of values of k then running and testing(either train_test_split or cv method) to get the best value of k within this range. But do we have and guaranty that the best value of k is in the chosen range ?

echidne · 10 July 2021 13:17

When you draw the graph scores vs n_neighbors or error vs n_neighbors the shape of the graph should help tell you if you are in correct range.
You can find here other examples of manual and automatic tuning of the n_neighbors parameter

ThomasLoock · 10 July 2021 14:57

Hi echidne,
i found the link just using google and it is free available to the public and there is nothing illegal with it.
I don´t know what makes you think there is any hacking involved here? Please explain.
The only restriction towardsdatascience.com or medium.com are putting on their content is that you
can only read a limited amount of stories (3 or 4) per month.
So either you have already reached this limit or maybe something else is blocking your access.
Nevertheless the story on " How to find the optimal value of K in KNN" is well presented and explained.

echidne · 10 July 2021 15:18

Hi ThomasLoock,
Acces to Towardsdatascience.com or medium.com are limited so they are not free. If you want to have full acces you have to pay.
And I never suggest there were hacking involved here. It’s very known in IT area that articles of these medias can be accessed freely by a very simple hack but i never say you were using it. I just wanted to say that I could not reveal the hack and allow everyone that can be interested , even the ones that have already reach the monthly limit, to read that article.
Stop trying to find second meaning in my sentences.

ogrisel · 11 July 2021 13:05

The only way to know the “best value of k” is to have access to an unlimited value of labeled test data and measure the performance on it. But even the best value of k would not always give you a good enough model.

The precise value of k in k-NN does not matter much in practice. Picking a good value of k is often enough. If you want to improve the predictive quality of a k-NN model, you’d better focus your efforts on:

engineering more informative features based on your expert knowledge of the problem at hand. Examples of feature engineering strategies are presented later in the MOOC, for instance in the module on linear models;
consider choosing a distance metrics that better reflects the important pairwise similarities between examples. This potentially includes manually rescaling the features to put more emphasis on features you think are the most predictive;
consider including new labeled examples in your training set if the you think it will help improve the cross-validation accuracy. This can be achieved by studying the learning curves as explained in the next module.

As often in Machine Learning, the quality of the training set (both in terms of features and samples) trumps the choice of the model and its parameters.

Masszo · 13 July 2021 10:51

Thanks ! More clear in the next sections.