KNN in a nutshell for a lazy reader

betoSibileau · 30 April 2022 15:19

Hi guys, sorry for picking your brains, I´m begin lazy to read about KNN.
I understand that KNN would not compute the high-dimensional distances of a given point to the whole dataset, supposing that would be very expensive? Is then the algorithm pre-selecting somehow from the training data a subset of points to perform on them only the distances computations? Thank you for your inputs!

ogrisel · 2 May 2022 07:30

The traditional k-NN (“brute force” method) does compute all the distances for each prediction. This makes this model very slow to predict.

For low-dimensional data (e.g. less than 30 dimensions) it’s possible to do better using a KD-tree or a Ball Tree datastructure and this is also implemented in scikit-learn. The KD-tree and Ball Tree methods still return exact results (the same results as the brute force method). However KD-tree and Ball Tree are slower than brute force for higher dimensional data.

For higher dimensional data, it’s also possible to use approximate methods that are faster (but not implemented in scikit-learn). For instance you can have a look at the following libraries: