KNN- Classifier
  • by Team Handson
  • July 14, 2022
KNN- Classifier

EUCLIDEAN DISTANCE–

  • Consider the points in two-dimension. Each point in two-dimension can be represented by a vector of dimension two.

OTHER DISTANCE METRICS

There are other distance metric like: Mahala Nobis Distance, Bhattacharya Distance etc. These are used for advanced statistical pattern recognition tasks.

 

  • K-NN Algorithm:
  1. All the training samples/ points are available beforehand.
  2. When a new test sample arrives calculate its distance from all training points.
  3. Choose K-nearest neighbors based on the distance calculated. Usually, the K is a positive odd integer and supplied by user. 
  4. Assign the class label of the test sample based on majority. i.e. for a test sample if most number of neighbors among those K-Nearest Neighbors belong to one particular class-c, then assign the class label of test sample as c.
  • Characteristics of K-NN Classifier:

It doesn’t create model based on the training patterns in advance. Rather, when a test instance comes for testing, runs the algorithm to get the class prediction of that particular testing instance. Hence, there is no learning in advance.

-Hence, k-NN classifier is also known as Lazy Learner.

K-NN Classifier: Decision Boundary–

  • Boundary are the points those are equidistant between the points of Class-1 and Class-2
  • Construct lines between closest pairs of points in different classes.
  • Draw perpendicular bisectors. End bisectors at intersections.
  • Note that locally the boundary is linear.
  • Hence the decision boundary is piece-wise linear curve.

For multi-class classification also the same thing is done to find the decision boundary.

K-NN: Choosing the value of K–

Increasing the ‘K’ simplifies the decision boundary. Because majority voting implies less emphasis on individual points.

  • However increasing the K also increases computational cost.
  • Hence, choosing K is an optimization between how much simplified decision boundary we want vs. how much computational cost we can afford.
  • Usually K = 5, 7, 9, 11 works fine for most practical problems.

Merits:

  • K-NN Classifier often works very well for practical problems.
  • It is very easy to implement, as there is no complex learning algorithm involved.
  • Robust to Noisy Data.

Demerits:

  • Choosing the value of K may not be straightforward. Often the same training samples are used for different values of K. But we choose the most suitable value of K based on minimum misclassification errors on test samples.
  • Doesn’t work well for categorical attributes.
  • Can encounter problem with sparse training data. (i.e. data points are located far away from each other)
  • Can encounter problems in very high-dimensional spaces.
    •                   Most points are at corners.
    •                   Most points are at the edge of the space.

-This problem is known as Curse of Dimensionality and affect many other Machine Learning algorithms.