Step-by-Step Guide to the KNN Algorithm: Understanding the Fundamentals and Applications

Utsav Desai
9 min readMar 27, 2023

--

KNN (K-Nearest Neighbors)

The K-Nearest Neighbors (KNN) algorithm is a popular supervised learning algorithm used for classification and regression tasks.

The KNN algorithm is based on the principle that objects or data points that are close to each other in feature space are likely to belong to the same class or have similar outputs. The KNN algorithm assigns a new data point to the class that is most common among its k nearest neighbors. The value of k is a hyperparameter that needs to be specified before the model training.

Here are the main steps involved in the KNN algorithm:

  1. Calculate the distance between the new data point and all the training data points based on a distance metric such as Euclidean distance or Manhattan distance.
  2. Select the k nearest neighbors based on the smallest distances.
  3. Assign the new data point to the class that is most common among its k nearest neighbors for classification tasks or predict the output as the mean of the k nearest neighbors for regression tasks.
  4. Evaluate the performance of the model using a suitable metric such as accuracy or mean squared error.
  5. Tune the hyperparameters of the model such as the value of k using a validation set or cross-validation.

The KNN algorithm is simple to understand and implement and can work well for small datasets. However, it can become computationally expensive for large datasets and high-dimensional feature spaces.

Example Of KNN

Let me give you a real-time example of the KNN algorithm with calculations.

Suppose we have a dataset of 10 observations with two features, height and weight, and a binary target variable, gender. We want to predict the gender of a new observation with a height of 175 cm and weight of 75 kg using the KNN algorithm.

Here is the training dataset:

To predict the gender of a new observation with a height of 175 cm and weight of 75 kg using the KNN algorithm, we need to follow the below steps:

1. Calculate the Euclidean distance between the new observation and all the training observations.

Distance between (175, 75) and (170, 70) = sqrt((175-170)^2 + (75-70)^2) = 7.07
Distance between (175, 75) and (172, 65) = sqrt((175-172)^2 + (75-65)^2) = 10.05
Distance between (175, 75) and (174, 68) = sqrt((175-174)^2 + (75-68)^2) = 7.81
Distance between (175, 75) and (175, 71) = sqrt((175-175)^2 + (75-71)^2) = 4
Distance between (175, 75) and (177, 73) = sqrt((175-177)^2 + (75-73)^2) = 3.60
Distance between (175, 75) and (180, 75) = sqrt((175-180)^2 + (75-75)^2) = 5
Distance between (175, 75) and (182, 80) = sqrt((175-182)^2 + (75-80)^2) = 7.81
Distance between (175, 75) and (160, 50) = sqrt((175-160)^2 + (75-50)^2) = 32.80
Distance between (175, 75) and (165, 55) = sqrt((175-165)^2 + (75-55)^2) = 20.62
Distance between (175, 75) and (168, 60) = sqrt((175-168)^2 + (75-60)^2) = 16.03

2. Select the k nearest neighbors based on the smallest distances. Let’s assume k = 3.

The 3 nearest neighbors of (175, 75) are:
- (175, 71) with a distance of 4
- (177, 73) with a distance of 3.60
- (180, 75) with a distance of 5

3. Assign the new observation to the class that is most common among its k nearest neighbors. In this case, the majority class is male, so we predict the gender of the new observation as male.

So, according to the KNN algorithm, the gender of a new observation with a height of 175 cm and weight of 75 kg is male.

Where to use KNN

The KNN (K-Nearest Neighbors) algorithm is a simple yet powerful algorithm that can be used for a variety of machine-learning tasks. Here are some common applications of KNN:

  1. Classification: KNN can be used for classification tasks, where the goal is to predict the class label of a new data point based on its similarity to other data points in the dataset. For example, KNN can be used for image classification, where the algorithm uses the pixel values of an image to determine which class the image belongs to.
  2. Regression: KNN can also be used for regression tasks, where the goal is to predict a continuous value for a new data point based on the values of other data points in the dataset. For example, KNN can be used for predicting housing prices based on the features of the houses in the dataset.
  3. Anomaly detection: KNN can be used for anomaly detection tasks, where the goal is to identify data points that are significantly different from other data points in the dataset. For example, KNN can be used for detecting fraudulent credit card transactions based on the spending patterns of other cardholders.
  4. Recommender systems: KNN can be used for building recommender systems, where the goal is to recommend items to users based on their similarity to other users in the dataset. For example, KNN can be used for recommending movies to users based on their movie ratings and the ratings of other users in the dataset.

Overall, KNN is a versatile algorithm that can be used for a wide range of machine-learning tasks. However, it may not be the best choice for very large datasets or high-dimensional data due to its computational complexity and sensitivity to feature scaling.

Two Important Concepts Associated With The KNN

Non-parametric and lazy learning are two important concepts associated with the KNN (K-Nearest Neighbors) algorithm. Here’s a brief overview of what these terms mean and how they apply to KNN:

1. Non-parametric: Non-parametric algorithms are those that do not make assumptions about the underlying distribution of the data. They are flexible and can handle both linear and nonlinear data distributions. In the case of the KNN algorithm, it does not assume any specific distribution of the data and instead uses the training data to make predictions.

Example: Suppose you have a dataset of customer purchase history, and you want to predict if a new customer is likely to purchase a product. A non-parametric algorithm like KNN would consider all the existing customer data to make the prediction, without making any assumptions about the distribution of the data.

2. Lazy learning: Lazy learning is an approach to machine learning where the model is not trained in advance, but instead stores the entire training dataset in memory. When a new data point is provided, the model searches the training dataset for similar data points and uses them to make a prediction. KNN is a lazy learning algorithm because it does not have a training phase, but instead stores the entire dataset in memory and uses it to make predictions on new data points.

Example: In the case of the KNN algorithm, imagine you have a dataset of car features, including mileage, horsepower, and weight, and you want to predict the price of a new car. The KNN algorithm would store the entire dataset of car features in memory and when a new car is introduced, it would search the training dataset for similar cars based on the features and use them to predict the price of the new car.

Overall, the non-parametric and lazy learning nature of the KNN algorithm makes it a flexible and powerful tool for many machine learning tasks, especially when dealing with non-linear data distributions. However, it also has some limitations, such as computational complexity and sensitivity to feature scaling, which should be taken into consideration when using the algorithm.

Pros

The KNN (K-Nearest Neighbors) algorithm has several advantages that make it a popular choice for many machine learning tasks. Here are some of the advantages of the KNN algorithm:

  1. Simple and easy to understand: The KNN algorithm is simple and easy to understand. It does not require any assumptions about the underlying data distribution, and it can be easily implemented using any programming language.
  2. Non-parametric: The KNN algorithm is non-parametric, which means that it does not make any assumptions about the underlying data distribution. This makes it suitable for both linear and nonlinear data distributions.
  3. No training phase: The KNN algorithm does not require a training phase. The model is simply stored in memory, and new data points can be classified or predicted in real-time.
  4. Flexible: The KNN algorithm can be used for both classification and regression tasks. It can also be adapted for multiclass classification by using a simple modification.
  5. Robust to noisy data: The KNN algorithm is robust to noisy data because it does not make any assumptions about the underlying data distribution. It can handle outliers and noisy data points by averaging the values of the nearest neighbors.
  6. No need for feature engineering: The KNN algorithm does not require feature engineering. It can work with any type of input data, including categorical and numerical data.
  7. High accuracy: The KNN algorithm can achieve high accuracy, especially when the value of k is chosen carefully. It is often used as a benchmark algorithm for many machine-learning tasks.

Overall, the KNN algorithm is a powerful and versatile algorithm that can be used for many machine-learning tasks. It is easy to understand and implement, and it can achieve high accuracy with proper parameter tuning.

Cons

KNN (K-Nearest Neighbors) algorithm has several advantages, but it also has some limitations and drawbacks. Here are some of the disadvantages of the KNN algorithm:

  1. Computationally expensive: The KNN algorithm is computationally expensive, especially when dealing with large datasets. For each new data point, the algorithm has to compute the distance to all other data points in the dataset, which can be time-consuming.
  2. Memory-intensive: The KNN algorithm is memory-intensive because it has to store all the training data in memory. This can be a problem for large datasets, especially when working with high-dimensional data.
  3. Sensitivity to feature scaling: The KNN algorithm is sensitive to feature scaling. If the features have different scales, then the algorithm may give more weight to features with larger scales. This can be a problem when dealing with high-dimensional data or when the features have different units of measurement.
  4. Sensitive to the choice of k(Must find an optimal k value): The KNN algorithm is sensitive to the choice of k, the number of nearest neighbors to consider. If k is too small, then the algorithm may be too sensitive to noise in the data. If k is too large, then the algorithm may not be able to capture the local structure of the data.
  5. Prediction time can be slow: Since the KNN algorithm has no training phase and requires distance calculations for each new point, the prediction time can be slow for large datasets.
  6. Curse of dimensionality: As the number of features increases, the KNN algorithm becomes less effective because the distance between data points becomes more and more similar. This is known as the curse of dimensionality and can make it difficult to find meaningful nearest neighbors.

Overall, the KNN algorithm is a simple and effective algorithm that can be used for many machine-learning tasks. However, it has some limitations that should be taken into account when using it for real-world applications.

--

--

Utsav Desai

Utsav Desai is a technology enthusiast with an interest in DevOps, App Development, and Web Development.