Why is choosing a large value of k for KNN classification algorithm beneficial?
The k-nearest neighbors (kNN) algorithm is a simple tool that can be used for a number of real-world problems in finance, healthcare, recommendation systems, and much more. This blog post will cover what kNN is, how it works, and how to implement it in machine learning projects. Show
What is the k-nearest neighbors classifier?The k-nearest neighbors classifier (kNN) is a non-parametric supervised machine learning algorithm. It’s distance-based: it classifies objects based on their proximate neighbors’ classes. kNN is most often used for classification, but can be applied to regression problems as well. What is a supervised machine learning model? In a supervised model, learning is guided by labels in the training set. For a better understanding of how it works, check out our detailed explanation of supervised learning principles. Non-parametric means that there is no fine-tuning of parameters in the training step of the model. Although k can be considered an algorithm parameter in some sense, it’s actually a hyperparameter. It’s selected manually and remains fixed at both training and inference time. The k-nearest neighbors algorithm is also non-linear. In contrast to simpler models like linear regression, it will work well with data in which the relationship between the independent variable (x) and the dependent variable (y) is not a straight line. What is k in k-nearest neighbors?The parameter k in kNN refers to the number of labeled points (neighbors) considered for classification. The value of k indicates the number of these points used to determine the result. Our task is to calculate the distance and identify which categories are closest to our unknown entity. How does the kNN classification algorithm work?The main concept behind k-nearest neighbors is as follows. Given a point whose class we do not know, we can try to understand which points in our feature space are closest to it. These points are the k-nearest neighbors. Since similar things occupy similar places in feature space, it’s very likely that the point belongs to the same class as its neighbors. Based on that, it’s possible to classify a new point as belonging to one class or another. The kNN algorithm step-by-stepProvide a training set A training set is a set of labeled data that’s used for training a model. You can create the training set by manually labeling data or use the labeled databases available in public resources, like this one. A good practice is to use around 75% of the available data for training and 25% as the testing set. Find k-nearest neighbors Finding k-nearest neighbors of a record means identifying those records which are most similar to it in terms of common features. This step is also known as similarity search or distance calculation. Classify points For classification problems, the algorithm assigns a class label based on a majority vote, meaning that the most frequent label that occurs in neighbors is applied. For regression problems, the average is used to identify the k nearest neighbors. After the completion of the above steps, the model provides the results. Note that for classification problems, we use discrete values, so the output would be descriptive, for example “likes bright colors”. For regression problems, continuous values are applied, meaning that the output would be a floating-point number. We can evaluate the accuracy of the results by looking at how close the model’s predictions and estimates match the known classes in the testing set. How to determine the k value in the k-neighbors classifier?The optimal k value will help you to achieve the maximum accuracy of the model. This process, however, is always challenging. The simplest solution is to try out k values and find the one that brings the best results on the testing set. For this, we follow these steps:
Picking k in more complicated casesIn the case of anomalies or a large number of neighbors closely surrounding the unknown point, we need to find different solutions for calculating k that are both time- and cost-efficient. The main limitations are related to computational resources: the more objects there are in the dataset, the higher the cost of appropriate k selection is. Moreover, with big datasets, we usually work in a multiparameter optimization setup. Our goal here is not only to maximize model quality by choosing the right k in meta-training but also to achieve faster model inference. If k is bigger, more inference time is needed, which could pose a problem in large data sets. Below we present some advanced methods for selecting k that are suitable for these cases. 1. Square root method The optimal K value can be calculated as the square root of the total number of samples in the training dataset. Use an error plot or accuracy plot to find the most favorable K value. KNN performs well with multi-label classes, but in case of the structure with many outliers, it can fail, and you’ll need to use other methods. 2. Cross-validation method (Elbow method) Begin with k=1, then perform cross-validation (5 to 10 fold – these figures are common practice as they provide a good balance between the computational efforts and statistical validity), and evaluate the accuracy. Keep repeating the same steps until you get consistent results. As k goes up, the error usually decreases, then stabilizes, and then grows again. The optimal k lies at the beginning of the stable zone. You can watch a detailed explanation of this methods below. How is the k-distance calculated?K-distance is the distance between data points and a given query point. To calculate it, we have to pick a distance metric. Some of the most popular metrics are explained below. Euclidean distanceThe Euclidean distance between two points is the length of the straight line segment connecting them. This most common distance metric is applied to real-valued vectors. Manhattan distance (adsbygoogle = window.adsbygoogle || []).push({});The Manhattan distance between two points is the sum of the absolute differences between the x and y coordinates of each point. Used to measure the minimum distance by summing the length of all the intervals needed to get from one location to another in a city, it’s also known as the taxicab distance. Minkowski distanceMinkowski distance generalizes the Euclidean and Manhattan distances. It adds a parameter called “order” that allows different distance measures to be calculated. Minkowski distance indicates a distance between two points in a normed vector space Hamming distanceHamming distance is used to compare two binary vectors (also called data strings or bitstrings). To calculate it, data first has to be translated into a binary system. K-nearest neighbors with PythonLet’s look at how the classifier works in practice. First stepsFirst, we need to generate data on which we’ll run the classifier:
To plot the generated data, you can run the following code:
Here’s an example of the resulting image from this code: How to split training and testing samplesNow we have a labeled set of objects, each of which is assigned to a class. We need to split this set into two parts: a training set and a test set. The following code is used for this:
Classifier implementationNow that we have a training sample, we can implement the classifier algorithm:
Examples of execution (adsbygoogle = window.adsbygoogle || []).push({});Now we can evaluate how well our classifier works. To do this, we generate data, split it into a training set and a test set, classify the objects in the test set, and compare the actual value of the classes in the test set to the value resulting from the classification:
Now we can graphically show the work of the classifier. In the pictures below we used 3 classes with 40 elements each, the value of k for the algorithm was set equal to three. For creating these graphics, we can use the following code:
Here are some other examples of clustering – first from the seed Pros and cons of kNNThe main advantage of kNN is that it’s fast to develop and simple to interpret:
However, the algorithm has its weaknesses too.
Practical applicationsK-nearest neighbors method for сar manufacturingAn automaker has designed prototypes of a new truck and sedan. To determine their chances of success, the company has to find out which current vehicles on the market are most similar to the prototypes. Their competitors are their "nearest neighbors.” To identify them, the car manufacturer needs to input data such as price, horsepower, engine size, wheelbase, curb weight, fuel tank capacity, etc., and compare the existing models. The kNN algorithm classifies complicated multi-featured prototypes according to their closeness to similar competitors’ products. kNN in E-commerceK-nearest neighbors is an excellent solution for cold-starting an online store recommendation system, but with the growth of the dataset more advanced techniques are usually needed. The algorithm can select the items that specific customers would like or predict their actions based on customer behavior data. For example, kNN will quickly tell whether or not a new visitor will likely carry out a transaction. kNN application for educationAnother kNN application is classifying groups of students based on their behavior and class attendance. With the help of the k-nearest neighbors analysis, it is possible to identify students who are likely to drop out or fail early. These insights would allow educators and course managers to take timely measures to motivate and help them master the material. Takeaways The kNN algorithm is popular, effective, and relatively easy to implement and interpret. It can efficiently solve classification and regression tasks for complicated multivariate cases. Despite the emergence of many advanced alternatives, it maintains its position among the best ML classifiers. What happens if K value is large in KNN?The value of k in the KNN algorithm is related to the error rate of the model. A small value of k could lead to overfitting as well as a big value of k can lead to underfitting. Overfitting imply that the model is well on the training data but has poor performance when new data is coming.
Is higher K better KNN?The only parameter that can adjust the complexity of KNN is the number of neighbors k. The larger k is, the smoother the classification boundary.
Why is K important in KNN?The k value in the k-NN algorithm defines how many neighbors will be checked to determine the classification of a specific query point. For example, if k=1, the instance will be assigned to the same class as its single nearest neighbor.
Does increasing K in KNN increase accuracy?However as the value of k increases the accuracy is decreasing.
|