Menu

In this tutorial, I’ll show you how to use the kNN (k – nearest neighbor) algorithm in R. kNN is considered Supervised Machine Learning as we already know what the result of the algorithm should be (aka it’s pre-labeled.)

So, as usual, let’s load the packages.

We will use ggplot2 for visualization, Dplyr for general data manipulation, gridExtra for arranging plots, and Class for kNN. Next, I’ll create a theme for visualization.

Finally, let’s load the data

I have created a bogus dataset of 100 paintings. We will use kNN()  to predict the painter of a test set. We will not convert the “paintings” data to Tibble format as knn()  cannot work with Tibble.

Let’s take a look at the data

The data is a made up dataset comparing 100 paintings from 4 artists (Davinci, Munch, Picasso, and Rembrandt.) I used the “Number of Jewels” appear in a painting and shades of blue on the scale of 1 to 10, where the higher the number indicates the darker shade of blue as predictors.

Before we dive into the data, I’d like to discuss kNN first. So, what is kNN?
In short, kNN can be summarized by the sentence: if X looks and tastes more like Y than Z, then it is in the same group as Y by using Euclidean distance to quantify looks and tastes.

Let’s look at them in a simplified example.

Next, let’s visualize.

Well, it seems like Picasso and Van Gogh liked to put jewels in their paintings.
But Picasso used darker blue in his painting and vice versa. Easy question, if I were to have painting Z with ten jewels and eight shades of blue, can you guess who probably is the painter? It is likely Picasso.

For kNN algorithm, here is the Euclidean Distance equation

$$dist(x,y) = \sqrt{\sum_{i=1}^{K} (x_i-y_i)^2}$$

In this case, we can calculate as follows:

\(dist(z,a) = \sqrt{(10-1)^2+(8-2)^2}= 10.8\)
\(dist(z,b) = \sqrt{(10-3)^2+(8-1)^2}= 9.9\)
\(dist(z,c) = \sqrt{(10-7)^2+(8-9)^2}= 3.2\)
\(dist(z,d) = \sqrt{(10-8)^2+(8-7)^2}= 2.2\)
\(dist(z,e) = \sqrt{(10-9)^2+(8-3)^2}= 5.1\)
\(dist(z,f) = \sqrt{(10-8)^2+(8-2)^2}= 6.3\)
\(dist(z,g) = \sqrt{(10-1)^2+(8-7)^2}= 9.1\)
\(dist(z,h) = \sqrt{(10-2)^2+(8-8)^2}= 8.0\)

For kNN algorithm, the shortest distance is 2.2 or painting D from Picasso. Therefore, painting Z is painted by Picasso.

But that is just 1-NN. In knn() , we can specify the number of K.
However, it is not that different. Suppose we specify it to 3, it will compare to the three nearest neighbors which are 2.2, 3.2, and 5.1 and count the vote. As two votes belong to Picasso (2.2 and 3.2), painting Z is painted by Picasso.

Now, it is the time to take a dive into the painting dataset.

Let’s start by visualizing

DaVinci seems to have a distinct pattern. Picasso, on the contrary, has some similarities with both Munch and Van Gogh. When the distinction is clear, kNN should be relatively straightforward. So, I’d guess that anything that falls somewhere between Munch and Picasso would be relatively more prone to incorrect prediction.

Next, we will split the data to train and test set, and some other necessary preps.

Before we create a model, let’s take a look the data that just got separated into train and test datasets and visualize them separately.

Just looking at the train and test sets, can you guess what paintings are likely and unlikely to be incorrectly predicted? Keep that in mind…, let’s run the model. For the demonstration purpose (also easier for me,) I’ll set k equals to 1.

Let’s see the result.

So, it seems like we missed around seven paintings out of 30. Or around 23%.
Since we have separated the correct painter in “test_lables” dataset, let’s combine them together and visualize what paintings the model got them wrong.

Good, now we know what paintings the model got wrong.

Aha, but that is not so clear. We can do better than that. Let’s create a chart that put data points from train set and incorrectly forecasted data point from the test set together.

Now we can clearly see why they are wrong. For example, the giant red dot is classified as painted by DaVinci, but in fact, it was painted by Picasso. Why is that? Recall that the algorithm will use the Painter from the smallest Euclidean distance. So in this case, there are a couple of dots close to the painting:

The algorithm then will pick Painting D as the winner whose painter is DaVinci. Consequently, CW’s painter is then predicted to be “DaVinci.” Now, choosing K is somewhat more art than science. There is a tradeoff between higher K and lower K. The higher the K, will decrease the variance and neutralize the impact from noisy data. But too high K will ignore the significant but small pattern.

Suppose we set k to be the same as number of observation,  well, in that case, the variable that has the highest representation in the train set, will always win, ignoring the “nearest neighbor.”

On the other hand, if we set k to 1, the nearest neighbor to a given data point will always win.

Textbooks suggested the square root of the observations. Therefore, in this example, it is \(\sqrt{70} \approx 8\). Let’s try that.

Let’s visualize the result.

This time we missed only 6 paintings. The painting from the previous example was correctly classified this time. 🙂

Come to think about it… what if there is a tie? I mean is the painting correctly classified due to unanimous vote? Let’s see.

There are ties uh oh. According to the documentations (link 1 and page 178 in link 2), then R will randomly break the tie.

kNN algorithm from the Class package is an excellent start to the algorithm. But there are many versions or iterations of the algorithm. For example, instead of using the voting systems or random data point to break the tie, some author added the weight into the calculation.

TL;DR… kNN()  is an effective algorithm and very easy to deploy and tune.