KNN Algorithm Using Scikit-Learn - Classifying Iris Species (Tutorial)

KNN Algorithm (K-Nearest Neighbor) is a supervised learning classification algorithm to perform predictive modeling and classify data.

The KNN algorithm learns from example data and able to classify new data based on feature similarity.

The primary function of supervised learning algorithms is to learn from labeled data points with some features to identify unlabeled data.

Think of labeled data as training data or examples that already come with the correct answer or classification. Therefore, this is the data that you feed into the model for training.

And, you feed the test data to test and evaluate your model’s prediction.

Learn more about supervised and unsupervised algorithm from the following article:

What is Machine Learning? – Supervised & Unsupervised

The purpose of this article is to get your hands dirty with the KNN algorithm by building an actual machine learning model. The model that we will create is going to classify different species of iris.

How Does KNN Algorithm Work?

The KNN algorithm uses the concept of similarity. It assumes that similar objects or data points exist near to each other.

Take a look at the diagram below:

Notice how similar data points are close to each other. Let’s say the blue points are cats. And the red points are the dogs. Let’s assume that this data is inputted based on features such as weights and heights.

As a result, points with similar features are clustered close to each other.

Green data points represent animals that we need to classify and predict, whether we have cats or dogs. In this case, you only have the information on their weights and heights.

Based on their features we are able to associate them with either the neighboring blue points or the red points.

For example, the green point on the left has proximity to the red points. This data point is most likely to be a dog.

Now to make the prediction, that’s where the KNN algorithm will come into play.

KNN will calculate the distance between a new data point with every single data point from your training set.

It converts the data points into mathematical values or vectors, which it uses to find the distances.

One of the most popular distance measuring metrics is the Euclidean distance.

Euclidean distance works by calculating the distance between two data points in a plane.

For instance, if we want to calculate the Euclidean distance between two data points p & q, we can use the following formula:

Once the algorithm calculates every single distance, it selects the K nearest data points.

What do I mean by K nearest?

K is the number of nearest neighbors or labeled data points in proximity to your new data point that we are trying to classify. Most importantly, the value of K is a positive integer.

In other words, if we have K=4, then the algorithm looks for the four nearest data points around the unlabeled point.

Let’s assume that the value of K for the unknown green data point is 3. Thus, we will draw a circle around the green data point with three neighbors inside the proximity.

Out of 3 of the neighboring data points, 2 of the data points are red, and the other is blue. In other words, we have two nearest points that belong to the red class and one nearest point belonging to the blue.

For example, the majority of the neighbors inside the circle are red. And as a result, our algorithm will most likely predict that our new data point belongs to the red class.

However, it probably is not going to be this simple. The diagram above is an example using imaginary data points. But hopefully, you get the idea of how our algorithm uses the distance and the K value to make predictions.

It is crucial to pick the optimal value for K to minimize the number of biases and variances. Thus, a K value should neither be too small nor too large.

And the best way to find out the optimal K value is by testing your model with different K values.

To sum up, here’s how we implement the KNN algorithm:

Load and store the data.
Calculate the distance from x (new data point) to all other data points.
Sort all the distances from your data in ascending order.
Initialize the K value for the nearest data points.
Make a prediction based on the majority of data points with the same label within the K value.
Evaluate your machine learning model.

Before we jump into the new section, I am assuming you are already familiar with the following concepts:

Python
NumPy
Pandas
Matplotlib
Supervised Machine Learning
Scikit-Learn (Optional)

Classifying Iris Species

Moving onto the fun part where we will now build an actual machine learning model.

For this tutorial, we want to create a machine model that will allow botanists to classify different species of iris flowers.

The way our model will predict the types of flowers is by learning from their features. These features are the length and width of the petals and the sepals.

By learning from these measurements, our model will predict whether the flower is a setosa, versicolor, or virginica.

Since we are training the model using labeled data for which we already know the answer, this is a supervised machine learning problem.

For instance, when we feed our model with unclassified iris species, it will give us predictions based on its learning from training data.

In other words, this problem is also a classification problem as it involves grouping data into different classes.

For problems such as this, we address possible outputs (different iris species) as classes.

Now open your Jupyter notebook and type in the following imports:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib

Part 1 – Data Preparation

In this part, we will prepare and analyze the data. Our data is the iris dataset. An advantage of using Scikit-Learn is that it already includes the iris dataset for us.

We just have to load it by calling the load_iris. So type the following:

from sklearn.datasets import load_iris
iris_dataset = load_iris()
print("Keys of iris_dataset:\n", iris_dataset.keys())

Output:

Keys of iris_dataset:
 dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

The output that we get is a Bunch object. This type is similar to a dictionary as it contains keys and values. Therefore, we can see key-value pairs.

First, the data key is a NumPy array that contains measurements of sepal length, sepal width, petal length, and petal width for 150 different flowers.

Type the following to learn more:

print("Data Type:", type(iris_dataset['data']))

Output:

Data Type: <class 'numpy.ndarray'>

Each row in this NumPy array corresponds to a flower. On the other hand, the columns correspond to their measurements.

Let’s take a look at the shape of the array to understand our data better.

print("Shape of Data:", iris_dataset['data'].shape)

Output:

Shape of Data: (150, 4)

The shape confirms that our data has 150 rows. And each row has 4 columns.

So, when it comes to machine learning, the items or rows are known as samples. And the columns are features.

We can take a look into the features of the first ten samples from our data set. Type the following:

print("First 10 Samples and Their Features:\n", iris_dataset['data'][:10])

Output:

First 10 Samples and Their Features:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]

Now moving onto the next key, which is the target.

The target key is a NumPy array that holds the iris species encoded as integers from 0 to 2.

And most importantly, this key is a one-dimensional array. We can find out more about the target through the following lines of code:

print("Type of Target:", type(iris_dataset['target']))
print("Shape of Target:", iris_dataset['target'].shape)
print(iris_dataset['target'])

Output:

Type of Target: <class 'numpy.ndarray'>
Shape of Target: (150,)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

As a result, the outputs tell us that this is a one-dimensional NumPy array. Moreover, it has 150 species encoded as integers from 0 to 2.

0 means setosa, 1 means versicolor, and 2 means virginica.

Next, we have the target_names. This key holds an array of strings. These strings are the name of the species of the flowers.

Let’s take a look:

print("Target names:", iris_dataset['target_names'])

Output:

Target names: ['setosa' 'versicolor' 'virginica']

Next, we have the key DESCR. This key provides us with a description of the iris dataset.

Let’s take a look:

print(iris_dataset['DESCR'])

Output:

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

Moreover, you can also print a specific part of the DESCR instead of the complete description.

print(iris_dataset['DESCR'][:500] + "\n...")

Output:

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica

The key feature_names is a list of strings that provides us with the names of available features.

Go ahead and type the following:

print("Feature Names:", iris_dataset['feature_names'])

Output:

Feature Names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

As you can see we get a list of all the features and their unit.

For now, these are the crucial keys that should allow us to get familiar with the data. You can go ahead and learn more about them by performing various operations with the dataset. But we are good to move onto the training and testing phase.

Part 2 – Training and Testing Data

In this part, we will split our data into two sets, the training set and the testing set.

Training set is the dataset which we use to train our model. The KNN algorithm will learn from the training first how to distinguish the different iris species.

During the training phase, we don’t expose our testing dataset to the model. As a result, when the training is over, we use the testing dataset to measure its success.

Testing dataset is what we use to assess how well our model works.

One of the benefits of using Scikit-Learn is that it comes prebuilt train_test_split() function. This function will split 75% of the rows in the dataset into a training set. And the rest 25% into the testing set.

So type the following code and then I will explain what’s going on.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    iris_dataset['data'], iris_dataset['target'], random_state=0)

The first line of code imports the train_test_split() function from the module sklearn.model_selection.

Next, we are using the train_test_split() function to randomize and split our dataset into the following variables:

X - These are the inputs or features that you feed into your model. This value corresponds to a two-dimensional array (matrix). As a result, we capitalize the X.
y - These are the expected outcomes or the labels. It corresponds to a one-dimensional array (vector).
X_train – This is the training dataset.
y_train – The labels that to the training dataset (X_train).
X_test – This is the testing dataset.
y-test - The labels that to the training dataset (X_test).

And the parameters that we passed are our data itself and the target. I like to think of this function as train_test_split(X, y), where X corresponds to features and y to the labels.

If you can remember, our data holds a two-dimensional array of features to each species. And the target contains a one-dimensional array of labels encoded as integers from 0 to 2.

One more parameter that we are using is the random_state. And we set it to 0. Therefore, it ensures that our function always splits the same data every time we run the train_test_split() function.

So if you don’t assign a fixed value like 0, 1, or 42, then every time you run the code, our training and testing datasets will have different sets of values.

Let’s see what we have so far by printing out their shapes. First, the training dataset:

print("X_train Shape:", X_train.shape)
print("y_train Shape:", y_train.shape)

Output:

X_train Shape: (112, 4)
y_train Shape: (112,)

Now the test dataset:

print("X_test Shape:", X_test.shape)
print("y_test Shape:", y_test.shape)

Output:

X_test Shape: (38, 4)
y_test Shape: (38,)

We see that X_train contains 75% of all the rows. On the other hand, X_test holds 25%.

Part 3 – Data Visualiztion

It is always a good idea to visualize your data for further inspection before creating a model.

At this point, we will visualize our data to see what we have. And one of the best ways to visualize our data is through scatter plots.

Scatter plots are diagrams where we represent our data using dots. We put one feature on the x-axis and one on the y-axis.

One of the purposes of a scatter plot is to observe the relationship between variables.

To create a scatter plot, first, we will create a dataframe from the data X_train using Pandas. After that, label the columns with the strings we saw in feature_names.

Type the following:

df = pd.DataFrame(X_train, columns=iris_dataset.feature_names)
df.head()

Output:

Now it is time to convert the dataframe into a scatter plot. We will use the function plotting.scatter_matrix() to pass our dataframe and color our data points based on the labels from y_train.

Type the following:

pd.plotting.scatter_matrix(df,c=y_train,figsize=(12,12), marker='o',s=20,alpha=.8)
plt.show()

Output:

From the code, you can see that we have passed several other parameters to modify our chart as we like. For example, figsize lets us determine the size of the diagrams or figures.

You can learn more about the parameters that are available to the plotting.scatter_matrix() function through the official Pandas documentation.

Here’s a challenge for you. Go ahead and try to visualize the X_test data using the same procedure.

See if you can figure it out. You can use the code snippet above as an example.

However, if not, then don’t worry about it. Here’s how you do it:

df = pd.DataFrame(X_test, columns=iris_dataset.feature_names)
pd.plotting.scatter_matrix(df,c=y_test,figsize=(12,12), marker='o',s=20,alpha=.8)
plt.show()

Output:

Part 4 – Creating the Model

Building the model using Scikit-Learn is not as complicated as it sounds. As you may already know that it comes with many classification algorithms. So the one that we will import is the KNN algorithm or K-nearest neighbors classifier.

First, we have to import the KNeighborsClassifier class from the neighbors module. After that, instantiate an object of that class.

Then we will set the value of K to 1 using the parameter n_neighbors=1.

And here’s the code for all of that:

from sklearn.neighbors import KNeighborsClassifier
knnObject = KNeighborsClassifier(n_neighbors=1)

The next step is to fit our training data (X_train, y_train) with the knnObject using the method fit().

Think of fitting the data as training. The fit() method allows us to take in our training dataset and train them to make predictions.

So, type the following:

knnObject.fit(X_train, y_train)

Output:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

For now, you don’t have to worry about the parameters we got as an output. All we did was fit the model using our training dataset so that we can make predictions.

Part 4 – Making Predictions

Imagine you are a machine learning engineer for a company. A client of yours reached out to you to verify an iris species they found in the wild.

Read: How to Become a Machine Learning Engineer

They only gave us the following information:

Sepal length: 40 cm
Sepal width: 10 cm
Petal length: 5 cm
Petal width: 2 cm

Based on these features, we have to make a prediction.

So, our first step is to create a two-dimensional NumPy array and calculate the shape:

newIris = np.array([[40, 10, 5, 2]])
print("newIris Shape:", newIris.shape)

Output:

newIris Shape: (1, 4)

The reason we have created a two-dimensional array is that Scikit-Learn only expects two-dimensional arrays.

Now we are going to make the prediction by calling the predict()method on knnObject:

prediction = knnObject.predict(newIris)
print("Prediction Value:", prediction)
print("Predicted Target Name:",
       iris_dataset['target_names'][prediction])

Output:

Prediction Value: [2]
Predicted Target Name: ['virginica']

The iris our client found belongs to class 2, which corresponds to the species virginica.

But before we confirm our result to the client, we have to make sure that our model predicted the correct species. And that’s where model evaluation comes into play.

Part 5 – Model Evaluation

For model evaluation, we have to use the test set. Even though you don’t use that data to build the model, but only to test it.

To clarify, this process involves predicting each iris on the test dataset using X_test. And then compare that prediction against the test data’s labels or the y_test.

First, we will make the prediction using knnObect.predict():

testSetPredictions = knnObject.predict(X_test)
print("Test Set Predictions:", testSetPredictions)

Output:

Test Set Predictions: [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]

Then call the score method to measure the model’s accuracy:

accuaracy = round(knnObject.score(X_test, y_test),2)
print("The Test Set Accuracy is:",accuaracy)

Output:

The Test Set Accuracy is: 0.97

In short, our model made the correct prediction of the iris species in the test dataset 97% of the time.

There are ways to tune our model to improve its accuracy and performance. But we will not go into that right now.

To sum up, 97% accuracy is a trustworthy model for such scenarios. However, depending on requirements, that may not be enough. And that’s where model tuning comes into play.

Conclusion

In conclusion, we have created a simple iris classification model using the KNN algorithm. Not only that, but you have also learned how the KNN algorithm works and how to implement it using Scikit-Learn.

Additionally, some of the aspects that we discussed require prior knowledge of specific tools and concepts. So you may need to brush up on those skills.

Besides, focus on learning how to read the official Scikit-Learn documentation. That is a great help.

Also, if you are interested in video tutorials on machine learning & AI, I highly recommend checking out these tutorials from LinkedIn Learning:

You can sign up for LinkedIn Learning and try out these courses for FREE. Click here to learn more.

Found this tutorial useful? Still, having problems understanding the KNN algorithm? What other algorithm do you think we can use to classify iris species? Comment below.