K-Nearest Neighbor as a Classification Algorithm: A Use Case — #70daysofMLStudy with Data Science Nigeria
In this article, we will be exploring the use of the K-nearest neighbor as a classification algorithm. The implementation is in the Python programming language and the dataset used is the ‘Breast Cancer Dataset’ from the UCI Machine Learning repository. This article will also focus on code execution using the Jupyter Notebook. Link to the Github repository can be found at the end of the article.
GETTING THE REQUIRED DATA
The breast cancer dataset can be downloaded from the UCI Machine Learning repository. The site can be a little confusing for first-timers/beginners, so I would briefly explain how the data can be downloaded and read on the Pandas data frame.
STEPS:
- Visit the site using this link: https://archive.ics.uci.edu/ml/index.php . The homepage opens up as shown below, with links to the latest news, newest data uploads, most popular dataset et cetera.
The dataset we seek is the ‘Breast Cancer Wisconsin(Original) dataset’. Listed on the homepage is a similar dataset but with different features(The Breast Cancer Wisconsin(Diagnostic) dataset). In order to gain access to the data required, click on view all datasets(highlighted in the image above).
2. A new page appears showing the link to numerous datasets. Scroll down the page and find the link to the required dataset as shown below:
Notice how the site also shows a brief description of each dataset, what machine learning tasks they can be used for, the type of variables present, year collated et cetera.
3. On Click, the Breast cancer Wisconsin dataset page opens up, also displaying some vital information on the data(see the image below).
4. Click on ‘ Data Folder’. This directly downloads the dataset to your computer. It will be noticed that the file extension for the dataset downloaded is quite different, ‘.data’. This is no cause of concern as it can still be read using Pandas, although in a slightly different manner as is usually done
Now, let’s move on to reading the dataset in the Pandas dataframe.
Reading/Importing the Dataset
Required Python Libraries:
- Pandas — for data analysis/manipulation
- Numpy — for Linear Algebra
- ScikitLearn(Sklearn) — For modeling
STEPS:
- Import the required libraries
2. Read/Import the Dataset. So earlier, the dataset downloaded had a different file extension name from the regular “.csv” or “.xlsx” most people are used to. In order to be able to read this dataset correctly, do the following:
i. Go the folder where the file was stored on downloading.
ii. Right-Click on the file, then ‘copy link address’ as seen below:
iii. On the Jupyter notebook, paste the link using the following syntax.
Reading the data using the Pandas .head() method.
Notice that the Data read has no column names or headers, and this makes it difficult to be able to tell what each column represents.
Fortunately, a description of the data is available on the site. On the page where the data was downloaded earlier, scroll down and find the description of each column.
Now we have a clearer picture of the data. Great!
It is a great idea to replace the column names to the original description as in the image above or a better representation for ease of recognition.
This can be done using the following syntax in Pandas.
I think this looks better. Don’t you?
Let’s Proceed.
DATA MANIPULATION/PREPROCESSING
For this use-case, there really isn’t much cleaning to be done. However, there are few things we should look at, such as checking for missing values. So earlier, on the data download page on the website, we were informed that the dataset does have missing values.
Let’s verify that with Pandas using the following syntax:
Something seems off somewhere. The code output says no missing entries, as indicated above, However, the data source says otherwise. We would need to carefully look into the data to verify. One quick and easy way is using the Pandas .unique() or .value_counts() method for each column. This will output unique entries or counts of unique entries of a column in the dataset.
Before doing this, I recommend dropping the ‘id number’ column. This is because for this use-case it plays no role in the algorithm and could act as an outlier.
The Pandas .drop() method can be used for this purpose. To drop a column, the column name has to be included, alongside other arguments such as ‘inplace=True’, which simply implies dropping the column in the same dataset without creating a column, and the ‘axis=1’ argument, which means dropping column-wise and not row-wise(can also be simply represented as 1).
Next, we pick one column after the other and apply the .unique() method for more details on what each column contains.
Or we could ease this task by creating a loop to iterate through each column and print the unique values alongside the column name.
Can you spot the fishy column?
That’s right! The ‘Bare Nuclei’ column. There’s a question mark present there, which would pose a problem during modeling. Also, it seems to be encoded as strings(object data type).
Since the focus of this article is not data wrangling or preprocessing techniques, The entry will be replaced using the (-999) method. For a better intuition on how missing values or wrong data entries are treated, please reference the links below.
Since the entry is not exactly a missing value but has been represented with a question mark, the .fillna() method would not be appropriate here. Rather, the .replace() method will be better efficient.
Notice I also changed the data type of the column from and ‘object’ to an ‘int’. The reason for this step is to avoid errors during modeling. As machine learning algorithms technically accept only numerical data types, although few algorithms(such as Catboost) work well with object data types too.
Now we are ready for modeling.
TRAINING THE ALGORITHM
First, we have to split the data into a train and test set. Before that, ‘X’ and ‘y’ has to be defined. With ‘X’ representing the features and ‘y’ representing the target column or column to be predicted.
This can be done in two ways. Either using the Pandas or Numpy libraries as shown below:
Then we split as follows, using a test size of 20%:
Finally, we train the KNN classifier. The .fit() method is used to fit the train data points into the classifier:
The algorithm is further tested on the test set created earlier, and the accuracy is calculated.
That is a pretty good accuracy. However, we could try some hyperparameter tuning by increasing or decreasing the number of k and using some other metrics(such as manhattan or euclidean).
Great!
Now we have a trained classifier, the next steps would be to test the algorithm with some unknown data points, not the test set, but some random data points. We could do that by creating an array using Numpy. Then reshaping the data. The reason for reshaping is because, ‘X’ is a two-dimensional array(2D), as noticed when the shape was printed earlier, whereas, ‘y’ is single-dimensional. Reshaping transforms the data into the same shape in space. It is a general ‘Rule of Thumb’ to reshape.
Take note that the new data created ‘example_measure’ has to be equal in length to the number of features present in X. i.e if X has 9 features, Just as in this case(excluding the Target), then the new data should also have 9 elements contained in it.
‘4’ above indicates the prediction for a ‘ malignant cancer’ data type.
CONCLUSION
The focus of this article was a simple classification problem using the KNearest Neighbor Algorithm, as a practical use-case. I have written a separate article on understanding the intuition behind the algorithm. Please find the link below.
Also find below the links to some very useful resources on the KNN algorithm, as well the GitHub repository link for the code used in this article.
Thank you for reading!
LINK TO ARTICLE ON KNN INTUITION
REFERENCES AND RESOURCES
SOCIAL MEDIA PROFILE
LinkedIn: https://www.linkedin.com/in/aminah-mardiyyah-rufa-i
Twitter: @diyyah92