K-Nearest Neighbours (KNN) in Python

K-Nearest Neighbour


Problem Statement: Predict whether or not a passenger survived during Titanic Sinking



Download The Dataset


Download The Code File



Variables: PassengerID, Survived, Pclass, Name, Sex, Age, Fare


We are going to use two variables i.e. Pclass and sex of the titanic passsengers to predict whether they survived or not


Independent Variables : Pclass, Sex

Dependent Variable : Survived


# Importing the libraries

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd


# Importing the dataset

dataset = pd.read_csv('titanic.csv')







# Separating the independent and dependent variables

X = dataset.iloc[:, [2, 4]].values

y = dataset.iloc[:, 1].values










# Encoding categorical data

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

labelencoder_X = LabelEncoder()

X[:, 1] = labelencoder_X.fit_transform(X[:, 1])

onehotencoder = OneHotEncoder(categorical_features = [1])

X = onehotencoder.fit_transform(X).toarray()



We have encoded the variable "Sex" in X which had two categorical values i.e. male and female. Hence, we've got 2 different columns. Third column in the picture below is for the variable "Pclass".







# Splitting the dataset into the Training set and Test set

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)









# Feature Scaling

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)

X_test = sc.transform(X_test)


We did feature scaling as we want to obtain an accurate prediction of whether a passenger survived the sinking of titanic or not.





# Fitting K-NN to the Training set

from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)

classifier.fit(X_train, y_train)


# Predicting the Test set results

y_pred = classifier.predict(X_test)



# Making the Confusion Matrix

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

Confusion Matrix helps to know how good our model is predicting. In other words, we will assess how correctly our Logistic Regression Model has learned the correlations from the training set to make accurate predictions on the test set.






Here, the diagonal with 138 and 40 shows the correct predictions and the diagonal 44 and 1 shows the incorrect predictions.

So, 138 + 40 = 178 are the total number of correct predictions out of 223 instances (in y_test)

Hence, our model showed 79.8% accuracy.