Data Preprocessing for Machine Learning

July 06, 2018

Machine Learning is the hottest thing of this decade. Everybody wants to get on the bandwagon and start deploying machine learning models in their businesses. At the heart of this intricate process is data. Your machine learning tools are as good as the quality of your data. Sophisticated algorithms will not make up for poor data. Just like how precious stones found while digging go through several steps of cleaning process, data needs to also go through a few before it is ready for further use.


In this article I will try to simplify the exercise of data preprocessing, or in other words, the rituals programmers usually follow before it is ready to be used for machine learning models into 6 simple steps.

Step 1: Import Libraries

First step is usually importing the libraries that will be needed in the program. A library is essentially a collection of modules that can be called and used. A lot of the things in the programming world do not need to be written explicitly ever time they are required. There are functions for them, which can simply be invoked. This is a list for most popular Python libraries for Data Science. Here’s a snippet of me importing the pandas library and assigning a shortcut “pd”.


import pandas as pd

Step 2: Import the Dataset

A lot of datasets come in CSV formats. We will need to locate the directory of the CSV file at first (it’s more efficient to keep the dataset in the same directory as your program) and read it using a method called read_csv which can be found in the library called pandas.

import pandas as pddataset = pd.read_csv('Medium.csv')

After inspecting our dataset carefully, we are going to create a matrix of features in our dataset (X) and create a dependent vector (Y) with their respective observations. To read the columns, we will use iloc of pandas (used to fix the indexes for selection) which takes two parameters — [row selection, column selection].

X = dataset.iloc[:, :-1].values


: as a parameter selects all. So the above piece of code selects all the rows. For columns we have :-1, which means all the columns except the last one. You can read more about the usage of iloc here.

Step 3: Taking care of Missing Data in Dataset

Sometimes you may find some data are missing in the dataset. We need to be equipped to handle the problem when we come across them. Obviously you could remove the entire line of data but what if you are unknowingly removing crucial information? Of course we would not want to do that. One of the most common idea to handle the problem is to take a mean of all the values of the same column and have it to replace the missing data.

The library that we are going to use for the task is called Scikit Learn preprocessing. It contains a class called Imputer which will help us take care of the missing data.

from sklearn.preprocessing import Imputer

A lot of the times the next step, as you will also see later on in the article, is to create an object of the same class to call the functions that are in that class. We will call our object imputer. The Imputer class can take a few parameters —i. missing_values — We can either give it an integer or “NaN” for it to find the missing values.ii. strategy — we will find the average so we will set it to mean. We can also set it to median or most_frequent (for mode) as necessary.iii. axis — we can either assign it 0 or 1, 0 to impute along columns and 1 to impute along rows.

imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)

Now we will fit the imputer object to our data. Fit is basically training, or in other words, imposing the model to our data.

imputer =[:,1:3])

The code above will fit the imputer object to our matrix of features X. Since we used :, it will select all rows and 1:3 will select the second and the third column (why? because in python index starts from 0 so 1 would mean the second column and the upper-bound is excluded. If we wanted to include the third column instead, we would have written 1:4).

Now we will just replace the missing values with the mean of the column by the method transform.


X[:, 1:3] = imputer.transform(X[:, 1:3])

Step 4: Encoding categorical data

Sometimes our data is in qualitative form, that is we have texts as our data. We can find categories in text form. Now it gets complicated for machines to understand texts and process them, rather than numbers since the models are based on mathematical equations and calculations. Therefore, we have to encode the categorical data.

This is an example of categorical data. In the first column, the data is in text form. We can see that there are five categories — Very, Somewhat, Not very, Not at all, Not sure — and hence the name categorical data.

So the way we do it, we will import the scikit library that we previously used. There’s a class in the library called LabelEncoder which we will use for the task*.*

from sklearn.preprocessing import LabelEncoder

As I have mentioned before, the next step is usually to create an object of that class. We will call our object labelencoder_X.

labelencoder_X = LabelEncoder()

To do our task, there’s a method in the LabelEncoder class called fit_transform which is what we will use. Once again, just like how we did it before, we will pass two parameters of X — row selection and column selection.

X[:,0] = labelencoder_X.fit_transform(X[:,0])

The above code will select all the rows (because 🙂 of the first column (because 0) and fit the LabelEncoder to it and transform the values. The values will then immediately be encoded to 0,1,2,3… accordingly.

The text has been replaced by numbers as we wanted. But if there are more than two categories, we may have created a new problem in the way. As we keep assigning different integers to different categories, it may create a confusion. If one category is assigned 0 and another category is assigned 2, and since 2 is greater than 0, are we trying imply that the category assigned as 2 is greater? Of course we don’t! So this strategy might as well defeat its own purpose.

So instead of having one column with n number of categories, we will use n number of columns with only 1s and 0s to represent whether the category occurs or not.

Dummy Encoding

To accomplish the task, we will import yet another library called OneHotEncoder.

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

Next we will create an object of that class, as usual, and assign it to onehotencoder. OneHotEncoder takes an important parameter called categorical_features which takes the value of the index of the column of categories.

onehotencoder = OneHotEncoder(categorical_features =[0])

The code above will select the first column to OneHotEncode the categories.

Just as we used fit_transform for LabelEncoder, we will use it for OneHotEncoder as well but also have to additionally include toarray().

X = onehotencoder.fit_transform(X).toarray()

If you check your dataset now, all your categories will have been encoded to 0s and 1s.

Step 5: Splitting the Dataset into Training set and Test Set

Now we need to split our dataset into two sets — a Training set and a Test set. We will train our machine learning models on our training set, i.e our machine learning models will try to understand any correlations in our training set and then we will test the models on our test set to check how accurately it can predict. A general rule of the thumb is to allocate 80% of the dataset to training set and the remaining 20% to test set. For this task, we will import test_train_split from model_selection library of scikit.

from sklearn.model_selection import train_test_split

Now to build our training and test sets, we will create 4 sets— X_train (training part of the matrix of features), X_test (test part of the matrix of features), Y_train (training part of the dependent variables associated with the X train sets, and therefore also the same indices) , Y_test (test part of the dependent variables associated with the X test sets, and therefore also the same indices). We will assign to them the test_train_split, which takes the parameters — arrays (X and Y), test_size (if we give it the value 0.5, meaning 50%, it would split the dataset into half. Since an ideal choice is to allocate 20% of the dataset to test set, it is usually assigned as 0.2. 0.25 would mean 25%, just saying).

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2)

Step 6: Feature Scaling

The final step of data preprocessing is to apply the very important feature scaling. The formula and graphical representation of Euclidean distance is given above.  But what is it? It is a method used to standardize the range of independent variables or features of data. A lot of machine learning models are based on Euclidean distance. For example, if the values in one column (x) is much higher than the value in another column (y), (x2-x1) squared will give a far greater value than (y2-y1) squared. So clearly, one square difference dominates over the other square difference. In the machine learning equations, the square difference with the lower value in comparison to the far greater value will almost be treated as if it does not exist. We do not want that to happen. That is why it is necessary to transform all our variables into the same scale. There are several ways of scaling the data. One way is called Standardization which may be used. For every observation of the selected column, our program will apply the formula of standardization and fit it to a scale.

To accomplish the job, we will import the class StandardScaler from the sckit preprocessing library and as usual create an object of that class.

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()

Now we will fit and transform our X_train set (It is important to note that when applying the Standard Scalar object on our training and test sets, we can simply transform our test set but for our training set we have to at first fit it and then transform the set). That will transform all the data to a same standardized scale.

X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

These are the general 6 steps of preprocessing the data before using it for machine learning. Depending on the condition of your dataset, you may or may not have to go through all these steps.

© Amitabha Dey. All rights reserved.