# End-to-End Machine Learning Project Tutorial — Part 1

## Covering every step from Data collection to Model Deployment

The perpetual question with regards to Data Science that I come across:

What is the best way to master Data Science? What will get me hired?

My answer remains constant: There is no alternative to working on portfolio-worthy projects. Even after clearing the TensorFlow Developer Certificate Exam, I’d say no certificates, no courses, nothing, you can only prove your competency with projects that showcase your research, programming skills, mathematical background, etc.

In my post, how to build an effective Data Science Portfolio, I shared many project ideas and other tips to prepare a kickass portfolio. This post is dedicated to one of those ideas where I mentioned about end-to-end data science/ML projects.

# Agenda

This tutorial is intended to walk you through all the major steps involved in completing an End-to-End Machine Learning project. For this project, I’ve chosen a supervised learning regression problem.

Major topics covered:-

**Pre-requisites and Resources****Data Collection and Problem Statement****Exploratory Data Analysis with Pandas and NumPy****Data Preparation using Sklearn****Selecting and Training a few Machine Learning Models****Cross-Validation and Hyperparameter Tuning using Sklearn****Deploying the Final Trained Model on Heroku via a Flask App**

Let’s start building…

**Pre-requisites and Resources**

This project and tutorial expect familiarity with Machine Learning algorithms, Python environment setup, and common ML terminologies. Here are a few resources to get you started:

- Read the first 2–3 chapters of The hundred page ML book: http://themlbook.com/wiki/doku.php
- List of Tasks for almost every Machine Learning Project — Keep referring to this list while working on this(or any other) ML project.
- You need a Python Environment set up — a virtual environment dedicated to this project.
- Familiarity with Jupyter Notebook.

That’s it, make sure you have an understanding of these concepts and tools and you’re ready to go!

**Data Collection and Problem Statement**

The first step is to get your hands onto the data but if you have access to data(as in most product-based companies) then, the first step is to define the problem that you want to solve. We don’t have the data yet, so we are going to collect the data first.

We are using the Auto MPG dataset from the UCI Machine Learning Repository. Here is the link to the dataset:

The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.

Once you have downloaded the data, move it to your project directory, activate your virtualenv, start the jupyter local server.

- You can download the data into your project from the notebook as well using
`wget`

:

`!wget "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"`

- The next step is to load this
`.data`

file into a pandas datagram, for that, make sure you have pandas and other general use case libraries installed. Import all the general use case libraries:

`import numpy as np`

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

- Reading and loading the file into a dataframe using
`read_csv()`

method:

- Looking at a few rows of the dataframe and reading the description of each attribute from the website helps you define the problem statement.

**Problem Statement — **The data contains the MPG(Mile Per Gallon) variable which is continuous data and tells us about the efficiency of fuel consumption of a vehicle in the 70s and 80s.

Our aim here is to predict the MPG value for a vehicle given we have other attributes of that vehicle.

**Exploratory Data Analysis with Pandas and NumPy**

For this rather simple dataset, the exploration is broken down into a series of steps:

**Check for Data type of columns**

`##checking the data info`

data.info()

**2. Check for null values.**

`##checking for all the null values`

data.isnull().sum()

The horsepower column has 6 missing values. We’ll have to study the column a bit more.

**3. Check for outliers in horsepower column**

##summary statistics of quantitative variables

data.describe()##looking at horsepower box plot

sns.boxplot(x=data['Horsepower'])

Since there are a few outliers, we can use the median of the column to impute the missing values using the pandas `median()`

method.

`##imputing the values with median`

median = data['Horsepower'].median()

data['Horsepower'] = data['Horsepower'].fillna(median)

data.info()

**4. Look for the category distribution in categorical columns**

##category distribution

data["Cylinders"].value_counts() / len(data)data['Origin'].value_counts()

The 2 categorical columns are Cylinders and Origin which only have a few categories of values. Looking at the distribution of the values among these categories will tell us how the data is distributed:

**5. Plot for correlation**

`##pairplots to get an intuition of potential correlations`

sns.pairplot(data[["MPG", "Cylinders", "Displacement", "Weight", "Horsepower"]], diag_kind="kde")

The pair plot gives you a brief overview of how each variable behaves with respect to every other variable.

For example, the MPG column(our target variable) is negatively correlated with Displacement, weight, and horsepower features.

**6. Set aside the test data set**

This is one of the first things we should do as we want to test our final model on unseen/unbiased data.

There are many ways to split the data into training and testing sets but we want our test set to represent the overall population and not just a few specific categories. Thus, instead of using simple and common `train_test_split()`

method from sklearn, we use **stratified sampling.**

Stratified Sampling — We create homogeneous subgroups called strata from the overall population and sample the right number of instances to each stratum to ensure that the test set is representative of the overall population.

In task 4, we saw how the data is distributed over each category of the Cylinder column. We’re using the Cylinder column to create the strata:

from sklearn.model_selection import StratifiedShuffleSplitsplit = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(data, data["Cylinders"]):

strat_train_set = data.loc[train_index]

strat_test_set = data.loc[test_index]

Checking for the distribution in training set:

`##checking for cylinder category distribution in training set`

strat_train_set['Cylinders'].value_counts() / len(strat_train_set)

Testing set:

`strat_test_set["Cylinders"].value_counts() / len(strat_test_set)`

You can compare these results with the output of `train_test_split()`

to find out which one produces better splits.

**7. Checking the Origin Column**

The Origin column about the origin of the vehicle and has discrete values that look like the code of a country.

To add some complication and make it more explicit, I converted these numbers to strings:

`##converting integer classes to countries in Origin column`

train_set['Origin'] = train_set['Origin'].map({1: 'India', 2: 'USA', 3 : 'Germany'})

train_set.sample(10)

We’ll have to preprocess this categorical column by one-hot encoding these values:

`##one hot encoding`

train_set = pd.get_dummies(train_set, prefix='', prefix_sep='')

train_set.head()

**8. Testing for new variables — Analyze the correlation of each variable with the target variable**

## testing new variables by checking their correlation w.r.t. MPG

data['displacement_on_power'] = data['Displacement'] / data['Horsepower']

data['weight_on_cylinder'] = data['Weight'] / data['Cylinders']

data['acceleration_on_power'] = data['Acceleration'] / data['Horsepower']

data['acceleration_on_cyl'] = data['Acceleration'] / data['Cylinders']corr_matrix = data.corr()

corr_matrix['MPG'].sort_values(ascending=False)

We found acceleration_on_power and acceleration_on_cyl as 2 new variables which turned out to be more positively correlated than the original variables.

This brings us to the end of the Exploratory Analysis. We are ready to proceed to our next step of preparing the data for our Machine Learning.

# Data Preparation using Sklearn

One of the important aspects of Data Preparation is that we have to keep automating our steps in the form of functions and classes so that it is easier for us to integrate the methods and pipelines into the main product.

Here are the major tasks to prepare the data and encapsulate functionalities:

**Preprocessing Categorical Attribute — Converting the Oval**

##onehotencoding the categorical values

from sklearn.preprocessing import OneHotEncodercat_encoder = OneHotEncoder()

data_cat_1hot = cat_encoder.fit_transform(data_cat)

data_cat_1hot # returns a sparse matrixdata_cat_1hot.toarray()[:5]

**2. Data Cleaning — Imputer**

We’ll be using the SimpleImputer class from the impute module of Sklearn library:

##handling missing values

from sklearn.impute import SimpleImputerimputer = SimpleImputer(strategy="median")

imputer.fit(num_data)

**3. Attribute Addition — Adding custom transformation**

In order to make changes to datasets and create new variables, sklearn offers the BaseEstimator class using which we can develop new features by defining our own class.

We have created a class to add 2 new features as found in the EDA step above:

- acc_on_power — Acceleration divided by Horsepower
- acc_on_cyl — Acceleration divided by the number of Cylinders

**4. Setting up Data Transformation Pipeline for numerical and categorical attributes**

As I said, we want to automate as much as possible. Sklearn offers a great number of classes and methods to develop such automated pipelines of data transformations.

The major transformations are to be performed on numerical columns, so let’s create the numerical pipeline using the `Pipeline`

class:

In the above code snippet, we have cascaded a set of transformations:

- Imputing Missing Values — using the SimpleImputer class discussed above.
- Custom Attribute Addition— using the custom attribute class defined above.
- Standard Scaling of each Attribute — always a good practice to scale the values before fed to the ML model, using the
`standardScaler`

class.

## Combined Pipeline for both Numerical and Categorical columns

We have numerical transformation ready, the only categorical column we have is Origin for which we need to one-hot encode the values.

Here’s how we can use the `ColumnTransformer`

class to capture both of these tasks in one go.

To the instance, provide the numerical pipeline object created from the function defined above and then call the `OneHotEncoder()`

class to process the Origin column.

## Final Automation

With these classes and functions defined, we now have to integrate them into a single flow which is going to be simply 2 function calls.

- Preprocessing the Origin Column to convert integers to Country names:

`##preprocess the Origin column in data`

def preprocess_origin_cols(df):

df["Origin"] = df["Origin"].map({1: "India", 2: "USA", 3: "Germany"})

return df

2. Calling the final `pipeline_transformer`

function defined above:

`##from raw data to processed data in 2 steps`

preprocessed_df = preprocess_origin_cols(data)

prepared_data = pipeline_transformer(preprocessed_df)

prepared_data

Voila, your data is ready to use in just 2 steps!

The next step is to start training ML models.

**Selecting and Training Machine Learning Models**

Since this is a regression problem, I chose to train the following models:

**Linear Regression****Decision Tree Regressor****Random Forest Regressor and,****SVM Regressor**

I’ll explain the flow for Linear Regression and then you can follow the same for all the others.

It’s a simple **4-step process:**

- Create an instance of the model class.
- Train the model using the fit() method.
- Make predictions by first passing the data through pipeline transformer.
- Evaluating the model using Root Mean Squared Error(typical performance metric for regression problems)

from sklearn.linear_model import LinearRegressionlin_reg = LinearRegression()

lin_reg.fit(prepared_data, data_labels)##testing the predictions with the

sample_data = data.iloc[:5]

sample_labels = data_labels.iloc[:5]sample_data_prepared = pipeline_transformer(sample_data)print("Prediction of samples: ", lin_reg.predict(sample_data_prepared))

**Evaluating model:**

from sklearn.metrics import mean_squared_errormpg_predictions = lin_reg.predict(prepared_data)

lin_mse = mean_squared_error(data_labels, mpg_predictions)

lin_rmse = np.sqrt(lin_mse)

lin_rmse

**RMSE for Linear regression: 2.95904**

**Cross-Validation and Hyperparameter Tuning using Sklearn**

Now, if you perform the same for Decision Tree, you’ll see that you have achieved 0.0 RMSE value which is not possible as there is no “perfect” Machine Learning Model(we’ve not reached there yet).

**Problem:** The problem is that we are testing our model on the same data we trained on, which is a problem. Now, we can’t use the test data yet until we finalize our best model that is ready to go into production.

**Solution: ****Cross-Validation**

Scikit-Learn’s K-fold cross-validation feature randomly splits the training set into `K` distinct subsets called folds, then it trains and evaluates the model K times, picking a different fold for evaluation every time and training on the other K-1 folds.

The result is an array containing the K evaluation scores, here’s how I did for 10 folds:

The scoring method gives you negative values to denote errors, therefore while calculating the square root, we have to add negation explicitly.

For Decision Tree, here is the list of

We take the average of these scores:

## Fine-Tuning Hyperparameters

After testing all the models, you’ll find that RandomForestRegressor has performed the best but it still needs to be fine-tuned.

A model is like a radio station with a lot of knobs to handle and tuned. Now, you can either tune all these knobs manually or provide a range of values/combinations that you want to test.

We use GridSearchCV to find out the best combination of hyperparameters for the RandomForest model:

GridSearchCV requires you to pass the parameter grid which is a python dictionary with parameter names as keys mapped with the list of values you want to test for that param.

We can pass the model, scoring method and cross-validation folds to it.

Train the model and it returns the best parameters and results for each combination of parameters:

## Check Feature Importance

We can also check the feature importance by enlisting the features and zipping them up with the best_estimator’s feature importance attribute as follows:

# feature importancesfeature_importances = grid_search.best_estimator_.feature_importances_

We see that “acc_on_power” which is the derived feature has turned out to be the most important feature.

You might want to keep iterating a few times before finalizing the best configuration

The model is now ready with the best configuration.

# Evaluate the Entire System

It’s time to Evaluate this entire system:

If you want to look at my complete project, here is the GitHub repository:

With that, you have your final model ready to go into production. For deployment, we save our model into a file using the `pickle`

model and develop a **Flask** web service to be deployed in **Heroku** which is covered in Part-2 of this blog.

Deployment is covered in Part-2 of this tutorial.

# Data Science with Harshit — My YouTube Channel

But if you don’t want to wait, here is the complete tutorial series(playlist) on my YouTube channel where you can follow me while working on this project.

With this channel, I plan to roll out a couple of series covering the entire data science space. Here is why you should be subscribing to the channel:

- These series would cover all the required/demanded quality tutorials on each of the topics and subtopics like Python fundamentals for Data Science.
- Explained Mathematics and derivations of why we do what we do in ML and Deep Learning.
- Podcasts with Data Scientists and Engineers at Google, Microsoft, Amazon, etc, and CEOs of big data-driven companies.
- Projects and instructions to implement the topics learned so far. Learn about new certifications, Bootcamp, and resources to crack those certifications like this
**TensorFlow Developer Certificate Exam by Google.**