End-to-End Machine Learning Project Tutorial — Part 1

Covering every step from Data collection to Model Deployment


Pre-requisites and Resources

Data Collection and Problem Statement

!wget "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Exploratory Data Analysis with Pandas and NumPy

##checking the data info
##checking for all the null values
##summary statistics of quantitative variables
##looking at horsepower box plot
##imputing the values with median
median = data['Horsepower'].median()
data['Horsepower'] = data['Horsepower'].fillna(median)
##category distribution
data["Cylinders"].value_counts() / len(data)
##pairplots to get an intuition of potential correlations
sns.pairplot(data[["MPG", "Cylinders", "Displacement", "Weight", "Horsepower"]], diag_kind="kde")
from sklearn.model_selection import StratifiedShuffleSplitsplit = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(data, data["Cylinders"]):
strat_train_set = data.loc[train_index]
strat_test_set = data.loc[test_index]
##checking for cylinder category distribution in training set
strat_train_set['Cylinders'].value_counts() / len(strat_train_set)
strat_test_set["Cylinders"].value_counts() / len(strat_test_set)
##converting integer classes to countries in Origin column
train_set['Origin'] = train_set['Origin'].map({1: 'India', 2: 'USA', 3 : 'Germany'})
##one hot encoding
train_set = pd.get_dummies(train_set, prefix='', prefix_sep='')
## testing new variables by checking their correlation w.r.t. MPG
data['displacement_on_power'] = data['Displacement'] / data['Horsepower']
data['weight_on_cylinder'] = data['Weight'] / data['Cylinders']
data['acceleration_on_power'] = data['Acceleration'] / data['Horsepower']
data['acceleration_on_cyl'] = data['Acceleration'] / data['Cylinders']
corr_matrix = data.corr()

Data Preparation using Sklearn

##onehotencoding the categorical values
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
data_cat_1hot = cat_encoder.fit_transform(data_cat)
data_cat_1hot # returns a sparse matrix
##handling missing values
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

Combined Pipeline for both Numerical and Categorical columns

Final Automation

##preprocess the Origin column in data
def preprocess_origin_cols(df):
df["Origin"] = df["Origin"].map({1: "India", 2: "USA", 3: "Germany"})
return df
##from raw data to processed data in 2 steps
preprocessed_df = preprocess_origin_cols(data)
prepared_data = pipeline_transformer(preprocessed_df)

Selecting and Training Machine Learning Models

from sklearn.linear_model import LinearRegressionlin_reg = LinearRegression()
lin_reg.fit(prepared_data, data_labels)
##testing the predictions with the
sample_data = data.iloc[:5]
sample_labels = data_labels.iloc[:5]
sample_data_prepared = pipeline_transformer(sample_data)print("Prediction of samples: ", lin_reg.predict(sample_data_prepared))
from sklearn.metrics import mean_squared_errormpg_predictions = lin_reg.predict(prepared_data)
lin_mse = mean_squared_error(data_labels, mpg_predictions)
lin_rmse = np.sqrt(lin_mse)

Cross-Validation and Hyperparameter Tuning using Sklearn

Fine-Tuning Hyperparameters

Check Feature Importance

# feature importancesfeature_importances = grid_search.best_estimator_.feature_importances_

Evaluate the Entire System

Data Science with Harshit — My YouTube Channel

Web & Data Science Instructional Designer | YouTuber | Writer https://www.youtube.com/c/DataSciencewithHarshit

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store