Machine Learning Pipeline
The initial process in any machine learning implementation
The purpose is to understand the data, interpret the hidden information, visualizing and engineering the feature to be used by the machine learning
A few things to consider:
– What questions do you want to answer or prove true/wrong?
– What kind of data do you have? Numeric, Categorical, Text, Image?
How are you going to treat them.
– Do you have any missing values, wrong format, etc.
– How the data is spread? Do you have any outliers? How are you going to
deal with them?
– Which features are important? Can we add or remove features to get
more from the data?
Data Wrangling
– Understand the data
– Getting basic summary statistics
– Handling missing values
– Handling outliers
– Typecasting and transformation
Data Visualization
– Univariate Analysis: histogram, distribution (distplot, boxplot, violin)
– Multivariate Analysis: scatter plot, pair plot, etc
Feature Engineering
Why Feature Engineering?
– Better representation of data
– Better performing models
– Essential for model building and evaluation
– More flexibility on data types
– Emphasis on the business and domain
Types of data for feature engineering ranges from numerical, categorical, text, temporal, and image
Numerical data
Can be used in raw
Rounding
Counts of numeric data, e.g., frequency of songs listened by users
Binarization, e.g., instead of frequency we can put a ‘0’ or ‘1’ value to state whether a song has been listened by a user (for a recommender system)
Binning, e.g., categorize users based on some age groups
Interaction or combination, for example by using polynomial features
Transformation, e.g., log transform, polynomial transform
Categorical data
Transform the data into nominal feature, e.g., for a movie genre, you can have {0: ‘action’, 1: ‘thriller’, 2: ‘drama’, 3: ‘horror’, 4: ‘comedy’, 5: ‘family’, 6: ‘other’}
Transform into ordinal value, e.g., similar to the above, but there is an order in which the category or genre is introduce in the data
Encoding:
– Use dummy encoding
– Transform a categorical feature of m distinct labels into m-1 binary features
Categorical data
Consider the following dataframe:
After the dummy encoding scheme:
If we drop the first (is_USA) or the last (is_Canada), will it destroy the dataset?
Feature Scaling
Using the features’ raw value might make models biased towards features with really high magnitude
– Outliers will skew the algorithm
– Affects machine learning algorithms that use the magnitude of features,
i.e., regression
Scikit-learn’s preprocessing module provide three different feature scaler: standard scaler, minmax scaler and robust scaler
Standard scaler (aka Z-score scaling) tries to remove the mean and scale the variance into 1
MinMax scaler scale the feature value in range [0 1] by utilizing its minimum and maximum values
– Be careful with outliers in minmax scaler
Robust scaler using some statistical measures like median and percentile to scale the data
– IQR is Inter-Quartile Range is the range or differences between the 75% quartile and 25% quartile
Filter Methods
– It is based on metrics like correlation, features’ values, and does not
depends on results from any model
– Popular methods are threshold and statistical method
Wrapper Methods
– Use recursive approach to build multiple models using feature subsets
to select the best subsets. The RFE (Recursive Feature Elimination)
from sklearn.feature_selection is one such example
– Popular methods are ANOVA and chi-square tests – Utilize regression/classifier and cross-validation
Embedded Methods
– Use machine learning algorithms like random forests, decision trees and ensemble methods to rank and score features
Threshold Methods
– You can analyze features’ variant
– Features that are less variant, i.e., mostly constant across all observation can be removed
Dimensionality Reduction
Dealing with a lot of features can lead to issues like model overfitting, complex models, and many more that all roll up to what is called as the curse of dimensionality
Dimensionality reduction is the process of reducing the total number of features in our feature set using strategies like feature selection or feature extraction.
A very popular technique for dimensionality reduction is called PCA (principal component analysis)
REGRESSION
Regression XP
We build regression models to explain and to predict phenomena
Regression Model
– Continuous (Multiple Regression)
• Linear and non-linear
– Discrete (Logistic Regression)
• Binary and multinomial
Regression analysis attempts to explain the influence that input (independent) variables have on the outcome (dependent) variable
Linear Multiple Regression
Models the relationship between some input variables and a continuous outcome variable
– Assumption is that the relationship is linear
– Transformations can be used to achieve a linear relationship.
Remember the transformation we did to the dataframe
Multiple Linear Regression Use Case
Real estate example
– Predict residential home prices
• Possible inputs – living area, #bathrooms, #bedrooms, lot size, property taxes
Demand forecasting example
– Restaurant predicts quantity of food needed
• Possible inputs – weather, day of week, etc.
Medical example
– Analyze effect of proposed radiation treatment
• Possible inputs – radiation treatment duration, frequeny
Comments