MACHINE LEARNING OVERVIEW
AI, ML, and DL
What is Artificial Intelligence?
What is Machine Learning?
What is Deep Learning?
The term was first introduced in 1956 in a conference where researchers wanted to digitized how human brain works
AI is the science and engineering of making computers behave in ways that until recently, we thought required human intelligence, Andrew Moore
AI is a moving target based on the capabilities that human possessed but machines do not, e.g., emotion – AI encompasses technology advances in different fields such as Machine Learning, Human Computer Interaction, etc
Example of AI: DeepBlue, and to some extent: Google Home, Siri and Alexa
Machine Learning and Deep Learning
Machine learning is the study of computer algorithms that allow computer programs to automatically improve through experience ~ Tom Mitchell
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E ~ Tom Mitchell
The goal of ML is never to make “perfect” guesses, because ML deals in domains where there is no such thing. The goal is to make guesses that are good enough to be useful.
It is a class of machine learning algorithms inspired by the structure of a human brain.
Deep learning algorithms use complex multi-layered neural networks, where the level of abstraction increases gradually by non-linear transformations of input data.
How can a machine learn?
The samples need to be representative
The samples can include numbers, images, text, etc
Important piece of data that work as the key to the solution of the task
Tell the machine/program what to pay attention to
The same task can be solved using different algorithm
The accuracy or speed of getting results can be different
If the dataset quality is high, the features were chosen right, an ML-powered system can be better than human for a given task
Machine Learning Problem
Many real-world problems are complex. Inventing specialized algorithms to solve them perfectly every time is not practical
Some example – How can we predict future traffic pattern at an intersection? – Is it cancer?
What is the market value of this house five years from now?
Which of these candidates are the perfect one for the job?
Which of these people can be my best friend/partner?
Will a certain people like this movie or not?
How can I slice the banana to make a perfect peanut butter banana sandwich? (https://www.ethanrosenthal.com/2020/08/25/optimalpeanut-butter-and-banana-sandwiches/)
Machine Learning Algorithm
Generally divided into supervised and unsupervised learning, also reinforced learning, based on whether the they are trained with human supervision, and whether the training data is labeled or not
Whether or not they can learn incrementally on the fly (online versus batch learning)
Whether they work by simply comparing new data points to known data points, or instead detect patterns in the training data and build a predictive model, much like scientists do (instance-based versus model-based learning)
The training data fed to the algorithm includes the desired solutions, called labels
It models the relationship between the target prediction output and the input features, such that we can predict the output values for new data based on those relationships learned from past data
The goal is to develop a finely tuned predictor function h(x) (sometimes called the “hypothesis”) so that, given input data x about a certain domain, e.g., square footage of a house), it will predict interesting value h(x), e.g., the market price of the housr.
Two major categories are regression and classification
Some of the most important supervised learning algorithms include:
Support Vector Machines (SVMs)
Decision Trees and Random Forests
If it quacks like a duck, waddle like a duck and swim like a duck, then….
It can be a mallard, which is a species of duck
If it has a flat beak to catch worms and has a webbed feet…
It does not need to be a duck. It can be a platypus
If it walk on four legs and has a long nose….
We need more information. It can be an elephant, but it can also be a family of some mice
In unsupervised learning, the training data is unlabeled
The unsupervised machine learning is typically tasked with finding relationships and correlation within data.
– Used mostly for pattern detection and descriptive modeling
Some of the most important unsupervised learning algorithms include:
– Visualization and dimensionality reduction
– Association rule learning
Supervised vs Unsupervised
Instance-based and Model-Based Learning
System generalizes to new cases based on a similarity measure to known cases.
System generalizes to new cases based on a similarity measure to known cases.
The system is incapable of learning incrementally: it must be trained using all the available data - offline.
First the system is trained, and then it is launched into production and runs without learning anymore; it just applies what it has learned.
When new data comes, you need to train a new version of the system from scratch on the full dataset (not just the new data, but also the old data), then stop the old system and replace it with the new one.
The system is trained incrementally by feeding it data instances sequentially, individually or in mini-batches.
The system can learn about new data on the fly, as it arrives
– Great for systems that receive data as a continuous flow (e.g., stock prices)
and need to adapt to change rapidly or autonomously
– Also good for limited computing resources, and huge datasets that cannot fit in one machine’s main memory (this is called out-of-core learning)
Main Challenges of Machine Learning
Machine learning involves selecting some learning algorithm and train it on some data.
“bad data” and “bad algorithm” are the two things that can go wrong
Insufficient quantity of training data – it takes a lot of data for most machine learning algorithms to work properly
Non-representative training data - In order to generalize well, it is crucial that your training data be representative of the new cases you want to generalize to.
Poor-quality data – outliers, missing values, etc.
Irrelevant features - A critical part of the success of a Machine Learning project is coming up with a good set of features to train on. This process, called feature engineering, involves:
– Feature selection: selecting the most useful features to train
on among existing features.
– Feature extraction: combining existing features to produce a more
– Feature generation: Creating new features by gathering new data.
Overfitting the Training Data - the model performs well on the training data, but it does not generalize well.
Underfitting the Training Data - occurs when your model is too simple to learn the underlying structure of the data
– The main options to fix this problem are:
• Selecting a more powerful model, with more parameters
• Feeding better features to the learning algorithm (feature engineering)
• Reducing the constraints on the model
Model Testing and Validation
Helps to determine how well the model generalizes to new cases
Achieved by splitting the data to training set and validation/test set.
Evaluating the model on the test set helps to assess how well it will perform on new instances of data
Cross-validation is used to evaluate several models
– The training set is split into complementary subsets, and each model
is trained against a different combination of these subsets and validated against the remaining parts.
– The selected model is then trained on the full training set, and the
generalized error is measured on the test set.
Machine Learning Steps
The main steps of a machine learning project include:
1. Look at the big picture (problem definition).
2. Get the data.
3. Discover and visualize the data to gain insights.
4. Prepare the data for Machine Learning algorithms.
5. Select a model and train it.
6. Fine-tune your model.
7. Present your solution.
8. Launch, monitor, and maintain your system.
Machine Learning Pipeline
Essential Python Libraries for Data Science
Python data ecosystem libraries commonly used in data science include:
Matplotlib (and it’s cousins)
IPython and Jupyter
NumPy: Short for Numerical Python
Provides the data structures, algorithms, and library glue needed for numerical computing in Python
Acts as a container for data to be passed between algorithms and libraries.
NumPy contains, among other things:
– A fast and efficient multidimensional array object ndarray
– Functions for performing element-wise computations with arrays
or mathematical operations between arrays
– Tools for reading and writing array-based datasets to disk
– Linear algebra operations, Fourier transform, and random number
– A mature C API to enable Python extensions and native C or C++ code
to access NumPy’s data structures and computational facilities
Provide high-level data structures and functions designed to make working with structured or tabular data fast, easy, and expressive
Blend the high-performance, array-computing ideas of NumPy with the flexible data manipulation capabilities of spreadsheets and relational databases
Provide sophisticated indexing functionality to make it easy to reshape, slice and dice, perform aggregations, and select subsets of data.
Two of the most important data structures of pandas are:
– DataFrame - a tabular, column-oriented data structure with both row and
– Series - a one-dimensional labeled array object
Matplotlib and Other Data Visualization Libraries
Matplotlib is the most popular Python library for producing publication-quality plots and other two-dimensional data visualizations.
Matplotlib is like the mother of all Python libraries. It serves as an excellent base, enabling coders to “wrap” other tools over it.
Seaborn may be able to support some more complex visualization approaches but it still requires matplotlib knowledge to fine-tune things.
Bokeh is a robust tool for setting up your own visualization server but maybe a bit overkill when creating simple scenarios.
Geoplotlib will get the job done if you need to visualize geographic data.
Ggplot shows a lot of promise but still has a lot of growing up to do.
Plot.ly generates the most interactive graphs, which can be saved offline to create vivid web-based visualizations.
Ipython and Jupyter
IPython is designed from the ground up to maximize your productivity in both interactive computing and software development.
Component of the much broader Jupyter open source project
Designed to accelerate the writing, testing, and debugging of Python code.
Jupyter Notebook is an interactive web-based code “notebook” offering support for dozens of programming languages.
Collection of packages addressing a number of different standard problem domains in scientific computing, such as:
scipy.linalg (linear algebra routines)
scipy.optimize (function optimizers)
scipy.sparse (sparse matrices and sparse linear system solvers)
scipy.special (a wrapper around Fortran SPECFUN library, implementing many math functions), and