Important Basic Topics of Python & Machine Learning | Overview

MACHINE LEARNING OVERVIEW

AI, ML, and DL

  • What is Artificial Intelligence?

  • What is Machine Learning?

  • What is Deep Learning?


Artificial intelligence

  • The term was first introduced in 1956 in a conference where researchers wanted to digitized how human brain works

  • AI is the science and engineering of making computers behave in ways that until recently, we thought required human intelligence, Andrew Moore

  • AI is a moving target based on the capabilities that human possessed but machines do not, e.g., emotion – AI encompasses technology advances in different fields such as Machine Learning, Human Computer Interaction, etc

  • Example of AI: DeepBlue, and to some extent: Google Home, Siri and Alexa


Machine Learning and Deep Learning

  • Machine learning is the study of computer algorithms that allow computer programs to automatically improve through experience ~ Tom Mitchell

  • A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E ~ Tom Mitchell

  • The goal of ML is never to make “perfect” guesses, because ML deals in domains where there is no such thing. The goal is to make guesses that are good enough to be useful.

Deep Learning

  • It is a class of machine learning algorithms inspired by the structure of a human brain.

  • Deep learning algorithms use complex multi-layered neural networks, where the level of abstraction increases gradually by non-linear transformations of input data.

Output:








How can a machine learn?

Dataset

  • The samples need to be representative

  • The samples can include numbers, images, text, etc


Features

  • Important piece of data that work as the key to the solution of the task

  • Tell the machine/program what to pay attention to


Algorithm

  • The same task can be solved using different algorithm

  • The accuracy or speed of getting results can be different


If the dataset quality is high, the features were chosen right, an ML-powered system can be better than human for a given task


Machine Learning Problem

Many real-world problems are complex. Inventing specialized algorithms to solve them perfectly every time is not practical


Some example – How can we predict future traffic pattern at an intersection? – Is it cancer?

  • What is the market value of this house five years from now?

  • Which of these candidates are the perfect one for the job?

  • Which of these people can be my best friend/partner?

  • Will a certain people like this movie or not?

  • How can I slice the banana to make a perfect peanut butter banana sandwich? (https://www.ethanrosenthal.com/2020/08/25/optimalpeanut-butter-and-banana-sandwiches/)


Machine Learning Algorithm

  • Generally divided into supervised and unsupervised learning, also reinforced learning, based on whether the they are trained with human supervision, and whether the training data is labeled or not

  • Whether or not they can learn incrementally on the fly (online versus batch learning)

  • Whether they work by simply comparing new data points to known data points, or instead detect patterns in the training data and build a predictive model, much like scientists do (instance-based versus model-based learning)


Supervised Learning

  • The training data fed to the algorithm includes the desired solutions, called labels

  • It models the relationship between the target prediction output and the input features, such that we can predict the output values for new data based on those relationships learned from past data

  • The goal is to develop a finely tuned predictor function h(x) (sometimes called the “hypothesis”) so that, given input data x about a certain domain, e.g., square footage of a house), it will predict interesting value h(x), e.g., the market price of the housr.

  • Two major categories are regression and classification


Some of the most important supervised learning algorithms include:

  • k-Nearest Neighbors

  • Linear Regression

  • Logistic Regression

  • Support Vector Machines (SVMs)

  • Decision Trees and Random Forests

  • Neural networks

If it quacks like a duck, waddle like a duck and swim like a duck, then….

  • It can be a mallard, which is a species of duck

If it has a flat beak to catch worms and has a webbed feet…

  • It does not need to be a duck. It can be a platypus

If it walk on four legs and has a long nose….

  • We need more information. It can be an elephant, but it can also be a family of some mice


Unsupervised Learning

  • In unsupervised learning, the training data is unlabeled

  • The unsupervised machine learning is typically tasked with finding relationships and correlation within data.

– Used mostly for pattern detection and descriptive modeling

  • Some of the most important unsupervised learning algorithms include:

– Clustering

– Visualization and dimensionality reduction

– Association rule learning


Supervised vs Unsupervised



Instance-based and Model-Based Learning


Instance-Based Learning

  • System generalizes to new cases based on a similarity measure to known cases.

Instance-Based Learning

  • System generalizes to new cases based on a similarity measure to known cases.


Batch Learning

  • The system is incapable of learning incrementally: it must be trained using all the available data - offline.

  • First the system is trained, and then it is launched into production and runs without learning anymore; it just applies what it has learned.

  • When new data comes, you need to train a new version of the system from scratch on the full dataset (not just the new data, but also the old data), then stop the old system and replace it with the new one.

Online Learning

  • The system is trained incrementally by feeding it data instances sequentially, individually or in mini-batches.

  • The system can learn about new data on the fly, as it arrives

– Great for systems that receive data as a continuous flow (e.g., stock prices)

and need to adapt to change rapidly or autonomously

– Also good for limited computing resources, and huge datasets that cannot fit in one machine’s main memory (this is called out-of-core learning)



Main Challenges of Machine Learning

  • Machine learning involves selecting some learning algorithm and train it on some data.

  • “bad data” and “bad algorithm” are the two things that can go wrong


Data Challenges

  • Insufficient quantity of training data – it takes a lot of data for most machine learning algorithms to work properly

  • Non-representative training data - In order to generalize well, it is crucial that your training data be representative of the new cases you want to generalize to.

  • Poor-quality data – outliers, missing values, etc.

  • Irrelevant features - A critical part of the success of a Machine Learning project is coming up with a good set of features to train on. This process, called feature engineering, involves:

– Feature selection: selecting the most useful features to train

on among existing features.

– Feature extraction: combining existing features to produce a more

useful one

– Feature generation: Creating new features by gathering new data.


Bad Algorithms

  • Overfitting the Training Data - the model performs well on the training data, but it does not generalize well.

  • Underfitting the Training Data - occurs when your model is too simple to learn the underlying structure of the data

– The main options to fix this problem are:

• Selecting a more powerful model, with more parameters

• Feeding better features to the learning algorithm (feature engineering)

• Reducing the constraints on the model


Model Testing and Validation

  • Helps to determine how well the model generalizes to new cases

  • Achieved by splitting the data to training set and validation/test set.

  • Evaluating the model on the test set helps to assess how well it will perform on new instances of data

  • Cross-validation is used to evaluate several models

– The training set is split into complementary subsets, and each model

is trained against a different combination of these subsets and validated against the remaining parts.

– The selected model is then trained on the full training set, and the

generalized error is measured on the test set.


Machine Learning Steps

The main steps of a machine learning project include:

1. Look at the big picture (problem definition).

2. Get the data.

3. Discover and visualize the data to gain insights.

4. Prepare the data for Machine Learning algorithms.

5. Select a model and train it.

6. Fine-tune your model.

7. Present your solution.

8. Launch, monitor, and maintain your system.


Machine Learning Pipeline











Essential Python Libraries for Data Science

Python data ecosystem libraries commonly used in data science include:

  • NumPy

  • Pandas

  • Matplotlib (and it’s cousins)

  • IPython and Jupyter

  • SciPy

  • Scikit-learn

  • Statsmodels

  • Keras

  • Tensorflow


NumPy: Short for Numerical Python

  • Provides the data structures, algorithms, and library glue needed for numerical computing in Python

  • Acts as a container for data to be passed between algorithms and libraries.

  • NumPy contains, among other things:

– A fast and efficient multidimensional array object ndarray

– Functions for performing element-wise computations with arrays

or mathematical operations between arrays

– Tools for reading and writing array-based datasets to disk

– Linear algebra operations, Fourier transform, and random number

generation

– A mature C API to enable Python extensions and native C or C++ code

to access NumPy’s data structures and computational facilities


Pandas

  • Provide high-level data structures and functions designed to make working with structured or tabular data fast, easy, and expressive

  • Blend the high-performance, array-computing ideas of NumPy with the flexible data manipulation capabilities of spreadsheets and relational databases

  • Provide sophisticated indexing functionality to make it easy to reshape, slice and dice, perform aggregations, and select subsets of data.

  • Two of the most important data structures of pandas are:

– DataFrame - a tabular, column-oriented data structure with both row and

column labels

– Series - a one-dimensional labeled array object



Matplotlib and Other Data Visualization Libraries

  • Matplotlib is the most popular Python library for producing publication-quality plots and other two-dimensional data visualizations.

  • Matplotlib is like the mother of all Python libraries. It serves as an excellent base, enabling coders to “wrap” other tools over it.

  • Seaborn may be able to support some more complex visualization approaches but it still requires matplotlib knowledge to fine-tune things.

  • Bokeh is a robust tool for setting up your own visualization server but maybe a bit overkill when creating simple scenarios.

  • Geoplotlib will get the job done if you need to visualize geographic data.

  • Ggplot shows a lot of promise but still has a lot of growing up to do.

  • Plot.ly generates the most interactive graphs, which can be saved offline to create vivid web-based visualizations.


Ipython and Jupyter

  • IPython is designed from the ground up to maximize your productivity in both interactive computing and software development.

  • Component of the much broader Jupyter open source project

  • Designed to accelerate the writing, testing, and debugging of Python code.

  • Jupyter Notebook is an interactive web-based code “notebook” offering support for dozens of programming languages.


SciPy

Collection of packages addressing a number of different standard problem domains in scientific computing, such as:

  • scipy.linalg (linear algebra routines)

  • scipy.optimize (function optimizers)

  • scipy.sparse (sparse matrices and sparse linear system solvers)

  • scipy.special (a wrapper around Fortran SPECFUN library, implementing many math functions), and