Realcode4you is the right choice if you are looking to Hire Best Machine Learning Engineers, Experts and Professionals. We are providing top rated services in all machine learning and data science topics. Below the some important topics in which you can get help.
Get Help In Fraud/Anomaly Detection
The concept of fraud includes a criminal and a victim.
It can be encountered in many different areas with various methods.
There are mainly two broad categories of fraud, traditional and digital
Further, even digital fraud activities are also quite diverse in themselves.
Some types of frauds are
Internet Fraud
Mail Fraud
Debit and Credit Card Fraud
Promotion Fraud
Application Fraud
Benefits of Fraud Detection
Losses can be prevented by detecting fraud attempts in real-time.
We can prioritize risk situations and respond to critical situations early.
By reducing manual reviews, we can reduce the workload of our fraud team, enable them to focus on more critical cases, and increase work efficiency.
400 billion dollar loss due to fraud?
Card industry will lose 400 billion dollars in this decade due to card frauds.
How will we do fraud detection?
In the last 10 years, fraud detection with the help of machine learning is quite trending. This is because ml increases efficiency when deployed in place of teams finding frauds manually.
Steps in Fraud Detection
Importing company transaction datasets
Preprocessing the data
Data Visualization
Model Building
Testing the Model
Deployment
Importing standard libraries
Import standard python libraries
Numpy : used by ml algorithms to perform matrix multiplications
Pandas : used for data-handling
Matplotlib : standard python library for graph plotting
Seaborn : advanced library with more aesthetic features
Importing standard libraries
Importing Dataset
Import the company transactions dataset
Data Preprocessing: Introduction
Data Preprocessing is the process of making data suitable for use while training a machine learning model.
Why to use data preprocessing
The dataset initially provided for training might not be in a ready-to-use state, for e.g. it might not be formatted properly, or may contain missing or null values.
Using a properly processed dataset while training will not only make life easier for you but also increase the efficiency and accuracy of our model
Features
A dataset can be viewed as a collection of data objects, which are often also called as a records, points, vectors, patterns, events, cases, samples, observations, or entities.
Data objects are described by a number of features, that capture the basic characteristics of an object, such as the mass of a physical object or the time at which an event occurred, etc.
Features are often called as variables, characteristics, fields, attributes, or dimensions.
Categorical Features
Features whose values are taken from a defined set of values. For instance:
monthNames = [ "January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November", "December" ]
Numerical Features
Features whose values are continuous or integer-valued. They are represented by numbers and possess most of the properties of numbers.
For instance, number of transactions done each year.
Cleaning the feature names in df
Remove any extra leading or trailing white spaces from column names
Code:
df = df.rename(columns=lambda x: x.strip())
Cleaning the feature names in df
Replace space with underscore if there is any in the column names
Code:
df.columns = df.columns.str.replace(' ','_')
Make columns from uppercase to lowercase if there is any
df.columns = df.columns.str.lower()
Handling Missing Values(HMV)
We will now see different methods to handle missing values
Checking total number of missing values. The code example:
num_of_nan = df.isnull().sum().sum()
Lets see total missing values present in the dataset
mask_total = df.isnull().sum().sort_values(ascending=False)
number = mask_total[mask_total > 0]
mask_percent = df.isnull().mean().sort_values(ascending=False)
percent = mask_percent[mask_percent > 0]
missing_data = pd.concat([number, percent], axis=1, keys=['Number_of_NaN', 'Percent_of_NaN'])
print(f'Number and Percentage of NaN:\n {missing_data}')
Lets see all the columns having missing values more than a threshold percentage, say, lets find all the columns having missing value more than 70%
mask_percent = df.isnull().mean()
series = mask_percent[mask_percent > missing_percent]
columns = series.index.to_list()
print(columns)
What if we want to drop columns with many nan values, say, drop all columns with more than 70% missing values
list_of_cols = view_columns_w_many_nans(df, missing_percent=missing_percent) #we obtained it sometime ago
df.drop(columns=list_of_cols, inplace=True)
print(list_of_cols)
Deleting Rows (HMV)
Missing values can be handled by deleting the rows or columns having null values.
If columns have more than half of the rows as null then the entire column can be dropped.
The rows which are having one or more columns values as null can also be dropped.
Deleting Rows (HMV): Pros
A model trained with the removal of all missing values creates definitely a robust model.
Deleting Rows (HMV): Cons
Loss of a lot of information.
Works poorly if the percentage of missing values is excessive in comparison to the complete dataset.
Deleting Rows (HMV) : Code
2. Imputing Missing Values(IMV)
Columns in the dataset which are having numeric continuous values can be replaced with the mean, median, or mode of remaining values in the column.
This method can prevent the loss of data compared to the earlier method.
Imputing Missing Values(IMV): Pros
Prevent data loss which results in deletion of rows or columns
Works well with a small dataset and is easy to implement.
Imputing Missing Values(IMV): Cons
Works only with numerical continuous variables.
Can cause data leakage
Do not factor the covariance between features.
3. Categorical IMV
When missing values is from categorical columns (string or numerical) then the missing values can be replaced with the most frequent category.
If the number of missing values is very large then it can be replaced with a new category.
Categorical IMV: Pros
Prevent data loss which results in deletion of rows or columns
Works well with a small dataset and is easy to implement.
Negates the loss of data by adding a unique category
Categorical IMV: Cons
Works only with categorical variables.
Addition of new features to the model while encoding, which may result in poor performance
4. Missing Value Independent Algorithms
Some ML algorithms are robust to missing values in the dataset.
The k-NN algorithm can ignore a column from a distance measure when a value is missing.
Naive Bayes can also support missing values when making a prediction. These algorithms can be used when the dataset contains null or missing values.
Missing Value Independent Algorithms: Pros
No need to handle missing values in each column as ML algorithms will handle them efficiently.
Missing Value Independent Algorithms: Cons
No implementation of these ML algorithms is available in the scikit-learn library.
Which way to choose to handle missing values
There is no fixed rule to handle missing values in a particular manner, the method which gets a robust model with the best performance.
One can use various methods on different features depending on how and what the data is about
Handling Inconsistent Values
We know that data can contain inconsistent values. Most probably we have already faced this issue at some point.
For instance, the ‘Address’ field contains the ‘Phone number’. It may be due to human error or maybe the information was misread while being scanned from a handwritten form.
Also it also happens that there is a date variable/feature, but the data-type is object/string.
So we will need to convert it to appropriate format.
Handling Date type Format
Date here is of type object
Converting object type to datetime64
Adding a new feature year, extracted from date feature
The code can be written as:
#adding year column to dataframe
df['year'] = df[feature_name].dt.year.astype(int)
Adding a new feature month, extracted from date feature
The code can be written as:
#adding month column to dataframe
df['month'] = df[feature_name].dt.month.astype(int)
Adding a new feature day, extracted from date feature
The code can be written as:
#adding day column to dataframe
df['day'] = df[feature_name].dt.day.astype(int)
This is shown the newly added columns
Handling Duplicate Values
A dataset may include data objects which are duplicates of one another.
It may happen when say the same person submits a form more than once.
The term deduplication is often used to refer to the process of dealing with duplicates.
The following code to check for duplicates:
#number of duplicate rows
len(df[df.duplicated()])
The following code to check for removing duplicates:
df.drop_duplicates(inplace=True)
Feature aggregation
Feature aggregation is a technique to extract features from data by combining multiple features from different (usually similar) datasets.
The goal of feature aggregation is to discover data-driven relations between the original features, which might be hard to discover otherwise.
Every aggregated feature can be seen as a "meta" feature (or "higher-level" feature) that summarizes many other lower-level features.
Feature Aggregations are performed so as to take the aggregated values in order to put the data in a better perspective.
Think of transactional data, suppose we have day-to-day transactions of a product from recording the daily sales of that product in various store locations over the year.
Aggregating the transactions to single store-wide monthly or yearly transactions will help us reducing hundreds or potentially thousands of transactions that occur daily at a specific store, thereby reducing the number of data objects.
Benefits of Feature Aggregation
This results in reduction of memory consumption and processing time
Aggregations provide us with a high-level view of the data as the behaviour of groups or aggregates is more stable than individual data objects
It can help machine learning algorithms to give better accuracy
Feature Sampling
Sampling is a very common method for selecting a subset of the dataset that we are analyzing.
In most cases, working with the complete dataset can turn out to be very expensive considering the memory and time constraints.
Using a sampling algorithm can help us reduce the size of the dataset to a point where we can use a better, but more expensive, machine learning algorithm.
The key principle here is that the sampling should be done in such a manner that the sample generated should have approximately the same properties as the original dataset
Types of Feature Sampling
There are mainly 2 types:
Sampling without replacement
Sampling with replacement
Sampling without Replacement
Sampling without Replacement : As each item is selected, it is removed from the set of all the objects that form the total dataset.
Sampling with Replacement
Sampling with Replacement : Items are not removed from the total dataset after getting selected.
This means they can get selected more than once.
Get Help In Dimensionality Reduction
Most real world datasets have a large number of features.
For example, consider an fraud detection problem, we might have to deal with a lot of features, also called as dimensions.
As the name suggests, dimensionality reduction aims to reduce the number of features - but not simply by selecting a sample of features from the feature-set.
The curse of dimensionality
This refers to the phenomena that generally data analysis tasks become significantly harder as the dimensionality of the data increases.
As the dimensionality increases, the number planes occupied by the data increases thus adding more and more sparsity to the data which is difficult to model and visualize.
What dimension reduction essentially does is that it maps the dataset to a lower-dimensional space, which may very well be to a number of planes which can now be visualized, say 2D.
The basic objective of techniques which are used for this purpose is to reduce the dimensionality of a dataset by creating new features which are a combination of the old features.
In other words, the higher-dimensional feature-space is mapped to a lower-dimensional feature-space. Principal Component Analysis and Singular Value Decomposition are two widely accepted techniques.
Benefits of Dimensionality Reduction
Data Analysis algorithms work better if the dimensionality of the dataset is lower. This is mainly because irrelevant features and noise have now been eliminated.
The models which are built on top of lower-dimensional data are more understandable and explainable.
The data may now also get easier to visualize!
Features can always be taken in pairs or triplets for visualization purposes, which makes more sense if the feature-set is not that big
PCA
One of the most popular methods to do dimensionality reduction is PCA.
Principal Component Analysis is basically a statistical procedure to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables.
Why to use PCA
It is basically a non-dependent procedure in which it reduces attribute space from a large number of variables to a smaller number of factors.
PCA is basically a dimension reduction process but there is no guarantee that the dimension is interpretable.
The main task in this PCA is to select a subset of variables from a larger set, based on which original variables have the highest correlation with the principal amount.
Applications of PCA
It is used to find inter-relation between variables in the data.
It is used to interpret and visualize data.
The number of variables is decreasing it makes further analysis simpler.
It’s often used to visualize genetic distance and relatedness between populations.
Working of PCA
PCA basically searches a linear combination of variables so that we can extract maximum variance from the variables.
Once this process completes it removes it and searches for another linear combination that gives an explanation about the maximum proportion of remaining variance which basically leads to orthogonal factors.
In this method, we analyze total variance
Explained Variance Ratio
But what is explained variance ratio PCA?
The explained variance ratio is the percentage of variance that is attributed by each of the selected components
Training PCA on DataFrame
Code:
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(df)
print("pca.explained_variance_ratio_before_scaling", pca.explained_variance_ratio_)
Feature Encoding
Feature encoding is basically performing transformations on the data such that it can be easily accepted as input for machine learning algorithms while still retaining its original meaning.
As we all know that better encoding leads to a better model and most algorithms cannot handle the categorical variables unless they are converted into a numerical value.
Lets first see types of categorical features
Broadly there are 3 types:
Binary
Ordinal
Nominal
Binary Feature
Examples:
Yes/No
True/False
Ordinal Features
Examples:
low, medium, high
cold, hot, lava Hot
Nominal Features
Examples:
cat, dog, tiger
pizza, burger, coke
Feature Encoding: Example Dataset
We will be using below dataset for explaining feature encoding
Mapping Binary Features
df['bin_1'] = df['bin_1'].apply(lambda x: 1 if x == 'T' else (0 if x == 'F' else None))
df['bin_2'] = df['bin_2'].apply(lambda x: 1 if x == 'Y' else (0 if x == 'N' else None))
sns.countplot(df['bin_1'])
sns.countplot(df['bin_2'])
Get Help In Label Encoding
Label encoding algorithm is quite simple and it considers an order for encoding.
Hence can be used for encoding ordinal data.
LabelEncoder is present in scikit-learn library. Code:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['ord_2'] = le.fit_transform(df['ord_2'])
sns.countplot(df['ord_2'])
One-Hot Encoding
To overcome the Disadvantage of Label Encoding as it considers some hierarchy in the columns which can be misleading to nominal features present in the data, we can use the One-Hot Encoding strategy.
It is done in the following 2 steps:
Splitting of categories into different columns.
Put ‘0 for others and ‘1’ as an indicator for the appropriate column.
Code:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc = enc.fit_transform(df[['nom_0']]).toarray()
encoded_colm = pd.DataFrame(enc)
df = pd.concat([df, encoded_colm], axis=1)
df = df.drop(['nom_0'], axis=1)
df.head(10)
Frequency Encoding
We can also encode considering the frequency distribution. This method can be effective at times for nominal features.
Code:
# grouping by frequency
fq = df.groupby('nom_0').size()/len(df)
# mapping values to dataframe
df.loc[:, "{}_freq_encode".format('nom_0')] = df['nom_0'].map(fq)
# drop original column.
df = df.drop(['nom_0'], axis=1)
fq.plot.bar(stacked=True)
df.head(10)
Ordinal Encoding
We can use Ordinal Encoding provided in Scikit learn class to encode Ordinal features.
It ensures that ordinal nature of the variables is sustained.
from sklearn.preprocessing import OrdinalEncoder
ord1 = OrdinalEncoder()
# fitting encoder
ord1.fit([df['ord_2']])
# transforming the column after fitting
df["ord_2"] = ord1.transform(df[["ord_2"]])
Binary Encoding
Initially, categories are encoded as Integer and then converted into binary code, then the digits from that binary string are placed into separate columns.
Code:
from category_encoders import BinaryEncoder
encoder = BinaryEncoder(cols =['ord_2'])
# transforming the column after fitting
newdata = encoder.fit_transform(df['ord_2'])
# concatenating dataframe
df = pd.concat([df, newdata], axis = 1)
# dropping old column
df = df.drop(['ord_2'], axis = 1)
Get Help In Feature Scaling
Feature scaling in machine learning is one of the most critical steps during the pre-processing of data before creating a machine learning model.
Scaling can make a difference between a weak machine learning model and a better one.
The most common techniques of feature scaling are Normalization and Standardization.
Normalization is used when we want to bound our values between two numbers, typically, between [0,1] or [-1,1].
While Standardization transforms the data to have zero mean and a variance of 1, they make our data unitless.
Why Feature Scaling
Machine learning algorithm just sees number — if there is a vast difference in the range say few ranging in thousands and few ranging in the tens, and it makes the underlying assumption that higher ranging numbers have superiority of some sort.
So these more significant number starts playing a more decisive role while training the model.
Few algorithms like Neural network gradient descent converge much faster with feature scaling than without it.
Where to use Feature Scaling
KNN
K-means
Principal Component Analysis
Gradient Descent
KNN: Feature Scaling
K-nearest neighbors (KNN) with a Euclidean distance measure is sensitive to magnitudes and hence should be scaled for all features to weigh in equally.
K-Means: Feature Scaling
K-Means uses the Euclidean distance measure here feature scaling matters.
PCA: Feature Scaling
PCA tries to get the features with maximum variance, and the variance is high for high magnitude features and skews the PCA towards high magnitude features.
Gradient Descent: Feature Scaling
We can speed up gradient descent by scaling because θ descends quickly on small ranges and slowly on large ranges, and oscillates inefficiently down to the optimum when the variables are very uneven.
Where not to use: Feature Scaling
We do not need to use feature scaling in below algorithms:
Random Forest
CART
Gradient Boosted Decision trees
We are also providing other advance level machine learning and data science related help. If you have any project related to Machine learning and Data Science then send your requirement details at below mail id
realcode4you@gmail.com
Comments