top of page
realcode4you

What is Feature Selection and Dimensionality Reduction In Machine Learning? | Realcode4you

Definition

  • A process that chooses an optimal subset of features according to a objective function

Objectives

  • To reduce dimensionality and remove noise

  • To improve mining performance

  • Speed of learning

  • Predictive accuracy

  • Simplicity and comprehensibility of mined results

Feature Selection and dimensionality reduction:

  1. Improve performance (speed, predictive power, simplicity of the model).

  2. Visualize the data for model selection.

  3. Reduce dimensionality and remove noise.

Feature Selection is a process to select optimal subset of features according to a certain criterion.


Other reasons for performing FS may include:

  • removing irrelevant data and noise.

  • increasing accuracy of learned models.

  • reducing the complexity of the resulting model description, improving the understanding of the data and the model.

  • Dimensionality reduction is an efficient approach to downsizing data

  • Visualization: projection of high-dimensional data onto 2D or 3D


Application of Dimensionality Reduction

  • Customer relationship management

  • Text mining

  • Image retrieval

  • Handwritten digit recognition

  • Intrusion detection


how it Works..

  • Searching for the best subset of features.

  • Criteria on how to evaluating different subsets


Different Aspects of Search

Search starting points

  • Empty set

  • Full set

  • Random point


Search directions

  • Sequential forward selection

  • Sequential backward elimination

  • Bidirectional generation

  • Random generation


Other Types of High-Dimensional Data

Face Images



Models of Feature Selection

Filter model

  • Separating feature selection from classifier learning

  • Relying on general characteristics and statistics of data (correlation, distance, dependence, consistency)


Wrapper model

  • Relying on a predetermined classification algorithm

  • Using predictive accuracy as goodness measure

  • High accuracy, but computationally expensive



Filter algorithms

Example: a filter algorithm based on entropy measure or information gain


Wrapper algorithms

Example: – a wrapper algorithm based on clustering or classification accuracy


wrapper based are advantageous for giving better performances since they use the target classifier the feature selection algorithm but they suffer they are computationally expensive.


filter methods are less accurate but faster to compute.



Filter Approach


Wrappers Approach:




Drawbacks of Features Selection in some cases

  • The resulted subsets of many models of FS are strongly dependent on the training set size.

  • the removal of any of them will seriously effect the learning performance.

  • A backward removal strategy is very slow when working with large-scale data sets.

  • In some cases, the FS outcome will still be left with a relatively large number of relevant features.


Example of feature selection in R… Wrapper approach

  • In this example we will use Boruta Package

  • Boruta is FS algorithm. It works as a wrapper algorithm around Random Forest.

  • Random forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees.


How does Boruta algorithm works?

  • Firstly, it adds randomness to the given data set by creating shuffled copies of all features (called shadow features).

  • Then, it trains a random forest classifier on the extended data set and applies a feature importance measure.

  • At every iteration, it checks whether a real feature has a higher importance than the best of its shadow features and constantly removes features which are unimportant.

  • Finally, the algorithm stops either when all features gets confirmed or rejected or it reaches a specified limit of random forest runs.


Application of Boruta algorithm and Random forest in R

Required libraries :

library(Boruta)
library(mlbench)
library(caret)
library(randomForest)
library(reprtree)

Code Implemetation

set.seed(111)
boruta <- Boruta(Species ~ ., data = iris, doTrace = 2, maxRuns = 500)
print(boruta)



To get any help in Feature Selection related assignments and projects you can contact us. Realcode4you machine learning experts and professionals team easily complete your homework or projects as per given instructions within your time frame without any plagiarism issues.


Send your project details at:


realcode4you@gmail.com

Comments


bottom of page