A process that chooses an optimal subset of features according to a objective function
To reduce dimensionality and remove noise
To improve mining performance
Speed of learning
Simplicity and comprehensibility of mined results
Feature Selection and dimensionality reduction:
Improve performance (speed, predictive power, simplicity of the model).
Visualize the data for model selection.
Reduce dimensionality and remove noise.
Feature Selection is a process to select optimal subset of features according to a certain criterion.
Other reasons for performing FS may include:
removing irrelevant data and noise.
increasing accuracy of learned models.
reducing the complexity of the resulting model description, improving the understanding of the data and the model.
Dimensionality reduction is an efficient approach to downsizing data
Visualization: projection of high-dimensional data onto 2D or 3D
Application of Dimensionality Reduction
Customer relationship management
Handwritten digit recognition
how it Works..
Searching for the best subset of features.
Criteria on how to evaluating different subsets
Different Aspects of Search
Search starting points
Sequential forward selection
Sequential backward elimination
Other Types of High-Dimensional Data
Models of Feature Selection
Separating feature selection from classifier learning
Relying on general characteristics and statistics of data (correlation, distance, dependence, consistency)
Relying on a predetermined classification algorithm
Using predictive accuracy as goodness measure
High accuracy, but computationally expensive
Example: a filter algorithm based on entropy measure or information gain
Example: – a wrapper algorithm based on clustering or classification accuracy
wrapper based are advantageous for giving better performances since they use the target classifier the feature selection algorithm but they suffer they are computationally expensive.
filter methods are less accurate but faster to compute.
Drawbacks of Features Selection in some cases
The resulted subsets of many models of FS are strongly dependent on the training set size.
the removal of any of them will seriously effect the learning performance.
A backward removal strategy is very slow when working with large-scale data sets.
In some cases, the FS outcome will still be left with a relatively large number of relevant features.
Example of feature selection in R… Wrapper approach
In this example we will use Boruta Package
Boruta is FS algorithm. It works as a wrapper algorithm around Random Forest.
Random forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees.
How does Boruta algorithm works?
Firstly, it adds randomness to the given data set by creating shuffled copies of all features (called shadow features).
Then, it trains a random forest classifier on the extended data set and applies a feature importance measure.
At every iteration, it checks whether a real feature has a higher importance than the best of its shadow features and constantly removes features which are unimportant.
Finally, the algorithm stops either when all features gets confirmed or rejected or it reaches a specified limit of random forest runs.
Application of Boruta algorithm and Random forest in R
Required libraries :
library(Boruta) library(mlbench) library(caret) library(randomForest) library(reprtree)
set.seed(111) boruta <- Boruta(Species ~ ., data = iris, doTrace = 2, maxRuns = 500) print(boruta)
To get any help in Feature Selection related assignments and projects you can contact us. Realcode4you machine learning experts and professionals team easily complete your homework or projects as per given instructions within your time frame without any plagiarism issues.
Send your project details at: