top of page

EasyVisa Data Analysis Project Help | Machine Learning Data Analysis Assignment Help

Problem Statement

The increasing number of applicants every year calls for a Machine Learning based solution that can help in shortlisting the candidates having higher chances of VISA approval. OFLC has hired the firm EasyVisa for data-driven solutions. You as a data scientist at EasyVisa have to analyze the data provided and, with the help of a classification model:

  • Facilitate the process of visa approvals.

  • Recommend a suitable profile for the applicants for whom the visa should be certified or denied based on the drivers that significantly influence the case status.

Data Analysis

  • The data contains the different attributes of employee and the employer. The detailed data dictionary is: case_id, continent, education_of_employee, has_job_experience and more others

  • This dataset has12 features.


Data Processing

  • Data Pre-Processing: data generally contains many issues like noises, missing values, and not given in proper format which cannot be directly used for machine learning algorithms.

  • This is the process for cleaning the data and making it suitable for a ML model to increase the model efficiency and increase the accuracy of the model also.

  • Data pre-processing in Machine Learning is a crucial step that helps enhance the quality of data

  • Checking null value if exist then need to fill it.


Load Dataset

This is initial step which used to load the dataset before data analysis

# Read Data  
visa = pd.read_csv('/content/drive/My Drive/EasyVisa.csv') 

# copying data to another variable to avoid any changes to original  data  
data = visa.copy() 
data.head() 

Checking Duplicate Values

Below the code which used to find the all duplicate values in dataset.


# checking for duplicate values  
data.duplicated() 

The duplicated() function return “False” if not any duplicate value and return “True” if it has duplicate value.


Exploratory Data Analysis

In this we need to visualize the dataset features

  • In this we have to use some visualization to show and understand the features and their relationship with other features easily.

  • It is divided into two categories:

  • Univariate Analysis

  • Multivariate Analysis


Observations on number of employees

  • In this graph we can show maximum number of employee between 0 to 100k

  • In box plot there are many outliers so we need to remove it to get better result



Observations on prevailing wage

  • At starting it goes to peak and then graph down when the prevailing_wage increases.

  • It also has the many outliers which you can see in box plot.



Observations on continent

  • Asia has the large number of continent compared to other.

  • And ‘Oceania’ has minimum number of continent.



Observations on education of employee

  • At bachelor's level 42.2% student apply for visa which is max.

  • At Doctorate level only 8.6% which is low compared to other.

  • So, need to increase doctorate level by focussing on this education



Observations on job experience

  • Above plot display number of adults and number of children . And plot a graph


Observations on job training

  • In graph we can see that only 2955 which required for job training but large number of record 22535 not has job training.

  • So, I recommended to increase graph of job training by reducing non-job training


Observations on region of employment

  • Northeast region has more employment compare other other(7195)

  • Island region has low employment rate

  • I recommended that need to focus on Island region employment to overcome this issue.



Observations on unit of wage

  • The unit of wage is max at yearly basis and minimum at monthly.

  • You can seed in graph for more clearance.


Correlation Plot

  • Here three variable which is highly correlated to itself. You can see it in diagonally in blue color.

  • Here some are negative which is show the low correlation between them.


Those with higher education may want to travel abroad for a well-paid job. Let's find out if education has any impact on visa certification

  • Here we see that when we go with highly then denied cases min but at high school standard denied cases is max.

  • We need to check the reason for high school stand for which is has max denied cases.



Different regions have different requirements of talent having diverse educational backgrounds. Let's analyze it further

  • In this heat map we can see Island has very weak level in all education standard.

  • We need to focus on Island education background to overcome this.


Let's have a look at the percentage of visa certifications across each region

  • Here we seed that Island has the min certified visa certifications and Midwest has the max visa certifications.

  • So we can so need to focus on Island region with other which has low.



The US government has established a prevailing wage to protect local talent and foreign workers. Let's analyze the data and see if the visa status changes with the prevailing wage

  • Here we see that when we see that when we remove the outliers the certified and denied graph increased.



Data Preparation for modeling

  • We want to predict which visa will be certified.

  • Before we proceed to build a model, we'll have to encode categorical features.

  • We'll split the data into train and test to be able to evaluate the model that we build on the train data

Here the code Script which used for decision tree model.



Decision Tree Classifier


Here we find the score of training data

Precision: Here we find the precision score 1

Recall: Recall score is also 1

F1-Score: f1-score is 1 And finally we see that accuracy of this decision tree model is 1



Here we find the score of test data:

Precision: Here we find the precision score 0.50

Recall: Recall score is also 0.49

F1-Score: f1-score is 0.50 And finally we see that accuracy of this decision tree model is 0.66



Bagging Classifier

Bagging is a type of ensemble machine learning approach that combines the outputs from many learner to improve performance.


These algorithms function by breaking down the training set into subsets and running them through various machine learning models.


Confusion Matrix:


Model Performance on training Set:

Accuracy: 0.98

Recall: 0.98

Precision: 0.99

F1-score: 0.98


Model Performance on training Set:

Accuracy: 0.70

Recall: 0.77

Precision: 0.77

F1-score: 0.77


Output:



Random Forest

Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression


Model Performance on training Set:

Accuracy: 1.0

Recall: 1.0

Precision: 1.0

F1-score: 1.0


Model Performance on test Set:

Accuracy: 0.71

Recall: 0.83

Precision: 0.76

F1-score: 0.7




Here you also get all data analysis assignment help, project help and homework help. For any help related to Python, Java, R and other programming you can send your project requirement detail at:


realcode4you@gmail.com

Comments


bottom of page