top of page

2020 Stack Overflow Annual Developer Survey Using Python Machine learning



In this coursework, you are asked to analyse the "2020 Stack Overflow Annual Developer Survey" dataset which is available at https://insights.stackoverflow.com/survey. The data were collected by Stack Overflow, which is a question and answer site for professional and enthusiast programmers. In 2020, there were 65,000 responses from over 180 countries and dependent territories. This survey examines all aspects of programmers’ experience from career satisfaction to opinions on open source software. This information could be useful for you as program developers. For an overview of 2020 Stack Overflow’s Developer Survey, you may refer to https://insights.stackoverflow.com/survey/2020.


The following tasks are required in the coursework:

(1) To understand the data set. You must implement this task by exploratory data analysis through Python programming.


(2) To describe characteristics of low-income developers and high-income developers.

You must implement this task by cluster analysis through Python programming.


Note: Low income is defined as salary up to the median of all developer’s annual salaries, while high-income as salary more than the median of all developer’s annual salaries. Income is measured by the column ConvertedComp which represents Salary converted to annual USD salaries.


(3) To build machine learning models for predicting whether a developer is in high- income based on survey data. You must implement this task by classification through Python programming.


Requirement:

  • Introduction

  • Data Understanding and Exploratory Data Analysis

  • Cluster Analysis

  • Machine Learning Methods and their Implementation

  • Evaluation Machine Learning Models

  • Discussions and conclusions

A report with 20-30 pages is recommended. The report in total, however, must not exceed 30 pages (excluding title page, contents page, references, and appendices) with the font Calibri and size 11 or 12 in the main text. A penalty of a single grade will be incurred if you exceed the 30-page limit. You may put extra information in appendices which is not counted in 30-page limit.


You are asked to write the report with the provided report template at the end of the template. It is recommended to cite and list referees using Harvard Referencing style (see https://www.ntu.ac.uk/m/library/referencing-made-easy). However, other (author, year) styles like APA are also accepted.


By the submission deadline, you are expected to submit both your report (in MS Word or PDF format) and the Python source code (in plain text format) to NOW Dropbox under the `report’ folder.


Your report will be assessed according to the assessment criteria provided in Section II.

The remainder of this specification provides you with detailed requirements for each area of content – you should read it very carefully.



REPORT TEMPLATE

Introduction


Data Understanding, Data Pre-processing, Exploratory Data Analysis

  • Describe the Survey Data Set (Stack Overflow Developer Survey, 2020)

  • Briefly describe data attributes with attribute name, description and data type use descriptive statistics and exploratory data analysis (Larose, and Larose, 2015). Note: it is not necessary include all results (e.g., tables and figures) in the main text. Only select most important results in the main text and leave others in the appendix.

  • Describe the characteristics of the data set, such as (though not limited to) the number of instances, possibly duplicate or conflicting instances, missing values, or erroneous values, outliers.

  • If any duplicate or conflicting instances, missing values, outliers/erroneous values, outliers exist, describe the process of cleaning these data.

  • Conduct the exploratory data analysis on the data set, for example (though not limited to), identify outliers using histogram or box-plot, or scatter-plot; visualise the percentage of classes using pie-plot; explore the relation between features and target variable using crosstab and staked bar plot and so on


Cluster Analysis

  • Split the data set into two subsets, labelled low-income (= up to the median of all developer’s annual salaries) or high-income (= more than the median of all developer’s annual salaries). Note: income is given in the column ConvertedComp which represents Salary converted to annual USD salaries. Perform cluster analysis of these two subsets separately using some clustering methods (such as k-Means and hierarchical clustering).

  • If applicable to a machine leaning method, describe the process of data transformation and normalization used in that method.

  • Implement cluster analysis on each subset. Describe parameter setting, initialisation, stopping criterion and discuss your choice of cluster number.

  • Describe characteristics of low-income developer clusters and high-income developer clusters found in cluster analysis.


Machine Learning for Classification and their Implementation

  • Describe the workflow of machine learning for classification using a flow-chart(s).

  • State and describe classification methods that are used in your coursework. At least three classifiers should be chosen for the classification problem. The methods may be chosen from those taught in this module, such as k-Nearest Neighbour, Decision Trees, Logistic Regression, Artificial Neural networks. It is also allowed to choose methods that are not taught in this module.

  • Describe parameter setting, data transformation and normalisation in the methods that you have chosen for the task.

  • If applicable to a machine leaning method, describe the process of data transformation and normalization used in that method.

  • Build and implement machine learning models and tune hyper-parameters in these models for good performance. You may implement these models using Sklearn modules. It is also allowed to use any other Python libraries that are not taught in this module.

  • Implement ensemble learning of combining your classifiers together. Describe the ensemble method(s) that you are using.


Evaluation Machine Learning Models

  • Evaluate and compare the performance of the machine learning models (both base and ensemble models). You should at least use one or more of the performance metrics (as appropriate), such as accuracy, confusion matrix, recall and precision, or ROC curve.

  • Explain your results. Generate tables to list the results or figures to visualize the results.

  • Review the performance of different models (base and ensemble models). You may critically review which model performed best and which hyper-parameter settings were most effective? Provide necessary explanations


Discussions and Conclusions

  • Summarise your task and your findings in the data analysis on this survey data set.

  • Describe what kind of insight that you have gained from the module “machine learning for data analytics”.

  • Explain whether and how well has the module developed your understanding of machine learning for data analytics?




If you need any programming assignment help in Machine Learning programming, Machine Learning project or Machine Learning homework or need solution of above problem then we are ready to help you.


Send your request at realcode4you@gmail.com and get instant help with an affordable price.

We are always focus to delivered unique or without plagiarism code which is written by our highly educated professional which provide well structured code within your given time frame.


If you are looking other programming language help like C, C++, Java, Python, PHP, Asp.Net, NodeJs, ReactJs, etc. with the different types of databases like MySQL, MongoDB, SQL Server, Oracle, etc. then also contact us.

bottom of page