Data Analysis Preprocessing Using DonorsChoose Dataset | Hire Expert to Preprocess your data receives hundreds of thousands of project proposals each year for classroom projects in need of funding. Right now, a large number of volunteers is needed to manually screen each submission before it's approved to be posted on the website. Next year, expects to receive close to 500,000 project proposals. As a result, there are three main problems they need to solve:

  • How to scale current manual processes and resources to screen 500,000 projects so that they can be posted as quickly and as efficiently as possible

  • How to increase the consistency of project vetting across different volunteers to improve the experience for teachers

  • How to focus volunteer time on the applications that need the most assistance

The goal of the competition is to predict whether or not a project proposal submitted by a teacher will be approved, using the text of project descriptions as well as additional metadata about the project, teacher, and school. can then use this information to identify projects most likely to need further review before approval.

About the DonorsChoose Data Set

The train.csv data set provided by DonorsChoose contains the following features:


Additionally, the resources.csv data set provides more data about the resources required for each project. Each line in this file represents a resource required by a project:

Note: Many projects require multiple resources. The id value corresponds to a project_id in train.csv, so you use it as a key to retrieve all resources needed for a project:

The data set contains the following label (the value you will attempt to predict):

Import Necessary Packages

%matplotlib inline
import warnings

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

import re
# Tutorial about Python regular expressions:

from nltk.corpus import stopwords
import pickle

from tqdm import tqdm
import os

1. Reading Data

project_data = pd.read_csv('train_data.csv', nrows=5000)
resource_data = pd.read_csv('resources.csv')
print("Number of data points in train data", project_data.shape)
print("The attributes of data :", project_data.columns.values)


Number of data points in train data (5000, 17) -------------------------------------------------- The attributes of data : ['Unnamed: 0' 'id' 'teacher_id' 'teacher_prefix' 'school_state' 'project_submitted_datetime' 'project_grade_category' 'project_subject_categories' 'project_subject_subcategories' 'project_title' 'project_essay_1' 'project_essay_2' 'project_essay_3' 'project_essay_4' 'project_resource_summary' 'teacher_number_of_previously_posted_projects' 'project_is_approved']

print("Number of data points in train data", resource_data.shape)


Number of data points in train data (1541272, 4) ['id' 'description' 'quantity' 'price']

2. Preprocessing Categorical Features: project_grade_category



Grades PreK-2 2002