DonorsChoose.org receives hundreds of thousands of project proposals each year for classroom projects in need of funding. Right now, a large number of volunteers is needed to manually screen each submission before it's approved to be posted on the DonorsChoose.org website.
Next year, DonorsChoose.org expects to receive close to 500,000 project proposals. As a result, there are three main problems they need to solve:
How to scale current manual processes and resources to screen 500,000 projects so that they can be posted as quickly and as efficiently as possible
How to increase the consistency of project vetting across different volunteers to improve the experience for teachers
How to focus volunteer time on the applications that need the most assistance
The goal of the competition is to predict whether or not a DonorsChoose.org project proposal submitted by a teacher will be approved, using the text of project descriptions as well as additional metadata about the project, teacher, and school. DonorsChoose.org can then use this information to identify projects most likely to need further review before approval.
About the DonorsChoose Data Set
The train.csv data set provided by DonorsChoose contains the following features:
...
Additionally, the resources.csv data set provides more data about the resources required for each project. Each line in this file represents a resource required by a project:
Note: Many projects require multiple resources. The id value corresponds to a project_id in train.csv, so you use it as a key to retrieve all resources needed for a project:
The data set contains the following label (the value you will attempt to predict):
Import Necessary Packages
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
from nltk.corpus import stopwords
import pickle
from tqdm import tqdm
import os
1. Reading Data
project_data = pd.read_csv('train_data.csv', nrows=5000)
resource_data = pd.read_csv('resources.csv')
print("Number of data points in train data", project_data.shape)
print('-'*50)
print("The attributes of data :", project_data.columns.values)
output:
Number of data points in train data (5000, 17) -------------------------------------------------- The attributes of data : ['Unnamed: 0' 'id' 'teacher_id' 'teacher_prefix' 'school_state' 'project_submitted_datetime' 'project_grade_category' 'project_subject_categories' 'project_subject_subcategories' 'project_title' 'project_essay_1' 'project_essay_2' 'project_essay_3' 'project_essay_4' 'project_resource_summary' 'teacher_number_of_previously_posted_projects' 'project_is_approved']
print("Number of data points in train data", resource_data.shape)
print(resource_data.columns.values)
resource_data.head(2)
output:
Number of data points in train data (1541272, 4) ['id' 'description' 'quantity' 'price']
2. Preprocessing Categorical Features: project_grade_category
project_data['project_grade_category'].value_counts()
output:
Grades PreK-2 2002
Grades 3-5 1729
Grades 6-8 785
Grades 9-12 484
Name: project_grade_category, dtype: int64
we need to remove the spaces, replace the '-' with '_' and convert all the letters to small
# https://stackoverflow.com/questions/36383821/pandas-dataframe-apply-function-to-column-strings-based-on-other-column-value
project_data['project_grade_category'] = project_data['project_grade_category'].str.replace(' ','_')
project_data['project_grade_category'] = project_data['project_grade_category'].str.replace('-','_')
project_data['project_grade_category'] = project_data['project_grade_category'].str.lower()
project_data['project_grade_category'].value_counts()
output:
grades_prek_2 2002
grades_3_5 1729
grades_6_8 785
grades_9_12 484
Name: project_grade_category, dtype: int64
3. Preprocessing Categorical Features:
project_subject_categories
project_data['project_subject_categories'].value_counts()
output:
Literacy & Language 1067
Math & Science 795
Literacy & Language, Math & Science 679
Health & Sports 509
Music & The Arts 233
Literacy & Language, Special Needs 207
Applied Learning 164
Special Needs 162
Math & Science, Literacy & Language 101
Applied Learning, Literacy & Language 97
Applied Learning, Special Needs 80
Math & Science, Special Needs 80
Literacy & Language, Music & The Arts 79
Math & Science, Music & The Arts 76
History & Civics, Literacy & Language 65
History & Civics 63
...
...
project_data['project_subject_categories'] = project_data['project_subject_categories'].str.replace(' The ','')
project_data['project_subject_categories'] = project_data['project_subject_categories'].str.replace(' ','')
project_data['project_subject_categories'] = project_data['project_subject_categories'].str.replace('&','_')
project_data['project_subject_categories'] = project_data['project_subject_categories'].str.replace(',','_')
project_data['project_subject_categories'] = project_data['project_subject_categories'].str.lower()
project_data['project_subject_categories'].value_counts()
output:
literacy_language 1067
math_science 795
literacy_language_math_science 679
health_sports 509
music_arts 233
literacy_language_specialneeds 207
appliedlearning 164
specialneeds 162
math_science_literacy_language 101
appliedlearning_literacy_language 97
...
...
4. Preprocessing Categorical Features: teacher_prefix
project_data['teacher_prefix'].value_counts()
output:
Mrs. 2560
Ms. 1845
Mr. 495
Teacher 100
Name: teacher_prefix, dtype: int64
# check if we have any nan values are there
print(project_data['teacher_prefix'].isnull().values.any())
print("number of nan values",project_data['teacher_prefix'].isnull().values.sum())
output:
False number of nan values 0
numebr of missing values are very less in number, we can replace it with Mrs. as most of the projects are submitted by Mrs.
project_data['teacher_prefix']=project_data['teacher_prefix'].fillna('Mrs.')
project_data['teacher_prefix'].value_counts()
output:
Mrs. 2560
Ms. 1845
Mr. 495
Teacher 100
Name: teacher_prefix, dtype: int64
Remove '.'
convert all the chars to small
project_data['teacher_prefix'] = project_data['teacher_prefix'].str.replace('.','')
project_data['teacher_prefix'] = project_data['teacher_prefix'].str.lower()
project_data['teacher_prefix'].value_counts()
output:
mrs 2560
ms 1845
mr 495
teacher 100
Name: teacher_prefix, dtype: int64
5. Preprocessing Categorical Features: project_subject_subcategories
project_data['project_subject_subcategories'].value_counts()
output:
Literacy 449 Literacy, Mathematics 368 Literature & Writing, Mathematics 293 Literacy, Literature & Writing 234 Mathematics 232 Literature & Writing 216 Health & Wellness 179 Special Needs 162 Applied Sciences, Mathematics 156 ... ...
same process we did in project_subject_categories
project_data['project_subject_subcategories'] = project_data['project_subject_subcategories'].str.replace(' The ','')
project_data['project_subject_subcategories'] = project_data['project_subject_subcategories'].str.replace(' ','')
project_data['project_subject_subcategories'] = project_data['project_subject_subcategories'].str.replace('&','_')
project_data['project_subject_subcategories'] = project_data['project_subject_subcategories'].str.replace(',','_')
project_data['project_subject_subcategories'] = project_data['project_subject_subcategories'].str.lower()
project_data['project_subject_subcategories'].value_counts()
output:
literacy 449 literacy_mathematics 368 literature_writing_mathematics 293 literacy_literature_writing 234 mathematics 232 literature_writing 216 health_wellness 179 specialneeds 162 appliedsciences_mathematics 156 ... ...
6. Preprocessing Categorical Features: school_state
project_data['school_state'].value_counts()
output:
CA 707
TX 352
NY 342
FL 261
NC 246
SC 191
IL 184
GA 164
PA 151
MI 151
OH 122
...
...
convert all of them into small letters
project_data['school_state'] = project_data['school_state'].str.lower()
project_data['school_state'].value_counts()
output:
ca 707
tx 352
ny 342
fl 261
nc 246
sc 191
il 184
ga 164
mi 151
pa 151
oh 122
...
...
7. Preprocessing Categorical Features: project_title
# https://stackoverflow.com/a/47091490/4084039
import re
def decontracted(phrase):
# specific
phrase = re.sub(r"won't", "will not", phrase)
phrase = re.sub(r"can\'t", "can not", phrase)
# general
phrase = re.sub(r"n\'t", " not", phrase)
phrase = re.sub(r"\'re", " are", phrase)
phrase = re.sub(r"\'s", " is", phrase)
phrase = re.sub(r"\'d", " would", phrase)
phrase = re.sub(r"\'ll", " will", phrase)
phrase = re.sub(r"\'t", " not", phrase)
phrase = re.sub(r"\'ve", " have", phrase)
phrase = re.sub(r"\'m", " am", phrase)
return phrase
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
stopwords= ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
"you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
"hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
"mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
'won', "won't", 'wouldn', "wouldn't"]
project_data['project_title'].head(5)
output:
print("printing some random reviews")
print(9, project_data['project_title'].values[9])
print(34, project_data['project_title'].values[34])
print(147, project_data['project_title'].values[147])
output:
printing some random reviews
9 Just For the Love of Reading--\r\nPure Pleasure
34 \"Have A Ball!!!\"
147 Who needs a Chromebook?\r\nWE DO!!
# Combining all the above stundents
from tqdm import tqdm
def preprocess_text(text_data):
preprocessed_text = []
# tqdm is for printing the status bar
for sentance in tqdm(text_data):
sent = decontracted(sentance)
sent = sent.replace('\\r', ' ')
sent = sent.replace('\\n', ' ')
sent = sent.replace('\\"', ' ')
sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
# https://gist.github.com/sebleier/554280
sent = ' '.join(e for e in sent.split() if e.lower() not in stopwords)
preprocessed_text.append(sent.lower().strip())
return preprocessed_text
preprocessed_titles = preprocess_text(project_data['project_title'].values)
print("printing some random reviews")
print(9, preprocessed_titles[9])
print(34, preprocessed_titles[34])
print(147, preprocessed_titles[147])
output:
printing some random reviews
9 love reading pure pleasure
34 ball
147 needs chromebook
8. Preprocessing Categorical Features: essay
# merge two column text dataframe:
project_data["essay"] = project_data["project_essay_1"].map(str) +\
project_data["project_essay_2"].map(str) + \
project_data["project_essay_3"].map(str) + \
project_data["project_essay_4"].map(str)
print("printing some random essay")
print(9, project_data['essay'].values[9])
print('-'*50)
print(34, project_data['essay'].values[34])
print('-'*50)
print(147, project_data['essay'].values[147])
output:
printing some random essay
9 Over 95% of my students are on free or reduced lunch. I have a few who are homeless, but despite that, they come to school with an eagerness to learn. My students are inquisitive eager learners who embrace the challenge of not having great books and other resources every day. Many of them are not afforded the opportunity to engage with these big colorful pages of a book on a regular basis at home and they don't travel to the public library. \r\nIt is my duty as a teacher to do all I can to provide each student an opportunity to succeed in every aspect of life. \r\nReading is Fundamental! My students will read these books over and over again while boosting their comprehension skills. These books will be used for read alouds, partner reading and for Independent reading. \r\nThey will engage in reading to build their \"Love for Reading\" by reading for pure enjoyment. They will be introduced to some new authors as well as some old favorites. I want my students to be ready for the 21st Century and know the pleasure of holding a good hard back book in hand. There's nothing like a good book to read! \r\nMy students will soar in Reading, and more because of your consideration and generous funding contribution. This will help build stamina and prepare for 3rd grade. Thank you so much for reading our proposal!nannan
--------------------------------------------------
...
...
preprocessed_essays = preprocess_text(project_data['essay'].values)
print("printing some random essay")
print(9, preprocessed_essays[9])
print('-'*50)
print(34, preprocessed_essays[34])
print('-'*50)
print(147, preprocessed_essays[147])
output:
printing some random reviews 9 95 students free reduced lunch homeless despite come school eagerness learn students inquisitive eager learners embrace challenge not great books resources every day many not afforded opportunity engage big colorful pages book regular basis home not travel public library duty teacher provide student opportunity succeed every aspect life reading fundamental students read books boosting comprehension skills books used read alouds partner reading independent reading engage reading build love reading reading pure enjoyment introduced new authors well old favorites want students ready 21st century know pleasure holding good hard back book hand nothing like good book read students soar reading consideration generous funding contribution help build stamina prepare 3rd grade thank much reading proposal nannan
...
...
8. Preprocessing Numerical Values: price
# https://stackoverflow.com/questions/22407798/how-to-reset-a-dataframes-indexes-for-all-groups-in-one-step
price_data = resource_data.groupby('id').agg({'price':'sum', 'quantity':'sum'}).reset_index()
price_data.head(2)
output:
# join two dataframes in python:
project_data = pd.merge(project_data, price_data, on='id', how='left')
project_data['price'].head()
output:
0 154.60
1 299.00
2 516.85
3 232.90
4 67.98
Name: price, dtype: float64
8.1 applying StandardScaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(project_data['price'].values.reshape(-1, 1))
project_data['std_price']=scaler.transform(project_data['price'].values.reshape(-1, 1) )
project_data['std_price'].head()
output:
0 -0.393708
1 -0.010053
2 0.568751
3 -0.185673
4 -0.623847
Name: std_price, dtype: float64
8.2 applying MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(project_data['price'].values.reshape(-1, 1))
project_data['nrm_price']=scaler.transform(project_data['price'].values.reshape(-1, 1))
project_data['nrm_price'].head()
output:
0 0.015320
1 0.029763
2 0.051554
3 0.023152
4 0.006656
Name: nrm_price, dtype: float64
For more details you can send your requirement details at:
realcode4you@gmail.com
留言