top of page

Data Analysis Preprocessing Using DonorsChoose Dataset | Hire Expert to Preprocess your data

DonorsChoose.org receives hundreds of thousands of project proposals each year for classroom projects in need of funding. Right now, a large number of volunteers is needed to manually screen each submission before it's approved to be posted on the DonorsChoose.org website. Next year, DonorsChoose.org expects to receive close to 500,000 project proposals. As a result, there are three main problems they need to solve:

  • How to scale current manual processes and resources to screen 500,000 projects so that they can be posted as quickly and as efficiently as possible

  • How to increase the consistency of project vetting across different volunteers to improve the experience for teachers

  • How to focus volunteer time on the applications that need the most assistance

The goal of the competition is to predict whether or not a DonorsChoose.org project proposal submitted by a teacher will be approved, using the text of project descriptions as well as additional metadata about the project, teacher, and school. DonorsChoose.org can then use this information to identify projects most likely to need further review before approval.


About the DonorsChoose Data Set

The train.csv data set provided by DonorsChoose contains the following features:


...


Additionally, the resources.csv data set provides more data about the resources required for each project. Each line in this file represents a resource required by a project:








Note: Many projects require multiple resources. The id value corresponds to a project_id in train.csv, so you use it as a key to retrieve all resources needed for a project:

The data set contains the following label (the value you will attempt to predict):




Import Necessary Packages

%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer


import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/

from nltk.corpus import stopwords
import pickle

from tqdm import tqdm
import os


1. Reading Data

project_data = pd.read_csv('train_data.csv', nrows=5000)
resource_data = pd.read_csv('resources.csv')
print("Number of data points in train data", project_data.shape)
print('-'*50)
print("The attributes of data :", project_data.columns.values)

output:

Number of data points in train data (5000, 17) -------------------------------------------------- The attributes of data : ['Unnamed: 0' 'id' 'teacher_id' 'teacher_prefix' 'school_state' 'project_submitted_datetime' 'project_grade_category' 'project_subject_categories' 'project_subject_subcategories' 'project_title' 'project_essay_1' 'project_essay_2' 'project_essay_3' 'project_essay_4' 'project_resource_summary' 'teacher_number_of_previously_posted_projects' 'project_is_approved']


print("Number of data points in train data", resource_data.shape)
print(resource_data.columns.values)
resource_data.head(2)

output:

Number of data points in train data (1541272, 4) ['id' 'description' 'quantity' 'price']






2. Preprocessing Categorical Features: project_grade_category


project_data['project_grade_category'].value_counts()

output:

Grades PreK-2 2002

Grades 3-5 1729

Grades 6-8 785

Grades 9-12 484

Name: project_grade_category, dtype: int64


we need to remove the spaces, replace the '-' with '_' and convert all the letters to small


# https://stackoverflow.com/questions/36383821/pandas-dataframe-apply-function-to-column-strings-based-on-other-column-value
project_data['project_grade_category'] = project_data['project_grade_category'].str.replace(' ','_')
project_data['project_grade_category'] = project_data['project_grade_category'].str.replace('-','_')
project_data['project_grade_category'] = project_data['project_grade_category'].str.lower()
project_data['project_grade_category'].value_counts()

output:

grades_prek_2 2002

grades_3_5 1729

grades_6_8 785

grades_9_12 484

Name: project_grade_category, dtype: int64



3. Preprocessing Categorical Features:

project_subject_categories


project_data['project_subject_categories'].value_counts()

output:

Literacy & Language                           1067
Math & Science                                 795
Literacy & Language, Math & Science            679
Health & Sports                                509
Music & The Arts                               233
Literacy & Language, Special Needs             207
Applied Learning                               164
Special Needs                                  162
Math & Science, Literacy & Language            101
Applied Learning, Literacy & Language           97
Applied Learning, Special Needs                 80
Math & Science, Special Needs                   80
Literacy & Language, Music & The Arts           79
Math & Science, Music & The Arts                76
History & Civics, Literacy & Language           65
History & Civics                                63
...
...

project_data['project_subject_categories'] = project_data['project_subject_categories'].str.replace(' The ','')
project_data['project_subject_categories'] = project_data['project_subject_categories'].str.replace(' ','')
project_data['project_subject_categories'] = project_data['project_subject_categories'].str.replace('&','_')
project_data['project_subject_categories'] = project_data['project_subject_categories'].str.replace(',','_')
project_data['project_subject_categories'] = project_data['project_subject_categories'].str.lower()
project_data['project_subject_categories'].value_counts()

output:

literacy_language                       1067
math_science                             795
literacy_language_math_science           679
health_sports                            509
music_arts                               233
literacy_language_specialneeds           207
appliedlearning                          164
specialneeds                             162
math_science_literacy_language           101
appliedlearning_literacy_language         97
...
...

4. Preprocessing Categorical Features: teacher_prefix

project_data['teacher_prefix'].value_counts()

output:

Mrs.       2560
Ms.        1845
Mr.         495
Teacher     100
Name: teacher_prefix, dtype: int64

# check if we have any nan values are there
print(project_data['teacher_prefix'].isnull().values.any())
print("number of nan values",project_data['teacher_prefix'].isnull().values.sum())

output:

False number of nan values 0


numebr of missing values are very less in number, we can replace it with Mrs. as most of the projects are submitted by Mrs.


project_data['teacher_prefix']=project_data['teacher_prefix'].fillna('Mrs.')
project_data['teacher_prefix'].value_counts()

output:

Mrs.       2560
Ms.        1845
Mr.         495
Teacher     100
Name: teacher_prefix, dtype: int64

Remove '.'

convert all the chars to small


project_data['teacher_prefix'] = project_data['teacher_prefix'].str.replace('.','')
project_data['teacher_prefix'] = project_data['teacher_prefix'].str.lower()
project_data['teacher_prefix'].value_counts()

output:

mrs 2560

ms 1845

mr 495

teacher 100

Name: teacher_prefix, dtype: int64



5. Preprocessing Categorical Features: project_subject_subcategories


project_data['project_subject_subcategories'].value_counts()

output:

Literacy 449 Literacy, Mathematics 368 Literature & Writing, Mathematics 293 Literacy, Literature & Writing 234 Mathematics 232 Literature & Writing 216 Health & Wellness 179 Special Needs 162 Applied Sciences, Mathematics 156 ... ...


same process we did in project_subject_categories



project_data['project_subject_subcategories'] = project_data['project_subject_subcategories'].str.replace(' The ','')
project_data['project_subject_subcategories'] = project_data['project_subject_subcategories'].str.replace(' ','')
project_data['project_subject_subcategories'] = project_data['project_subject_subcategories'].str.replace('&','_')
project_data['project_subject_subcategories'] = project_data['project_subject_subcategories'].str.replace(',','_')
project_data['project_subject_subcategories'] = project_data['project_subject_subcategories'].str.lower()
project_data['project_subject_subcategories'].value_counts()

output:

literacy 449 literacy_mathematics 368 literature_writing_mathematics 293 literacy_literature_writing 234 mathematics 232 literature_writing 216 health_wellness 179 specialneeds 162 appliedsciences_mathematics 156 ... ...



6. Preprocessing Categorical Features: school_state

project_data['school_state'].value_counts()

output:

CA    707
TX    352
NY    342
FL    261
NC    246
SC    191
IL    184
GA    164
PA    151
MI    151
OH    122
...
...

convert all of them into small letters


project_data['school_state'] = project_data['school_state'].str.lower()
project_data['school_state'].value_counts()

output:

ca    707
tx    352
ny    342
fl    261
nc    246
sc    191
il    184
ga    164
mi    151
pa    151
oh    122
...
...

7. Preprocessing Categorical Features: project_title


# https://stackoverflow.com/a/47091490/4084039
import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'

stopwords= ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"]

project_data['project_title'].head(5)

output:







print("printing some random reviews")
print(9, project_data['project_title'].values[9])
print(34, project_data['project_title'].values[34])
print(147, project_data['project_title'].values[147])

output:

printing some random reviews

9 Just For the Love of Reading--\r\nPure Pleasure

34 \"Have A Ball!!!\"

147 Who needs a Chromebook?\r\nWE DO!!



# Combining all the above stundents 
from tqdm import tqdm
def preprocess_text(text_data):
    preprocessed_text = []
    # tqdm is for printing the status bar
    for sentance in tqdm(text_data):
        sent = decontracted(sentance)
        sent = sent.replace('\\r', ' ')
        sent = sent.replace('\\n', ' ')
        sent = sent.replace('\\"', ' ')
        sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
        # https://gist.github.com/sebleier/554280
        sent = ' '.join(e for e in sent.split() if e.lower() not in stopwords)
        preprocessed_text.append(sent.lower().strip())
    return preprocessed_text

preprocessed_titles = preprocess_text(project_data['project_title'].values)
print("printing some random reviews")
print(9, preprocessed_titles[9])
print(34, preprocessed_titles[34])
print(147, preprocessed_titles[147])

output:

printing some random reviews

9 love reading pure pleasure

34 ball

147 needs chromebook



8. Preprocessing Categorical Features: essay


# merge two column text dataframe: 
project_data["essay"] = project_data["project_essay_1"].map(str) +\
                        project_data["project_essay_2"].map(str) + \
                        project_data["project_essay_3"].map(str) + \
                        project_data["project_essay_4"].map(str)
print("printing some random essay")
print(9, project_data['essay'].values[9])
print('-'*50)
print(34, project_data['essay'].values[34])
print('-'*50)
print(147, project_data['essay'].values[147])

output:

printing some random essay
9 Over 95% of my students are on free or reduced lunch.  I have a few who are homeless, but despite that, they come to school with an eagerness to learn.  My students are inquisitive eager learners who  embrace the challenge of not having great books and other resources  every day.  Many of them are not afforded the opportunity to engage with these big colorful pages of a book on a regular basis at home and they don't travel to the public library.  \r\nIt is my duty as a teacher to do all I can to provide each student an opportunity to succeed in every aspect of life. \r\nReading is Fundamental! My students will read these books over and over again while boosting their comprehension skills. These books will be used for read alouds, partner reading and for Independent reading. \r\nThey will engage in reading to build their \"Love for Reading\" by reading for pure enjoyment. They will be introduced to some new authors as well as some old favorites. I want my students to be ready for the 21st Century and know the pleasure of holding a good hard back book in hand. There's nothing like a good book to read!  \r\nMy students will soar in Reading, and more because of your consideration and generous funding contribution. This will help build stamina and prepare for 3rd grade. Thank you so much for reading our proposal!nannan
--------------------------------------------------
...
...

preprocessed_essays = preprocess_text(project_data['essay'].values)
print("printing some random essay")
print(9, preprocessed_essays[9])
print('-'*50)
print(34, preprocessed_essays[34])
print('-'*50)
print(147, preprocessed_essays[147])

output:

printing some random reviews 9 95 students free reduced lunch homeless despite come school eagerness learn students inquisitive eager learners embrace challenge not great books resources every day many not afforded opportunity engage big colorful pages book regular basis home not travel public library duty teacher provide student opportunity succeed every aspect life reading fundamental students read books boosting comprehension skills books used read alouds partner reading independent reading engage reading build love reading reading pure enjoyment introduced new authors well old favorites want students ready 21st century know pleasure holding good hard back book hand nothing like good book read students soar reading consideration generous funding contribution help build stamina prepare 3rd grade thank much reading proposal nannan

...

...



8. Preprocessing Numerical Values: price


# https://stackoverflow.com/questions/22407798/how-to-reset-a-dataframes-indexes-for-all-groups-in-one-step
price_data = resource_data.groupby('id').agg({'price':'sum', 'quantity':'sum'}).reset_index()
price_data.head(2)

output:






# join two dataframes in python: 
project_data = pd.merge(project_data, price_data, on='id', how='left')
project_data['price'].head()

output:

0    154.60
1    299.00
2    516.85
3    232.90
4     67.98
Name: price, dtype: float64

8.1 applying StandardScaler

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(project_data['price'].values.reshape(-1, 1))
project_data['std_price']=scaler.transform(project_data['price'].values.reshape(-1, 1) )
project_data['std_price'].head()

output:

0   -0.393708
1   -0.010053
2    0.568751
3   -0.185673
4   -0.623847
Name: std_price, dtype: float64


8.2 applying MinMaxScaler

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(project_data['price'].values.reshape(-1, 1))
project_data['nrm_price']=scaler.transform(project_data['price'].values.reshape(-1, 1))
project_data['nrm_price'].head()

output:

0 0.015320

1 0.029763

2 0.051554

3 0.023152

4 0.006656

Name: nrm_price, dtype: float64




For more details you can send your requirement details at:


realcode4you@gmail.com
47 views0 comments

Komentáře


bottom of page