top of page

Text Processing Using NLP | Sample Practice Set | Realcode4you

Dataset

Link to the my_json.data used.

from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
import nltk
import os

os.chdir('/content/drive/MyDrive/my_data/my_data/my_data.json')

my_data_df = pd.read_json('my_data.json')
working_data = my_data_df.copy()

working_data['cls_body'] = working_data['body']

working_data.head(5)

output:


## BEGIN YOUR ASSIGNMENT HERE

# T1: print the body of the 5th document in the list. You should see a recipe as the output.
# The first item in a list = 0, so the 5th document will be 4
# There are url's present in this 5th document

doc = working_data.at[4, 'body']
print(doc)

Output:

“mini” 😂 [Link to recipe. ](https://bakerinretrograde.wordpress.com/2020/12/20/brownie-trifle/)

Chocolate Diplomat Cream Layer

1 cup milk

1/2 tbsp vanilla

3 egg yolks

1/3 cup sugar

1/8 cup cornstarch

1/2 tbsp unsalted butter

1/4 cup chocolate chips

1 and 1/2 cups heavy whipping cream

Instructions

Add the milk to a medium saucepan, place over medium heat and bring to a boil.

In a bowl, whisk the egg yolks and sugar until light and fluffy. Add the cornstarch and whisk until smooth.

Whisk the hot milk gradually into the egg mixture until incorporated.

Pour the mixture back into the saucepan. Cook over medium-high heat, whisking constantly, until thickened. You know it is done when you can run a wooden spoon along the bottom of the saucepan, and it leaves a clean line.

Remove from the heat and stir in the chocolate chips and butter until melted. Let cool slightly. 

Cover with plastic wrap, lightly pressing the plastic against the surface to prevent a skin from forming.

Chill about an hour.

Once chilled, whip up the heavy cream until thick and you can overturn the bowl without any cream falling out. Remember not to over-whip.

Fold the whipped into the chocolate pastry cream.

Brownie Layer

2 cups semisweet chopped chocolate or chocolate chips

1/2 cup butter

1/2 cup light brown sugar

4 large eggs

2 teaspoons vanilla extract

1/4 teaspoon salt

1 cup all purpose flour

Instructions

Preheat oven to 325°F. Line a 9×9 baking dish with parchment, and grease with oil. Set aside.

In a medium saucepan, cook butter over medium heat until melted and browned, stirring constantly. It should take about 4 minutes. You will see it change color and aroma with brown bits forming at the bottom. Immediately take it off the heat and pour over the chocolate chips.

Let the mixture sit to allow the chocolate chips to melt and then stir until no more chunks of chocolate remain. Whisk in the brown sugar.

Next add the eggs and whisk until evenly combined. Next mix in the vanilla until smooth. Stir in the salt and flour until they’re evenly incorporated. Then pour the batter into the prepared pan.

Bake the brownies for 30 minutes, or until a toothpick inserted into the center comes out clean.

Let the brownies cool completely and cut or crumble into bite-sized pieces.

Whipped Cream Layer

1 and 1/2 cups heavy whipping cream

Instructions

Whip up the heavy cream until thick and you can overturn the bowl without any cream falling out. Remember not to over-whip.

Assembly

In each glass, place alternating layers of brownie bits, chocolate diplomat cream, and whipped cream. Garnish with shaved chocolate on top. [Link to recipe. ](https://bakerinretrograde.wordpress.com/2020/12/20/brownie-trifle/)

Chocolate Diplomat Cream Layer

1 cup milk

1/2 tbsp vanilla

3 egg yolks

1/3 cup sugar

1/8 cup cornstarch

1/2 tbsp unsalted butter

1/4 cup chocolate chips

1 and 1/2 cups heavy whipping cream

Instructions

Add the milk to a medium saucepan, place over medium heat and bring to a boil.

In a bowl, whisk the egg yolks and sugar until light and fluffy. Add the cornstarch and whisk until smooth.

Whisk the hot milk gradually into the egg mixture until incorporated.

Pour the mixture back into the saucepan. Cook over medium-high heat, whisking constantly, until thickened. You know it is done when you can run a wooden spoon along the bottom of the saucepan, and it leaves a clean line.

Remove from the heat and stir in the chocolate chips and butter until melted. Let cool slightly. 

Cover with plastic wrap, lightly pressing the plastic against the surface to prevent a skin from forming.

Chill about an hour.

Once chilled, whip up the heavy cream until thick and you can overturn the bowl without any cream falling out. Remember not to over-whip.

Fold the whipped into the chocolate pastry cream.

Brownie Layer

2 cups semisweet chopped chocolate or chocolate chips

1/2 cup butter

1/2 cup light brown sugar

4 large eggs

2 teaspoons vanilla extract

1/4 teaspoon salt

1 cup all purpose flour

Instructions

Preheat oven to 325°F. Line a 9×9 baking dish with parchment, and grease with oil. Set aside.

In a medium saucepan, cook butter over medium heat until melted and browned, stirring constantly. It should take about 4 minutes. You will see it change color and aroma with brown bits forming at the bottom. Immediately take it off the heat and pour over the chocolate chips.

Let the mixture sit to allow the chocolate chips to melt and then stir until no more chunks of chocolate remain. Whisk in the brown sugar.

Next add the eggs and whisk until evenly combined. Next mix in the vanilla until smooth. Stir in the salt and flour until they’re evenly incorporated. Then pour the batter into the prepared pan.

Bake the brownies for 30 minutes, or until a toothpick inserted into the center comes out clean.

Let the brownies cool completely and cut or crumble into bite-sized pieces.

Whipped Cream Layer

1 and 1/2 cups heavy whipping cream

Instructions

Whip up the heavy cream until thick and you can overturn the bowl without any cream falling out. Remember not to over-whip.

Assembly

In each glass, place alternating layers of brownie bits, chocolate diplomat cream, and whipped cream. Garnish with shaved chocolate on top. Where’s the meat layer? This looks delicious, nice job op! The mugs are so cute! Where are they from?

# T2: using regex substitution remove square brackets and parantheses from the text.
  # HINT: the pattern to search for is '[\(\[].*?[\)\]]'

import re

no_urls = []
for i, row in working_data['cls_body'].iteritems():
  no_punkt_body = re.sub(r'[\(\[].*?[\)\]]',' ', row)

  no_urls.append(no_punkt_body)

working_data['cls_body'] = no_urls

doc = working_data.at[4, 'cls_body']
print(doc)

Output:

“mini” 😂   

Chocolate Diplomat Cream Layer

1 cup milk

1/2 tbsp vanilla

3 egg yolks

1/3 cup sugar

1/8 cup cornstarch

1/2 tbsp unsalted butter

1/4 cup chocolate chips

1 and 1/2 cups heavy whipping cream

Instructions

Add the milk to a medium saucepan, place over medium heat and bring to a boil.

In a bowl, whisk the egg yolks and sugar until light and fluffy. Add the cornstarch and whisk until smooth.

Whisk the hot milk gradually into the egg mixture until incorporated.

Pour the mixture back into the saucepan. Cook over medium-high heat, whisking constantly, until thickened. You know it is done when you can run a wooden spoon along the bottom of the saucepan, and it leaves a clean line.

Remove from the heat and stir in the chocolate chips and butter until melted. Let cool slightly. 

Cover with plastic wrap, lightly pressing the plastic against the surface to prevent a skin from forming.

Chill about an hour.

Once chilled, whip up the heavy cream until thick and you can overturn the bowl without any cream falling out. Remember not to over-whip.

Fold the whipped into the chocolate pastry cream.

Brownie Layer

2 cups semisweet chopped chocolate or chocolate chips

1/2 cup butter

1/2 cup light brown sugar

4 large eggs

2 teaspoons vanilla extract

1/4 teaspoon salt

1 cup all purpose flour

Instructions

Preheat oven to 325°F. Line a 9×9 baking dish with parchment, and grease with oil. Set aside.

In a medium saucepan, cook butter over medium heat until melted and browned, stirring constantly. It should take about 4 minutes. You will see it change color and aroma with brown bits forming at the bottom. Immediately take it off the heat and pour over the chocolate chips.

Let the mixture sit to allow the chocolate chips to melt and then stir until no more chunks of chocolate remain. Whisk in the brown sugar.

Next add the eggs and whisk until evenly combined. Next mix in the vanilla until smooth. Stir in the salt and flour until they’re evenly incorporated. Then pour the batter into the prepared pan.

Bake the brownies for 30 minutes, or until a toothpick inserted into the center comes out clean.

Let the brownies cool completely and cut or crumble into bite-sized pieces.

Whipped Cream Layer

1 and 1/2 cups heavy whipping cream

Instructions

Whip up the heavy cream until thick and you can overturn the bowl without any cream falling out. Remember not to over-whip.

Assembly

In each glass, place alternating layers of brownie bits, chocolate diplomat cream, and whipped cream. Garnish with shaved chocolate on top.   

Chocolate Diplomat Cream Layer

1 cup milk

1/2 tbsp vanilla

3 egg yolks

1/3 cup sugar

1/8 cup cornstarch

1/2 tbsp unsalted butter

1/4 cup chocolate chips

1 and 1/2 cups heavy whipping cream

Instructions

Add the milk to a medium saucepan, place over medium heat and bring to a boil.

In a bowl, whisk the egg yolks and sugar until light and fluffy. Add the cornstarch and whisk until smooth.

Whisk the hot milk gradually into the egg mixture until incorporated.

Pour the mixture back into the saucepan. Cook over medium-high heat, whisking constantly, until thickened. You know it is done when you can run a wooden spoon along the bottom of the saucepan, and it leaves a clean line.

Remove from the heat and stir in the chocolate chips and butter until melted. Let cool slightly. 

Cover with plastic wrap, lightly pressing the plastic against the surface to prevent a skin from forming.

Chill about an hour.

Once chilled, whip up the heavy cream until thick and you can overturn the bowl without any cream falling out. Remember not to over-whip.

Fold the whipped into the chocolate pastry cream.

Brownie Layer

2 cups semisweet chopped chocolate or chocolate chips

1/2 cup butter

1/2 cup light brown sugar

4 large eggs

2 teaspoons vanilla extract

1/4 teaspoon salt

1 cup all purpose flour

Instructions

Preheat oven to 325°F. Line a 9×9 baking dish with parchment, and grease with oil. Set aside.

In a medium saucepan, cook butter over medium heat until melted and browned, stirring constantly. It should take about 4 minutes. You will see it change color and aroma with brown bits forming at the bottom. Immediately take it off the heat and pour over the chocolate chips.

Let the mixture sit to allow the chocolate chips to melt and then stir until no more chunks of chocolate remain. Whisk in the brown sugar.

Next add the eggs and whisk until evenly combined. Next mix in the vanilla until smooth. Stir in the salt and flour until they’re evenly incorporated. Then pour the batter into the prepared pan.

Bake the brownies for 30 minutes, or until a toothpick inserted into the center comes out clean.

Let the brownies cool completely and cut or crumble into bite-sized pieces.

Whipped Cream Layer

1 and 1/2 cups heavy whipping cream

Instructions

Whip up the heavy cream until thick and you can overturn the bowl without any cream falling out. Remember not to over-whip.

Assembly

In each glass, place alternating layers of brownie bits, chocolate diplomat cream, and whipped cream. Garnish with shaved chocolate on top. Where’s the meat layer? This looks delicious, nice job op! The mugs are so cute! Where are they from?

# T3: perform sent tokenization on the string object in row.
from nltk import sent_tokenize
nltk.download('punkt')

sentencized_row = []
for i, row in working_data['cls_body'].iteritems():
  sents = sent_tokenize(row)
  print(sents)

Output:


# T4: perform a regex substitution removing all punctuations leaving only words and white space for each sentence.
    # HINT: the regex pattern is '[^\s\w]'
  # To bring in the recipe code

from nltk import sent_tokenize
nltk.download('punkt')

sentencized_row = []
for i, row in working_data['cls_body'].iteritems():
  sents = sent_tokenize(row)

  no_punkt_sents = []
  for sent in sents:
   
    sent = re.sub('[^\s\w]','', sent)
    no_punkt_sents.append(sent)

  sentencized_row.append(no_punkt_sents)

working_data['cls_body'] = sentencized_row
doc = working_data.at[4, 'cls_body']
doc[:15]

Output:

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[' mini    \n\nChocolate Diplomat Cream Layer\n\n1 cup milk\n\n12 tbsp vanilla\n\n3 egg yolks\n\n13 cup sugar\n\n18 cup cornstarch\n\n12 tbsp unsalted butter\n\n14 cup chocolate chips\n\n1 and 12 cups heavy whipping cream\n\nInstructions\n\nAdd the milk to a medium saucepan place over medium heat and bring to a boil',
 'In a bowl whisk the egg yolks and sugar until light and fluffy',
 'Add the cornstarch and whisk until smooth',
 'Whisk the hot milk gradually into the egg mixture until incorporated',
 'Pour the mixture back into the saucepan',
 'Cook over mediumhigh heat whisking constantly until thickened',
 'You know it is done when you can run a wooden spoon along the bottom of the saucepan and it leaves a clean line',
 'Remove from the heat and stir in the chocolate chips and butter until melted',
 'Let cool slightly',
 'Cover with plastic wrap lightly pressing the plastic against the surface to prevent a skin from forming',
 'Chill about an hour',
 'Once chilled whip up the heavy cream until thick and you can overturn the bowl without any cream falling out',
 'Remember not to overwhip',
 'Fold the whipped into the chocolate pastry cream',
 'Brownie Layer\n\n2 cups semisweet chopped chocolate or chocolate chips\n\n12 cup butter\n\n12 cup light brown sugar\n\n4 large eggs\n\n2 teaspoons vanilla extract\n\n14 teaspoon salt\n\n1 cup all purpose flour\n\nInstructions\n\nPreheat oven to 325F']

working_data.head(5)

output:



from nltk import word_tokenize
nltk.download('punkt')
# T5: for each document for each sentence in document word tokenize the sentence.
# HINT: follow the patterns you saw in the previous cell.
# HINT: save the list of tokenized documents back to the workd_data['cls_body'] column
buf = [' '.join(i) for i in working_data['cls_body']]
working_data['cls_body'] = buf
working_data['cls_body'] = working_data.apply(lambda row: nltk.word_tokenize(row['cls_body']), axis=1)
working_data.head()

Output:


from nltk.tokenize import treebank
# T6: use the WordNetLemmatizer to create lemmas for each tokenized word.
# HINT: pos = nltk.pos_tag(sentence)
# HINT: lemmatize.lemmatize(word, get_wordnet_pos(pos))
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
import nltk
nltk.download('omw-1.4')
# T6: use the WordNetLemmatizer to create lemmas for each tokenized word.
# HINT: pos = nltk.pos_tag(sentence)
# HINT: lemmatize.lemmatize(word, get_wordnet_pos(pos))
def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(w) for w in text]
working_data['cls_body'].apply(lemmatize_text)
working_data.head()

Output:


# T7: for each document and each sentence remove any stopwords found in the stop_words list.
# HINT: turn stopwords into a python list called stop_words. Then loop through all the documents/sentences/words and remove any word in that list.
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
lemz = []
#stop_words = []
for text_tokens in working_data['cls_body']:
  lemmrs = [word for word in text_tokens if not word in stopwords.words()]
  lemz.append(lemmrs)
working_data= working_data.iloc[:20,:]
working_data['cls_body'] = lemz
working_data.head()

# save your working_data dataframe to a pickle
import pickle
working_data.to_pickle('processed_data.pkl')

# T8: print a frequency distribution of the vocabulary (how often is each word observed)
from nltk import ngrams, FreqDist
# HINT: for each word in sentence for each sentence in document append each word into a list.
# HINT: corpus is just a list of words in each document (no sentences)
lists =  working_data['cls_body']
words = []
for wordList in lists:
    words += wordList
fdist = FreqDist(wordList)
# T9: print the most_common 50 words in the distribution along with their frequency.
fdist.most_common(50)
## T10: print the number of documents in the corpus
print('length of corpus ',working_data.shape[0])
# T11: print the size of the vocabulary of the corpus
print('size of vocabulary ',len(wordList))

Feature Engineering

# T12 - T18
processed_data = working_data.copy()
docs = []
for row in processed_data['cls_body']:  
  doc = []
  for i in row:
    doc.append(i)
  # T12: for each row in processed_data. Append each word from the document into a flat list (we want to remove sentences).
  # HINT: for each sentence in doc for each word in sentence append the word to a new list called doc. Then append that list to a list called docs.
  #print(' | ',row)
  docs.append(doc)
 
processed_data['cls_body'] = docs
X = processed_data['cls_body'].tolist()
# T13: print the first 15 words from the 5th document in X
X[5][:5]

#T14
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
def dummy(doc):
  return doc
# T14: create the bigram model using CountVectorizor. Apply the model to X
vectorizer = CountVectorizer(ngram_range=(1, 2),
        tokenizer=dummy,
        preprocessor=dummy)
gramarray = vectorizer.fit_transform(X).toarray()
# save the data to the modeling_data/ folder in a file called ngrams.txt
# add modeling_data folder 
file = open('ngrams.txt', "w+")
 
# Saving the array in a text file
content = str(gramarray)
file.write(content)
file.close()
# print the shape of the ngrams array (# rows, # columns) format
gramarray.shape
import requests, zipfile, io

url = 'http://nlp.stanford.edu/data/glove.6B.zip'
r = requests.get(url, allow_redirects=True)

z = zipfile.ZipFile(io.BytesIO(r.content))

try:
  os.mkdir('glove_vectors/')
except:
  pass

z.extractall('glove_vectors/')

from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models.keyedvectors import KeyedVectors
glove2word2vec(glove_input_file="glove_vectors/glove.6B.50d.txt", word2vec_output_file="glove_vectors/gensim_glove_vectors.txt")

glove_model = KeyedVectors.load_word2vec_format("glove_vectors/gensim_glove_vectors.txt", binary=False)
glove_model['recipe']
glove_model.most_similar(positive='recipe')
glove_model.most_similar(positive=["king", "woman"], negative=["man"], topn=6)
test_doc = X[1]
test_doc[:15]
word_vectors = []
for word in test_doc:
  try:
    word_vec = glove_model[word]
  except:
    continue
  word_vectors.append(word_vec)
word_vectors = np.asarray(word_vectors)
word_vectors[:15]
flattened_vector = np.average(word_vectors, axis=0)
flattened_vector
glove_model.most_similar(positive=[flattened_vector])

def create_doc_vecs(docs, dim=50):
  doc_vecs = []
  for i, doc in enumerate(docs):
    word_vecs = []
    weights = []
    for word in doc:
      try:
        word_vec = glove_model[word]
      except:
        continue
      word_vecs.append(word_vec)
    doc_array = np.asarray(word_vecs)
    flattened_vector = np.average(doc_array, axis=0)
    if np.isnan(flattened_vector).all():
      df = pd.DataFrame([[flattened_vector] * dim], index=[i], columns=['vec_' + str(i) for i in range(dim)])
    else:
      df = pd.DataFrame([flattened_vector], index=[i], columns=['vec_' + str(i) for i in range(dim)])
    doc_vecs.append(df)
  return pd.concat(doc_vecs)
doc_vecs = create_doc_vecs(X)
doc_vecs.head(5)
np.savetxt('glove_vecs.txt', doc_vecs.values)

In the cell below use the Doc2Vec model from gensim to train your own document level feature vector. Using the documentation below train a Doc2Vec model. Focus on the "Training the Model" section of the documentation.



from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import gensim
# T15: for each document in X create TaggedDocument and store inside of a list
# HINT: Doc2Vec takes as input a list of TaggedDocument objects.
taggedlis = []
for docx in X:
  inj = TaggedDocument(words=docx,tags='tag')
  taggedlis.append(inj)
doc_2_vec=Doc2Vec(taggedlis)
# T8: print the shape of the doc_2_vec array (# rows, # columns) format
doc_2_vec[0].shape
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=2)
model.build_vocab(taggedlis)
model.train(taggedlis, total_examples=model.corpus_count, epochs=model.epochs)
taggedlis[0]
## T16: using Doc2Vec create a feature representation for each document
doc_2_vec[0]
# T17: apply the model to X and save the data to the modeling_data/ folder in a file called doc_2_vec.txt
## HINT: Doc2Vec returns an object with a docvecs.vector_docs attribute that holds our data as an array.
nx = []
for vec in X:
  vector = model.infer_vector(vec)
  nx.append(vector)
np.savetxt('doc_2_vec.txt', np.array(nx))
# T18: print the shape of the doc_2_vec array (# rows, # columns) format
vector.shape,len(nx)

Classification

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, roc_auc_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
# T19: you will use sklearn to create a text classification model
# HINT: below is just an example, you will need to swap out X for each of the feature representations above.
X = np.array(nx) #glove_data.drop(['title', 'flair', 'body', 'cls_body'], axis=1).values
y = np.where(processed_data['flair'] == 'Recipe', 1, 0).ravel()
xtr,xts,ytr,yts = train_test_split(X,y,test_size=0.2)
# HINT: for each vector representation fit multiple models.
# HINT: select the best parameters for each model using grid search
# HINT: use the Pipeline object to streamline your code. (https://scikit-learn.org/stable/modules/compose.html#pipeline)
parameters = [{'solver': ['newton-cg', 'lbfgs']},
              {'penalty':['l1', 'l2']},
              {'C':[0.001, 0.01]}]
logreg = LogisticRegression()
grid_search = GridSearchCV(estimator = logreg,  
                           param_grid = parameters,
                           scoring = 'accuracy',

                          cv = 3,
                           verbose=0)
grid_search.fit(xtr, ytr)
parameters = [{'solver': ['newton-cg', 'lbfgs']},
              {'penalty':['l1', 'l2']},
              {'C':[0.001, 0.01]}]

logreg = LogisticRegression()
grid_search = GridSearchCV(estimator = logreg,  
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 3,
                           verbose=0)

grid_search.fit(xtr, ytr)


To get any help in Machine Learning Projects, Assignments and Homework you can send your assignment or project requirement details at:


bottom of page