NLP(Natural Language Processing)
In this blog we will check our understanding of the concepts learned in NLP.
To summarize NLP or Natural Language Processing is:
Computer manipulation of natural languages.
Set of methods/algorithms for making natural language accessible to computer.
The image below summarizes the basic steps involved in any NLP task:
There are 5 exercises in total and an optional exercise. To answer some of the exercises (3, 4 and 5) you will be required to write a code from scratch in the code cells containing:
# write your code here
Before starting make sure you are using the GPU.
!nvidia-smi
Tokenization We will use TensorFlow Keras Tokenizer to tokenize our text. As per the TensorFlow documentation: “This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf.” There are many functions that we can use, but below we will be using these two functions to train the tokenizer to our text data and convert given text to tokens:
fit_on_texts: Updates internal vocabulary based on a list of texts. This method creates the vocabulary index based on word frequency. So if you give it something like, "The cat sat on the mat." It will create a dictionary “word_index” such that every word gets a unique integer value. 0 is reserved for padding. So lower integer means more frequent word.
texts_to_sequences Transforms each text in texts to a sequence of integers. It takes each word in the text and replaces it with its corresponding integer value from the word_index dictionary.
from tensorflow.keras.preprocessing.text import Tokenizer
t = Tokenizer()
fit_text = ["In science, you can say things that seem crazy, but in the long run, they can turn out to be right"]
t.fit_on_texts(fit_text)
test_text1 = "I would like to take a right turn"
test_text2 = "That man is crazy"
sequences = t.texts_to_sequences([test_text1, test_text2])
print('sequences : ',sequences,'\n')
print('word_index : ',t.word_index)
Exercise 1 In the code above we tokenize two sentences:
"I would like to take a right turn"
"That man is crazy"
a. What is the tokenized version of these sentences? b. The first sentence has 8 words, and second sentence has 4 words, however the tokenized version has 3 and 2 integers respectively for them. Why is it so?
Answer 1: Write Your Answer Here
Embeddings
Exercise 2 In the class we learned about embeddings, let us explore them a little more. Kindly go to the site Embeddings Projector. Play around a bit and answer the following questions:
For the word 'fantastic' list the five nearest neighbours, when using Word2Vec 10K embedding.
Repeat the exercise by changing the embeddings to Word2Vec All.
Reflect on the result. How do you think the world fantastic is related to its five nearest neighbours?
Answer 2: Write Your answer Here
Word Similarities Let us now train Word2Vec model on text8 dataset
!mkdir data
import gensim.downloader as api
from gensim.models import Word2Vec
info = api.info("text8")
assert(len(info) > 0)
dataset = api.load("text8") # download and load text 8 dataset
model = Word2Vec(dataset) # we create an embedding using Word2vec model for this data
model.save("data/text8-word2vec.bin")
Load the saved model as KeyedVector to save space.
from gensim.models import KeyedVectors
model = KeyedVectors.load("data/text8-word2vec.bin") # Help in saving memory by shedding the internal data structures necessary for training
word_vectors = model.wv ## Gives the word vectors
Helper function to print
def print_most_similar(word_conf_pairs, k):
for i, (word, conf) in enumerate(word_conf_pairs):
print("{:.3f} {:s}".format(conf, word))
if i >= k-1:
break
if k < len(word_conf_pairs):
print("...")
print_most_similar(word_vectors.most_similar('king'),5)
Exercise 3 In the class, we learned how to use the Word2Vec embeddings in Gensim. When the model is trained on the ‘text8’ dataset, give five most similar words to the word ‘tree’ using word2vec embedding trained on ‘text8’ dataset.
## Write your code here
Answer 3: Write your Answer Here
Word Arithmetics
Exercise 4 With the Word2Vec model trained on text8 dataset, calculate the following:
woman + king - man = ?
chair + table - work = ?
Queens - queen + person = ?
## Write your code here
Answer 4: Write Your answer Here
Spam Classifier
Some helper codes:
importing required modules
defining helper functions
Building model
The code cells below are hidden, that is by default you cannot see the code in them, but remember to run these cells. You can check the code by double clicking the cells.
#@title
# The modules needed to run the code
import argparse # To read commandline argument and parse it
import gensim.downloader as api
import numpy as np
import os # For file and directory handling
import shutil # For file and directory handling
import tensorflow as tf
from sklearn.metrics import accuracy_score, confusion_matrix #For measuring performance
# Some parameters
DATA_DIR = "data" # Data directory to save embedding
EMBEDDING_NUMPY_FILE = os.path.join(DATA_DIR, "E.npy") # Numpy file containing word embeddings
DATASET_URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip" # Dataset URL from where data is downloaded
EMBEDDING_MODEL = "glove-wiki-gigaword-300" # The gensim embedding model we will use
EMBEDDING_DIM = 300 # The embedding dimensions
NUM_CLASSES = 2 # The number of classes in output-- Spam or Ham
BATCH_SIZE = 128 # The batch size
NUM_EPOCHS = 3 # number of epochs for which model is to be trained
# data distribution is 4827 ham and 747 spam (total 5574), which
# works out to approx 87% ham and 13% spam, so we take reciprocals
# and this works out to being each spam (1) item as being approximately
# 8 times as important as each ham (0) message.
CLASS_WEIGHTS = { 0: 1, 1: 8 } # To take care of imbalance in classes
tf.random.set_seed(42) # Set the seed for random number generation to be able to reproduce results.
# Data downloading and data Processing
def download_and_read(url):
"""
The function downloads the data from given url, splits it into Text and Labels
Uses tf.keras.utils.get_file() function to download the data from url--> function
downloads the data from the given url, extracts it from the zip file and place it in folder "datasets"
with the name specified in the first argument.
tf.keras.utils.get_file(
fname, origin, untar=False, md5_hash=None, file_hash=None,
cache_subdir='datasets', hash_algorithm='auto', extract=False,
archive_format='auto', cache_dir=None)
Arguments:
url: The url link of the dataset in zip format
Returns:
Two lists containing texts and respective labels
"""
local_file = url.split('/')[-1] # split the file name (last string after '/') from url
p = tf.keras.utils.get_file(local_file, url,
extract=True, cache_dir=".") #function to download the data from url to folder datasets with name given in local_file
labels, texts = [], []
local_file = os.path.join("datasets", "SMSSpamCollection") # define the path of the file from which to read data: datasets/SMSSpamCollection
with open(local_file, "r") as fin:
for line in fin:
label, text = line.strip().split('\t') # The labels and text are in one line separated by tab space.
labels.append(1 if label == "spam" else 0)
texts.append(text)
return texts, labels
def build_embedding_matrix(sequences, word2idx, embedding_dim,
embedding_file):
"""
The function reads the dict word2idx (word --> number) and written the corresponding
word vector for each word as defined by the Embedding model
Arguments:
sequences: not needed, not used-- just there because to suport back support for TF1 book
word2idx: Dictionary containing words in the text and their respective idx as given by tokenizer.
embedding_dim: The number of units for the embedding layer
embedding_file: The data file in which embeddings will be store for future use.
"""
if os.path.exists(embedding_file): # Checks if the embedding file already exists- then it justs loads it in the memory
E = np.load(embedding_file)
else: # Else it creates the embedding file using the model specified in EMBEDDING_MODEL
vocab_size = len(word2idx) # The vocabulary size is number of unique words in the text
E = np.zeros((vocab_size, embedding_dim)) # Creates a variable to store embeddings
word_vectors = api.load(EMBEDDING_MODEL) # Get the embeddings from Gensim
for word, idx in word2idx.items():
try:
E[idx] = word_vectors.word_vec(word) # For each word it converts it to respective word vector and store in Embedding file
except KeyError: # word not in embedding
pass
# except IndexError: # UNKs are mapped to seq over VOCAB_SIZE as well as 1
# pass
np.save(embedding_file, E) # The embeddings are saved in a file for future reference
return E
#@title
class SpamClassifierModel(tf.keras.Model): # The model is build using model API of Keras with tf.Keras.Model as the parent class.
# The class inherits train, predict methods of the parent class.
def __init__(self, vocab_sz, embed_sz, input_length,
num_filters, kernel_sz, output_sz,
run_mode, embedding_weights,
**kwargs):
super(SpamClassifierModel, self).__init__(**kwargs)
if run_mode == "scratch": # Choose the embedding layer scratch means the weights wil be traned from scratch
self.embedding = tf.keras.layers.Embedding(vocab_sz,
embed_sz,
input_length=input_length,
trainable=True)
elif run_mode == "vectorizer": # Vectorizer means we use the pre-trained weights--> Transfer Learning
self.embedding = tf.keras.layers.Embedding(vocab_sz,
embed_sz,
input_length=input_length,
weights=[embedding_weights],
trainable=False)
else: # This is the fine tuning mode- we use pre-trained weights for the embedding layer and fine tune them.
self.embedding = tf.keras.layers.Embedding(vocab_sz,
embed_sz,
input_length=input_length,
weights=[embedding_weights],
trainable=True)
self.dropout = tf.keras.layers.SpatialDropout1D(0.2) # Add droput layer to avoid overfotting.
self.conv = tf.keras.layers.Conv1D(filters=num_filters, # Define the 1D convolutional layer
kernel_size=kernel_sz,
activation="relu")
self.pool = tf.keras.layers.GlobalMaxPooling1D() # The pooling layer
self.dense = tf.keras.layers.Dense(output_sz,
activation="softmax") # And the last classifying layer consists of a fully connected Dense layer
def call(self, x): # This function performs forward pass in the model.
x = self.embedding(x)
x = self.dropout(x)
x = self.conv(x)
x = self.pool(x)
x = self.dense(x)
return x
#@title
# The code below requires a folder to be created
!mkdir data
## Now we will use the functions and model defined above --> ideally they should be done in a separate file-- main.py
# read data
texts, labels = download_and_read(DATASET_URL)
# tokenize and pad text so that each text is of same size
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(texts)
text_sequences = tokenizer.texts_to_sequences(texts)
text_sequences = tf.keras.preprocessing.sequence.pad_sequences(text_sequences)
num_records = len(text_sequences)
max_seqlen = len(text_sequences[0])
#print("{:d} sentences, max length: {:d}".format(num_records, max_seqlen))
# labels --> convert labels to categorical labels (one hot encoded)
cat_labels = tf.keras.utils.to_categorical(labels, num_classes=NUM_CLASSES)
# vocabulary --> Create word mapping and its inverse
word2idx = tokenizer.word_index
idx2word = {v:k for k, v in word2idx.items()}
word2idx["PAD"] = 0
idx2word[0] = "PAD"
vocab_size = len(word2idx)
#print("vocab size: {:d}".format(vocab_size))
# load the dataset as tensors, split it into test, train and validation set
dataset = tf.data.Dataset.from_tensor_slices((text_sequences, cat_labels))
dataset = dataset.shuffle(10000)
test_size = num_records // 4
val_size = (num_records - test_size) // 10
test_dataset = dataset.take(test_size)
val_dataset = dataset.skip(test_size).take(val_size)
train_dataset = dataset.skip(test_size + val_size)
test_dataset = test_dataset.batch(BATCH_SIZE, drop_remainder=True)
val_dataset = val_dataset.batch(BATCH_SIZE, drop_remainder=True)
train_dataset = train_dataset.batch(BATCH_SIZE, drop_remainder=True)
# Build the embedding
E = build_embedding_matrix(text_sequences, word2idx, EMBEDDING_DIM,
EMBEDDING_NUMPY_FILE)
#print("Embedding matrix:", E.shape)
#Since we are not passing the mode by command line in this file we need to give a value to run_mode
run_mode = 'scratch'
# Now we use the SpamClassifierModel class to create a model
conv_num_filters = 256
conv_kernel_size = 3
model = SpamClassifierModel(
vocab_size, EMBEDDING_DIM, max_seqlen,
conv_num_filters, conv_kernel_size, NUM_CLASSES,
run_mode, E)
model.build(input_shape=(None, max_seqlen))
model.summary()
Compile and train the model
# Define compile and train
model.compile(optimizer="adam", loss="categorical_crossentropy",
metrics=["accuracy"])
# Now we train the model
model.fit(train_dataset, epochs=NUM_EPOCHS,
validation_data=val_dataset,
class_weight=CLASS_WEIGHTS)
And now we evaluate the model on test dataset.
# Lastly we evaluate the trained model against test set
labels, predictions = [], []
for Xtest, Ytest in test_dataset:
Ytest_ = model.predict_on_batch(Xtest) # for each test test predict the label
ytest = np.argmax(Ytest, axis=1) # Get the label with highest probabilty from actual test output
ytest_ = np.argmax(Ytest_, axis=1) # Get the label with highest probabilty from predictted test output
labels.extend(ytest.tolist()) # add to list
predictions.extend(ytest.tolist()) # add to list
print("test accuracy: {:.3f}".format(accuracy_score(labels, predictions))) # Calculate accuracy score
Exercise 5 In the spam classifier what is the false positive and false negative on the test dataset? What does it tell you about the trained model? # Write your code here
Answer 5: Write Your answer Here
To get Complete solution of above exercises you can contact us or if you have any other NLP related project assignments then also share with you.
Realcode4you.com Experts team provide complete support to do your project. Here you get plagiarism free work and quality code with a reasonable price.
Send your NLP Project requirement details at:
realcode4you@gmail.com
Comments