Intro To NLP In Python: Types, Tokens And Unix Commands | NLP Sample Paper

realcode4you
Oct 4, 2021
4 min read

How to do this problem set:

Use Python version 3. We strongly suggest installing Python from the Anaconda Individual Edition software package.
Download large_movie_review_dataset.zip and lotr_script.txt. We will use these two datasets in this homework.
Most of these questions require writing Python code or Unix commands and computing results, while the remainder have textual answers. To complete this assignment, you will need to fill out the supporting files, hw1.py and hw1.sh.
For all of the textual answers, replace the placeholder text ("Answer in one or two sentences here.") with your answer.
This assignment is designed so that you can run all cells in a few minutes of computation time. If it is taking longer than that, you probably have a mistake in your code.

# Run this cell! It sets some things up for you.

# This code makes plots appear inline in this document rather than in a new window.
import matplotlib.pyplot as plt

# This code imports your work from hw1.py
from hw1 import *

%matplotlib inline
plt.rcParams['figure.figsize'] = (5, 4) # set default size of plots

# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

# download the IMDB large movie review corpus to a file location on your computer

PATH_TO_DATA = 'large_movie_review_dataset'  # set this variable to point to the location of the IMDB corpus on your computer
POS_LABEL = 'pos'
NEG_LABEL = 'neg'
TRAIN_DIR = os.path.join(PATH_TO_DATA, "train")
TEST_DIR = os.path.join(PATH_TO_DATA, "test")

for label in [POS_LABEL, NEG_LABEL]:
    if len(os.listdir(TRAIN_DIR + "/" + label)) == 12500:
        print("Great! You have 12500 {} reviews in {}".format(label, TRAIN_DIR + "/" + label))
    else:
        print("Oh no! Something is wrong. Check your code which loads the reviews")

# Actually reading the data you are working with is an important part of NLP! Let's look at one of these reviews

print (open(TRAIN_DIR + "/neg/3740_2.txt").read())

Intro to NLP in Python: types, tokens and Unix commands Types and tokens

One major part of any NLP project is word tokenization. Word tokenization is the task of segmenting text into individual words, called tokens. In this assignment, we will use simple whitespace tokenization. You will have a chance to improve this for extra credit at the end of the assigment. Take a look at the tokenize_doc function in hw1.py. You should not modify tokenize_doc but make sure you understand what it is doing.

# We have provided a tokenize_doc function in hw1.py. Here is a short demo of how it works

d1 = "This SAMPLE doc has   words tHat  repeat repeat"
bow = tokenize_doc(d1)

assert bow['this'] == 1
assert bow['sample'] == 1
assert bow['doc'] == 1
assert bow['has'] == 1
assert bow['words'] == 1
assert bow['that'] == 1
assert bow['repeat'] == 2

bow2 = tokenize_doc("Computer science is both practical and abstract.")
for b in bow2:
    print(b)

Task 1:

Now we are going to count the word types and word tokens in the corpus. In the cell below, use the word_counts dictionary variable to store the count of each word in the corpus. Use the tokenize_doc function to break documents into tokens.

word_counts keeps track of how many times a word type appears across the corpus. For instance, word_counts["dog"] should store the number 990 -- the count of how many times the word dog appears in the corpus.

Task 2:

Fill out the functions n_word_types and n_word_tokens in hw1.py. These functions return the total number of word types and tokens in the corpus. important The autoreload "magic" that you setup early in the assignment should automatically reload functions as you make changes and save. If you run into trouble you can always restart the notebook and clear any .pyc files.

Task 3:

Using word_counts dictionary you just created, make a new list of (word,count) pairs called sorted_list where tuples are sorted according to counts, in decending order. Then print the first 30 values from sorted_list.

Unix Text Processing

In this part, you will practice extracting and processing information from text with Unix commands. Download lotr_script.txt on the course website to a file location on your computer. This text file corresponds to the movie script of The Fellowship of the Rings (2001). This script comes from a larger corpus of movie scripts, the ScriptBase-J corpus.

First, let's open and examine lotr_script.txt.

Task 4:

Describe the structure of this script. How are roles, scene directions, and dialogue organized?

Task 5:

Use Unix commands to print the name of each character with dialogue in the script, one name per line. This script's text isn't perfect, so expect a few additional names.

Implement this in hw1.sh. Then, copy your implementation and its resulting output into the following two cells.

Task 6:

Now, let's extract and analyze the dialogue of this script using Unix commands

First, extract all lines of dialogue in this script. Then, normalize and tokenize this text such that all alphabetic characters are converted to lowercase and words are sequences of alphabetic characers. Finally, print the top-20 most frequent word types and their corresponding counts.

Hint: Ignore parantheticals. These contain short stage directions.

Implement this in hw1.sh. Then, copy your implementation and its resulting output into the following two cells.

Task 7:

If we instead tokenized *all* text in the script, how might the results from Question 1.6 to change? Are there specific word types that might become more frequent?

Hire expert to get help in any NLP related assignment Help, homework Help, Project Help