Building A Conversation Chatbot In Machine Learning

realcode4you
Jun 10, 2021
3 min read

Aim of the Project

Aim of the project is to build an intelligent conversational chatbot, Riki, that can understand

complex queries from the user and intelligently respond.

Background

R-Intelligence Inc., an AI startup, has partnered with an online chat and discussion website bluedit.io. They have an average of over 5 million active customers across the globe and more than 100,000 active chat rooms. Due to the increased traffic, they are looking at improving their user experience with a chatbot moderator, which helps them engage in a meaningful conversation and keeps them updated on trending topics, while merely chatting with Riki, a chatbot. The Artificial Intelligence-powered chat experience provides easy access to information and a host of options to the customers.

Business Requirement

R-Intelligence Inc. has invested in Python, PySpark, and Tensorflow. Using emerging technologies of Artificial Intelligence, Machine Learning, and Natural Language Processing, Riki

– the chatbot should make the whole conversation as realistic as talking to an actual human.

The chatbot should understand that users have different intents and make it extremely simple to work around these by presenting the users with options and recommendations that best suit their needs.

Suggested Approach

R-Intelligence Inc. used an approach using only Natural Language Processing, in which Seq2seq models (encoder and Decoder) are used as the state-of-the-art approach to implement end to end text generation for a conversational bot.

Tasks to be performed

Download the glove model available at https://nlp.stanford.edu/projects/glove/ Specification: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download): glove.twitter.27B.zip
Load the glove word embedding into a dictionary where the key is a unique word token and the value is a d dimension vector
Data Preparation - Filter the conversations till max word length and convert the dialogues pairs into input text and target texts. Put start and end token to recognize the beginning and end of the sentence token.

Create two dictionaries:

target_word2id
target_id2word

and save it as NumPy file format in the disk.

• Prepare the input data with embedding. The input data is a list of lists:

First list is a list of sentences
Each sentence is a list of words

• Generate training data per batch

• Define the model architecture and perform the following steps:

Step 1: Use a LSTM encoder to get input words encoded in the form of (encoder outputs, encoder hidden state, encoder context) from input words
Step 2: Use a LSTM decoder to get target words encoded in the form of (decoder outputs, decoder hidden state, decoder context) from target words. Use encoder hidden states and encoder context (represents input memory) as initial state.
Step 3: Use a dense layer to predict the next token out of the vocabulary given decoder output generated by Step 2.
Step 4: Use loss ='categorical_crossentropy' and optimizer='rmsprop'

• Generate the model summary

• Finally generate the prediction

Dataset Description

Dataset: Cornell Movie Dialogue corpus

Brief Description

This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts:

➢ 220,579 conversational exchanges between 10,292 pairs of movie characters

➢ involves 9,035 characters from 617 movies

➢ in total 304,713 utterances

➢ movie metadata included:

genres
release year
IMDB rating
number of IMDB votes
IMDB rating

➢ character metadata included:

- gender (for 3,774 characters)
- position on movie credits (3,321 characters)

File Description

In all files the field separator is " +++$+++ "

➢ movie_titles_metadata.txt

Contains information about each movie title

fields:

movieID,
movie title,
movie year,
IMDB rating,
no. IMDB votes,
genres in the format ['genre1','genre2',É,'genreN']

➢ movie_characters_metadata.txt

Contains information about each movie character

fields:

characterID
character name
movieID
movie title
gender ("?" for unlabeled cases)
position in credits ("?" for unlabeled cases)

➢ movie_lines.txt

Contains the actual text of each utterance

fields:

lineID
characterID (who uttered this phrase)
movieID
character name
text of the utterance

raw_script_urls.txt

Contains the urls from which the raw sources were retrieved

How to Start with the Project?

1. Login to the Google Co-lab, load the

notebook to the environment. Go to

Runtime to choose the “Change runtime

type”. For faster training, choose GPU as the hardware accelerator and SAVE it.

2. Open the ‘Chatbot.ipynb’ notebook and start filling the code.

3. Import all the necessary Python packages. Numpy and Pandas for numerical processing, data importing, preprocessing etc. Sklearn for splitting datasets, keras/tensorflow for deep learning model creation, training, testing, inference etc.

4. From here you can take over to the project and start building the conversational chatbot.