Simple Encoder and Decoder
Task -1: Simple Encoder and Decoder Implement simple Encoder-Decoder model
Download the Italian to English translation dataset from here
You will find ita.txt file in that ZIP, you can read that data using python and preprocess that data this way only:
You have to implement a simple Encoder and Decoder architecture
Use BLEU score as metric to evaluate your model. You can use any loss function you need.
You have to use Tensorboard to plot the Graph, Scores and histograms of gradients.
a. Check the reference notebook b. Resource 2
Dowloading required files lets download the data which have italian words along with its english trasilaton
!wget http://www.manythings.org/anki/ita-eng.zip
!unzip ita-eng.zip
output:
lets download the glove vectors ("vectors for english words"), note that this file will have vectors with 50d, 100d and 300d, you can choose any one of them based on your computing power
__ In our assignment we will be passing english text to the decoder, so we will be using these vectors in decoder embedding layer __
!wget https://www.dropbox.com/s/ddkmtqz01jc024u/glove.6B.100d.txt
--2020-08-29 17:23:15-- https://www.dropbox.com/s/ddkmtqz01jc024u/glove.6B.100d.txt
Resolving www.dropbox.com (www.dropbox.com)... 162.125.65.1, 2620:100:6021:1::a27d:4101
Connecting to www.dropbox.com (www.dropbox.com)|162.125.65.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/ddkmtqz01jc024u/glove.6B.100d.txt [following]
--2020-08-29 17:23:15-- https://www.dropbox.com/s/raw/ddkmtqz01jc024u/glove.6B.100d.txt
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uca40c7c23141e02ee0b88baa137.dl.dropboxusercontent.com/cd/0/inline/A-Zh8w3W9V509FRBs2Q6z_vIl5GAHk325hebibhDv3NCe7pjEvoG-xT_xgCzmCOenx8fE2dMHRArTYUpSihBcBOjI51uki4e2K5g35Epb2UQKxc7DmKiP140HbwpUBfvSBM/file# [following]
--2020-08-29 17:23:16-- https://uca40c7c23141e02ee0b88baa137.dl.dropboxusercontent.com/cd/0/inline/A-Zh8w3W9V509FRBs2Q6z_vIl5GAHk325hebibhDv3NCe7pjEvoG-xT_xgCzmCOenx8fE2dMHRArTYUpSihBcBOjI51uki4e2K5g35Epb2UQKxc7DmKiP140HbwpUBfvSBM/file
Resolving uca40c7c23141e02ee0b88baa137.dl.dropboxusercontent.com (uca40c7c23141e02ee0b88baa137.dl.dropboxusercontent.com)... 162.125.65.15, 2620:100:6021:15::a27d:410f
Connecting to uca40c7c23141e02ee0b88baa137.dl.dropboxusercontent.com (uca40c7c23141e02ee0b88baa137.dl.dropboxusercontent.com)|162.125.65.15|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 347116733 (331M) [text/plain]
Saving to: ‘glove.6B.100d.txt’
glove.6B.100d.txt 100%[===================>] 331.04M 22.1MB/s in 15s
2020-08-29 17:23:32 (21.7 MB/s) - ‘glove.6B.100d.txt’ saved [347116733/347116733]
Loading data if you observe the data file, each feild was seprated by a tab '\t'
Import Necessary Packages
import matplotlib.pyplot as plt
%matplotlib inline
# import seaborn as sns
import pandas as pd
import re
import tensorflow as tf
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
with open('ita.txt', 'r', encoding="utf8") as f:
eng=[]
ita=[]
for i in f.readlines():
eng.append(i.split("\t")[0])
ita.append(i.split("\t")[1])
data = pd.DataFrame(data=list(zip(eng, ita)), columns=['english','italian'])
print(data.shape)
data.head()
output:
(341554, 2)
def decontractions(phrase):
"""decontracted takes text and convert contractions into natural form.
ref: https://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python/47091490#47091490"""
# specific
phrase = re.sub(r"won\'t", "will not", phrase)
phrase = re.sub(r"can\'t", "can not", phrase)
phrase = re.sub(r"won\’t", "will not", phrase)
phrase = re.sub(r"can\’t", "can not", phrase)
# general
phrase = re.sub(r"n\'t", " not", phrase)
phrase = re.sub(r"\'re", " are", phrase)
phrase = re.sub(r"\'s", " is", phrase)
phrase = re.sub(r"\'d", " would", phrase)
phrase = re.sub(r"\'ll", " will", phrase)
phrase = re.sub(r"\'t", " not", phrase)
phrase = re.sub(r"\'ve", " have", phrase)
phrase = re.sub(r"\'m", " am", phrase)
phrase = re.sub(r"n\’t", " not", phrase)
phrase = re.sub(r"\’re", " are", phrase)
phrase = re.sub(r"\’s", " is", phrase)
phrase = re.sub(r"\’d", " would", phrase)
phrase = re.sub(r"\’ll", " will", phrase)
phrase = re.sub(r"\’t", " not", phrase)
phrase = re.sub(r"\’ve", " have", phrase)
phrase = re.sub(r"\’m", " am", phrase)
return phrase
def preprocess(text):
# convert all the text into lower letters
# use this function to remove the contractions: https://gist.github.com/anandborad/d410a49a493b56dace4f814ab5325bbd
# remove all the spacial characters: except space ' '
text = text.lower()
text = decontractions(text)
text = re.sub('[^A-Za-z0-9 ]+', '', text)
return text
def preprocess_ita(text):
# convert all the text into lower letters
# remove the words betweent brakets ()
# remove these characters: {'$', ')', '?', '"', '’', '.', '°', '!', ';', '/', "'", '€', '%', ':', ',', '('}
# replace these spl characters with space: '\u200b', '\xa0', '-', '/'
# we have found these characters after observing the data points, feel free to explore more and see if you can do find more
# you are free to do more proprocessing
# note that the model will learn better with better preprocessed data
text = text.lower()
text = decontractions(text)
text = re.sub('[$)\?"’.°!;\'€%:,(/]', '', text)
text = re.sub('\u200b', ' ', text)
text = re.sub('\xa0', ' ', text)
text = re.sub('-', ' ', text)
return text
data['english'] = data['english'].apply(preprocess)
data['italian'] = data['italian'].apply(preprocess_ita)
data.head()
output:
ita_lengths = data['italian'].str.split().apply(len)
eng_lengths = data['english'].str.split().apply(len)
for i in range(0,101,10):
print(i,np.percentile(ita_lengths, i))
for i in range(90,101):
print(i,np.percentile(ita_lengths, i))
for i in [99.1,99.2,99.3,99.4,99.5,99.6,99.7,99.8,99.9,100]:
print(i,np.percentile(ita_lengths, i))
output:
0 1.0
10 3.0
20 4.0
30 4.0
40 5.0
50 5.0
60 6.0
70 6.0
80 7.0
90 8.0
100 92.0
90 8.0
91 8.0
92 8.0
93 9.0
94 9.0
95 9.0
96 9.0
97 10.0
98 11.0
99 12.0
100 92.0
99.1 12.0
99.2 12.0
99.3 12.0
99.4 13.0
99.5 13.0
99.6 14.0
99.7 15.0
99.8 16.0
99.9 20.0
100 92.0
for i in range(0,101,10):
print(i,np.percentile(eng_lengths, i))
for i in range(90,101):
print(i,np.percentile(eng_lengths, i))
for i in [99.1,99.2,99.3,99.4,99.5,99.6,99.7,99.8,99.9,100]:
print(i,np.percentile(eng_lengths, i)
If you observe the values, 99.9% of the data points are having length < 20, so select the sentences that have words < 20
Inorder to do the teacher forcing while training of seq-seq models, lets create two new columns, one with <start> token at begining of the sentence and other column with <end> token at the end of the sequence
data['italian_len'] = data['italian'].str.split().apply(len)
data = data[data['italian_len'] < 20]
data['english_len'] = data['english'].str.split().apply(len)
data = data[data['english_len'] < 20]
data['english_inp'] = '<start> ' + data['english'].astype(str)
data['english_out'] = data['english'].astype(str) + ' <end>'
data = data.drop(['english','italian_len','english_len'], axis=1)
# only for the first sentance add a toke <end> so that we will have <end> in tokenizer
data.head()
output:
data.sample(10)
output:
Getting train and test
from sklearn.model_selection import train_test_split
train, validation = train_test_split(data, test_size=0.2)
print(train.shape, validation.shape)
# for one sentence we will be adding <end> token so that the tokanizer learns the word <end>
# with this we can use only one tokenizer for both encoder output and decoder output
train.iloc[0]['english_inp']= str(train.iloc[0]['english_inp'])+' <end>'
train.iloc[0]['english_out']= str(train.iloc[0]['english_out'])+' <end>'
output:
(272932, 3) (68234, 3)
train.head()
output:
validation.head()
output:
ita_lengths = train['italian'].str.split().apply(len)
eng_lengths = train['english_inp'].str.split().apply(len)
import seaborn as sns
sns.kdeplot(ita_lengths)
plt.show()
sns.kdeplot(eng_lengths)
plt.show()
output:
Creating Tokenizer on the train data and learning vocabulary
Note that we are fitting the tokenizer only on train data and check the filters for english, we need to remove symbols < and >
tknizer_ita = Tokenizer()
tknizer_ita.fit_on_texts(train['italian'].values)
tknizer_eng = Tokenizer(filters='!"#$%&()*+,-./:;=?@[\\]^_`{|}~\t\n')
tknizer_eng.fit_on_texts(train['english_inp'].values)
vocab_size_eng=len(tknizer_eng.word_index.keys())
print(vocab_size_eng)
vocab_size_ita=len(tknizer_ita.word_index.keys())
print(vocab_size_ita)
output:
12817
26118
tknizer_eng.word_index['<start>'], tknizer_eng.word_index['<end>']
output:
(1, 10104)
def grader_1(data):
shape_value = data.shape ==(340044, 3)
tknizer = Tokenizer(char_level=True)
tknizer.fit_on_texts(data['italian'].values)
ita_chars = tknizer.word_index.keys()
diff_chars_ita = set(ita_chars)-set([' ', 't', 'a', 'o', 'r', 'e', 's', 'i', 'n', 'l', 'c', 'm', 'u', 'd', 'p', 'v', 'h', 'g', 'b', 'f', 'è', 'q', 'z', 'ò', 'à', 'y', 'é', 'ì', 'ù', 'k', 'w', '0', 'j', '1', '3', '2', 'x', '9', '5', '8', '4', '6', '7', 'á', 'ñ', 'ê', 'ü', 'ō', 'î', 'ö', 'ú', 'º'])
tknizer = Tokenizer(char_level=True)
tknizer.fit_on_texts(data['english_inp'].values)
eng_chars = tknizer.word_index.keys()
diff_chars_eng = set(eng_chars)-set(['<','>',' ', 'e', 'o', 't', 'i', 'a', 'n', 's', 'h', 'r', 'l', 'd', 'm', 'y', 'u', 'w', 'g', 'c', 'p', 'f', 'b', 'k', 'v', 'j', 'x', 'z', 'q', '0', '1', '3', '2', '9', '5', '8', '6', '4', '7'])
unique_char_value = (len(diff_chars_eng)==0) and (len(diff_chars_ita)==0)
return unique_char_value and shape_value
grader_1(data)
Creating embeddings for english sentences
embeddings_index = dict()
f = open('glove.6B.100d.txt')
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
embedding_matrix = np.zeros((vocab_size_eng+1, 100))
for word, i in tknizer_eng.word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
Implement custom encoder decoder
class Encoder(tf.keras.layers.Layer):
def __init__(self, vocab_size, embedding_dim, input_length, enc_units):
super().__init__()
self.vocab_size = vocab_size
self.embedding_dim = embedding_dim
self.input_length = input_length
self.enc_units= enc_units
self.lstm_output = 0
self.lstm_state_h=0
self.lstm_state_c=0
def build(self, input_shape):
self.embedding = Embedding(input_dim=self.vocab_size, output_dim=self.embedding_dim, input_length=self.input_length,
mask_zero=True, name="embedding_layer_encoder")
self.lstm = LSTM(self.enc_units, return_state=True, return_sequences=True, name="Encoder_LSTM")
def call(self, input_sentances, training=True):
input_embedd = self.embedding(input_sentances)
self.lstm_output, self.lstm_state_h,self.lstm_state_c = self.lstm(input_embedd)
return self.lstm_output, self.lstm_state_h,self.lstm_state_c
def get_states(self):
return self.lstm_state_h,self.lstm_state_c
class Decoder(tf.keras.layers.Layer):
def __init__(self, vocab_size, embedding_dim, input_length, dec_units):
super().__init__()
self.vocab_size = vocab_size
self.embedding_dim = 100
self.dec_units = dec_units
self.input_length = input_length
# we are using embedding_matrix and not training the embedding layer
self.embedding = Embedding(input_dim=self.vocab_size, output_dim=self.embedding_dim, input_length=self.input_length,
mask_zero=True, name="embedding_layer_decoder", weights=[embedding_matrix], trainable=False)
self.lstm = LSTM(self.dec_units, return_sequences=True, return_state=True, name="Encoder_LSTM")
def call(self, target_sentances, state_h, state_c):
target_embedd = self.embedding(target_sentances)
lstm_output, _,_ = self.lstm(target_embedd, initial_state=[state_h, state_c])
return lstm_output
Creating data pipeline
class Dataset:
def __init__(self, data, tknizer_ita, tknizer_eng, max_len):
self.encoder_inps = data['italian'].values
self.decoder_inps = data['english_inp'].values
self.decoder_outs = data['english_out'].values
self.tknizer_eng = tknizer_eng
self.tknizer_ita = tknizer_ita
self.max_len = max_len
def __getitem__(self, i):
self.encoder_seq = self.tknizer_ita.texts_to_sequences([self.encoder_inps[i]]) # need to pass list of values
self.decoder_inp_seq = self.tknizer_eng.texts_to_sequences([self.decoder_inps[i]])
self.decoder_out_seq = self.tknizer_eng.texts_to_sequences([self.decoder_outs[i]])
self.encoder_seq = pad_sequences(self.encoder_seq, maxlen=self.max_len, dtype='int32', padding='post')
self.decoder_inp_seq = pad_sequences(self.decoder_inp_seq, maxlen=self.max_len, dtype='int32', padding='post')
self.decoder_out_seq = pad_sequences(self.decoder_out_seq, maxlen=self.max_len, dtype='int32', padding='post')
return self.encoder_seq, self.decoder_inp_seq, self.decoder_out_seq
def __len__(self): # your model.fit_gen requires this function
return len(self.encoder_inps)
class Dataloder(tf.keras.utils.Sequence):
def __init__(self, dataset, batch_size=1):
self.dataset = dataset
self.batch_size = batch_size
self.indexes = np.arange(len(self.dataset.encoder_inps))
def __getitem__(self, i):
start = i * self.batch_size
stop = (i + 1) * self.batch_size
data = []
for j in range(start, stop):
data.append(self.dataset[j])
batch = [np.squeeze(np.stack(samples, axis=1), axis=0) for samples in zip(*data)]
# we are creating data like ([italian, english_inp], english_out) these are already converted into seq
return tuple([[batch[0],batch[1]],batch[2]])
def __len__(self): # your model.fit_gen requires this function
return len(self.indexes) // self.batch_size
def on_epoch_end(self):
self.indexes = np.random.permutation(self.indexes)
train_dataset = Dataset(train, tknizer_ita, tknizer_eng, 20)
test_dataset = Dataset(validation, tknizer_ita, tknizer_eng, 20)
train_dataloader = Dataloder(train_dataset, batch_size=1024)
test_dataloader = Dataloder(test_dataset, batch_size=1024)
print(train_dataloader[0][0][0].shape, train_dataloader[0][0][1].shape, train_dataloader[0][1].shape)
output:
(1024, 20) (1024, 20) (1024, 20)
# this is the same model we have given in the other reference notebook
class MyModel(Model):
def __init__(self, encoder_inputs_length,decoder_inputs_length, output_vocab_size):
super().__init__() # https://stackoverflow.com/a/27134600/4084039
self.encoder = Encoder(vocab_size=vocab_size_ita+1, embedding_dim=50, input_length=encoder_inputs_length, enc_units=256)
self.decoder = Decoder(vocab_size=vocab_size_eng+1, embedding_dim=100, input_length=decoder_inputs_length, dec_units=256)
self.dense = Dense(output_vocab_size, activation='softmax')
def call(self, data):
input,output = data[0], data[1]
encoder_output, encoder_h, encoder_c = self.encoder(input)
decoder_output = self.decoder(output, encoder_h, encoder_c)
output = self.dense(decoder_output)
return output
Model training
model = MyModel(encoder_inputs_length=20,decoder_inputs_length=20,output_vocab_size=vocab_size_eng)
optimizer = tf.keras.optimizers.Adam()
model.compile(optimizer=optimizer,loss='sparse_categorical_crossentropy')
train_steps=train.shape[0]//1024
valid_steps=validation.shape[0]//1024
model.fit_generator(train_dataloader, steps_per_epoch=train_steps, epochs=50, validation_data=train_dataloader, validation_steps=valid_steps)
model.summary()
output:
WARNING:tensorflow:From <ipython-input-25-40a53cceada8>:6: Model.fit_generator (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version. Instructions for updating: Please use Model.fit, which supports generators. Epoch 1/50 266/266 [==============================] - 158s 595ms/step - loss: 1.8152 - val_loss: 1.6080 Epoch 2/50 266/266 [==============================] - 159s 597ms/step - loss: 1.5121 - val_loss: 1.4234 Epoch 3/50 266/266 [==============================] - 160s 601ms/step - loss: 1.3524 - val_loss: 1.2738 Epoch 4/50 266/266 [==============================] - 160s 602ms/step - loss: 1.2180 - val_loss: 1.1595 Epoch 5/50 266/266 [==============================] - 160s 602ms/step - loss: 1.1174 - val_loss: 1.0640 Epoch 6/50 266/266 [==============================] - 160s 602ms/step - loss: 1.0243 - val_loss: 0.9736 Epoch 7/50 266/266 [==============================] - 160s 603ms/step - loss: 0.9383 - val_loss: 0.8900
...
...
#Create an object of your custom model.
#Compile and train your model on dot scoring function.
# Visualize few sentences randomly in Test data
# Predict on 1000 random sentences on test data and calculate the average BLEU score of these sentences.
# https://www.nltk.org/_modules/nltk/translate/bleu_score.html
#Sample example
import nltk.translate.bleu_score as bleu
reference = ['i am groot'.split(),] # the original
translation = 'it is ship'.split() # trasilated using model
print('BLEU score: {}'.format(bleu.sentence_bleu(reference, translation)))
output:
BLEU score: 0
model.save_weights('model.h5')
To get advance deep learning or machine learning model implementation you can contact us at:
And get instant help with an affordable price.
Comments