top of page

Predict Image Captions Using Combination CNN and LSTM Models - Sample Paper

Problem Statement

Image captioning is an application of one to many RNN’s and CNN’s. For a given input image model predicts the caption based on the vocabulary of train data. Using Flickr8K data, predict image captions using combination CNN and LSTM models. Also, tune the algorithm and explore the learnings that have been done by the Inception model training.

Dataset Description

Dataset: Flickr8k_Dataset Flickr8k_Dataset: It contains a total of 8092 images in JPEG format with different shapes and sizes. Out of which 6000 are used for training, 1000 for tests and 1000 for development.

Flickr8k_text : Contains text files describing train_set ,test_set. Flickr8k.token.txt contains 5 captions for each image i.e. total 40460 captions.

Use this command to download the data:

:!wget --header="Host:" --header="User-Agent: Mozilla/5.0  (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)  Chrome/81.0.4044.92 Safari/537.36" --header="Accept:  text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8 ,application/signed-exchange;v=b3;q=0.9" --header="Accept-Language: en-US,en;q=0.9" - -header="Referer:" " SLJZv1uMI3QI3LNGmsRhl%2FsvlzAwskgkHXG2hiRQ9KwdeD5YiaJ6Ys3dTBIw7f3Kgwt5XJIb84DXrnc0vpS aqAXWt1Uj71TtSqqJkLUz9GE7ARraDkuo613ey5aYSbPuREeQhSTt426vyd%2FBZXJRvTQr1mjTjY0GXcT8CF sW0XGBw7%2FVeuMWqpeFLebeCHe7OxKBBaCyjLI%2B9cEdGi0tTTXAJZ%2BTJyZTw0AlVpSXQj7PT6gPcDDXx BTpHt50debULge9KjFTKMudpsJU%2F2U%2BkNIVPrDnP62C5ijsLXw2fcuCnYMylFYhK7U8eZgQg%3D%3D&re" -c -O  '' And extract the data using: !unzip 

Tasks to be performed:

.ipynb file 1

Question 1.1: Extract Image features from Flickr Data by using Inception V3 model with weights as Imagenet

Hint: new_input = model.input
hidden_layer = model.layers[-2].output
model_new = Model(new_input, hidden_layer)

Encode Function Hint: 
image = preprocess(image)
 temp_enc = model_new.predict(image)
 temp_enc = np.reshape(temp_enc, temp_enc.shape[1])

Problem Statement 2

Machine Translation is one of the widely used use-cases for the sequence to sequence learning models. They have helped bridge the gap between languages by building models that can comprehend and translate from one language to another. Using existing English-Hindi Translation data, build an encoderdecoder model to predict Hindi sentences from a given English sentence.

Dataset Description

Dataset contains two major columns - english_sentence & hindi_sentence,which are given translations. There are many sources available for this data - but you have to use "ted" for this assignment. Use this command to download the data:

!wget --header="Host:" --header="User-Agent: Mozilla/5.0  (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)  Chrome/81.0.4044.138 Safari/537.36" --header="Accept:  text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8 ,application/signed-exchange;v=b3;q=0.9" --header="Accept-Language: en-US,en;q=0.9" - -header="Referer:" " V1hBlxa901ghZ8bKmasWC9msw%2BHmyAooohSK4f0y2hVLU9pmhnEq7%2FZ3ncWZJTjWd4NK0Hhygpk43fkAv pvNhTQcAiExtojT%2FRfXrRR6ZR%2FzEyqH1nh1ywqcLnTqJRwjqzV0PbCgnmNIczOO533FVgkZJwZk59kNwJ uIOU98NIA1zhSxd0q%2ByTGDATwNFNmYalISRyCCFlUtsyjP%2Fk8zUHeQ2gU5lFGLyQeE7584D1uzD6klWV6 Ng%2BhJzIWaq3laUZNOuoD7Sm4dxy2t4Ip%2Fc%2BPLIC5ZdZHU%2F9I8LGsbfTJUcdujy72%2BN1hBo2Ts3r w%3D%3D&" -c -O '' 

Tasks to be performed:

.ipynb file 2

Question 2.1: Build a Encoder & Decoder Model using Keras with Following properties:


- Input Layer (that takes the input processed data in notebook)

- Embedding Layer with num_encoder_tokens, latent_dim (Given in Notebook)

- LSTM Layer - Combination of Embedded & LSTM Layer:

- Hint

encoder_outputs, state_h, state_c = encoder_lstm(enc_emb) 
# We discard `encoder_outputs` and only keep the states. 
encoder_states = [state_h, state_c]


- Input Layer

- Embedding Layer with num_decoder_tokens, latent_dim

- LSTM Decoder Layer

- Hint:

dec_emb = dec_emb_layer(decoder_inputs) 
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)

- Combine both layer

- Hint:

decoder_outputs, _, _ = decoder_lstm(dec_emb,  initial_state=encoder_states) decoder_dense = Dense(num_decoder_tokens, activation='softmax') decoder_outputs = decoder_dense(decoder_outputs)

Combine Encoder & Decoder


Model([encoder_inputs, decoder_inputs], decoder_outputs)

Thanks For Visit Here!

If you need solution of above problem or need any help related to Python Machine Learning, R Programming, MATLAB Machine Learning then contact Us and get instant help with an affordable price.

Send your requirement Details at:


bottom of page