top of page

Designing, Implementing and Querying a NoSQL Database Using Hadoop MapReduce and DASK | Realcode4you

This Task relates to the following Learning Outcomes:

  • Apply techniques for storing large volumes of data.

  • Apply Map-reduce techniques to a number of problems that involve Big Data.

Dataset

Twitter1 serves many objects as JSON2 , including Tweets and Users. These objects all encapsulate core attributes that describe the object. Each Tweet has an author, a message, a unique ID, a timestamp of when it was posted, and sometimes geo metadata shared by the user. Each User has a Twitter name, an ID, a number of followers, and most often an account bio. With each Tweet, Twitter generates 'entity' objects, which are arrays of common Tweet contents such as hashtags, mentions, media, and links. If there are links, the JSON payload can also provide metadata such as the fully unwound URL and the webpage’s title and description.


So, in addition to the text content itself, a Tweet can have over 140 attributes associated with it. Let’s start with an example Tweet:



The following JSON illustrates the structure for these objects and some of their attributes:


{
 "created_at": "Thu Apr 06 15:24:15 +0000 2017",
 "id_str": "850006245121695744",
 "text": "1\/ Today we\u2019re sharing our vision for the future of the Twitter API 
platform!\nhttps:\/\/t.co\/XweGngmxlP",
 "user": {
 "id": 2244994945,
 "name": "Twitter Dev",
 "screen_name": "TwitterDev",
 "location": "Internet",
 "url": "https:\/\/dev.twitter.com\/",
 "description": "Your official source for Twitter Platform news, updates & events. Need technical help? Visit 
https:\/\/twittercommunity.com\/ \u2328\ufe0f #TapIntoTwitter"
 },
 "place": { 
 },
 "entities": {
 "hashtags": [ 
 ],
 "urls": [
 {
 "url": "https:\/\/t.co\/XweGngmxlP",
 "unwound": {
 "url": "https:\/\/cards.twitter.com\/cards\/18ce53wgo4h\/3xo1c",
 "title": "Building the Future of the Twitter API Platform"
 } }
 ],
 "user_mentions": [ 
 ]
 }}


Task 0

  • Data Cleaning and Cleansing: To prepare the 10k Tweet Dataset (available on iLearn) in the above schema format, you will have to perform multiple curation tasks.


Task 1

  • Dataset: 10000 Tweets; dataset on iLearn “tweets.zip”

  • Tasks: In Python, create a program that retrieves the Tweet Dataset from MongoDB. For each tweet in the dataset:

1. extract Keywords from its text. Then, for each tweet, create a new name/value pair to store the extracted keywords in comma-separated value (CSV) format. Update the original tweet in the MongoDB database with the new name/value pair.

2. extract Named Entities (of type Person, Organization, and Location) from its text. Then, for each tweet, create a new name/value pair to store the extracted Named Entities in comma-separated value (CSV) format. Update the original tweet in the MongoDB database with the new name/value pair.

3. extract the Topic from its text. Then, for each tweet, create a new name/value pair to store the extracted Topic in comma-separated value (CSV) format. Update the original tweet in the MongoDB database with the new name/value pair.

4. extract the Sentiment from its text. Then, for each tweet, create a new name/value pair to store the extracted Sentiment in comma-separated value (CSV) format. Update the original tweet in the MongoDB database with the new name/value pair.


Notice: You will need to Create a short documentation in which you briefly describe your implementation. You can use existing packages to extract Keyword, Named Entity, Topic, and Sentiment from the text of the Tweet


Task 2:

  • Dataset: 10000 Tweets; dataset on iLearn “tweets.zip”

  • MapReduce: Create a program that can calculate the frequency of each word occurring in the text of tweets. Create a short documentation in which you briefly describe your implementation:

o What to write in the mapper(s) ? Flowchart and Pseudocode !

o What to write in the reducer(s) ? Flowchart and Pseudocode !


Task 3:

  • Dataset: 10000 Tweets; dataset on iLearn “tweets.zip”

  • MapReduce: Create a program that can determine the total number of tweets for a given list of cities in Australia. Create a short documentation in which you briefly describe your implementation:

o What to write in the mapper(s) ? Flowchart and Pseudocode !

o What to write in the reducer(s) ? Flowchart and Pseudocode !


Task 4:

  • Dataset: 10000 Tweets; dataset on iLearn “tweets.zip”

  • MapReduce: Create a program that uses MapReduce to sort the 10k dataset of tweets based on their tweet ID. To complete the assignment, you will first need to import the tweet dataset and write a MapReduce program that performs a key-value swap to sort the tweets by their ID. The MapReduce program will then be run to generate the sorted output. Notice that, Merge sort is the default feature of MapReduce. Create a short documentation in which you briefly describe your implementation:

o What to write in the mapper(s) ? Flowchart and Pseudocode !

o What to write in the reducer(s) ? Flowchart and Pseudocode !

  • You are required to assess the performance of the MapReduce program and compare it with a program that utilizes Merge sort for sorting tweets by their ID in the absence of MapReduce. By completing this assignment, you will gain a better understanding of how MapReduce can be used to process and analyze large datasets in a distributed computing environment. Create a short documentation in which you briefly describe your implementation


Task 5:

  • Dataset: 10000 Tweets; dataset on iLearn “tweets.zip”

  • Develop a DASK program to implement the TF-IDF algorithm for each keyword extracted from the text of the tweet in a large Tweets dataset. You are required to extract keywords from the text of each tweet (here, you can use the output from Task 1) and calculate the TF-IDF score for each keyword.

  • The output should include the top N keywords with their corresponding TF-IDF scores. The value of N should be configurable.

  • You are required to provide a report documenting your approach, the design of your program, and the results obtained. The report should also include a discussion on the limitations of your approach and any future improvements that can be made.


To get solution of this sample paper you can send your request at:


bottom of page