Twitter Data Analysis Using Map Reduce | Sample Paper

realcode4you
Dec 4, 2021
6 min read

Datasets

For this assignment, you will use twitter datasets which include a collection of tweets and a graph describing the follower/following structure of the network. We have preprocessed and selected a subset of the dataset to be used in this assignment. The data is available in Vocareum, under the data/ folder.

File formats:

There are two files used in this assignment: tweets.tsv and edges.csv. Here is a description of the format of both files:

tweets.tsv: This is a tab separated file containing 1,000,000 lines 1 . Each line represents a tweet and is of the form:

UserID\tTweetID\tTweet\tCreatedAt

There is no header in the file

If you are interested in the full version of this dataset, it is the Twitter CIKM 2010 dataset (https: //archive.org/details/twitter_cikm_2010), a collection of scraped public twitter updates used to study the geolocation data related to twittering.

edges.csv: This is a comma separated file containing 1,000,000 lines. Each line represents a directed edge in the graph, that is, a pair of user IDs: UserID,FollowingUserID Indicating that UserID follows FollowingUserID, that is, there is a directed edge from UserID to FollowingUserID. The IDs are integers and should be treated as such (e.g. when comparing values), but don’t assume users are sequentially numbered, there may be gaps in sequence of IDs. There is no header in the file.

1.1 Hashtags

Objective: Parse the twitter data (tweets.tsv) to (1) extract hashtags (i.e., anything starting with “#” followed by a sequence of alpha-numeric characters), (2) convert to lower case, (3) count the occurrences of each hashtag, and (4) return the top 10 hashtags (i.e., with largest number of occurrences).

(a) (Code) Write a map-reduce approach in python to accomplish this task. Make sure to include a mapper function, reducer function, and execute function as we discussed in class (see lecture 22.1). Your code must be submitted in Vocareum.

(b) Report the wall clock runtime (in seconds) of your program when applied to the file tweets.tsv. Use the function time() from the python time package.

(c) Report the top ten hashtags with their counts, listed in decreasing order of counts.

(d) (Code) Write a command line program to accomplish the same task, using e.g., grep, tr, sed, awk, sort. This must be submitted in Vocareum.

(e) Run your command line program in a shell script and report the wall clock runtime (in seconds) (with e.g., the unix command time).

(f) Discuss how the runtimes compare between the two approaches.

1.2 User mentions Objective:

Parse the twitter data (tweets.tsv) to (1) extract usernames (i.e., anything starting with “@” followed by a sequence of alpha-numeric characters), (2) convert to lower case, (3) count the occurrences of each username, and (4) return the top 10 (i.e., with largest number of occurrences).

(b) Report the wall clock runtime (in seconds) of your program when applied to the file tweets.tsv. The total runtime must include all steps, from loading the data to the final output. You can use the function time() from the python time package.

(c) Report the top ten usernames with their counts, listed in decreasing order of counts.

(d) (Code) Write a command line program to accomplish the same task, using e.g., grep, tr, sed, awk, sort. This must be submitted in Vocareum.

(e) Run your command line program in a shell script and report the wall clock runtime (in seconds) (with e.g., the unix command time). (f) Discuss how the runtimes compare between the two approaches.

2 Finding Reciprocal Followers

Objective: Process the follower network data (edges.csv) to determine reciprocal following relationships, i.e., pairs of users that mutually follow each other.

(b) Report the wall clock runtime of your program when applied to the file edges.csv. Use the function time() from the python time package.

(c) Output the results in a text file to use in the next question. This will just be a subset of the original edges.csv file. Report the difference in size between the two versions of the graph with respect to number of unique nodes, and total number of edges.

(d) (Code) Write a command line program to accomplish the same task, using e.g., awk, sort, join. Make sure to include your program in the Vocareum submission.

(e) Run your command line program in a shell scrip and report the wall clock runtime with e.g., the unix command time.

(f) Discuss how the runtimes compare between the two approaches.

3 Finding Friends of Friends

Objective: Use the symmetric follower graph you computed in the previous question. For each pair of friends (i.e., pair of users who mutually follow each other), find the number of friends they have in common, that is, for each pair of friends (A, B), count the number of friends of A who are also friends of B.

(a) (Code) Write a map reduce approach in python to accomplish this task. Make sure to include a mapper function, reducer function, and execute function as we discussed in class (see lecture 22.1). Include your code in Vocareum. Note: you will probably need two map/reduce functions to accomplish this task: one to identify the friends of each user, and another to find the friends they have in common.

(b) Report the top ten pairs of friends with most number of friends in common, and what is that number. Within each pair, they should be ordered by the user ID, that is, for a pair (u, v), u must be smaller than v. If two or more pairs are tied for the number of counts, order them according to their user ID pairs, so if the pairs (245, 1023) and (125, 340) have the same number of common friends, (125, 340) should come before (245, 1023).

Hint To Start The Code

1 Project 3 - Code For this project, you have to provide both Python code for the MapReduce version as well as command line programs for some of the questions.

All the Python code must be submitted by filling the appropriate cells in this notebook. Make sure to include all the necessary functions (reading the data, mapper, reducer, executer, anything else you do with the output of mapreduce). Remember to include the code used to time the execution. You must include all the code/steps you used to generate the answers in your report.

The command line code must be added to the appropriate shell script (either q1_1.sh, q1_2.sh, q2.sh). For convenience, we included cells in this notebook to display the content of those files and execute them. But your answer must be added to those files, not to this notebook. You must include all the code/steps you used to generate the answers in your report. If you want to include any descriptions, assumptions, clarifications or any additional information for the TAs, you can add them as separate cells. If you want, you can change the cells type to Markdown (Cell > Cell Type > Markdown), to add formating.

2 Q1 - Finding Trends

2.1 Q1.1 - Hashtags

2.1.1 Python code

Add your python MapReduce code in this section. Make sure to include all necessary functions (i.e. mapper, reducer, executer) as seen in the lecture/labs. Make sure you are timing the execution of your code as well.

### ###

YOUR CODE HERE

###

2.1.2 Command line version

Add your answer to the file q1_1.sh. The next two cells display the content of the file and execute it with bash (and time the execution with time).

# Show contents of the file 
!cat q1_1.sh

import subprocess 
result = subprocess.run(["/usr/bin/time", "-p", "bash", "q1_1.sh"],
 	      stdout=subprocess.PIPE, stderr=subprocess.PIPE) 
print(f"Return code: {result.returncode}") 
print("STDOUT:") 
print(result.stdout.decode()) 
print("STDERR:") 
print(result.stderr.decode())

2.2 Q1.2 - Usernames

2.2.1 Python code

### ###

YOUR CODE HERE

###

2.2.2 Command line version

Add your answer to the file q1_2.sh. The next two cells display the content of the file and execute it with bash (and time the execution with time).

# Show contents of the file 
!cat q1_2.sh

import subprocess 
result = subprocess.run(["/usr/bin/time", "-p", "bash", "q1_2.sh"],
 	stdout=subprocess.PIPE, stderr=subprocess.PIPE) 
print(f"Return code: {result.returncode}") 
print("STDOUT:") 
print(result.stdout.decode()) 
print("STDERR:") 
print(result.stderr.decode())

3 Q2 - Finding Reciprocal Followers

3.0.1 Python code

### ###

YOUR CODE HERE

###

3.0.2 Command line version

Add your answer to the file q2.sh. The next two cells display the content of the file and execute it with bash (and time the execution with time). In [8]:

# Show contents of the file 
!cat q2.sh

import subprocess result = subprocess.run(["/usr/bin/time", "-p", "bash", "q2.sh"],
 	stdout=subprocess.PIPE, stderr=subprocess.PIPE) 
print(f"Return code: {result.returncode}") 
print("STDOUT:") 
print(result.stdout.decode()) 
print("STDERR:") 
print(result.stderr.decode())

4 Q3 - Finding Friends of Friends

4.0.1 Python code

Add your python MapReduce code in this section. Make sure to include all necessary functions (i.e. mapper, reducer, executer) for each MapReduce operation.

### ###

YOUR CODE HERE

###

If you face any issue which is related to big data Map Reduce then send your query or requirement details at: