top of page
Search

Make A Clustering-Based Model Using News Article Dataset |Sample Practice Set

Requirement

You are provided with a dataset “NewsArticles.json” having news articles of mixed topics including business, entertainment, politics, sports, technology, but without labels.

You are required to make a clustering-based model.

1. Perform K-Means clustering on the above dataset and find the value of Sum of Squared Error (SSE)

2. Use PCA algorithm to reduce the dimension of the dataset (about 100) and then perform K-means clustering on the manipulated dataset and find the value of Sum of Squared Error (SSE)

3. Find the cluster having the highest value of count (before PCA). Also,

4. Mention the highest value of count (before PCA)

5. Find the cluster having the highest value of count (after PCA). Also,

6. Mention the highest value of count (after PCA)

7. Extract top 50 words from each cluster in both the cases and print the last word (50th word) from the cluster you think is of news articles related to the topic of entertainment (before PCA)

8. Extract top 50 words from each cluster in both the cases and print the last word (50th word) from the third cluster (after PCA)

Hint: In both the above cases, use the number of clusters as 5 and compute Sum of Square Error within clusters.

***NOTE: 1.Do not use any NLP concepts here for any kind of cleansing or preprocessing.

2. Write the code only in solution() function and do not pass any arguments to the function. For predefined stub refer stub.py***

Final Output Sample:

result=[150.20,90.23, 1, 34, 2, 130, 'musical','china']

#NOTE: Here the answer for the questions are in the following format:

#1 ---> Answer 3 (Eg: 1st Cluster)

#34 ---> Answer 3 (Eg: Count of elements in Cluster 1)

#2 ---> Answer 4 (Eg: 2nd Cluster)

#130 ---> Answer 4 (Eg: Count of elements in Cluster 2)

Perform the above operations and write your output to a file named output.csv, which should be present at the location output/output.csv

output.csv should contain the answer to each question on consecutive rows.