Predicting online success from text data
### Course: Text Analysis
### Assignment: "Predicting online success from text data"
### Data Source: https://www.pnas.org/doi/10.1073/pnas.2026045118
### Please see the ReadMe file for variable descriptions: https://osf.io/rym7z/
# check and set working directory
getwd()
setwd("")
# set random seed to make results replicable
set.seed(42)
# load libraries
library(dplyr)
library(stargazer)
library(car)
library(anytime)
# load data
d <- read.csv("kickstarter_data_OSF.csv", stringsAsFactors = F) # second dataset: "news_data_OSF_simplified.csv"
#load("kickstarter_data_OSF_simplified.Rda") # alternative option to import the data
# explore data
head(d, 3)
View(d[1:10,])
length(unique(d$country)) # check number of countries
sort(table(d$category), decreasing = T) # check distribution of categories
# transform variables
d <- d %>% dplyr::rename(text = blurb) # rename text column
summary(d$WC)
d$WC_mc <- as.numeric(scale(d$WC, scale = F)) # mean-center word count (WC)
summary(d$WC_mc) # check mean-centering
d$WC_mc_sq <- d$WC_mc**2 # square WC
cor(d$WC, d$WC^2) # check correlation without mean-centering
d$campaign_success <- ifelse(d$state == "failed", 0, 1)
d$usd_pledged_ln <- log(d$usd_pledged + 1)
d$goal_ln <- log(d$goal + 1)
d$backers_count_ln <- log(d$backers_count+1)
d$start_year <- format(as.Date(anytime(d$unix_startdate), format="%d/%m/%Y"),"%Y")
# create additional text-based variables
d$i <- grepl("\\bi\\b|\\bme\\b|\\bmyself\\b|\\bmy\\b", tolower(d$text)) # helpful resource for regex: https://cran.r-project.org/web/packages/stringr/vignettes/regular-expressions.html
d$innovative <- grepl("innovative", tolower(d$text))
d$please <- grepl("please", tolower(d$text))
round(prop.table(table(d$i)), 3) # explore new variable
d %>% group_by(i) %>% summarize(usd_pledged_mean = mean(usd_pledged)) # explore model-free evidence
d %>%
filter(i == T) %>% dplyrselect(text) %>% sample_n(5)
# visualize data
plot(table(d$start_year), main = "Number of Projects over Time") # plot distribution of years
par(mfrow = c(1,2))
hist(d$usd_pledged_ln, main = "Amount Pledged (in USD, ln)")
hist(d$goal_ln, main = "Funding Goal (in USD, ln)")
round(cor(d$usd_pledged_ln, d$goal_ln), 3) # check correlation
hist(d$WC, main = "Word Count")
hist(d$WC_mc, main = "Word Count (mean-centered)")
# sample data (optional)
if(TRUE){d <- d %>% sample_n(10000)}
# apply text classifier to full data, important: run "SentimentML_Script.R" before running this
corpus2 <- VCorpus(VectorSource(d$text))
dtm2 <- DocumentTermMatrix(corpus2, control = list(dictionary = Terms(dtm_unigram),
weighting = function(x) weightBin(x)))
dtm2 # inspect document-term-matrix (number of columns should be identical to the document-term-matrix based on which the classifier is trained)
d$sentiment <- predict(model_dt, as.data.frame(as.matrix(dtm2)), type = "class") # create new column based on predictions from classifier
table(d$sentiment) # important: this is a weak sentiment measure and just for illustration purposes
# analyze data
m <- list()
m[[1]] <- glm(campaign_success ~ WC_mc + WC_mc_sq + i + please + innovative + number + goal_ln + date_difference + country + category, data = d, family = "binomial") # logistic regression (for binary data)
m[[2]] <- lm(usd_pledged_ln ~ WC_mc + WC_mc_sq + i + please + innovative + number + goal_ln + date_difference + country + category, data = d) # linear regression (for continuous data)
m[[3]] <- update(m[[2]], "usd_pledged ~ .") # change to non-ln transformed usd_pledged to compare model fit
m[[4]] <- update(m[[2]], "backers_count_ln ~ .")
m[[5]] <- glm(backers_count ~ WC_mc + WC_mc_sq + i + please + innovative + number + goal_ln + date_difference + country + category, data = d, family = "poisson") # poisson regression (for count data)
summary(m[[1]])
vif(m[[1]]) # check vif values
# report results
stargazer(m,
title = "Regression Results",
omit = c("country", "category"),
no.space = F,
initial.zero = F,
notes.align = "l",
notes = "",
star.cutoffs = c(.05, .01, .001),
add.lines=list(c('Country Fixed Effects', rep('Yes', length(m))),
c('Category Fixed Effects', rep('Yes', length(m)))),
omit.stat = "aic",
type = "text")
# plot curves in relevant value range
par(mfrow = c(1,3))
START = quantile(d$WC_mc, probs = .05, na.rm = T) # define 90% value range for WC from START to END
START
END = quantile(d$WC_mc, probs = .95, na.rm = T)
END
# plot campaign success
b1 = coef(m[[1]])["WC_mc"]
b2 = coef(m[[1]])["WC_mc_sq"]
c = coef(m[[1]])["(Intercept)"]
curve(b1 * x + b2 * x^2 + c, from = START, to = END,
ylab="Campaign Success", xlab = "Word Count (mean-centered)")
# plot usd pledged (ln)
b1 = coef(m[[2]])["WC_mc"]
b2 = coef(m[[2]])["WC_mc_sq"]
c = coef(m[[2]])["(Intercept)"]
curve(b1 * x + b2 * x^2 + c, from = START, to = END,
ylab="USD Pledged (ln)", xlab = "Word Count (mean-centered)")
# plot backers count
b1 = coef(m[[5]])["WC_mc"]
b2 = coef(m[[5]])["WC_mc_sq"]
c = coef(m[[5]])["(Intercept)"]
curve(b1 * x + b2 * x^2 + c, from = START, to = END,
ylab="Backers Count", xlab = "Word Count (mean-centered)")
# compute optimum
-b1/(2*b2) # optimum of mean-centered WC (for backers count)
-b1/(2*b2) + mean(d$WC) # optimum of original WC (for backers count)
# THE END
Using above code as a reference do the below task(Requirement Details)
CONTEXT
Imagine you are a digital analyst at a management consultancy working on a new industry report to impress prospective clients. Firms and their marketing departments are increasingly recognizing the value of unstructured data (videos, images, text) and eager to learn what insights can be generated from them. You are responsible for the workstream “Predicting online success from text data”. Your main objectives are to (a) showcase creatively how different NLP methods can be used to extract meaningful textual features, and (b) explore which ones are associated with online success. For this purpose, you have been given access to a large-scale real-world dataset, which you can draw on to derive your data-driven recommendations.
DATA
The dataset contains 160,007 campaigns from Kickstarter, a global crowdfunding platform. Your main column of interest is the text column, containing a brief project description (e.g., This project is designed to help protect the environment by using Eco-friendly product packaging.). Do these unstructured project descriptions contain any valuable signals that can be used to predict the funding success?
SUBMISSION
You have to prepare and submit a PDF presentation of max. 8 content slides (not including the title page) to tell your story and answer the questions. Note: Structure your presentation based on the pyramid principle. Start with a slide, which contains an “Executive Summary”, and provide more and more detail as you go. This ensures that those with limited time to read your report still get the main message. Please also submit your R and/or Python code.
QUESTIONS
Your main overarching question is: Which text features explain funding success on Kickstarter? Please consider the following guiding questions when addressing the above question. Note: Creative insights (e.g., insightful, novel text features) are part of the grading criteria listed below. Be sure to get to know the data well and generate compelling, data-driven insights from them that differentiate your industry report from the competition. Please be encouraged to pursue your own ideas in addition to the questions listed below, which is also what you would need to do if this was your real job.
Data exploration:
What proportion of projects managed to raise the desired amount of money?
What is the median funding goal? What is the median of funds raised?
How many different categories are included in the dataset? How do they differ in terms of funding objectives?
How many words does the average text contain? What are the 10 most frequent words?
How can you visualize the overall distribution of words in the data in a compelling way?
Does topic modeling using methods such as supervised or unsupervised LDA generate any insights? (optional as we did not cover this in class).
Variable generation:
7. Based on your data exploration, please create at least one dictionary (i.e., custom word list) that you expect to be associated with campaign success. How well does your dictionary perform? Critically reflect on possible misclassifications and suggest improvement ideas.
8. Create additional textual features based on different off-the-shelf lexicons (e.g., the Evaluative Lexicon 2.0, VADER, AFINN, etc.). How well do these perform?
9. Train your own machine learning-based classifier based on a random set of annotated training data (about 100-200 examples per class are recommended). Apply this model to the Kickstarter data. (If you cannot process all 160,007 observations during inference due to computational constraints, you may also apply it only to a smaller subset, e.g., around 10,000 observations, and report those results).
10. Apply an off-the-shelf language model from Hugging Face (e.g., SiEBERT for sentiment analysis).
Findings:
11. Include your text-based variables in different regression models. What do you learn? 12. Do you find any non-linear effects (quadratic, interactions)? Please visualize them. 13. Discuss your findings based on related marketing literature (e.g., Berger & Milkman 2012).
14. What are the top recommendations you want to emphasize in your industry report? What insight is most striking to you?
If you need any help or support then you can contact us or send your assignment requirement details at:
realcode4you@gmail.com
Comments