top of page

Regression with GBM In R | R Studio Assignment and Project Help | Realcode4you

In this case study we’ll be predicting bike sharing numbers based on weather and other factors.


Read Dataset

bs=read.csv("bike_sharing_hours.csv",stringsAsFactors = F) 
CreateDummies=function(data,var,freq_cutoff=0){
	t=table(data[,var])
	t=t[t>freq_cutoff]
	t=sort(t)
	categories=names(t)[-1]

	for( cat in categories){
		name=paste(var,cat,sep="_")
		name=gsub(" ","",name)
		name=gsub("-","_",name)
		name=gsub("\\?","Q",name)
		name=gsub("<","LT_",name)
		name=gsub("\\+","",name)
		name=gsub("\\/","_",name)
		name=gsub(">","GT_",name)
		name=gsub("=","EQ_",name)
		name=gsub(",","",name)
		data[,name]=as.numeric(data[,var]==cat)
	}

	data[,var]=NULL
	return(data)
}

here cnt is nothing but simple sum of casual and registered. cnt is our response, we’ll be dropping the date and two other columns.

library(dplyr) 
bs=bs %>% select(-dteday,-casual,-registered) 

next we’ll make dummy vars for appropriate columns.


char_cols=c("season","mnth","hr","holiday","weekday","workingday","weathersit") 
for(col in char_cols){ 
	bs=CreateDummies(bs,col,500) 
} 

you’ll see that we’ll follow same procedure here for parameter tuning as randomForest, just that name and usual values of the paramaters to be tried are going to be different.


library(gbm) 
library(cvTools) 
param=list(interaction.depth=c(1:7), 
	n.trees=c(50,100,200,500,700), 
	shrinkage=c(.1,.01,.001), 
	n.minobsinnode=c(1,2,5,10)) 

Here interaction.depth, controls how weak is our learner. value 1 means weak learners are decision trees with single split . very high numbers here will lead to overfit.


n.trees means the same thing as ntree in randomForest, and n.minobsinnode is similar to nodesize in randomForest. Shrinkage is the fraction that we talked about earlier , ideal values to try out here should be ideally less than 1. Very small values generally tend to result in high values of n.trees



subset_paras=function(full_list_para,n=10){ 
	all_comb=expand.grid(full_list_para) 
	s=sample(1:nrow(all_comb),n) 
	subset_para=all_comb[s,] 
	return(subset_para) 
} 
num_trials=10 
my_params=subset_paras(param,num_trials) 
# Note: A good value for num_trials is around 10-20% of total possible 
# combination. It doesnt have to be always 10 
myerror=9999999
for(i in 1:num_trials){
	print(paste0('starting iteration:',i))
	# uncomment the line above to keep track of progress
	params=my_params[i,]

	k=cvTuning(gbm,cnt~.,
		data =bs,
		tuning =params,
		args = list(distribution="gaussian"),
		folds = cvFolds(nrow(bs), K=10, type = "random"),
		seed =2,
		predictArgs = list(n.trees=params$n.trees)
		)
score.this=k$cv[,2]

if(score.this<myerror){
	print(params)
	# uncomment the line above to keep track of progress
	myerror=score.this
	print(myerror)
	# uncomment the line above to keep track of progress
	best_params=params
}
print('DONE')

# uncomment the line above to keep track of progress

}

myerror

## [1] 52.29379


This is tentative measure of performance


best_params

## interaction.depth n.trees shrinkage n.minobsnode

## 1 6 500 0.1 10


This is the best combination of paramter values as per cv errors. Lets build our final model using these values


bs.gbm.final=gbm(cnt~.,data=bs,
	n.trees = best_params$n.trees,
	n.minobsinnode = best_params$n.minobsnode,
	shrinkage = best_params$shrinkage,
	interaction.depth = best_params$interaction.depth,
	distribution = "gaussian")
test.pred=predict(bs.gbm.final,newdata=bs_test,n.trees = best_params$n.trees) write.csv(test.pred,"mysubmission.csv",row.names = F) 
bottom of page