top of page
Search

# Regression with GBM In R | R Studio Assignment and Project Help | Realcode4you

In this case study we’ll be predicting bike sharing numbers based on weather and other factors.

Read Dataset

`bs=read.csv("bike_sharing_hours.csv",stringsAsFactors = F) `
```CreateDummies=function(data,var,freq_cutoff=0){
t=table(data[,var])
t=t[t>freq_cutoff]
t=sort(t)
categories=names(t)[-1]

for( cat in categories){
name=paste(var,cat,sep="_")
name=gsub(" ","",name)
name=gsub("-","_",name)
name=gsub("\\?","Q",name)
name=gsub("<","LT_",name)
name=gsub("\\+","",name)
name=gsub("\\/","_",name)
name=gsub(">","GT_",name)
name=gsub("=","EQ_",name)
name=gsub(",","",name)
data[,name]=as.numeric(data[,var]==cat)
}

data[,var]=NULL
return(data)
}```

here cnt is nothing but simple sum of casual and registered. cnt is our response, we’ll be dropping the date and two other columns.

```library(dplyr)
bs=bs %>% select(-dteday,-casual,-registered) ```

next we’ll make dummy vars for appropriate columns.

```char_cols=c("season","mnth","hr","holiday","weekday","workingday","weathersit")
for(col in char_cols){
bs=CreateDummies(bs,col,500)
} ```

you’ll see that we’ll follow same procedure here for parameter tuning as randomForest, just that name and usual values of the paramaters to be tried are going to be different.

```library(gbm)
library(cvTools) ```
```param=list(interaction.depth=c(1:7),
n.trees=c(50,100,200,500,700),
shrinkage=c(.1,.01,.001),
n.minobsinnode=c(1,2,5,10)) ```

Here interaction.depth, controls how weak is our learner. value 1 means weak learners are decision trees with single split . very high numbers here will lead to overfit.

n.trees means the same thing as ntree in randomForest, and n.minobsinnode is similar to nodesize in randomForest. Shrinkage is the fraction that we talked about earlier , ideal values to try out here should be ideally less than 1. Very small values generally tend to result in high values of n.trees

```
subset_paras=function(full_list_para,n=10){
all_comb=expand.grid(full_list_para)
s=sample(1:nrow(all_comb),n)
subset_para=all_comb[s,]
return(subset_para)
} ```
```num_trials=10
my_params=subset_paras(param,num_trials)
# Note: A good value for num_trials is around 10-20% of total possible
# combination. It doesnt have to be always 10
myerror=9999999```
```for(i in 1:num_trials){
print(paste0('starting iteration:',i))
# uncomment the line above to keep track of progress
params=my_params[i,]

k=cvTuning(gbm,cnt~.,
data =bs,
tuning =params,
args = list(distribution="gaussian"),
folds = cvFolds(nrow(bs), K=10, type = "random"),
seed =2,
predictArgs = list(n.trees=params\$n.trees)
)
score.this=k\$cv[,2]

if(score.this<myerror){
print(params)
# uncomment the line above to keep track of progress
myerror=score.this
print(myerror)
# uncomment the line above to keep track of progress
best_params=params
}
print('DONE')
```

# uncomment the line above to keep track of progress

}

myerror

## [1] 52.29379

This is tentative measure of performance

best_params

## interaction.depth n.trees shrinkage n.minobsnode

## 1 6 500 0.1 10

This is the best combination of paramter values as per cv errors. Lets build our final model using these values

```bs.gbm.final=gbm(cnt~.,data=bs,
n.trees = best_params\$n.trees,
n.minobsinnode = best_params\$n.minobsnode,
shrinkage = best_params\$shrinkage,
interaction.depth = best_params\$interaction.depth,
distribution = "gaussian")```
`test.pred=predict(bs.gbm.final,newdata=bs_test,n.trees = best_params\$n.trees) write.csv(test.pred,"mysubmission.csv",row.names = F) `
3 views0 comments

bottom of page