*Deep Learning vs. Random Forest*

*Deep Learning vs. Random Forest*

June 1, 2015

There are various binary classifiers, such as logistic regression, deep learning, random forest, gradient boosting trees, etc. I am going to compare the performance ofÂ these methods.Â

Â

The data set that IÂ use is provided by a companyÂ that develops an adaptive learning platform for mathematics education. Under its web and app based platform, students repeatedly watch videos and comics that explain mathematical concepts, and solve test problems. The output variable is defined as whether a student correctly solve a problem (1) or not (0).Â We use two types of explanatory variables (total 289 input variables). The first type includes the intrinsic characteristics of a problem such as chapter, review or lesson, etc. The second type represents a student's learning behaviors such asÂ the number of exposures of the same problem before, elapsedÂ time to watch the video, etc.Â I have about 10M observations.

Â

R provides various state-of-art machine learning packages. Even a novice can easily apply the machine learning techniques. The majority of machine learningÂ packages are developed under lower-level languages such as C, JAVA, FORTRAN, etc. Moreover, some ofÂ themÂ provide a parallel computing feature.

Â

Libraries

Â

I use four libraries.Â

Â

library(gbm)

library(randomForest)

library(h2o)

library(pROC)

Â

Random Forest

Â

Random Forest is aÂ bagging method for classification by constructing a multitude of decision tress and combining the outputs of each decision tree models. It is proper to parallel computing and resolves the overfitting problem of decision tree model.Â

Â

Â

Â

You can easily build a binary classifier based on random forest with the following commends on R.

Â

rfÂ = randomForest(result ~. , data = train, importance = T,na.action=na.omit,proximity=F,ntree=100,nodesize=5)

rf_pred_trainÂ = predict(rf,type="prob")

rf_pred_test = predict(rf,newdata=test,type="prob")

Â

Gradient Boosting Trees

Â

GBM (Gradient Boosting Trees) is aÂ boosting method to combine decision trees which is sequentially updated with modified versions of the data.Â In each iteration, we have to find the optimal tree (parameters) that minimizes the loss function given prior trees.Â The optimal tree (parameters) that minimizes the loss function is close to negative gradient of loss function with respect to f.

Â

Â

You can run a GBM using the following commends on R. ItÂ requires more detailed setting than Random Forest.Â

Â

gbm1Â = gbm(result ~. , # formula

Â Â Â Â Â Â data=train,Â # dataset

Â Â Â Â Â Â distribution="bernoulli", # see the help for other choices

Â Â Â Â Â Â n.trees=1000, # number of trees

Â Â Â Â Â Â shrinkage=0.05, # shrinkage or learning rate,

Â Â Â Â Â Â # 0.001 to 0.1 usually work

Â Â Â Â Â Â interaction.depth=3, # 1: additive model, 2: two-way interactions, etc.

Â Â Â Â Â Â bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best

Â Â Â Â Â Â train.fraction = 0.5, # fraction of data for training,

Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â # first train.fraction*N used for training

Â Â Â Â Â Â n.minobsinnode = 10, # minimum total weight needed in each node

Â Â Â Â Â Â cv.folds = 0, # do 3-fold cross-validation

Â Â Â Â Â Â keep.data=TRUE, # keep a copy of the dataset with the object

Â Â Â Â Â Â verbose=FALSE, # don't print out progress

Â Â Â Â Â Â n.cores=1

Â Â Â Â Â Â )

gbm_pred_train = predict(gbm1,train,best.iter)

gbm_pred_test = predict(gbm1,test,best.iter)

Â

Deep Learning

Â

As a quintessential deep learning model, deep learningÂ consists of bunch of connected functions to constitute a classifier or regressor.Â The function, basic unit of the model, is biologically inspired by human neuron.Â The model is called feedforward because information flows through the function and there is no feedback connections.Â Feedforward networks are called networks because they are typically represented by composing together many different functions.

Â

To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point.Â SGD (Stochastic Gradient Descent Algorithm) is a gradient descent optimization method for minimizing an objective function that is written as a sum of differentiable functions. The true gradient of is approximated by a gradient at a single example.Â Computing the gradient against more than one training example (called a "mini-batch") at each step can perform significantly better than true stochastic gradient descent, because the code can make use of vectorization libraries rather than computing each step separately. It may also result in smoother convergence.Â Simultaneously finding optimal parameters requires tedious calculation. However, back propagation enables a modular approach for optimization.

Â

Â

Following code is used to run deep learningÂ

Â

dlÂ = h2o.deeplearning(x = x,

Â Â Â Â Â Â Â Â Â Â Â Â Â y = y,

Â Â Â Â Â Â Â Â Â Â Â Â Â training_frame = train,

Â Â Â Â Â Â Â Â Â Â Â Â Â validation_frame = test,

Â Â Â Â Â Â Â Â Â Â Â Â Â distribution = "multinomial",

Â Â Â Â Â Â Â Â Â Â Â Â Â activation = "MaxoutWithDropout",

Â Â Â Â Â Â Â Â Â Â Â Â Â hidden = c(200,200,200),

Â Â Â Â Â Â Â Â Â Â Â Â Â input_dropout_ratio = 0.2,

Â Â Â Â Â Â Â Â Â Â Â Â Â hidden_dropout_ratios = c(0.5,0.5,0.5),

Â Â Â Â Â Â Â Â Â Â Â Â Â l1 = 1e-5,

Â Â Â Â Â Â Â Â Â Â Â Â Â epochs = 50)

dl_pred_trainÂ <- h2o.predict(dl, newdata = train)

dl_pred_testÂ <- h2o.predict(dl, newdata = test)

Â

Â

Bayesian IRT(Item Response Theory)

Â

IRT modelÂ widely applied to computer based test such as Toefl, GRE, GMAT, etc.Â evaluates students' achievement. Based on logistic functional form, it estimates student's ability and problem's difficulty and predicts the probability of corret answer by weighing the ability against difficulty.Â

Â

I build a hierarchical bayesian IRT model andÂ estimate student- and problem- level parameters by using Bayesian MCMC (Metropolis-Hastings Algorithms)Â method.Â

Â

Â

Â

Â I write some codes to estiamte the bayesian IRT model from the scratch. If you have further interests, please let meÂ know.

Â

Â

ROC

Â

The ROC (Receiver operating characteristic) curve illustratesÂ the performance of each binary classifier. Thus, it has been widely used toÂ compare various binary classifiers. Following commend plots the ROC curve of four binary classifiers:Â Bayesian logistic regression, deep learning, random forest, and gradient boosting trees. I plot two ROC curves for training data set and test data set, respectively.Â

Â

bayes_train_roc = plot.roc(train$result,round(logistic_pred_train,2),col="#008600")

rf_train_roc = plot.roc(train$result,rf_pred_train,col="#1c61b6",add=T)

gbm_ins_roc = plot.roc(train$result,gbm1_pred_train,col="red",add=T)

dl_ins_roc = plot.roc(train$result,dl_pred_train,col="purple",add=T)

legend("bottomright", legend=c("Bayesian IRT","Random Forest","Gradient Boosting Trees","Deep Learning"),

Â Â Â Â Â Â Â Â Â Â Â Â Â col=c("#008600","#1c61b6","red","purple"), lwd=2)

Â

Â

Bayesian logistic regression shows the best performance in the training data set. The machine learning methods usually have larger set of parameters so that they possess more strict regularization rules to avoid over-fitting. We can see that Bayesian logistic regression under-performs the machine learning methods in the test data set. The tree based methods (random forest and gradient boosting trees) outperform deep learning methods. However, the differences in terms of accuracy areÂ lower than 0.3Â %p. Thus, we can conclude that the binary classifiersÂ based on machine learning performÂ better than statistical models because of theirÂ strict regularization rules. Also, it is really hard to find the differences between the binary classifiers based on machine learningÂ in termsÂ accuracy in our data set.

Â

ROC (training data set)

Â

ROC (testÂ data set)

Â

Â

Â

Â

Â

Â

Â