There are various binary classifiers, such as logistic regression, deep learning, random forest, gradient boosting trees, etc. I am going to compare the performance of these methods.
The data set that I use is provided by a company that develops an adaptive learning platform for mathematics education. Under its web and app based platform, students repeatedly watch videos and comics that explain mathematical concepts, and solve test problems. The output variable is defined as whether a student correctly solve a problem (1) or not (0). We use two types of explanatory variables (total 289 input variables). The first type includes the intrinsic characteristics of a problem such as chapter, review or lesson, etc. The second type represents a student's learning behaviors such as the number of exposures of the same problem before, elapsed time to watch the video, etc. I have about 10M observations.
R provides various state-of-art machine learning packages. Even a novice can easily apply the machine learning techniques. The majority of machine learning packages are developed under lower-level languages such as C, JAVA, FORTRAN, etc. Moreover, some of them provide a parallel computing feature.
I use four libraries.
Random Forest is a bagging method for classification by constructing a multitude of decision tress and combining the outputs of each decision tree models. It is proper to parallel computing and resolves the overfitting problem of decision tree model.
You can easily build a binary classifier based on random forest with the following commends on R.
GBM (Gradient Boosting Trees) is a boosting method to combine decision trees which is sequentially updated with modified versions of the data. In each iteration, we have to find the optimal tree (parameters) that minimizes the loss function given prior trees. The optimal tree (parameters) that minimizes the loss function is close to negative gradient of loss function with respect to f.
You can run a GBM using the following commends on R. It requires more detailed setting than Random Forest.
gbm1 = gbm(result ~. , # formula data=train, # dataset distribution="bernoulli", # see the help for other choices n.trees=1000, # number of trees shrinkage=0.05, # shrinkage or learning rate, # 0.001 to 0.1 usually work interaction.depth=3, # 1: additive model, 2: two-way interactions, etc. bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best train.fraction = 0.5, # fraction of data for training, # first train.fraction*N used for training n.minobsinnode = 10, # minimum total weight needed in each node cv.folds = 0, # do 3-fold cross-validation keep.data=TRUE, # keep a copy of the dataset with the object verbose=FALSE, # don't print out progress n.cores=1 )
gbm_pred_train = predict(gbm1,train,best.iter)
gbm_pred_test = predict(gbm1,test,best.iter)
As a quintessential deep learning model, deep learning consists of bunch of connected functions to constitute a classifier or regressor. The function, basic unit of the model, is biologically inspired by human neuron. The model is called feedforward because information flows through the function and there is no feedback connections. Feedforward networks are called networks because they are typically represented by composing together many different functions.
To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. SGD (Stochastic Gradient Descent Algorithm) is a gradient descent optimization method for minimizing an objective function that is written as a sum of differentiable functions. The true gradient of is approximated by a gradient at a single example. Computing the gradient against more than one training example (called a "mini-batch") at each step can perform significantly better than true stochastic gradient descent, because the code can make use of vectorization libraries rather than computing each step separately. It may also result in smoother convergence. Simultaneously finding optimal parameters requires tedious calculation. However, back propagation enables a modular approach for optimization.
IRT model widely applied to computer based test such as Toefl, GRE, GMAT, etc. evaluates students' achievement. Based on logistic functional form, it estimates student's ability and problem's difficulty and predicts the probability of corret answer by weighing the ability against difficulty.
I build a hierarchical bayesian IRT model and estimate student- and problem- level parameters by using Bayesian MCMC (Metropolis-Hastings Algorithms) method.
I write some codes to estiamte the bayesian IRT model from the scratch. If you have further interests, please let me know.
The ROC (Receiver operating characteristic) curve illustrates the performance of each binary classifier. Thus, it has been widely used to compare various binary classifiers. Following commend plots the ROC curve of four binary classifiers: Bayesian logistic regression, deep learning, random forest, and gradient boosting trees. I plot two ROC curves for training data set and test data set, respectively.
Bayesian logistic regression shows the best performance in the training data set. The machine learning methods usually have larger set of parameters so that they possess more strict regularization rules to avoid over-fitting. We can see that Bayesian logistic regression under-performs the machine learning methods in the test data set. The tree based methods (random forest and gradient boosting trees) outperform deep learning methods. However, the differences in terms of accuracy are lower than 0.3 %p. Thus, we can conclude that the binary classifiers based on machine learning perform better than statistical models because of their strict regularization rules. Also, it is really hard to find the differences between the binary classifiers based on machine learning in terms accuracy in our data set.