By Esperanza López Aguilera
As part of Kx25, the international kdb+ user conference held May 18th, a series of seven JuypterQ notebooks were released and are now available on https://code.kx.com/q/ml/. Each notebook demonstrates how to implement different machine learning techniques in kdb+, primarily using embedPy, to solve all kinds of machine learning problems from feature extraction to fitting and testing a model. These notebooks act as a foundation to our users, allowing them to manipulate the code and get access to the exciting world of machine learning within Kx. (For more about the Kx machine learning team please watch Andrew Wilson’s presentation at Kx25 on the Kx Youtube channel).
Decision trees were introduced in the previous blog in this series, and they were proven to perform efficiently for the particular problem covered. However, when more complex datasets come into play, decision trees can present some issues such as overfitting or instability. To get around these drawbacks, random forests are used instead, which provide the same advantages as decision trees but add some new important benefits.
The random forest algorithm is an ensemble method commonly used for both classification and regression problems that combines multiple decision trees and outputs and average prediction. It can be considered to be a collection of decision trees (forest) so it offers the same advantages as an individual tree: it can manage a mix of continuous, discrete and categorical variables; it does not require either data normalization or pre-processing; it is not complicated to interpret; and it automatically performs feature selection and detects interactions between variables. In addition, random forests solve some of the issues presented by decision trees, reduce variance and overfitting and provide more accurate and stable predictions. This is all achieved by making use of two different techniques: bagging (or bootstrap aggregation) and boosting.
On the other hand, boosting is increasingly used to turn weak learners into strong learners. Several base machine learning models (weak learners) are trained and combined into a single model afterwards (strong learner). In order to do that, an iterative process is followed:
- A simple model assigning the same importance to all the observations is fitted.
- Another model is trained giving higher importance to instances that were misclassified by the previous base model.
- Step 2 is repeated until the desired number of models or accuracy is achieved.
In the case of random forests, the weak learners are decision trees that are combined to create a better predictive model by using average or weighted average in regression problems or by voting in classification problems. This technique also results in reduced bias and variance.
This last notebook in the series aims once again to serve as an example of how to merge the best of both q and Python via embedPy. It allows us to load the data into kdb+ to be pre-processed and explored easily using simple q functions or queries. Once the data has been prepared, a machine learning algorithm imported from any Python library can be fed with it. In this example, we work with scikit-learn and XGBoost to build a random forest. Finally, we are able to do some predictions by making use of the fitted model and the results can be explored and displayed in different ways such as plots created with matplotlib.
Firstly, we load the Santander Customer Satisfaction dataset, obtained from Kaggle as a csv file, into q. The dataset consists of 370 features of 76,020 customers along with a binary target variable that indicates if a customer was satisfied with the service offered by the bank. The data exploratory stage shows that 96.04% of the clients were satisfied while 4% were not, which means that a classifier always predicting that a client was satisfied would achieve a 96% accuracy in this dataset. However, this is not useful, so we would like to train a random forest algorithm to be able to distinguish when a customer was not satisfied based on the 370 provided features.
When training a model it is important to measure its performance in a different dataset, thus, we first need to split data into training and test datasets, for which we use the traintestsplit function defined in functions.q. We keep 70% of the data to train the model while 30% will be used to measure the performance afterwards. We also create the function results to facilitate the process of training the model, obtaining predictions and showing results. It takes as arguments, the random forest classifier you want to train, its name, its arguments and the number of trees to grow in the random forest. It does not return any result but it displays the logarithmic loss (log loss), accuracy and area under the ROC curve obtained by the trained model. These measures are computed and shown for both training and test datasets because it allows us to infer when a model correctly learned from the training data or just memorized it.
In a first attempt to build an appropriate classifier we import into q the RandomForestClassifier from the Python module sklearn.ensemble, this allows us to build a random forest using bagging. Five different forests populated by 1, 5, 10, 50 and 100 trees are created and tested. While they provide a very high accuracy on the training dataset, they perform poorly on the training set, which suggests we are overfitting the data. As a consequence, since overfitting is usually handled by reducing the degrees of freedom of the models, we train again the same random forest algorithms with the same number of trees but decrease the maximum depth. Such a constraint has the expected results and the accuracies on the training and datasets are equaled.
Furthermore, as mentioned above, results can also be tested now in a graphic way by taking advantage of embedPy. Several functions are provided in graphics.q to create different plots that can be useful in distinct situations. Among these functions, we choose displayROCcurve to better visualize our results.
Finally, results try to be improved employing a random forest algorithm that brings together bagging and boosting. XGBClassifier is imported from the XGBoost library for this purpose given that it combines both techniques and provides more control over bagging than the previously used algorithm. Function results and the ROC curve are used again to test this new classifier: log loss is much lower and accuracy highly increases as well as the area under the ROC curve. Consequently, performance can be considered better.
If you would like to further investigate the uses of embedPy and machine learning algorithms in Kx, check out the ML 06 Random Forests notebook on GitHub (github.com/kxsystems/mlnotebooks). You can use Anaconda to integrate into your Python installation to set up your machine learning environment, or you can build your own, which consists of downloading kdb+, embedPy and JupyterQ. You can find the installation steps on code.kx.com/q/ml/setup/.
Other articles in this JupyterQ series of blogs by the Kx Machine Learning Team:
Natural Language Processing in kdb+ by Fionnuala Carr
Neural Networks in kdb+ by Esperanza López Aguilera
Dimensionality Reduction in kdb+ by Conor McCarthy
Classification Using k-Nearest Neighbors in kdb+ by Fionnuala Carr
Feature Engineering in kdb+ by Fionnuala Carr
Decision Trees in kdb+ by Conor McCarthy
Niklas Donges. The Random Forest Algorithm. Available from: https://towardsdatascience.com/the-random-forest-algorithm-d457d499ffcd
Leo Breiman and Adele Cutler. Random Forests. Available from: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Zachary Jones and Fridolin Linder. Exploratory Data Analysis using Random Forest. Available from: http://zmjones.com/static/papers/rfss_manuscript.pdf
Sunil Ray. Quick Introduction to Boosting Algorithms in Machine Learning. Available from: https://www.analyticsvidhya.com/blog/2015/11/quick-introduction-boosting-algorithms-machine-learning/
Leo Breiman. Random Forests. Available from: https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf