By Diane O’ Donoghue
The Kx machine learning team has an ongoing project of periodically releasing useful machine learning libraries and notebooks for kdb+. These libraries and notebooks act as a foundation to our users, allowing them to use the ideas presented, and the code provided, to access the exciting world of machine learning with Kx.
This blog gives an overview of the machine learning toolkit (ML-Toolkit), covering the functionality available and giving insight into workflow design. The blog also highlights a refactoring of the machine learning notebooks provided on the Kx GitHub.
As with all libraries released from the Kx machine learning team, the ML-Toolkit and its constituent sections are available as open source, Apache 2 software, and are supported for our clients.
The Kx machine learning team was formed in 2017 with the goal of providing the wider Kx community with user-friendly procedures and interfaces for the application of machine learning. The key milestones in this development to date have been:
- The releases of the embedPy and JupyterQ interfaces in late 2017
- The release of the Natural Language Processing (NLP) library in mid 2018
- The development of the ML-Toolkit over the past year
The ML-Toolkit provides kdb+/q users with libraries for data preprocessing, feature extraction and model evaluation. In conjunction with embedPy, this allows machine learning methods to be applied seamlessly to kdb+ data.
The embedPy interface is part of Fusion for kdb+ and allows the application of Python functions to q data within a kdb+ process. The reduction in data movement and the ability to leverage algorithms developed by the Python community makes this a powerful tool for machine learning. A full outline of its functionality can be found here.
Combining embedPy with the ML-Toolkit allows users to implement machine learning pipelines as close to the data as possible, while leveraging the best aspects of both q and Python. Fast/scalable data manipulation in q allows for optimized pre and post processing, while Python exposes a vast universe of ML algorithms.
Below is an outline of the sections found at present within the toolkit. Note, however, that the toolkit is continuously being updated with the addition of new sections, along with improved performance capabilities and extensions to the existing functionality.
Documentation of the full library, with example implementations of each of the procedures, can be found here.
Sections within the Toolkit
The ML-Toolkit is currently divided into three distinct sections:
- Utility Functions
- Cross Validation and Grid Search procedures
The initialization procedure allows for the entire toolkit, or specific sections, to be loaded into an environment, depending on the task.
Utilities contain sections for preprocessing, metrics and general utilities which are explained in more detail below
- Preprocessing – These functions convert raw data into a clean dataset that is suitable to be passed into a machine learning algorithm. These are essential, as they remove, or manipulate, the data such that a machine learning model can be run on it. Cleaning the data is achieved using techniques such as:
- Removing columns of zero variance
- Filling null/infinite values.
- Encoding categorical data
- Scaling of values
- Metrics – Functions contained in this section are used to evaluate the performance of a trained machine learning model. The scoring functions contained in the toolkit cover both regression and classification tasks and include the most commonly used metrics:
- Accuracy scores
- Precision vs Recall scores
- Correlation/Covariance matrices
- F1, f-beta scores
- Root mean square error
- Utils – This subsection contains functions that are more general purpose in nature. The following are a subset of those available:
- Transforming pandas dataframe to q table (and vice-versa)
- Splitting data into train and test sets
- Obtaining unique combinations of columns in a table/matrix
These utilities form the core of most workflows, ensuring that models and algorithms perform optimally and that the results can be evaluated using a variety of suitable metrics.
FRESH (FeatuRE Extraction on the basis of Scalable Hypothesis tests)
Feature extraction plays a central role in the development of machine learning models. Within the toolkit techniques such as one-hot encoding and the extraction of sub categories of time from timestamps can be used to derive information that may be useful to a model.
When dealing with time series data, the extraction of pertinent features is particularly important. In manufacturing, for example, there is often an associated failure rate in the production of hardware components. The application of feature extraction to these time series can allow a mapping between the time series as a whole and the failure/success of the process. This mapping allows machine learning algorithms to be applied with ease in the hope that the features extracted have encoded the information relevant to the categorization task at hand.
The FRESH algorithm is an automated method to complete this with a set of extensible functions to compute the mean, max, kurtosis, Fourier components etc. of each individual time series based on a unique identifying characteristic.
While the extraction of features is important, this can also bloat the dataset with information that is unlikely to be useful to a machine learning model and which negatively affects performance. To counteract this, feature significance testing based on the distributions of the features and the targets, has also been implemented to reduce the features to those most likely to provide useful information to the model.
A full outline of both the feature extraction and significance capabilities of FRESH can be found here.
Cross-Validation and Grid search procedures
Cross-validation is a method of measuring the performance of a machine learning model on unseen (validation) data. It is commonly used to ensure that a model is not overfitting and can be generalized. Grid search is also a valuable tool when evaluating the best hyper-parameters to fine tune a model. The algorithms used in this section examine how stable a model is when the volume of the data or specific subset of the data is altered for validation.
A number of cross-validation procedures have been implemented in the toolkit, including:
- Stratified K-Fold Cross-Validation
- Shuffled K-Fold Cross-Validation
- Sequentially Split K-Fold Cross-Validation
- Roll-Forward Time-Series Cross-Validation
- Chain-Forward Time-Series Cross-Validation
- Monte-Carlo Cross-Validation
In particular, the Roll-Forward and Chain-forward cross validation are relevant for application in a time-series dataset, ensuring that a strict order is kept so that future observations are not included when building a forecast model, a problem known as leakage.
A grid search procedure has been implemented for each of the methods mentioned above, allowing users to optimize the hyper-parameters for a given algorithm.
Standardized multiprocess distribution framework
Both the FRESH and cross-validation libraries support multi-processing through the distribution of jobs to worker processes. This framework can be used to distribute both Python and q code. A full outline of this functionality is outlined here, as such, the following is a cursory overview.
Once a q process has been initialized with multiple worker processes, a centralized port and the machine learning library.
$ q ml/ml.q -s -4 -p 5000
The workers can then be initialized with the desired library (FRESH for example) using the function:
Within the toolkit, work involving both the FRESH and cross-validation functions will automatically be peached if this format is followed.
Machine Learning Demonstration Notebooks
Throughout the development of embedPy, the NLP library and all the sections above, a number of Jupyter notebooks have been released to highlight functionality and example use-cases.
Previously, these have been split across the various repositories on the Kx Systems GitHub. To coincide with the release of this blog, these notebooks have been refactored to include the latest toolkit functionality and have been centralized to an mlnotebooks repository.
These notebooks cover both embedPy and ML-toolkit functionality, and provide template workflows for machine learning in the following areas:
- Decision Trees
- Random Forests
- Neural Networks
- Dimensionality Reduction
- Feature Engineering
- Cross Validation
- Natural Language Processing
- K-Nearest Neighbors Clustering
If you would like to further investigate the uses of the machine learning toolkit, check out the machine learning toolkit on the Kx GitHub to find the complete list of the functions that are available in the toolkit. You can also use Anaconda to integrate into your Python installation to set up your machine learning environment, or you build your own which consists of downloading kdb+, embedPy and JupyterQ. You can find the installation steps here.