By Diane O’Donoghue
The KX machine learning team has an ongoing project of periodically releasing useful machine learning libraries and notebooks for kdb+. These libraries and notebooks act as a foundation to our users, allowing them to use the ideas presented and the code provided to access the exciting world of machine learning with KX.
This release, which is the first in a series of planned releases in 2019, provides both updates to the functionality of the FRESH (Feature Extraction based on Scalable Hypothesis tests) algorithm as well as the addition of a number of accuracy metrics, preprocessing functions and utilities. In conjunction with code changes, modifications to the namespace structure of the toolkit have been made to streamline the code and improve user experience.
The toolkit is available in its entirety on the KX Github here, with supporting documentation on code.kx.com
As with all the libraries released from the KX machine learning team, the ML-Toolkit and its constituent sections are available as open source, Apache 2 software, and are supported for our clients.
Background
The Machine Learning Toolkit (ML-Toolkit) contains general use functions for preprocessing data and scoring the results from machine learning algorithms. These can be used alongside the FRESH algorithm to allow users to easily perform machine learning tasks on structured time-series data.
Since the initial release of the ML-Toolkit, numerous functions have been added or updated in order to improve performance, add functionality and allow machine learning tasks to be performed on a broader range of datasets.
A description of changes that have been made are outlined briefly below. Full documentation of the expected behavior of the functions is available at https://code.kx.com/v2/ml/
Technical Description
Utilities
The utilities section of the toolkit has now been split into three distinct sections – preprocessing, metrics and utils. This structure allows for future expansion of the toolkit into a wider variety of sections and for the individual loading of specific sections of the utilities.
The major change within this section is the removal of the `.ml.util` namespace. All functions within utilities are now contained in the `.ml` namespace to remove ambiguity which arose between true utility functions and remaining toolkit functionality.
The primary additions to the toolkit have been made within the preprocessing and metrics sections. As only aesthetic changes to outputs were made within the utils script, such modifications are not outlined here.
Preprocessing
Additional functions have been added to the toolkit to preprocess data and deal with the inability of machine learning models to handle specific data types, namely categorical and date/time types.
The forms of encoding created to handle such behavior are as follows:
- Frequency encoding
- Lexicographical encoding
- Time-split encoding
Preprocessing features in this manner, via frequency and lexicographical encoding, can produce a marked improvement in performance over one hot encoding methods.
Time-series data can play an important role in the outcome of certain models. By extracting additional information from time-series columns, through splitting it into its constituent parts (such as the day of the week, month, season etc) during the preprocessing stages, a machine learning model can learn patterns within the data. For example, it may be possible to find that peak demand for a product is always at the weekend. The following shows how the new function `.ml.timesplit` is used to separate time and datetime columns into their constituent parts.
Metrics
Given the variety of scenarios which may arise, an extensive set of scoring metrics for testing results of regression and classification models have been supplied within the toolkit. In addition to those available within the initial toolkit release, functions have been added for the computation of f1-score, r2-score, matthews-correlation coefficient and root mean squared error, among others.
Given that a variety of new classification metrics are now present, a number of these have been wrapped together to create a table known as a classification report to display the performance of a model in predicting the correct class.
FRESH
For a detailed explanation of how the FRESH algorithm operates, both in regard to feature extraction and selection, please read the relevant blog here. The following shows how the feature extraction and selection procedures have been updated since the last release.
Feature Extraction
The function to complete the extraction of features is as follows:
The inputs to the first three parameters are the same as those in the initial release. The major modification to the function is in the fourth parameter.
Previously, this parameter took in a dictionary of the functions to be applied to the dataset, with support only provided for functions that took the data from individual ID’s within columns as input. In the new release, `.ml.fresh.createfeatures` allows both single and multi-parameter functions to be applied during feature extraction. This can be done by passing the function a table (defined as default by `.ml.fresh.params`) as the fourth argument.
The below outlines the structure of this table.
Functions to be applied to the data are determined by the ‘valid’ column. As such, the use of table updates can limit the functions that are to be applied or the hyperparameters for multi-input functions.
Feature Significance
Once feature extraction has been performed on the data, feature significance can be used to select a subset of features that are deemed to be statistically significant. In the previous release, there was a restriction that with this feature significance testing must be performed using the Benjamini-Hochberg-Yekutieli procedure. A reformatting of this function allowed for further methods to be introduced, the options for which are as follows:
- Benjamini-Hochberg-Yekutieli (BHY) procedure – passed a p-value and determines if the in question feature meets a defined False Discovery Rate (FDR) level defined by the user as a float.
- K-significant features – Returns a list of the k-best features with the lowest p-values.
- Percentile significant features – Returns significant features based on the p-score being within the top p percentile.
Below is an example of how these methods are applied:
If you would like to further investigate the use any of the functions contained in the ML-Toolkit, check out the files on our GitHub here and visit https://code.kx.com/v2/ml/toolkit/ to find documentation and the complete list of the functions available within the ML-Toolkit. Example implementations of a wide range of functionality are also available here.
For steps regarding the set up of your machine learning environment, see the installation guide available at https://code.kx.com/v2/ml/
Please do not hesitate to contact ai@kx.com if you have any suggestions or queries.