By Diane O’Donoghue
The Kx machine learning team has an ongoing project of periodically releasing useful machine learning libraries and notebooks for kdb+. These libraries and notebooks act as a foundation to our users, allowing them to use the ideas presented and the code provided to access the exciting world of machine learning with Kx.
This release, which is the first in a series of planned releases in 2019, provides both updates to the functionality of the FRESH (Feature Extraction based on Scalable Hypothesis tests) algorithm as well as the addition of a number of accuracy metrics, preprocessing functions and utilities. In conjunction with code changes, modifications to the namespace structure of the toolkit have been made to streamline the code and improve user experience.
As with all the libraries released from the Kx machine learning team, the ML-Toolkit and its constituent sections are available as open source, Apache 2 software, and are supported for our clients.
The Machine Learning Toolkit (ML-Toolkit) contains general use functions for preprocessing data and scoring the results from machine learning algorithms. These can be used alongside the FRESH algorithm to allow users to easily perform machine learning tasks on structured time-series data.
Since the initial release of the ML-Toolkit, numerous functions have been added or updated in order to improve performance, add functionality and allow machine learning tasks to be performed on a broader range of datasets.
A description of changes that have been made are outlined briefly below. Full documentation of the expected behavior of the functions is available at https://code.kx.com/v2/ml/
The utilities section of the toolkit has now been split into three distinct sections – preprocessing, metrics and utils. This structure allows for future expansion of the toolkit into a wider variety of sections and for the individual loading of specific sections of the utilities.
The major change within this section is the removal of the `.ml.util` namespace. All functions within utilities are now contained in the `.ml` namespace to remove ambiguity which arose between true utility functions and remaining toolkit functionality.
The primary additions to the toolkit have been made within the preprocessing and metrics sections. As only aesthetic changes to outputs were made within the utils script, such modifications are not outlined here.
Additional functions have been added to the toolkit to preprocess data and deal with the inability of machine learning models to handle specific data types, namely categorical and date/time types.
The forms of encoding created to handle such behavior are as follows:
- Frequency encoding
- Lexicographical encoding
- Time-split encoding
Preprocessing features in this manner, via frequency and lexicographical encoding, can produce a marked improvement in performance over one hot encoding methods.
Time-series data can play an important role in the outcome of certain models. By extracting additional information from time-series columns, through splitting it into its constituent parts (such as the day of the week, month, season etc) during the preprocessing stages, a machine learning model can learn patterns within the data. For example, it may be possible to find that peak demand for a product is always at the weekend. The following shows how the new function `.ml.timesplit` is used to separate time and datetime columns into their constituent parts.
q)2#timetab:(`timestamp$2000.01.01+til 5;5?0u;5?10;5?10) x x1 x2 x3 ----------------------------------------- 2000.01.01D00:00:00.000000000 21:51 7 6 2000.01.02D00:00:00.000000000 02:55 5 7 q).ml.timesplit[timetab;::] /default behaviour encode all time/date cols x2 x3 x_dow x_year x_mm x_dd x_qtr x_wd x_hh x_uu x_ss x1_hh x1_uu ------------------------------------------------------------------ 7 6 0 2000 1 1 1 0 0 0 0 21 51 5 7 1 2000 1 2 1 0 0 0 0 2 55 q).ml.timesplit[timetab;`x1] x x2 x3 x1_hh x1_uu ----------------------------------------------- 2000.01.01D00:00:00.000000000 6 8 21 51 2000.01.02D00:00:00.000000000 6 1 02 55
Given the variety of scenarios which may arise, an extensive set of scoring metrics for testing results of regression and classification models have been supplied within the toolkit. In addition to those available within the initial toolkit release, functions have been added for the computation of f1-score, r2-score, matthews-correlation coefficient and root mean squared error, among others.
Given that a variety of new classification metrics are now present, a number of these have been wrapped together to create a table known as a classification report to display the performance of a model in predicting the correct class.
q)xr:1000?2 /vector of predicted labels q)yr:1000?2 /vector of true labels q).ml.classreport[xr;yr] class | precision recall f1_score support ---------| ------------------------------------- 0 | 0.5171717 0.4885496 0.5024534 524 1 | 0.4693069 0.4978992 0.4831804 476 avg/total| 0.4932393 0.4932244 0.4928169 1000
For a detailed explanation of how the FRESH algorithm operates, both in regard to feature extraction and selection, please read the relevant blog here. The following shows how the feature extraction and selection procedures have been updated since the last release.
The function to complete the extraction of features is as follows:
The inputs to the first three parameters are the same as those in the initial release. The major modification to the function is in the fourth parameter.
Previously, this parameter took in a dictionary of the functions to be applied to the dataset, with support only provided for functions that took the data from individual ID’s within columns as input. In the new release, `.ml.fresh.createfeatures` allows both single and multi-parameter functions to be applied during feature extraction. This can be done by passing the function a table (defined as default by `.ml.fresh.params`) as the fourth argument.
The below outlines the structure of this table.
q)show ptab:.ml.fresh.params /example of the hyperparam dict f | pnum pnames pvals valid ---------------|------------------------------------------------------ absenergy | 0 () () 1 abssumchange | 0 () () 1 count | 0 () () 1 autocorr | 1 ,`lag ,0 1 2 3 4 5 6 7 8 9 1 binnedentropy | 1 ,`lag ,2 5 10 1 c3 | 1 ,`lag ,1 2 3 1 ..
Functions to be applied to the data are determined by the ‘valid’ column. As such, the use of table updates can limit the functions that are to be applied or the hyperparameters for multi-input functions.
Once feature extraction has been performed on the data, feature significance can be used to select a subset of features that are deemed to be statistically significant. In the previous release, there was a restriction that with this feature significance testing must be performed using the Benjamini-Hochberg-Yekutieli procedure. A reformatting of this function allowed for further methods to be introduced, the options for which are as follows:
- Benjamini-Hochberg-Yekutieli (BHY) procedure – passed a p-value and determines if the in question feature meets a defined False Discovery Rate (FDR) level defined by the user as a float.
- K-significant features – Returns a list of the k-best features with the lowest p-values.
- Percentile significant features – Returns significant features based on the p-score being within the top p percentile.
Below is an example of how these methods are applied:
/Set the target vector of predicted valued q)targets:value exec avg col2+.001*col2 by date from tab q)t:value cfeats /return features that have a FDR of 0.05 q)show benj:ml.fresh.significantfeatures[t;targets;.ml.fresh.benjhoch .05] `col2_mean`col2_sumval`col2_fftcoeff_maxcoeff_10_coeff_0_real`col2_fftcoeff_m .. q)count benj 31 /return the 30 best significant features q)ksig:.ml.fresh.significantfeatures[t;targets;.ml.fresh.ksigfeat 30] `col2_mean`col2_sumval`col2_fftcoeff_maxcoeff_10_coeff_0_real`col2_fftcoeff_m .. q)count ksig 30 /return the features with the top 0.45 percentile q)perc:.ml.fresh.significantfeatures[t;targets;.ml.fresh.percentile .45] `col1_absenergy`col1_abssumchange`col1_countabovemean`col1_firstmax`col1_firs .. q)count perc 193
If you would like to further investigate the use any of the functions contained in the ML-Toolkit, check out the files on our GitHub here and visit https://code.kx.com/v2/ml/toolkit/ to find documentation and the complete list of the functions available within the ML-Toolkit. Example implementations of a wide range of functionality are also available here.
For steps regarding the set up of your machine learning environment, see the installation guide available at https://code.kx.com/v2/ml/