Automated Machine Learning in kdb+

26 May 2020 |
Share on:

By Conor McCarthy

The Kx machine learning team has an ongoing project of periodically releasing useful machine learning libraries and notebooks for kdb+. These libraries and notebooks act as a foundation to our users, allowing them to use the ideas presented and the code provided to access the exciting world of machine learning with Kx.

This blog gives an overview of the newest release from the team, namely an automated machine learning framework (automl) based on the Kx open source machine learning libraries, kdb+ core technology and a number of Python open source libraries accessed via embedPy.

The kdb+ automl framework is available in its entirety on the Kx GitHub with supporting documentation on the automl page of the Kx Developers’ site.

As with all libraries released from the Kx machine learning team, the automl framework and its constituent sections are available as open source, Apache 2 software, and are supported for our clients on a best effort basis.

Introduction

The automl framework described here is built largely on tools provided within the machine learning toolkit and various open-source libraries available for Python. The purpose of this framework is to provide users with the ability to automate the process of applying machine learning techniques to real-world problems. The framework is designed to allow individuals or organizations without expertise in the field of machine learning to follow a well defined and general machine learning workflow, while also allowing the flexibility for those with an understanding of such workflows to make extensive modifications to the framework to suit their use case.

For both the seasoned machine learning engineer and the novice the workflow steps are the same

  1. Data preprocessing
  2. Feature extraction and feature selection
  3. Model selection
  4. Hyperparameter tuning
  5. Report, model and configuration persistence

Each of these steps is outlined in-depth within the associated documentation here with particular attention given both to the default behavior of the framework and extensive information on how a user can customize this framework to suit their use case.

In this first release, the framework can be used to solve both regression and classification problems, related to either the FRESH algorithm or non time-series (defined here as “normal”) machine learning tasks which have one target associated with each row of the dataset. 

Future releases will expand on these capabilities to include, more specific time-series functionality, natural language processing capabilities and wider system capabilities.

The remainder of this blog will be used to highlight how this system can be used to solve a regression problem suitable for the FRESH algorithm. This is split into three sections

  1. Default automated framework
  2. Advanced parameter updates to framework
  3. Model deployment

Default Framework

A full explanation of the default system behavior is outlined in full within the user-guide here. For brevity, the following shows the running of an example workflow followed by particular points of note about the workflow.

// Define the example table for FRESH
q)N:20000
q)fresh_tab:([]”d”$asc N?50;desc N?1f;N?100f;N?1f;asc N?0b)
// Define the associated target data
q)tgt:asc 50?1f
// Define the problem type
q)prob_typ:`fresh
// Define the target type
q)tgt_typ:`reg
// Run the framework with default settings (::)
// displaying output
q).automl.run[fresh_tab;tgt;prob_typ;tgt_typ;::]

The following is a breakdown of information for each of the relevant columns in the dataset 

  | count unique mean      std       min          max       type   
--| ---------------------------------------------------------------
x1| 20000 20000  0.501004  0.2879844 8.079223e-07 0.9999848 numeric
x2| 20000 20000  49.78456  28.95649  0.003722939  99.97618  numeric
x3| 20000 20000  0.4983016 0.2900219 0.0001210535 0.9999987 numeric
x | 20000 50     ::        ::        ::           ::        time   
x4| 20000 2      ::        ::        ::           ::        boolean

Data preprocessing complete, starting feature creation
Feature creation and significance testing complete
Starting initial model selection - allow ample time for large datasets
Total features being passed to the models = 216

Scores for all models, using .ml.mse
GradientBoostingRegressor| 0.0009062366
RandomForestRegressor    | 0.001109588
AdaBoostRegressor        | 0.001404073
Lasso                    | 0.001477753
LinearRegression         | 0.004706263
KNeighborsRegressor      | 0.01119728
MLPRegressor             | 3223.835
RegKeras                 | 115417.8

Best scoring model = GradientBoostingRegressor
Score for validation predictions using best model = 0.01904934

Feature impact calculated for features associated with GradientBoostingRegressor model
Plots saved in /outputs/2020.01.16/run_10.14.36.554/images/

Continuing to grid-search and final model fitting on holdout set

Best model fitting now complete - final score on test set = 0.02795512

Saving down procedure report to /outputs/2020.01.16/run_10.14.36.554/report/
Saving down GradientBoostingRegressor model to /outputs/2020.01.16/run_10.14.36.554/models/
Saving down model parameters to /outputs/2020.01.16/run_10.14.36.554/config/

 

The following highlights some points of note within this workflow

  1. Problem types relating to the application of the FRESH algorithm apply all feature extraction functions within the keyed table .ml.fresh.params
  2. The feature significance test applied is based on the function .ml.fresh.featuresignificance and compares each column within the extracted features with the target to find an appropriate subsection of features for model fitting.
  3. An exhaustive grid search over flat-file defined hyperparameters is applied to the best model found from a 5-fold cross validation on the training set
  4. The workflow saves down the following
    1. The best model fit on the training set with the optimized grid search derived parameters
    2. A report document highlighting the complete workflow, relevant images to the run, all scores achieved throughout the run and the hyperparameters used
    3. A configuration/meta file which defines the steps taken and relevant information needed in order to rerun a workflow or run on new data.

Advanced Parameter Modifications

While the default behaviour of this system may be suitable in a large number of use cases, for many users the ability to extensively modify the framework may be required to fit with current systems or more in depth prototyping. To this end the following features of the pipeline can be modified by a user.

Parameters:
  aggcols     Aggregation columns for FRESH
  funcs       Functions to be applied for feature extraction
  gs          Grid search function and associated no. of folds/percentage
  hld         Size of holdout set on which the final model is tested
  saveopt     Saving options outlining what is to be saved to disk from a run
  scf         Scoring functions for classification/regression tasks
  seed        Random seed to be used
  sigfeats    Feature significance procedure to be applied to the data
  sz          Size of test set for train-test split function
  tts         Train-test split function to be applied
  xv          Cross validation function and associated no. of folds/percentage

Each of these parameters are discussed in depth here. These can be passed to the function .automl.run as a final parameter either as a kdb+ dictionary or via file based input as explained here. For brevity a number of kdb+ dictionary input examples are outlined here with file input outlined in the documentation

  1. The following modifications will be made to a default workflow for a ‘normal’ machine learning problem
    1. Do not save any images or the summary report
    2. Change the metric used to score models to root mean squared error
    3. Update the feature extraction functions to be applied in a ‘normal’ feature extraction procedure to include a truncated singular value decomposition
// Key of the updated dictionary
q)mod_key:`saveopt`scf`funcs
// Save option value
q)saveopt:1
// Scoring function to be applied
q)scf:enlist[`reg]!enlist`.ml.rmse
// Feature extraction functions
q)funcs:`.automl.prep.i.truncsvd
// Create the appropriate dictionary
q)dict:mod_key!(saveopt;scf;funcs)
// Apply these to a user defined table and target
q).automl.run[tab;tgt;`normal;`reg;dict]
  1. Produce a workflow which acts of a FRESH type problem with the following modifications
    1. Do not save any information to disk
    2. Update the grid search and cross validation procedures to perform a 5 fold chain forward procedure
    3. Use a random seed of 42
// Key of the updated dictionary
q)mods:`saveopt`gs`xv`seed
// Save option value
q)saveopt:0
// Grid search procedure
q)gsearch:(`.ml.gs.tschain;5)
// Cross validation procedure
q)xvproc:(`.ml.xv.tschain;5)
// Random seed
q)seed:42
// Create an appropriate dictionary
q)dict:mods!(saveopt;gsearch;xvproc;seed)
// Apply the updates to a FRESH workflow
q).automl.run[fresh_tab;tgt;`fresh;`reg;dict]

Model deployment

Once a single run of the system has been completed a user is likely to wish to deploy their model such that it can be used on live/new data. To this end the function .automl.new documented here is provided. This takes as input the new data to which a model/workflow is to be applied, the date and the time of original run being applied. This is completed as follows

// Date on which the run to be applied was completed
q)start_date:2020.02.08
// Time at which the run was initiated
q)start_time:11.21.47.763
// Example table here taken from first FRESH example
q)new_tab:5000#fresh_tab
q).automl.new[5000#fresh_tab;start_date;start_time]
0.01299933 0.05512824 0.05547163 0.07714543 0.08226989 …

For this to be valid a user should ensure that the schema of the new table is consistent with that of the table from which the model was derived.

Conclusion

The framework described above provides users with the ability to formalize the process of machine learning framework development and opens up machine learning capabilities more freely to the wider kdb+ community. This beta release is the first of a number of such releases relating to the Kx automated machine learning repository. With expanded functionality and iterative improvements to workflows and user experience to be made in the coming months.

 

If you would like to further investigate the uses of the automated machine learning framework, check out the Kx GitHub to find the complete source code and available functionality within the automl workflow. You can also use Anaconda to integrate into your Python installation to set up your machine learning environment, or you build your own which consists of downloading kdb+, embedPy and JupyterQ. You can find the installation steps here.

Please do not hesitate to contact if you have any suggestions or queries.

 

SUGGESTED ARTICLES