By Conor McCarthy
The Kx machine learning team has an ongoing project of periodically releasing useful machine learning libraries and notebooks for kdb+. These libraries and notebooks act as a foundation to our users, allowing them to use the ideas presented and the code provided to access the exciting world of machine learning with Kx.
This blog gives an overview of the newest release from the team, namely an automated machine learning framework (automl) based on the Kx open source machine learning libraries, kdb+ core technology and a number of Python open source libraries accessed via embedPy.
As with all libraries released from the Kx machine learning team, the automl framework and its constituent sections are available as open source, Apache 2 software, and are supported for our clients on a best effort basis.
The automl framework described here is built largely on tools provided within the machine learning toolkit and various open-source libraries available for Python. The purpose of this framework is to provide users with the ability to automate the process of applying machine learning techniques to real-world problems. The framework is designed to allow individuals or organizations without expertise in the field of machine learning to follow a well defined and general machine learning workflow, while also allowing the flexibility for those with an understanding of such workflows to make extensive modifications to the framework to suit their use case.
For both the seasoned machine learning engineer and the novice the workflow steps are the same
- Data preprocessing
- Feature extraction and feature selection
- Model selection
- Hyperparameter tuning
- Report, model and configuration persistence
Each of these steps is outlined in-depth within the associated documentation here with particular attention given both to the default behavior of the framework and extensive information on how a user can customize this framework to suit their use case.
In this first release, the framework can be used to solve both regression and classification problems, related to either the FRESH algorithm or non time-series (defined here as “normal”) machine learning tasks which have one target associated with each row of the dataset.
Future releases will expand on these capabilities to include, more specific time-series functionality, natural language processing capabilities and wider system capabilities.
The remainder of this blog will be used to highlight how this system can be used to solve a regression problem suitable for the FRESH algorithm. This is split into three sections
- Default automated framework
- Advanced parameter updates to framework
- Model deployment
A full explanation of the default system behavior is outlined in full within the user-guide here. For brevity, the following shows the running of an example workflow followed by particular points of note about the workflow.
// Define the example table for FRESH q)N:20000 q)fresh_tab:(”d”$asc N?50;desc N?1f;N?100f;N?1f;asc N?0b) // Define the associated target data q)tgt:asc 50?1f // Define the problem type q)prob_typ:`fresh // Define the target type q)tgt_typ:`reg // Run the framework with default settings (::) // displaying output q).automl.run[fresh_tab;tgt;prob_typ;tgt_typ;::] The following is a breakdown of information for each of the relevant columns in the dataset | count unique mean std min max type --| --------------------------------------------------------------- x1| 20000 20000 0.501004 0.2879844 8.079223e-07 0.9999848 numeric x2| 20000 20000 49.78456 28.95649 0.003722939 99.97618 numeric x3| 20000 20000 0.4983016 0.2900219 0.0001210535 0.9999987 numeric x | 20000 50 :: :: :: :: time x4| 20000 2 :: :: :: :: boolean Data preprocessing complete, starting feature creation Feature creation and significance testing complete Starting initial model selection - allow ample time for large datasets Total features being passed to the models = 216 Scores for all models, using .ml.mse GradientBoostingRegressor| 0.0009062366 RandomForestRegressor | 0.001109588 AdaBoostRegressor | 0.001404073 Lasso | 0.001477753 LinearRegression | 0.004706263 KNeighborsRegressor | 0.01119728 MLPRegressor | 3223.835 RegKeras | 115417.8 Best scoring model = GradientBoostingRegressor Score for validation predictions using best model = 0.01904934 Feature impact calculated for features associated with GradientBoostingRegressor model Plots saved in /outputs/2020.01.16/run_10.14.36.554/images/ Continuing to grid-search and final model fitting on holdout set Best model fitting now complete - final score on test set = 0.02795512 Saving down procedure report to /outputs/2020.01.16/run_10.14.36.554/report/ Saving down GradientBoostingRegressor model to /outputs/2020.01.16/run_10.14.36.554/models/ Saving down model parameters to /outputs/2020.01.16/run_10.14.36.554/config/
The following highlights some points of note within this workflow
- Problem types relating to the application of the FRESH algorithm apply all feature extraction functions within the keyed table .ml.fresh.params
- The feature significance test applied is based on the function .ml.fresh.featuresignificance and compares each column within the extracted features with the target to find an appropriate subsection of features for model fitting.
- An exhaustive grid search over flat-file defined hyperparameters is applied to the best model found from a 5-fold cross validation on the training set
- The workflow saves down the following
- The best model fit on the training set with the optimized grid search derived parameters
- A report document highlighting the complete workflow, relevant images to the run, all scores achieved throughout the run and the hyperparameters used
- A configuration/meta file which defines the steps taken and relevant information needed in order to rerun a workflow or run on new data.
Advanced Parameter Modifications
While the default behaviour of this system may be suitable in a large number of use cases, for many users the ability to extensively modify the framework may be required to fit with current systems or more in depth prototyping. To this end the following features of the pipeline can be modified by a user.
Parameters: aggcols Aggregation columns for FRESH funcs Functions to be applied for feature extraction gs Grid search function and associated no. of folds/percentage hld Size of holdout set on which the final model is tested saveopt Saving options outlining what is to be saved to disk from a run scf Scoring functions for classification/regression tasks seed Random seed to be used sigfeats Feature significance procedure to be applied to the data sz Size of test set for train-test split function tts Train-test split function to be applied xv Cross validation function and associated no. of folds/percentage
Each of these parameters are discussed in depth here. These can be passed to the function .automl.run as a final parameter either as a kdb+ dictionary or via file based input as explained here. For brevity a number of kdb+ dictionary input examples are outlined here with file input outlined in the documentation
- The following modifications will be made to a default workflow for a ‘normal’ machine learning problem
- Do not save any images or the summary report
- Change the metric used to score models to root mean squared error
- Update the feature extraction functions to be applied in a ‘normal’ feature extraction procedure to include a truncated singular value decomposition
// Key of the updated dictionary q)mod_key:`saveopt`scf`funcs // Save option value q)saveopt:1 // Scoring function to be applied q)scf:enlist[`reg]!enlist`.ml.rmse // Feature extraction functions q)funcs:`.automl.prep.i.truncsvd // Create the appropriate dictionary q)dict:mod_key!(saveopt;scf;funcs) // Apply these to a user defined table and target q).automl.run[tab;tgt;`normal;`reg;dict]
- Produce a workflow which acts of a FRESH type problem with the following modifications
- Do not save any information to disk
- Update the grid search and cross validation procedures to perform a 5 fold chain forward procedure
- Use a random seed of 42
// Key of the updated dictionary q)mods:`saveopt`gs`xv`seed // Save option value q)saveopt:0 // Grid search procedure q)gsearch:(`.ml.gs.tschain;5) // Cross validation procedure q)xvproc:(`.ml.xv.tschain;5) // Random seed q)seed:42 // Create an appropriate dictionary q)dict:mods!(saveopt;gsearch;xvproc;seed) // Apply the updates to a FRESH workflow q).automl.run[fresh_tab;tgt;`fresh;`reg;dict]
Once a single run of the system has been completed a user is likely to wish to deploy their model such that it can be used on live/new data. To this end the function .automl.new documented here is provided. This takes as input the new data to which a model/workflow is to be applied, the date and the time of original run being applied. This is completed as follows
// Date on which the run to be applied was completed q)start_date:2020.02.08 // Time at which the run was initiated q)start_time:22.214.171.1243 // Example table here taken from first FRESH example q)new_tab:5000#fresh_tab q).automl.new[5000#fresh_tab;start_date;start_time] 0.01299933 0.05512824 0.05547163 0.07714543 0.08226989 …
For this to be valid a user should ensure that the schema of the new table is consistent with that of the table from which the model was derived.
The framework described above provides users with the ability to formalize the process of machine learning framework development and opens up machine learning capabilities more freely to the wider kdb+ community. This beta release is the first of a number of such releases relating to the Kx automated machine learning repository. With expanded functionality and iterative improvements to workflows and user experience to be made in the coming months.
If you would like to further investigate the uses of the automated machine learning framework, check out the Kx GitHub to find the complete source code and available functionality within the automl workflow. You can also use Anaconda to integrate into your Python installation to set up your machine learning environment, or you build your own which consists of downloading kdb+, embedPy and JupyterQ. You can find the installation steps here.