Machine Learning in kdb+: k-NN classification and pattern recognition

31 Aug 2017 | , , , , ,
Share on:

By Emanuele Melis

As a powerful array-processing technology, kdb+ can be used with great effect in machine learning algorithms. This latest Kx whitepaper on  k-NearestNeighbor classification and pattern recognition in kdb+ uses a non-parametric statistical method commonly used for Pattern Recognition.

k-NN  assumes data points are in a metric space and are represented using n-dimensional vectors, out of which distance metrics can be extracted. This makes it one of the easiest Machine Learning algorithms to implement, but impractical to use in some industry settings due to the computational complexity and cost of: (1) distance metrics; (2) feature extraction; (3) classification.

The paper further examines the implementation strategies in kdb+, and the performance of a k-NN classifier used to predict digits in a dataset of handwritten samples normalized in arrays of 8 (x,y) coordinate pairs. The training set, loaded in kdb+ as “label-to-arrays of features” mappings, was represented as a table keyed on the label and the distance metric calculated applying distance functions on it. A validation set has been used to measure the prediction accuracy of the classifier, leveraging q-sql syntax.

Adopting kdb+ to implement a k-NN classifier introduced the benefits of using a high performance array processing language with an easy to read q-sql syntax, which allows a performant and elegant algorithm implementation without using external libraries.

The code used in this white paper is available on the Kx Github.

 

Emanuele Melis is an expert kdb+/q software engineer currently based in Glasgow, Scotland.

SUGGESTED ARTICLES

kx and machine learning

Machine Learning Toolkit Update: Multi-parameter FRESH and updated utilities

25 Apr 2019 | , ,

This latest toolkit release, is the first in a series of planned releases in 2019 that will add updates to the functionality of the FRESH (Feature Extraction based on Scalable Hypothesis tests) algorithm and the addition of a number of accuracy metrics, preprocessing functions and utilities. In conjunction with code changes, modifications to the namespace structure of the toolkit have been made to streamline the code and improve user experience.

Kx extends relationship with NASA Frontier Development Lab and the SETI Institute

Detection of Exoplanets at NASA FDL with kdb+

13 Dec 2018 | , , , ,

Kx data scientist Espe Aguilera explains a NASA FDL mission to improve the accuracy of finding new exoplanets using machine learning models. The data for the project will come from the Transiting Exoplanet Survey Satellite (TESS), which was launched in April 2018, with the objective of discovering new exoplanets in orbit around the brightest stars in the solar neighborhood.