Machine Learning with kdb+ blog

Dimensionality Reduction in kdb+

14 Jun 2018 | , , , , ,
Share on:

By Conor McCarthy

As part of Kx25, a series of seven JuypterQ notebooks were released and are now available on https://code.kx.com/q/ml/. Each notebook demonstrates how to implement a different machine learning technique in kdb+, primarily using embedPy, to solve all kinds of machine learning problems, from feature extraction to fitting and testing a model. These notebooks act as a foundation to our users, allowing them to manipulate the code and get access to the exciting world of machine learning within Kx.

The notebook described in this third blog in the series* outlines the implementation of dimensionality reduction in kdb+ and focuses primarily on one aspect of this, namely, feature selection. (For more about the Kx machine learning team please watch Andrew Wilson’s presentation at Kx25 on the Kx Youtube channel).

Background

Dimensionality reduction methods have been the focus of much interest within the statistics and machine learning communities for a range of applications. These techniques have a long history as being methods for data pre-processing. Dimensionality reduction is the mapping of data to a lower dimensional space such that uninformative variance in the data is discarded. By doing this we hope to retain only that data that is meaningful to our machine learning problem. In addition, by finding a lower-dimensional representation of a dataset we hope we can improve the efficiency and accuracy of machine learning models.

The process of dimensionality reduction can be broken into two sections, feature selection and feature extraction. Feature selection describes the process of finding a meaningful subset of all the features in the dataset. Feature extraction is the process of creating new features from the original dataset which may be useful to help interpret the data or facilitate subsequent machine learning. A number of techniques are available; however, this work focuses on two popular and successful techniques, principal components analysis (PCA) and t-distributed stochastic neighbor embedding(t-SNE).

Technical Description

This notebook demonstrates the use of embedPy to import the Python machine learning libraries Keras, and Scikit-learn, which respectively, are used to import and analyze data from the Modified National Institute of Standards and Technology (MNIST) database. This is a large collection of handwritten digits which have been provided to the public by an agency of the US government. The dimensionality reduction notebook also provides a useful demonstration of the benefits which JupyterQ provide in regards to data visualization and markdown.

Following the initial import of the dataset from Keras we split the data into its train and test set where xtrain and xtest are 8 bit grayscale images of shape 28×28 pixels, ytrain and ytest are the labels which have been associated with the dataset (1st image = 5; 2nd image = 4 etc.).

Using embedPy, we can leverage sklearn to use PCA in order to perform dimensionality reduction by linearly mapping the data set to a lower-dimensional space using an orthogonal linear transformation. This mapping is completed such that the variance of the lower-dimensional data is maximized. To do this the function calculates the covariance of the image (matrix), performs Eigen-decomposition and selects the largest eigenvalues.

This is now our reduced dataset, which although smaller, retains the majority of variance of the original higher-dimensional data. The PCA is then plotted in a set of scatter plots to show the distribution of the reduced data in the new eigenvalue derived space. With the reduced data labeled it is possible to see a weak correlation between the locations of each of the numerical labels associated with the images in the reduced set, for example, those labeled with the value 1 appear to be clustered tightly together. However, it is less obvious for many of the other labels, and as such, this method is unlikely to provide accurate classifications of the test data.

Given the lack of ability to accurately differentiate between different characters using linear dimensionality reduction, a nonlinear method t-SNE is tested. The t-SNE algorithm works by taking pairs of points in the high-dimensional dataset and constructing two probability distributions such that in the first similar pairs have high probabilities and in the second dissimilar pairs have lower probabilities. Following the creation of these probability distributions, the Kullback-Leibler divergence between the distributions is minimized with respect to locations of points on the map. This nonlinear method is often used in artificial neural networks as a method for visualizing data.

As with the execution of PCA, the t-SNE algorithm uses embedPy to complete the data analysis. Once completed visualization of the labeled distribution shows the effectiveness of this algorithm, while a number of outliers can be seen, the clustering of similar labels shows that this dataset will allow for better classification of the handwritten images than the PCA model.

If you would like to further investigate the uses in Kx of embedPy and machine learning algorithms in the ML 02 Dimensionality Reduction in Kx notebook on GitHub (https://github.com/KxSystems/mlnotebooks ), where several functions commonly employed in machine learning problems are also provided together with some functions to create several interesting graphics.

You can use Anaconda to integrate into your Python installation to set up your Machine Learning environment, or you can build your own by downloading kdb+, embedPy and JupyterQ. You can find the installation steps on code.kx.com/q/ml/setup/

Don’t hesitate to contact if you have any suggestions or queries.

*Other articles in this JupyterQ series of blogs by the Kx Machine Learning Team:

Natural Language Processing in kdb+ by Fionnuala Carr

Neural Networks in kdb+ by Esperanza López Aguilera

Classification Using k-Nearest Neighbors in kdb+ by Fionnuala Carr

Feature Engineering in kdb+ by Fionnuala Carr

Decision Trees in kdb+ by Conor McCarthy

Random Forests in kdb+ by Esperanza López Aguilera

 

References:

Cunningham, J. and Ghahramani Z. Linear Dimensionality Reduction: Survey, Insights and Generalizations. Journal of Machine Learning Research. 16(Dec) 2859-2900. 2015. Available at: https://arxiv.org/abs/1406.0873

Burges, C. J. C.  Dimension reduction: a guided tour. Foundations & Trends in Machine
Learning, 2(4):275–365, 2010. Avaiable at: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/FnT_dimensionReduction.pdf

Fodor, I. K. A Survey of Dimension Reduction Techniques. Available at: https://e-reports-ext.llnl.gov/pdf/240921.pdf

Ghodsi, A.  Dimensionality Reduction A Short Tutorial. University of Waterloo Technical Report. Available at: https://www.math.uwaterloo.ca/~aghodsib/courses/f06stat890/readings/tutorial_stat890.pdf

Pedregosa F. et. al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011) 2825-2830. Available at: http://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf

Ray, S. Beginners Guide to Learn Dimension Reduction Techniques. Available at: https://www.analyticsvidhya.com/blog/2015/07/dimension-reduction-methods/

Deshpande M. Dimensionality Reduction. Available at: https://pythonmachinelearning.pro/dimensionality-reduction/

 

SUGGESTED ARTICLES

Random forest and kdb+

Random Forests in kdb+

12 Jul 2018 | , , , , ,

The Random Forest algorithm is an ensemble method commonly used for both classification and regression problems that combines multiple decision trees and outputs and average prediction. It can be considered to be a collection of decision trees (forest) so it offers the same advantages as an individual tree: it can manage a mix of continuous, discrete and categorical variables; it does not require either data normalization or pre-processing; it is not complicated to interpret; and it automatically performs feature selection and detects interactions between variables. In addition to these, random forests solve some of the issues presented by decision trees: reduce variance and overfitting and provide more accurate and stable predictions. This is all achieved by making use of two different techniques: bagging (or bootstrap aggregation) and boosting.

Kx and NASA FDL: Space Weather, GNSS and Exoplanets

10 Jul 2018 | , ,

By Robert Hill Kx is delighted to once more be partnering with the NASA Frontier Development Laboratory (NASA FDL) team on two exciting challenges facing the space sector. This follows from last year’s successful solar activity detection work, which resulted in the ‘FlareNet’ tool (supported by Kx and Lockheed Martin) that demonstrated the potential for […]

Kx Insights: Machine learning subject matter experts in semiconductor manufacturing

9 Jul 2018 | , ,

Subject matter experts are needed for ML projects since generalist data scientists cannot be expected to be fully conversant with the context, details, and specifics of problems across all industries. The challenges are often domain-specific and require considerable industry background to fully contextualize and address. For that reason, successful projects are typically those that adopt a teamwork approach bringing together the strengths of data scientists and subject matter experts. Where data scientists bring generic analytics and coding capabilities, Subject matter experts provide specialized insights in three crucial areas: identifying the right problem, using the right data, and getting the right answers.