Machine Learning with kdb+ blog

Dimensionality Reduction in kdb+

14 Jun 2018 | , , , , ,
Share on:

By Conor McCarthy

As part of Kx25, a series of seven JuypterQ notebooks were released and are now available on https://code.kx.com/q/ml/. Each notebook demonstrates how to implement a different machine learning technique in kdb+, primarily using embedPy, to solve all kinds of machine learning problems, from feature extraction to fitting and testing a model. These notebooks act as a foundation to our users, allowing them to manipulate the code and get access to the exciting world of machine learning within Kx.

The notebook described in this third blog in the series* outlines the implementation of dimensionality reduction in kdb+ and focuses primarily on one aspect of this, namely, feature selection. (For more about the Kx machine learning team please watch Andrew Wilson’s presentation at Kx25 on the Kx Youtube channel).

Background

Dimensionality reduction methods have been the focus of much interest within the statistics and machine learning communities for a range of applications. These techniques have a long history as being methods for data pre-processing. Dimensionality reduction is the mapping of data to a lower dimensional space such that uninformative variance in the data is discarded. By doing this we hope to retain only that data that is meaningful to our machine learning problem. In addition, by finding a lower-dimensional representation of a dataset we hope we can improve the efficiency and accuracy of machine learning models.

The process of dimensionality reduction can be broken into two sections, feature selection and feature extraction. Feature selection describes the process of finding a meaningful subset of all the features in the dataset. Feature extraction is the process of creating new features from the original dataset which may be useful to help interpret the data or facilitate subsequent machine learning. A number of techniques are available; however, this work focuses on two popular and successful techniques, principal components analysis (PCA) and t-distributed stochastic neighbor embedding(t-SNE).

Technical Description

This notebook demonstrates the use of embedPy to import the Python machine learning libraries Keras, and Scikit-learn, which respectively, are used to import and analyze data from the Modified National Institute of Standards and Technology (MNIST) database. This is a large collection of handwritten digits which have been provided to the public by an agency of the US government. The dimensionality reduction notebook also provides a useful demonstration of the benefits which JupyterQ provide in regards to data visualization and markdown.

Following the initial import of the dataset from Keras we split the data into its train and test set where xtrain and xtest are 8 bit grayscale images of shape 28×28 pixels, ytrain and ytest are the labels which have been associated with the dataset (1st image = 5; 2nd image = 4 etc.).

Using embedPy, we can leverage sklearn to use PCA in order to perform dimensionality reduction by linearly mapping the data set to a lower-dimensional space using an orthogonal linear transformation. This mapping is completed such that the variance of the lower-dimensional data is maximized. To do this the function calculates the covariance of the image (matrix), performs Eigen-decomposition and selects the largest eigenvalues.

This is now our reduced dataset, which although smaller, retains the majority of variance of the original higher-dimensional data. The PCA is then plotted in a set of scatter plots to show the distribution of the reduced data in the new eigenvalue derived space. With the reduced data labeled it is possible to see a weak correlation between the locations of each of the numerical labels associated with the images in the reduced set, for example, those labeled with the value 1 appear to be clustered tightly together. However, it is less obvious for many of the other labels, and as such, this method is unlikely to provide accurate classifications of the test data.

Given the lack of ability to accurately differentiate between different characters using linear dimensionality reduction, a nonlinear method t-SNE is tested. The t-SNE algorithm works by taking pairs of points in the high-dimensional dataset and constructing two probability distributions such that in the first similar pairs have high probabilities and in the second dissimilar pairs have lower probabilities. Following the creation of these probability distributions, the Kullback-Leibler divergence between the distributions is minimized with respect to locations of points on the map. This nonlinear method is often used in artificial neural networks as a method for visualizing data.

As with the execution of PCA, the t-SNE algorithm uses embedPy to complete the data analysis. Once completed visualization of the labeled distribution shows the effectiveness of this algorithm, while a number of outliers can be seen, the clustering of similar labels shows that this dataset will allow for better classification of the handwritten images than the PCA model.

If you would like to further investigate the uses in Kx of embedPy and machine learning algorithms in the ML 02 Dimensionality Reduction in Kx notebook on GitHub (https://github.com/KxSystems/mlnotebooks ), where several functions commonly employed in machine learning problems are also provided together with some functions to create several interesting graphics.

You can use Anaconda to integrate into your Python installation to set up your Machine Learning environment, or you can build your own by downloading kdb+, embedPy and JupyterQ. You can find the installation steps on code.kx.com/q/ml/setup/

Don’t hesitate to contact if you have any suggestions or queries.

*Other articles in this JupyterQ series of blogs by the Kx Machine Learning Team:

Natural Language Processing in kdb+ by Fionnuala Carr

Neural Networks in kdb+ by Esperanza López Aguilera

Classification Using k-Nearest Neighbors in kdb+ by Fionnuala Carr

Feature Engineering in kdb+ by Fionnuala Carr

Decision Trees in kdb+ by Conor McCarthy

Random Forests in kdb+ by Esperanza López Aguilera

 

References:

Cunningham, J. and Ghahramani Z. Linear Dimensionality Reduction: Survey, Insights and Generalizations. Journal of Machine Learning Research. 16(Dec) 2859-2900. 2015. Available at: https://arxiv.org/abs/1406.0873

Burges, C. J. C.  Dimension reduction: a guided tour. Foundations & Trends in Machine
Learning, 2(4):275–365, 2010. Avaiable at: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/FnT_dimensionReduction.pdf

Fodor, I. K. A Survey of Dimension Reduction Techniques. Available at: https://e-reports-ext.llnl.gov/pdf/240921.pdf

Ghodsi, A.  Dimensionality Reduction A Short Tutorial. University of Waterloo Technical Report. Available at: https://www.math.uwaterloo.ca/~aghodsib/courses/f06stat890/readings/tutorial_stat890.pdf

Pedregosa F. et. al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011) 2825-2830. Available at: http://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf

Ray, S. Beginners Guide to Learn Dimension Reduction Techniques. Available at: https://www.analyticsvidhya.com/blog/2015/07/dimension-reduction-methods/

Deshpande M. Dimensionality Reduction. Available at: https://pythonmachinelearning.pro/dimensionality-reduction/

 

SUGGESTED ARTICLES

Kx Insights: Machine learning and the value of historical data

2 Aug 2018 | , , ,

Data is being generated at a faster rate now than ever before. IDC has predicted that in 2025, there will be 163 zettabytes of data generated each year—a massive increase from the 16.1 zettabytes created in 2016. These high rates of data generation are partially an outcome of the multitude of sensors found on Internet of Things (IoT) devices, the majority of which are capable of recording data many times per second. IHS estimates that the number of IoT devices in use will increase from 15.4 billion devices in 2015 to 75.4 billion in 2025, indicating that these immense rates of data generation will continue to grow even higher in the years to come.

SEMICON 2018 Snapshot: Data and the Era of AI

24 Jul 2018 | ,

By Bill Pierson The future of the semiconductor industry is looking bright judging by the breadth of new developments, initiatives and innovations on display at SEMICON West in San Francisco this July. Industry leading companies presented the latest technical and business insights into today’s opportunities and challenges, particularly in the areas of smart manufacturing and […]