By Esperanza López Aguilera
As part of Kx25, the international kdb+ user conference held May 18th, a series of seven JuypterQ notebooks were released and are now available on https://code.kx.com/q/ml/. Each notebook demonstrates how to implement different machine learning techniques in kdb+, primarily using embedPy, to solve all kinds of machine learning problems, from feature extraction to fitting and testing a model. These notebooks act as a foundation to our users, allowing them to manipulate the code and get access to the exciting world of machine learning within Kx. (For more about the Kx machine learning team please watch Andrew Wilson’s presentation at Kx25 on the Kx Youtube channel).
Several academic examples exist to demonstrate the power of machine learning algorithms to solve different real-time problems. Among these is the very well known problem of image recognition, this problem that has been explored and resolved in some of the released notebooks employing Neural Networks and these have proven to be very efficient.
Neural networks have become the focus of an intensified research effort within the machine learning community. A neural network can be considered to be an interconnected assembly of simple processing elements, units or nodes, whose functionality is loosely based on the animal neuron. Dr. Robert Hecht-Nielsen, the inventor of one of the first neuro-computers, defined an Artificial Neural Network (ANN or NN) as “a computing system made up of a number of simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs.” They are based loosely on biological neural structures in the brain and try in some way to imitate their behavior. The processing ability of the network is stored in the inter-unit connection strengths, or weights, obtained by a process of adaptation to, or learning from, a set of training patterns.
ANN are usually structured in layers populated by nodes; these layers are generally separated into three distinct types: input layer, hidden layers and output layer. Data is given to the NN via the input layer, which is connected to the first hidden layer by several connections that primarily define linear combinations of the data based on different weights. Hidden layers are the primary defining characteristic of the NN and there are usually hundreds or thousands of them. These layers are composed of nodes containing activation functions which try to imitate the activation of a biological neuron upon receiving a stimulus. Likewise, the nodes of the hidden layers are interrelated by weighted connections. Finally, there is an output layer from which the final answer is extracted.
Neural networks have many applications for classification, in time-series data and in optimization problems. Among the applications worth mentioning are credit card fraud detection, disease diagnosis and prediction, petroleum exploration, stock market prediction, forecasting weather patterns, speech recognition as well as image recognition and classification.
This notebook as with all those within this series aims to exemplify how to approach a machine learning problem using kdb+. It deals with the manipulation, preprocessing and inspection of the dataset, followed by the construction and training of the NN and finally illustrates the predictive aspects of the model and an interpretation of the results. To complete this, embedPy is used as it allows us to make the best use of both Python and q in tandem. Data can be managed and explored rapidly and easily using both kdb+ functions and queries while we can gain access to the wide range of optimized machine learning algorithms provided through Python modules and libraries such as Scikit-learn and Keras.
In this example, we introduce the problem of handwritten digit recognition by examining the Modified National Institute of Standards and Technology (MNIST) database, a very well known and previously studied database. This dataset will be studied further in next weeks blog titled ‘Dimensionality reduction in Kx.’ The MNIST (handwritten digits) dataset is publicly available and can be loaded into q using the Keras Python module, which is imported using embedPy. The dataset consists of 4 different byte type datasets: training and test images (defined in our case as xtrain & xtest), which contain the images of the handwritten digits as matrices where each of the 28×28 pixels is represented as a value in the matrix. The datasets also contain the actual digit values between 0 and 9 which are present in each of the respective images (we define these as ytrain and ytest), which allows us to determine if our model is accurately identifying the relevant digits.
Data is loaded as q data, thus allows us to manipulate and explore the data quickly. From this we can easily cast the data to floats, get the shape of the datasets using q defined lambdas or determine if the classes of the dataset are balanced by employing simple qSQL queries.
Moreover, when dealing with images it is often useful to give some visual output to the user to allow them to better understand the dataset; to do this we use embedPy to import the matplotlib python module which allows such images to be displayed.
Different machine learning algorithms require different data formats so it is important to know the appropriate way to give the data to the model. Neural networks perform better in classification problems when the classes are given as one hot vectors. The function onehot has been defined in func.q for this purpose and it is used in this example. Furthermore, although NN accept matrices as input data, we reshape them to create one dimensional vectors and simplify the structure of the network to better understand this explanatory example. We use embedPy and Keras modules to build a neural network with one hidden layer populated by 512 nodes that contain the rectified linear unit (ReLU) activation function. The output layer consists of 10 nodes that will output the probability of belonging to each class (0 to 9).
Once the structure of the NN has been defined, the model is fitted and the computational power of GPUs is leveraged using Tensorflow to speed up the training process. After fitting the model to the training data, the accuracy of the model can be assessed with the test data. The class of the test images is predicted using the predict function of the Keras model and the result is saved as a q list. This permits us once again to deal easily with the data and to analyze the performance of the classifier in different ways: qSQL queries are applied to account for the performance by class, this identifies the most recognizable digits versus those most commonly misclassified. The confusion matrix is also computed using qSQL queries to get which classes are mixed up.
These results are usually better understood when they are represented graphically. Matplolib provides a large set of functions that produce many different types of graphical representation. Therefore, we take advantage of this to produce three different illustrative plots of the confusion matrix: a 3D plot that has in the X axis the actual digits, in the Y axis the predicted labels and in the Z axis the count of instances that have a value X but were predicted as Y; a heatmap, a quite common representation of matrices that maps colours to numbers: the more intense the colour of a brick in the plane, the higher the number in the corresponding item in the matrix; and a histogram, which allows us to recognize how the different digits were misclassified. Finally, we visualize some of the images that were misclassified by the NN to try to detect why the model predicted them wrong.
If you would like to further investigate the uses of embedPy and machine learning algorithms in Kx, check out ML 01 Neural Networks notebook on GitHub (github.com/kxsystems/mlnotebooks), where several functions commonly employed in machine learning problems are also provided together with some functions to create several interesting graphics. You can use Anaconda to integrate into your python installation to set up your Machine Learning environment, or you build your own which consists of downloading kdb+, embedPy and JupyterQ. You can find the installation steps on code.kx.com/q/ml/setup/.
Other articles in this JupyterQ series of blogs by the Kx Machine Learning Team:
Natural Language Processing in kdb+ by Fionnuala Carr
Dimensionality Reduction in kdb+ by Conor McCarthy
Classification Using k-Nearest Neighbors in kdb+ by Fionnuala Carr
Feature Engineering in kdb+ by Fionnuala Carr
Decision Trees in kdb+ by Conor McCarthy
Random Forests in kdb+ by Esperanza López Aguilera
Michael Nielsen. Neural Network and Deep learning. Available from:
University of Wisconsin-Madison. Department of Computer Sciences. Available from:
Eric Roberts’ Sophomore College. The Intellectual Excitement of Computer Science. Available from:
Jayesh Bapu Ahire. Real-World Applications of Artificial Neural Networks. Available from:
Bishop CM. Neural Networks for Pattern Recognition. Cambridge University Press. 1995. Available from: http://cs.du.edu/~mitchell/mario_books/Neural_Networks_for_Pattern_Recognition_-_Christopher_Bishop.pdf
Gurney K. An Introduction to Neural Networks. UCL Press. 1997. Available from: https://www.inf.ed.ac.uk/teaching/courses/nlu/assets/reading/Gurney_et_al.pdf
You may also want to read the Kx Technical Whitepaper; An Introduction to Neural Networks in kdb+