KxCon2016 Puzzle Challenge

25 May 2016 | , , ,
Share on:

By Nick Psaris. Inspired by Andrew Ng’s Machine Learning Coursera Class.

KxCon2016 was a success, especially for the brave programmers who took on the KxCon2016 programming challenge. Try your hand and we will post the solutions next week.

The KxCon2016 programming challenge was chosen because it can be quickly implemented inefficiently and then considerably optimized. This is not a toy problem – the resulting function is used to load datasets for machine learning. Finally, to make the problem more interesting, an existing q operator has been extended (“reshape extended to >2 dimensions…”) in kdb+ 3.4t that can make your solution even shorter.



A popular application of machine learning is character recognition. If we assume a handwritten digit can be digitized into a vector of pixels, logistic regression (among many other techniques) can be used to assign a weight to each pixel. These learned weights can then be combined with a new image to make a prediction of which digit it represents.

The MNIST database holds a collection of handwritten digits that have been normalized for use in testing machine learning and pattern recognition techniques.


Figure 1: The first image in the MNIST training file representing the number 5.

To process these images of handwritten digits, we must load the data from files stored in the custom MNIST binary format. Your challenge is to write a function to read this data and return the resulting n-dimensional array. Lucky for you, this format has been well documented on the MNIST site.

The site specifies the exact dimension and numerical type of each dataset. This would allow you to write a custom loader for each file. The file format, however, is self-describing. You are required, therefore, to write a general loader that works with datasets of all dimensions and types. While you are waiting for the dataset to download, you can begin testing your implementation against the unit tests below.



Your function will be applied to the MNIST training dataset. To make the function more flexible, its should accept a byte-vector instead of a file name. The function can then be applied to unit tests to confirm proper behavior. To be accepted, your function named ldidx should produce the following results (signed and unsigned bytes should both be returned as type “x”). NOTE: ignore any extra trailing bytes.

Figure 2: The last image in the MNIST training file representing the number 8.

q)ldidx 0x0000080100000000
q)ldidx 0x000008010000000100
,0 x00
q)0N!ldidx 0x0000080200000002000000020001020304;
q)0N!ldidx 0x00000803000000020000000200000002000102030405060708;
q)ldidx 0x00000b010000000200010002
1 2h
q)ldidx 0x00000c01000000020000000100000002
1 2i
q)ldidx 0x00000d01000000023f80000040000000
1 2e
q)ldidx 0x00000e01000000023ff00000000000004000000000000000
1 2f
q)md5 raze over string X:ldidx b:read1 ‘$”train-images-idx3-ubyte”


Email your function to as soon as it produces valid results. Email it again when you’ve optimized the code. No external user-defined functions or data structures can be used. Only the first and last submission by an individual will be accepted for the competition. All submissions must be made prior to 00:00 EST on 22 May 2016. The 32 bit free version of q available on 20 May 2016 will be used to test each submission.


One point will be awarded for each of the following categories.

  1. Fastest valid submission measured in milliseconds elapsed – q)t:10 ldidx b
  2. Smallest valid submission measured in allocated bytes – q)ts ldidx b
  3. Shortest valid submission measured in bytes – q)count first get ldidx

In case of a tie, the submitter who provided the first valid submission (irrespective of performance) will win.

UPDATE: The solution is here.


kdb+ for feature engineering

Machine Learning Toolkit Release:  Feature Extraction and Selection in kdb+

9 Oct 2018 | , , , ,

The latest library released by the Kx machine learning team is a machine learning toolkit (ML-Toolkit) which includes both utility functions for general use and an implementation of the FRESH (Feature Extraction based on Scalable Hypothesis tests) algorithm [1]. The ML-Toolkit is available in its entirety on the Kx GitHub.

Kx Insights: Machine learning and the value of historical data

2 Aug 2018 | , , ,

Data is being generated at a faster rate now than ever before. IDC has predicted that in 2025, there will be 163 zettabytes of data generated each year—a massive increase from the 16.1 zettabytes created in 2016. These high rates of data generation are partially an outcome of the multitude of sensors found on Internet of Things (IoT) devices, the majority of which are capable of recording data many times per second. IHS estimates that the number of IoT devices in use will increase from 15.4 billion devices in 2015 to 75.4 billion in 2025, indicating that these immense rates of data generation will continue to grow even higher in the years to come.