By Nick Psaris. Inspired by Andrew Ng’s Machine Learning Coursera Class.
KxCon2016 was a success, especially for the brave programmers who took on the KxCon2016 programming challenge. Try your hand and we will post the solutions next week.
The KxCon2016 programming challenge was chosen because it can be quickly implemented inefficiently and then considerably optimized. This is not a toy problem – the resulting function is used to load datasets for machine learning. Finally, to make the problem more interesting, an existing q operator has been extended (“reshape extended to >2 dimensions…”) in kdb+ 3.4t that can make your solution even shorter.
A popular application of machine learning is character recognition. If we assume a handwritten digit can be digitized into a vector of pixels, logistic regression (among many other techniques) can be used to assign a weight to each pixel. These learned weights can then be combined with a new image to make a prediction of which digit it represents.
The MNIST database holds a collection of handwritten digits that have been normalized for use in testing machine learning and pattern recognition techniques.
Figure 1: The first image in the MNIST training file representing the number 5.
To process these images of handwritten digits, we must load the data from files stored in the custom MNIST binary format. Your challenge is to write a function to read this data and return the resulting n-dimensional array. Lucky for you, this format has been well documented on the MNIST site.
The site specifies the exact dimension and numerical type of each dataset. This would allow you to write a custom loader for each file. The file format, however, is self-describing. You are required, therefore, to write a general loader that works with datasets of all dimensions and types. While you are waiting for the dataset to download, you can begin testing your implementation against the unit tests below.
Your function will be applied to the MNIST training dataset. To make the function more flexible, its should accept a byte-vector instead of a file name. The function can then be applied to unit tests to confirm proper behavior. To be accepted, your function named ldidx should produce the following results (signed and unsigned bytes should both be returned as type “x”). NOTE: ignore any extra trailing bytes.
Figure 2: The last image in the MNIST training file representing the number 8.
q)md5 raze over string X:ldidx b:read1 ‘$”train-images-idx3-ubyte”
Email your function to as soon as it produces valid results. Email it again when you’ve optimized the code. No external user-defined functions or data structures can be used. Only the first and last submission by an individual will be accepted for the competition. All submissions must be made prior to 00:00 EST on 22 May 2016. The 32 bit free version of q available on 20 May 2016 will be used to test each submission.
One point will be awarded for each of the following categories.
- Fastest valid submission measured in milliseconds elapsed – q)t:10 ldidx b
- Smallest valid submission measured in allocated bytes – q)ts ldidx b
- Shortest valid submission measured in bytes – q)count first get ldidx
In case of a tie, the submitter who provided the first valid submission (irrespective of performance) will win.
UPDATE: The solution is here.