5041

KxCon2016 Puzzle Challenge

25 May 2016 | , , ,
Share on:

By Nick Psaris. Inspired by Andrew Ng’s Machine Learning Coursera Class.

KxCon2016 was a success, especially for the brave programmers who took on the KxCon2016 programming challenge. Try your hand and we will post the solutions next week.

The KxCon2016 programming challenge was chosen because it can be quickly implemented inefficiently and then considerably optimized. This is not a toy problem – the resulting function is used to load datasets for machine learning. Finally, to make the problem more interesting, an existing q operator has been extended (“reshape extended to >2 dimensions…”) in kdb+ 3.4t that can make your solution even shorter.

MACHINE LEARNING

Background

A popular application of machine learning is character recognition. If we assume a handwritten digit can be digitized into a vector of pixels, logistic regression (among many other techniques) can be used to assign a weight to each pixel. These learned weights can then be combined with a new image to make a prediction of which digit it represents.

The MNIST database holds a collection of handwritten digits that have been normalized for use in testing machine learning and pattern recognition techniques.

Challenge

5
Figure 1: The first image in the MNIST training file representing the number 5.

To process these images of handwritten digits, we must load the data from files stored in the custom MNIST binary format. Your challenge is to write a function to read this data and return the resulting n-dimensional array. Lucky for you, this format has been well documented on the MNIST site.

The site specifies the exact dimension and numerical type of each dataset. This would allow you to write a custom loader for each file. The file format, however, is self-describing. You are required, therefore, to write a general loader that works with datasets of all dimensions and types. While you are waiting for the dataset to download, you can begin testing your implementation against the unit tests below.

RULES

Interface

Your function will be applied to the MNIST training dataset. To make the function more flexible, its should accept a byte-vector instead of a file name. The function can then be applied to unit tests to confirm proper behavior. To be accepted, your function named ldidx should produce the following results (signed and unsigned bytes should both be returned as type “x”). NOTE: ignore any extra trailing bytes.

8
Figure 2: The last image in the MNIST training file representing the number 8.

q)ldidx 0x0000080100000000
byte$()
q)ldidx 0x000008010000000100
,0 x00
q)0N!ldidx 0x0000080200000002000000020001020304;
(0x0001;0x0203)
q)0N!ldidx 0x00000803000000020000000200000002000102030405060708;
((0x0001;0x0203);(0x0405;0x0607))
q)ldidx 0x00000b010000000200010002
1 2h
q)ldidx 0x00000c01000000020000000100000002
1 2i
q)ldidx 0x00000d01000000023f80000040000000
1 2e
q)ldidx 0x00000e01000000023ff00000000000004000000000000000
1 2f
q)md5 raze over string X:ldidx b:read1 ‘$”train-images-idx3-ubyte”
0x6a5cde79f049959f93df34292c599c1b

Submission

Email your function to as soon as it produces valid results. Email it again when you’ve optimized the code. No external user-defined functions or data structures can be used. Only the first and last submission by an individual will be accepted for the competition. All submissions must be made prior to 00:00 EST on 22 May 2016. The 32 bit free version of q available on 20 May 2016 will be used to test each submission.

Scoring

One point will be awarded for each of the following categories.

  1. Fastest valid submission measured in milliseconds elapsed – q)t:10 ldidx b
  2. Smallest valid submission measured in allocated bytes – q)ts ldidx b
  3. Shortest valid submission measured in bytes – q)count first get ldidx

In case of a tie, the submitter who provided the first valid submission (irrespective of performance) will win.

UPDATE: The solution is here.

SUGGESTED ARTICLES

Random forest and kdb+

Random Forests in kdb+

12 Jul 2018 | , , , , ,

The Random Forest algorithm is an ensemble method commonly used for both classification and regression problems that combines multiple decision trees and outputs and average prediction. It can be considered to be a collection of decision trees (forest) so it offers the same advantages as an individual tree: it can manage a mix of continuous, discrete and categorical variables; it does not require either data normalization or pre-processing; it is not complicated to interpret; and it automatically performs feature selection and detects interactions between variables. In addition to these, random forests solve some of the issues presented by decision trees: reduce variance and overfitting and provide more accurate and stable predictions. This is all achieved by making use of two different techniques: bagging (or bootstrap aggregation) and boosting.

Kx and NASA FDL: Space Weather, GNSS and Exoplanets

10 Jul 2018 | , ,

By Robert Hill Kx is delighted to once more be partnering with the NASA Frontier Development Laboratory (NASA FDL) team on two exciting challenges facing the space sector. This follows from last year’s successful solar activity detection work, which resulted in the ‘FlareNet’ tool (supported by Kx and Lockheed Martin) that demonstrated the potential for […]

Kx Insights: Machine learning subject matter experts in semiconductor manufacturing

9 Jul 2018 | , ,

Subject matter experts are needed for ML projects since generalist data scientists cannot be expected to be fully conversant with the context, details, and specifics of problems across all industries. The challenges are often domain-specific and require considerable industry background to fully contextualize and address. For that reason, successful projects are typically those that adopt a teamwork approach bringing together the strengths of data scientists and subject matter experts. Where data scientists bring generic analytics and coding capabilities, Subject matter experts provide specialized insights in three crucial areas: identifying the right problem, using the right data, and getting the right answers.