Arthur Whitney
q is a general purpose programming language and an RDBMS: http://kx.com/q/d/kdb+.htm
q has atoms, lists, dicts, tables and functions. It includes file, socket and webserver subsystems as well as a superset of sql with timeseries extensions. It is fast and expressive.
[kdb+ is actually a platform for several languages. The bootstrap language is a compact form of q called k.]
Kdb+ installs in . The executable is . Scripts are .
NB: there is no notation for singleton lists -- they must be built, e.g.
Dictionaries are maps of lists to lists. A table is a list of similar dicts(records).
A keyed table is a dict whose key and value are both tables.
To add records to a table:
If t is a keyed table then this does upserts -- update existing & append new. is a restricted(no update) case of "," which returns the new record handles. Given,
Column names have to line up. Amend and append promote to enum but otherwise types have to match.
In q we fully normalize - store least amount of data. i.e. joins are generally faster than reading denormalized data. Joins are functional (many-to-one) or dysfunctional. We can ignore the dysfunctional -- which leaves:
The important joins are left joins -- usually with fkeys, e.g. all 50 joins in the tpcd queries are fkey left joins. Left coincides with inner with complete info. Left joins subdivide into fkey and adhoc. The right argument is always keyed. In kdb+, fkey joins use "."'s, e.g.
Adhoc joins use , e.g.
or and to get just the column.
Fkey joins are 10 to 100 million per second (ram/cache). Adhoc joins are 4 to 40 million per second (ram/cache).
in sql:
or
in q:
Given a table of changes
and a table of cusips and dates
how do we find the sym for a cusip on a certain date? In sql the trick is to store the next date by cusip(with self joins) then:
In q we don't need nextdate - simply:
See http://kx.com/q/taq/adj.q for applying corporate actions using asof joins. If the map isn't keyed, e.g.
then we must use . The map is often: or and it is treated as if these fields were key, e.g.
brings in quotes at the trade times and
brings in quotes as of 1500 milliseconds before the trade. See http://kx.com/q/taq/taq.q for examples. In the two column case the first column should be either `p# or `g#. Realtime data is grouped(`g#) by sym and sorted by time. Historical data is `p# by sym for better disk access.
In aj[`sym`time;t;q] the last key must be`time (or other datetime) and q must be sorted by time within the equivalence classes of the other keys, e.g.
q should have `g#sym(rdb) or a `p#sym(hdb). This way the asof join gets a small set of possible records immediately and then does a binary search within those - often several million times faster than without a `g# or `p#.
t doesn't matter. aj is like lj - the second table needs the fast lookup.
is a generalization of
the result has the union of the columns filled with nulls where necessary.
Month, date, datetime, minute, second and time are types. Datetime, date and week units are days.
Week is rolled to mondays, i.e. . Day of week is . 0 is saturday. The rest are int fragments, e.g. . Quarters can be handled as .
Functions are atomic (atom from atom), aggregate (atom from list), uniform (list from list) or other.
Functions execute left OF right, e.g. write instead of . Uniform functions preserve length. They include the atomic functions, integrals and derivatives . Q has no ambivalence so we must use .
Lambdas are compiled
N.B. statements must be separated with ";"'s.
Functions can have queries and queries can have functions, e.g. given
then
will calculate how many times per day the trade price was higher than the midpoint for sym s.
Statements in scripts must be outdented, e.g.
/ will comment out the rest of the line. \ will exit the script. Sections of script can be commented out with matching singleton / and \.
Develop code in an editor and run databases from a console with \l(load), cut and paste. When there is an error the following are displayed:
If inside a function parameters, locals and globals can be inspected and then:
File handles are of the form `:PATH. Kdb+ files can be read and written just like any other variable, e.g.
Non kdb+ files are data(bytes) or text(lines).
NB: hclose and exit flush pending messages. to chase use: h""
Databases can be spread across a directory. A directory can have a number of tables and scripts (*.?). The scripts are loaded after all the tables are loaded. Functions should be put in etc.. Tables can be single files or spread across a directory. Keyed indicative tables must be single files. Big fact tables can be splayed across a directory.
There is one partition variable and it is a virtual column in all the partition tables. There can be other columns with different names. New partitions can be added -- then send reset"\\l ."
E and F also recognize "+dd dd/ddd"
D recognizes:
Syms(S) trim, blank skips and * is character vector. Reverse types and widths to load reverse endian data.
Tables can be written as a single file or spread across a directory, e.g.
Tables splayed across a directory must be fully enumerated(no varchar) and not keyed.
Datetime txt and csv are written as:
sym constants need to be distinguished from sym references, e.g.
Operators take functions and produce derived functions.
We can watch open/close with
Connect and make remote function calls, e.g. (OS is s32|l32|w32)
No arguments is a blocking read, e.g.
The language is q and location is `. no matter what the server console is doing.
q calls c, e.g. in QHOME/OS (see c/k.h for generators and accessors)
c calls q (just like the clients) with handle 0. Also
Bottom line: vector operations are up to 100 times faster per element than scalar operations. Q scalars are similar to java Objects(Integer etc.) and .net boxed elements. Given 400MHZ 64bit bus and 1.6GHZ cpu - about .5 of fastest 2004 cpus - q operation overhead(incl. fn call,each, object create) is about 50 ns. Vector operations are between .5ns(boolean in cache) and 500ns(exp,random access) per element. So a time dominating iterative inner loop may be a candidate for writing in c.
One technique for maintaining constant(or log) time insert and select is to store nested data, e.g.
We look for 10,000 to 100,000 inserts and selects per second.
Suppose f[s;t] calculates a link from table s to table t, e.g. previous quote, next quote, .. Then we can do:
or we can precalculate the link:
So having the link is generally better if it will be used more than once. A link can give us the performance(or better) of denormalization with minimal extra storage.
The byte overheads use n(number of elements) and u(number of uniques). These are used by lookups, e.g.
`u is for unique lists. `p and `g are for lists with a lot of repetition. `s#table marks the table to use binary search and marks first column sorted. `s#dict(incl. keyed tables) marks the dict to behave as step function. Vector and single-column table lookups run at about 200ns per lookup. 2 column lookups when the first column has an attribute are about 2us per lookup. Put the most distinguishing column first. The attribute flags are descriptive -- not prescriptive. Amends and appends preserve flags if and only if the attribute is preserved, e.g.