Kdb+taq loads the nyse taq (25 billion+ trades and quotes) data into kdb+. The resulting multi-terabyte kdb+ database can be queried and analyzed at about 1 million ticks per second per cpu. The master, trade and quote data is from daily FTP's and/or monthly DVD's:
The DVD data goes back to 1993 and (as of late 2003) is growing at 50 million records per day.
http://www.nyse.com/content/articles/1056810884848.html
The kdb+ storage factor is about 1. Get 2 times as much disk for raid5, staging and scratch space, e.g.(2004):
Install kdb+. Put taq.k in q/
Put FTP(taqtrade.. taqquote.. taqmaster..) or DVD(*.tab *.bin *.idx) in SRC.
See http://www.nyse.com/pdfs/userguid.pdf
Kdb+ loads, indexes and stores about 1GB per minute per cpu. taq.k will load all files it finds in the SRC directory. If you want kdb+ to run in parallel:
There will be N slaves each reading their own drive array -- no contention. 2 disk arrays per cpu is about right(e.g. 2 cpu's and 4 array's above). Days are round-robin allocated. Multi-day queries run in parallel.
These databases can run forever. The loads (taq.k) can send reset messages.
The data is indexed by symbol for one disk seek per day*sym*field. This yields about one million prices per second per cpu (at 10,000 ticks per sym per day). In general,
Data cached in memory is much faster. Restrict dates, syms and fields as much as possible -- read as little as possible, e.g.
is faster than
Move as little data as possible -- calculate in the server, e.g. from java:
is much faster than pulling the data and calculating locally:
To retrieve for sym.exchange, e.g. `AA.N
see http://kx.com/q/c for java, .net and c clients.
see http://kx.com/q/taq/adj.q for symbol, split and dividend adjustments
load DVD's every month and/or load FTP every night.
ftp://test.nysedata.com