by Nuša Žnuderl

‘Surprisingly, New England Patriots won the 2017 Super Bowl after a 28-3 deficit.’ The evaluative adverb ‘surprisingly’ adds meaning to the sentence, expressing that the outcome of the 2017 Super Bowl was remarkable. ‘Nothing special!’ might say a proficient user of the English language. However, consider an unskilled user of the English language who wants to express the same idea, but does not know how to use the adverb ‘surprisingly’. They might say: ‘New England Patriots won the 2017 Super Bowl after a 28-3 deficit. Therefore this outcome of the Super Bowl was unexpected.’ While they would still successfully convey the message, their form of speech might be classified as long-winded and clumsy.

Similar to the English language, adverbs in q augment operations to allow an application on lists. They make code shorter, clearer and almost always more efficient than the alternative loopy modus operandi – all of which are qualities that differentiate code written by proficient q users from the rest.

The theory of adverbs is well covered in traditional q/kdb+ manuals, for example Jeff Borror’s Q for Mortals chapters 1 and 6 and Kx’s developer brief Efficient Use of Adverbs by Conor Slattery, yet in my experience an average q user often struggles to use adverbs in their q code. In this note I will attempt to take a step towards bridging this gap by demonstrating the application of adverbs in solutions to a handful of problems q users regularly come across in real life.

**We have a table with column id of type string. We can retrieve records with ids starting with “abc” like this: ’select from tab where id like “abc*”’, but how can we retrieve data for ids starting with “abc” or “def”?**

We construct our where clause to use the aggregate ‘any’ on the verb ‘like’ that we apply to the list of our id stems using adverb ‘each-right’ ( /: ).

*(Note, all code examples in this blog are also available on GitHub here.)*

**We are investigating data in a table and would like to see only the rows exhibiting changes in a set of fields that we are interested in.**

We can use uniform ‘differ’ (which uses the adverb ‘each-prior’ ( ‘: ) to check the match between 2 consecutive items in a list) to each column of interest using the adverb ‘each-right ( /: ).

If we are interested in rows where all columns of interest changed simultaneously, we apply to the result the aggregate ‘all.’ If instead we are interested in retrieving rows where any columns of interest changed, we replace ‘all’ with ‘any.’

First, let’s apply this on a small, in-memory table.

However, what if instead of an in-memory table we are working with data stored in a date partitioned HDB, and we are trying to examine a daily time series of changes in a set of fields from this HDB that is too large to read into memory for a single date, never mind when looking across a date range? Such an examination can inform an efficient schema design where we separate fields which change together into separate tables.

We can approach this problem by reasoning as follows:

*Since we do not have enough memory to select all columns of interest even for a single date, we will compare a subset of columns of interest across 2 consecutive days at a time.*

To do this we write a function getDiffs which takes as inputs the HDB name ‘tab’ (e.g. `ref), a symbol list of field names ‘fns’ we can read form the HDB in one go (.e.g `col1`col2), the dataset key ‘fk’ (e.g. `id) for which we want to track changes across days, date ‘d’ for which we wish to run the comparison (.e.g. 2017.02.10) and an in-memory table ‘seed’ consisting of fields `date`id`col1`col2 for the date prior to the comparison date d.

The function selects fields `date,fk,fns from the HDB for date d to a local table t. It then union-joins this table t with the seed, and orders it in ascending order of fk, such that if the same key exists on both dates, the records are adjacent. It then performs the “any-differ-each-right” algorithm we saw above and outputs the result.

*We now need to apply this function recursively across the date range of interest. We can do this by applying adverb over ( / ) to getDiffs, and in addition to constant parameters ‘tab’, ‘fns’ and ‘fk’, pass to it the table ‘seed’ which is the starting point for our comparison, and a list of dates.*

This application of the adverb overachieves recursion: on the first run the function getDiffs is applied with ‘seed’ and the first date in our list, on the second iteration getDiffs runs using the table output from the first iteration as its first parameter, and the second date in our list as the second parameter, and so on until it is applied to all dates in the list. We wrap this logic in a function getPartDiffs:

*In the next step we apply getPartDiffs to all fields of interest. We could split the full list of fields into manageable chunks we can read into memory, and apply every chunk to getPartDiffs using the each function which is defined using the adverb each-both as {x’y}. This would get differences in the first cluster of fields across an entire date range, then it would proceed to the next cluster, and so on in sequence until it runs through all clusters. However, we can optimise on execution time if we run our session with multiple slaves (-s command line parameter) and use the function parallel-each (peach) instead of each, which is defined using the adverb each-prior as {x’:y} as this will simultaneously kick off the execution of as many field clusters as there are slaves available. We express this logic in function getFullDiffs where ‘fnr’ is the number of fields per cluster:*

The output of getPartDiffs invoked with peach is a list of tables, all of which we key on date and fk, and union-join into a single table which represents cumulative differences in a set of fields across a date range.

*For completeness, we may want to join to the output of getFullDiffs any ids which were present on our initial date but did not exhibit any differences during the time period we examined, and are therefore not present in the output of getFullDiffs. Also, we may want to wrap the procedure into a single main function like this:*

Sample invocation of main is:

I ran this procedure across 28 days for 9 fields in a heavily repetitive HDB (stored on DAS) that contains approximately 2.5 million records a day. The session ran with 6 slaves. I chunked the 9 fields of interest into 3 clusters of 3 fields each. Execution completes in approximately 11 seconds, with peak memory consumption of 1.5GB.

**Our data is fragmented across multiple tables. We frequently need to access data across tables, but we do not always need all fields from every table. We know the fields that link different tables, but we may not remember which field resides in which table. We would like to create a generic function which looks for the fields of interest in the correct tables and returns the result in one table.**

This logic is wrapped into function getData. It uses several adverbs to achieve minimum amount of data operations while remaining dynamic with respect to input fields:

*‘each’ is used in multiple places to modify a monadic function (cols, where, count) for application on every item in a list rather than the list as a whole;*

*‘each-right’ ( /: ) is used to modify the dyadic function ‘in’ such that it searches for the list of fields f in every list of columns in the dictionary d;*

*‘each-both’ ( ‘ ) is used to modify the select lambda by applying it pairwise to the corresponding lists of tables and fields;*

*‘over’ modifies the dyadic uj such that it applies cumulatively to the list of tables*

Sample call:

**We have a daily time series of index constituents, their volumes and prices. The index constituency may change day over day. We calculate the daily volume weighted price of the index using the built-in function ‘wavg’. The time series is naturally jumpy, but we also notice a few outliers.**

Upon inspection we see that the outliers are caused by heavy index constituents which appear on a given day, and then completely disappear from the index. These are bad data points which we would like to exclude from the volume weighted average calculation. In particular, we define a rule that a constituent will only contribute to the wavg calculation if it is present in the index on 2 consecutive days.

To demonstrate the approach to solving this problem, I will use the following example, where wavg on 2016.09.15 is an outlier we want to further investigate.

We use xgroup to collapse the table to one row per date. This table contains 3 nested columns: for each date, the column sym contains a list of constituents, v contains a list of volumes and p contains a list of prices. Sym f on 2016.09.15 is the outlier we want to remove.

In this collapsed table we can compare the list of constituents with that of the previous day and retain only common constituents using ‘inter’ with each-prior adverb ( ‘: ) on the nested column sym:

We observe that we successfully removed the outlier `f on 2016.09.15 – but it is worth noting that the price of using this method is that we lost some valid data points. For example, we now cannot calculate the average for the first day as there is no in-sample data to compare it to even though all data points may be valid. Also, sym `b enters the dataset on 2016.09.13 and is not an outlier, but we will only begin counting it as a valid sym on 2016.09.14.

We extend this approach one step further to retrieve the position of validSym in the original list of syms. For this we use the ‘each-both’ ( ‘ ) adverb with ‘find’ ( ? ).

Now that we know the position of data points we want to include in the calculation of wavg, we retrieve the lists of volumes and prices using apply ( @ ) with the adverb each-both ( ‘ ). We then use each-both one more time in the application of wavg.

Finally, in order to demonstrate another useful application of each-both, let’s display the wavg calculation before (res1) and after (res2) the removal of the outlier side-by-side:

**Continuing from the previous example, we would like to further filter out the noise of price fluctuations and observe monthly trends in our price series, i.e. we want to examine the average, minimum, maximum, and standard deviation for a rolling window of 21 business days.**

The application of these functions to a rolling window is so frequent that they have their own built-in functions in q: mavg, mmin, mmax and mdev. They take as their first parameter the length of the moving window and a numeric list to apply the function to as their second parameter.

We can generalize the concept of applying a function to a moving window for an arbitrary function f. One approach is to use the adverb scan ( \ ) to build a moving list by adding one item each time while dropping the oldest value. We can then apply an arbitrary function f on every list using the adverb each.

We could also pass mwin a custom function f. For example, we may want to attach a higher weight to more recent data points in each moving window as defined in function wa:

To sum up, you can really improve your results by using adverbs and not using looping constructs. Long-term the benefits are vastly improved performance because you are doing things in the “q way,” so the interpreter can help you.

*Nuša Žnuderl is an expert kdb+/q programmer and Kx consultant currently based in New York City. Code snippets were created by Jack Kiernan, a Kx Data Scientist, also currently based in New York. You can find the code shown in this article on the Kx Systems’ GitHub, in kxblog.contributions here. A video of Nuša presenting this paper at the Kx Community NYC Meetup is available here.*