ポイント
- KDB-X Python (PyKX) allows q developers to expose existing q applications to Python without rewriting performance-critical logic.
- The most effective migration approach is usually a hybrid model, keeping data-intensive analytics in q while moving user-facing orchestration and integration layers to Python.
- PyKX contexts can automatically expose entire q codebases to Python when applications are cleanly organized into self-contained namespaces.
- AI coding tools can accelerate Python adoption for q developers, but understanding how q and Python differ in areas such as mutation, data types, and text handling remains important.
- By treating Python and q as complementary technologies rather than competing languages, teams can extend existing kdb+ investments while making applications accessible to a broader developer audience.
The scenario we consider
The starting point is an existing q program. You want to make some or all of its functionality available from Python — perhaps to reach a wider audience of callers, perhaps to sit behind Python tooling, perhaps simply because that is where the rest of your stack lives. The assumption is that you are a q programmer who knows little about Python. You have two decisions before you write any code: what moves to Python vs. what stays in q, and how you generate the top-level Python that calls the remaining q. Making those decisions well is what we address.
The goal of this presentation
My aim is to show how to bridge the two worlds of q and Python through KDB-X Python, told from the perspective of a q programmer who is not a Python expert. Along the way we offer recommendations on the steps you can take, and flag the pitfalls that cost me time that you can avoid. By the end you should have a clear picture of the hybrid Python-q program to produce, and a sense of which parts of the job are mechanical and which require judgment.
KDB-X Python overview
KDB-X Python gives Python programs access to q functionality and kdb data. Its orientation is deliberately *Python-first*: it is built for Python developers who may know little or no q. It exposes a large slice of q — including qSQL select, update, and the full range of column operations — through ordinary Python calls. We approach the bridge from the opposite direction, a perspective that colors our choices. It is important to note that “Python-first” does not mean “Python-only”; the bridge carries traffic in both directions.
Getting set up
Get a KDB-X license if you do not already have one. It is free, and the setup is fast and easy. Install pykx in *licensed* mode specifically, because that is what lets you run q code locally inside Python — and run Python from q. You can convert to unlicensed mode later if that is what your eventual runtime environment requires. With that in place, get a current Python — I pulled 3.13 directly from python.org — and choose an IDE; I used VS Code because it is free and widely used. Install both the KX and the Python extensions. Before writing any of your own code, run the tests in the pykx documentation to verify access from Python to q and from q to Python.
Use an AI code generator
Even if you know Python, the AI likely knows it better than you do; the only Python I had ever written was “Hello, World.” On a colleague’s recommendation I used Claude. Install the recent KX plug-ins for Claude Code. Claude was invaluable for exposing my app functionality to Python — the upper code layer was ported to Python in a few days from a standing start. One caveat: with the KX plugins Claude’s pykx knowledge is up to date, but it still writes q code that doesn’t compile.
KDB-X Python contexts
Before you start porting by hand, check whether you can avoid much of the work. pykx has a context interface that can pull a q application into Python wholesale — provided the application is cleanly partitioned. The requirement is that your code lives entirely in namespaces, each context self-contained. This means it calls q’s built-in functions, and its own functions, but none of its own functions are themselves defined in root and it doesn’t call anything else in root. In that case pykx can load the entire application into Python automatically, with your user-defined routines callable on par with the q built-ins — no function aliasing required.
I could not take this path because my application had some core functions directly in root. Refactoring was possible, but reorganizing a working q application solely to suit the Python port was not a trade I was willing to make. So as I converted the code to Python, I paid the (relatively small) price of aliasing functions individually. If you are starting fresh, or your codebase is already tidy, take the automatic path; if not, read on.
The Iceberg Model: Render unto Python that which is Python’s, and unto q that which is q’s
Picture the desired target as an Iceberg.
The visible tip is Python: top-level functions, their arguments, and any simple data used directly in Python should be native Python types. You can push the waterline down as far as appropriate to expose the desired functionality in Python. Stop when you reach core q operations — which stay in q.
The submerged base belongs to q: heavy vector number crunching remains in q, as do tables and keyed tables that are created and maintained in q, because qSQL is more powerful than pandas. In our toy example, we have a heavy vector calculation sq and a large trades table.
Do the code conversion incrementally, top-down
Start at a single entry point on the q side — ideally a simple function; wrap it for initial learning to make it simpler. Alias each user-defined q function it calls with kx.q['funcname']. Now rewrite the body line by line: first in pykx q data types, then shifting to Python types and Python operations. Lists and dictionaries are the key insight here — they are frictionless crossing the boundary. In my project, the data I used at the top level naturally settled into Python lists and dicts. Then recurse: descend into each aliased function and repeat the process as far down the stack as you need to go.
We demonstrate this approach with our toy Iceberg example. The first code page shows one end of the spectrum: doing everything in q from Python. The second code page shows the other end of the spectrum: doing everything in Python down to the boundary calls to q.
Step 1 — Everything stays in q, called from Python: each q expression is handed to kx.q() as a string.
Step 2 — everything moves to Python with pykx down to the boundary: kx.q['sq'] for the function call, the trades table fetched into Python, and a Column-based select; only the q primitives remain.
A working version of our toy example is in the following Jupyter notebook.
Recommendations and caveats
The following observations are based on my own experience porting the top level of a relatively sophisticated q application.
Be careful what you ask for: A cautionary tale
The quality of the AI’s code and analysis depends directly on how carefully you frame the request. Early on I asked Claude to port a q function to Python. After some back and forth the result ran its test fine. A few steps later it failed the moment I iterated the function. Claude pointed out that “my” Python code was mutating the parameters it had been passed. A few more rounds produced code that iterated correctly. The lesson: I should have described how the function would be *used*, not just what it should return.
Python mutation
The underlying trap of the cautionary tale is worth understanding directly. Python passes object references by value: mutating a passed object in place is visible to the caller, while merely rebinding the name is not. This means objects passed as arguments can be changed out from under their owner–even when you don’t intend–simply because you modified them in the body of a function. Since q doesn’t pass by reference this might be terra incognito for q programmers.
TLDR: If you plan to use an argument as part of something else, make sure you have a copy.
Use C style; avoid O-O
Object-oriented constructs have no q analogue — I have never once pined for inheritance in q. An O-O Python design will steadily diverge from q, making it painful to reconcile future enhancements. Aim for C-style code instead. It will have a q accent; the more sophisticated your q, the stronger that accent will be.
Strings vs symbols
Python strings are sequences of Unicode code points, which can lead to surprises in q, where a single UTF-8 character may span several bytes. pykx defaults a Python string to a q symbol across the boundary — the correct call. A q string, meanwhile, corresponds to a Python bytes object, so you must cast explicitly when a q function expects a char vector: b"abc"” or kx.CharVector("abc"). As a side note, since q’s string functionality is comparatively weak, consider rewriting string-heavy logic on the Python side rather than shuttling it back to q. For the full set of text-conversion rules and helpers, see KX’s guide.
User-defined column operations
Most q operators are built into pykx for column manipulation, using an OO syntax that maps onto the equivalent q composition. Bonus points: if you look closely at the pykx query syntax, it is q query functional form in O-O drag. Your *own* function is the exception — you cannot simply apply it to a column. Check the pykx documentation for defining and registering column functions; the shape is a small wrapper plus a registration call:
```python
def my_calc(column):
return column.call(calc)
kx.register.column_function('myCalc', my_calc, overwrite=True)
```Registration is static, so use the `overwrite` flag when you need dynamic redefinition.
Index or key not found
In q, a list index out of bounds or a non-existing dictionary key hands you an appropriate null. In Python the same access throws an exception — and your q program may depend on that null without your realizing it. Guard dictionary lookups with .get() rather than direct indexing. That said, leaning on the null is bad practice to begin with, so it’s better to fix this in q. As my father used to say: you’re cruisin’ for a bruisin’.
Projections and higher-order functions
Python has projections, but they cannot be implicit; you must make them explicit, and it is clunky — partial(my_func, arg). Interestingly, you *can* project an aliased q function by omitting arguments, which softens the blow. Python also has higher-order functions, but they are less powerful than q’s, so point-free constructions will be problematic — you must write them out as explicit lambdas. Expect your elegant q one-liners to turn ugly.
The result
What I ended up with is a Python-first hybrid, with about 10% of overall code above the waterline. The top-level routines are entirely Python, converted iteratively down to the layer that does the core q computation and retrieval. Data in the upper layers lives in Python lists and dictionaries — the exception being any fields that are tables constructed and manipulated on the q side. The latter are handled through pykx with its version of qSQL, which remains more powerful than pandas DataFrames. The Python reads like C with a q accent, and there is no OO in sight.
Following is the entire top level q code from my app and the corresponding Python port. Look for the red/blue colored sections at the end of each code bloc to start at the top-level entry points procTh and rdThaiTr.
### The top-level q — `tw.q`
```q
// ===================================================================
// tw.q -- WORD layer (.tw namespace)
// Word segmentation, the pAll dispatch, caching, entry points
// (.tw.procTh / .tw.rdThaiTr). Absolute dotted names in ROOT context.
// The single up-call into the syllable layer is .tw.procWord -> .ts.matchXSylls.
// Load LAST.
// ===================================================================
////// word level processing
.tw.initialWord:{
// iterate to find the shortest (first) word at beginning of x
env:([toks:x; isStem:1b; initWord:(); potWordToks:()]);
r1:{[env]
{"break";}[];
potWordToks:env[`potWordToks],1#env `toks;
$[isStemToks potWordToks;
[
env[`potWordToks]:potWordToks;
env[`toks]:1_env `toks;
if[isWordToks potWordToks;
///word:raze string potWordToks;
///env[`span]:word;
env[`initWord]:potWordToks;
];
env
];
@[env; `isStem; :; 0b]
]
}/[{x[`isStem]¬[count x`initWord]&count x`toks}; env];
r1[`initWord]}
.tw.xxAhead:{[n; toks] (0=count toks)|(toks[n] in PUNCT)|not isThaiTok toks[n]}
.tw.xxAheadCache:{[n; toks] (0=count toks)|(toks[n] in PUNCTNOMAI)|not isThaiTok toks[n]}
.tw.initialName:{
if[count .tw.initialWord x; :`$()];
// iterate to find the shortest (first) word at beginning of x
env:([toks:x; isFollowed:0b; initName:(); potNameToks:()]);
r1:{[env]
/{break;}[];
//show env;
potNameToks:env[`potNameToks],1#env `toks;
$[count[.tw.initialWord[1_env[`toks]]]|.tw.xxAhead[1; env[`toks]];
@[env; `isFollowed`initName; :; (1b; potNameToks)];
@[env; `toks`potNameToks; :; (1_env `toks; potNameToks)]
]
}/[{not[x[`isFollowed]]&count x`toks}; env];
r1[`initName]}
.tw.pwName:{[strm]
/{break;}[];
if[(isEmpty strm)|isSuccess[pNonThai[mkTagTh`nonthai] strm]|isSuccess Num strm;:Fail[naked strm; "no input"]];
name:.tw.initialName strm[`inp];
/{break;}[];
resStrm:take[count name][mkTagTh `nm] strm;
resStrm[`succ]:0<count name;
resRec:([index:-1; typ:`unrec; toks:name; text:toks2text name]);
resStrm[`aux]:resStrm[`aux] uj enlist resRec;
resStrm[`back]:(); / barrier: a non-word unit stops a rollback
retainInp resStrm}
.tw.pwThaiNum:{[strm]
/{break;}[];
if[(isEmpty strm)|isSuccess pNonThai[mkTagTh`nonthai] strm; :Fail[naked strm; "no Thai input"]];
resStrm:(many1 Num) strm;
resRec:([index:-2; typ:`num; toks:resStrm[`out]; text:toks2text resStrm[`out]]);
resStrm[`aux]:resStrm[`aux] uj enlist resRec;
resStrm[`back]:(); / barrier: a non-word unit stops a rollback
retainInp resStrm}
.tw.pwThaiUnk:{[strm]
/{break;}[];
if[(isEmpty strm)|isSuccess pNonThai[mkTagTh`nonthai] strm; :Fail[naked strm; "no input"]];
resStrm: pThai[mkTagTh`unkthai] strm;
resRec:([index:-99; typ:`unkthai; toks:resStrm[`out]; text:toks2text resStrm[`out]]);
resStrm[`aux]:resStrm[`aux] uj enlist resRec;
resStrm[`back]:(); / barrier: a non-word unit stops a rollback
retainInp resStrm}
.tw.pwNonThai:{[strm]
/{break;}[];
if[(isEmpty strm)|isSuccess pThai[mkTagTh`thai] strm; :Fail[naked strm; "no input"]]
resStrm:(many1 pNonThai[mkTagTh`nonth]) strm;
resRec:([index:-3; typ:`nonthai; toks:resStrm[`out]; text:toks2text resStrm[`out]]);
resStrm[`aux]:resStrm[`aux] uj enlist resRec;
resStrm[`back]:(); / barrier: a non-word unit stops a rollback
retainInp resStrm}
.tw.pwPaiyanyai:{[strm]
resStrm:Paiyanyai strm;
/{break;}[];
resRec:([index:-5; typ:`paiyanyai; toks:resStrm[`out]; text:toks2text resStrm[`out]; ipa:"..."]);
resStrm[`aux]:resStrm[`aux] uj enlist resRec;
resStrm[`back]:(); / barrier: a non-word unit stops a rollback
retainInp resStrm}
.tw.pwBlank:{[strm]
/{break;}[];
resStrm:(many1 Blank) strm;
/{break;}[];
resRec:([index:-4; typ:`blank; toks:1#`; text:toks2text resStrm[`out]]);
resStrm[`aux]:resStrm[`aux] uj enlist resRec;
resStrm[`back]:(); / barrier: a non-word unit stops a rollback
retainInp resStrm}
.tw.initenv:{[isCache; strm] ([xxAhead:$[isCache; .tw.xxAheadCache; .tw.xxAhead]; strm; isStem:1b; initWord:""; potWordToks:(); span:(); candStrm:Fail[stream"";""]])};
.tw.fndWord:{[isCache; strm]
r1:{[env]
/ {"break";}[];
potWordToksPrev:env[`potWordToks];
potWordToks:env[`potWordToks],1#env[`strm][`inp];
///$[isTksStem provWordTks;
$[isStemToks potWordToks;
[
env[`potWordToks]:potWordToks;
env[ `strm;`inp]:1_env[`strm][`inp];
/{break;}[];
// not blocked after next gulp
ct1:count[.tw.initialWord env[`strm; `inp]];
canCapture:ct1|env[`xxAhead][0; env[`strm][`inp]];
/canContinue2:0N!count[initialWord ct1 _ env[`strm; `inp]]|xxAhead[ct1; env[`strm][`inp]];
///if[isWordToks[potWordToks]&count[initialWord env[`strm; `inp]]|xxAhead[0; env[`strm][`inp]];
/{breaktestcap;}[];
if[isWordToks[potWordToks]&canCapture;
/{breakcaptword;}[];
word:raze string potWordToks;
env[`span]:word;
env[`prevWord]:env[`candStrm];
potWordTags:`$("w_",string[lkpIndexStr word],"_"),/:string potWordToks;
env[`candStrm]:@[env[`strm]; `succ`out`tag; :;(1b; potWordToks; potWordTags)];
];
env
];
@[env; `isStem; :; 0b]
]
}/[{x[`isStem]¬[x[`xxAhead][0; x[`strm][`inp]]]&count x[`strm][`inp]}; .tw.initenv[isCache; strm]];
r1 }
.tw.procMaiyamok:{[r1; strm; rawWord]
mymk:(many1 Maiyamok) r1`strm;
wrdStrm0:$[
isSuccess[mymk]&isWordToks r1`potWordToks; / there is a base word in dict
[
/// wordToks:r1[`potWordToks],MAIYAMOK;
///wordTags:`$("w_",string[lkpIndexToks wordToks],"_"),/:string wordToks;
///fullWord:instStreamX[1b; wordToks; `$(); wordTags; ""; ()];
fullWord:@[mymk; `out`tag; {y,x}; rawWord`out`tag];
newIndex:lkpIndexToks fullWord[`out];
fullWord[`tag]:(-1_fullWord[`tag]),`$"w_",string newIndex;
fullWord
];
isSuccess[mymk]; / there is no base word in dict; take from dipa
[ /append fake word to end of THAIDICT
appendIndex:1+exec max index from THAIDICT;
/{breakMymk;}[];
repIpa:" " vs string lkpIPAToks strm`inp;
ipa:`$" " sv (count[repIpa] div 2)#repIpa;
maiRow:([word:`$raze string r1`potWordToks; index:appendIndex; en:`; ipa; syllCount:1; exception:`]);
`THAIDICT upsert maiRow; / working dict: needed by the next-line lkpIndexToks
`MaiAppends upsert maiRow; / durable capture: persisted -> MaiAppends.dat -> THAIDICTFULL
toks:r1[`potWordToks],mymk`out;
newIndex:lkpIndexToks toks;
fullWord:@[strmEmpty; `out; :; toks];
fullWord[`tag]:count[toks]#`$"w_",string newIndex;
fullWord
];
rawWord
];
wrdStrm0}
.tw.calcTr:{[trtgt; ipa] {[trtgt; ipa] r1:.xtr.cross[`ipa; trtgt; stream ipa]; raze r1`ictgt`vdiatgt`fctgt}[trtgt;] each ipa};
.tw.procWord:{[strm; wrdStrm; prevWord]
r2:$[count wrdStrm[`out];
[
wrdRec:lkpByIndex "I"$first 1_"_" vs string last wrdStrm`tag;
/{breakwordrec;}[];
(wordSym; dipaSym; dsyllCount; dexception):wrdRec `word`ipa`syllCount`exception;
syllStream:stream string wordSym; / put 0N! here to see word as it is processed
dipaRem:" " vs string dipaSym;;
// process syll
env:([strm:syllStream; dipaRem; drec:wrdRec; cclass:`; txsyll:()]);
/{break;}[];
///match:mtchXSyll/[{count[x[`strm][`inp]]}; env];
/{breakobeformatch;}[];
match:.ts.matchXSylls/[{count[x`dipaRem]¬ `Maiyamok~x`name}; env];
/{breakaftermatch;}[];
///txsyll:update typ:`syll, tr:calcTr[trtgt;] ipa from match`txsyll;
/txsyll:{[trtgt;xsyll] update typ:`syll, tr:raze .xtr.cross[`ipa; trtgt; stream xsyll `ipa][`ictgt`vdiatgt`fctgt] from xsyll}[trtgt;] each match`txsyll;
txsyll:update typ:`syll from match`txsyll;
txsyll1:$[`Maiyamok~match`name; txsyll,update typ:`maiy from txsyll; txsyll];
/{breakpostmatch;}[];
ext1:update index:wrdRec[`index], word:wrdRec[`word], en:wrdRec[`en] from txsyll1;
ext2: wrdStrm[`aux] uj ext1;
wrdStrm[`aux]:`word`index`typ`toks`ipa`en xcols ext2;
pushPrior[wrdStrm; prevWord]
];
Fail[strm; ""]
];
/{breakendprocword;}[];
r2}
.tw.procWordCache:{[strm; wrdStrm]
r2:$[count wrdStrm[`out];
[
ix:"I"$first 1_"_" vs string last wrdStrm`tag;
/{breakwordcache;}[];
wrdStrm[`aux]:wrdStrm[`aux] uj select from SYLLSFULL where index=ix;
wrdStrm
];
Fail[strm; ""]
];
/{breakendprocword;}[];
r2}
// rollOrFail — the one rollback DECISION point, factored out of pwWord1 so the
// PyKX port can call it at the seam (PyKX_Port_Plan.md Gap 3f). Given the current
// stream and the just-built word-stream wrdStrm0:
// word found -> use wrdStrm0 (no rollback)
// else back stack -> roll back one committed boundary (peek-with-reattach),
// logging the reverted-to word to ROLLHITS
// else -> Fail (no rollback target)
// All `back` internals and ROLLHITS telemetry stay below the waterline; the port
// just calls .tw.rollOrFail[strm; wrdStrm0] and carries the result opaquely.
.tw.rollOrFail:{[strm; wrdStrm0]
if[(0=count wrdStrm0[`out]) & 0<count strm`back; ROLLHITS,:enlist peekBack[strm;1]`out];
$[count wrdStrm0[`out]; wrdStrm0;
count strm`back; @[peekBack[strm;1]; `back; :; -1 _ strm`back];
Fail[naked strm; "no rollback target"]] };
.tw.pwWord1:{[isCache; strm]
// iterate to find the longest word at beginning of x
/{breakpwWord}[];
if[(isEmpty strm)|isSuccess[pNonThai[mkTagTh`nonthai] strm]|isSuccess[Num strm]; :Fail[naked strm; "no input"]];
r1:.tw.fndWord[isCache; strm];
(prevWord;rawWord):r1[`prevWord`candStrm];
/{breakMai;}[];
wrdStrm0:$[isCache; rawWord; .tw.procMaiyamok[r1; strm; rawWord]];
/{breakr2;}[];
// if we have hit a wall roll back one step (decision + telemetry now live in .tw.rollOrFail)
wrdStrm:.tw.rollOrFail[strm; wrdStrm0];
r2:$[isCache; .tw.procWordCache[strm; wrdStrm]; .tw.procWord[strm; wrdStrm; prevWord]];
/{breakendword}[];
retainInp r2};
////// main entry points
.tw.diddleEn:{[en] (`$(" " sv) sublist[3;]@";" vs) each string en }
.tw.pAll:{[isCache] choice (.tw.pwWord1[isCache;]; .tw.pwName; .tw.pwThaiNum; .tw.pwPaiyanyai; .tw.pwThaiUnk; .tw.pwBlank; .tw.pwNonThai)}
.tw.procTh:{[trtgt; strng:(),]
r0:(many .tw.pAll[0b]) stream strng;
r:update typ:`syll, tr:.tw.calcTr[trtgt;] ipa from r0`aux;
r}
.tw.rdThaiTr:{[trtgt; strng] txsyll:.tw.procTh[trtgt; strng]; select index, toks, `$ipa, `$compipa, `$tr, en:.tw.diddleEn en, name, isPass, isMatch, descr from txsyll}
.tw.rd:.tw.rdThai:.tw.rdThaiTr[`ar;]
```
### The Python port — `hello.py`
```python
# hello.py manual test harness
# See hello_usage.md for example commands and how to run this script.
from functools import partial, reduce
from pathlib import Path
import pykx as kx
import pandas as pd
REPO_ROOT = Path(__file__).resolve().parent
Q_DIR = REPO_ROOT
def qpath(filename):
return str(Q_DIR / filename)
def apply(arg, fun):
return fun(arg)
def compose(*funs):
return partial(reduce, apply, funs)
# ThaiReader.q self-loads its deps (ParserComb, ReferenceTables, xtr, ThaiDict at L9-12);
# add only the two layers it doesn't: ts, tw. Load order owned by ThaiReader.q.
kx.q.system.load(qpath('ThaiReader.q'))
kx.q.system.load(qpath('ts.q'))
kx.q.system.load(qpath('tw.q'))
THAIDICT = kx.q('THAIDICT')
# pykx.CharVector('qwerty')
toQString = partial(kx.toq, ktype=kx.CharVector)
str2toks = compose (toQString, kx.q['str2toks'])
# stream = compose (toQString, kx.q['stream'])
def qstrm_to_py(rQ):
# A q stream object -> Python dict, keeping `aux` and `back` as NATIVE q values
# (mirrors how `aux` was always handled). They cross the waterline opaquely;
# only q reads/writes their internals. Without the native `back`, the rollback
# silently no-ops. Used wherever a q function hands a stream back to Python.
rP = rQ.py()
rP['aux'] = rQ['aux']
rP['back'] = rQ['back']
return rP
def stream(strng):
return qstrm_to_py(kx.q['stream'](kx.CharVector(strng)))
#isEmpty = kx.q['isEmpty']
def isEmpty(strm):
return len(strm['inp']) == 0
#isSuccess = kx.q['isSuccess']'
def isSuccess(strm):
s = strm['succ']
return s.py() if hasattr(s, 'py') else bool(s)
Fail = kx.q['Fail']
naked = kx.q['naked']
mkTag = kx.q['mkTagTh']
#choice = kx.q['choice']
many = kx.q['many']
many1 = kx.q['many1']
toks2text = kx.q['toks2text']
#retainInp = kx.q['retainInp']
def retainInp(strm):
return {**strm, 'out': [], 'tag': [], 'err': ""}
strmEmpty = kx.q['strmEmpty']
isStemToks = kx.q['isStemToks']
isWordToks = kx.q['isWordToks']
lkpIndexStr = kx.q['lkpIndexStr']
lkpIndexToks = kx.q['lkpIndexToks']
lkpIPAToks = kx.q['lkpIPAToks']
lkpByIndex = kx.q['lkpByIndex']
PUNCT = kx.q['PUNCT'].py()
# PUNCTNOMAI = PUNCT minus the ๆ (MAIYAMOK) token, so ๆ doesn't break the cache
# lookahead (xxAheadCache). Bind q's value directly (ThaiReader.q: PUNCT except "ๆ");
# the old `kx.q['PUNCT']` made it identical to PUNCT, defeating xxAheadCache.
PUNCTNOMAI = kx.q['PUNCTNOMAI'].py()
isThaiTok = kx.q['isThaiTok']
pThai = kx.q['pThai']
pNonThai = kx.q['pNonThai']
Maiyamok = kx.q['Maiyamok']
Paiyanyai = kx.q['Paiyanyai']
Blank = kx.q['Blank']
Num = kx.q['Num']
# xxAhead = kx.q['xxAhead']
def xxAhead(n, toks):
"""
Lookahead boundary test: is position `n` in `toks` a word boundary?
True when any of:
- there's no token at position n (n is at or past the end), or
- the token at n is punctuation (in PUNCT), or
- the token at n is not a Thai token.
PUNCT and isThaiTok are resolved at module scope.
"""
return (
n >= len(toks)
or toks[n] in PUNCT
or not isThaiTok(toks[n])
)
def xxAheadCache(n, toks):
# PUNCTNOMAI = PUNCT minus the ๆ token, so ๆ no longer breaks the search
return (n >= len(toks)
or toks[n] in PUNCTNOMAI
or not isThaiTok(toks[n]))
xtr_cross = kx.q['.xtr.cross']
matchXSylls = kx.q['.ts.matchXSylls']
calcTr = kx.q['.tw.calcTr']
def choicePy(*parsers):
def parser(strm):
for p in parsers:
r = p(strm)
if isSuccess(r):
return r
return Fail(strm, 'no alternative matched')
return parser
def manyPy(p):
def parser(strm):
if isEmpty(strm):
return Fail(strm, 'empty input')
cur, outs, tags = strm, [], []
while True:
r = p({**cur}) # <-- pass a shallow copy
if not isSuccess(r):
break
if r.get('inp') == cur.get('inp'):
break
outs += r.get('out', [])
tags += r.get('tag', [])
cur = r
return {**cur, 'succ': True, 'out': outs, 'tag': tags}
return parser
def many1Py(p):
inner = manyPy(p)
def parser(strm):
r = inner(strm)
return r if r['out'] else Fail(strm, 'many1: no match')
return parser
def initial_word(x):
env = {
'toks': list(x), # remaining tokens to consume
'isStem': True, # current prefix is still a valid stem
'initWord': [], # result: first complete word found
'potWordToks': [], # accumulated candidate prefix
}
# q: r1: {...}/[{cond}; env] -- "do while cond" converge
while env['isStem'] and not env['initWord'] and env['toks']:
# q: potWordToks: env[`potWordToks], 1#env`toks
pot = env['potWordToks'] + env['toks'][:1]
if isStemToks(pot):
env['potWordToks'] = pot
env['toks'] = env['toks'][1:] # q: 1_env`toks
if isWordToks(pot):
env['initWord'] = pot # loop exits next iter
else:
env['isStem'] = False # q: @[env;`isStem;:;0b]
return env['initWord']
def init_env(isCache, strm):
return {
'xxAhead': xxAheadCache if isCache else xxAhead,
'strm': {**strm}, # COPY — de-mutation fix
'isStem': True,
'initWord': "",
'potWordToks': [],
'span': [],
'candStrm': Fail(stream(""), ""),
}
def fnd_word(isCache, strm):
env = init_env(isCache, strm)
xxa = env['xxAhead'] # the cache or non-cache lookahead
while (env['isStem']
and not xxa(0, env['strm']['inp'])
and len(env['strm']['inp']) > 0):
pot = env['potWordToks'] + env['strm']['inp'][:1]
if isStemToks(pot):
env['potWordToks'] = pot
env['strm']['inp'] = env['strm']['inp'][1:]
ct1 = len(initial_word(env['strm']['inp']))
canCapture = ct1 or xxa(0, env['strm']['inp'])
if isWordToks(pot) and canCapture:
word = "".join(str(t) for t in pot)
env['span'] = word
env['prevWord'] = env.get('candStrm') # q .tw.fndWord: env[`prevWord]:env[`candStrm]
prefix = f"w_{lkpIndexStr(toQString(word))}_"
pot_word_tags = [prefix + str(t) for t in pot]
env['candStrm'] = {**env['strm'], 'succ': True,
'out': pot, 'tag': pot_word_tags}
else:
env['isStem'] = False
return env
def proc_maiyamok(
r1,
strm,
raw_word
):
"""
Handle a trailing Maiyamok (repeat-mark) after a candidate word.
Three cases mirror the q $[...] cascade:
1. Parser succeeds AND the prior tokens already form a dictionary
word. Splice rawWord's out/tag in front of mymk's, then rewrite
the *last* tag to point at the combined word's index.
2. Parser succeeds but the prior tokens are NOT yet a dictionary
word. Fabricate a new THAIDICT entry (IPA = first half of
lkpIPAToks's output, on the assumption the second half is the
maiyamok repetition), upsert it, then build the combined
stream from a fresh strmEmpty template.
3. Parser fails. Pass rawWord through unchanged.
"""
# q: mymk: (many1 Maiyamok) r1`strm
mymk = many1(Maiyamok)(r1['strm']).py()
if isSuccess(mymk) and isWordToks(r1['potWordToks']):
# ----- case 1: base word already in dict ---------------------------
# q: full_word = @[mymk; `out`tag; {y,x}; rawWord`out`tag]
# {y,x} swaps args, so each field becomes rawWord_field ++ mymk_field
full_word = dict(mymk)
full_word['out'] = raw_word['out'] + mymk['out']
full_word['tag'] = raw_word['tag'] + mymk['tag']
new_index = lkpIndexToks(full_word['out'])
# drop final tag, append one pointing at the new combined index
full_word['tag'] = full_word['tag'][:-1] + [f"w_{new_index}"]
wrd_strm0 = full_word
elif isSuccess(mymk):
# ----- case 2: no base word in dict; synthesise one, capture durably ---
# q tw.q:158-165 -- build the fake dict row (ipa = first half of the repeated
# IPA) and upsert into BOTH the working dict (THAIDICT) and the durable log
# (MaiAppends -> MaiAppends.dat -> THAIDICTFULL). Done in q against the NAMED
# globals so the very-next lkpIndexToks (which reads q's THAIDICT) sees the new
# word, and the capture persists -- a Python-side snapshot upsert would do
# neither. This is Gap 4 (the prior Python only touched THAIDICT, in-snapshot).
pot = kx.SymbolVector(r1['potWordToks'])
inp = kx.SymbolVector(strm['inp'])
kx.q('''{[pot;inp]
appendIndex:1+exec max index from THAIDICT;
repIpa:" " vs string lkpIPAToks inp;
ipa:`$" " sv (count[repIpa] div 2)#repIpa;
maiRow:([] word:enlist `$raze string pot; index:enlist appendIndex;
en:enlist `; ipa:enlist ipa; syllCount:enlist 1; exception:enlist `);
`THAIDICT upsert maiRow;
`MaiAppends upsert maiRow; }''', pot, inp)
toks = r1['potWordToks'] + mymk['out']
new_index = lkpIndexToks(toks)
full_word = strmEmpty.py() # q: @[strmEmpty; `out; :; toks]
full_word['out'] = toks
full_word['tag'] = [f"w_{new_index}"] * len(toks) # q: count[toks]#`$"w_",string newIndex
wrd_strm0 = full_word
else:
# ----- case 3: parser failed; pass through -------------------------
wrd_strm0 = raw_word
return wrd_strm0
#r1 = fnd_word(stream("เด็กๆ"))
#prevWord, rawWord = r1['prevWord'], r1['candStrm']
#wrdStrm0 = proc_maiyamok(r1, strm, rawWord)
def match_XSylls(env):
"""
Read fields from env, derive dipaDet via xtr_cross, package into a
native q Dictionary, and hand it to getSyll. Read-only on env.
"""
strm = env['strm']
dipaRem = env['dipaRem']
dipa = dipaRem[0] # q: first dipaRem
# q: dipaDet: .xtr.cross[`ipa;`ipa;] 0N!stream dipa (0N! = debug print)
s = stream(dipa)
# print(s)
dipaDet = xtr_cross('ipa', 'ipa', s) # your Python wrapper
dexception = env['drec']['exception']
# q: getSyll([env; strm; cclass:env`cclass; dipa; dipaDet; dipaRem; dexception])
arg = kx.Dictionary({
'env': env,
'strm': strm,
'cclass': env['cclass'],
'dipa': dipa,
'dipaDet': dipaDet,
'dipaRem': dipaRem,
'dexception': dexception,
})
return kx.q('.ts.getSyll', arg)
def proc_word(
strm,
wrdStrm,
prevWord,
):
"""
Process one word-stream: look up its dictionary record by parsing
the index out of the last tag, segment into syllables, transliterate
each via .xtr.cross, optionally duplicate syllables tagged `maiy`
when the matched name is Maiyamok, then merge the rows into
wrdStrm['aux'].
If wrdStrm has no output tokens, returns Fail(strm, '') instead.
"""
if 0 < len(wrdStrm['out']):
# q: wrdRec: lkpByIndex "I"$first 1_"_" vs string last wrdStrm`tag
# parse "w_<idx>_<...>" from the last tag, pull <idx>, look it up
last_tag = str(wrdStrm['tag'][-1])
idx = int(last_tag.split('_')[1])
wrdRec = lkpByIndex(idx)
# q: (wordSym;dipaSym;dsyllCount;dexception): wrdRec`word`ipa`syllCount`exception
# dsyllCount and dexception are destructured in the q but never read;
# kept here for parity
wordSym, dipaSym, dsyllCount, dexception = (
wrdRec['word'],
wrdRec['ipa'],
wrdRec['syllCount'],
wrdRec['exception'],
)
# q: syllStream: stream string 0N!wordSym (0N! is debug print)
# print(wordSym)
syllStream = stream(str(wordSym))
# q: dipaRem: " " vs string dipaSym
dipaRem = str(dipaSym).split(' ')
# q: env: ([strm:syllStream; dipaRem; drec:wrdRec; cclass:`; txsyll:()])
env = {
'strm': syllStream,
'dipaRem': kx.q.string(dipaRem),
'drec': wrdRec,
'cclass': kx.toq(''),
'name': '',
'txsyll': [],
}
# q: match: matchXSylls/[{count[x`dipaRem] & not `Maiyamok~x`name}; env]
# iterate matchXSylls while dipaRem is non-empty AND name != Maiyamok
match = kx.toq(env)
while 0 < len(match['dipaRem']) and match.get('name') != 'Maiyamok':
match = match_XSylls(match)
# q: calcTr: {[trtgt;ipa] r1:.xtr.cross[`ipa;trtgt;stream ipa]; raze r1`ictgt`vdiatgt`fctgt}
# closes over trtgt; xtr_cross + stream are injected at the outer scope
# q: txsyll: update typ:`syll, tr:calcTr[trtgt;] each ipa from match`txsyll
# columnar update on the txsyll table: set typ broadcast, tr per-row
# q: txsyll:update typ:`syll from match`txsyll -- table op stays in q (data,
# not algorithm), avoiding a pandas round-trip on a pykx Table.
txsyll = kx.q('{update typ:`syll from x}', match['txsyll'])
# q: txsyll1:$[`Maiyamok~match`name; txsyll, update typ:`maiy from txsyll; txsyll]
# a Maiyamok match duplicates the syllable rows, retagged `maiy (q row-concat).
if match.get('name') == 'Maiyamok':
txsyll1 = kx.q('{x, update typ:`maiy from x}', txsyll)
else:
txsyll1 = txsyll
# ext1:update index:wrdRec[`index], word:wrdRec[`word], prevWord:enlist prevWord, en:wrdRec[`en] from txsyll1;
# ext2: wrdStrm[`aux] uj ext1;
# wrdStrm[`aux]:`word`index`typ`toks`ipa`tr`en xcols ext2;
# q .tw.procWord:198 -- ext1:update index:..., word:..., en:... from txsyll1
# (no prevWord column; the current q dropped it -- prevWord is only used for
# pushPrior now, not as a table column).
ext1 = txsyll1.update(columns=
kx.Column('index', data=wrdRec['index']) &
kx.Column('word', data=[wrdRec['word']]) &
kx.Column('en', data=[wrdRec['en']])
)
ext2 = kx.q.uj(wrdStrm['aux'], ext1)
# wrdStrm['aux'] = kx.q.xcols(kx.SymbolVector(['word', 'index', 'typ', 'toks', 'ipa', 'en']), ext2)
# r2 = wrdStrm
new_aux = kx.q.xcols(
kx.SymbolVector(['word', 'index', 'typ', 'toks', 'ipa', 'en']), ext2)
committed = {**wrdStrm, 'aux': new_aux} # fresh stream; argument untouched
# q: pushPrior[wrdStrm; prevWord] -- commit this boundary onto `back` so a
# later mis-segmentation can roll back to it. The primitive (incl. the
# 99h=type pw guard and depth-2 sublist) stays in q; we carry the result
# back across the waterline opaquely.
r2 = qstrm_to_py(kx.q['pushPrior'](committed, prevWord))
else:
# q: Fail[strm; ""]
r2 = Fail(strm, '')
return r2
def proc_word_cache(strm, wrdStrm, prevWord):
"""
Cache-backed word processor: look the word's syllables up in SYLLSFULL
by index and union-join into aux. Non-mutating. prevWord unused (kept
for call-site parity with proc_word).
"""
if len(wrdStrm['out']) > 0:
# q: ix:"I"$first 1_"_" vs string last wrdStrm`tag
ix = int(str(wrdStrm['tag'][-1]).split('_')[1])
rows = kx.q('{[ix] select from SYLLSFULL where index=ix}', ix)
new_aux = kx.q.uj(wrdStrm['aux'], rows)
r2 = {**wrdStrm, 'aux': new_aux} # fresh stream
else:
r2 = Fail(strm, "")
return r2
# pwWord = kx.q['pwWord']
def pwWord(strm):
if (isEmpty(strm)
or isSuccess(pNonThai(mkTag('nonthai'))(strm))
or isSuccess(Num(strm))):
return Fail(naked(strm), "no input")
r1 = fnd_word(False, strm)
prevWord, rawWord = r1.get('prevWord'), r1.get('candStrm')
wrdStrm0 = proc_maiyamok(r1, strm, rawWord)
if len(wrdStrm0['out']) > 0:
wrdStrm = wrdStrm0
else:
wrdStrm = strm['aux'].exec(kx.Column('prevWord').last())
r2 = proc_word(strm, wrdStrm, prevWord)
return retainInp(r2)
def pwWord1(is_cache, strm):
if (isEmpty(strm)
or isSuccess(pNonThai(mkTag('nonthai'))(strm))
or isSuccess(Num(strm))):
return Fail(naked(strm), "no input")
r1 = fnd_word(is_cache, strm)
prevWord, rawWord = r1.get('prevWord'), r1.get('candStrm')
wrdStrm0 = rawWord if is_cache else proc_maiyamok(r1, strm, rawWord)
# The rollback DECISION lives in q (.tw.rollOrFail): word found -> use wrdStrm0;
# else roll back one committed boundary (and log to ROLLHITS); else Fail. The
# peek-with-reattach, the `back` internals and the telemetry all stay below the
# waterline -- Python just calls it and carries the result opaquely.
wrdStrm = qstrm_to_py(kx.q['.tw.rollOrFail'](strm, wrdStrm0))
r2 = proc_word_cache(strm, wrdStrm, prevWord) if is_cache else proc_word(strm, wrdStrm, prevWord)
return retainInp(r2)
def pwWordCache(strm):
if (isEmpty(strm)
or isSuccess(pNonThai(mkTag('nonthai'))(strm))
or isSuccess(Num(strm))):
return Fail(naked(strm), "no input")
r1 = fnd_word(True, strm)
prevWord, rawWord = r1.get('prevWord'), r1.get('candStrm')
wrdStrm0 = rawWord # procMaiyamok eliminated
if wrdStrm0 is not None and len(wrdStrm0['out']) > 0:
wrdStrm = wrdStrm0
else:
wrdStrm = strm['aux'].exec(kx.Column('prevWord').last())
r2 = proc_word_cache(strm, wrdStrm, prevWord)
return retainInp(r2)
pwName = kx.q['.tw.pwName']
# pwNonThai = kx.q['pwNonThai']
#if[(isEmpty strm)|isSuccess pThai[mkTag`thai] strm; :Fail[naked strm; "no input"]]
# resStrm:(many1 pNonThai[mkTag`nonth]) strm;
# resRec:([index:-3; typ:`nonthai; toks:resStrm[`out]; text:toks2text resStrm[`out]]);
# resStrm[`aux]:resStrm[`aux] uj enlist resRec;
# retainInp resStrm}
def pwNonThai(strm):
if isEmpty(strm) or isSuccess(pThai(mkTag('nonthai'))(strm)):
return Fail(naked(strm), toQString("found Thai input"))
resStrm = qstrm_to_py(many1(pNonThai(mkTag('nonth')))(strm)) # keep aux/back native
# q: resRec:([index:-3; typ:`nonthai; toks:resStrm`out; text:toks2text resStrm`out])
# explicit q types so the 1-row record matches the oracle (typ/text symbols,
# toks a symbol-vector cell); a plain dict mis-typed these and collided on uj.
out = kx.SymbolVector(resStrm['out'])
resRec = {'index': -3, 'typ': kx.SymbolAtom('nonthai'),
'toks': out, 'text': kx.q('toks2text', out)}
resStrm['aux'] = kx.q.uj(resStrm['aux'], kx.q.enlist(resRec))
resStrm['back'] = kx.List([]) # barrier: a non-word unit stops a rollback
return retainInp(resStrm)
def pwThaiNum(strm):
if isEmpty(strm) or isSuccess(pNonThai(mkTag('nonthai'))(strm)):
return Fail(naked(strm), toQString("no Thai input"))
resStrm = qstrm_to_py(many1(Num)(strm)) # keep aux/back native
# q: resRec:([index:-2; typ:`num; toks:resStrm`out; text:toks2text resStrm`out])
out = kx.SymbolVector(resStrm['out'])
resRec = {'index': -2, 'typ': kx.SymbolAtom('num'),
'toks': out, 'text': kx.q('toks2text', out)}
resStrm['aux'] = kx.q.uj(resStrm['aux'], kx.q.enlist(resRec))
resStrm['back'] = kx.List([]) # barrier: a non-word unit stops a rollback
return retainInp(resStrm)
#pwPaiyanyai = kx.q['pwPaiyanyai']
# resStrm:Paiyanyai strm;
# /{break;}[];
# resRec:([index:-5; typ:`paiyanyai; toks:resStrm[`out]; text:toks2text resStrm[`out]; ipa:"..."]);
# resStrm[`aux]:resStrm[`aux] uj enlist resRec;
# retainInp resStrm}
def pwPaiyanyai(strm):
if isEmpty(strm) or isSuccess(pNonThai(mkTag('nonthai'))(strm)):
return Fail(naked(strm), toQString("no acceptable input"))
resStrm = qstrm_to_py(Paiyanyai(strm))
# q: resRec:([index:-5; typ:`paiyanyai; toks:resStrm`out; text:toks2text resStrm`out; ipa:"..."])
out = kx.SymbolVector(resStrm['out'])
resRec = {'index': -5, 'typ': kx.SymbolAtom('paiyanyai'),
'toks': out, 'text': kx.q('toks2text', out), 'ipa': toQString("...")}
resStrm['aux'] = kx.q.uj(resStrm['aux'], kx.q.enlist(resRec))
resStrm['back'] = kx.List([]) # barrier: a non-word unit stops a rollback
return retainInp(resStrm)
pwThaiUnk = kx.q['.tw.pwThaiUnk']
#pwBlank = kx.q['pwBlank']
def pwBlank(strm):
if isEmpty(strm) or isSuccess(pThai(mkTag('nonthai'))(strm)):
return Fail(naked(strm), toQString("no acceptable input"))
resStrm = qstrm_to_py(many1(Blank)(strm)) # keep aux/back native
# q: resRec:([index:-4; typ:`blank; toks:1#`; text:toks2text resStrm`out])
# toks is a single null symbol (1#`), NOT the matched blanks.
out = kx.SymbolVector(resStrm['out'])
resRec = {'index': -4, 'typ': kx.SymbolAtom('blank'),
'toks': kx.q('1#`'), 'text': kx.q('toks2text', out)}
resStrm['aux'] = kx.q.uj(resStrm['aux'], kx.q.enlist(resRec))
resStrm['back'] = kx.List([]) # barrier: a non-word unit stops a rollback
return retainInp(resStrm)
#def pAll(trtgt):
# r = kx.q['choice']([kx.q['pwWord'](trtgt), kx.q['pwName'], kx.q['pwThaiNum'], kx.q['pwPaiyanyai'], kx.q['pwBlank'], kx.q['pwNonThai']]);
# return r
pAll = choicePy(pwWord, pwThaiNum, pwPaiyanyai, pwBlank, pwNonThai)
def pAll_py(is_cache): # the enclosing closure IS required
return choicePy(partial(pwWord1, is_cache), pwThaiNum, pwPaiyanyai, pwBlank, pwNonThai)
def procTh(trtgt, strng):
# q_strng = kx.toq(strng, ktype=kx.CharVector)
# strm = stream (q_strng)
strm = stream(strng)
r1 = manyPy (pAll_py(False))(strm)
txsyll = r1['aux']
calc = calcTr(trtgt)
def calc_tr(column):
return column.call(calc)
kx.register.column_function('calcTr', calc_tr, overwrite=True)
# q .tw.procTh: update typ:`syll, tr:.tw.calcTr[trtgt;] ipa from r0`aux
# tr via the registered column function; typ flattened to `syll for ALL rows in q
# (overrides parser-set nonthai/maiy/blank -- a constant-symbol Column is awkward).
txsyll1 = txsyll.update(columns=kx.Column('ipa').calcTr().name('tr'))
txsyll1 = kx.q('{update typ:`syll from x}', txsyll1)
return txsyll1
def diddle_en(column):
return column.call('.tw.diddleEn')
kx.register.column_function('diddleEn', diddle_en)
def as_sym(column):
return column.call('{`$x}');
kx.register.column_function('asSym', as_sym)
def rdThaiTr(trtgt, strng):
txsyll = procTh(trtgt, strng)
r = txsyll.select(columns=kx.Column('word') & kx.Column('index') & kx.Column('toks') & \
kx.Column('ipa').asSym() & kx.Column('compipa').asSym() & kx.Column('tr').asSym() & \
kx.Column('en').diddleEn() & kx.Column('name') & kx.Column('isPass') & \
kx.Column('isMatch') & kx.Column('descr'))
return r
```



