Democratizing analytics: Natural language querying with kdb+

Democratizing analytics: Natural language querying with kdb+

Author

Michael Ryaboy

Developer Advocate

Key Takeaways

  1. kdb+ is engineered for ultra-low latency and high-throughput environments
  2. Combining kdb+ with natural language capability democratizes complex query creation.
  3. Automating visualizations can be achieved via a simple function or using tools such as pandas-ai

In finance, billion-row time series datasets are common, which is why financial institutions have long trusted kdb+, the world’s fastest time series database for mission-critical workloads. Unlike traditional solutions, which move data to external tools for processing, kdb+ executes queries directly using its vectorized language, q. This minimizes latency and avoids the overhead of data movement, but can require careful planning and specialist skills when crafting complex queries.

In this blog, I will demonstrate how to democratize this process, combining kdb+ and PyKX (KX’s Python interface for kdb+) with natural language understanding to accelerate time to value.

I’ll show you how to:

  • Import millions of rows into a kdb+ table
  • Generate queries via an LLM
  • Convert returned data frames to pandas
  • Create a plot function

If you’d like to follow along, you can do so via my colab notebook.

Let’s begin.

Data ingestion

To begin, we will install a free personal license of PyKX (KX’s Python interface for kdb+). Then, for simplicity and convenience,  ingest a sample stock dataset containing 3 million minute-level time-series records.

Python
import pykx as kx
# Load a sample CSV and push it into q as 'trades'
pandas_tbl = pd.read_csv(
    'https://huggingface.co/datasets/opensporks/stocks/resolve/main/stocks.csv',
    parse_dates=['timestamp']
)
tbl = kx.toq(pandas_tbl)        # DataFrame → q Table
tbl = tbl.rename(columns={"close": "closePrice"})
kx.q['trades'] = tbl             # assign to q global

Let’s examine the results:

Python
# get table description and save as text stored in a variable:
table_desc = tbl.dtypes
print(table_desc)

# get min and last date
min_date = kx.q('first trades`timestamp').pd()
max_date = kx.q('last trades`timestamp').pd()

print(f"\nMinimum timestamp: {min_date}")
print(f"Maximum timestamp: {max_date}")

Scaling Analytics Datatypes

As you can see from the output, tickers, floats, and timestamps are returned.

Natural language translation

Next, we’ll create a translator that converts plain English questions into qSQL queries. I’m using this method because LLMs are not yet fully proficient in writing q, kdb+’s native query language. However, with qSQL being somewhat ANSI compliant, we can overcome this limitation by prompting our LLM with the subtle differences.

Python
SYSTEM_PROMPT = 'Create a valid sql query based on this table schema and natural language query'

class TranslateResponse(BaseModel):
    qsql: str

def translate_to_qsql(nl_query: str, table: str, schema: str) -> str:
    system = (
        SYSTEM_PROMPT
    )
    user = (
        f"Table: {table}\n"
        f"Schema: {schema}\n"
        f"Question: {nl_query}"
    )
    resp = client.beta.chat.completions.parse(
        model="o3",
        messages=[
            {"role": "system", "content": system},
            {"role": "user",   "content": user},
        ],
        response_format=TranslateResponse,
    )
    return resp.parsed.qsql

Our system prompt now contains detailed instructions about the qSQL syntax and best practices. This will ensure valid and efficient queries for kdb+. This is very much a work in progress, so I welcome any feedback or suggestions from community members.

Automated visualization

When queries are processed and returned, we’ll want AI to generate the appropriate visualization based on their associated data structures. In this instance, I have created my own visualization function; however, you might also consider using pandas-ai, which is a more robust and feature-rich solution.

Python
def visualize_df_with_plotly(df, model="gpt-4o", sample_size=20, temperature=0):
    """
    Given a pandas DataFrame df and an OpenAI-compatible client,
    infers a schema and sample, asks the model to generate Plotly Express code,
    and then executes and displays the resulting chart.
    """
    # 1. Build schema and sample, converting timestamps to strings
    sample_df = df.head(sample_size).copy()
    for col, dtype in sample_df.dtypes.items():
        if "datetime" in str(dtype).lower() or "timestamp" in str(dtype).lower():
            sample_df[col] = sample_df[col].astype(str)
    sample = sample_df.to_dict(orient="list")
    schema_auto = {col: str(dtype) for col, dtype in df.dtypes.items()}

# 2. Prepare prompts
    system = (
        "You're an expert Python data-visualizer. "
        "Given a DataFrame schema and sample data, "
        "write self-contained code using Plotly Express that "
        "takes the df and, creates a fig named fig, "
        "Write code that assigns df to fig, "
        "but do not call fig.show(). I will handle displaying. "
        "Feel free to show any other aggregations and print as numbers as you feel is needed as long as the figure is made"
        "GIVE ME ONLY CODE AND NOTHING ELSE."
        "Do not rewrite the values in the df, assume it is global in a dataframe called df"
        "Think about what the user wants in the query and provide it"
        "Never hardcode values"
    )
    user = (
        f"Schema: {json.dumps(schema_auto)}\n\n"
        f"Sample data: {json.dumps(sample, default=str)}"
    )
    # 3. Call the chat completion API
    resp = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system},
            {"role": "user",   "content": user}
        ],
        temperature=temperature,
    )
    # 4. Extract the code from the response
    chart_code = resp.choices[0].message.content
    if chart_code.startswith("```"):
        chart_code = chart_code.split("\n", 1)[1].rsplit("\n```", 1)[0]
    print(chart_code)
    # 5. Execute the generated code
    local_vars = {"df": df, "px": px}
    exec(chart_code, {}, local_vars)
    # 6. Display the figure
    fig = local_vars.get("fig")
    if fig:
        fig.show()
    else:
        raise RuntimeError("The generated code did not produce a fig object.")

Plotting

Finally, we will combine everything into a powerful new function named query_and_plot, which will enable plot outputs to be generated from a natural language query.

Python
def query_and_plot(nl_query: str, table: str = "trades") -> None:
    """
    Given a natural-language question about the kdb+ table,
    translates it to q-SQL, runs it, and renders an appropriate chart.
    """
    # --- translate to q-SQL ---
    # derive schema from the table in q
    q_tbl = tbl
    schema = str(q_tbl.dtypes)
    print(schema)

    # call the existing translator (assumes translate_to_qsql is defined globally)
    qsql = translate_to_qsql(nl_query, table, schema)
    print("🔧 Generated q-SQL:\n", qsql)
    # --- execute query ---
    result = kx.q.sql(qsql)
    df = result.pd()
    print("Resulting df After AI Query")
    display(df.head())
    print("\n── Resulting df Description ──")
    display(df.describe(include="all"))
    visualize_df_with_plotly(df)

Real-world testing

Let’s now test the solution with some real-world examples. Remember, LLMs can experience inconsistent outputs, meaning results may differ.

Example 1: Cumulative returns

To begin, I will ask the agent to chart the cumulative returns for both Tesla (TSLA) and Apple (AAPL).

Python
query_and_plot("Chart the cumulative returns for TSLA and AAPL")

Scaling Analytics - Cumulative returns

What this generates:

  • A q-SQL query that calculates cumulative returns
  • A multi-line chart showing performance over time

It also automatically handles date parsing and percentage calculations.

Example 2: Statistical analysis

Next, I will ask the agent to calculate the average price of each stock in our dataset.

Python
query_and_plot("what is the average price for each stock?"

Scaling Analytics - Statistical analysis

We can also ask which date had the highest closing price for each.

Python
query_and_plot("Which day had the highest closing price for each stock?")

Scaling Analytics Avg Price By Sym

What this generates:

  • A grouping query to calculate averages by stock symbol
  • An appropriate bar chart or table visualization
  • Clear labelling and formatting

By combining kdb+’s performance with AI’s natural language capabilities, we not only create a powerful tool for time series analysis but also democratize complex query building to anyone who can communicate in plain English. Both serve to enhance decision-making and accelerate new possibilities for innovation and growth.

If you enjoyed this blog, please check out my others on kx.com and connect with me on Slack. You can also find all of the code used in this blog via my colab notebook.

Customer Stories

Discover richer, actionable insights for faster, better informed decision making

ADSS Logo
Capital Markets

ADSS leverages KX real-time data platform to accelerate its transformational growth strategy.

Read More About ADSS
Axi logo
Capital Markets

Axi uses KX to capture, analyze, and visualize streaming data in real-time and at scale.

Read More About Axi


Demo the world’s fastest database for vector, time-series, and real-time analytics

Start your journey to becoming an AI-first enterprise with 100x* more performant data and MLOps pipelines.

  • Process data at unmatched speed and scale
  • Build high-performance data-driven applications
  • Turbocharge analytics tools in the cloud, on premise, or at the edge

*Based on time-series queries running in real-world use cases on customer environments.

Book a demo with an expert

"*" indicates required fields

By submitting this form, you will also receive sales and/or marketing communications on KX products, services, news and events. You can unsubscribe from receiving communications by visiting our Privacy Policy. You can find further information on how we collect and use your personal data in our Privacy Policy.

This field is for validation purposes and should be left unchanged.

A verified G2 leader for time-series