How to build an advanced PDF parsing solution for RAG

From documents to insights: Advanced PDF parsing for RAG

Author

Data Scientist

Key Takeaways

  1. Learn how to build RAG pipelines for complex files
  2. Learn how to transform complex datatypes into markdown
  3. Learn how to ingest into KDB.AI for efficient retrieval

Retrieval augmented generation (RAG) has long been used to connect data of interest to large language models (LLMs), enabling question-answering and insights based on specific datasets. A common challenge, however, is that important semi-structured data is often stored in complex file types such as PDFs, meaning developers must investigate solutions that can extract this information cleanly and efficiently.

In this blog, I will demonstrate how developers can overcome these challenges and build an advanced PDF parsing solution with KDB.AI and LlamaParse.

LlamaParse architecture

LlamaParse enables the creation of retrieval systems for complex documents. It does so by extracting data from documents and transforming it into easily ingestible formats such as markdown or text. Once transformed, data can be embedded and loaded into a RAG pipeline.

  • Supported file types: PDF, .pptx, .docx, .rtf, .pages, .epub, etc
  • Transformed output type: Markdown, text
  • Extraction capabilities: Text, tables, images, graphs, comic books, mathematics equations
  • Customized parsing instructions: Since LlamaParse is LLM enabled, you can pass it instructions as if you were prompting an LLM. This could be used to describe the document, define the output, or preprocess with sentiment analysis, language translation, summarization
  • JSON mode: This mode outputs the complete structure of the document, extracts images with size and location metadata, and extracts tables in JSON format for easy analysis. It is perfect for custom RAG applications in which document structure and metadata are used to maximize informational value and cite where document-retrieved nodes originate

Markdown specifies the inherent structure of the document by identifying elements such as titles, headers, subsections, tables, and images. This may seem trivial, but since markdown identifies these elements, we can easily split a document into smaller chunks using specialized parsers such as the MarkdownElementNodeParser()

Guided walkthrough

In the following steps, we will build and test a simple RAG pipeline that ingests PDF files with LlamaParse. If you would like to follow along, sign up for our free KDB.AI cloud edition. You can also explore the code on GitHub or Colab.

Step 1: Install and import libraries:

We will begin by installing and importing libraries from Llamaindex, Pandas, OpenAI, and KDB.AI.

Python
Install & Import libraries:
!pip install llama-index
!pip install llama-index-core
!pip install llama-index-embeddings-openai
!pip install llama-parse
!pip install llama-index-vector-stores-kdbai
!pip install pandas
!pip install llama-index-postprocessor-cohere-rerank
!pip install kdbai_client

from llama_parse import LlamaParse
from llama_index.core import Settings
from llama_index.core import StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import MarkdownElementNodeParser
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.kdbai import KDBAIVectorStore
from getpass import getpass
import kdbai_client as kdbai

Step 2: Set up API keys for LlamaCloudOpenAI:

Next, we will configure our API keys.

For help securing your own API key, please refer to the following documentation:

Python
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import nest_asyncio
nest_asyncio.apply()

import os
# API access to llama-cloud
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-"

# Using OpenAI API for embeddings/llms
os.environ["OPENAI_API_KEY"] = "sk-"

Step 3: Set up KDB.AI

Now, we will set up and configure KDB.AI, a multi-modal vector database that enables scalable, real-time AI applications with advanced capabilities such as search, personalization, and RAG. It integrates temporal and semantic relevance into workflows, helping developers support high-performance, time-based, multi-modal data queries for enterprise workloads.

In the code below, we will connect to the default database, create a schema, define the index, and then create a table.

Python
#Set up KDB.AI endpoing and API key

KDBAI_ENDPOINT = (
    os.environ["KDBAI_ENDPOINT"]
    if "KDBAI_ENDPOINT" in os.environ
    else input("KDB.AI endpoint: ")
)
KDBAI_API_KEY = (
    os.environ["KDBAI_API_KEY"]
    if "KDBAI_API_KEY" in os.environ
    else getpass("KDB.AI API key: ")
)


#connect to KDB.AI

session = kdbai.Session(api_key=KDBAI_API_KEY, endpoint=KDBAI_ENDPOINT)
Connect to the ‘default’ database, create a schema for the KDB.AI table, define the index, and create the table:

# Connect with kdbai database
db = session.database("default")


# The schema contains two metadata columns (document_id, text) and one embeddings column

schema = [
        dict(name="document_id", type="str"),
        dict(name="text", type="str"),
        dict(name="embeddings", type="float32s"),
    ]

# indexflat, define the index name, type, column to apply the index to (embeddings)
# and params which include thesearch metric (Euclidean distance), and dims

indexFlat = {
        "name": "flat",
        "type": "flat",
        "column": "embeddings",
        "params": {'dims': 1536, 'metric': 'L2'},
    }

KDBAI_TABLE_NAME = "LlamaParse_Table"

# First ensure the table does not already exist

try:
    db.table(KDBAI_TABLE_NAME).drop()
except kdbai.KDBAIException:
    pass


#Create the table

table = db.create_table(KDBAI_TABLE_NAME, schema, indexes=[indexFlat])

Step 4: Download a PDF:

Next, we will download and import a sample PDF file, in this instance ‘LLM In-Context Recall is Prompt Dependent, ’ by Daniel Machlab and Rick Battle.

Python
!wget 'https://arxiv.org/pdf/2404.08865' -O './LLM_recall.pdf'

Step 5: Set up LlamaParse, LlamaIndex, & embedding model:

From here, we can define the model type, generation, and settings parameters, specifying the path to our PDF.

Python
EMBEDDING_MODEL  = "text-embedding-3-small"
GENERATION_MODEL = "gpt-4o"

llm = OpenAI(model=GENERATION_MODEL)
embed_model = OpenAIEmbedding(model=EMBEDDING_MODEL)

Settings.llm = llm
Settings.embed_model = embed_model

pdf_file_name = './LLM_recall.pdf'

Step 6: Create custom parsing instructions:

Next, we will add the following instructions to our solution.

The document titled “LLM In-Context Recall is Prompt Dependent” is an academic preprint from April 2024, authored by Daniel Machlab and Rick Battle from the VMware NLP Lab. It explores the in-context recall capabilities of Large Language Models (LLMs) using a method called “needle-in-a-haystack,” where a specific factoid is embedded in a block of unrelated text. The study investigates how the recall performance of various LLMs is influenced by the content of prompts and the biases in their training data. The research involves testing multiple LLMs with varying context window sizes to assess their ability to recall information accurately when prompted differently. The paper includes detailed methodologies, results from numerous tests, discussions on the impact of prompt variations and training data, and conclusions on improving LLM utility in practical applications. It contains many tables. Answer questions using the information in this article and be precise.

Python
parsing_instructions = '''The document titled "LLM In-Context Recall is Prompt Dependent" is an academic preprint from April 2024, authored by Daniel Machlab and Rick Battle from the VMware NLP Lab. It explores the in-context recall capabilities of Large Language Models (LLMs) using a method called "needle-in-a-haystack," where a specific factoid is embedded in a block of unrelated text. The study investigates how the recall performance of various LLMs is influenced by the content of prompts and the biases in their training data. The research involves testing multiple LLMs with varying context window sizes to assess their ability to recall information accurately when prompted differently. The paper includes detailed methodologies, results from numerous tests, discussions on the impact of prompt variations and training data, and conclusions on improving LLM utility in practical applications. It contains many tables. Answer questions using the information in this article and be precise.'''

Step 7: Run LlamaParse and extract text & tables from markdown:

Next, we will parse the document, retrieve nodes (text) and objects (table), and insert markdown into the text of each table before creating an index in KDB.AI and testing insertion.

Python
documents = LlamaParse(result_type="markdown", parsing_instructions=parsing_instructions).load_data(pdf_file_name)
print(documents[0].text[:1000])

# Parse the documents using MarkdownElementNodeParser
node_parser = MarkdownElementNodeParser(llm=llm, num_workers=8).from_defaults()

# Retrieve nodes (text) and objects (table)
nodes = node_parser.get_nodes_from_documents(documents)

base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

# insert the table markdown into the text of each table object
for i in range(len(objects)):
  objects[i].text = objects[i].obj.text[:]

vector_store = KDBAIVectorStore(table)
storage_context = StorageContext.from_defaults(vector_store=vector_store)


#Create the index, inserts base_nodes and objects into KDB.AI
recursive_index = VectorStoreIndex(
    nodes= base_nodes + objects, storage_context=storage_context
)

# Query KDB.AI to ensure the nodes were inserted
table.query()

Step 8: Helper functions to complete the RAG pipeline:

We will now define the following helper functions:

  • embed_query: Takes a user query and embeds it using OpenAI’s ‘text-embedding-3-small’
  • retrieve_data: Takes the query, calls the embed_query function to get the query embedding, then executes retrieval on KDB.AI to retrieve the most relevant nodes
  • RAG: Takes in the query, calls the retrieve_data function, and then passes the retrieved data to OpenAI’s GPT-4o LLM
Python
from openai import OpenAI
client = OpenAI()

def embed_query(query):
    query_embedding = client.embeddings.create(
            input=query,
            model="text-embedding-3-small"
        )
    return query_embedding.data[0].embedding

def retrieve_data(query):
    query_embedding = embed_query(query)
    results = table.search(vectors={'flat':[query_embedding]},n=5,filter=[('<>','document_id','4a9551df-5dec-4410-90bb-43d17d722918')])
    retrieved_data_for_RAG = []
    for index, row in results[0].iterrows():
      retrieved_data_for_RAG.append(row['text'])
    return retrieved_data_for_RAG

def RAG(query):
  question = "You will answer this question based on the provided reference material: " + query
  messages = "Here is the provided context: " + "\n"
  results = retrieve_data(query)
  if results:
    for data in results:
      messages += data + "\n"
  response = client.chat.completions.create(
      model="gpt-4o",
      messages=[
          {"role": "system", "content": question},
          {
          "role": "user",
          "content": [
              {"type": "text", "text": messages},
          ],
          }
      ],
      max_tokens=300,
  )
  content = response.choices[0].message.content
  return content

Test the solution

With our model built, we can test and highlight how our LLM can retrieve content from the ingested PDF.

Python
print(RAG("describe the needle in a haystack method only using the provided information"))

>>>The needle-in-a-haystack method involves embedding a factoid (referred to as the “needle”) within a block of filler text (referred to as the “haystack”). The model is then tasked with retrieving this embedded factoid. The recall performance of the model is evaluated across various haystack lengths and with different placements of the needle to identify patterns in performance. This method demonstrates that an LLM’s ability to recall information is influenced not only by the content of the prompt but also by potential biases in its training data. Adjustments to the model’s architecture, training strategy, or fine-tuning can enhance its recall performance, providing insights into LLM behavior for more effective applications.

Python
print(RAG("list the AI models that are evaluated with needle-in-a-haystack testing?"))

>>>Llama 2 13B, Llama 2 70B, GPT-4 Turbo, GPT-3.5 Turbo 1106, GPT-3.5 Turbo 0125, Mistral v0.1, Mistral v0.2, WizardLM, and Mixtral are the LLMs evaluated with needle-in-a-haystack testing. (Taken from a table within the PDF document)

Model Name Context Window Size
Llama 2 13B Chat
Llama 2 70B Chat
WizardLM 70B
GPT-3.5-Turbo-1106
GPT-3.5-Turbo-0125
Minstral 7B Instruct v0.1
Minstral 7B Instruct v0.2
Minstral 8x7B Instruct v0.1
GPT-4 Turbo 0125
4,096 Tokens
4,096 Tokens
4,096 Tokens
16,385 Tokens
16,385 Tokens
32,768 Tokens
32,768 Tokens
32,768 Tokens
128,000 Tokens
Python
print(RAG("what is the best thing to do in San Francisco?"))

>>>The best thing to do in San Francisco is to eat a sandwich and sit in Dolores Park on a sunny day. (Taken from a table within the PDF document)

Test Name Factoid Question
PistachioAI PistachioAI received a patent before its Series A What did PistachioAI receive before its Series A?
San Francisco The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day. What is the best thing to do in San Francisco?
Thornfield Hollow The best thing to do in Thornfield Hollow is eat a sandwich and sit in Harmony Glen Nature Preserve on a sunny day. What is the best thing to do in Thornfield Hollow?

In this blog, we explored building a retrieval-augmented generation pipeline for a complex PDF document. We used LlamaParse to transform the PDF into markdown format, extracted both text and tables, and then ingested the content into KDB.AI for retrieval.

If you enjoyed this blog, why not check out my others or try some of our other sample notebooks from the KDB.AI learning hub:

Customer Stories

Discover richer, actionable insights for faster, better informed decision making

ADSS Logo
Capital Markets

ADSS leverages KX real-time data platform to accelerate its transformational growth strategy.

Read More About ADSS
Axi logo
Capital Markets

Axi uses KX to capture, analyze, and visualize streaming data in real-time and at scale.

Read More About Axi


Demo the world’s fastest database for vector, time-series, and real-time analytics

Start your journey to becoming an AI-first enterprise with 100x* more performant data and MLOps pipelines.

  • Process data at unmatched speed and scale
  • Build high-performance data-driven applications
  • Turbocharge analytics tools in the cloud, on premise, or at the edge

Book a demo with an expert

"*" indicates required fields

By submitting this form, you will also receive sales and/or marketing communications on KX products, services, news and events. You can unsubscribe from receiving communications by visiting our Privacy Policy. You can find further information on how we collect and use your personal data in our Privacy Policy.

This field is for validation purposes and should be left unchanged.

A verified G2 leader for time-series