Sourangshu Pal

Building Your First RAG Pipeline

Sourangshu Pal — Tue, 09 Dec 2025 18:30:00 GMT

Retrieval-Augmented Generation (RAG) is one of the most practical ways to make a large language model useful on your own data. The core idea is simple: before you ask the LLM a question, retrieve relevant documents from your own store and inject them into the prompt. The model answers from the retrieved context, not from memorized training data.

Why RAG and Not Fine-Tuning?

Fine-tuning is expensive, slow, and requires retraining whenever your data changes. RAG lets you update the knowledge base (your vector store) without touching the model. For most enterprise use cases — internal docs, customer support, code search — RAG wins on cost and iteration speed.

The Four-Stage Pipeline

A minimal RAG system has four stages:

Source Documents → Chunker → Embedder → Vector Store
                                              ↑
                                         Query Time:
                                         Query → Embed → Retrieve → LLM → Answer

Stage 1: Chunking

Long documents don’t fit in a single embedding or a single prompt. Split them into overlapping chunks so context is preserved across boundaries.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " "],
)
chunks = splitter.split_text(document_text)

Tip: Use chunk_overlap=50 as a baseline. For technical docs with dense references, go higher (100–150).

Stage 2: Embedding

Convert each chunk into a dense vector. OpenAI’s text-embedding-3-large with 3072 dimensions gives state-of-the-art retrieval quality.

from openai import AsyncOpenAI

client = AsyncOpenAI()

async def embed(texts: list[str]) -> list[list[float]]:
    response = await client.embeddings.create(
        model="text-embedding-3-large",
        input=texts,
        dimensions=3072,
    )
    return [item.embedding for item in response.data]

Stage 3: Storing in a Vector Database

Qdrant is a fast, production-ready vector database with first-class Python support.

from qdrant_client import AsyncQdrantClient, models

client = AsyncQdrantClient(url="http://localhost:6333")

await client.upsert(
    collection_name="my_docs",
    points=[
        models.PointStruct(
            id=i,
            vector={"dense": embedding},
            payload={"text": chunk, "source": filename},
        )
        for i, (chunk, embedding) in enumerate(zip(chunks, embeddings))
    ],
)

Stage 4: Retrieval and Generation

At query time, embed the question, retrieve the top-k chunks, build a prompt, and call the LLM.

from openai import AsyncOpenAI

async def answer(question: str, top_k: int = 5) -> str:
    q_vec = (await embed([question]))[0]

    results = await qdrant.query_points(
        collection_name="my_docs",
        query=q_vec,
        using="dense",
        limit=top_k,
        with_payload=True,
    )

    context = "\n\n".join(r.payload["text"] for r in results.points)

    response = await openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer using only the provided context. If unsure, say so."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
        temperature=0.0,
    )
    return response.choices[0].message.content

Common Failure Modes

Problem	Cause	Fix
Low recall	Chunks too large	Reduce chunk size to 256–512 tokens
Hallucinations	Context not injected correctly	Log the prompt, verify context is present
Slow ingest	Embedding one-at-a-time	Batch embed 100 chunks per API call
Stale answers	Vector store not updated	Build an incremental update pipeline

What’s Next

Once you have a working pipeline, the next steps are:

Hybrid search — combine dense vectors with BM25 keyword search for better recall on technical terms
Reranking — use a cross-encoder to reorder retrieved chunks before passing to the LLM
Evaluation — measure answer quality with RAGAS (faithfulness, context_recall, answer_relevancy)

RAG is not magic. A well-chunked, well-indexed knowledge base paired with a tight prompt is what makes the difference between a demo and a production system.

LLM Fundamentals Every Engineer Should Know

Sourangshu Pal — Thu, 04 Dec 2025 18:30:00 GMT

Building on top of large language models is mostly an API integration problem. But when things break — and they will — you need to know what’s actually happening inside. This post covers the concepts that matter most for engineers working with LLMs in production.

Tokens, Not Words

LLMs don’t see words. They see tokens — subword units produced by a byte-pair encoding (BPE) tokenizer. The rule of thumb: 1 token ≈ 0.75 English words, or ~4 characters.

This matters for three reasons:

Cost: pricing is per-token, not per-word
Limits: context windows are measured in tokens
Surprises: "unhelpful" might be 1 token while "xyyzz4928" might be 5

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("The transformer architecture scales remarkably well.")
print(len(tokens))  # 9

Always count tokens before sending to the API — a request that exceeds the context window throws a hard error.

Context Windows

The context window is the total number of tokens the model can “see” at once — input plus output combined. GPT-4o supports 128k tokens. This is large enough that most applications fit comfortably, but naive approaches still hit limits:

System prompt: ~500 tokens
Retrieved context (RAG): 5 chunks × 512 tokens = 2,560 tokens
Conversation history: grows without bound if you don’t truncate

The practical rule: reserve at least 4k tokens for the model’s output. Everything else is input budget.

Temperature and Sampling

Temperature controls how “creative” the model is. Technically, it scales the logit distribution before softmax — higher values flatten the distribution (more randomness), lower values sharpen it (more deterministic).

Temperature	Use Case
`0.0`	Fact extraction, classification, structured output
`0.3–0.5`	Summarization, Q&A systems
`0.7–1.0`	Creative writing, brainstorming
`>1.0`	Rarely useful in production

For RAG and data pipelines, always use temperature=0.0. You want reproducible, factual answers — not creative ones.

System Prompts Are Load-Bearing

The system prompt shapes everything. A weak system prompt is the most common reason a “good model” gives bad results. Key principles:

Be explicit about what the model should and should not do
State the output format, not just the task
Include examples (few-shot) for complex or unusual tasks
Test it: prompt injection is real, and adversarial users will try to override it

SYSTEM = """You are a technical support assistant.
Answer only questions about our product.
If the question is off-topic, say: "I can only answer questions about Product X."
Always respond in plain English, no markdown.
"""

Hallucinations Are a Probability Problem

LLMs are not search engines. They predict the next probable token. When the model doesn’t “know” something, it produces plausible-sounding output — which is a hallucination.

Mitigation strategies ranked by effectiveness:

RAG — ground answers in retrieved documents, include the source
Self-consistency — sample the same question 3× at temperature=0.7, return the majority answer
Structured output — force the model to output JSON with a "confidence" field; set a threshold
Verification prompts — ask the model “Is the answer above supported by the context?” as a second call

None of these eliminate hallucinations. They reduce them.

The Real Cost of Long Context

128k token context windows are impressive, but “lost in the middle” is a real phenomenon: LLMs attend more strongly to content at the beginning and end of the context window. Content buried in the middle gets underweighted.

Practical consequences:

Put the most important instructions at the top of the system prompt
In RAG, put the highest-ranked chunk first in the context block
Don’t assume adding more context always improves quality — sometimes it degrades it

Structured Outputs

Modern OpenAI models support JSON mode and structured outputs via response schemas. Use these whenever you need machine-readable responses.

from pydantic import BaseModel
from openai import AsyncOpenAI

class AnalysisResult(BaseModel):
    sentiment: str
    confidence: float
    summary: str

client = AsyncOpenAI()

response = await client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Analyze: 'The product is excellent but shipping was slow.'"}],
    response_format=AnalysisResult,
)
result = response.choices[0].message.parsed  # typed AnalysisResult object

This is more reliable than asking the model to “respond in JSON” and then parsing the output yourself.

Prompt Caching

If your system prompt is large and repeated across many calls, OpenAI’s prompt caching reduces latency and cost. Cached input tokens cost 50% less. The cache key is based on the exact prefix — so keep your system prompt stable and put dynamic content at the end.

Understanding these fundamentals turns debugging from guesswork into diagnosis. When an LLM application fails, the root cause is almost always one of: token budget exceeded, temperature too high for a deterministic task, system prompt underspecified, or context not properly injected.

Vector Databases Explained

Sourangshu Pal — Thu, 27 Nov 2025 18:30:00 GMT

Every RAG system, semantic search engine, and recommendation feature built on embeddings needs a vector database. But “vector database” is a term that gets overloaded. This post explains what’s actually happening, where the complexity lives, and how to configure Qdrant — the best open-source option — for production use.

What Problem Do They Solve?

An embedding model turns text (or images, audio, etc.) into a dense float vector — say, 1536 or 3072 numbers. Semantic similarity maps to geometric proximity: similar meanings → nearby vectors in the embedding space.

The search problem is: given a query vector, find the k most similar vectors in a collection of millions or billions. Exact nearest neighbor search is O(n) — you’d compute the cosine distance between your query and every stored vector. At 10M vectors with 1536 dimensions, that’s 30 billion float operations per query. Too slow.

Vector databases solve this with Approximate Nearest Neighbor (ANN) algorithms. They trade a small accuracy loss for orders-of-magnitude speed gains.

How HNSW Works

The dominant ANN algorithm today is Hierarchical Navigable Small World (HNSW). The key intuitions:

Graph structure: Vectors are nodes. Similar vectors are connected by edges.
Hierarchical layers: Multiple layers of graphs, coarser at the top, denser at the bottom. Entry is at the top layer.
Greedy search: Start at a random node in the top layer, greedily traverse to the nearest neighbor, then descend to the next layer and repeat.

The result: search is O(log n). At 10M vectors, you’re exploring hundreds of candidates instead of millions.

The two knobs that matter for HNSW:

Parameter	Effect	Default	When to Increase
`m`	Edges per node	16	Higher recall on sparse data
`ef_construct`	Build-time search width	100	Better index quality, slower build

from qdrant_client.models import VectorParams, Distance, HnswConfigDiff

await client.create_collection(
    collection_name="my_docs",
    vectors_config={
        "dense": VectorParams(
            size=3072,
            distance=Distance.COSINE,
            hnsw_config=HnswConfigDiff(
                m=16,           # standard
                ef_construct=100,  # increase to 200 for better recall
            ),
        )
    },
)

Dense vs. Sparse vs. Hybrid Search

Dense Vectors

Generated by neural embedding models. Capture semantic meaning. Struggle with exact keyword matching and rare terms (e.g., product codes, proper nouns not in training data).

Sparse Vectors (BM25 / SPLADE)

The classic keyword search approach. Each dimension corresponds to a vocabulary term. Excellent at exact matching, terrible at synonyms and paraphrase.

Hybrid: Best of Both

Qdrant supports hybrid search natively using Reciprocal Rank Fusion (RRF) — run both retrievals, merge the ranked lists, take the top-k from the merged result.

from qdrant_client.models import Prefetch, FusionQuery, Fusion, SparseVector

results = await client.query_points(
    collection_name="my_docs",
    prefetch=[
        Prefetch(query=dense_vector, using="dense", limit=20),
        Prefetch(query=SparseVector(indices=sparse_indices, values=sparse_values),
                 using="sparse", limit=20),
    ],
    query=FusionQuery(fusion=Fusion.RRF),
    limit=5,
    with_payload=True,
)

For most production RAG systems, hybrid search improves recall by 10–20% over dense-only, especially on domain-specific or technical content.

Payload Filtering

A vector database isn’t just for retrieval — you also need to filter by metadata. Qdrant stores arbitrary JSON payloads alongside vectors and can filter at query time without a post-processing step.

from qdrant_client.models import Filter, FieldCondition, MatchValue

results = await client.query_points(
    collection_name="my_docs",
    query=query_vector,
    using="dense",
    query_filter=Filter(
        must=[
            FieldCondition(key="source", match=MatchValue(value="technical_manual")),
            FieldCondition(key="year", range={"gte": 2024}),
        ]
    ),
    limit=5,
)

Important: Create payload indexes for fields you filter on. Without an index, Qdrant scans the full payload for every candidate — that erases the ANN speed advantage.

await client.create_payload_index(
    collection_name="my_docs",
    field_name="source",
    field_schema="keyword",
)

Choosing a Vector Database

DB	Best For	Notes
Qdrant	Production open-source	Rust core, excellent Python SDK, hybrid search built-in
Pinecone	Managed cloud, fast setup	Expensive at scale, less control
pgvector	Already on PostgreSQL	HNSW support since 0.7.0, but slower than dedicated DBs
Weaviate	GraphQL API, multi-modal	Heavier operationally
Chroma	Local dev and prototyping	Not production-tested at scale

For most teams building RAG: Qdrant on Docker locally, Qdrant Cloud for production. The API is identical; you change one URL string.

Production Checklist

Index built with ef_construct >= 100; raise to 200 if recall is below expectations
Payload indexes created for all filter fields
Collection backed by persistent storage (not in-memory)
Async client (AsyncQdrantClient) for all FastAPI or async paths
Batch upsert — don’t insert individual vectors in a loop
Collection snapshots scheduled for backup

A vector database is infrastructure. Get it right once, and it runs quietly for years. Get it wrong — missing indexes, sync client in async code, no backups — and you’ll feel it in production.