// Discipline 03

LLMs & RAG that answer from your knowledge.

Retrieval pipelines over private knowledge, evaluation harnesses, prompt + context engineering. We build LLM systems that are measurably useful, not just functional.

Discuss a project →See details ↓

// query → retrieve → ground → answer

01user query → retrieverk=12 · hybrid

↓

02rerank → top-5cross-encoder

↓

03context assembly3,840 tok

↓

04generation w/ citationsclaude · gpt · llama

↓

05faithfulness checkpassed · 0.94

“Yes — see policy §4.2.^[3] The 30-day window applies only to commercial accounts.^[5]”

/ 3.1

Retrieval-Augmented Generation

Grounded answers from private documents, databases, and knowledge bases — without re-training.

// rag_pipeline · hybrid · pgvector + bm25

query → embed(ada-002) → vec search (k=20)

→ bm25 sparse → union → rerank (top-5)

retrieved 5 chunks · context: 3,840 tok

faithfulness: 0.94 relevance: 0.91

“Per §4.2, the 30-day window applies to commercial accounts. [3][5]”

// Details

Vector search (pgvector, Pinecone, Weaviate, Qdrant)
Chunking strategy and embedding selection
Hybrid retrieval (dense + sparse BM25)
Contextual compression and re-ranking

// Output formats

REST APIPython SDKDocker

/ 3.2

LLM Evaluation & Benchmarking

You can't improve what you can't measure. We build evaluation suites before we build the system.

// ragas_eval · system_v1 · 500 questions

faithfulness ....... 0.94 ↑ +0.08 from v0

answer_relevancy ....... 0.91 ↑ +0.06

context_recall ....... 0.87 needs improvement

context_precision....... 0.89

action: improve chunking · tune rerank threshold

// Details

Groundedness, relevance, faithfulness metrics
RAGAS / custom evaluation harnesses
Regression benchmarks across model versions
Human evaluation integration

// Output formats

Eval reportJSONDashboard

/ 3.3

Prompt & Context Engineering

Systematic prompt development, few-shot curation, context window optimization.

// prompt_template · structured · json_mode

SYSTEM: You are a policy analyst. Answer using only

the provided context. Say “unclear” if unsure.

CONTEXT: {{retrieved_chunks}}

USER: {{user_question}}

OUTPUT: {"answer":…, "citations":[…], "confidence":…}

// Details

Structured prompt templates
Chain-of-thought, structured output (JSON mode)
Prompt regression testing
Context window management strategies

// Output formats

YAML promptsLangChainLlamaIndex

/ 3.4

Tool-Use & Function Calling

LLMs connected to APIs, databases, and tools — with proper fallback, error handling, and observability.

// tool_trace · search_agent · 3 calls

[1] call: search_docs(“refund policy 2026”) → 5 chunks

[2] call: lookup_clause(“§4.2”) → 2025-01-01

[3] call: validate_output(schema) → pass

final: grounded · citations: [3,5] · tokens: 842

// Details

OpenAI function calling / tool_use
Multi-step reasoning with tool selection
Output parsing and validation
Observability with tracing

// Output formats

OpenAI APIAnthropic APIOpen-source

// Work with us

Ready to ship? Let's scope it together.

Whether it's labeled data, a fine-tuned model, a RAG pipeline, or an agent running in production — bring us the brief. We'll scope it, price it, and tell you honestly if we're the right team. Inside 48 hours, no commitment.

Book a call →View all services →