Two years ago, integrating a large language model into a product meant research papers and custom GPU clusters. Today, it means an API call. The hard problems have shifted from "how do I access an LLM" to "how do I build reliably on top of one" — and that's where most teams get into trouble.
Choosing Your LLM: The Decision Matrix
The market now has clear tiers. For reasoning-heavy tasks (code generation, complex analysis, multi-step planning): GPT-4o or Claude 3.5 Sonnet. For high-volume, latency-sensitive tasks: GPT-4o Mini, Claude Haiku, or Gemini Flash. For on-premise or privacy-sensitive workloads: Llama 3.1 70B or Mistral Large, self-hosted. Don't over-index on benchmark scores — test on your actual data with your actual prompts.
Prompt Engineering That Actually Works
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
async function analyseDocument(text) {
const response = await client.messages.create({
model: 'claude-sonnet-4-6',
max_tokens: 1024,
system: `You are a specialist document analyst.
Always respond in valid JSON.
Be concise and factual.`,
messages: [
{
role: 'user',
content: `Analyse this document and extract:
1. Key entities (people, organisations, dates)
2. Main topics
3. Sentiment
Document: ${text}`
}
]
});
return JSON.parse(response.content[0].text);
}
The 5 Architecture Patterns
- <strong>Direct completion.</strong> Simplest — user input → LLM → output. Good for single-shot tasks like summarisation or classification.
- <strong>RAG (Retrieval-Augmented Generation).</strong> Embed your documents, retrieve relevant chunks, include them in the prompt. Dramatically reduces hallucinations for knowledge-intensive tasks.
- <strong>Agents with tool use.</strong> Give the LLM access to functions (search, calculator, database) and let it decide when to call them. Powerful but non-deterministic.
- <strong>Multi-step chains.</strong> Break complex tasks into sequential LLM calls. More predictable than agents, better for structured pipelines.
- <strong>Fine-tuned models.</strong> For narrow, high-volume tasks, fine-tuning a smaller model on your data can be 10× cheaper and more consistent than prompting a large model.
Managing Costs at Scale
LLM costs are token costs, and tokens compound fast. Key optimisations: (1) Implement prompt caching — Anthropic and OpenAI both cache repeated system prompts, cutting costs by up to 90% for structured workflows. (2) Use the smallest model that meets your quality bar. (3) Cache LLM responses for identical inputs. (4) Set hard token limits and monitor usage per user.
A naive implementation of an LLM-powered feature can cost £0.30 per user session. With caching, model selection, and prompt optimisation, the same feature can cost £0.003. That's a 100× difference — and it's the difference between a viable product and a loss-making one.
Handling Reliability and Hallucinations
LLMs are probabilistic, not deterministic. For any business-critical output, implement: output validation (JSON schema validation, regex checks), human-in-the-loop review for high-stakes decisions, confidence scoring where the LLM is asked to rate its own answer, and fallback paths when the LLM returns something unexpected. Never pipe raw LLM output directly to a database or user-facing UI without sanitisation.
Got a project in mind?
I work directly with founders and CTOs to build reliable, scalable software. Let's have a conversation about your goals.
Teklif Al