title: How to Actually Use LLMs at Work: A Practitioner's Playbook description: A technical playbook for SME operators on model selection, tool invocation, context management, and hallucination discipline — closing the gap between surface-level and competent LLM usage.

How to Actually Use LLMs at Work: A Practitioner's Playbook

Category: Education

TL;DR: Most LLM usage stays at the surface — free-tier, no tool calls, outputs treated as finished answers. The gap between that and competent usage is model-task matching, tool invocation discipline, and a hallucination verification habit. Close those three gaps and productivity gains become structural rather than incidental.

Most people using AI tools at work are running them at maybe 20% of their potential. The model isn't the bottleneck. The operator is: wrong tier, wrong model for the task, no tools enabled, no verification step. This playbook covers what actually matters.

Architecture overview

Before any workflow decisions, the mental model has to be correct.

An LLM call is a stateless token-stream computation over a fixed context window. The model has no memory between calls unless you provide it — either through an injected memory layer (ChatGPT's persistent Memory, Claude Projects with a pre-loaded system prompt) or explicit prompt construction. Pre-training encodes world knowledge up to a cutoff date; post-training (RLHF, Constitutional AI, DPO) shapes instruction-following behaviour and the safety envelope. These are distinct. A model with stale pre-training data can still be excellent at reasoning tasks; a model with poor post-training will hallucinate confidently regardless of context quality.

Context window sizes: GPT-4o at 128k tokens, Claude 3.7 Sonnet at 200k, Gemini 1.5 Pro at 1M. Those upper bounds sound large until you're ingesting a 300-page contract, a code repository, and a standing system prompt simultaneously — at which point you're hitting lost-in-the-middle degradation well before the token cap, not after it.

Tool calls are separate inference steps routed through the model's function-calling layer. The model emits a structured tool-call object; the client (or the platform sandbox) executes it and returns the result. ChatGPT's Code Interpreter, Claude's tool use, Gemini's function calling — all the same pattern. This matters operationally because tool latency compounds: a four-step tool chain at 3 seconds per step is a 12-second round-trip before the model begins generating its response. Agentic pipelines that feel fast in demos often feel slow at scale.

Implementation considerations

Five decisions that determine 80% of outcomes:

Tier selection. Use the free tier for anything non-load-bearing — exploration, personal drafts, throwaway queries. For anything that touches client output, business decisions, or code that ships, pay for one Tier 2 subscription. ChatGPT Plus or Claude Pro at roughly £18–20/month each gets you leading models, full tool access, and usable rate limits. Running both is not twice as good; it's context fragmentation and a doubled cost with no clear benefit. Pick one as your primary. Use the other as a capability fallback for tasks where it demonstrably outperforms.

Model-task matching. There are two meaningful production tiers: default (GPT-4o, Claude 3.7 Sonnet, Gemini 1.5 Pro) and thinking/reasoning (o1, o3, Claude 3.7 with extended thinking enabled, DeepSeek-R1). Default models handle drafting, summarisation, extraction, and most coding tasks well, and they're fast. Thinking models are for multi-step reasoning where intermediate steps need to hold — formal logic, legal analysis, complex architectural decisions, anything with a chain of dependencies where an early error cascades. They cost 2–5× more per token and run 3–8× slower. Applying o1 to email drafts is expensive waste. Applying GPT-4o to formal contract interpretation produces confident nonsense. Match the model to the cognitive demand of the task, not to your general comfort with a particular interface.

Tool invocation. Web search turns the model from a knowledge-cutoff prisoner into a live research assistant. File upload plus Code Interpreter turns it into a data analyst. Deep Research (OpenAI's 15–30 minute multi-step pipeline, available in Gemini Advanced as well) runs sequential web searches and synthesises a sourced report — genuinely useful for regulatory research, competitive landscape analysis, or any domain where your internal knowledge is sparse. These tools are not cosmetic. The productivity delta between a bare prompt and a well-tooled session is an order of magnitude on research tasks. If you're not enabling tools, you're not getting the capability you're paying for.

Context management. Projects (ChatGPT, Claude) persist a system prompt and document context across sessions. Use them. Set up one project per client domain, one per internal function. Write a proper system prompt for each — what the context is, what format outputs should follow, what the model should not do. Inject it once. Custom Instructions in ChatGPT apply globally across all conversations; Projects are scoped. Don't conflate them. NotebookLM is worth a separate mention for document-heavy workflows: it grounds responses exclusively in the sources you provide and surfaces citations, which eliminates one class of hallucination for document review and knowledge retrieval tasks.

Output verification. The model is a probabilistic token predictor. It will hallucinate facts, fabricate citations, mis-recall dates, and confabulate statistics — and will do all of this with high confidence and clean prose. Treat any load-bearing claim as unverified until you've traced it to a primary source. A regulation reference, a competitor pricing figure, a contract clause interpretation — verify these, every time. The model is excellent at drafting structure and prose; it is not a reliable oracle for specific facts.

Trade-offs

Prompt engineering vs fine-tuning

For SME use cases, fine-tuning is rarely the right answer. It requires labelled training data (hundreds to thousands of high-quality examples), a training pipeline, and ongoing maintenance every time the base model updates. The capability gap between a well-engineered prompt-plus-RAG system and a fine-tuned model is small for most tasks; the operational gap is substantial.

The correct sequencing: invest in prompting first. System prompt, few-shot examples, explicit output format, chain-of-thought elicitation. If you're still hitting consistent failure on well-defined tasks after that, and "failure" means reproducible degradation on representative inputs, then fine-tuning is worth evaluating. Most SME teams never get there.

RAG vs long context

Gemini 1.5 Pro's 1M-token context looks like it makes retrieval-augmented generation obsolete. It doesn't — for three reasons. First, cost: 1M tokens at full inference is expensive per query. Second, lost-in-the-middle degradation means accuracy on content buried in the middle of a long context degrades even within the window (see the detailed treatment here). Third, freshness: a RAG pipeline over pgvector or Qdrant can be updated continuously; a context dump is a snapshot. For document-heavy workflows, a lightweight retrieval layer still outperforms naive context-stuffing once your corpus exceeds roughly 50k tokens of genuinely relevant content. Below that threshold, long context is simpler and probably fine.

Failure modes and mitigations

Hallucination on specific facts. Ground the model — inject the primary source directly into context, or use a RAG-retrieved chunk that contains the claim. Ask the model to cite its source. If it can't point to something in context, the claim is suspect.

Context rot in long sessions. Long threads accumulate irrelevant prior turns that dilute attention on what matters. Fix: start a new session for a new task. Don't persist context across unrelated topics in a single thread.

Model over-confidence on ambiguous instructions. The model will complete the task as it interprets it rather than asking for clarification. Add an explicit directive to your system prompt: "If the instruction is ambiguous, ask before proceeding." It works.

Tool chain failures. Code Interpreter sandboxes are stateless — a kernel restart drops all variables. Files uploaded in one turn may not persist across sessions depending on the platform. Test tool chains on representative inputs before relying on them in any client-facing workflow.

Instruction following decay in agentic chains. Adherence degrades over pipeline depth. A 10-step chain running at 95% per step delivers roughly 60% end-to-end success. For anything high-stakes, break long workflows into checkpointed stages with a human review between them.

Cost and operational profile

Approximate figures at current API rates (May 2026, GPT-4o):

| Pattern | Tokens/task | Approx. cost | Latency | |---|---|---|---| | Simple Q&A | ~2k | <£0.01 | 2–5s | | Document summary (20 pages) | ~15k | ~£0.05 | 8–15s | | Deep Research report | ~200k+ | £0.50–2.00 | 15–30 min | | Code Interpreter (data analysis) | ~30k + compute | ~£0.10–0.30 | 30–90s |

For subscription-tier usage, costs are flat at ~£18–20/month per user. The ROI calculus is straightforward: if a practitioner saves 30 minutes of research and drafting daily, the subscription pays for itself on the first morning of the month.

API costs become a material engineering concern at scale. A 100-user deployment running 50 API calls/day at 5k tokens each is ~25M tokens/day — roughly £75/day at GPT-4o rates. At that volume, model selection, prompt caching (Anthropic's prefix caching cuts repeated system-prompt costs by up to 90%), and async batching are worth engineering time.

Recommended approach for SME context

One paid subscription. Claude Pro or ChatGPT Plus — pick the one you'll actually open every morning. Consistency compounds faster than capability-chasing.

Set up Projects on day one. One per client, one per internal domain. Write a proper system prompt for each: context, constraints, output format, and what not to do. Invest an hour upfront; it saves hours per week.

Enable web search for any task involving current information or research. Enable Code Interpreter for any task involving data, regardless of whether you think you need the Python sandbox — you probably do.

Apply thinking models only when reasoning depth is the genuine bottleneck. For most tasks — drafting, extraction, summarisation, routine analysis — default models are sufficient and significantly faster.

Build one verification habit and keep it: after any session that produces a claim you'll act on, spend two minutes checking the load-bearing facts against a primary source.

Everything beyond this — fine-tuning, multi-agent orchestration, custom retrieval pipelines — is a later-stage decision. Valid, but only once this foundation runs reliably.


Source: Andrej Karpathy, "How I use LLMs" (YouTube, 2.4M views, 2025) — watch here. Supporting references: OpenAI Deep Research documentation; Anthropic Claude 3.7 Sonnet feature notes; Google NotebookLM.