writeup

How I engineer with AI (not just prompt it)

I build the system around the model, not just the prompts: agents that do real work, retrieval that stays grounded, evals that gate quality, fine-tuning when it pays, and the token economics to run all of it in production.

Solo: agents, RAG, evals, fine-tuning, cost

LLM agentsRAGevals / LLM-as-judgeQLoRA fine-tuningMCPtoken FinOps

TL;DR

I engineer with AI end to end: I build the system that produces the answers (agents, retrieval, evals, fine-tuning) and I operate it in production (orchestration, governance, cost).
Every claim here is measured somewhere in my work, not asserted: κ 0.81 for the judge, ~90% of cloud accuracy at $0, a matcher that stops dead at a $1.00 cap.
That full stack, build plus operate, is exactly what teams are desperate to hire in 2026 and rarely find in one person.

The gap

Most people prompt the model. Few engineer with it.

A good prompt one day, a hallucinated refactor the next, and a bill nobody is watching. Inside a real org that does not hold. The rare engineer builds the system around the model and keeps it accountable: measured, bounded, and cheap enough to run. I work at that intersection.

I build the system

Not prompts in a notebook. Systems with parts that are tested and measured.

Agents & RAG. An agentic job matcher with a bounded tool-use loop, run-to-run memory, and an MCP server; retrieval that stays grounded and cited. Untrusted text is spotlighted as data, never instructions, and the loop stops the instant it trips a hard cap (24 steps, 600k tokens, $1.00). (hiring-radar, daily-news)
Evals. An LLM-as-judge calibrated against human labels (Cohen's κ 0.81, a confusion matrix, bias checks) that fails the PR when a prompt quietly gets worse. The judge is a measured instrument, not a black box. (rubric)
Fine-tuning. QLoRA on a small local model, measured on a labelled set: 72% to 80% on SROIE, ~90% of cloud accuracy at $0, without leaving the laptop. (shoebox)

I operate it

Anyone can spin up an agent. Almost nobody runs a fleet of them in a team without the bill exploding.

Orchestration, not demos. Real multi-agent systems doing autonomous work: one coordinates a fleet over an offline-first product, another automates GitHub issues end to end.
Governance at team scale. I configure Claude for a whole org (permissions, standards, agents other engineers inherit) and shape behaviour structurally, building system prompts in Cowork from the base, not firing off stray prompts.
The token economy. FinOps for LLMs: the right model per task (Haiku where it is enough, Opus where it is needed), caching, context budgets, hard cost caps. The difference between a pretty prototype and something a company can run.

Programming and AI are the same move for me: I do not write the feature, I build the machine that produces features; I do not prompt the model, I engineer and operate the system that produces the answers.

the throughline

Always on the frontier

I did not stop at what I learned a year ago. I take the courses and stay at the edge: ultracode, Opus 4.8 in the editor the day it landed. The same self-taught engine that taught me to code keeps me current in a field that moves every week, and I run it on my own life too (routines that sweep my inbox, workflows for LinkedIn).

Why it matters

Solid engineering plus AI system-building plus operational discipline is exactly what teams cannot hire for right now. I can build a team's AI features and govern its tooling, with the economics under control is close to a superpower in today's market. For the AI Engineer and founding-engineer roles I am after, this is the pillar, not a bullet.

Where to see it

The proofs are public and measured across my own repos: the agentic matcher (hiring-radar), the eval gate (rubric), the fine-tuned local model (shoebox), plus a versioned PROMPT_HISTORY.md and an AGENTS.md house-style doc that grounds every agent.

See the case studies My GitHub