AI Engineer turning data into decisions — building RAG pipelines, LLM systems, and enterprise data platforms that scale.
I'm a Senior AI/ML Engineer based in London with 8+ years of experience designing and delivering production AI systems across insurance, media, and fintech. I don't just prototype — I own the full lifecycle: architecture, infrastructure, deployment, and monitoring.
At NFU Mutual, I built a natural-language analytics interface used by 200+ analysts, a RAG pipeline that cut research time from 30 minutes to under a minute, and a real-time ML recommendation engine for personalised cross-sell campaigns. At Sky UK, I scaled their primary conversational AI across help and sales journeys with production guardrails — prompt injection filtering, PII redaction, and RAG grounding.
I'm currently building FinRAG Eval — an open-source evaluation framework for financial RAG systems — as my contribution to the community and signal of depth for what comes next.
From Hyperledger deployments at Accenture to production LLM systems at top-tier UK enterprises — I've grown through each generation of the stack.
Delivered across NFU Mutual and Sky UK via Zensar, and built enterprise cloud platforms at Accenture serving 270+ global clients.
Building FinRAG Eval — a production-grade RAG evaluation framework with hallucination detection, citation accuracy, and automated dataset construction from SEC filings.
A production-grade RAG evaluation framework built for financial documents. Uses claim-level decomposition with a local NLI model (DeBERTa-v3), fuzzy citation matching, and a Streamlit dashboard for visual inspection. Adapter-based design means you can swap one module to benchmark any RAG system. Automated dataset construction from real SEC 10-K/10-Q filings with LLM-generated QA candidates and human-in-the-loop review.
AI deal intelligence platform. Document intelligence pipeline combining Azure OpenAI assumption extraction with regex fallback, plus sentence-transformers + pgvector for semantic retrieval. Multi-tenant FastAPI backend (JWT, RBAC, 19 Postgres tables), 3-queue Celery worker pipeline, and a safe AST-based formula engine with topological dependency resolution.
Fine-tuned Mistral-7B on a custom SEC EDGAR pipeline — API ingestion, HTML parsing, section extraction, company-aware splits — for financial document summarisation. MLflow experiment tracking with config-driven reproducibility throughout.
Scaled Sky's primary conversational AI across help and sales journeys on web and mobile. Architected multi-tenant LLM serving with production guardrails — prompt injection filtering, PII redaction, intent safety classification, and retrieval-augmented knowledge grounding.
Lock graph + DFS cycle detection via Python AST & Go go/ast for deadlocks, races, goroutine leaks across 12 hazard types.
Jupyter notebooks covering core ML topics — from fundamentals to applied experiments.
Predictive modelling for marine insurance risk using classical ML approaches.
RAG evaluation is deceptively hard. String matching doesn't capture semantic accuracy, and LLM-as-judge alone is expensive and inconsistent. Here's how I combined claim-level decomposition, local NLI models, and fuzzy citation matching to build an evaluation pipeline that actually reflects what users care about.
More posts in the pipeline — covering QLoRA fine-tuning, Snowflake data quality patterns, and enterprise RAG in production.
I'm looking for Staff / Senior AI Engineer roles at ambitious companies — FAANG, AI-first startups, or strong mid-tier tech. I bring 8+ years of proven delivery, not just prototypes.
Based in London, open to remote and relocation. Happy to talk about RAG, LLM systems, data platforms, or anything in between.
View LinkedIn recommendations →