Anjali Subramaniam
Summary
Senior LLM engineer with six years of ML + two years on production LLM systems. Owns the inference request router for the Llama-3.1-70B-Instruct fleet at a Series C AI company (4M req/day, 4xH100 nodes). Cut p99 ms/token from 84ms to 28ms via vLLM continuous batching + speculative decoding. Two merged PRs to vllm-project/vllm; NeurIPS 2024 workshop coauthor.
Skills
Experience
- Cut p99 ms/token on the Llama-3.1-70B inference path from 84ms to 28ms by migrating from HF Transformers serving to vLLM with continuous batching + paged-attention; speculative decoding with a Llama-3.1-8B draft model accounted for the last 12ms.
- Reduced cost-per-1k-tokens on the inference fleet from $0.42 to $0.18 through H100 → H200 migration (1.4× throughput) + continuous batching tuning + prefix caching for long-system-prompt workloads.
- Built the team's eval harness in lm-evaluation-harness + custom rubric tasks (n=480 prompts, 4-judge ensemble with calibration); held post-fine-tune model to ≥ base + 6pp on the internal benchmark across 4 quarterly releases.
- Fine-tuned Llama-3.1-8B on a 38k-row internal support-ticket dataset (LoRA r=16, alpha=32, 3 epochs); resulting model beat GPT-4o-mini on internal eval by 14pp at 1/12th per-token cost.
- Shipped a prompt-injection detection layer (Llama-3.1-8B guard + heuristic prefilter); detected 98.4% of injection attempts with 0.4% FP rate on benign traffic.
- Built the team's RAG pipeline (pgvector + sentence-transformers + hybrid retriever); Recall@5 62% → 87% via BM25 + dense + LLM rerank.
- Migrated the eval pipeline from notebook-based to CI-gated harness; every model release now runs against 480 prompts in 38 min; deploy gated on regression < 2pp.
- Mentored 2 junior ML engineers transitioning from classical ML into LLM-engineering focus; both shipped sole-owner production LLM systems within 6 months.
- Built the merchant-fraud detection model (XGBoost on 38M historic transactions); precision @ recall=0.80 went from 0.62 to 0.84 on the post-launch holdout.
Open Source & Publications
Two merged PRs to vLLM — one extended the continuous-batching scheduler for prefix-caching across long-prefix workloads; one closed a memory leak in the paged-attention kernel under high tensor-parallelism. Plus: NeurIPS 2024 workshop paper (coauthor) — 'Speculative decoding with mixture-of-experts draft models.' Open-sourced llama-3.1-8b-quill-support on Hugging Face (4,800 downloads).
Education
Senior (LLM)
6 years ML + 2 years LLMs. Owns inference router for 70B model on 4M req/day.
Use this template