Pieter van der Westhuizen
Senior SRE · SLO + Incident Command · OpenTelemetry / Prometheus
Summary
Senior SRE with seven years across two payments + platform companies. Owns the 99.95% availability SLO on the merchant-settlement service ($1.2B annualised). Led IC on 11 incidents through 2024 (3 Sev-1); MTTM on bank-rails incidents fell 47→14 min. Two merged PRs to OpenTelemetry Collector; SREcon 2024 speaker.
Skills
SRE practices
SLO + SLI engineeringMulti-window multi-burn-rate alertingBlameless postmortemsIncident commandChaos engineering / GameDay
Observability
OpenTelemetryPrometheus + ThanosGrafana + JsonnetTempo (tracing)Loki (logs)
Platform
KubernetesTerraformAWS (EKS, RDS, S3, Lambda)Go (control plane)PagerDuty + Rundeck
Experience
Senior Site Reliability Engineer
Q
Quill · Remote (Amsterdam)
Apr 2022—Present
Series B fintech, 38k merchants, $1.2B annualised. Own the merchant-settlement service SLO + on-call rotation.
- Restructured alerting on the bank-rails service from threshold-based to multi-window multi-burn-rate SLO alerts; alert volume fell 76% with no missed SLO-breaching incidents over 9 months.
- Led IC on 11 incidents through 2024 (3 Sev-1, 8 Sev-2); authored the team's blameless-postmortem template now adopted across 4 service teams. MTTM on bank-rails incidents fell 47→14 minutes.
- Automated the bank-rails on-call toil via a self-healing runbook (Terraform + Lambda + PagerDuty Rundeck); on-call eng-hours per rotation fell from 14 to 3 across 4 quarters.
- Built the capacity-planning automation (Python + Prometheus historic data + statistical forecast); quarterly forecast accuracy went from ±18% to ±4% across the production-services fleet.
- Owned the GameDay program through 2024 — ran 14 chaos-engineering exercises across 3 service teams; surfaced 22 latent reliability bugs (8 Sev-1 candidates).
Site Reliability Engineer
A
Adyen · Amsterdam, NL
Aug 2019—Mar 2022
- Migrated 12 services from StatsD + ELK to OpenTelemetry + Tempo + Grafana; cut MTTI on cross-service traces from 22 minutes to under 4 minutes.
- Shipped a Prometheus + Thanos federation rebuild across 4 production clusters; query latency on global dashboards p95 fell from 8.2s to 480ms.
- Authored 11 blameless postmortems through the period; 38 corrective actions shipped within the 60-day window (89% completion rate).
Production Engineer
B
Booking.com · Amsterdam, NL
Jul 2017—Jul 2019
- Reduced PagerDuty alert volume by 84% across 6 services by migrating threshold-based alerts to SLO-based and consolidating redundant signals.
Open Source & Speaking
open-telemetry/opentelemetry-collector
Contributor (2 merged PRs)Two merged PRs — one closed a metric-cardinality leak under high-volume scrape configs; one extended the OTLP exporter for self-monitoring. Plus: SREcon EMEA 2024 speaker — 'Multi-burn-rate alerting in practice' (40-min talk).
GoOpenTelemetry
Education
MSc in Computer Science (Distributed Systems)
University of Amsterdam
Sep 2014—Jun 2017