The Gap Between a Working Model and a Reliable Service

Back to blog

Sprout AI builds machine learning tools for insurance companies — OCR, automated claim handlers, processing pipelines that work through an incoming claim without a human touching every step. My role was on the backend: building and maintaining the APIs that served the ML models.

The engineering around the model

The data scientists built the models. My job was to take a trained PyTorch model and turn it into a production service that clients could call reliably. That boundary is interesting, because it's where research-flavored work meets the reliability requirements of a real system.

In practice: FastAPI endpoints, caching to avoid redundant inference, testing that actually exercises the model path end-to-end, and deployment on AWS using kserve. kserve handles a lot of the serving infrastructure, but the failure modes are subtle — model warm-up latency, memory pressure under concurrent requests, state across restarts — and most of the interesting work was understanding those before they became incidents.

Monitoring matters more for ML services than for conventional APIs, because the failures are often silent. A model returning confident but wrong predictions is harder to catch than a service returning 500s. Getting observability right was something I spent real time on.

What I took from it

The main lesson: ML systems in production are infrastructure, not applications. They need the same rigour around failure modes that you'd apply to a database. Treating them as "just code that calls a model" is how you get paged at 3am.

Working at a small startup also meant there was no SRE team. If something broke, you figured it out. I've come to value that kind of full-stack ownership — it changes how carefully you think about what you ship.

Python PyTorch FastAPI AWS kserve Redis

Previous Next