Question 1

We have a working model in a notebook. What does production-ready look like?

Accepted Answer

It means: reproducible training (someone else can rerun and get the same numbers), versioned inputs (the training data is captured, not just the model weights), CI-gated deployments (model can't be promoted without passing eval thresholds), monitored inference (you know when latency or accuracy degrades), and a retraining path (when drift happens, the loop closes itself).

Question 2

Should we build our own platform or use SageMaker / Vertex / Databricks?

Accepted Answer

For most teams, managed wins on TCO once you account for the platform team you'd need otherwise. We've shipped on all three. The decision usually comes down to your cloud, your data residency requirements, and how much custom orchestration you need.

Question 3

What about LLMs — is that MLOps or your AI service?

Accepted Answer

Adjacent. LLM apps care more about prompt versioning, retrieval quality, and eval datasets than about training pipelines. Traditional ML cares about training reproducibility, feature engineering, and drift. We do both, but treat them as separate workstreams because the tooling differs.

Question 4

How do you catch model drift?

Accepted Answer

Multiple layers. (1) Input drift: distribution checks on features (KS test, PSI). (2) Output drift: prediction distribution over time. (3) Performance drift: when labels arrive, comparing predicted vs actual. (4) Business KPI drift: when the model's downstream metric trends wrong. We wire alerts on whichever combo gives you signal first.

Machine Learning & MLOps

What “shipped to production” actually means

When this fits

Questions about Machine Learning & MLOps.

Ready to talk about Machine Learning & MLOps?