Back to all roles

Engineering · London or the Gulf - Travel to customer sites · Full-time

Machine Learning Infrastructure Engineer

About 1001

1001 builds AI-powered operational intelligence for the world's most complex, data-heavy environments. We turn fragmented data into a live, unified model of operations and use it to drive better decisions and solve high-stakes problems. Our work sits inside government and large enterprises, in environments defined by critical operations and messy, real-world data.

Our engagements start with forward-deployed teams embedded in the customer environment. They work on real data, build quickly, and iterate until the system proves itself, then scale it across the organization.

The company is backed by Lux Capital, General Catalyst, civ, Hanabi, Sanabil, and 9Yards, with angels including Chris Re, Amjad Masad, Karim Atiyeh, Kareem Amin, and Russell Kaplan.

Working at 1001

We take on high-stakes problems in environments where mistakes carry real consequences. That demands an uncompromising bar, real speed, and systems that hold up under live operations. The people who thrive set that bar for themselves and keep raising it. They own outcomes end to end, bring rigor to everything, and lift everyone around them.

About the role

Every model 1001 ships runs on the platform you build. As a Machine Learning Infrastructure Engineer, you own the serving, deployment, and observability layer underneath our models across multiple enterprise deployments at once. Your work is what makes a trained model production-ready: reliable, secure, and affordable to run.

This is a hands-on role for someone who has run machine learning in production. The hard parts show up at scale and across customers: multi-tenant serving, model artifacts and registries that stay consistent, deployment security that holds up in government and enterprise environments, and inference that stays fast and affordable across cpu and gpu workloads.

You will build shared platform components that product teams can use on their own, so each deployment is faster and more reliable than the last.

What you'll work on

  • Build and operate machine learning serving infrastructure across multiple enterprise and government deployments.
  • Own the deployment pipeline end to end: model registry, artifacts, datasets, deployment security, and multi-tenant serving.
  • Stand up monitoring, logging, and observability so model performance is visible and problems surface early.
  • Optimize inference latency, throughput, and cost across cpu and gpu workloads.
  • Maintain shared platform components so product teams can ship models without waiting on you.

Requirements

  • 4 or more years in infrastructure or platform engineering, with a meaningful share of it in machine learning or mlops.
  • A track record of running machine learning in production, not only building it.
  • Hands-on with a model serving stack such as Triton, TorchServe, Ray Serve, vLLM, BentoML, or KServe.
  • Strong with Kubernetes, containers, ci/cd, and infrastructure-as-code with Terraform.
  • Production cloud experience on aws, gcp, or Azure.
  • Comfortable with a monitoring and observability stack such as Prometheus, Grafana, or OpenTelemetry.
  • Solid in Python and TypeScript.

Nice to have

  • gpu optimization experience.
  • Distributed training experience.
  • Large-model serving experience.

Ready to apply? We’d love to hear from you.

Apply for this role →

Don’t see your role?

We’re always looking for exceptional people.

If you believe you can help us build AI the physical world can trust, tell us how you’d contribute. We read every introduction.