Katalyze AI is a fast-growing AI-driven biotech platform company on a mission to make life-saving drugs accessible and affordable for everyone. Our AI Agents help pharmaceutical and biotech companies increase production efficiency, reduce costs, and minimize waste. We're a team of humble, fast-moving, and curious craftspeople working at the intersection of science and AI.
We're looking for a Site Reliability Engineer to ensure Katalyze AI's platform is reliable, scalable, and secure as we grow with enterprise customers. You'll build and maintain the infrastructure and practices that keep our systems running smoothly — and help us move fast without breaking things.
Define and maintain SLOs, SLIs, and error budgets for critical platform services
Build and operate CI/CD pipelines, monitoring, alerting, and incident response systems
Design and manage cloud infrastructure (AWS/GCP/Azure) using infrastructure-as-code (Terraform, Pulumi)
Implement observability tooling (logging, tracing, metrics) across the platform
Partner with engineering to embed reliability practices into the development lifecycle
Lead incident response and post-mortems; drive systemic improvements
Support security and compliance requirements for enterprise customer deployments
Build automation to reduce toil and improve operational efficiency
4+ years of SRE, DevOps, or platform engineering experience
Strong experience with Kubernetes, Docker, and container orchestration
Proficiency with cloud platforms (AWS preferred) and infrastructure-as-code
Experience with observability tools (Datadog, Grafana, Prometheus, or similar)
Understanding of security best practices and enterprise compliance requirements (SOC 2, HIPAA awareness)
Experience with Python or Go for automation scripting
Startup experience preferred — you're comfortable building from scratch
Search for a command to run...