Katalyze AI is a fast-growing AI-driven biotech platform company on a mission to make life-saving drugs accessible and affordable for everyone. Our AI Agents help pharmaceutical and biotech companies increase production efficiency, reduce costs, and minimize waste. We're a team of humble, fast-moving, and curious craftspeople working at the intersection of science and AI.
We are looking for a Platform / Infrastructure Engineer to build and scale a reliable, secure, and multi-tenant infrastructure for our AI-powered pharmaceutical platforms. You will be responsible for the "production-grade" backbone of our services, ensuring that our systems meet strict SOC2/HIPAA compliance standards while serving global pharmaceutical leaders. From implementing OpenTelemetry to automating zero-downtime deployments with Terraform, you will own the reliability and security of our entire cloud ecosystem.
Observability & Monitoring: Establish a production-grade observability stack (OpenTelemetry, distributed tracing) to achieve a <5 min mean-time-to-detection for incidents.
Infrastructure as Code (IaC): Implement automated provisioning and CI/CD pipelines (Terraform, GitHub Actions) for multi-tenant SaaS environments.
Reliability Engineering: Own platform metrics, aiming for a 99.9% uptime SLA and <500ms p95 API latency.
Security & Compliance: Implement and maintain SOC2/HIPAA controls, managing secrets (AWS Secrets Manager), IAM policies, and audit logging.
Cost Optimization: Drive architectural improvements and right-sizing strategies to optimize AWS expenditure.
Documentation: Create postmortems, runbooks, and architectural decision records (ADRs) to ensure team autonomy and operational excellence.
Required:
AWS Mastery: 5+ years managing production infrastructure (ECS/Fargate, RDS, S3, CloudFront, VPC networking).
Terraform Expertise: Deep experience with IaC patterns for multi-environment deployments (Dev/Staging/Prod).
Containerization: Battle-tested experience managing Docker/ECS with a focus on auto-scaling and health checks.
Incident Response: Real-world experience in on-call rotations and resolving live production outages.
CI/CD & Automation: Strong experience implementing pipelines for monorepo applications (Nx experience is a plus).
Security Mindset: Practical knowledge of least-privilege IAM, network isolation, and secrets management.
Overlap: Ability to work with at least 4-6 hours of overlap with US East Coast (EST/EDT) business hours.
Nice to Have:
Snowflake administration (role management, query optimization).
Python scripting for infra-automation.
Experience with Kafka, Redis, or BullMQ queue infrastructure.
Familiarity with dbt pipeline orchestration (Airflow/MWAA).
Cloud: AWS (ECS, RDS, S3, Lambda, CloudFront).
Infrastructure: Terraform, Docker, GitHub Actions.
Data: Snowflake, PostgreSQL, Redis, BullMQ.
Observability: OpenTelemetry, CloudWatch, Datadog.
Frameworks: Nx Monorepo (Next.js, Fastify, Django).
Search for a command to run...