SWE - Platform / Infrastructure Engineer

Full Time - Hybrid Toronto (CA)

About Katalyze AI

Katalyze AI is a fast-growing AI-driven biotech platform company on a mission to make life-saving drugs accessible and affordable for everyone. Our AI Agents help pharmaceutical and biotech companies increase production efficiency, reduce costs, and minimize waste. We're a team of humble, fast-moving, and curious craftspeople working at the intersection of science and AI.

About the Role

We are looking for a Platform / Infrastructure Engineer to build and scale a reliable, secure, and multi-tenant infrastructure for our AI-powered pharmaceutical platforms. You will be responsible for the "production-grade" backbone of our services, ensuring that our systems meet strict SOC2/HIPAA compliance standards while serving global pharmaceutical leaders. From implementing OpenTelemetry to automating zero-downtime deployments with Terraform, you will own the reliability and security of our entire cloud ecosystem.

Key Responsibilities

  • Observability & Monitoring: Establish a production-grade observability stack (OpenTelemetry, distributed tracing) to achieve a <5 min mean-time-to-detection for incidents.

  • Infrastructure as Code (IaC): Implement automated provisioning and CI/CD pipelines (Terraform, GitHub Actions) for multi-tenant SaaS environments.

  • Reliability Engineering: Own platform metrics, aiming for a 99.9% uptime SLA and <500ms p95 API latency.

  • Security & Compliance: Implement and maintain SOC2/HIPAA controls, managing secrets (AWS Secrets Manager), IAM policies, and audit logging.

  • Cost Optimization: Drive architectural improvements and right-sizing strategies to optimize AWS expenditure.

  • Documentation: Create postmortems, runbooks, and architectural decision records (ADRs) to ensure team autonomy and operational excellence.

Qualifications

Required:

  • AWS Mastery: 5+ years managing production infrastructure (ECS/Fargate, RDS, S3, CloudFront, VPC networking).

  • Terraform Expertise: Deep experience with IaC patterns for multi-environment deployments (Dev/Staging/Prod).

  • Containerization: Battle-tested experience managing Docker/ECS with a focus on auto-scaling and health checks.

  • Incident Response: Real-world experience in on-call rotations and resolving live production outages.

  • CI/CD & Automation: Strong experience implementing pipelines for monorepo applications (Nx experience is a plus).

  • Security Mindset: Practical knowledge of least-privilege IAM, network isolation, and secrets management.

  • Overlap: Ability to work with at least 4-6 hours of overlap with US East Coast (EST/EDT) business hours.

Nice to Have:

  • Snowflake administration (role management, query optimization).

  • Python scripting for infra-automation.

  • Experience with Kafka, Redis, or BullMQ queue infrastructure.

  • Familiarity with dbt pipeline orchestration (Airflow/MWAA).

Tech Stack

  • Cloud: AWS (ECS, RDS, S3, Lambda, CloudFront).

  • Infrastructure: Terraform, Docker, GitHub Actions.

  • Data: Snowflake, PostgreSQL, Redis, BullMQ.

  • Observability: OpenTelemetry, CloudWatch, Datadog.

  • Frameworks: Nx Monorepo (Next.js, Fastify, Django).

Command Palette

Search for a command to run...