Staff Data Engineer

Full Time - Hybrid Toronto (CA) - San Francisco (USA)

About Katalyze AI

Katalyze AI is a fast-growing AI-driven biotech platform company on a mission to make life-saving drugs accessible and affordable for everyone. Our AI Agents help pharmaceutical and biotech companies increase production efficiency, reduce costs, and minimize waste. We're a team of humble, fast-moving, and curious craftspeople working at the intersection of science and AI.

About the Role

We're looking for a Staff or Senior Data Engineer to own the data infrastructure that powers Katalyze AI's platform. You'll make architecture decisions, design integrations with customer data systems, and build the streaming pipelines that give our AI models and agents access to clean, reliable, real-time data — setting the patterns the team builds on as we scale.

What You'll Do

  • Own the data infrastructure architecture — define standards, patterns, and tooling decisions for the data layer

  • Design and build data integration pipelines connecting customer systems (MES, LIMS, ERP, historians) to the Katalyze AI platform

  • Develop and operate real-time and batch data streaming infrastructure (Kafka, Kinesis, or similar) at scale

  • Build and maintain ETL/ELT pipelines for structured and unstructured scientific data

  • Establish data quality, reliability, and observability frameworks across all pipelines

  • Collaborate with ML and Data Science teams to deliver clean, well-structured data for model training and inference

  • Design data schemas and storage solutions (data lakes, warehouses) optimized for AI/ML workloads

  • Work directly with customer IT teams during deployments to establish secure data connections and meet compliance requirements

  • Set technical direction for the data engineering function as the team grows

What We're Looking For

  • 7+ years of data engineering experience, with a track record of owning systems — not just building within them

  • Demonstrated experience making architecture decisions: choosing tools, designing schemas, defining standards

  • Deep expertise in data streaming (Kafka, Kinesis, Flink, or Spark Streaming) — designed and operated in production, at scale

  • Strong proficiency in building data integrations with external enterprise systems (REST APIs, OPC-UA, proprietary connectors)

  • Experience with cloud data platforms (AWS Glue, Databricks, Snowflake, or similar)

  • Proficiency in Python and SQL; experience with dbt or similar transformation tooling

  • Strong data quality and observability instincts — you've built frameworks, not just used existing tools

  • Background in industrial data systems (OSIsoft PI, Ignition, MES/LIMS integrations) is a strong plus

  • Comfortable communicating technical tradeoffs to non-technical stakeholders

Command Palette

Search for a command to run...