Katalyze AI is a fast-growing AI-driven biotech platform company on a mission to make life-saving drugs accessible and affordable for everyone. Our AI Agents help pharmaceutical and biotech companies increase production efficiency, reduce costs, and minimize waste. We're a team of humble, fast-moving, and curious craftspeople working at the intersection of science and AI.
We're looking for a Staff or Senior Data Engineer to own the data infrastructure that powers Katalyze AI's platform. You'll make architecture decisions, design integrations with customer data systems, and build the streaming pipelines that give our AI models and agents access to clean, reliable, real-time data — setting the patterns the team builds on as we scale.
Own the data infrastructure architecture — define standards, patterns, and tooling decisions for the data layer
Design and build data integration pipelines connecting customer systems (MES, LIMS, ERP, historians) to the Katalyze AI platform
Develop and operate real-time and batch data streaming infrastructure (Kafka, Kinesis, or similar) at scale
Build and maintain ETL/ELT pipelines for structured and unstructured scientific data
Establish data quality, reliability, and observability frameworks across all pipelines
Collaborate with ML and Data Science teams to deliver clean, well-structured data for model training and inference
Design data schemas and storage solutions (data lakes, warehouses) optimized for AI/ML workloads
Work directly with customer IT teams during deployments to establish secure data connections and meet compliance requirements
Set technical direction for the data engineering function as the team grows
7+ years of data engineering experience, with a track record of owning systems — not just building within them
Demonstrated experience making architecture decisions: choosing tools, designing schemas, defining standards
Deep expertise in data streaming (Kafka, Kinesis, Flink, or Spark Streaming) — designed and operated in production, at scale
Strong proficiency in building data integrations with external enterprise systems (REST APIs, OPC-UA, proprietary connectors)
Experience with cloud data platforms (AWS Glue, Databricks, Snowflake, or similar)
Proficiency in Python and SQL; experience with dbt or similar transformation tooling
Strong data quality and observability instincts — you've built frameworks, not just used existing tools
Background in industrial data systems (OSIsoft PI, Ignition, MES/LIMS integrations) is a strong plus
Comfortable communicating technical tradeoffs to non-technical stakeholders
Search for a command to run...