Principal Data Scientist (Founding)

Full Time - HybridToronto (CA) - San Francisco (USA)

About Katalyze AI

Katalyze AI is a fast-growing AI-driven biotech platform company on a mission to make life-saving drugs accessible and affordable for everyone. Our AI Agents help pharmaceutical and biotech companies increase production efficiency, reduce costs, and minimize waste. We're a team of humble, fast-moving, and curious craftspeople working at the intersection of science and AI.

About the Role

We're looking for a Principal Data Scientist to join Katalyze AI and work at the intersection of applied statistics, machine learning, and biotechnology. You'll independently analyze complex scientific and process data build interpretable predictive models, and translate findings into actionable recommendations for enterprise customers in biopharma and advanced manufacturing.

This is a high-ownership, customer-facing role. You'll work directly with scientists and engineers at our accounts, not just hand off reports internally.

What You'll Do

  • Build predictive and diagnostic models on scientific and industrial data (time series, multivariate sensor data, spectral data, batch records)

  • Select and apply the right modelling technique for each problem — gradient-boosted trees, Gaussian processes, neural networks, classical statistical models — with clear reasoning for your choices

  • Apply signal processing and time series methods (Fourier transforms, wavelet analysis, autocorrelation, decomposition, forecasting) to real-world sensor and process data

  • Design rigorous model evaluation frameworks: cross-validation strategies for time-series data, SHAP-based interpretability, uncertainty quantification, and statistical significance testing

  • Build interpretable ML pipelines that surface drivers of variability in ways that satisfy audit and documentation requirements

  • Design analytics dashboards that communicate complex statistical findings to manufacturing scientists, quality teams, and supply chain managers

  • Work closely with the Deployment Strategist to configure and deliver data science components for customer deployments

  • Partner directly with enterprise customers to understand their data challenges, deviation patterns, and quality systems

  • Apply LLM-based approaches where appropriate to automate insight generation and multi-step analytical workflows

What We're Looking For

  • 6+ years of applied data science experience with a strong foundation in statistics and machine learning

  • Deep understanding of how model families work — linear/logistic regression, tree-based models (XGBoost, LightGBM, CatBoost), SVMs, neural networks, transformers — and when to use each

  • Strong time series expertise: Fourier analysis, wavelet transforms, autocorrelation, stationarity, decomposition, and forecasting (ARIMA, Prophet, and deep learning approaches)

  • Rigorous model evaluation skills: proper train/test design for time-series data, overfitting detection, SHAP and interpretability methods, uncertainty quantification

  • Experience with Gaussian processes, Bayesian methods, or uncertainty quantification

  • Strong Python skills: scikit-learn, pandas, numpy, statsmodels, PyTorch or TensorFlow

  • Experience with enterprise data infrastructure — SQL, data warehouses, cloud platforms (Snowflake, Databricks, Redshift)

  • Strong communication skills — able to explain statistical findings clearly to scientists, engineers, and business stakeholders

  • Experience with scientific or industrial data (sensor streams, spectral data, batch records, LIMS/MES outputs) is a strong plus

  • PhD or Master's in Data Science, Statistics, Chemical Engineering, or related field preferred

  • Experience with LLMs or agentic systems is a plus, not a core requirement

Tech Stack

  • ML & Analytics: Python, scikit-learn, XGBoost, LightGBM, PyTorch, statsmodels

  • Interpretability: SHAP, LIME, custom visualization pipelines

  • Data Processing: pandas, polars, SQL, dbt

  • Visualization: Plotly, Streamlit, custom dashboards

  • Infrastructure: AWS, Docker, PostgreSQL

  • LLM / Agents: Claude/GPT APIs, LangChain (where applicable)

Location

Toronto or San Francisco (hybrid)

Command Palette

Search for a command to run...