Katalyze AI is a fast-growing AI-driven biotech platform company on a mission to make life-saving drugs accessible and affordable for everyone. Our AI Agents help pharmaceutical and biotech companies increase production efficiency, reduce costs, and minimize waste. We're a team of humble, fast-moving, and curious craftspeople working at the intersection of science and AI.
We're looking for a Principal Data Scientist to join Katalyze AI and work at the intersection of applied statistics, machine learning, and biotechnology. You'll independently analyze complex scientific and process data build interpretable predictive models, and translate findings into actionable recommendations for enterprise customers in biopharma and advanced manufacturing.
This is a high-ownership, customer-facing role. You'll work directly with scientists and engineers at our accounts, not just hand off reports internally.
Build predictive and diagnostic models on scientific and industrial data (time series, multivariate sensor data, spectral data, batch records)
Select and apply the right modelling technique for each problem — gradient-boosted trees, Gaussian processes, neural networks, classical statistical models — with clear reasoning for your choices
Apply signal processing and time series methods (Fourier transforms, wavelet analysis, autocorrelation, decomposition, forecasting) to real-world sensor and process data
Design rigorous model evaluation frameworks: cross-validation strategies for time-series data, SHAP-based interpretability, uncertainty quantification, and statistical significance testing
Build interpretable ML pipelines that surface drivers of variability in ways that satisfy audit and documentation requirements
Design analytics dashboards that communicate complex statistical findings to manufacturing scientists, quality teams, and supply chain managers
Work closely with the Deployment Strategist to configure and deliver data science components for customer deployments
Partner directly with enterprise customers to understand their data challenges, deviation patterns, and quality systems
Apply LLM-based approaches where appropriate to automate insight generation and multi-step analytical workflows
6+ years of applied data science experience with a strong foundation in statistics and machine learning
Deep understanding of how model families work — linear/logistic regression, tree-based models (XGBoost, LightGBM, CatBoost), SVMs, neural networks, transformers — and when to use each
Strong time series expertise: Fourier analysis, wavelet transforms, autocorrelation, stationarity, decomposition, and forecasting (ARIMA, Prophet, and deep learning approaches)
Rigorous model evaluation skills: proper train/test design for time-series data, overfitting detection, SHAP and interpretability methods, uncertainty quantification
Experience with Gaussian processes, Bayesian methods, or uncertainty quantification
Strong Python skills: scikit-learn, pandas, numpy, statsmodels, PyTorch or TensorFlow
Experience with enterprise data infrastructure — SQL, data warehouses, cloud platforms (Snowflake, Databricks, Redshift)
Strong communication skills — able to explain statistical findings clearly to scientists, engineers, and business stakeholders
Experience with scientific or industrial data (sensor streams, spectral data, batch records, LIMS/MES outputs) is a strong plus
PhD or Master's in Data Science, Statistics, Chemical Engineering, or related field preferred
Experience with LLMs or agentic systems is a plus, not a core requirement
ML & Analytics: Python, scikit-learn, XGBoost, LightGBM, PyTorch, statsmodels
Interpretability: SHAP, LIME, custom visualization pipelines
Data Processing: pandas, polars, SQL, dbt
Visualization: Plotly, Streamlit, custom dashboards
Infrastructure: AWS, Docker, PostgreSQL
LLM / Agents: Claude/GPT APIs, LangChain (where applicable)
Toronto or San Francisco (hybrid)
Search for a command to run...