Kelly W. Zhang

kelly.zhang@imperial.ac.uk

CV (Feb 2025)

I am an Assistant Professor at Imperial College London in the Mathematics Department (statistics section). I am also a faculty member in the I-X, an interdisciplinary AI initiative at Imperial. My research interests lie at the intersection of adaptive experimentation, reinforcement learning, and statistical inference.

I previously was a Postdoctoral Fellow at Columbia Business School in the Descision, Risk, and Optimization group, working with Daniel Russo and Hongseok Namkoong. I completed my Ph.D. student in computer science at Harvard University in the Statistical Reinforcement Learning Lab. I was advised by Susan Murphy and Lucas Janson. I was supported by an NSF Graduate Fellowship during my PhD and was selected to be a Siebel Scholar in 2023.

Even earlier, I worked on natural language processing and deep learning with Sasha Rush, Sam Bowman, and Yann LeCun. I also previously interned at Apple’s HealthAI team in Seattle, Facebook AI Research in New York, and at eBay New York on the homepage recommendations team.

Selected Papers

Bandit/RL Algorithms
Contextual Thompson Sampling via Generation of Missing Data

Kelly W Zhang, Tiffany (Tianhui) Cai, Hongseok Namkoong, and Daniel Russo

Under submission, 2025

Abs Bib PDF

We introduce a framework for Thompson sampling contextual bandit algorithms, in which the algorithm’s ability to quantify uncertainty and make decisions depends on the quality of a generative model that is learned offline. Instead of viewing uncertainty in the environment as arising from unobservable latent parameters, our algorithm treats uncertainty as stemming from missing, but potentially observable, future outcomes. If these future outcomes were all observed, one could simply make decisions using an "oracle" policy fit on the complete dataset. Inspired by this conceptualization, at each decision-time, our algorithm uses a generative model to probabilistically impute missing future outcomes, fits a policy using the imputed complete dataset, and uses that policy to select the next action. We formally show that this algorithm is a generative formulation of Thompson Sampling and prove a state-of-the-art regret bound for it. Notably, our regret bound i) depends on the probabilistic generative model only through the quality of its offline prediction loss, and ii) applies to any method of fitting the "oracle" policy, which easily allows one to adapt Thompson sampling to decision-making settings with fairness and/or resource constraints.
@article{psarContext2025, title = {Contextual Thompson Sampling via Generation of Missing Data}, author = {Zhang, Kelly W and Cai, Tiffany (Tianhui) and Namkoong, Hongseok and Russo, Daniel}, journal = {Under submission}, year = {2025}, }
Bandit/RL Algorithms
Impatient Bandits: Optimizing for the Long-Term Without Delay

Kelly W Zhang, Thomas Baldwin-McDonald, Kamil Ciosek, Lucas Maystre, and Daniel Russo

Under submission, 2025

Abs Bib PDF

Increasingly, recommender systems are tasked with improving users’ long-term satisfaction. In this context, we study a content exploration task, which we formalize as a bandit problem with delayed rewards. There is an apparent trade-off in choosing the learning signal: waiting for the full reward to become available might take several weeks, slowing the rate of learning, whereas using short-term proxy rewards reflects the actual long-term goal only imperfectly. First, we develop a predictive model of delayed rewards that incorporates all information obtained to date. Rewards as well as shorter-term surrogate outcomes are combined through a Bayesian filter to obtain a probabilistic belief. Second, we devise a bandit algorithm that quickly learns to identify content aligned with long-term success using this new predictive model. We prove a regret bound for our algorithm that depends on the Value of Progressive Feedback, an information theoretic metric that captures the quality of short-term leading indicators that are observed prior to the long-term reward. We apply our approach to a podcast recommendation problem, where we seek to recommend shows that users engage with repeatedly over two months. We empirically validate that our approach significantly outperforms methods that optimize for short-term proxies or rely solely on delayed rewards, as demonstrated by an A/B test in a recommendation system that serves hundreds of millions of users.
@article{impatientBandits2025, title = {Impatient Bandits: Optimizing for the Long-Term Without Delay}, author = {Zhang, Kelly W and Baldwin-McDonald, Thomas and Ciosek, Kamil and Maystre, Lucas and Russo, Daniel}, journal = {Under submission}, year = {2025}, }
Bandit/RL Algorithms
Active Exploration via Autoregressive Generation of Missing Data

Tiffany (Tianhui) Cai, Hongseok Namkoong, Daniel Russo, and Kelly W Zhang

Working paper; Selected for presentation at the Econometric Society Interdisciplinary Frontiers: Economics and AI+ML conference, 2024

Abs Bib PDF

We pose uncertainty quantification and exploration in online decision-making as a problem of training and generation from an autoregressive sequence model, an area experiencing rapid innovation. Our approach rests on viewing uncertainty as arising from missing future outcomes that would be revealed through appropriate action choices, rather than from unobservable latent parameters of the environment. This reformulation aligns naturally with modern machine learning capabilities: we can i) train generative models through next-outcome prediction rather than fit explicit priors, ii) assess uncertainty through autoregressive generation rather than parameter sampling, and iii) adapt to new information through in-context learning rather than explicit posterior updating. To showcase these ideas, we formulate a challenging meta-bandit problem where effective performance requires leveraging unstructured prior information (like text features) while exploring judiciously to resolve key remaining uncertainties. We validate our approach through both theory and experiments. Our theory establishes a reduction, showing success at offline next-outcome prediction translates to reliable online uncertainty quantification and decision-making, even with strategically collected data. Semi-synthetic experiments show our insights bear out in a news-article recommendation task, where article text can be leveraged to minimize exploration.
@article{psar2024, title = {Active Exploration via Autoregressive Generation of Missing Data}, author = {Cai, Tiffany (Tianhui) and Namkoong, Hongseok and Russo, Daniel and Zhang, Kelly W}, journal = {Working paper; Selected for presentation at the Econometric Society Interdisciplinary Frontiers: Economics and AI+ML conference}, year = {2024}, }
Statistical Inference
Statistical Inference with M-Estimators on Adaptively Collected Data

Kelly W Zhang, Lucas Janson, and Susan Murphy

Advances in Neural Information Processing Systems (NeurIPS), 2021

Abs Bib PDF Video Code

Bandit algorithms are increasingly used in real-world sequential decision-making problems. Associated with this is an increased desire to be able to use the resulting datasets to answer scientific questions like: Did one type of ad lead to more purchases? In which contexts is a mobile health intervention effective? However, classical statistical approaches fail to provide valid confidence intervals when used with data collected with bandit algorithms. Alternative methods have recently been developed for simple models (e.g., comparison of means). Yet there is a lack of general methods for conducting statistical inference using more complex models on data collected with (contextual) bandit algorithms; for example, current methods cannot be used for valid inference on parameters in a logistic regression model for a binary reward. In this work, we develop theory justifying the use of M-estimators – which includes estimators based on empirical risk minimization as well as maximum likelihood – on data collected with adaptive algorithms, including (contextual) bandit algorithms. Specifically, we show that M-estimators, modified with particular adaptive weights, can be used to construct asymptotically valid confidence regions for a variety of inferential targets.
@article{zhang2021mestimator, author = {Zhang, Kelly W and Janson, Lucas and Murphy, Susan}, journal = {Advances in Neural Information Processing Systems (NeurIPS)}, editor = {Ranzato, M. and Beygelzimer, A. and Dauphin, Y. and Liang, P.S. and Vaughan, J. Wortman}, pages = {7460--7471}, title = {Statistical Inference with M-Estimators on Adaptively Collected Data}, volume = {34}, year = {2021}, }
Statistical Inference
Inference for Batched Bandits

Kelly W Zhang, Lucas Janson, and Susan Murphy

Advances in Neural Information Processing Systems (NeurIPS), 2020

Abs Bib PDF Video Code

As bandit algorithms are increasingly utilized in scientific studies and industrial applications, there is an associated increasing need for reliable inference methods based on the resulting adaptively-collected data. In this work, we develop methods for inference on data collected in batches using a bandit algorithm. We first prove that the ordinary least squares estimator (OLS), which is asymptotically normal on independently sampled data, is not asymptotically normal on data collected using standard bandit algorithms when there is no unique optimal arm. This asymptotic non-normality result implies that the naive assumption that the OLS estimator is approximately normal can lead to Type-1 error inflation and confidence intervals with below-nominal coverage probabilities. Second, we introduce the Batched OLS estimator (BOLS) that we prove is (1) asymptotically normal on data collected from both multi-arm and contextual bandits and (2) robust to non-stationarity in the baseline reward.
@article{zhang2020inference, author = {Zhang, Kelly W and Janson, Lucas and Murphy, Susan}, journal = {Advances in Neural Information Processing Systems (NeurIPS)}, editor = {Larochelle, H. and Ranzato, M. and Hadsell, R. and Balcan, M.F. and Lin, H.}, pages = {9818--9829}, title = {Inference for Batched Bandits}, volume = {33}, year = {2020}, }

News

Nov 26, 2025	I will attend INFORMS 2025. Yongyi Guo and I will organize a session on “Applications of Statistical Reinforcement Learning.”
Jun 11, 2025	I will attend RLDM and have an oral presentation on our work on exploration via generation of missing data (see here and here)!
May 19, 2025	I will speak at Amazon Berlin as a part of a workshop with the StatML CDT!
Aug 14, 2024	I am co-organizing a session at IMS-Bernoulli on the “Frontiers of Adaptive Experimentation.” The speakers will be Dean Foster, Maria Dimakopolou, Koulik Khamaru, and myself!
Aug 09, 2024	Co-organizing workshop on Deployable RL: From Research to Practice at RLC! Please come by!!
Aug 08, 2024	I will be speaking at JSM in the session on Statistical Challenges and New Directions for Adaptive Experimentation organized by Aaditya Ramdas! I will also chair the session on New Methods in Causal Inference and Reinforcement Learning for Personalized Decision-Making!
Aug 06, 2023	Speaking at the session on Integrating Algorithms and Analysis for Adaptively Randomized Experiments at JSM in Toronto. The session is organized by John Langford, Sofia Villar, Aaditya Ramdas, Joseph Jay Williams, and Tong Li.
Jul 28, 2023	I was interviewed as a part of the Harvard Women in Statistics and Data Science Series!
Apr 28, 2023	I defended my thesis on Statistical Inference for Adaptive Experimentation. My slides are here. My thesis committee members were Susan Murphy, Lucas Janson, Milind Tambe, and Jonas Oddur Jonasson.
Jun 28, 2022	I organized an invited session at the 2022 Institute of Mathematical Statistics Annual Meeting on “Inference Methods for Adaptively Collected Data”. The speakers will be Nathan Kallus, Koulik Khamaru, Evan Munro, and myself! Joseph Jay Williams and Nina Deliu will chair the session.