Research
2024
- Statistical InferenceReplicable Bandits for Digital Health InterventionsKelly W. Zhang, Nowell Closser, Anna L. Trella, and Susan A. MurphyPreprint, 2024
Adaptive treatment assignment algorithms, such as bandit and reinforcement learning algorithms, are increasingly used in digital health intervention clinical trials. Causal inference and related data analyses are critical for evaluating digital health interventions, deciding how to refine the intervention, and deciding whether to roll-out the intervention more broadly. However the replicability of these analyses has received relatively little attention. This work investigates the replicability of statistical analyses from trials deploying adaptive treatment assignment algorithms. We demonstrate that many standard statistical estimators can be inconsistent and fail to be replicable across repetitions of the clinical trial, even as the sample size grows large. We show that this non-replicability is intimately related to properties of the adaptive algorithm itself. We introduce a formal definition of a "replicable bandit algorithm" and prove that under such algorithms, a wide variety of common statistical analyses are guaranteed to be consistent. We present both theoretical results and simulation studies based on a mobile health oral health self-care intervention. Our findings underscore the importance of designing adaptive algorithms with replicability in mind, especially for settings like digital health where deployment decisions rely heavily on replicated evidence. We conclude by discussing open questions on the connections between algorithm design, statistical inference, and experimental replicability.
- Bandit/RL AlgorithmsPosterior Sampling via Autoregressive GenerationKelly W Zhang*, Tiffany (Tianhui) Cai*, Hongseok Namkoong, and Daniel RussoPreprint, 2024
Real-world decision-making requires grappling with a perpetual lack of data as environments change; intelligent agents must comprehend uncertainty and actively gather information to resolve it. We propose a new framework for learning bandit algorithms from massive historical data, which we demonstrate in a cold-start recommendation problem. First, we use historical data to pretrain an autoregressive model to predict a sequence of repeated feedback/rewards (e.g., responses to news articles shown to different users over time). In learning to make accurate predictions, the model implicitly learns an informed prior based on rich action features (e.g., article headlines) and how to sharpen beliefs as more rewards are gathered (e.g., clicks as each article is recommended). At decision-time, we autoregressively sample (impute) an imagined sequence of rewards for each action, and choose the action with the largest average imputed reward. Far from a heuristic, our approach is an implementation of Thompson sampling (with a learned prior), a prominent active exploration algorithm. We prove our pretraining loss directly controls online decision-making performance, and we demonstrate our framework on a news recommendation task where we integrate end-to-end fine-tuning of a pretrained language model to process news article headline text to improve performance.
- Bandit/RL AlgorithmsMonitoring Fidelity of Online Reinforcement Learning Algorithms in Clinical TrialsAnna L Trella, Kelly W Zhang, Inbal Nahum-Shani, Vivek Shetty, Iris Yan, Finale Doshi-Velez, and Susan A MurphyPreprint, 2024
Online reinforcement learning (RL) algorithms offer great potential for personalizing treatment for participants in clinical trials. However, deploying an online, autonomous algorithm in the high-stakes healthcare setting makes quality control and data quality especially difficult to achieve. This paper proposes algorithm fidelity as a critical requirement for deploying online RL algorithms in clinical trials. It emphasizes the responsibility of the algorithm to (1) safeguard participants and (2) preserve the scientific utility of the data for post-trial analyses. We also present a framework for pre-deployment planning and real-time monitoring to help algorithm developers and clinical researchers ensure algorithm fidelity. To illustrate our framework’s practical application, we present real-world examples from the Oralytics clinical trial. Since Spring 2023, this trial successfully deployed an autonomous, online RL algorithm to personalize behavioral interventions for participants at risk for dental disease.
- Bandit/RL AlgorithmsThe Fallacy of Minimizing Local Regret in the Sequential Task SettingZiping Xu, Kelly W Zhang, and Susan MurphyPreprint, 2024
In the realm of Reinforcement Learning (RL), online RL is often conceptualized as an optimization problem, where an algorithm interacts with an unknown environment to minimize cumulative regret. In a stationary setting, strong theoretical guarantees, like a sublinear (√T) regret bound, can be obtained, which typically implies the convergence to an optimal policy and the cessation of exploration. However, these theoretical setups often oversimplify the complexities encountered in real-world RL implementations, where tasks arrive sequentially with substantial changes between tasks and the algorithm may not be allowed to adaptively learn within certain tasks. We study the changes beyond the outcome distributions, encompassing changes in the reward designs (mappings from outcomes to rewards) and the permissible policy spaces. Our results reveal the fallacy of myopically minimizing regret within each task: obtaining optimal regret rates in the early tasks may lead to worse rates in the subsequent ones, even when the outcome distributions stay the same. To realize the optimal cumulative regret bound across all the tasks, the algorithm has to overly explore in the earlier tasks. This theoretical insight is practically significant, suggesting that due to unanticipated changes (e.g., rapid technological development or human-in-the-loop involvement) between tasks, the algorithm needs to explore more than it would in the usual stationary setting within each task. Such implication resonates with the common practice of using clipped policies in mobile health clinical trials and maintaining a fixed rate of ϵ-greedy exploration in robotic learning.
- Bandit/RL AlgorithmsDid we personalize? assessing personalization by an online reinforcement learning algorithm using resamplingSusobhan Ghosh*, Raphael Kim*, Prasidh Chhabria, Raaz Dwivedi, Predrag Klasnja, Peng Liao, Kelly W Zhang, and Susan MurphyMachine Learning (Special Issue on Reinforcement Learning for Real Life), 2024
There is a growing interest in using reinforcement learning (RL) to personalize sequences of treatments in digital health to support users in adopting healthier behaviors. Such sequential decision-making problems involve decisions about when to treat and how to treat based on the user’s context (e.g., prior activity level, location, etc.). Online RL is a promising data-driven approach for this problem as it learns based on each user’s historical responses and uses that knowledge to personalize these decisions. However, to decide whether the RL algorithm should be included in an “optimized” intervention for real-world deployment, we must assess the data evidence indicating that the RL algorithm is actually personalizing the treatments to its users. Due to the stochasticity in the RL algorithm, one may get a false impression that it is learning in certain states and using this learning to provide specific treatments. We use a working definition of personalization and introduce a resampling-based methodology for investigating whether the personalization exhibited by the RL algorithm is an artifact of the RL algorithm stochasticity. We illustrate our methodology with a case study by analyzing the data from a physical activity clinical trial called HeartSteps, which included the use of an online RL algorithm. We demonstrate how our approach enhances data-driven truth-in-advertising of algorithm personalization both across all users as well as within specific users in the study.
- Clinical TrialsA mobile health intervention for emerging adults with regular cannabis use: A micro-randomized pilot trial design protocolLara N. Coughlin, Maya Campbell, Tiffany Wheeler, Chavez Rodriguez, Autumn Florimbio, Susobhan Ghosh, Yongyi Guo, Pei-Yao Hung, Kelly W Zhang, Lauren Zimmerman, and 4 more authorsPreprint, 2024
- Clinical TrialsOptimizing an adaptive digital oral health intervention for promoting oral self-care behaviors: Micro-randomized trial protocolInbal Nahum-Shani, Zara M Greer, Anna L Trella, Kelly W Zhang, Stephanie M Carpenter, Dennis Ruenger, David Elashoff, Susan A Murphy, and Vivek ShettyContemporary Clinical Trials, 2024
Dental disease continues to be one of the most prevalent chronic diseases in the United States. Although oral self-care behaviors (OSCB), involving systematic twice-a-day tooth brushing, can prevent dental disease, this basic behavior is not sufficiently practiced. Recent advances in digital technology offer tremendous potential for promoting OSCB by delivering Just-In-Time Adaptive Interventions (JITAIs)- interventions that leverage dynamic information about the person’s state and context to effectively prompt them to engage in a desired behavior in real-time, real-world settings. However, limited research attention has been given to systematically investigating how to best prompt individuals to engage in OSCB in daily life, and under what conditions prompting would be most beneficial. This paper describes the protocol for a Micro-Randomized Trial (MRT) to inform the development of a JITAI for promoting ideal OSCB, namely, brushing twice daily, for two minutes each time, in all four dental quadrants (i.e., 2x2x4). Sensors within an electric toothbrush (eBrush) will be used to track OSCB and a matching mobile app (Oralytics) will deliver on-demand feedback and educational information. The MRT will micro-randomize participants twice daily (morning and evening) to either (a) a prompt (push notification) containing one of several theoretically grounded engagement strategies or (b) no prompt. The goal is to investigate whether, what type of, and under what conditions prompting increases engagement in ideal OSCB. The results will build the empirical foundation necessary to develop an optimized JITAI that will be evaluated relative to a suitable control in a future randomized controlled trial.
2023
- Statistical InferenceStatistical Inference for Adaptive ExperimentationKelly W ZhangThesis, 2023
Online reinforcement learning (RL) algorithms are a very promising tool for personalizing decision-making for digital interventions, e.g., in mobile health, online education, and public policy. Online RL algorithms are increasingly being used in these applications since they are able to use previously collected data to continually learn and improve future decision-making. After deploying online RL algorithms though, it is critical to be able to answer scientific questions like: Did one type of teaching strategy lead to better student outcomes? In which contexts is a digital health intervention effective? The answers to these questions inform decisions about whether to roll out or how to improve a given intervention. Constructing confidence intervals for treatment effects using normal approximations is a natural approach to address these questions. However, classical statistical inference approaches for i.i.d. data fail to provide valid confidence intervals on data collected with online RL algorithms. Since online RL algorithms use previously collected data to inform future treatment decisions, they induce dependence in the collected data. This induced dependence can cause standard statistical inference approaches for i.i.d. data to be invalid on this data type. This thesis provides an understanding of the reasons behind the failure of classical methods in these settings. Moreover, we introduce a variety of alternative statistical inference approaches that are applicable to data collected by online RL algorithms.
- Bandit/RL AlgorithmsReward design for an online reinforcement learning algorithm supporting oral self-careAnna L Trella, Kelly W Zhang, Inbal Nahum-Shani, Vivek Shetty, Finale Doshi-Velez, and Susan A MurphyAAAI Conference on Artificial Intelligence, 2023
Dental disease is one of the most common chronic diseases despite being largely preventable. However, professional advice on optimal oral hygiene practices is often forgotten or abandoned by patients. Therefore patients may benefit from timely and personalized encouragement to engage in oral self-care behaviors. In this paper, we develop an online reinforcement learning (RL) algorithm for use in optimizing the delivery of mobile-based prompts to encourage oral hygiene behaviors. One of the main challenges in developing such an algorithm is ensuring that the algorithm considers the impact of the current action on the effectiveness of future actions (i.e., delayed effects), especially when the algorithm has been made simple in order to run stably and autonomously in a constrained, real-world setting (i.e., highly noisy, sparse data). We address this challenge by designing a quality reward which maximizes the desired health outcome (i.e., high-quality brushing) while minimizing user burden. We also highlight a procedure for optimizing the hyperparameters of the reward by building a simulation environment test bed and evaluating candidates using the test bed. The RL algorithm discussed in this paper will be deployed in Oralytics, an oral self-care app that provides behavioral strategies to boost patient engagement in oral hygiene practices.
- Statistical InferenceStatistical Inference after Adaptive Sampling for Longitudinal DataKelly W Zhang, Lucas Janson, and Susan A MurphyPreprint, 2023
Online reinforcement learning and other adaptive sampling algorithms are increasingly used in digital intervention experiments to optimize treatment delivery for users over time. In this work, we focus on longitudinal user data collected by a large class of adaptive sampling algorithms that are designed to optimize treatment decisions online using accruing data from multiple users. Combining or "pooling" data across users allows adaptive sampling algorithms to potentially learn faster. However, by pooling, these algorithms induce dependence between the sampled user data trajectories; we show that this can cause standard variance estimators for i.i.d. data to underestimate the true variance of common estimators on this data type. We develop novel methods to perform a variety of statistical analyses on such adaptively sampled data via Z-estimation. Specifically, we introduce the \textitadaptive sandwich variance estimator, a corrected sandwich estimator that leads to consistent variance estimates under adaptive sampling. Additionally, to prove our results we develop novel theoretical tools for empirical processes on non-i.i.d., adaptively sampled longitudinal data which may be of independent interest. This work is motivated by our efforts in designing experiments in which online reinforcement learning algorithms optimize treatment decisions, yet statistical inference is essential for conducting analyses after experiments conclude.
2022
- Bandit/RL AlgorithmsDesigning Reinforcement Learning Algorithms for Digital Interventions: Pre-Implementation GuidelinesAnna L. Trella, Kelly W. Zhang, Inbal Nahum-Shani, Vivek Shetty, Finale Doshi-Velez, and Susan A. MurphyAlgorithms (Special Issue Algorithms in Decision Support Systems); Preliminary version at RLDM 2022 (Multi-disciplinary Conference on RL and Decision Making); selected for an oral, 2022
Online reinforcement learning (RL) algorithms are increasingly used to personalize digital interventions in the fields of mobile health and online education. Common challenges in designing and testing an RL algorithm in these settings include ensuring the RL algorithm can learn and run stably under real-time constraints, and accounting for the complexity of the environment, e.g., a lack of accurate mechanistic models for the user dynamics. To guide how one can tackle these challenges, we extend the PCS (predictability, computability, stability) framework, a data science framework that incorporates best practices from machine learning and statistics in supervised learning to the design of RL algorithms for the digital interventions setting. Furthermore, we provide guidelines on how to design simulation environments, a crucial tool for evaluating RL candidate algorithms using the PCS framework. We show how we used the PCS framework to design an RL algorithm for Oralytics, a mobile health study aiming to improve users’ tooth-brushing behaviors through the personalized delivery of intervention messages. Oralytics will go into the field in late 2022.
2021
- Statistical InferenceStatistical Inference with M-Estimators on Adaptively Collected DataKelly W Zhang, Lucas Janson, and Susan MurphyAdvances in Neural Information Processing Systems (NeurIPS), 2021
Bandit algorithms are increasingly used in real-world sequential decision-making problems. Associated with this is an increased desire to be able to use the resulting datasets to answer scientific questions like: Did one type of ad lead to more purchases? In which contexts is a mobile health intervention effective? However, classical statistical approaches fail to provide valid confidence intervals when used with data collected with bandit algorithms. Alternative methods have recently been developed for simple models (e.g., comparison of means). Yet there is a lack of general methods for conducting statistical inference using more complex models on data collected with (contextual) bandit algorithms; for example, current methods cannot be used for valid inference on parameters in a logistic regression model for a binary reward. In this work, we develop theory justifying the use of M-estimators – which includes estimators based on empirical risk minimization as well as maximum likelihood – on data collected with adaptive algorithms, including (contextual) bandit algorithms. Specifically, we show that M-estimators, modified with particular adaptive weights, can be used to construct asymptotically valid confidence regions for a variety of inferential targets.
2020
- Statistical InferenceInference for Batched BanditsKelly W Zhang, Lucas Janson, and Susan MurphyAdvances in Neural Information Processing Systems (NeurIPS), 2020
As bandit algorithms are increasingly utilized in scientific studies and industrial applications, there is an associated increasing need for reliable inference methods based on the resulting adaptively-collected data. In this work, we develop methods for inference on data collected in batches using a bandit algorithm. We first prove that the ordinary least squares estimator (OLS), which is asymptotically normal on independently sampled data, is not asymptotically normal on data collected using standard bandit algorithms when there is no unique optimal arm. This asymptotic non-normality result implies that the naive assumption that the OLS estimator is approximately normal can lead to Type-1 error inflation and confidence intervals with below-nominal coverage probabilities. Second, we introduce the Batched OLS estimator (BOLS) that we prove is (1) asymptotically normal on data collected from both multi-arm and contextual bandits and (2) robust to non-stationarity in the baseline reward.
- Bandit/RL AlgorithmsA bayesian approach to learning bandit structure in markov decision processesKelly W Zhang, Omer Gottesman, and Finale Doshi-VelezChallenges of Real-World Reinforcement Learning 2020 (NeurIPS Workshop), 2020
In the reinforcement learning literature, there are many algorithms developed for either Contextual Bandit (CB) or Markov Decision Processes (MDP) environments. However, when deploying reinforcement learning algorithms in the real world, even with domain expertise, it is often difficult to know whether it is appropriate to treat a sequential decision making problem as a CB or an MDP. In other words, do actions affect future states, or only the immediate rewards? Making the wrong assumption regarding the nature of the environment can lead to inefficient learning, or even prevent the algorithm from ever learning an optimal policy, even with infinite data. In this work we develop an online algorithm that uses a Bayesian hypothesis testing approach to learn the nature of the environment. Our algorithm allows practitioners to incorporate prior knowledge about whether the environment is that of a CB or an MDP, and effectively interpolate between classical CB and MDP-based algorithms to mitigate against the effects of misspecifying the environment. We perform simulations and demonstrate that in CB settings our algorithm achieves lower regret than MDP-based algorithms, while in non-bandit MDP settings our algorithm is able to learn the optimal policy, often achieving comparable regret to MDP-based algorithms.
2018
- NLPLanguage modeling teaches you more syntax than translation does: Lessons learned through auxiliary task analysisKelly W Zhang, and Samuel R BowmanBlackboxNLP 2018 (EMNLP Workshop), 2018
Recent work using auxiliary prediction task classifiers to investigate the properties of LSTM representations has begun to shed light on why pretrained representations, like ELMo (Peters et al., 2018) and CoVe (McCann et al., 2017), are so beneficial for neural language understanding models. We still, though, do not yet have a clear understanding of how the choice of pretraining objective affects the type of linguistic information that models learn. With this in mind, we compare four objectives—language modeling, translation, skip-thought, and autoencoding—on their ability to induce syntactic and part-of-speech information. We make a fair comparison between the tasks by holding constant the quantity and genre of the training data, as well as the LSTM architecture. We find that representations from language models consistently perform best on our syntactic auxiliary prediction tasks, even when trained on relatively small amounts of data. These results suggest that language modeling may be the best data-rich pretraining task for transfer learning applications requiring syntactic information. We also find that the representations from randomly-initialized, frozen LSTMs perform strikingly well on our syntactic auxiliary tasks, but this effect disappears when the amount of training data for the auxiliary tasks is reduced.
- NLPAdversarially Regularized AutoencodersJunbo (Jake) Zhao*, Yoon Kim*, Kelly Zhang, Alexander Rush, and Yann LeCunInternational Conference on Machine Learning (ICML), 2018
Deep latent variable models, trained using variational autoencoders or generative adversarial networks, are now a key technique for representation learning of continuous structures. However, applying similar methods to discrete structures, such as text sequences or discretized images, has proven to be more challenging. In this work, we propose a more flexible method for training deep latent variable models of discrete structures. Our approach is based on the recently proposed Wasserstein Autoencoder (WAE) which formalizes adversarial autoencoders as an optimal transport problem. We first extend this framework to model discrete sequences, and then further explore different learned priors targeting a controllable representation. Unlike many other latent variable generative models for text, this adversarially regularized autoencoder (ARAE) allows us to generate fluent textual outputs as well as perform manipulations in the latent space to induce change in the output space. Finally we show that the latent representation can be trained to perform unaligned textual style transfer, giving improvements both in automatic measures and human evaluation.