Review of 'Q- and A-learning Methods for Estimating Optimal Dynamic Treatment Regimes'

Phillip J. Schulte, Anastasios A. Tsiatis, Eric B. Laber, and Marie Davidian

By Joowon Lee in Causal Inference

October 24, 2022


Overview

In this project, we will review a paper Q- and A-learning Methods for Estimating Optimal Dynamic Treatment Regimes written by Phillip J. Schulte et al., (2014) [paper], focusing on Q-learning methods.

In clinical practice, clinicians make a series of decisions based on individual patient’s baseline and evolving characteristics. In this paper, we wanted to make optimal sequential treatment decisions to achieve the ‘best’ outcome for an individual patient. The list of sequential decision rules is formalized as a dynamic treatment regime. Throughout this paper, we investigate two main methods, Q-learning and A-learning, to estimate the optimal dynamic treatment regime based on data from a clinical trial or observational study.

Let us first introduce the following notations:

  • \(k\) : stages or pre-specified ordered decision points \(k = 1, \cdots, K\),
  • \(\Omega\) : a superpopulation of patients with a patient \(\omega \in \Omega\),
  • \(Y\) : a final outcome of interest (assuming that large values are preferred),
  • \(X_{k}\) : covariate information immediately prior to decision \(k\) with value \(x_{k} \in \mathcal{X}_{k}\),
  • \(A_{k}\) : treatment at stage \(k\) with value \(a_{k} \in \mathcal{A}_{k}\) (a finite set possible options)
  • \(\overline{a_{k}} = (a_{1}, \cdots, a_{k})\) : a possible treatment history through decision \(k\) taking values in \(\overline{\mathcal{A}_{k}} = \mathcal{A}_{1} \times \cdots \mathcal{A}_{k}\).

Then we define the potential outcomes \[\begin{align*} W^{*} &= \left\{ X^{*}_{2}(a_{1}), X^{*}_{3}\big(\overline{a_{2}}\big), \cdots, X^{*}_{K}\big(\overline{a_{k-1}}\big), Y^{*}\big(\overline{a_{k}}\big) \,\, \text{ for all } \, \overline{a_{k}} \in \overline{\mathcal{A}_{k}} \right\} \end{align*}\]

Here, \(X^{*}_{K}\big(\overline{a_{k-1}}\big)(\omega)\) denotes the value of covariate information that would arise between decisions \(k-1\) and \(k\) for a patient \(\omega \in \Omega\) who have received treatment history \(\overline{a_{k-1}}\), taking value \(x_{k}\) in a set \(\mathcal{X}_{k}\). Similarly, \(Y^{*}\big(\overline{a_{k}}\big)\) is the hypothetical outcome that would result for a patient \(\omega\) who have received the full set of \(K\) treatment in \(\overline{a_{k}}\). For simplicity, we write \(\overline{X}^{*}_{\underline{k}}\big(\overline{a_{k-1}}\big)=\{X_{1}, X^{*}_{2}(a_{1}), \cdots, X^{*}_{k}\big(\overline{a_{k-1}}\big)\}, k = 1, \cdots, K\).


Dynamic Treatment Regime

A dynamic treatment regime \(\mathbf{d} = (d_{1}, \cdots, d_{K})\) is a set of rules that forms an algorithm to treat individual patient over time (it is called ‘dynamic’ because treatment is determined based on a patient’s previous history). For example, at the \(k\)th stage,

  • the \(k\)th rule \(d_{k}\) takes
    • input : \(\overline{x_{k}}, \overline{a_{k-1}}\) (patient’s realized covariate, and treatment history up to \(k-1\)th decision),
    • output : \(a_{k} \in \Psi_{k}(\overline{x_{k}}, \overline{a_{k-1}}) \subset \mathcal{A}_{k}\) where \(\Psi_{k}(\overline{x_{k}}, \overline{a_{k-1}})\) is a pre-specified set of possible traetment options for a patient with (\(\overline{x_{k}}, \overline{a_{k-1}}\)).

Since \(d_{k}\) need only map a subset of \(\overline{\mathcal{X}_{k}} \times \overline{\mathcal{A}_{k-1}}\), namely \(\Gamma_{k}\), to \(\mathcal{A}_{k}\), let us define the subset recursively as

\[\begin{align*} \Gamma_{k} &= \{ (\overline{x_{k}}, \overline{a_{k-1}}) \in \overline{\mathcal{X}_{k}} \times \overline{\mathcal{A}_{k-1}} \,\, \text{ satisfying } \\ & \quad \quad \text{(i) } \, a_{j} \in \Psi_{j}\big( \overline{x_{j}}, \overline{a_{j-1}} \big), \,\, j = 1, \cdots, k-1, \\ & \quad \quad \text{(ii) } \, \mathbb{P}(\overline{X_{k}}\big(\overline{a_{k-1}} \big)=\overline{x_{k}})>0\}, \end{align*}\]

\(k=1, \cdots, K\), determined by \(\Psi=(\Psi_{1}, \cdots, \Psi_{K})\). Define the class \(\mathcal{D}\) of dynamic treatment regimes to be the set of all \(\mathbf{d}=(d_{1}, \cdots. d_{k})\) where \(d_{k}: \Gamma_{k} \rightarrow \mathcal{A}_{k}, k=1, \cdots, K\).

Then, for any \(\mathbf{d} \in \mathcal{D}\), define the potential outcomes associated with \(\mathbf{d}\) as \[\begin{align*} \left\{ X^{*}_{2}(d_{1}), \cdots, X^{*}_{k}\big(\overline{d_{k-1}}\big), \cdots, X^{*}_{K}\big(\overline{d_{K-1}}\big), Y^{*}\big(\mathbf{d}\big) \,\, \right\} \end{align*}\]

Then how can we find an optimal regime (the ‘best’ treatment for a patient) with these definitions?

The expected outcome in the population if all patients who have a baseline covariate \(X_{1} = x_{1}\) were to follow regime \(\mathbf{d}\) is \(\mathbb{E}[Y^{*}\big(\mathbf{d}\big)\ | X_{1} = x_{1}]\). Then an optimal regime, \(\mathbf{d}^{opt} \in \mathcal{D}\), should satisfy

\[\begin{align*} & \mathbb{E}[Y^{*}\big(\mathbf{d}\big)\ | X_{1} = x_{1}] \quad \le \quad \mathbb{E}[Y^{*}\big(\mathbf{d}^{opt}\big)\ | X_{1} = x_{1}], \,\, \text{ for all } \mathbf{d} \in \mathcal{D} \text{ and all } x_{1} \in \mathcal{X}_{1} \\ \Rightarrow \,\, & \mathbb{E}[Y^{*}\big(\mathbf{d}\big)] \qquad \qquad \quad \, \le \quad \mathbb{E}[Y^{*}\big(\mathbf{d}^{opt}\big)], \qquad \quad \quad \,\, \text{ for all } \mathbf{d} \in \mathcal{D} \end{align*}\]

the first inequality implies the second one since it is true for any fixed \(x_{1} \in \mathcal{X}_{1}\).


Model Setup and Assumptions

The problem is that potential outcomes for an individual patient for all \(\mathbf{d} \in \mathcal{D}\) may not be observed. Therefore, the goal of dynamic treatment regime is to estimate \(\mathbf{d}^{opt}\) based on data.

Consider a study with a random sample of \(n\) patients in \(\Omega\). Assume independent and identically distributed time-ordered random variables \((X_{1i}, A_{1i}, \cdots, X_{Ki}, A_{Ki}, Y_{i}), i = 1, \cdots, n\) on \(\Omega\). Use causal framework to estimate an optimal regime from the observed data under the following assumptions:

  • Consistency assumption : \(X_{k} = X^{*}_{k}\big(\overline{A_{k-1}}\big), k=2, \cdots, K\) and \(Y=Y^{*}\big(\overline{A_{k}}\big)\),
  • Stable unit treatment value assumption (well-defined) : a patient’s covariates and outcome are not affected by treatment assignment and other patients,
  • Sequential Ignorability : \(A_{k} \perp W^{*} | \{\overline{X_{k}}, \overline{A_{k-1}} \}, k=1, \cdots, K\).

Identification

In this section, we are interested in identifying the expected potential outcome \(\mathbb{E}[Y^{*}(\overline{a_{K}})]\) which may or may not be observable based on statistical quantity.

We first consider a fixed treatment \((a_{1}, a_{2}) \in \{0, 1\}^{2}\) with the two time points \((k = 2)\). We have time-ordered variables \((X_{1}, A_{1}, X_{2}, A_{2}, Y)\), where the dependency between them can be described as the following dynamic acyclic graph (DAG):

Dynamic acyclic graph for k = 2

Figure 1: Dynamic acyclic graph for k = 2

From the following lemma below, we show that the expected potential outcome \(\mathbb{E}[Y^{*}(\overline{a_{2}})]=\mathbb{E}[Y^{*}(a_{1}, a_{2})]\) of treatment \((a_{1}, a_{2})\) can be identified as a statistical estimand, a function of conditional expectation.


Lemma 1 Fix a treatment \((a_{1},a_{2})\in \{0,1\}^{2}\). Under the assumptions of consistency, SUTVA, and sequential ignorability mentioned in the section Model setup and assumptions, we have \[\begin{align*} &\mathbb{E}[Y^{*}(a_1, a_2)] \\ & \quad = \mathbb{E}_{X_{1}} \bigg[ \mathbb{E}_{X_{2}} \bigg[ \mathbb{E}_{Y}[Y | X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = a_{2}] \bigg| X_{1}, A_{1} = a_{1} \bigg] \bigg] \\ & \quad = \int_{x_{1}} \int_{x_{2}} \mathbb{E}_{Y}[Y | X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = a_{2}] \,\, \mathbb{P}\left(X_{2}=x_{2}|X_{1}=x_{1}, A_{1} = a_{1} \right) \mathbb{P}\left(X_{1}=x_{1}\right)dx_{2}dx_{1}. \end{align*}\]

Proof. \({}\) Here, let us denote \(X^{*}_{2} := X^{*}_{2}(a_{1})\) and \(Y^{*} := Y^{*}(a_1, a_2)\), which we will be using when these random variables are in subscripts for simplicity. Then \[\begin{align*} &\mathbb{E}[Y^{*}(a_1, a_2)] \\ &\quad =\mathbb{E}_{X_{1}} \bigg[ \mathbb{E}_{X^{*}_{2},Y^{*}}[Y^{*}(a_1, a_2) | X_{1}]\bigg] \\ &\quad \overset{(a)}{=}\mathbb{E}_{X_{1}} \bigg[ \mathbb{E}_{X^{*}_{2},Y^{*}}[Y^{*}(a_1, a_2) | X_{1}, A_{1} = a_{1}]\bigg] \\ &\quad \overset{(b)}{=}\mathbb{E}_{X_{1}} \bigg[ \mathbb{E}_{X^{*}_{2}} \bigg[ \mathbb{E}_{Y^{*}}[Y^{*}(a_1, a_2) | X_{1}, A_{1} = a_{1}, X^{*}_{2}(a_{1})] \bigg| X_{1}, A_{1} = a_{1}\bigg] \bigg] \\ &\quad \overset{(c)}{=}\mathbb{E}_{X_{1}} \bigg[ \mathbb{E}_{X^{*}_{2}} \bigg[ \mathbb{E}_{Y^{*}}[Y^{*}(a_1, a_2) | X_{1}, A_{1} = a_{1}, X^{*}_{2}(a_{1}), A_{2} = a_{2}] \bigg| X_{1}, A_{1} = a_{1}\bigg] \bigg] \\ &\quad \overset{(d)}{=}\mathbb{E}_{X_{1}} \bigg[ \mathbb{E}_{X_{2}} \bigg[ \mathbb{E}_{Y}[Y| X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = a_{2}] \bigg| X_{1}, A_{1} = a_{1}\bigg] \bigg] \\ &\quad = \int_{x_{1}} \int_{x_{2}} \mathbb{E}[Y | X_{1} = x_{1}, A_{1} = a_{1}, X_{2} = x_{2}, A_{2} = a_{2}] \mathbb{P}\left(X_{2} = x_{2} | X_{1} = x_{1}, A_{1} = a_{1}\right) \mathbb{P}(X_{1} = x_{1}) dx_{2} dx_{1}, \\ \end{align*}\] where (a) and (c) use sequential ignorability, (b) uses iterated expectation, and (d) use consistency. \[\begin{align*} \hspace{15cm} \square \end{align*}\]

Now consider the general case when we have \(K\) time points. First define the following conditional expectation as \(\mu_{\overline{a_{K}}} (\overline{x_{K}})\) for simplicity and introduce the lemma for identification with \(K\) time points.

\[\begin{align*} \mu_{\overline{a_{K}}} (\overline{x_{K}}) := \mathbb{E}[Y | \overline{X_{K}} = \overline{x_{K}}, \overline{A_{K}}= \overline{a_{K}}]. \end{align*}\]

Now consider the general case when we have \(K\) time points. We have time-varying variables \((X_{1}, A_{1}, \cdots, X_{K}, A_{K}, Y)\), assuming all dependencies between variables with proper temporal orders as shown in Figure 1. Then we introduce the lemma for identification with \(K\) time points.


Lemma 2 Fix a treatment \(\overline{a_{K}}=(a_{1},\cdots,a_{K})\in \{0,1\}^{K}\). Under the assumptions of consistency, SUTVA, and sequential ignorability, we have \[\begin{align*} &\mathbb{E}[Y^{*}(\overline{a_{K}})] \\ &\quad = \mathbb{E}_{X_{1}} \bigg[ \mathbb{E}_{X_{2}} \bigg[ \cdots \bigg[ \mathbb{E}_{X_{K}}\left[ \mathbb{E}\left[Y \bigg| \begin{matrix} \overline{X_{K}} \\ \overline{A_{K}} = \overline{a_{K}} \end{matrix}\right] \bigg| \begin{matrix} \overline{X_{K-1}} \\ \overline{A_{k-1}} = \overline{a_{k-1}} \end{matrix} \right] \bigg| \begin{matrix} \overline{X_{K-2}} \\ \overline{A_{k-2}} = \overline{a_{k-2}} \end{matrix} \bigg] \cdots \bigg| \begin{matrix} X_{1} \\ A_{1} = a_{1} \end{matrix}\bigg] \bigg] \end{align*}\]

Click for proof

Proof. \({}\)

The idea of the proof is to use induction on the number of unobserved treatments. Let us define the following quantities: \[\begin{align*} \mathcal{F}_{k} &: \sigma \text{-algebra generated by information up to } k , \\ \overline{Z_{[k+1,K]}} &= (Z_{k+1}, Z_{k+2}, \cdots, Z_{K}). \end{align*}\] We prove the following stronger claim:

For \(k \in \{0, 1, \cdots, K \}\), \[\begin{align*} &\mathbb{E}\left[ Y^{*}(\overline{A_{k}}, \overline{a_{[k+1,K]}}) \bigg|\mathcal{F}_{k}\right] = \mathbb{E}_{X_{k+1}}\left[ \mathbb{E}_{X_{k+2}} \left[ \cdots \left[ \mathbb{E}\left[Y \bigg| \begin{matrix} \overline{X_{[k+1, K]}} \\ \overline{A_{[k+1, K]}} = \overline{a_{[k+1,T]}} \\ \mathcal{F}_{k} \end{matrix}\right] \bigg| \right] \cdots \right] \bigg| \begin{matrix} \overline{X_{k+1}} \\ \overline{A_{k+1}} = \overline{a_{k+1}} \\ \mathcal{F}_{k} \end{matrix} \right] \end{align*}\] Then the statement follows by taking \(k=0\). We use induction on \(K-k=\) the number of unobserved treatment.

First, Let \(k = K\). Since we have all information up to K and there is no counterfactual outcome, \[\begin{align*} & \mathbb{E}[Y^{*}(A_{1}, \cdots, A_{K})|\mathcal{F}_{K}] = \mathbb{E}[Y|\overline{X_{K}}, \overline{A_{K}}, \mathcal{F}_{K}]. \\ \end{align*}\]

Next, assume induction hypothesis for \(k\) holds. We wish to show induction hypothesis holds for \(k-1\). \[\begin{align*} & \mathbb{E}[Y^{*}(A_{1}, \cdots, A_{k-1}, a_{k}, \cdots, a_{K}) | \mathcal{F}_{k-1}] \\ & \quad \overset{(a)}{=} \mathbb{E}_{X_{k}} \left[ \mathbb{E} \left[ Y^{*}(A_{1}, \cdots, A_{k-1}, a_{k}, \cdots, a_{K}) \bigg| X_{k}, \mathcal{F}_{k-1} \right] \right] \\ & \quad \overset{(b)}{=} \mathbb{E}_{X_{k}} \left[ \mathbb{E} \left[ Y^{*}(A_{1}, \cdots, A_{k-1}, a_{k}, \cdots, a_{K}) \bigg| X_{k}, \overline{A_{k}} = \overline{a_{k}}, \mathcal{F}_{k-1} \right] \right] \\ & \quad = \mathbb{E}_{X_{k}} \left[ \mathbb{E} \left[ Y^{*}(A_{1}, \cdots, A_{k-1}, A_{k}, \cdots, a_{K}) \bigg| X_{k}, \overline{A_{k}} = \overline{a_{k}}, \mathcal{F}_{k-1} \right] \right] \\ & \quad \overset{(c)}{=} \mathbb{E}_{X_{k}} \left[ \mathbb{E} \left[ Y^{*}(A_{1}, \cdots, A_{k-1}, A_{k}, \cdots, a_{K}) \bigg| \mathcal{F}_{k} \right] \right] \\ & \quad = \mathbb{E}_{X_{k}} \left[ \mathbb{E} \left[ Y^{*}(A_{1}, \cdots, A_{k}, a_{k+1}, \cdots, a_{K}) \bigg| \mathcal{F}_{k} \right] \right] \\ & \quad \overset{(d)}{=} \mathbb{E}_{X_{k}} \left[ \mathbb{E}_{X_{k+1}}\left[ \mathbb{E}_{X_{k+2}} \left[ \cdots \left[ \mathbb{E}\left[Y \bigg| \begin{matrix} \overline{X_{[k+1, K]}} \\ \overline{A_{[k+1, K]}} = \overline{a_{[k+1,T]}} \\ \mathcal{F}_{k} \end{matrix}\right] \bigg| \right] \cdots \right] \bigg| \begin{matrix} \overline{X_{k+1}} \\ \overline{A_{k+1}} = \overline{a_{k+1}} \\ \mathcal{F}_{k} \end{matrix} \right] \right] \end{align*}\] where (a) uses iterated expectation, (b) uses sequential ignorability, (c) uses no unmeasured confounder assumption in SUTVA, and (d) is from the assumed induction hypothesis. Since it holds for \(k-1\), we are done. \[\begin{align*} \hspace{12.5cm} \square \end{align*}\]


Optimal Treatment Regimes

Q- and A-learning are two approaches to estimate \(d^{opt}\) satisfying \(\mathbb{E}[Y^{*}\big(\mathbf{d}\big)] \le \mathbb{E}[Y^{*}\big(\mathbf{d}^{opt}\big)], \text{ for all } \mathbf{d} \in \mathcal{D}\) which involve recursive algorithms. In this posting, we will focus on Q-learning method.

In this section, we demonstrate the formulation of \(d^{opt}\) in terms of the potential outcomes and then show how \(d^{opt}\) may be expressed in terms of the observed data under assumptions described in the section Model setup and Assumptions.

At the \(K\)th decision point, for any \((\overline{x_{K}}, \overline{a_{K-1}}) \in \Gamma_{K}\), define \[\begin{align*} d^{opt}_{K}(\overline{x_{K}}, \overline{a_{K-1}}) &= \argmax_{a_{K} \in \Psi_{K}(\overline{x_{K}}, \overline{a_{K-1}})} \mathbb{E}[Y^{*}(\overline{a_{K-1}}, a_{K}) | \overline{X}_{K}^{*}(\overline{a_{K-1}}) = \overline{x_{K}}], \\ V_{K}^{}(\overline{x_{K}}, \overline{a_{K-1}}) &= \max_{a_{K} \in \Psi_{K}(\overline{x_{K}}, \overline{a_{K-1}})} \mathbb{E}[Y^{*}(\overline{a_{K-1}}, a_{K}) | \overline{X}_{K}^{*}(\overline{a_{K-1}}) = \overline{x_{K}}]. \end{align*}\]

Here, \(d^{opt}_{K}\) is the treatment that maximizes the expected potential final outcome given the prior potential information, and \(V_{K}\) is the achieved maximum.

For \(k = K-1, \cdots, 1\), and any \((\overline{x_{k}}, \overline{a_{k-1}}) \in \Gamma_{k}\), let \[\begin{align*} d^{opt}_{k}(\overline{x_{k}}, \overline{a_{k-1}}) &= \argmax_{a_{k} \in \Psi_{k}(\overline{x_{k}}, \overline{a_{k-1}})} \mathbb{E} \left[ V_{k+1}^{} \left(\underbrace{\overline{x_{k}}, X^{*}_{k+1}(\overline{a_{k-1}},a_{k})}_{=\overline{X}_{k+1}^{*} (\overline{a_{k}})}, \underbrace{\overline{a_{k-1}}, a_{k}}_{=\overline{a_{k}}} \right) \bigg| \overline{X}_{k}^{*}(\overline{a_{k-1}}) = \overline{x_{k}} \right], \\ V_{k}^{}(\overline{x_{k}}, \overline{a_{k-1}}) &= \max_{a_{k} \in \Psi_{k}(\overline{x_{k}}, \overline{a_{k-1}})} \mathbb{E} \left[ V_{k+1}^{} \left(\underbrace{\overline{x_{k}}, X^{*}_{k+1}(\overline{a_{k-1}},a_{k})}_{ = \overline{X}_{k+1}^{*} (\overline{a_{k}})}, \underbrace{\overline{a_{k-1}}, a_{k}}_{=\overline{a_{k}}} \right) \bigg| \overline{X}_{k}^{*}(\overline{a_{k-1}}) = \overline{x_{k}} \right]. \end{align*}\]

Similarly, \(d^{opt}_{k}\) is the \(k\)th treatment that maximizes the \(k+1\)th expected outcome that would be achieved if subsequent optimal rules already defined were followed henceforth. It shows how \(\mathbf{d}^{opt}=(d^{opt}_{1}, d^{opt}_{2}, \cdots, d^{opt}_{K})\) is determined via backward induction, also known as dynamic programming. Proof will be added to show that \(\mathbf{d}^{opt}\) defined in the above steps satisfies the condition \(\mathbb{E}[Y^{*}\big(\mathbf{d}\big)] \le \mathbb{E}[Y^{*}\big(\mathbf{d}^{opt}\big)], \text{ for all } \mathbf{d} \in \mathcal{D}\).


Estimation of Optimal Treatment Regimes

If an optimal regime is to be identifiable, we can express \(\mathbf{d}^{opt}\) in terms of the distribution of the observed i.i.d. data \((X_{1i}, A_{1i}, \cdots, X_{Ki}, A_{Ki}, Y_{i}), i = 1, \cdots, n\).

Then at the \(K\) th decision point, define \[\begin{align*} Q_{K}(\overline{x_{K}}, \overline{a_{K}}) &= \mathbb{E} [Y|\overline{X_{K}}=\overline{x_{K}}, \overline{A_{K}}=\overline{a_{K}}], \\ d_{K}^{opt}(\overline{x_{K}}, \overline{a_{K-1}}) &= \argmax_{a_{K} \in \Psi_{K}(\overline{x_{K}}, \overline{a_{K-1}})} Q_{K}(\overline{x_{K}}, \overline{a_{K-1}}, a_{K}), \\ V_{K}(\overline{x_{K}}, \overline{a_{K-1}}) &= \max_{a_{K} \in \Psi_{K}(\overline{x_{K}}, \overline{a_{K-1}})}Q_{K}(\overline{x_{K}}, \overline{a_{K-1}}, a_{K}). \end{align*}\] and for \(k = K-1, \cdots, 1,\) \[\begin{align*} Q_{k}(\overline{x_{k}}, \overline{a_{k}}) &= \mathbb{E} [V_{k+1}(\overline{x_{k}}, X_{k+1}, \overline{a_{k}}) | \overline{X_{k}}=\overline{x_{k}}, \overline{A_{k}}=\overline{a_{k}}], \\ d_{k}^{opt}(\overline{x_{k}}, \overline{a_{k-1}}) &= \argmax_{a_{k} \in \Psi_{k}(\overline{x_{k}}, \overline{a_{k-1}})} Q_{k}(\overline{x_{k}}, \overline{a_{k-1}}, a_{k}), \\ V_{k}(\overline{x_{k}}, \overline{a_{k-1}}) &= \max_{a_{k} \in \Psi_{k}(\overline{x_{k}}, \overline{a_{k-1}})} Q_{k}(\overline{x_{k}}, \overline{a_{k-1}}, a_{k}). \\ \end{align*}\]

Here, \(Q_{k}(\overline{x_{k}}, \overline{a_{k}})\) are referred to as “Q-functions” viewed as measuring the quality associated with treatment \(a_{k}\) at decision \(k\) given the history up to that decision and then following the optimal regime therafter. \(V_{k}(\overline{x_{k}}, \overline{a_{k-1}})\) are known as “value functions” viewed as the value of a patient’s history \(\overline{x_{k}}, \overline{a_{k-1}}\) given that optimal decisions are made in the future.

In order to estimate an optimal regime, the treatment options in \(\Psi\) must be represented in the data, which implies for \((\overline{x_{k}}, \overline{a_{k-1}}) \in \Gamma_{k}, \text{ and }a_{k} \in \Psi_{k}(\overline{x_{k}}, \overline{a_{k-1}}),\) \[\begin{align*} \mathbb{P}[ A_{k} = a_{k} | \overline{X_{k}} = \overline{x_{k}}, \overline{A_{k-1}} = \overline{a_{k-1}}] > 0, \end{align*}\] for all \(k = 1, \cdots, K\).

Then, estimation of \(\mathbf{d}^{opt}\) may be accomplished via direct modeling and fitting of the Q-functions. Specifically, for \(k = K, K-1, \cdots, 1\), assume models \(Q_{k}(\overline{x_{k}}, \overline{a_{k}}; \xi_{k})\) depending on a finite-dimensional parameter \(\xi_{k}\). The models may be linear or nonlinear in \(\xi_{k}\), and estimators \(\widehat{\xi_{k}}\) can be obtained in a backward iteration for \(k = K, K-1, \cdots, 1\) by solving suitable estimating equations including ordinary least squares (OLS) or weighted least squares (WLS) as follows.

\[\begin{align*} \sum_{i=1}^n \frac{\partial Q_k\left(\overline{X_{k, i}}, \overline{A_{k,i}}; \xi_{k} \right)}{\partial \xi_{k}} \left[ \underbrace{\Sigma_{k}\left(\overline{X_{k, i}}, \overline{A_{k,i}} \right) }_{=: \text{Working Variance}} \right]^{-1} \left\{ \underbrace{\widetilde{V}_{(k+1), i}}_{=Y_{i} \text{ for } k=K} - Q_{k}\left(\overline{X_{k, i}}, \overline{A_{k,i}}; \xi_{k}\right) \right\} =0. \end{align*}\]

  1. Solve the above estimating equations for \(k = K\) in \(\xi_{K}\) to obtain \(\widehat{\xi_{K}}\).

    • \(d_{K}^{opt}(\overline{x_{K}}, \overline{a_{K-1}}; \widehat{\xi_{K}}) = \argmax_{a_{K}} Q_{K}(\overline{x_{K}}, \overline{a_{K}}; \widehat{\xi_{K}})\) : an estimator for the optimal treatment choice at decision \(K\) for a patient with past history \(\overline{X_{K}}= \overline{x_{K}}, \overline{A_{K-1}}= \overline{a_{K-1}}\).
  2. Solve the estimating equation for \(k = K-1\) with \(\widehat{\xi_{K-1}}\) in \(\xi_{K-1}\).

    • With \(\widehat{\xi_{K}}\), one would form \[\begin{align*} \widetilde{V}_{K,i} = \max_{a_{K} \in \Psi_{K}}Q_{K} \left(\overline{X_{K,i}}, \overline{A_{(K-1),i}}, a_{K}; \widehat{\xi_{K}} \right) \text{ for each } i. \end{align*}\]
    • \(d_{K-1}^{opt}(\overline{x_{K-1}}, \overline{a_{K-2}}; \widehat{\xi_{K-1}}) = \argmax_{a_{K-1}} Q_{K-1}(\overline{x_{K-1}}, \overline{a_{K-1}}; \widehat{\xi_{K-1}})\): an estimator for the optimal treatment choice at decision \(K-1\) for a patient with past history \(\overline{X_{K-1}}= \overline{x_{K-1}}, \overline{A_{K-2}}= \overline{a_{K-2}}\), assuming that a patient will take the optimal treatment at decision \(K\).
  3. Continue this process in the backward manner for \(k= K-2, \cdots, 1\).

Then we may summarize the estimated optimal regime as \(\widehat{\mathbf{d}}^{opt} = (\widehat{d}^{opt}_{1}, \widehat{d}^{opt}_{2}, \cdots, \widehat{d}^{opt}_{K})\), where \[\begin{align*} & \widehat{d}^{opt}_{1}(x_{1}) = d_{1}^{opt}(x_{1}; \widehat{\xi_{1}}), \\ & \widehat{d}^{opt}_{k}(\overline{x_{k}}, \overline{a_{k-1}}) = d_{k}^{opt}(\overline{x_{k}}, \overline{a_{k-1}}; \widehat{\xi_{k}}), \quad k = 2, \cdots, K. \end{align*}\]

Since we have presented estimating equations in the weighted least squares form, at the \(K\)th decision with responses \(Y_{i}\), it is known that it is possible to get asymptotically efficient estimator for \(\xi_{K}\) when \(Var(Y|\overline{X_{k}}= \overline{x_{k}}, \overline{A_{k}} = \overline{a_{k}}) = \Sigma_{k}(\overline{x_{k}}, \overline{a_{k}})\). On the other hand, for \(k < K\) with responses \(\widetilde{V}_{(k+1),i}\), the theory may no longer apply. However, since deriving the optimal term is considerably complicated, it is standard to fit the model via OLS or WLS.

\(\star\) Estimation for \(K = 2\) :

Now we illustrate this approach for \(K = 2\), where at each decision, there are two possible treatment options coded as 0 and 1. Let us first define history at time \(1\) and \(2\) as \(\mathcal{H}_{1}:= (1, x_{1}^{T})^{T}\) and \(\mathcal{H}_{2}:= (1, x_{1}^{T}, a_{1}, x_{2}^{T})^{T}\), respectively. Here, we assume linear models for the Q-functions as follows.

\[\begin{align*} & Q_{1}(x_{1}, a_{1}; \boldsymbol{\xi}_{1}) \hspace{1cm} := \mathcal{H}_{1}^{T} \boldsymbol{\beta}_{1} + a_{1} (\mathcal{H}_{1}^{T} \boldsymbol{\gamma}_{1}) \\ & Q_{2}(x_{1}, a_{1}, x_{2}, a_{2}; \boldsymbol{\xi}_{2}) := \mathcal{H}_{2}^{T} \boldsymbol{\beta}_{2} + a_{2} (\mathcal{H}_{2}^{T} \boldsymbol{\gamma}_{2}) \end{align*}\]

where \(\boldsymbol{\xi}_{k} = (\boldsymbol{\beta}_{k}, \boldsymbol{\gamma}_{k})^{T}\) for \(k = 1, 2\).

By definition,

\[\begin{align*} V_{2} (x_{1}, a_{1}, x_{2}; \boldsymbol{\xi}_{2}) &= \max_{a_{2}} Q_{2} (x_{1}, a_{1}, x_{2}, a_{2}; \boldsymbol{\xi}_{2}) \\ &= \max_{a_{2}} \left[ \mathcal{H}_{2}^{T} \boldsymbol{\beta}_{2} + a_{2} (\mathcal{H}_{2}^{T} \boldsymbol{\gamma}_{2}) \right] \\ &= \mathcal{H}_{2}^{T} \boldsymbol{\beta}_{2} + \mathcal{H}_{2}^{T} \boldsymbol{\gamma}_{2} \,\, \mathbf{1}\left(\mathcal{H}_{2}^{T} \boldsymbol{\gamma}_{2} > 0 \right), \\ V_{1} (x_{1}; \boldsymbol{\xi}_{1}) &= \max_{a_{1}} Q_{1} (x_{1}, a_{1}; \boldsymbol{\xi}_{1}) \\ &= \max_{a_{1}} \left[ \mathcal{H}_{1}^{T} \boldsymbol{\beta}_{1} + a_{1} (\mathcal{H}_{1}^{T} \boldsymbol{\gamma}_{1}) \right] \\ &= \mathcal{H}_{1}^{T} \boldsymbol{\beta}_{1} + \mathcal{H}_{1}^{T} \boldsymbol{\gamma}_{1} \,\, \mathbf{1}\left(\mathcal{H}_{1}^{T} \boldsymbol{\gamma}_{1} > 0 \right). \end{align*}\]

which implies \(d_{1}^{opt}(x_{1}; \boldsymbol{\xi}_{1}) = \mathbf{1}\left(\mathcal{H}_{1}^{T} \boldsymbol{\gamma}_{1} > 0 \right), \,\, d_{2}^{opt}(x_{1}, a_{1}, x_{2}; \boldsymbol{\xi}_{2}) = \mathbf{1}\left(\mathcal{H}_{2}^{T} \boldsymbol{\gamma}_{2} > 0 \right)\).

In this example, we follow the steps below based on ordinary least squares to estimate optimal dynamic treatment rule:

  1. Estimate \(\boldsymbol{\xi}_{2}= ({\boldsymbol{\beta}}_{2}, {\boldsymbol{\gamma}}_{2}).\) \[\begin{align*} (\widehat{\boldsymbol{\beta}}_{2}, \widehat{\boldsymbol{\gamma}}_{2}) &= \argmin_{\beta_{2}, \gamma_{2}} \frac{1}{n} \sum_{i=1}^{n} \left(Y_{i} - Q_{2}(X_{1}, A_{1}, X_{2}, A_{2}; \beta_{2}, \gamma_{2}) \right)^{2} \\ &= \argmin_{\beta_{2}, \gamma_{2}} \frac{1}{n} \sum_{i=1}^{n} \left(Y_{i} - \mathcal{H}_{2,i}^{T} \boldsymbol{\beta}_{2} - A_{2,i} \mathcal{H}_{i,2}^{T} \boldsymbol{\gamma}_{2} \right)^{2} \\ \end{align*}\]

  2. Calculate \(\widetilde{V_{1,i}}\) for \(i=1,\cdots,n.\) \[\begin{align*} \widetilde{V_{1,i}} &= \max_{a_{2}} \left[Q_{2}(X_{1}, A_{1}, X_{2}, A_{2}; \widehat{\beta}_{2}, \widehat{\gamma}_{2}) \right]\\ &= \max_{a_{2}} \left[ \mathcal{H}_{2,i}^{T} \widehat{\boldsymbol{\beta}}_{1} - A_{2,i} \mathcal{H}_{i,2}^{T} \widehat{\boldsymbol{\gamma}}_{2} \right]. \end{align*}\]

  3. Estimate \(\boldsymbol{\xi}_{1}= ({\boldsymbol{\beta}}_{1}, {\boldsymbol{\gamma}}_{2}).\) \[\begin{align*} (\widehat{\boldsymbol{\beta}}_{1}, \widehat{\boldsymbol{\gamma}}_{1}) &= \argmin_{\beta_{1}, \gamma_{1}} \frac{1}{n} \sum_{i=1}^{n} \left( \widetilde{V_{1,i}} - Q_{1}(X_{1}, A_{1}; \beta_{1}, \gamma_{1}) \right)^{2} \\ &= \argmin_{\beta_{1}, \gamma_{1}} \frac{1}{n} \sum_{i=1}^{n} \left(\widetilde{V_{1,i}} - \mathcal{H}_{1,i}^{T} \boldsymbol{\beta}_{2} - A_{1,i} \mathcal{H}_{i,1}^{T} \boldsymbol{\gamma}_{1} \right)^{2} \\ \end{align*}\]

  4. Estimate the optimal treatment \(\mathbf{d}^{opt}=(\mathbf{d}_{1}^{opt}, \mathbf{d}_{2}^{opt}).\) \[\begin{align*} & \widehat{d}_{1}^{opt}(x_{1}; \widehat{\boldsymbol{\xi}}_{1}) \hspace{1cm} = \mathbf{1}\left(\mathcal{H}_{1}^{T} \widehat{\boldsymbol{\gamma}}_{1} > 0 \right), \\ & \widehat{d}_{2}^{opt}(x_{1}, a_{1}, x_{2}; \widehat{\boldsymbol{\xi}}_{2}) = \mathbf{1}\left(\mathcal{H}_{2}^{T} \widehat{\boldsymbol{\gamma}}_{2} > 0 \right). \end{align*}\]


Estimation of Potential Outcome

In this section, consider a fixed treatment \((a_{1}, a_{2}) \in \{0,1\}^{2}\) with two time points \(K = 2\). We are interested in estimating the expected potential outcome \(\mathbb{E}[Y(a_{1}, a_{2})]\) of treatment \((a_{1}, a_{2})\) using weighted observed outcome \(W \mathbb{1}(A_{1} = a_{1}, A_{2} = a_{2})Y\) when the random treatment used \((A_{1}, A_{2})\) in fact agrees with the desired treatment \((a_{1}, a_{2})\). We have the following lemma below based on inverse probability weighting.


Lemma 3 Fix a treatment \((a_{1},a_{2})\in \{0,1\}^{2}\). Under the assumptions of consistency, SUTVA, and sequential ignorability mentioned in the section Model setup and assumptions, we have \[\begin{align*} &\mathbb{E}[Y^{*}(a_{1}, a_{2})] \\ &= \mathbb{E}_{(X_{1}, A_{1}, X_{2}, A_{2}, Y)} \left[ \frac{Y \mathbf{1}(A_{1} = a_{1})}{\mathbf{P}(A_{1} = a_{1}|X_{1})}\frac{\mathbf{1}(A_{2} = a_{2})}{\mathbf{P}(A_{2} = a_{2} | X_{1}, A_{1} = a_{1}, X_{2})}\right] \\ &= \mathbb{E} \left[ W \mathbf{1}(A_1 = a_1, A_2 = a_2) Y\right] \end{align*}\] where \(W = \mathbf{P}(A_{1} = a_{1}|X_{1})^{-1}, \mathbf{P}(A_{2} = a_{2} | X_{2}, A_{1} = a_{1}, X_{1})^{-1}\).

Proof. \({}\) Using iterated expectations, consistency, and sequential ignorability, we have \[\begin{align*} & \mathbb{E}_{(X_{1}, A_{1}, X_{2}, A_{2}, Y)} \left[ \frac{Y \mathbf{1}(A_{1} = a_{1})}{\mathbb{P}(A_{1} = a_{1}|X_{1})}\frac{\mathbf{1}(A_{2} = a_{2})}{\mathbb{P}(A_{2} = a_{2} | X_{1}, A_{1} = a_{1}, X_{2})}\right] \\ & \qquad \overset{(a)}{=} \mathbb{E}_{(X_{1}, A_{1}, X_{2}, A_{2}, Y^{*})} \left[ \frac{Y^{*}(a_{1}, a_{2})}{\mathbb{P}(A_{1} = a_{1}|X_{1})}\frac{\mathbf{1}(A_{1} = a_{1}) \mathbf{1}(A_{2} = a_{2})}{\mathbb{P}(A_{2} = a_{2} | X_{1}, A_{1} = a_{1}, X_{2})}\right] \\ & \qquad = \mathbb{E}_{(X_{1}, A_{1}, X_{2})} \left[ \mathbb{E}_{(A_{2}, Y^{*})} \left[ \frac{Y^{*}(a_{1}, a_{2})}{\mathbb{P}(A_{1} = a_{1}|X_{1})}\frac{\mathbf{1}(A_{1} = a_{1}) \mathbf{1}(A_{2} = a_{2})}{\mathbb{P}(A_{2} = a_{2} | X_{1}, A_{1} = a_{1}, X_{2})} \bigg| X_{1}, A_{1}, X_{2} \right] \right] \\ & \qquad = \mathbb{E}_{(X_{1}, A_{1}, X_{2})} \left[ \frac{\mathbf{1}(A_{1} = a_{1})}{\mathbb{P}(A_{1} = a_{1}|X_{1}) \mathbb{P}(A_{2} = a_{2} | X_{1}, A_{1} = a_{1}, X_{2})} \,\, \mathbb{E}_{(A_{2}, Y^{*})} \left[ Y^{*}(a_{1}, a_{2})\mathbf{1}(A_{2} = a_{2}) \bigg| X_{1}, A_{1}, X_{2} \right] \right] \\ & \qquad \overset{(b)}{=} \mathbb{E}_{(X_{1}, A_{1}, X_{2})} \left[ \frac{\mathbf{1}(A_{1} = a_{1})}{\mathbb{P}(A_{1} = a_{1}|X_{1}) \mathbb{P}(A_{2} = a_{2} | X_{1}, A_{1} = a_{1}, X_{2})} \,\, \mathbb{E}_{Y^{*}} \left[ Y^{*}(a_{1}, a_{2}) \bigg| X_{1}, A_{1}, X_{2} \right] \mathbb{E}_{A_{2}} \left[\mathbf{1}(A_{2} = a_{2}) \bigg| X_{1}, A_{1} = a_{1}, X_{2} \right] \right] \\ & \qquad = \mathbb{E}_{(X_{1}, A_{1}, X_{2})} \left[ \frac{\mathbf{1}(A_{1} = a_{1})}{\mathbb{P}(A_{1} = a_{1}|X_{1}) \mathbb{P}(A_{2} = a_{2} | X_{1}, A_{1} = a_{1}, X_{2})} \,\, \mathbb{E}_{Y^{*}} \left[ Y^{*}(a_{1}, a_{2}) \bigg| X_{1}, A_{1}, X_{2} \right] \mathbb{P} \left(A_{2} = a_{2} \bigg| X_{1}, A_{1} = a_{1}, X_{2} \right) \right] \\ & \qquad = \mathbb{E}_{(X_{1}, A_{1}, X_{2})} \left[ \frac{\mathbf{1}(A_{1} = a_{1})}{\mathbb{P}(A_{1} = a_{1}|X_{1})} \,\, \mathbb{E}_{Y^{*}} \left[ Y^{*}(a_{1}, a_{2}) \bigg| X_{1}, A_{1}, X_{2} \right] \right] \\ & \qquad = \mathbb{E}_{(X_{1}, A_{1}, X_{2}, Y^{*})} \left[ \frac{Y^{*}(a_{1}, a_{2}) \mathbf{1}(A_{1} = a_{1})}{\mathbb{P}(A_{1} = a_{1}|X_{1})} \right] \\ & \qquad = \mathbb{E}_{X_{1}} \left[ \mathbb{E}_{(A_{1}, X_{2}, Y^{*})} \left[ \frac{Y^{*}(a_{1}, a_{2})\mathbf{1}(A_{1} = a_{1})}{\mathbb{P}(A_{1} = a_{1}|X_{1})} \bigg| X_{1} \right] \right] \\ & \qquad = \mathbb{E}_{X_{1}} \left[ \frac{1}{\mathbb{P}(A_{1} = a_{1}|X_{1})} \,\, \mathbb{E}_{(A_{1}, X_{2}, Y^{*})} \left[ Y^{*}(a_{1}, a_{2})\mathbf{1}(A_{1} = a_{1}) \bigg| X_{1} \right] \right] \\ & \qquad \overset{(c)}{=} \mathbb{E}_{X_{1}} \left[ \frac{1}{P(A_{1} = a_{1}|X_{1})} \,\, \mathbb{E}_{(X_{2}, Y^{*})} \left[ Y^{*}(a_{1}, a_{2}) \bigg| X_{1} \right] \mathbb{E}_{A_{1}} \left[\mathbf{1}(A_{1} = a_{1}) \bigg| X_{1} \right] \right] \\ & \qquad = \mathbb{E}_{X_{1}} \left[ \frac{1}{\mathbb{P}(A_{1} = a_{1}|X_{1})} \,\, \mathbb{E}_{(X_{2}, Y^{*})} \left[ Y^{*}(a_{1}, a_{2}) \bigg| X_{1} \right] \mathbb{P} \left(A_{1} = a_{1} \bigg| X_{1} \right) \right] \\ & \qquad = \mathbb{E}_{X_{1}} \left[ \mathbb{E}_{(X_{2}, Y^{*})} \left[ Y^{*}(a_{1}, a_{2}) \bigg| X_{1} \right] \right] \\ & \qquad = \mathbb{E}_{(X_{1}, X_{2}, Y^{*})} \bigg[Y^{*}(a_{1}, a_{2}) \bigg] \\ & \qquad = \mathbb{E} [Y^{*}(a_{1}, a_{2})] \end{align*}\] where \((b)\) and \((c)\) use the sequential ignorability assumptions \(Y^{*}(a_{1}, a_{2}) \perp A_{2} \, | \, (X_{1}, A_{1}, X_{2})\), \(Y^{*}(a_{1}, a_{2}) \perp A_{1} | X_{1}\), respectively, and \((a)\) uses the fact \(Y \mathbf{1}(A_{1} = a_{1}) \mathbf{1}(A_{2} = a_{2}) = Y(A_{1}, A_{2}) \mathbf{1}(A_{1} = a_{1}) \mathbf{1}(A_{2} = a_{2}) = Y^{*}(a_{1}, a_{2}) \mathbf{1}(A_{1} = a_{1}) \mathbf{1}(A_{2} = a_{2})\) using consistency. \[\begin{align*} \hspace{15cm} \square \end{align*}\]


Simulation Studies

For simiulation, generating data process mimics a study in which HIV-infected patients are randomized to receive antiretroviral therapy (coded as 1) or not (coded as 0) at baseline and at six months. The randomization probabilities depend on baseline and six month CD4 counts. The detailed generating process is as follows.

  • \(X_{1} \sim N(450, 150^2)\) : baseline CD4 count
  • \(A_{1} | X_{1} = x_{1} \sim \text{Ber}\left[\text{expit}(-0.006 x_{1} + 2)\right]\) : treatment \(A_{1}\)
  • \(X_{2} | X_{1} = x_{1}, A_{1} = a_{1} \sim N(1.25 x_{1}, 60^2)\) : six month CD4 count
  • \(\small{A_{2} | X_{1} = x_{1}, A_{1} = a_{1}, X_{2} = x_{2} \sim \text{Ber}\left[\text{expit}(-0.004 x_{2} + 0.8)\right]}\) : treatment \(A_{2}\)
  • \(Y = Y^{opt} - \mu_{1}(X_{1}, A_{1}) - \mu_{2}(X_{1}, A_{1}, X_{2}, A_{2})\) : CD4 count at one year
    • \(Y^{opt} | X_{1} = x_{1}, A_{1} = a_{1}, X_{2} = x_{2}, A_{2} = a_{2} \sim \hspace{1cm} N(1.6 x_{1} + 400, 60^2)\)
    • \[\begin{align*} \mu_{1}(X_{1}, A_{1}) &= (\gamma_{10} + \gamma_{11} X_{1}) \left[ \mathbf{1}(\gamma_{10} + \gamma_{11} X_{1} > 0) - A_{1} \right] \\ &\overset{(set)}{=} (250 - X_{1}) \left[ \mathbf{1}(250 - X_{1} > 0) - A_{1} \right] \end{align*}\]
    • \[\begin{align*} \mu_{2}(X_{1}, A_{1}, X_{2}, A_{2}) &= (\gamma_{20} + \gamma_{21} X_{2}) \left[ \mathbf{1}((\gamma_{20} + \gamma_{21} X_{2}) > 0) - A_{2} \right] \\ &\overset{(set)}{=} (720 -2X_{2}) \left[\mathbf{1}(720 -2 X_{2} > 0) - A_{2} \right] \end{align*}\]

where \(\text{expit}(x) = e^{x}/(e^{x}+1)\), and \(\mu_{1}(X_{1}, A_{1}), \mu_{2}(X_{1}, A_{1}, X_{2}, A_{2})\) are true advantage functions in response if the optimal treatment at each decision were given relative to that actually received. Then the true optimal treatment regime is \(d_{1}^{opt}(x_{1}) = \mathbf{1}(250 - X_{1} > 0)\) and \(d_{2}^{opt}(x_{1}, a_{1}, x_{2}) = \mathbf{1}(720 -2 X_{2} > 0)\).

For Q-learning, we assumed working models \[\begin{align*} & Q_{1}(x_{1}, a_{1}; \boldsymbol{\xi}_{1}) \hspace{1cm} := \beta_{10} + \beta_{11} x_{1} + a_{1} (250 - x_{1}) \\ & Q_{2}(x_{1}, a_{1}, x_{2}, a_{2}; \boldsymbol{\xi}_{2}) := \beta_{20} + \beta_{21} x_{1} + \beta_{22} a_{1} + \beta_{23} x_{1} a_{1} + \beta_{24} x_{2} + a_{2} (720 -2 x_{2}) \end{align*}\]

With \(n=1000\), we have the following results:

\[\begin{align*} \begin{aligned} &\begin{array}{ccc} \text { Time } & \text{ True parameter } & \text { Estimated parameter } & \text{ Predicted ITR Accuracy } \\ \hline 1 & \gamma_{10} = 250 & 155.49 (21.76) & 0.957 (0.012) \\ 1 & \gamma_{11} = -1.0 & -0.78 (0.05) &- \\ 2 & \gamma_{20} = 720 & 506.50 (48.78) &0.957 (0.014) \\ 2 & \gamma_{21} = -2.0 & -1.58 (0.09) &- \\ \hline \end{array} \end{aligned} \end{align*}\]

Click for code
expit <- function(x){ exp(x)/(1+exp(x)) }

simulated_dtr <- function(iter = 3){
  
  n = 1000
  result = list()
  result[[1]] = array(0, dim = c(2, 2, iter))
  result[[2]] = array(0, dim = c(2, 2, iter))
  result_cutoff = data.frame(cutoff_X1 = 0, cutoff_X2 = 0,
                             slope_X1 = 0, slope_X2 = 0)
  result_outcome = data.frame(mean_Y_est_itr = 0, mean_Y_itr = 0)

  for (i in 1:iter){
    set.seed(i)
    
    # Generate dataset
    X1 = rnorm(n, mean = 450, sd = 150) 
    A1 = rbinom(n, size = 1, prob = expit(-0.006 * X1 + 2)) 
    X2 = rnorm(n, mean = 1.25 * X1, sd = 60)
    A2 = rbinom(n, size = 1, prob = expit(-0.004*X2 + 0.8)) 
    Y_opt = rnorm(n, mean = 400 + 1.6 * X1, sd = 60) 
    mu_1 = (250 - X1) * ( ifelse(250 - X1 > 0, 1, 0) - A1 ) 
    mu_2 = (720 - 2 * X2) * ( ifelse(720 - 2 * X2 > 0, 1, 0) - A2 ) 
    Y = Y_opt - mu_1 - mu_2
    
    # True optimal ITR
    d1_opt = ifelse(250 - X1 > 0, 1, 0) 
    d2_opt = ifelse(720 - 2 * X2 > 0, 1, 0) 
    
    # Estimate DTR
    xi_2 = lm(Y ~ X1 + A1 + X1*A1 + X2 + A2 + A2*X2)
    pred_A2 = as.numeric(ifelse(cbind(1, X2) %*% coef(xi_2)[c(5,7)]>0, 1, 0))
    v_1 = as.numeric(cbind(1, X1, A1, X2, X1*A1) %*% coef(xi_2)[c(1:4, 6)] 
                     + pred_A2 * cbind(1, X2) %*% coef(xi_2)[c(5,7)])
    xi_1 = lm(v_1 ~ X1 + A1 + A1*X1)
    pred_A1 = as.numeric(ifelse(cbind(1, X1) %*% coef(xi_1)[3:4]>0, 1, 0))
    
    # Estimate potential outcome
    
    ## propensity score model
    ps1_model = glm(A1 ~ X1, family = binomial(link = "logit"))
    ps1 = predict(ps1_model, type = "response")
    ps2_model = glm(A2 ~ X1 + A1 + X2, family = binomial(link = "logit"))
    ps2 = predict(ps2_model, type = "response")

    ## IPW
    weight = rep(0, n)
    weight[A1 == 1 & A2 == 1] = 
      ps1[A1 == 1 & A2 == 1] * ps2[A1 == 1 & A2 == 1]
    weight[A1 == 1 & A2 == 0] = 
      ps1[A1 == 1 & A2 == 0] * (1-ps2[A1 == 1 & A2 == 0])
    weight[A1 == 0 & A2 == 1] = (1-ps1[A1 == 0 & A2 == 1]) * 
      ps2[A1 == 0 & A2 == 1]
    weight[A1 == 0 & A2 == 0] = (1-ps1[A1 == 0 & A2 == 0]) * 
      (1-ps2[A1 == 0 & A2 == 0])
    weight = 1/weight

    msm = lm(Y ~ A1 + A2, weights = weight)

    ## Potential outcome
    mean_outcome_est_itr = 
      mean(predict(msm, data.frame(A1 = pred_A1, A2 = pred_A2)))
    mean_outcome_itr = 
      mean(predict(msm, data.frame(A1 = d1_opt, A2 = d2_opt)))
    
    # Result
    result_cutoff = rbind(result_cutoff, 
                          c(coef(xi_1)[3], coef(xi_2)[5],
                            coef(xi_1)[4], coef(xi_2)[7]))
    result_outcome = rbind(result_outcome, 
                          c(mean_outcome_est_itr, mean_outcome_itr))
    result[[1]][,,i] = table(d1_opt, pred_A1)                    
    result[[2]][,,i] = table(d2_opt, pred_A2)
  }
  result_cutoff = result_cutoff[-1,];
  result_outcome = result_outcome[-1,]
  result[[3]] = result_cutoff
  result[[4]] = result_outcome
  return(result)
}

sim_result = simulated_dtr(1000)
accuracy <- apply(sim_result[[1]], 3, function(x) {(x[1,1]+x[2,2])/1000})
accuracy2 <- apply(sim_result[[2]], 3, function(x) {(x[1,1]+x[2,2])/1000})
mean(accuracy); sd(accuracy)
mean(accuracy2); sd(accuracy2)
apply(sim_result[[3]], 2, mean)
apply(sim_result[[3]], 2, sd)
apply(sim_result[[4]], 2, mean)
apply(sim_result[[4]], 2, sd)

Based on the estimated optimal regime, the first therapy is given to patients with baseline CD4 count less than 155.49, while the true optimal CD4 threshold is 250. Similarly, at the second decision, the second therapy is given to patients with six month CD4 count less than 506.50, while the true optimal CD4 threshold is 720.

Although Q-learning yields poor estimation of parameters, the estimated optimal regime shows 95% accuracy. Also, the estimated potential outcome with optimal ITR under marginal structural model is \(\widehat{\mathbb{E}}(\mathbb{d}^{opt}) = 1002.53\) (standard deviation of 22.83) and one with predicted ITR is \(\widehat{\mathbb{E}}(\widehat{\mathbb{d}}^{opt}) = 1028.47\) (standard deviation of 26.97), which is smaller difference compared to the difference of parameters.

The reason why the estimated parameters do not achieve the true optimal CD4 threshold is that the Q-functions are misspecified. More specifically, the linear models of \(f(\beta_{k})=H_{k}^{T}\beta_{k}\) for \(k=1, 2\) are poor approximations to the complex forms of the true models. This is why people introduced ‘Advantage learning (A-learning)’ because the entire Q-functions may not be directly specified to estimate the optimal regime. We will discuss more this topic.


Sensitivity Analysis

In the previous section, the estimation approach for optimal treatment rule \(\mathbb{d}^{opt}\) is based on the following assumptions:

  • Consistency assumption : \(X_{k} = X^{*}_{k}\big(\overline{A_{k-1}}\big), k=2, \cdots, K\) and \(Y=Y^{*}\big(\overline{A_{k}}\big)\),
  • Stable unit treatment value assumption (well-defined) : a patient’s covariates and outcome are not affected by treatment assignment and other patients,
  • Sequential Ignorability : \(A_{k} \perp W^{*} | \{\overline{X_{k}}, \overline{A_{k-1}} \}, k=1, \cdots, K\).

Based on these assumptions, we were able to identify and estimate the causal effect, which eventually allows us to find the optimal treatment rule.

In this section, we consider the situation where the sequential ignorability assumption is violated due to the existence of unmeasured confounder \(U_{k}\) for \(k = 1, \cdots, K\). Then the potential outcome \(\mathbb{E}[Y^{*}(\overline{a_{k}})]\) cannot be identified by the statistical estimand. The alternative approach is to find the possible range of values of the causal estimand in the presence of \(U_{k}\) based on Manski bound.

Before we start, let us first define average treatment effect (ATE). Different from a single stage problem, there is no single definition of average treatment effect for time-varying covariates because there are \(2^{K}\) possible combinations of treatment strategies. For static treatment, the number of possible contrasts is \({2^{K} \choose 2}\). For example, there are \(6\) contrasts for \(K = 2\) such as \(\mathbb{E}[Y(1, 1) - Y(0, 0)]\) and \(\mathbb{E}[Y(1, 1) - Y(1, 0)]\). Under the dynamic treatment rule, the definition of ATE is even more complicated since treatment assignment at time \(k\) depends on the evolving individual characteristics . For the following derivation, we will use a regret function introduced by Murphy (2003). Regret function \(R_{k}\) at each time point \(k = 1, \cdots, K\) is defined as the expected difference in the outcome had the optimal treatment \(d^{opt}_{k}\) been taken at \(k\) instead of treatment \(a_{k}\) among those who received treatment \(\overline{a_{k}}\) up to time \(k\) and the optimal regime from \(k+1\) onward. Then under \(K = 2\), the regret functions \(R_{1}\) and \(R_{2}\) are defined as:

\[\begin{align*} R_{1}(x_{1}, a_{1}) := & \mathbb{E}\left[ Y^{*}(d^{opt}_{1}(x_{1}), d_{2}^{opt}(x_{1}, d^{opt}_{1}(x_{1}), X^{*}_{2}(d^{opt}_{1}))) - Y^{*}(a_{1}, d_{2}^{opt}(x_{1}, a_{1}, X^{*}_{2}(a_{1}))) \, \bigg| \, X_{1} = x_{1} \right], \\ R_{2}(\overline{x_{2}}, a_{1}, a_{2}) := & \mathbb{E}\left[Y^{*}(a_{1}, d_{2}^{opt}(x_{1}, a_{1}, x_{2})) - Y^{*}(a_{1}, a_{2}) \, \bigg| \, X_{1} = x_{1}, A_{1} = a_{1}, X_{2} = x_{2} \right] \\ \end{align*}\]

where \(d^{opt}_{1} := d^{opt}_{1}(x_{1})\) and \(d_{2}^{opt} = d_{2}^{opt}(x_{1}, a_{1}, X^{*}_{2}(a_{1}))\). If the regret \(R_{k}\) is greater than \(0\), we regret the choice of treatment. On the other hand, if \(R_{k}=0\) (no regret), we can say that we make an ideal choice, therefore, \(d^{opt}_{k}\) satisfies \(R_{k}(\overline{X_{k}}, \overline{a_{k-1}}, d^{opt}_{k})=0\) for \(k = 1, 2\).

Here, we consider a fixed treatment \((a_{1}, a_{2}) \in \{0, 1\}^{2}\) under two time points \(K = 2\) and bounded outcome \(Y \in [Y_{l}, Y_{u}]\). The goal is to find the possible range of values of \(R_{k}\) in the presence of unmeasured confounder \(U_{k}\). Then we have the following modified assumptions:

\[\begin{align*} & 1. \quad Y^{*}(a_{1}, a_{2}) \perp A_{1} \,\, | \,\, \{X_{1}, U_{1}\}, \quad Y^{*}(a_{1}, a_{2}) \perp A_{2} \,\, | \,\, \{X_{1}, U_{1}, A_{1}, X_{2}, U_{2} \}, \\ & 2. \quad 0 < \mathbb{P}(A_{1} = a_{1} | X_{1} = x_{1}, U_{1} = u_{1}) <1, \\ & \hspace{0.7cm} 0 < \mathbb{P}(A_{2} = a_{2} | X_{1} = x_{1}, U_{1} = u_{1}, A_{1} = a_{1}, X_{2} = x_{2}, U_{2} = u_{2}) <1. \\ \end{align*}\]

Regret at second stage

\[\begin{align*} & R_{2}(\overline{X_{2}}, a_{1}, a_{2}) \\ & \quad = \mathbb{E} \bigg[ Y^{*}(a_{1}, \underbrace{d_{2}^{opt} (X_{1}, a_{1}, X_{2})}_{ =: \, d_{2}^{opt}}) - Y^{*}(a_{1}, a_{2}) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2} \bigg] \\ & \quad \overset{(a)}{=} \mathbb{1}(d_{2}^{opt} \neq a_{2}) \,\, \mathbb{E} \left[ Y^{*}(a_{1}, 1-a_{2}) - Y^{*}(a_{1}, a_{2}) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2} \right] \\ & \quad = \mathbb{1}(d_{2}^{opt} \neq a_{2}) \,\, \left\{ \mathbb{E} \left[ Y^{*}(a_{1}, 1-a_{2}) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2} \right] - \mathbb{E} \left[ Y^{*}(a_{1}, a_{2}) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2} \right] \right\} \\ & \quad = \mathbb{1}(d_{2}^{opt} \neq a_{2}) \,\, \bigg\{ \mathbb{E} \left[ Y^{*}(a_{1}, 1-a_{2}) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = a_{2} \right] \mathbb{P} \left(A_{2} = a_{2} \, | \, X_{1}, A_{1} = a_{1}, X_{2} \right) \\ & \hspace{3.2cm} + \mathbb{E} \left[ Y^{*}(a_{1}, 1-a_{2}) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = 1-a_{2} \right] \mathbb{P} \left(A_{2} = 1-a_{2} \, | \, X_{1}, A_{1} = a_{1}, X_{2}\right) \\ & \hspace{3.2cm} - \mathbb{E} \left[ Y^{*}(a_{1}, a_{2}) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = a_{2} \right] \mathbb{P} \left(A_{2} = a_{2} \, | \, X_{1}, A_{1} = a_{1}, X_{2}\right) \\ & \hspace{3.2cm} - \mathbb{E} \left[ Y^{*}(a_{1}, a_{2}) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = 1 - a_{2} \right] \mathbb{P} \left(A_{2} = 1-a_{2} \, | \, X_{1}, A_{1} = a_{1}, X_{2} \right) \bigg\} \\ & \quad \overset{(b)}{=} \mathbb{1}(d_{2}^{opt} \neq a_{2}) \,\, \bigg\{ \mathbb{E} \left[ Y^{*}(a_{1}, 1-a_{2}) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = a_{2} \right] \mathbb{P} \left(A_{2} = a_{2} \, | \, X_{1}, A_{1} = a_{1}, X_{2} \right) \\ & \hspace{3.2cm} + \mathbb{E} \left[ Y \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = 1-a_{2} \right] \mathbb{P} \left(A_{2} = 1-a_{2} \, | \, X_{1}, A_{1} = a_{1}, X_{2}\right) \\ & \hspace{3.2cm} - \mathbb{E} \left[ Y \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = a_{2} \right] \mathbb{P} \left(A_{2} = a_{2} \, | \, X_{1}, A_{1} = a_{1}, X_{2}\right) \\ & \hspace{3.2cm} - \mathbb{E} \left[ Y^{*}(a_{1}, a_{2}) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = 1 - a_{2} \right] \mathbb{P} \left(A_{2} = 1-a_{2} \, | \, X_{1}, A_{1} = a_{1}, X_{2} \right) \bigg\} \\ & \quad = \mathbb{1}(d_{2}^{opt} \neq a_{2}) \,\, \bigg\{ C + \mathbb{E} \left[ Y^{*}(a_{1}, 1-a_{2}) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = a_{2} \right] \mathbb{P} \left(A_{2} = a_{2} \, | \, X_{1}, A_{1} = a_{1}, X_{2} \right) \\ & \hspace{3.2cm} - \mathbb{E} \left[ Y^{*}(a_{1}, a_{2}) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = 1 - a_{2} \right] \mathbb{P} \left(A_{2} = 1-a_{2} \, | \, X_{1}, A_{1} = a_{1}, X_{2} \right) \bigg\}. \\ & \hspace{1cm} \text{ where } \,\, C := \, \mathbb{E} \left[ Y \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = 1-a_{2} \right] \mathbb{P} \left(A_{2} = 1-a_{2} \, | \, X_{1}, A_{1} = a_{1}, X_{2}\right) \\ & \hspace{2.8cm} - \mathbb{E} \left[ Y \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = a_{2} \right] \mathbb{P} \left(A_{2} = a_{2} \, | \, X_{1}, A_{1} = a_{1}, X_{2}\right). \\ \end{align*}\]

Here, \((a)\) uses the fact that \(R_{2}=0\) if \(d_{2}^{opt}=a_{2}\), and \(R_{2}>0\) if \(d_{2}^{opt} \neq a_{2}\), and \((b)\) uses consistency assumption. Let \(p_{2} := \mathbb{P} \left(A_{2} = a_{2} \, | \, X_{1}, A_{1} = a_{1}, X_{2} \right)\). Since \(R_{2}\) takes its maximum value when \(\mathbb{E} \left[ Y^{*}(a_{1}, 1-a_{2}) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = a_{2} \right] = Y_{U}\) and its minimum value when \(\mathbb{E} \left[ Y^{*}(a_{1}, a_{2}) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = 1 - a_{2} \right] = Y_{L}\),

\[\begin{align*} & R_{2} \, \in \, \left[ \max(0 \,, \, C + \, p_{2} \times Y_{L} - (1- p_{2}) \times Y_{U}) \, , \, C + \, p_{2} \times Y_{U} - (1- p_{2}) \times Y_{L} \right] \\ \Rightarrow \quad & R_{2} \, \in \, \left[ \max(0 \,, \, C - Y_{U} + (Y_{L} + Y_{U}) \times p_{2}) \, , \, C - Y_{L} + (Y_{L} + Y_{U}) \times p_{2} \right]. \\ \end{align*}\]

Therefore, \(\left[ \max(0 \,, \, C - Y_{U} + (Y_{L} + Y_{U}) \times p_{2}) \, , \, C - Y_{L} + (Y_{L} + Y_{U}) \times p_{2} \right]\) is the possible range of values in the regret at time \(2\) in the presence of unmeasured confounder.

Regret at first stage

\[\begin{align*} & R_{1}(X_{1}, a_{1}) \\ & \quad = \mathbb{E} \bigg[ Y^{*}( \underbrace{d^{opt}_{1}(X_{1})}_{=:d^{opt}_{1}}, d_{2}^{opt}(X_{1}, d^{opt}_{1}, X^{*}_{2}(d^{opt}_{1}))) - Y^{*}(a_{1}, d_{2}^{opt}(X_{1}, a_{1}, X^{*}_{2}(a_{1}))) \, \bigg| \, X_{1} \bigg] \\ & \quad \overset{(a)}{=} \mathbb{1}(d_{1}^{opt} \neq a_{1}) \,\, \mathbb{E} \left[ Y^{*}( 1-a_{1}, \underbrace{d_{2}^{opt}(X_{1}, 1-a_{1}, X^{*}_{2}(1-a_{1}))}_{=: d^{opt}_{2}(1-a_{1})}) - Y^{*}(a_{1}, \underbrace{d_{2}^{opt}(X_{1}, a_{1}, X^{*}_{2}(a_{1}))}_{=: d^{opt}_{2}(a_{1})}) \, \bigg| \, X_{1} \right] \\ & \quad = \mathbb{1}(d_{1}^{opt} \neq a_{1}) \,\, \left\{ \mathbb{E} \bigg[ Y^{*}( 1-a_{1}, d^{opt}_{2}(1-a_{1})) \, \bigg| \, X_{1} \bigg] - \mathbb{E} \bigg[ Y^{*}(a_{1}, d^{opt}_{2}(a_{1})) \, \bigg| \, X_{1} \bigg] \right\} \\ & \quad = \mathbb{1}(d_{1}^{opt} \neq a_{1}) \,\, \bigg\{ \mathbb{E} \left[ Y^{*}(1-a_{1}, d^{opt}_{2}(1-a_{1})) \, \bigg| \, X_{1}, A_{1} = a_{1} \right] \mathbb{P} \left(A_{1} = a_{1} \, | \, X_{1} \right) \\ & \hspace{3.2cm} + \mathbb{E} \left[ Y^{*}(1-a_{1}, d^{opt}_{2}(1-a_{1})) \, \bigg| \, X_{1}, A_{1} = 1-a_{1} \right] \mathbb{P} \left(A_{1} = 1-a_{1} \, | \, X_{1} \right) \\ & \hspace{3.2cm} - \mathbb{E} \left[ Y^{*}(a_{1}, d^{opt}_{2}(a_{1})) \, \bigg| \, X_{1}, A_{1} = a_{1} \right] \mathbb{P} \left(A_{1} = a_{1} \, | \, X_{1} \right) \\ & \hspace{3.2cm} - \mathbb{E} \left[ Y^{*}(a_{1}, d^{opt}_{2}(a_{1})) \, \bigg| \, X_{1}, A_{1} = 1 - a_{1} \right] \mathbb{P} \left(A_{1} = 1 - a_{1} \, | \, X_{1} \right) \bigg\} \\ & \quad = \mathbb{1}(d_{1}^{opt} \neq a_{1}) \,\, \bigg\{ \mathbb{E}_{X_{2}} \left[ \mathbb{E}_{Y} \left[ Y^{*}(1-a_{1}, d_{2}^{opt}(1-a_{1})) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2} \right] \, \bigg| \, X_{1}, A_{1} = a_{1} \right] \mathbb{P} \left(A_{1} = a_{1} \, | \, X_{1} \right) \\ & \hspace{3.2cm} + \mathbb{E}_{X_{2}} \left[ \mathbb{E}_{Y} \left[ Y^{*}(1-a_{1}, d_{2}^{opt}(1-a_{1})) \, \bigg| \, X_{1}, A_{1} = 1 - a_{1}, X_{2} \right] \, \bigg| \, X_{1}, A_{1} = 1 - a_{1} \right] \mathbb{P} \left(A_{1} = 1 - a_{1} \, | \, X_{1} \right) \\ & \hspace{3.2cm} - \mathbb{E}_{X_{2}} \left[ \mathbb{E}_{Y} \left[ Y^{*}(a_{1}, d_{2}^{opt}(a_{1})) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2} \right] \, \bigg| \, X_{1}, A_{1} = a_{1} \right] \mathbb{P} \left(A_{1} = a_{1} \, | \, X_{1} \right) \\ & \hspace{3.2cm} - \mathbb{E}_{X_{2}} \left[ \mathbb{E}_{Y} \left[ Y^{*}(a_{1}, d_{2}^{opt}(a_{1})) \, \bigg| \, X_{1}, A_{1} = 1-a_{1}, X_{2} \right] \, \bigg| \, X_{1}, A_{1} = 1-a_{1} \right] \mathbb{P} \left(A_{1} = 1-a_{1} \, | \, X_{1} \right) \bigg\}. \\ \end{align*}\]

where \((a)\) uses the fact that \(R_{1}=0\) if \(d_{1}^{opt}=a_{1}\), and \(R_{1}>0\) if \(d_{1}^{opt} \neq a_{1}\). Now, consider further expansion for conditional expectation on \(Y\).

\[\begin{align*} & \mathbb{E}_{Y} \left[ Y^{*}(1-a_{1}, d_{2}^{opt}(1-a_{1})) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2} \right] \\ & \quad = \mathbb{E}_{Y} \left[ Y^{*}(1-a_{1}, d_{2}^{opt}(1-a_{1})) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = d_{2}^{opt}(a_{1}) \right] \underbrace{\mathbb{P} \left(A_{2} = d_{2}^{opt}(a_{1}) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2} \right)}_{=: p_{1,a_{1}}} \\ & \quad + \mathbb{E}_{Y} \left[ Y^{*}(1-a_{1}, d_{2}^{opt}(1-a_{1})) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = 1-d_{2}^{opt}(a_{1}) \right] \mathbb{P} \left(A_{2} = 1- d_{2}^{opt}(a_{1}) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2} \right) \\ & \quad = \mathbb{E}_{Y} \left[ Y^{*}(1-a_{1}, d_{2}^{opt}(1-a_{1})) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = d_{2}^{opt}(a_{1}) \right] p_{1,a_{1}} \\ & \quad + \mathbb{E}_{Y} \left[ Y^{*}(1-a_{1}, d_{2}^{opt}(1-a_{1})) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = 1-d_{2}^{opt}(a_{1}) \right] (1-p_{1,a_{1}}), \\ & \mathbb{E}_{Y} \left[ Y^{*}(1-a_{1}, d_{2}^{opt}(1-a_{1})) \, \bigg| \, X_{1}, A_{1} = 1-a_{1}, X_{2} \right] \\ & \quad = \mathbb{E}_{Y} \left[ Y^{*}(1-a_{1}, d_{2}^{opt}(1-a_{1})) \, \bigg| \, X_{1}, A_{1} = 1-a_{1}, X_{2}, A_{2} = d_{2}^{opt}(1-a_{1}) \right] \mathbb{P} \left(A_{2} = d_{2}^{opt}(1-a_{1}) \, \bigg| \, X_{1}, A_{1} = 1-a_{1}, X_{2} \right) \\ & \quad + \mathbb{E}_{Y} \left[ Y^{*}(1-a_{1}, d_{2}^{opt}(1-a_{1})) \, \bigg| \, X_{1}, A_{1} = 1-a_{1}, X_{2}, A_{2} = 1-d_{2}^{opt}(1-a_{1}) \right] \mathbb{P} \left(A_{2} = 1- d_{2}^{opt}(1-a_{1}) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2} \right) \\ & \quad = \mathbb{E}_{Y} \left[ Y \, \bigg| \, X_{1}, A_{1} = 1-a_{1}, X_{2}, A_{2} = d_{2}^{opt}(1-a_{1}) \right] \underbrace{\mathbb{P} \left(A_{2} = d_{2}^{opt}(1-a_{1}) \, \bigg| \, X_{1}, A_{1} = 1-a_{1}, X_{2} \right)}_{=: p_{1,1-a_{1}}} \\ & \quad + \mathbb{E}_{Y} \left[ Y^{*}(1-a_{1}, d_{2}^{opt}(1-a_{1})) \, \bigg| \, X_{1}, A_{1} = 1-a_{1}, X_{2}, A_{2} = 1-d_{2}^{opt}(1-a_{1}) \right] \mathbb{P} \left(A_{2} = 1- d_{2}^{opt}(1-a_{1}) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2} \right) \\ & \quad = \mathbb{E}_{Y} \left[ Y \, \bigg| \, X_{1}, A_{1} = 1-a_{1}, X_{2}, A_{2} = d_{2}^{opt}(1-a_{1}) \right] p_{1,1-a_{1}} \\ & \quad + \mathbb{E}_{Y} \left[ Y^{*}(1-a_{1}, d_{2}^{opt}(1-a_{1})) \, \bigg| \, X_{1}, A_{1} = 1-a_{1}, X_{2}, A_{2} = 1-d_{2}^{opt}(1-a_{1}) \right] (1-p_{1,1-a_{1}}), \\ & \mathbb{E}_{Y} \left[ Y^{*}(a_{1}, d_{2}^{opt}(a_{1})) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2} \right] \\ & \quad = \mathbb{E}_{Y} \left[ Y^{*}(a_{1}, d_{2}^{opt}(a_{1})) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = d_{2}^{opt}(a_{1}) \right] \mathbb{P} \left(A_{2} = d_{2}^{opt}(a_{1}) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2} \right) \\ & \quad + \mathbb{E}_{Y} \left[ Y^{*}(a_{1}, d_{2}^{opt}(a_{1})) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = 1- d_{2}^{opt}(a_{1}) \right] \mathbb{P} \left(A_{2} = 1- d_{2}^{opt}(a_{1}) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2} \right) \\ & \quad = \mathbb{E}_{Y} \left[ Y \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = d_{2}^{opt}(a_{1}) \right] p_{1,a_{1}} \\ & \quad + \mathbb{E}_{Y} \left[ Y^{*}(a_{1}, d_{2}^{opt}(a_{1})) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = 1- d_{2}^{opt}(a_{1}) \right] (1-p_{1,a_{1}}), \\ & \mathbb{E}_{Y} \left[ Y^{*}(a_{1}, d_{2}^{opt}(a_{1})) \, \bigg| \, X_{1}, A_{1} = 1-a_{1}, X_{2} \right] \\ & \quad = \mathbb{E}_{Y} \left[ Y^{*}(a_{1}, d_{2}^{opt}(a_{1})) \, \bigg| \, X_{1}, A_{1} = 1- a_{1}, X_{2}, A_{2} = d_{2}^{opt}(1-a_{1}) \right] p_{1,1-a_{1}} \\ & \quad + \mathbb{E}_{Y} \left[ Y^{*}(a_{1}, d_{2}^{opt}(a_{1})) \, \bigg| \, X_{1}, A_{1} = 1-a_{1}, X_{2}, A_{2} = 1- d_{2}^{opt}(1-a_{1}) \right] (1-p_{1,1-a_{1}}). \end{align*}\]

Finally, with \(p_{1} := \mathbb{P} \left(A_{1} = a_{1} \, | \, X_{1} \right)\), we have

\[\begin{align*} & R_{1}(X_{1}, a_{1}) \\ & \quad = \mathbb{1}(d_{1}^{opt} \neq a_{1}) \,\, \times \\ & \hspace{1.2cm} \bigg\{ (1-p_{1}) \cdot \mathbb{E} \left[ p_{1,1-a_{1}} \cdot \mathbb{E} \left[ Y \bigg| X_{1}, A_{1} = 1-a_{1}, X_{2}, A_{2} = d_{2}^{opt}(1-a_{1}) \right] \bigg| X_{1}, A_{1} = 1-a_{1} \right] \\ & \hspace{2.0cm} - p_{1} \cdot \mathbb{E} \left[ p_{1,a_{1}} \cdot \mathbb{E} \left[ Y \bigg| X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = d_{2}^{opt}(a_{1}) \right] \bigg| X_{1}, A_{1} = a_{1} \right] \\ & \hspace{2cm} + p_{1} \cdot \mathbb{E} \left[ p_{1,a_{1}} \cdot \mathbb{E}_{Y} \left[ Y^{*}(1-a_{1}, d_{2}^{opt}(1-a_{1})) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = d_{2}^{opt}(a_{1}) \right] \bigg| X_{1}, A_{1} = a_{1} \right] \\ & \hspace{2cm} + p_{1} \cdot \mathbb{E} \left[ (1-p_{1,a_{1}}) \cdot \mathbb{E}_{Y} \left[ Y^{*}(1-a_{1}, d_{2}^{opt}(1-a_{1})) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = 1-d_{2}^{opt}(a_{1}) \right] \bigg| X_{1}, A_{1} = a_{1} \right] \\ & \hspace{2cm} - p_{1} \cdot \mathbb{E} \left[ (1-p_{1,a_{1}}) \cdot \mathbb{E}_{Y} \left[ Y^{*}(a_{1}, d_{2}^{opt}(a_{1})) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = 1- d_{2}^{opt}(a_{1}) \right] \bigg| X_{1}, A_{1} = a_{1} \right] \\ & \hspace{1cm} + (1-p_{1}) \cdot \mathbb{E} \left[ (1-p_{1,1-a_{1}}) \cdot \mathbb{E}_{Y} \left[ Y^{*}(1-a_{1}, d_{2}^{opt}(1-a_{1})) \, \bigg| \, X_{1}, A_{1} = 1-a_{1}, X_{2}, A_{2} = 1-d_{2}^{opt}(1-a_{1}) \right] \bigg| X_{1}, A_{1} = 1-a_{1} \right] \\ & \hspace{1cm} - (1-p_{1}) \cdot \mathbb{E} \left[ p_{1,1-a_{1}} \cdot \mathbb{E}_{Y} \left[ Y^{*}(a_{1}, d_{2}^{opt}(a_{1})) \, \bigg| \, X_{1}, A_{1} = 1- a_{1}, X_{2}, A_{2} = d_{2}^{opt}(1-a_{1}) \right] \bigg| X_{1}, A_{1} = 1-a_{1} \right] \\ & \hspace{1cm} - (1-p_{1}) \cdot \mathbb{E} \left[ (1-p_{1,1-a_{1}}) \cdot \mathbb{E}_{Y} \left[ Y^{*}(a_{1}, d_{2}^{opt}(a_{1})) \, \bigg| \, X_{1}, A_{1} = 1-a_{1}, X_{2}, A_{2} = 1- d_{2}^{opt}(1-a_{1}) \right] \bigg| X_{1}, A_{1} = 1-a_{1} \right] \bigg\}. \\ & \quad = \mathbb{1}(d_{1}^{opt} \neq a_{1}) \,\, \times \\ & \hspace{1.2cm} \bigg\{ D + p_{1} \cdot \mathbb{E} \left[ p_{1,a_{1}} \cdot \mathbb{E}_{Y} \left[ Y^{*}(1-a_{1}, d_{2}^{opt}(1-a_{1})) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = d_{2}^{opt}(a_{1}) \right] \bigg| X_{1}, A_{1} = a_{1} \right] \\ & \hspace{2cm} + p_{1} \cdot \mathbb{E} \left[ (1-p_{1,a_{1}}) \cdot \mathbb{E}_{Y} \left[ Y^{*}(1-a_{1}, d_{2}^{opt}(1-a_{1})) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = 1-d_{2}^{opt}(a_{1}) \right] \bigg| X_{1}, A_{1} = a_{1} \right] \\ & \hspace{2cm} - p_{1} \cdot \mathbb{E} \left[ (1-p_{1,a_{1}}) \cdot \mathbb{E}_{Y} \left[ Y^{*}(a_{1}, d_{2}^{opt}(a_{1})) \, \bigg| \, X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = 1- d_{2}^{opt}(a_{1}) \right] \bigg| X_{1}, A_{1} = a_{1} \right] \\ & \hspace{1cm} + (1-p_{1}) \cdot \mathbb{E} \left[ (1-p_{1,1-a_{1}}) \cdot \mathbb{E}_{Y} \left[ Y^{*}(1-a_{1}, d_{2}^{opt}(1-a_{1})) \, \bigg| \, X_{1}, A_{1} = 1-a_{1}, X_{2}, A_{2} = 1-d_{2}^{opt}(1-a_{1}) \right] \bigg| X_{1}, A_{1} = 1-a_{1} \right] \\ & \hspace{1cm} - (1-p_{1}) \cdot \mathbb{E} \left[ p_{1,1-a_{1}} \cdot \mathbb{E}_{Y} \left[ Y^{*}(a_{1}, d_{2}^{opt}(a_{1})) \, \bigg| \, X_{1}, A_{1} = 1- a_{1}, X_{2}, A_{2} = d_{2}^{opt}(1-a_{1}) \right] \bigg| X_{1}, A_{1} = 1-a_{1} \right] \\ & \hspace{1cm} - (1-p_{1}) \cdot \mathbb{E} \left[ (1-p_{1,1-a_{1}}) \cdot \mathbb{E}_{Y} \left[ Y^{*}(a_{1}, d_{2}^{opt}(a_{1})) \, \bigg| \, X_{1}, A_{1} = 1-a_{1}, X_{2}, A_{2} = 1- d_{2}^{opt}(1-a_{1}) \right] \bigg| X_{1}, A_{1} = 1-a_{1} \right] \bigg\}. \\ \end{align*}\]

where \(D := (1-p_{1}) \cdot \mathbb{E} \left[ p_{1,1-a_{1}} \cdot \mathbb{E} \left[ Y \bigg| X_{1}, A_{1} = 1-a_{1}, X_{2}, A_{2} = d_{2}^{opt}(1-a_{1}) \right] \bigg| X_{1}, A_{1} = 1-a_{1} \right] - p_{1} \cdot \mathbb{E} \left[ p_{1,a_{1}} \cdot \mathbb{E} \left[ Y \bigg| X_{1}, A_{1} = a_{1}, X_{2}, A_{2} = d_{2}^{opt}(a_{1}) \right] \bigg| X_{1}, A_{1} = a_{1} \right]\).

Therefore, \(R_{1} \in \left[ \max(0 \,, \, D + (Y_{L} - Y_{U}) + p_{1} \cdot p_{1, a_{1}} \cdot Y_{U} + (1-p_{1}) \cdot p_{1, 1-a_{1}} \cdot Y_{L} ) \, , \, D + (Y_{U} - Y_{L}) + p_{1} \cdot p_{1, a_{1}} \cdot Y_{L} - (1-p_{1}) \cdot p_{1, 1-a_{1}} \cdot Y_{U} \right]\) is the possible range of values in the regret at time \(1\) in the presence of unmeasured confounder.


Summary

In summary, we reviewed a paper Q- and A-learning Methods for Estimating Optimal Dynamic Treatment Regimes, focusing on Q-learning methods. The goal of the paper is to make a series of optimal treatment decisions to achieve the best outcome for an individual patient, called a dynamic treatment regime. To estimate the optimal treatment regime, the Q-learning method uses recursive algorithms: we decide to maximize the expected potential final outcome and then decide on a previous stage under the subsequent optimal rules in a backward induction manner up to the first stage. Q-learning is based on “Q-functions” to estimate the expected potential outcome of a series of treatments which measures the ‘quality’ associated with the treatment. Then estimation of the optimal treatment rule is accomplished via direct modeling of Q-functions such as ordinary least squares.

Throughout the project, we identified the expected potential outcome as a statistical estimand, a function of conditional expectation, and estimated the quantity under Q-functions and inverse probability weighting. From the simulation study, the estimated treatment rule showed 95% prediction accuracy. However, the Q-learning approach yields a poor estimation of true parameters which affect decision rules due to misspecified Q-functions. As can be seen in the simulation study, the performance of the Q-learning approach is affected by model specification, which may result in poor estimation.

One possible approach to overcome the model specification issue is to use a robust approach that may not need any model assumption based on non-parametric inference. More specifically, we may use distance measures between the identified empirical distribution of the covariates and the weighted empirical distribution of covariates with inverse probability weighting of certain treatment groups. Without any further model assumption, we can find a weighting function that directly minimizes the distance between them.

This project is expected to contribute to understanding dynamic treatment regimes in the Q-learning framework. Anyone interested in how to identify and estimate causal estimands in longitudinal settings may find this helpful.


Posted on:
October 24, 2022
Length:
37 minute read, 7713 words
Categories:
Causal Inference
Tags:
Causal Inference Dynamic Treatment Regime
See Also:
Review of 'Q- and A-learning Methods for Estimating Optimal Dynamic Treatment Regimes'