1 Introduction
Reinforcement learning has proved to be a powerful approach for online control of complicated systems [Bertsekas, 1995, Sutton and Barto, 2018]. Given an unknown transition system with unknown rewards, we aim to learn to control the system onthefly by exploring available actions and receiving realtime feedback. Learning to control efficiently requires the algorithm to actively explore the problem space and dynamically update the control policy.
A major challenge with effective exploration is how to generalize past experiences to unseen states. Extensive research has focused on reinforcement learning with parametric models, for examples linear quadratic control
[Dean et al., 2018], linear model for value function approximation [Parr et al., 2008], and state aggregation model [Singh et al., 1995]. While these parametric models would substantially reduce the complexity or regret of reinforcement learning, their practical performances are at risk of model misspecification.In this paper, we focus on finitehorizon deterministic control systems without any parametric model. We suppose that the control system is endowed with a metric dist that characterizes the proximity between state and action pairs. We assume that the transition and reward functions are continuous with respect to dist, i.e., states that are close to each other have similar values. Proximity measures of the stateactions have been extensively studied in the literature (see e.g. [Ferns et al., 2004, Ortner, 2007, Castro and Precup, 2010, Ferns et al., 2012a, Ferns et al., 2012b, Tang and van Breugel, 2016] and reference therein).
Under this very general assumption, we develop a surprisingly simple upperconfidence reinforcement learning algorithm. It adaptively updates the control policy through episodes of online learning. The algorithm keeps track of an experience buffer, as is common in practical deep reinforcement learning methods [Mnih et al., 2013]. After each episode, the algorithm recomputes the Qfunctions by using the updated experience buffer through a function approximation oracle. The function approximation oracle is required to find an upperconfidence Qfunction that fits the known data and optimistically estimate the value of unseen states and actions. We show that the oracle can be achieved using a nearest neighbor construction. The optimism nature of the algorithm would encourage exploration of unseen states and actions. We show that for arbitrary metric stateaction space, the algorithm achieves the sublinear regret
where is the diameter of the stateaction space, is some smoothness parameter, and is the doubling dimension of the stateaction space with respect to the metric. This regret is “sublinear” in the number of episodes played. Therefore the average number of mistakes decreases as more experiences are collected. When the stateactionspace is a smooth and compact manifold, its intrinsic dimension can be substantially smaller than the observed ambient dimension. We use an informationtheoretical approach to show that this regret is optimal.
The algorithm we propose is surprisingly general and easy to implement. It uses a function approximation oracle to find “optimistic” functions, which are later used to control the system in the next episode. By picking suitable function approximators, we can adapt our method and analysis to more structured classes of control systems. As an example, we show that the method can be adapted to the setting where the value functions are linear combinations of features. In this setting, we show the method achieves a stateofart regret upper bound , where
is the dimension of feature space. This regret is also known to be optimal. We believe our method can be adapted to work with a broader family of function approximators, including both the classical spline methods and deep neural networks. Understanding the regret for learning to control using these function classes is for future research.
2 Related Literatures
Complexity and regret for reinforcement learning on stochastic systems received significant attention. A basic setting is the Markov decision process (MDP), where the transition law at a given state
and actionis according to some probability distribution
. In the case of finitestateaction MDP without any structure knowledge, efficient reinforcement learning methods typically achieve regret that scale as , where is the number of discrete states and is the number of discrete actions, is number of time steps (see for examples [Jaksch et al., 2010, Agrawal and Jia, 2017, Azar et al., 2017, Osband and Van Roy, 2016]). The work of [Jaksch et al., 2010] provided a lower bound on the regret of for horizon MDP and also regret bounds for weakly communicating infinitehorizon average reward MDP. The number of sample transitions needed to learn an approximate policy has been considered by [Strehl et al., 2006, Lattimore and Hutter, 2014a, Lattimore and Hutter, 2014b, Dann and Brunskill, 2015, Szita and Szepesvári, 2010, Kakade et al., 2003]. The optimal sample complexity for finding an optimal policy is [Sidford et al., 2018].In the regime of continuousstate MDP, the complexity and regret of online reinforcement learning has been explored under structured assumptions. [Lattimore et al., 2013] studies the complexity when the true transition system belongs to a finite or compact hypothesis class, and shows sample policy that depends polynomially on the cadinality or covering number of the model class. Ortner and Ryabko [Ortner and Ryabko, 2012] develops a modelbased algorithm with a regret bound for an algorithm that applies to Lipschitz MDP problems with continuous state spaces and Holdercontinuous transition kernels in terms of the total variation divergence. Pazis and Parr [Pazis and Parr, 2013] considers MDP with continuous state spaces, under the assumption that Qfunctions are Lipschitzcontinuous and establishes the sample complexity bound that involves an approximate covering number. Ok et al. [Ok et al., 2018] studied structured MDP with a finite state space and a finite action space, including the special case of Liptschiz MDP, and provides various regret upper bounds.
Unfortunately, existing results for Lipchitz MDP do not apply to deterministic control systems with continuity in a metric space. In the preceding works on Lipschitz MDP, the transition kernel is assumed to be continuous in the sense that . However, this assumption almost never holds for deterministic systems. Due to the deterministic nature, the transition density is always a Dirac measure, which is discontinuous in the norm. In fact, we often have when regardless of how close and are.
In contrast to the vast literatures on reinforcement learning for MDP, online learning for deterministic control has been studied by few. Note that deterministic transition is far more common in applications like robotics and selfdriving cars. A closely related and significant result is [Wen and Van Roy, 2017], which studies the complexity and regret for online learning in episodic deterministic systems. Under the assumption that the optimal function belongs to a hypothesis class, it provides an optimistic constraint propagation method that achieves optimal regret. This result applies to many important special cases including the case of finitely many states, the cases of linear models and state aggregation and beyond. This result of [Wen and Van Roy, 2017] points out a significant observation that the complexity of learning to control depends on complexity of the functional class where the functions reside in. However, the algorithm provided by [Wen and Van Roy, 2017] is rather abstract (which is due to the generality of the method). In comparison to [Wen and Van Roy, 2017], our paper focuses on the setting where the only structural knowledge is the continuity with respect to a metric. In such a setting, [Wen and Van Roy, 2017] would imply an infinite regret as the Eulerdimension can be infinity in this case. We achieve a sublinear regret w.r.t. the number of episodes played. We show that it is optimal in this setting, and our algorithm is based on an upperconfidence function approximator which is easier to implement and generalize.
Upon finishing this paper, we are aware several recent papers working on similar settings independently. For instance, [Song and Sun, 2019] and [Zhu and Dunson, 2019] study the metric space reinforcement learning problem under stochastic reward and transition. Although we obtain a similar regret bound, our contribution is focused on the more general function approximators that capture a variate of settings, in which metric space is a special case. Another group [Wang and Du, 2019] is performing a comprehensive study on the linear value function setting. Our function approximators can be potentially combined with their results to obtain a better algorithm in the linear setting.
3 Problem Formulation
We review the basics of Markov decision problems and the notion of regret.
3.1 Deterministic MDP
Consider a deterministic finitehorizon Markov Decision Process (MDP) , where is an arbitrary set of states, is an arbitrary set of actions, is a deterministic transition function, is the horizon, and is a reward function. A policy is a function ( denotes the set of integers ). The optimal policy maximizes the cumulative reward in time steps, from any fixed initial state :
Given a policy , the value function is defined recursively as follows.
and
An optimal policy satisfies
In particular, the optimal value function satisfies the following Bellman equation
and
We also define the functions, , as,
where . We further denote for .
3.2 Episodic Reinforcement Learning and Regret
We focus on the online episodic reinforcement learning problem, in which the learning agent does not know or to begin with. The agent repetitively controls the system for episodes of time, where each episode starts from some initial state that does not depend on the history. We denote the total number of episodes played by the agent as .
Suppose that the learning agent is an algorithm (possibly randomized). It can observe all the state transitions and rewards generated by the system and adaptively pick the next action. We define its regret of this algorithm as
where the action is generated by the algorithm at time based on the entire past history, and is taken over the randomness of the algorithm . In words, the regret of measures the difference between the total rewards collected by the algorithm and that by the optimal policy after episodes.
4 The Basic Case of Finitely Many States and Actions
We provide Algorithm 1 for the case where the state space is a finite set of size and the action space is a finite set of size , without assuming any structural knowledge. Note although [Wen and Van Roy, 2017] has provided a regretoptimal algorithm for this setting, we provide a simpler algorithm based on upperconfidence bounds. Despite of the simplicity of this setting, we include the result to illustrate our idea, which might be of independent interest.
Algorithm 1 always maintains an upper bound of the optimal value function using the past experiences. The algorithm uses the value upper bound to plan the future actions. After each episode, the value upper bound is improved based on the newly obtained data. Since the exploration is based on the upper bound of the value function, it always encourages the exploration of unexplored actions. The value is improved in such a way that once a regret is paid, the algorithm is always able to gain some new information such that the same regret will not be paid again. The guarantee of the algorithm is presented in the following theorem.
Theorem 1.
After episodes, the above algorithm obtains a regret bound
The proof of the algorithm is presented in the appendix. Whenever a stateaction is visited for the first time, an instant regret is paid and the algorithm “gains confidence” by setting the confidence bound to zero. This can happen at most times. Theorem 1 matches the regret upper and lower bound in the finite case, which was proved in [Wen and Van Roy, 2017]. Note that our setting is slightly different from that of [Wen and Van Roy, 2017] (they assumed to be timedependent). For completeness, we include a rigorous regret lower bound proof for our setting. See Theorem 8 in the appendix.
5 Policy Exploration In Metric Space
Now we consider the more general case where the stateaction space
is arbitrarily large. For example, the state in a video game can be a rawpixel image, and the state of a robotic system can be a vector of positions, velocities and acceleration. In these problems the state space can be considered a smooth manifold in a highdimensional ambient space.
5.1 Metric and Continuity
The major challenge with reinforcement learning is to generalize past experiences to unseen states. For the sake of generality, we only assume that a proper notion of distance between states is given, which suggests that states that are closer to each other have similar values.
Suppose we have a metric^{1}^{1}1In fact, our analysis does not require the condition . Hence the metric space can be further relaxed to pseudometric space. over the stateaction space , i.e., , and dist satisfies the triangle inequality.
Assumption 1 (MDP in metric space with Lipschitz continuity).
Let the optimal actionvalue function be . Then there exist constants such that and , ,
(1) 
and
(2) 
We further denote for convenience.
5.2 Optimistic Function Approximation
To handle the curse of dimensionality of general state space, we will use a function approximator for computing optimistic Qfunction from experiences. The function approximator needs to satisfy the following conditions.
Assumption 2 (Function Approximation Oracle).
Let be a function. Let be a set of keyvalue pairs generated by function . Let be a parameter. Then there exists a function approximator, FuncApprox, which, on given , outputs a function that satisfies

is Lipschitz continuous;

;

^{2}^{2}2This condition can be further relaxed to , for some error parameter . In this case, the regret bound is linearly depending on ..
One way to achieve the conditions required by the function approximator is to use the nearest neighbor approach, given by
(3) 
where the distance regularization will overestimate the value at an unseen point using its near neighbors.
Lemma 2.
Proof.
More generally speaking, one can construct the function approximator by solving a regression problem. For example, suppose that is integrable with respect to a measure over the stateaction space. Then we can find it using
or for some arbitrarily small ,
where is the set of Lipschitz functions with infinity norm bounded by . This formulation is compatible with a broader family of function classes, where can be replaced by a parametric family that is sufficient to express the unknown
, including spline interpolation and deep neural networks.
5.3 RegretOptimal Algorithm in Metric Space
Next we provide a regretoptimal algorithm for the metric MDP. The algorithm does not need additional structural assumption other than the Lipschitz continuity of and . It is a combination of the UCBtype algorithm with a nearestneighbor search. It measures the confidence by coupling the Lipschitz constant with the distance of a newly observed state to its nearest observed state. The algorithm is formally presented in Algorithm 2.
The algorithm keeps an experience buffer that grows as new sample transitions are observed. It optimistically explores the policy space in online training using upperestimate of Qvalues. These Qvalues are computed recursively by using the function approximator according to the dynamic programming principle.
In particular, if the function approximation oracle is given by the nearest neighbor construction (3), Step 11 and Step 12 of the algorithm take the form of
(4) 
for all . In this case, the nearestneighbor function approximator will prioritize exploring ’s that are farther away from the seen ones. Note that the function approximators defined in (5.3) satisfy Assumption 2 (see Lemma 2).
Algorithm 2 provides a general framework for reinforcement learning with a function approximator in deterministic control systems. It can be adapted to work with a broad class of function approximators.
6 Regret Analysis
In this section, we prove the main results of this paper.
6.1 Main Results
For a metric space , we denote the net, , as a set such that
If is compact, we denote as the minimum size of an net for . We also denote a similar concept, the packing, , as a set such that
If is compact, we denote as the maximum size of an packing for . In general, and are of the same order. For a normed space (the metric is induced by a norm), we have .
Next we show that the regret till reaching optimality is upper bounded by a constant that is proportional to the size of the net. We will show later that the regret is lower bounded by a constant proportional to the size of the packing.
Theorem 3 (Regret till optimality).
Suppose we have an episodic deterministic MDP that satisfies Assumption 1. Let be a stateaction space with diameter , and be parameters specified in Assumption 1. Suppose we use the continuous function approximator defined in (5.3). Suppose the stateaction space admits an cover for any . Then after steps, Algorithm 2 obtains a regret bound
where .
Suppose is the doubling dimension of the stateaction space . The doubling dimension of a metric space is the smallest positive integer, , such that every ball can be covered by balls of half the radius. Then we can show the following regret bound.
Theorem 4 (Optimal Regret for Metric Space).
The regret bound is sublinear with in the number of steps and linear with respect to the smoothness constant and diameter .
About Doubling Dimension:
The regret depends on the doubling dimension . It is the intrinsic dimension of  often very small even though the observed state space has high dimensions. For example, the rawpixel images in a video games often belong to a smooth manifold and has small intrinsic dimension. Our Algorithm 2 uses the nearestneighbor function approximation. It can be thought of as learning the manifold state space at the same time when solving the dynamic program. It does not need any parametric model or feature map to capture the small intrinsic dimension.
6.2 Proofs of the Main Theorems
To prove the above theorems, we need several core lemmas. The following lemma shows that the approximated Qfunction (in Algorithm 2) is always an upper bound of the optimal Qfunction. Note that this lemma works for all function approximators that satisfies Assumption 2.
Lemma 5 (Optimism).
Proof.
By the properties of FuncApprox, we have
We prove the result by induction: when , we have
Suppose the relation holds for , then, we have
Thus we have
This completes the proof. ∎
Given a finite set , for any , we define the nearest neighbor operator NN and function as followings,
The next lemma shows that Algorithm 2 with function approximator (5.3) does not incur too much perstep error.
Proof.
Let
By definition of , we have
∎
Proof of Theorem 3.
From Lemma 4, we have
Denote the policy at episode as . We can rewrite the the regret as
Next we consider . We have
Moreover, we immediately have
Therefore,
We consider an net, , that covers . We now connect each to its nearest neighbor in . Denote
At episode , if for some and some there is , we call has been visited. Thus if , we can upper bound
On the other hand, if has not been visited, we upper bound the regret of the entire episode by . However, such case can only happen at most times as for the next episode, will become visited. Therefore,
as desired.
∎
We are now ready to prove Theorem 4.
Comments
There are no comments yet.