Persistency of Excitation, Nonlinear Function
Approximation, and Stochastic Contraction Analysis for
Learning in Model Reference Adaptive Control
A THESIS
SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL
OF THE UNIVERSITY OF MINNESOTA
BY
Tyler Lekang
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Andrew Lamperski, Advisor
August, 2023
© Tyler Lekang 2023
ALL RIGHTS RESERVED
Acknowledgements
First and foremost, I am grateful to my advisor and mathematics guru Prof. Andrew
Lamperski. None of this would be possible without your wisdom and encouragement.
We’ve come a long way, from studying robotics and bandits, to learning theory in
adaptive control. Through it all, Andy has always had confidence in my abilities and
was an inspiration in perseverance. I am grateful for the support provided through the
grant NSF CMMI-2122856.
There are many people in the Biomedical Engineering Department who played an
important role in my early PhD career. The IGERT Fellowship, headed at the time
by Prof. Bin He, was a tremendous source of opportunity, affording me the chance
to attend conferences, present results, collaborate, as well as receive funding. Andy
and I had projects with Prof. Tay Netoff and Dr. Logan Grado, among others, which
inspired our earlier work on Dueling Bandits. I have made many wonderful friends in
the department, such as tailgating and river floating extraordinaire John Basile and
his adventurous lab colleagues Gerardo Rodriguez and Inderbir Sondh. I also had the
privledge of befriending the biomdedical power couple, Drs. Jose´ Valdez and Rachel
Niu. There are many others, too numerous to list here.
I’m grateful to have known many very intelligent lab colleagues. Drs. Jianjun Yuan,
Bolei Di, Venkat Ram Subramanian have graduated, while Yuping Zheng is bright eyed
and bushy tailed, fast on her way to that milestone.
i
I was fortunate to have three internships while a PhD student: at 3M under David
Redinger (thanks to Jianjun!), at Honeywell under Hal Voges, and at Seagate under
Allan Luk. Thanks to Dr. Vivek Nagaraj for helping me get the Honeywell internship.
Last but certainly not least, I want to thank my family, without which, there is no
way I could have continued on the PhD path. My parents Theresa Lekang and Lane
& Karen Brettingen, as well as Jim & Sue Reid. My siblings, cousins, and extended
family as well. But most importantly, my wonderful, dear wife, Larissa Lekang, for her
constant foundation of encouragement, motivation, companionship, and love. We look
forward to the next chapter in our lives, with the arrival of our baby girl and all the joy
she will bring.
ii
Dedication
To my wonderful family in all its forms and branches, my terrific friends new and old,
and most of all, to my lovely wife Larissa, our furry companion Jake, and our sweet
baby girl Linnea who fills our lives with so much joy.
iii
Abstract
Machine learning has achieved unprecedented levels of success recently, in the areas
of language processing and modeling, image and video classification and generation, and
recommendation and dynamic pricing systems. The control of dynamic systems has also
benefited from these advancements in learning, particularly in the areas of reinforcement
learning for tasks such as robotic navigation and control of nuclear fusion processes.
We wish to study learning in another area where it can naturally be applied: adaptive
control systems. These systems must estimate and identify uncertainties in the plant in
order to apply their adaptive control laws. We study the areas of stochastic contraction
and convex projection, persistency of excitation, and function approximation, with an
eye towards this application.
The first part of the thesis is motivated by the problem of quantitatively bounding
the convergence of adaptive control methods for stochastic systems to a stationary dis-
tribution. Such bounds are useful for analyzing statistics of trajectories and determining
appropriate step sizes for simulations. To this end, we extend a methodology from (un-
constrained) stochastic differential equations (SDEs) which provides contractions in a
specially chosen Wasserstein distance. This theory focuses on unconstrained SDEs with
fairly restrictive assumptions on the drift terms. Typical adaptive control schemes place
constraints on the learned parameters and their update rules violate the drift conditions.
To this end, we extend the contraction theory to the case of constrained systems repre-
sented by reflected stochastic differential equations and generalize the allowable drifts.
We show how the general theory can be used to derive quantitative contraction bounds
on a nonlinear stochastic adaptive regulation problem.
iv
The second part of the thesis defines geometric criteria which are then used to estab-
lish sufficient conditions for persistency of excitation with vector functions constructed
from single hidden-layer neural networks with step or ReLU activation functions. We
show that these conditions hold when employing reference system tracking, as is com-
monly done in adaptive control. We demonstrate the results numerically on a system
with linearly parameterized activations of this type and show that the parameter esti-
mates converge to the true values with the sufficient conditions met.
The third part of the thesis studies function approximation. Classical results in
neural network approximation theory show how arbitrary continuous functions can be
approximated by networks with a single hidden layer, under mild assumptions on the
activation function. However, the classical theory does not give a constructive means to
generate the network parameters that achieve a desired accuracy. Recent results have
demonstrated that for specialized activation functions, such as ReLUs, high accuracy
can be achieved via linear combinations of randomly initialized activations. These re-
cent works utilize specialized integral representations of target functions that depend
on the specific activation functions used. This paper defines mollified integral repre-
sentations, which provide a means to form integral representations of target functions
using activations for which no direct integral representation is currently known. The
new construction enables approximation guarantees for randomly initialized networks
using any activation for which there exists an established base approximation which
may not be constructive. We extend the results to the supremum norm and show how
this enables application to an extended, approximate version of (linear) model reference
adaptive control.
v
Contents
Acknowledgements i
Dedication iii
Abstract iv
List of Figures x
1 Introduction 1
1.1 Why Learning in Model Reference Adaptive Control? . . . . . . . . . . 1
1.2 Challenges and State of the Art . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Stochastic Contraction and Convex Projection . . . . . . . . . . 3
1.2.2 Persistency of Excitation . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Function Approximation . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Stochastic Contraction and Convex Projection 10
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1 General Inner Product and Norm . . . . . . . . . . . . . . . . . . 14
vi
2.1.2 The Convex Projection Operation . . . . . . . . . . . . . . . . . 14
2.1.3 The Normal Cone of a Closed Convex Set . . . . . . . . . . . . . 14
2.1.4 Relationship Between Convex Projection and Normal Cone . . . 15
2.1.5 Projection Coordinates . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.6 Reflection Processes . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.7 Partial Projection Using Partial Identity P Matrix . . . . . . . . 18
2.2 Contraction Analysis for Reflected Stochastic
Differential Equations over Closed Convex Domains . . . . . . . . . . . 19
2.2.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.3 Background on Wasserstein Distances . . . . . . . . . . . . . . . 21
2.2.4 Main Contraction Result . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Application to Linear MRAC . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Lyapunov-Based Parameter Update Rule . . . . . . . . . . . . . 27
2.3.3 Satisfying Assumptions on the Lyapunov Equation . . . . . . . . 30
2.3.4 Satisfying the One-Sided Growth Condition . . . . . . . . . . . . 31
2.4 Numerical Simulation of Linear MRAC Application . . . . . . . . . . . . 34
3 Persistency of Excitation with Step and ReLU Functions 42
3.1 Persistency of Excitation With Basis Functions of a Continuous State
Trajectory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Nonlinear, Positive Semidefinite Activation Functions . . . . . . . . . . 46
3.2.1 Positive Semidefinite Activation Geometry . . . . . . . . . . . . . 47
3.3 Sufficient Conditions for Persistency of Excitation with (Scaled) Step and
ReLU Activations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
vii
3.3.2 Proof Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.3 Extension to Other Activation Functions . . . . . . . . . . . . . 52
3.4 Application to Linear MRAC . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4.2 Persistency of Excitation . . . . . . . . . . . . . . . . . . . . . . 55
3.4.3 Numerical Simulation . . . . . . . . . . . . . . . . . . . . . . . . 56
4 Function Approximation 63
4.1 Function Approximation with Activation Functions . . . . . . . . . . . . 65
4.1.1 Base and Random Approximations . . . . . . . . . . . . . . . . . 66
4.2 Existing Results for Base Approximations . . . . . . . . . . . . . . . . . 66
4.2.1 Convex Combinations of Step and ReLU Activations . . . . . . . 66
4.3 Mollified Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.1 Functions with Integral Representation . . . . . . . . . . . . . . 69
4.3.2 Mollified Integral Representation . . . . . . . . . . . . . . . . . . 70
4.3.3 Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4 Random Approximations to Mollified Approximations . . . . . . . . . . 74
4.4.1 Expected Value of Random Approximations to Functions with
Integral Representations . . . . . . . . . . . . . . . . . . . . . . . 74
4.4.2 Setting Up McDiarmid’s Inequality . . . . . . . . . . . . . . . . . 75
4.4.3 Using McDiarmid’s Inequality . . . . . . . . . . . . . . . . . . . . 76
4.5 Random Approximations to Base Approximations . . . . . . . . . . . . 77
4.6 Extending the Error Bound to L8 for Lipschitz Functions on Convex
Compact Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.7 Application to Linear MRAC . . . . . . . . . . . . . . . . . . . . . . . . 80
4.7.1 Example on Euclidean Ball . . . . . . . . . . . . . . . . . . . . . 82
viii
5 Conclusion and Future Directions 85
References 87
Appendix A. 97
A.1 Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
A.2 Proof of Corollary 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
A.3 Useful Properties of Invertible Matrix G . . . . . . . . . . . . . . . . . . 109
Appendix B. 110
B.1 Proofs of Theorems 3.1 and 3.2 . . . . . . . . . . . . . . . . . . . . . . . 110
B.1.1 Proof Methodology and Setup . . . . . . . . . . . . . . . . . . . . 110
B.1.2 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . 112
B.1.3 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . 113
B.1.4 Extension to Other Activation Functions . . . . . . . . . . . . . 116
B.2 Global Uniform Asymptotic Stability of (3.23) . . . . . . . . . . . . . . 116
Appendix C. 119
C.1 Useful Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
C.2 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
C.3 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
C.4 Proof of Theorem 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
C.5 Proof of Lemma 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
C.6 Proof of Lemma 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
C.7 Proof of Theorem 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
C.8 Proof of Theorem 4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
C.9 Proof of Lemma 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
ix
List of Figures
2.1 Numerical simulation of similar LTI system and Ornstein-Uhlenbeck pro-
cess from the same initial conditions. . . . . . . . . . . . . . . . . . . . . 12
2.2 Contractivity of the law Pt from initial deterministic P0 (deltas) to the
Gaussian stationary distribution pi. . . . . . . . . . . . . . . . . . . . . . 12
2.3 Reflected Brownian Motion. The reflection process ψt constrains
2-dimensional Brownian Motion to remain within a box set X Ă R2
containing the origin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Convex Projection and Normal Cone. Here we see the relation-
ship between convex projection of exterior points (colored dots to black
dots) and vectors in the normal cone at those projected points (colored
outwards pointing vectors). . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 States x
p1q
t (blue) and x
p2q
t (orange), for the three simulation scenarios.
Coupling time τ is the vertical red line. . . . . . . . . . . . . . . . . . . 38
2.6 Euclidean norm of states }xp1qt }2 (magenta) and }xp2qt }2 (cyan), for the
three simulation scenarios. Coupling time τ is the vertical red line. . . . 39
2.7 Phase plots of the 2D polygons used to constrain each row of pKt (left side)
and pΩt (right side), with blue for pΘp1qt and orange for pΘp2qt . Cyan dots
show true parameter values. Red and Green dots show final parameter
estimates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
x
2.8 Euclidean norm of parameter error } pΘp1qt ´Θ}2 (magenta) and } pΘp2qt ´Θ}2
(cyan), for the three simulation scenarios. Coupling time τ is the vertical
red line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1 State space x P R2 showing N “ 4 hyper planes WJx ` b “ 04. There
are 11 feasible activation regions Aj (color dots) and 5 infeasible regions
(no dots). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2 Phase plot of xt (cyan), tracking into x
r
t (magenta) driven by r
p1q
t , and
clearly crossing all N hyperplanes at nondegenerate borders along the
limit cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3 Phase plot of xt (cyan), tracking into x
r
t (magenta) driven by r
p2q
t , but
does not cross all N hyperplanes along the limit cycle. . . . . . . . . . . 61
3.4 Norm of state tracking error }et}2 converges to zero in both cases, while
the parameter estimation error }pΘt ´ Θ}2 converges to zero for the case
with r
p1q
t and but not for the case with r
p2q
t . . . . . . . . . . . . . . . . . 62
4.1 The mollified delta δˆλ for different λ values. In each case,
ş1{λ
´1{λ δˆλpγqdγ “ 1. 71
4.2 The base approximation pfN is transformed into the mollified approxima-
tion pfλ,N by replacing Dirac deltas δ with the mollified deltas δˆλ. . . . . 72
C.1 The intersection KX Bpx´ hd rL¯ in green, for possible convex geometrical
features on the boundary of K where the center px can lie. . . . . . . . . 144
C.2 The largest ball which always lower bounds Sd, centered at d with radius
d sinpϑq. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
xi
Chapter 1
Introduction
1.1 Why Learning in Model Reference Adaptive Control?
Machine learning is a wide-ranging umbrella topic that encompasses many different
methods and models, which can be used to bestow onto a machine or system the ability
to complete various tasks without specific experience being hard coded or otherwise
already possessed by the system [50]. In particular, deep neural network based ap-
proaches have achieved unprecedented levels of performance and utility in areas such
as language processing [58], medical computer vision [21], and reinforcement learning
[16]. Bandit model [13, 43] approaches are another method with various applications in
clinical trials, recommendation systems, and dynamic pricing [10].
Adaptive control lies at the intersection of machine learning methods and the control
of dynamics systems [23], and has a rich history in the controls literature [41, 26, 28, 35].
It has wide applications in areas such as robotics [46], aerospace systems [41], and elec-
tromechanical systems [5]. The typical approach utilizes Lyapunov-based design to
update parameters while guaranteeing stability. In particular, the setup of (linear)
model reference adaptive control (MRAC) [41, 74, 71] lends itself well to focused study
1
2of the learning components of the overall adaptive mechanisms, with the goal of enhanc-
ing the already well understood aspects of state error tracking stability. Improvement
in learning within this application corresponds to improvement in the ability of the
adaptive system to estimate uncertain parameters, identify the system, and cancel out
detrimental effects of additive nonlinearities inherent in the plant.
1.2 Challenges and State of the Art
The fundamental challenges that both machine learning and adaptive control methods
attempt to address ultimately revolve around learning (approximating) an uncertain
function. Many recent state of the art models employ deep neural networks, often with
particular structure (such as transformers [76] or convolutional layers [37]). At their
heart, these deep models are highly-parameterized approximators of complex, nonlinear
functions of interest that govern the intrinsic nature of various tasks.
An important aspect of addressing this challenge in the context of MRAC adaptive
control is the concept of persistency of excitation. We show that the adaptive control law
of the linear MRAC setup in [41] has parameter estimation error which can be formed as
a linear time-varying system, and thus will be globally uniformly asymptotically stable
if persistency of excitation is met. This is a well known and defined condition [2], but
relatively few methods exist that provide constructive recipes for achieving it.
These challenges can be addressed in both a deterministic and stochastic context.
In [44], we took the first step within the stochastic context. There we also addressed
the challenge of projection, in which the parameter estimates of the MRAC system
are desired to stay within some compact convex set, without disrupting the state error
tracking stability. Currently existing methods for this [41, 28, 26, 35] tend to rely on
specialized operators for specific classes of convex sets. In [45], we switched to the
deterministic setting and provided novel sufficient conditions for achieving persistency
3of excitation in the context of the linear MRAC setup in [41] where the nonlinear
vector function is a single hidden-layer neural network with known parameterization
but uncertain coefficients.
1.2.1 Stochastic Contraction and Convex Projection
In recent years, there has been a drive to connect adaptive control methods with tech-
niques from reinforcement learning [78, 33, 79, 8]. In parallel, methods from reinforce-
ment learning have seen works giving precise optimality guaranatees [68, 72, 24]. These
works rely on precise convergence bounds that are fairly straightforward for linear sys-
tems, but substantially more complex in stochastic nonlinear systems.
Convergence of stochastic nonlinear systems is a vast area with numerous approaches,
e.g. [54, 62, 6, 9]. In order to derive convergence guarantees analogous to those available
in linear systems, explicit quantitative bounds are required. In [44], our desire was to
derive quantitative convergence guarantees for stochastic adaptive control methods. To
this end, we build upon methodologies at the intersection of stochastic differential equa-
tions and optimal transport [20, 18]. However, the existing methods in this area are too
restrictive to be applied directly to common adaptive control schemes. In particular,
these focus unconstrained processes with strong Lipschitz-like conditions on the drift
term. However, in adaptive control, the parameters are typically constrained and their
update rules often contain quadratic terms that violate the drift conditions.
Our primary contribution to stochastic convergence theory is an extension of the
methodology from [20] to constrained processes with less restrictive drift conditions. We
derive an explicit exponential contraction bound under a specially constructed Wasser-
stein distance. The result implies exponential convergence to a unique stationary dis-
tribution under a variety of measures, including total variation distance and Euclidean
Wasserstein distances. We then show how this result can be used to prove exponential
4convergence in a stochastic adaptive regulation problem. Additionally, we show how a
projection method based on reflected stochastic differential equations can be used to
constrain the parameters to an arbitrary closed convex set. This provides a flexible
alternative to handling constraints, which contrasts with more specialized projection
operators commonly employed in adaptive control [41, 26, 35].
1.2.2 Persistency of Excitation
Persistency of excitation is a fundamental concept employed within contexts and appli-
cations related to parameter learning, such as system identification and adaptive control.
It is often discussed, or at least mentioned, in adaptive control textbooks, such as in
[71, 74, 41]. It was proven in [55, 2] that persistency of excitation is necessary and suf-
ficient for the global uniform asymptotic stability of certain linear time-varying (LTV)
systems, which can be shown to hold for the parameter estimation error dynamics that
stem from the adaptive parameter update laws in linear MRAC.
There have been many works since [55, 2] which utilize an assumption of persistency
of excitation in order to achieve results in parameter learning. In [1], a simple parameter
learning scheme can be employed for a general class of nonlinear systems which have
some kind of working (nonlinear) feedback controller. An integral condition similar to
persistency of excitation is assumed to be satisfiable. This inspired the work in [69],
which assumes a similar integral condition in order to identify, and then provide MRAC
control for, an unknown MIMO LTI system. And in [14, 41] we see additional examples
of MRAC control which assume persistency of excitation in order to achieve parameter
convergence, while in [3], persistently exciting assumptions are made in reinforcement
learning applications. In [80], a sufficient condition for windows of observed behavior
of an LTI system (in discrete time) to span the space of possible windows, is for a
component signal (like the input) to be persistently exciting. In [56], conditions for
5neural networks excitation are given to guarantee bounds on the function estimate error.
Lastly, in [60] it is proven that for a general class of nonlinear systems which are feedback
linearizable (see [53, 35]), global uniform asymptotic stability can be achieved for linearly
paramatrized vector functions meeting relaxed persistency of excitation conditions.
However, in these and other works which utilize persistency of excitation assump-
tions, there is often no explicit sufficient conditions provided for how to ensure that
persistency of excitation is satisfied. On the other hand, there have been some works
which do provide these sufficient conditions. A classic result is that the state of an LTI
system satisfies persistence of excitation if the (stationary) input to that system contains
sufficient frequency content (“sufficient richness”, see [11]). In [52, 36] this is extended
to certain linear time-varying systems, while in [47] frequency arguments for sufficient
conditions for excitation are then extended to nonlinear systems in parametric-strict-
feedback form and in [34] to the context of adaptive dynamic programming, in which
optimal control value functions are approximated using polynomial basis functions. In
[59], a rank condition is proven sufficient and necessary for the state of time-invariant
systems to be persistently exciting, and in [70], strong Lyapunov functions are provided
that are equivalent to the persistency of excitation condition. Finally, and closest in
spirit of our work in [45], in [38] a sufficient condition, based on geometric criteria, is
given for satisfying persistency of excitation with vector functions composed of radial
basis functions.
1.2.3 Function Approximation
A key challenge in function approximation work is that many theorems describe the ex-
istence of approximations that are linear combinations of some activation function (such
as a sigmoid), but are not constructive. In [64], many of these results are reviewed, in-
cluding classic results on the universal approximation properties of continuous sigmoidal
6activations [15, 25, 22], as well as another classic result by [7] for arbitrary sigmoidal
activations, which was expanded to arbitrary hinge functions by [12].
Random approximation provides a key way to avoid the non-constructiveness by
randomly sampling the internal parameters of the activations from a known set, if the
target function of interest has an integral representation. However, determining these
representations for general classes of target functions and activations is a nontrivial task.
[29] gives a constructive method, but only for target and activation functions in L1. In
[31] and [61], they propose constructive methods for a class of target functions with
unit step and ReLU activations respectively. In [27], functions are approximated using
trigonometric polynomial ridge functions, which can then be shown in expectation to
be equivalent to randomly initialized ReLU activations.
There are several interesting and important results in the literature having to do with
neural networks with random (typically Gaussian) parameter initialization, sometimes
as a consequence of using randomly initialized gradient descent for training the network.
A classic result by [57] shows that the output of a single hidden-layer network with
Gaussian randomly initialized parameters goes to a Gaussian Process as the width goes
to infinity. Similar results are then achieved in [42] for deep fully-connected networks
as all hidden layer widths go to infinity. Also for deep fully-connected networks, in [30]
the authors define the Neural Tangent Kernel and propose that its limit, as the hidden
layer widths go to infinity, can be used to study the timestep evolution and dynamics
of the parameters, and the corresponding network output function, in gradient descent.
In [66], the authors show that single hidden-layer networks cannot achieve the same
rates of increase in a measure of curvature produced by the network output, as deep
networks can, with parameters Gaussian randomly initialized and bounded activation
functions. In [4], the authors show that single hidden-layer networks of a sufficient
width can use Gaussian randomly initialized gradient descent on values of a target
7function, and achieve guaranteed generalization to the entire function. And in [17], the
authors show that deep fully-connected networks, where each hidden layer width meets
a sufficient size, can be trained with Gaussian randomly initialized gradient descent and
be guaranteed to reach the global minimum at a linear rate.
1.3 Contributions
We summarize our main contributions as follows:
• We develop a novel extension of the framework in [18, 20] for establishing ex-
ponential stochastic contraction using specialized Wasserstein distances, which is
detailed in Section 2.2. The main technical result is Theorem 2.1, which also pro-
vides for Corollary 2.1. We analyze a modified, stochastic version of linear MRAC
to show that the theory can be applied in Section 2.3, and perform numerical
simulation.
• We develop novel sufficient conditions for persistency of excitation for (nonlinear)
vector functions of continuous, bounded state trajectories when they are composed
of activation functions (eg. step and ReLU) together with affine transformations
of the state space, which is detailed in Sections 3.1 and 3.2. The main technical
results are Theorems 3.1 and 3.2. We analyze a simplified version of linear MRAC
to show that the theory can be applied in Section 3.4, and perform numerical
simulation.
• We develop a novel method for bridging convex combinations of activation func-
tions (eg. step and ReLU) with unknown parameters to approximations using
randomly initialized parameters by using a mollified integral representation, which
is detailed in Section 4.3. By combining existing results discussed in Section 4.2
8with the general setup in Section 4.4, we obtain the main technical result Theo-
rem 4.5. We extend this result to the supremum norm in Section 4.6, and analyze
linear MRAC to show that the theory can be applied in Section 4.7.
1.4 Thesis Outline
• Chapter 2 focuses on Stochastic Contraction and Convex Projection.
• Chapter 3 focuses on Persistency of Excitation.
• Chapter 4 focuses on Function Approximation.
• Chapter 5 provides conclusions and future directions for these lines of work.
1.5 Notation
Here, we will describe notation that is consistent throughout the thesis. Any deviations
or additional notation will be specified within individual chapters.
We use R,C,N to denote the real, complex, and natural numbers. I denotes the
indicator function, while E denotes the expected value. We use B, C,K,X to denote
bounded, compact, compact convex, and closed convex subsets of Rn. The integer set
t1, . . . , ku is denoted by rks. We use coH to denote the closure of the convex hull of a
set H. Bxprq Ă Rn denotes the radius r Euclidean ball centered at x P Rn.
For any n,m P N, we interpret x, y P Rn as length n column vectors and their
transposes xJ, yJ as length n row vectors. }x}2 “
?
xJx will always denote the standard
Euclidean vector norm and inner product, with xJy the dot product of two equal length
vectors. 0n, 1n denote the length n vector of zeros and ones respectively. The notation
that A is a n ˆ m matrix assumes real entries, thus A P Rnˆm and its transpose
AJ P Rmˆn. If A is a square matrix its trace is denoted by trpAq. In denotes the nˆ n
9identity matrix. A ă B (ĺ) denotes that the matrix B ´ A is positive (semi)definite.
And so, aIn ĺ A ĺ bIn denotes that eigpAq P ra, bs.
Index subscripts on vectors and matrices denote the row index, for example AJi is
the ith row of AJ. The Frobenius matrix norm is denoted }A}F . We denote the product
of matrix A P Rnˆm with vector x P Rm as Ax, which is a length n (column) vector
where the ith (row) element is the inner product pAxqi. If A P Rnˆm and x P Rn, then
the matrix product AJx results in a length m vector where we denote the ith element
pAJxqi equivalently as the inner product AJi x.
We denote time indices using subscripts, for all (scalar, vector, or matrix valued)
functions of time. For example, xt or At, instead of xptq or Aptq. Random variables are
denoted in bold, eg. x, and so stochastic processes also have a time subscript, eg. xt.
The ith row of a time-varying vector or matrix is denoted with t, i subscript.
For a function f : Rn Ñ R, an affine shift qf is defined as qfpxq “ fpxq ´ aJx´ b for
some a P Rn and b P R, typically with a “ ∇fp0q and b “ fp0q.
Chapter 2
Stochastic Contraction and
Convex Projection
In this chapter, we will analyze what happens to the general setup for linear MRAC
from Chapter 9 of [41] when two important modifications are applied:
1. stochastic terms are added to the dynamics, which thus makes the system itself
stochastic, such that at every time t ě 0 the state xt is a stochastic process with
a probability distribution (law) Pt that evolves with time, and
2. reflection terms are added to the dynamics, which constrain the state of the system
to be within any closed convex set X Ă Rn.
First, consider the standard LTI control system which is described by the linear
differential equation
9xt “ Axt `B ut (2.1)
where xt P Rn is the system state, A is an n ˆ n matrix, ut P R` is the control input
and B is an nˆ ` matrix. We know from classic results in LTI controls theory that the
10
11
solution to this equation is the state trajectory
xt “ eAtx0 `
ż t
0
eApt´sqBusds
where x0 P Rn is the initial state.
And so, if we use an input ut which is stochastic, then the solution (state trajec-
tory) xt will also necessarily be stochastic. Let us then input an n-dimension Brownian
motion wt. This is a Wiener process satisfying that w0 “ 0n, its increments are multi-
variate zero-mean Gaussian pwt`u ´wtq „ Np0n, uInq at all t ě 0, for any u ě t, and
independent of past values ws for all 0 ď s ă t, and its sample paths are continuous in
time. We express this using a differential form of (2.1) as
dxt “ Axtdt`G dwt (2.2)
where G is an nˆn matrix. This form is used because the time-derivative of Brownian
Motion dwtdt does not exist anywhere. (And yet, wt is continuous!)
It is no surprise then, that the solution to (2.2) is the stochastic process xt, also
known as an Ornstein–Uhlenbeck process, which satisfies
xt “ eAtx0 `
ż t
0
eApt´sqGdws ùñ Erxts “ eAtErx0s .
If the eigenvalues of A are not zero, then a stationary distribution exists for this process
which is pi “ Np0n,Σq, where the covariance matrix satisfies the Lyapunov equation
AΣ`ΣA “ ´GGJ. If we also require G to be invertible, then GGJ is positive-definite.
Therefore in both cases, if A is Hurwitz then we have:
1. the LTI system (2.1) is exponentially stable, with xt Ñ 0n exponentially fast from
initial condition x0, and
2. the Ornstein-Uhlenbeck process (2.2) is exponentially stochastically stable, with
the law xt „ Pt contracting to the stationary pi exponentially fast from the initial
P0 and thus Erxts Ñ 0n exponentially fast.
12
We illustrate these concepts with the following figures. Figure 2.1 shows numerical
simulation of both (2.1) and (2.2) using the same A “ ´In and B “ G “ In with
` “ n “ 10 and x0 “ x0 linearly spaced between r´10, 10s, over times t P r0, 10s sec.
Figure 2.2 illustrates contractivity of the law xt „ Pt over time from deterministic P0.
0 2 4 6 8 10
−10.0
−7.5
−5.0
−2.5
0.0
2.5
5.0
7.5
10.0
Figure 2.1: Numerical simulation of similar LTI system and Ornstein-Uhlenbeck process
from the same initial conditions.
Figure 2.2: Contractivity of the law Pt from initial deterministic P0 (deltas) to the
Gaussian stationary distribution pi.
13
Next, we consider additive reflection terms, which can be either deterministic ψt or
stochastic processes ψt, and constrain the solutions xt to (2.1) and xt to (2.2) within a
closed convex subset X Ă Rn. For example, consider the Brownian motion dwt in n “ 2
dimensions by itself, together with the additive reflection term ψt which constrains to
X Ă R2 being a box containing the origin. The differential form is integrated to give
the solution %t “
şt
0 dwt `ψt, and a numerical simulation is shown in Figure 2.3.
Figure 2.3: Reflected Brownian Motion. The reflection process ψt constrains 2-
dimensional Brownian Motion to remain within a box set X Ă R2 containing the origin.
This chapter is adapted from the published work [44] and we summarize the orga-
nization of the sections, as follows:
1. Section 2.1 provides background information for understanding the main results,
2. Section 2.2 sets up and provides the main result for exponential contraction of the
law Pt in a specially constructed Wasserstein distance,
3. Section 2.3 explains how the application to linear MRAC meets the requirements
of the results from the previous section, and
4. Section 2.4 provides a numerical simulation of this application.
14
2.1 Background
This section gives a basic overview of important concepts like convex projection and
reflection processes, and provides important definitions that will be used throughout
the chapter (and the associated appendix) related to these concepts.
2.1.1 General Inner Product and Norm
We use x¨, ¨y and } ¨ } without subscript to denote a general inner product and norm
for x P Rn having the quadratic form }x}2 “ xx, xy “ xJPx, using some symmetric,
positive-definite nˆ n matrix P . The Euclidean (P “ In) inner product and norm are
denoted } ¨ }22 “ xx, xy2 “ xJx, and all other norms are explicitly mentioned in context.
2.1.2 The Convex Projection Operation
The convex projection of any exterior point y P RnzX onto the set X is defined as
ΠX pyq :“ arg min
xPX }x´ y}
2 . (2.3)
In words, the convex projection is the closest point on the boundary of X to the
exterior point y, with distance in the sense of the general norm. Simple geometries,
such as Euclidean balls and boxes, have analytic formulas for the convex projection.
More generally, for any closed convex set, the convex projection can be obtained via
constrained linear optimization methods.
2.1.3 The Normal Cone of a Closed Convex Set
Let X be a closed convex set in Rn. Then at any y P Rn, the normal cone of X is the
subset of vectors in Rn defined as
NX pyq :“
$’’&’’%
H if y R X
tv P Rn | xx´ y, vy ď 0 @x P X u if y P X
, (2.4)
15
where xx´ y, vy “ px´ yqJPv is the general inner product.
In words, the normal cone at some y P X is the set of all vectors in Rn satisfying
that the general inner products with all vectors pointing from that y to all x P X are
non-positive. It is defined to be empty at all points outside the set X , and at all interior
points of X the only such vector satisfying this criteria is the zero vector 0n.
However, at boundary points of X the normal cone is nontrivial. For points on
curved or flat boundary surfaces, it contains all non-negative scalings of the vector
in the orthogonal direction (with respect to P ) to the boundary at that point. And
at vertex points, it contains all non-negative scalings of all such orthogonal directions
(again, with respect to P ), which forms a cone.
2.1.4 Relationship Between Convex Projection and Normal Cone
By inspecting (2.3) and (2.4), we can see the close relationship between the convex
projection onto X and the normal cone of X . That is, if we denote x˚ “ ΠX pyq as the
projection onto X for some exterior point y P RnzX , then it must hold that there are
v P NX px˚q extending outward from x˚ which pass through the point y. See Fig. 2.4
for an example with X Ă R2 as a compact polygon and P “ I2 (so that orthogonal
directions to flat boundary surfaces are at right angles).
16
Figure 2.4: Convex Projection and Normal Cone. Here we see the relationship
between convex projection of exterior points (colored dots to black dots) and vectors in
the normal cone at those projected points (colored outwards pointing vectors).
2.1.5 Projection Coordinates
When P ‰ In the convex projection (2.3) may not project the exterior point y to the
nearest boundary point x P X , in the sense of Euclidean distance. This can be mitigated
by using different coordinates for the projection. For all v, x, y P Rn, let qv, qx, qy be the
linear transformations according to the matrix P´ 12 (eg. qx “ P´ 12x) and similarly let
the transformed set be qX “ P´ 12X “ tP´ 12x | x P X u. Then for any fixed y P RnzX
and corresponding qy, we have
minqxP qX }qx´ qy}2 “ minqxP qX pqx´ qyqJP pqx´ qyq “ minpxPX ppx´ yqJP´ 12PP´ 12 ppx´ yq “ minpxPX }px´ y}22
where px “ P 12 qx P P 12 qX “ X and the matrices P 12 and P´ 12 are also symmetric, positive-
definite and can always be constructed from the real, positive singular values of P by
using their (inverse) square roots. This gives the Euclidean minimizer as px˚ “ P 12 qx˚.
17
2.1.6 Reflection Processes
Here we will describe some basic concepts about reflection processes ψt, which are used
to constrain the solution of a reflected (stochastic) differential equation within a set by
reflecting it at the boundary. For details, see Appendix F of [39], as well as [75] and
[73].
Let X Ă Rn be some closed convex set and let % : r0,8q Ñ Rn be a piecewise-
continuous function of time with %0 P X . Then the piecewise-continuous functions of
time x : r0,8q Ñ Rn and ψ : r0,8q Ñ Rn are said to solve the Skorokhod problem if
the following hold for all T ą 0:
1. xt “ %t ` ψt, with x0 “ %0 P X , remains within X for all t P r0, T q
2. ψt has the form ψt “ ´
şt
0 vsdµpsq, where vt P NX pxtq such that }vt}2 P t0, 1u, and
the non-negative measure µ is finite µpr0, tqq ă 8, for all t P r0, T q.
Note that NX pxtq “ t0nu if xt is in the interior of X . And so, ψt only changes value at
times when xt is at the boundary of X .
Similarly, if %t is a solution to an unconstrained stochastic differential equation,
then ψt “ ´
şt
0 vsdµpsq, with vt and µptq meeting the conditions above, is a stochastic
reflection process with bounded variation, enforcing that xt “ %t`ψt, with x0 “ %0 P X ,
remains within X for all t ě 0.
Thus, xt is a solution to the reflected stochastic differential equation (RSDE)
dxt “ d%t ´ vtdµptq .
Importantly, it therefore follows from the definition of the normal cone (2.4) and µptq
being a non-negative measure, that the following hold for all t ě 0 and y P X :
´xxt ´ y,vtdµptqy “ ´pxt ´ yqJP vtdµptq ď 0
xy ´ xt,vtdµptqy “ py ´ xtqJP vtdµptq ď 0 .
18
These facts will be used in multiple places throughout the chapter (and the associated
appendix), to show that the reflection processes have negligible effect on Lyapunov and
martingale analysis.
Finally, we note that the solution of the above RSDE, obtained by integrating the
differential form as
xt “ x0 `
ż t
0
d%t `ψt ,
can be simulated numerically (see [75, 73] for details) via a projected Euler method with
the convex projection ΠX . Thus, the effect of ψt on %t is the continuous time limit of
the convex projection operation ΠX back onto X .
2.1.7 Partial Projection Using Partial Identity P Matrix
Let us define that xt, y P Rn`m with xJt “ rxJt,n xJt,ms and yJ “ ryJn yJms, and let it be
that we only desire to project the xt,m part of the overall solution xt, leaving the xt,n
part unconstrained. In such a case, we can define the closed convex set X :“ Rn ˆ Xm
using the closed convex set Xm Ă Rm, and
P “
»–Pn 0
0 Pm
fifl and vt dµptq “
»– 0n
vmt dµ
mptq
fifl
where Pn and Pm are any symmetric, positive-definite matrices with respective n ˆ n
and m ˆm dimension, and vmt P NXmpxt,mq is defined with respect to Pm. Thus, the
general inner product is xxt, yy “ xJt Py “ xJt,nPnyn ` xJt,mPmym, which gives
´xxt ´ y,vtdµptqy “ 0´ pxt,m ´ ymqJPm vmt dµmptq ď 0
xy ´ xt,vtdµptqy “ 0` pym ´ xt,mqJPm vmt dµmptq ď 0
for all xt, y P X “ Rn ˆ Xm. We use this type of implementation for the application
presented in Section 2.3, with Xm “ K as a compact convex set and Pm “ Im which
avoids any need for projection coordinate transformations.
19
2.2 Contraction Analysis for Reflected Stochastic
Differential Equations over Closed Convex Domains
This section gives a general contraction result for reflected stochastic differential equa-
tions (RSDEs) over closed convex domains. The results in this section build upon the
contraction theory of [19], with substantial novel work required to generalize the con-
ditions in that work. This was required in order for the adaptive control application
studied in Section 2.3 to be able to satisfy those generalized conditions.
We consider RSDEs in order to handle the constraints that can be applied to pa-
rameter update rules in adaptive control applications, while in [19] the authors focus
on unconstrained SDEs.
2.2.1 Problem Setup
Let X be a closed convex subset of Rn. We will examine contractivity properties of
continuous-time Markov processes with continuous sample paths xt P X , called diffusion
processes, which are solutions (trajectories) of reflected SDEs having the form
dxt “ Hpxtqdt`Gdwt ´ vtdµptq , (2.5)
where H : Rn Ñ Rn is a time-invariant vector function (the drift coefficient), G is a
constant, invertible nˆn matrix (the diffusion coefficient), wt is an n-dimensional Brow-
nian motion (Wiener process), and ψt “ ´
şt
0 vsdµpsq is a bounded variation reflection
process enforcing that xt P X for all t ě 0 from initial condition x0 P X .
When H is Lipschitz, it can be shown that ψt is the unique bounded variation
reflection process such that vt P NX pxtq with }vt}2 P t0, 1u, and µ is a random measure
satisfying µpr0, tsq ă 8, for all t ě 0. Then the solution (trajectory) satisfies xt P X
for all t ě 0. See [75, 49]. The Lipschitz condition can be relaxed to H being locally
Lipschitz, provided that the process is nonexplosive. See Section 2.4 of [63].
20
Note that instead of the usual 9xt “ dxtdt notation often seen in controls theory with
deterministic differential equations, (2.5) instead uses the differential notation of SDEs.
With this notation, we directly integrate over time to obtain the solution (trajectory)
xt “ x0 `
ż t
0
Hpxsqds`G
ż t
0
dws ´ψt .
Such solutions can be simulated numerically via a projected Euler method with time
step dt and independent Brownian increments wt`dt ´wt as
xt`dt « ΠX
`
xt ` dt Hpxtq `Gpwt`dt ´wtq
˘
.
2.2.2 Assumptions
The first requirement is that H satisfy a one-sided growth condition. We assume that
there is a function κ : p0,8q Ñ r0,8q satisfying ş10 rκprqdr ă 8 and a non-negative
number α, such that for all x, y P X , with r :“ }x´ y}, the following bound holds:
xx´ y,Hpxq ´Hpyqy ď κprqr2 ` α rp}x} ` }y}q . (2.6)
This one-sided growth condition generalizes the one-sided Lipschitz condition from [19],
which corresponds to the special case with α “ 0. The extra terms are required to
handle the application to adaptive control in Section 2.3.
Let A denote the generator associated with the (reflected) diffusion process xt.
Specifically, for any function g : X Ñ R, the generator is given by
pAgqpxq “ lim
hÓ0 h
´1pErgpxhq|x0 “ xs ´ gpxqq .
We then assume that there is a twice continuously differentiable Lyapunov function
V : X Ñ r0,8q and positive numbers λ and C such that for all x P X it holds that
pAVqpxq ď C ´ λVpxq . (2.7)
21
Also, we assume that Vpxq increases monotonically with }x}. Specifically, there is a
strictly monotonically increasing function φ such that Vpxq “ φp}x}q. We will further
assume that V grows at least linearly with }x} (ie, φ grows at least linearly).
Let R1 be the diameter of the set tpx, yq P X | Vpxq`Vpyq ď 4C{λu. Linear growth
implies that R1 is finite. By construction, if }x´ y} ą R1, then
pAVqpxq ` pAVqpyq ď ´pλ{2qpVpxq ` Vpyqq . (2.8)
Then, let M be a positive number such that M ě R1 and for all x with }x} ě M the
following bound holds:
Vpxq ě max
"
2
λ
pα}x} ` Cq, 4C
λ
p2}x} ` 1q
*
. (2.9)
Finally, let R2 be the diameter of the set tpx, yq P X | }x} ď M and }y} ď Mu. Note
then that R2 ď 2M by the triangle inequality.
In Section 2.3 we take Vpxq “ }x}2 ` 1 (meaning φpdq “ d2 ` 1), and so the growth
assumptions on the Lyapunov function V are automatically satisfied.
2.2.3 Background on Wasserstein Distances
Our main theory describes convergence in a Wasserstein distance. To state the result,
some basic concepts from optimal transport are required. See [77] for a general reference.
Let P and Q be probability measures over X with respect to the standard Borel
sigma algebra. Then the joint probability measure Γ over X ˆ X is called a coupling
of P and Q if its marginals are respectively P and Q. In other words, for any Borel set
S, we have ΓpS ˆ X q “ PpSq and ΓpX ˆ Sq “ QpSq. Let CpP,Qq denote the set of all
couplings of P and Q.
If ρ : RnˆRn Ñ r0,8q is a metric, then the q-Wasserstein distance W qρ , with respect
22
to metric ρ, between measures P and Q is defined as
W qρ pP,Qq :“
ˆ
inf
ΓPCpP,Qq
ż
XˆX
ρpx, yqqdΓpx, yq
˙1{q
“
´
inf
ΓPCpP,Qq
Epx,yq„Γ
“
ρpx,yqq‰¯1{q .
(2.10)
For simple notation, we follow the convention that Wρ :“W 1ρ is used for general ρ,
and W q :“W q}¨}2 is used for ρ being the Euclidean norm.
2.2.4 Main Contraction Result
The main idea in [19] is to construct a metric ρ for which contraction in Wρ of the
law Pt (ie, the time evolving probability measure) of the trajectory xt can be tractably
analyzed, and then use the result to examine the contraction in more standard measures
such as W q and the total variation distance. We follow a similar approach, but the
metric must be modified to account for the more general growth condition (2.6).
Our metric will have the form
ρpx, yq “ `fp}x´ y}q ` γVpxq ` γVpyq `
maxtVpxq, φpMqu `maxtVpyq, φpMqu˘ Ipx ‰ yq . (2.11)
Recall that φ is a function such that Vpxq “ φp}x}q and I is the indicator function.
We define the following positive constants:
ε :“ λmaxpP qσmaxpGq2 and γ :“ ξε
4C
, (2.12)
where λmaxpP q is the largest eigenvalue of the matrix P (which defines the general inner
product and norm), σmaxpGq “ max}v}2“1 }Gv}2 is the largest singular value of G (with
respect to the Euclidean norm), and ξ is defined below.
23
The function f : r0,8q Ñ r0,8q is defined via the following chain of definitions:
hprq “ 1
ε
ˆ
1
2
ż r
0
sκpsqds` αMr
˙
(2.13a)
ϕprq “ e´hprq (2.13b)
Φprq “
ż r
0
ϕpsqds (2.13c)
ξ´1 “
ż R1
0
ϕpsq´1ds (2.13d)
β´1 “
ż R2
0
pΦpsq{ϕpsqqds (2.13e)
gprq “ 1´ ξ
4
ż mintr,R1u
0
ϕpsq´1ds´ β
4
ż mintr,R2u
0
pΦpsq{ϕpsqqds (2.13f)
fprq “
ż mintr,R2u
0
ϕpsqgpsqds. (2.13g)
We can now state the main contraction result, as follows.
Theorem 2.1. The function ρpx, yq defined in (2.11) is a metric over x, y P X . Let xt
and yt be two solutions to (2.5) with initial conditions x0 ‰ y0 P X and respective laws
Pt and Qt. If the initial laws satisfy
ş
X VpxqdP0pxq ă 8 and
ş
X VpxqdQ0pxq ă 8, then
the laws contract exponentially in Wρ distance as
WρpPt,Qtq ď e´atWρpP0,Q0q , (2.14)
where a “ 12 mintλ, ξε, βεu.
Proof. The proof is given in Appendix A.1.
The following corollary then establishes contraction to a stationary distribution pi (which
satisfies that if QT “ pi, then Qt “ pi for all t ě T ) in total variation distance and
standard q-Wasserstein distances, analogous to Corollary 2.1 and Remark 2.3 of [19].
Corollary 2.1. For (2.5), there exists a unique stationary distribution pi satisfyingş
X Vpxqdpipxq ă 8 . Let xt be a solution to (2.5) with law Pt such that the initial law
24
satisfies
ş
X VpxqdP0pxq ă 8 . If Vpxq ě 1 for all x P X , then Pt contracts exponentially
to pi in total variation distance as
}Pt ´ pi}TV ď γ´1e´atWρpP0, piq . (2.15)
Now let q P r1,8q satisfy 1p ` 1q “ 1 with the corresponding p ą 1. If Vpxq ě }x}q for
all x P X , then Pt contracts exponentially to pi in standard W q as
W qpPt, piq ď 21{p γ´1{q
`
e´atWρpP0, piq
˘1{q
. (2.16)
Proof. The proof is given in Appendix A.2.
2.3 Application to Linear MRAC
In this section, we analyze a stochastic variation of a linear MRAC application from
Chapter 9 of [41]. We will implement adaptive regulation of the plant state towards the
equilibrium point of a reference linear system, in spite of uncertainties in the plant’s
system matrix and its parameterization of an additive nonlinear vector function. We
also use convex projection, as detailed in Section 2.1, on the parameter update rules so
that the parameters remain within a constraint set.
The plant will also be under the influence of an additive (scaled) Brownian motion
disturbance, and thus our objective will be to show that such applications of linear
MRAC meet the requirements for the contraction results detailed in Section 2.2
2.3.1 Setup
We largely follow the general linear MRAC setup introduced in Chapter 9 of [41],with
some important modications. Notably, we introduce stochastic terms into the plant and
parameter update dynamics, and we will be tracking the fixed model reference state
xrt “ 0n, as opposed to a reference model with dynamic state behavior.
25
The plant has the form (in differential notation)
dxt “
“
Axt `B
`
ut ` ΩJΨpxtq
˘‰
dt`Gxdwxt , (2.17)
where A is an unknown n ˆ n state matrix for the plant state xt P Rn, B is a known
n ˆ ` input matrix for the input ut P R`, Ω is an unknown L ˆ ` matrix that linearly
parameterizes a vector of known nonlinear basis functions Ψ1, . . . ,ΨL : Rn Ñ R of the
plant state, and Gx is an invertible n ˆ n matrix scaling an n-dimensional Brownian
motion wxt . We omit the input scaling matrix Λ, which typically gives an overall input
matrix of BΛ, for simplicity.
The linear model reference system is then given by
9xrt “ Arxrt `Brt , (2.18)
where Ar is a known Hurwitz nˆn reference state matrix for the reference state xrt P Rn
and rt P R` is a bounded reference input. And so, we assume there exists an (unknown)
nˆ ` matrix of feedback control gains K satisfying the matching condition
A`BKJ “ Ar . (2.19)
Since the reference input rt will not deviate from zero, there is no need for a specific
reference input matrix Br with corresponding matching condition and feedforward gains.
With the matching condition satisfied, the plant (2.17) can be rewritten as
dxt “
“
Arxt `B
`
ut ´KJxt ` ΩJΨpxtq
˘‰
dt`Gxdwxt , (2.20)
Remark 2.1. The condition (2.19) is feasible if some Hurwitz Ar can be reached from
pA,Bq by state feedback. Towards this end, it is often practical to solve for an LQR
feedback gain matrix KLQR and then check that Ar “ A`BKJLQR is Hurwitz.
The methods derived in Chapter 9 of [41] aim to drive the norm of the nominal state
tracking error et “ xt ´ xrt to zero asymptotically as limtÑ8 }et}2 “ 0, by designing
26
parameter update rules based on Lyapunov analysis in combination with Barbalat’s
lemma. Here, xt is the nominal plant state, which is (2.17) without any added stochastic
term Gxdw
x
t . In this application, we will set x
r
0 “ 0n and rt “ 0` for all t ě 0, and so
the reference state will always remain at the origin. Thus, it holds that the stochastic
state tracking error et :“ xt´ 0n “ xt, and so our goal in this section is to regulate the
(stochastic) plant state to the origin, in spite of the plant uncertainties and the (scaled)
Brownian motion disturbance.
If K and Ω were known, we could then simply apply the (stochastic) feedback control
law of ut “ KJxt ´ ΩJΨpxtq and obtain a stable Ornstein–Uhlenbeck process
dxt “ Arxtdt`Gxdwxt .
But since we do not know K or Ω, we must instead use ut “ pKJtxt´ pΩJt Ψpxtq, wherepKt and pΩt are stochastic parameter estimates that require dynamic update rules. To
simplify notation, we define the following:
Θ :“
»– K
´Ω
fifl , pΘt :“
»– pKt
´pΩt
fifl , Φpxtq “
»– xt
Ψpxtq
fifl . (2.21)
Here, Θ and pΘt are N ˆ ` matrices and we have the vector function Φ : Rn Ñ RN (note
this is unrelated to the Φprq in (2.13)), where we set N :“ n ` L. Then the matched
plant (2.20) with feedback control law ut “ pΘJt Φpxtq gives the state dynamics as
dxt “
“
Arxt `Bp pΘt ´ΘqJΦpxtq‰dt`Gx dwxt . (2.22)
We assume that the vector function Φ satisfies a Lipschitz condition
}Φpxq ´Φpyq}2 ď L}x´ y}2, (2.23)
for all x, y P X and some constant scalar L ą 0.
Now we define a vectorization of the matrices Θ and pΘt. Set m :“ N` and let
S : Rm Ñ RNˆ` be the matrix shaping function defined by Spvqi,j “ vpi´1q``j for
27
all i, j P rN s ˆ r`s. Then S is an invertible linear function, such that we have the
vectorizations: θ “ S´1pΘq and pθt “ S´1p pΘtq.
Let K be a compact convex subset of Rm that contains the true values θ “ S´1pΘq.
We assume that such a K is known, with diameter D :“ diamtKu.
Then we let X “ Rn ˆ K be the closed convex subset of Rn`m containing the
combined state zJt “ rxJt pθJt s P X , and assume that pθt has an update rule satisfying an
RSDE of the form
dpθt “ Rpztq dt`Gθ dwθt ´ vθt dµθptq , (2.24)
where R : X Ñ Rm, wθt P Rm is an m-dimensional Brownian motion with invertible
scaling matrix Gθ P Rmˆm, and ψθt “ ´
şt
0 v
θ
s dµ
θpsq is a bounded variation reflection
process enforcing that pθt P K for all t ě 0 from initial condition pθ0 P K.
And so, the combination of (2.22) and (2.24) defines a special case of the RSDE
(2.5). Indeed, for every point zJ “ rxJ ϑJs P X , the corresponding NX pzq contains
vectors of the form vJ “ r0Jn pvθqJs, where vθ P NKpϑq and NKpϑq is defined with
respect to the Euclidean inner product.
Remark 2.2. The parameter update rule (2.24) differs from typical methods of adaptive
control in how it forces pθt to remain in the constraint set K. This method uses reflection
processes (RSDEs), which can be approximated in discrete time by convex projections.
In contrast, the parameters of adaptive control laws are commonly constrained using
specialized projection operators that are designed for specific classes of convex sets (see
examples in [41, 26, 28, 35]).
2.3.2 Lyapunov-Based Parameter Update Rule
We will follow a Lyapunov-based procedure for designing the parameter update rule
(2.24) similar to the method in Chapter 9 of [41], but incorporating the stochastic
28
terms.
First, we construct the Lyapunov function. We fix some nˆ n symmetric, positive
definite matrix Qx, and since Ar is Hurwitz, then there is a unique n ˆ n symmetric,
positive definite matrix Px satisfying the Lyapunov equation
AJr Px ` PxAr “ ´Qx . (2.25)
For any point zJ “ rxJ ϑJs P X “ Rn ˆK, we thus define the Lyapunov function as
Vpzq “ xJPxx` pϑ´ θqJpϑ´ θq ` 1 , (2.26)
recalling that θ is a vectorized Θ, containing the true parameter values. We define the
corresponding norm over Rn`m as }z}2 “ zJPz “ xJPxx ` ϑJϑ, where our overall P
matrix is the block diagonal
P “
»–Px 0
0 Im
fifl . (2.27)
The setup in [41] uses the Lyapunov function
Vpzq “ xJPxx` tr
`
Spϑ´ θqJΓSpϑ´ θq˘ ,
where Γ is an N ˆN symmetric, positive-definite adaptation rates matrix. Noting that
aJb “ trpSpaqJSpbqq “ trpAJBq holds for all a, b P Rm, then (2.26) satisfies this with
Γ “ IN such that tr
`
Spϑ´ θqJIN Spϑ´ θq
˘ “ pϑ´ θqJpϑ´ θq.
Theorem 2.1 requires that the Lyapunov function be a monotonic function of a norm.
This could be attained using the affine coordinate transformation zˇJ “ rxJ pϑ ´ θqJs
and then setting Vˇpzˇq “ }zˇ}2 ` 1 “ Vpzq. However, in the analysis below we will work
in the original coordinates zJ “ rxJ ϑJs.
29
We now apply Itoˆ’s formula to the Lyapunov function (2.26), which gives:
dVpztq “ ∇xVJdxt ` 1
2
dxJt p∇2x Vq dxt `∇θVJdpθt ` 12 dpθJt p∇2θ Vq dpθt
“ 2xJt PJx dxt ` dxJt Px dxt ` 2 ppθt ´ θqJdpθt ` dpθJt dpθt
“ xJt
`
AJr Px ` PxAr
˘
xt dt` 2xJt PxBp pΘt ´ΘqJΦpxtq dt` 2xJt PxGx dwxt
` dxJt Px dxt ` 2 tr
`p pΘt ´ΘqJd pΘt˘` dpθJt dpθt
“ ´xJt Qx xt dt` 2 tr
`p pΘt ´ΘqJpΦpxtqxJt PxB dt` d pΘtq˘
` trpGJxPxGxq dt` dpθJt dpθt ` 2xJt PxGx dwxt . (2.28)
The third equality uses (2.22), that aJb “ trpSpaqJSpbqq “ trpAJBq for any a, b P Rm,
and that the symmetry of Px gives
2xJt PJx dxt “ xJt Px dxt ` dxJt Px xt .
The fourth equality uses (2.25), the trace identity aJb “ trpbaJq as
2xJt PxBp pΘt ´ΘqJΦpxtq dt “ 2tr`p pΘt ´ΘqJΦpxtqxJt PxB dt˘ ,
and that the Itoˆ rules dt2 “ 0, dt dwxt “ 0n, and dwxt pdwxt qJ “ dt In thus give
dxJt Px dxt “ pdwxt qJGJxPxGxdwxt “ trpGJxPxGxq dt .
By examining the second term of (2.28), we see that ΦpxtqxJt PxB dt within the trace
operation is canceled if R in (2.24) is defined as
Rpztq “ S´1
`´ΦpxtqxJt PxB˘ , (2.29)
where S´1 vectorizes the matrix in the parentheses by stacking its rows. This gives the
design for the stochastic parameter update rule of pθt, and so also pΘt “ Sppθtq.
30
2.3.3 Satisfying Assumptions on the Lyapunov Equation
Now we will continue analyzing dVpztq to ensure that the Lyapunov function V as
defined in (2.26) satisfies the required assumptions in Section 2.2.2.
Using the R defined in (2.29) within the parameter update rule dpθt (2.24) and then
plugging this into dVpztq (2.28) by using d pΘt “ Spdpθtq, gives
dVpztq “ ´xJt Qxxt dt` 2 tr
´
p pΘt ´ΘqJ´SpGθ dwθt q ´ S`vθt dµθptq˘¯¯
` trpGJxPxGxqdt` trpGJθ Gθq dt` 2xJt PxGx dwxt
“ `´ xJt Qxxt ` trpGJxPxGxq ` trpGJθ Gθq ˘ dt
` 2xJt PxGx dwxt ` 2 ppθt ´ θqJGθ dwθt ´ 2 ppθt ´ θqJvθt dµθptq , (2.30)
where again we used aJb “ trpSpaJqSpbqq and the Itoˆ rules dt2 “ 0, dt dwθt “ 0m, and
dwθt pdwθt qJ “ dt Im in the second equality.
Now since vθt P NKppθtq and µθ is a non-negative measure, the definition of the normal
cone of K, with respect to the lower block identity Im in (2.27), therefore implies
´ppθt ´ θqJvθt dµθptq ď 0 .
And so, we have
dVpztq ď
`´ xJt Qxxt ` trpGJxPxGxq ` trpGJθ Gθq ˘ dt
` 2xJt PxGx dwxt ` 2 ppθt ´ θqJGθ dwθt ,
giving that the generator of the Lyapunov function Vpztq satisfies
AVpzq ď `´ xJQxx` trpGJxPxGxq ` trpGJθ Gθq ˘ . (2.31)
To ensure that this meets the required decrease assumption (2.7), we use a standard
quadratic form bound, followed by the diameter condition on K as pϑ´θqJpϑ´θq ď D2.
31
Thus, from (2.26) we have:
xJQxx ě λminpQxq
λmaxpPxqx
JPxx
ě λminpQxq
λmaxpPxqVpzq ´
λminpQxq
λmaxpPxqpD
2 ` 1q .
Here λminpQxq is the minimum eigenvalue of Qx and λmaxpPxq is the maximum eigen-
value of Px. Plugging this bound into (2.31) shows V satisfies (2.7) with
C “ trpGJxPxGxq ` trpGJθ Gθq ` λminpQxqλmaxpPxqpD
2 ` 1q
λ “ λminpQxq
λmaxpPxq .
The constants R1, M , and R2 are then defined relative to this C and λ as in Section 2.2.2,
along with the inequalities (2.8) and (2.9) for the α to be defined in the next section.
All assumptions on the Lyapunov equation have been satisfied.
2.3.4 Satisfying the One-Sided Growth Condition
The final task needed to apply Theorem 2.1 to this application is to ensure that the
one-sided growth condition (2.6) holds. Note that the combination of (2.22), (2.24),
and (2.29) leads to a special case of (2.5) with the following definitions:
Hpztq “
»–Axt `B ` pΘt ´ΘqJΦpxtq
S´1
`´ΦpxtqxJt PxB˘
fifl , G “
»–Gx 0
0 Gθ
fifl
dwt “
»–dwxt
dwθt
fifl , and vt dµptq “
»– 0
vθt dµ
θptq
fifl . (2.32)
Let zJ “ rxJ, ϑJs and rzJ “ rrxJ, rϑJs be two points in X , and let us denote Ξ “
Spϑq and rΞ “ Sprϑq for visual distinction from the true parameters Θ “ Spθq. Direct
32
calculation using the general inner product xz, rz y “ xJPxrx` ϑJrϑ then gives:
xz ´ rz,Hpzq ´Hprzqy “C
»–x´ rx
ϑ´ rϑ
fifl ,
»–Apx´ rxq `B `Ξ´ΘqJΦpxq ´B `rΞ´ΘqJΦprxq
S´1
`´ΦpxqxJPxB˘´ S´1`´ΦprxqrxJPxB˘
fiflG
“ px´ rxqJPxA px´ rxq ` px´ rxqJPxB pΞ´ΘqJΦpxq ´ px´ rxqJPxB prΞ´ΘqJΦprxq
` tr`pΞ´ rΞqJpΦpxqxJ ´ΦprxqrxJ˘PxB˘
“ ´1
2
px´ rxqJQx px´ rxq ` tr`pΞ´ΘqJΦpxqpx´ rxqJPxB˘´ tr`prΞ´ΘqJΦprxqpx´ rxqJPxB˘
` tr`pΞ´ rΞqJp´ΦpxqxJ `ΦprxqrxJ˘PxB˘
“ ´1
2
px´ rxqJQx px´ rxq
`(((((((
(((
tr
`
ΞJΦpxqxJPxB
˘´ tr`ΞJΦpxqrxJPxB˘´ tr`ΘJΦpxqxJPxB˘` tr`ΘJΦpxqrxJPxB˘
´ tr`rΞJΦprxqxJPxB˘`((((((((((tr`rΞJΦprxqrxJPxB˘` tr`ΘJΦprxqxJPxB˘´ tr`ΘJΦprxqrxJPxB˘
´(((((((
(((
tr
`
ΞJΦpxqxJPxB
˘` tr`ΞJΦprxqrxJPxB˘` tr`rΞJΦpxqxJPxB˘´((((((((((tr`rΞJΦprxqrxJPxB˘
ď ´tr`ΞJΦpxqrxJPxB˘´ tr`ΘJΦpxqxJPxB˘` tr`ΘJΦpxqrxJPxB˘´ tr`ΘJΦprxqrxJPxB˘
´ tr`rΞJΦprxqxJPxB˘` tr`ΘJΦprxqxJPxB˘` tr`ΞJΦprxqrxJPxB˘` tr`rΞJΦpxqxJPxB˘
“ ´xJPxB prΞ´ΘqJpΦpxq ´Φprxqq ` rxJPxB pΞ´ΘqJpΦpxq ´Φprxqq , (2.33)
where we drop the non-positive ´12px´ rxqJQx px´ rxq term for inequality and use the
trace identity aJb “ trpb aJq in many places. Setting
y :“ PxB pΞ´ΘqJpΦpxq ´Φprxqq
ry :“ PxB prΞ´ΘqJpΦpxq ´Φprxqq ,
33
note that we then have
}y ´ ry}2 “ }PxBpΞ´ rΞqJpΦpxq ´Φprxqq}2
ď }PxB}2}Ξ´ rΞ}2L}x´ rx}2
ď }PxB}2}Ξ´ rΞ}FL}x´ rx}2
ď }PxB}2DL}x´ rx}2 .
Here, } ¨ }2 applied to matrices is the induced martix 2-norm. Then the first inequality
uses sub-multiplicativity followed by the Lipschitz assumption on Φ. Next we note
that the induced matrix 2-norm is bounded above by the Frobenius norm, and that
}Ξ ´ rΞ}F “ }ϑ ´ rϑ}2. So, the final inequality follows from the diameter bound on K.
An analogous derivation shows that
}ry}2 ď }PxB}2}rΞ´Θ}FL}x´ rx}2
ď }PxB}2DL}x´ rx}2 .
Recalling that }z}2 “ xJPxx ` ϑJϑ, we can then apply a quadratic form bound of
}x}2 ď 1?
λminpPxq}z}. Therefore, we achieve the following overall upper bound:
xz´rz,Hpzq ´Hprzqy ď ´xJry ` rxJy “ prx´ xqJry ` rxJpy ´ ryq
ď }rx´ x}2}ry}2 ` }rx}2}y ´ ry}2
ď }PxB}2DL
`}x´ rx}22 ` }rx}2}x´ rx}2˘
ď }PxB}2DL
λminpPxq
`}z ´ rz}2 ` }rz}}z ´ rz}˘
ď }PxB}2DL
λminpPxq
`
r2 ` rp}z} ` }rz}q˘ .
This satisfies the one-sided growth requirement (2.6) with
κprq “ α “ }PxB}2DL
λminpPxq .
34
The results are summarized in the following theorem:
Theorem 2.2. The combined system dynamics and parameter update rule described by
(2.22), (2.24), and (2.29) leads to a special case of (2.5), as defined in (2.32). The
control law ut “ pΘJt Ψpxtq with parameter estimates, as defined in (2.21), meets all
requirements of Section 2.2, such that the law Pt of any solution xt from initial condition
x0 P X , contracts to a unique stationary distribution pi. Contraction is exponential with
respect to total variation distance and standard W qpPt, piq Wasserstein distance, with
convergence rates described by Theorem 2.1 and Corollary 2.1.
2.4 Numerical Simulation of Linear MRAC Application
In this section, we will perform a numerical simulation of the application in Section 2.3,
using the example from Section 11.5 of [41]. The linear system is an LTI model for the
lateral-directional dynamics of a small passenger aircraft in cruise configuration. The
state xt “ rxt,1 xt,2 xt,3 xt,4sJ represents the bank angle, sideslip angle, roll rate, and
yaw rate, while the control input ut “ rut,1 ut,2sJ represents the rudder and aileron
deflection angles (all units are radians and radians/sec). The nominal plant is given as
9xt “ Axt `B
`
ut ` ΩJΨpxtq
˘
,
with
A “
»———————–
0 0 1 0
0.0487 ´0.0829 0 ´1
0 ´4.546 ´1.699 0.1717
0 3.382 ´0.0654 ´0.0893
fiffiffiffiffiffiffiffifl
, B “
»———————–
0 0
0 0.0116
27.276 0.5758
0.3952 ´1.362
fiffiffiffiffiffiffiffifl
,
ΩJ “
»–2A3,2 1A3,3 1A3,4
2A4,2 1A4,3 1A4,4
fifl , and Ψpxq “ rx2 x3 x4sJ .
35
Thus, we have n “ 4, ` “ 2, and L “ 3.
We can see that the input ut mainly affects the last two states (roll and yaw rates).
And so, the unknown linearly parameterized ΩJΨpxtq represents an uncertainty about
the critical system model parameters of A for those two states, which we will also assume
are unknown. (Note: in [41] it appears the A entries used for the parameterization are
incorrect, relative to the stated goal of the modeling uncertainties.)
The given A has eigenvalues of λpAq “ t´0.0464 ˘ j1.8784, ´1.7797, 0.0014u, so
we use the LQR method with QLQR “ diagpr0 0 0.1 5sq and RLQR “ I` to obtain
KJ “
»–´0.1118 ´0.1911 ´0.2756 0.1011
´0.084 0.1267 ´0.0414 2.1049
fifl ,
which gives a Hurwitz
Ar “ A`BKJ “
»———————–
0 0 1 0
0.0477 ´0.0814 ´0.0005 ´0.9756
´3.0986 ´9.6867 ´9.2396 4.1402
0.0702 3.1339 ´0.1179 ´2.9162
fiffiffiffiffiffiffiffifl
with eigenvalues of λpArq “ t´8.835, ´0.3319, ´1.5352˘ j1.0673u.
Setting Qx “ diagpr1 1 10 10sq and solving the Lyapunov equation with Ar gives
Px “
»———————–
3.2517 0.2803 0.1649 ´0.0356
0.2803 8.1912 ´0.4639 ´1.3807
0.1649 ´0.4639 0.5565 0.2
´0.0356 ´1.3807 0.2 2.4605
fiffiffiffiffiffiffiffifl
which is positive definite with eigenvalues of λpPxq “ t8.5513, 3.251, 0.5085, 2.149u.
We use stochastic parameter estimates pθt “ S´1p pΘtq, with m “ 14, in place of
the unknown K and Ω, and Φ, as defined in (2.21) with N “ 7, in the control law
ut “ pΘJt Φpxtq as in (2.22). This gives the stochastic plant as
dxt “
“
Arxt `Bp pΘt ´ΘqJΦpxtq‰dt`Gx dwxt ,
36
where we set Gx “ gx I4 for some positive scalar gx ą 0. We similarly use (2.24) and
(2.29) to get the stochastic parameter update rule as
dpθt “ S´1`´ΦpxtqxJt PxB˘ dt`Gθ dwθt ´ vθt dµθptq ,
where we set Gθ to be block diagonal of GK “ gK I8 and GΩ “ gΩ I6, for some positive
scalars gK , gΩ ą 0.
The compact convex constraint set K Ă R14 is defined in the following manner. Each
row of pΘt (column of pΘJt ) will be constrained within a convex 2-dimensional polygon.
Each of the n “ 4 rows corresponding to parameters of K will use the same polygon,
and each of the L “ 3 rows corresponding to parameters of Ω will use the same polygon.
See Figure 2.7 for a visualization of how the parameter estimates are constrained within
these polygons.
With the combined state zJt “ rxJt pθJt s P X “ R4 ˆ K, and using the definitions
of H, G, dwt, and vtdµptq in (2.32), we have the overall adaptive regulation system
described by the RSDE
dzt “ Hpztqdt`Gdwt ´ vtdµptq , (2.34)
which meets all requirements of Theorem 2.2. Note here that G has the overall form
G “
»————–
gx I4 0 0
0 gK I8 0
0 0 gΩ I6
fiffiffiffiffifl .
We perform three simulations, each simulating two solutions z
p1q
t and z
p2q
t to (2.34)
from the initial conditions: x
p1q
0 “ r0.1 ¨ ¨ ¨ 0.1sJ, xp2q0 “ r´0.15 ¨ ¨ ¨ ´ 0.15sJ, andpθp1q0 “ pθp2q0 “ 0m, thus satisfying zp1q0 ‰ zp2q0 P X . The three scenarios are: i) gx “ gK “
gΩ “ 0, which gives the nominal system behavior with no stochastic terms, ii) gx “
gK “ gΩ “ 0.05, which gives the stochastic system behavior with a moderate amount
37
of Brownian disturbance, and iii) gx “ gK “ 0.01 and gΩ “ 1, which places higher
Brownian disturbance on the pΩt stochastic parameter estimates in order to highlight
the convex projection to and constraint within K.
All simulations are performed using a projected Euler method over t P r0, 30s sec,
with timesteps of dt “ 0.001 sec and a convex projection operation ΠK. This aligns
with the discussion at the end of Section 2.2.1. At each timestep, an independent
Brownian increment wt`dt ´wt is sampled from N
`
0n`m, dt In`m
˘
. The solution z
p2q
t
uses Brownian motion d rwt “ pIn`m ´ 2δtδJt qdwt, which induces a reflection coupling
between the two solutions. (See Appendix A.1 for details.) The coupling time τ between
the solutions is the first time step when the norm difference is practically close to zero,
such that }zp1qτ ´ zp2qτ } ď ε « 0.
Figure 2.5 shows a plot of the states x
p1q
t and x
p2q
t over time. We see in the first
scenario (top plot) the nominal behavior with no stochastic terms. The states decay
(mostly) smoothly down to the origin. The second scenario (middle plot) shows similar
behavior, but with stochastic noise added to the trajectories. Coupling occurs at τ «
17sec. The third scenario (last plot) shows similar behavior until around t “ 10 sec,
when a larger disturbance occurs, after which the states again decay down to the origin.
We see similar results in Figure 2.6, which plots the Euclidean norms }xp1qt }2 and }xp2qt }2
of the states over time, and is being driven to zero in this application.
Figure 2.7 shows the phase plot of the polygons used to constrain each row of pKp1qt ,pKp2qt (left side) and pΩp1qt , pΩp2qt (right side). Blue dots show true values K and Ω. Red and
Green dots show final parameter estimates. We see in all scenarios that the constraint
is enforced for the pKt heavily on the left side of the set. The constraint for pΩt is mostly
not needed for the first two scenarios but in the third it is enforced at all boundaries.
Figure 2.8 shows the Euclidean norms } pΘp1qt ´Θ}2 and } pΘp2qt ´Θ}2 of the parameter error
from the true values over time, which is not being driven to zero in this application.
38
0 5 10 15 20 25 30
−3
−2
−1
0
1
2
0 5 10 15 20 25 30
−3
−2
−1
0
1
2
0 5 10 15 20 25 30
−2
−1
0
1
2
Figure 2.5: States x
p1q
t (blue) and x
p2q
t (orange), for the three simulation scenarios.
Coupling time τ is the vertical red line.
39
0 5 10 15 20 25 30
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0 5 10 15 20 25 30
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0 5 10 15 20 25 30
0.0
0.5
1.0
1.5
2.0
2.5
Figure 2.6: Euclidean norm of states }xp1qt }2 (magenta) and }xp2qt }2 (cyan), for the three
simulation scenarios. Coupling time τ is the vertical red line.
40
−0.4 −0.2 0.0 0.2 0.4
K1
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
K 2
−10 −8 −6 −4 −2 0
Ω1
0
2
4
6
Ω 2
−0.4 −0.2 0.0 0.2 0.4
K1
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
K 2
−10 −8 −6 −4 −2 0
Ω1
0
2
4
6
Ω 2
−0.4 −0.2 0.0 0.2 0.4
K1
−0.5
0.0
0.5
1.0
1.5
2.0
2.5
K 2
−10 −8 −6 −4 −2 0
Ω1
0
2
4
6
Ω 2
Figure 2.7: Phase plots of the 2D polygons used to constrain each row of pKt (left
side) and pΩt (right side), with blue for pΘp1qt and orange for pΘp2qt . Cyan dots show true
parameter values. Red and Green dots show final parameter estimates.
41
0 5 10 15 20 25 30
11.45
11.50
11.55
11.60
11.65
0 5 10 15 20 25 30
11.25
11.30
11.35
11.40
11.45
11.50
11.55
11.60
11.65
0 5 10 15 20 25 30
9
10
11
12
13
14
Figure 2.8: Euclidean norm of parameter error } pΘp1qt ´Θ}2 (magenta) and } pΘp2qt ´Θ}2
(cyan), for the three simulation scenarios. Coupling time τ is the vertical red line.
Chapter 3
Persistency of Excitation with
Step and ReLU Functions
In this chapter, we introduce novel sufficient conditions for achieving persistency of
excitation with vector functions constructed from single hidden-layer neural networks
that use step or ReLU activation functions. We apply these conditions to the general
setup for linear MRAC from Chapter 9 of [41], showing how that setup can meet the
conditions and therefore learn the true parameterization values.
Persistency of excitation is a fundamental concept employed within many contexts
and applications related to parameter learning, such as system identification and adap-
tive control. It is often discussed, or at least mentioned, in adaptive control textbooks,
such as in [71, 74, 41].
It was proven in [55, 2] that persistency of excitation is necessary and sufficient for
the global uniform asymptotic stability of the linear time-varying (LTV) system
9rθt “ ´Vt V Jt rθt , (3.1)
where rθt P RN is the system state, and V : r0,8q Ñ RN is a regulated vector function of
42
43
time. Here, regulated means that the left and right limits of Vt exist at all times t ě 0,
though they may not be equal at a finite number of discontinuities on a compact subset
C Ă Rn.
Let us denote Vt “ rVt,1 ¨ ¨ ¨ Vt,N sJ, with each component a scalar regulated function
of time, mapping to R. The N ˆN matrixż τ`T
τ
Vt V
J
t dt (3.2)
defines a Gramian, since the i, j-th entry
şτ`T
τ Vt,i Vt,j dt defines an inner product of
functions Vt,1, . . . , Vt,N over any positive time interval rτ, τ ` T s with τ ě 0 and T ą 0.
Gramian matrices are always positive semidefinite, which can be shown using the bi-
linearity of inner products.
Persistency of excitation is the requirement that the Gramian matrix (3.2) be strictly
positive definite with eigenvalues in some positive, bounded interval rα1, α2s, over all
shifts τ ě 0 of the sliding time window rτ, τ ` T s for some window length T ą 0.
Formally, persistency of excitation requires the existence of constants α1, α2, T ą 0
such that
α1IN ĺ
ż τ`T
τ
Vt V
J
t dt ĺ α2IN (3.3)
holds for all τ ě 0. An equivalent scalar requirement is that
α1}v}22 ď
ż τ`T
τ
pvJVtq2 dt ď α2}v}22 (3.4)
must hold for all v P RN and τ ě 0, since vJVtV Jt v “ pvJVtqpvJVtqJ “ pvJVtq2.
Note that we can partition the time window r0, T s for any T ą 0 using any positive
integer D P N length increasing sequence of times 0 “ T0 ă T1 ă ¨ ¨ ¨ ă TD´1 ă TD “ T ,
and thus obtain by linearity of integration thatż τ`T
τ
Vt V
J
t dt “
Dÿ
d“1
ż τ`Td
τ`Td´1
Vt V
J
t dt .
44
Note then that by the same reasoning as with (3.2), each integral in the sum is also a
Gramian matrix and thus positive semidefinite. Therefore, for any particular τ shift of
the window, the overall sum will be positive definite if at least one term, corresponding
to the partition rτ ` Td´1, τ ` Tds for some d P rDs, is positive definite. This follows
since vJpA1 ` A2 ` . . . qv “ vJA1v ` vJA2v ` ¨ ¨ ¨ ą 0 for all v P RNzt0Nu, if at least
one of A1, A2, . . . is positive definite and all others are positive semidefinite. And so, if
(3.3) holds for some T ą 0, then it must also hold for all rT ą T sinceż τ` rT
τ
Vt V
J
t dt “
ż τ`T
τ
Vt V
J
t dt`
ż τ` rT
τ`T
Vt V
J
t dt .
This chapter is adapted from the published work [45] and we summarize the orga-
nization of the sections, as follows:
1. Section 3.1 explains P.E. concepts in the specific setting of basis functions and
state trajectories, which will be used throughout the chapter,
2. Section 3.2 defines the activations used in the basis functions, and associated
geometry they induce on the state space, which are crucial to the P.E. conditions,
3. Section 3.3 gives the main results, which are novel sufficient conditions for P.E.
when using step and ReLU activations, and
4. Section 3.4 analyzes the linear MRAC setup, explains how it can meet the P.E
conditions, and provides a numerical simulation of this application.
3.1 Persistency of Excitation With Basis Functions of a
Continuous State Trajectory
In this chapter, we will focus on vector functions Vt having the form
Vt “ Ψpxtq “ rg1pxtq ¨ ¨ ¨ gN pxtqsJ ,
45
where nonlinear basis functions g1, . . . , gN : Rn Ñ R are continuous almost everywhere
and x : r0,8q Ñ Rn is a continuous-time trajectory representing the state of a system.
And so, we will analyze sufficient conditions for persistency of excitation defined by the
existence of constants α1, α2, T ą 0 such that
α1IN ĺ
ż τ`T
τ
ΨpxtqΨpxtqJ dt ĺ α2IN (3.5)
and the equivalent scalar requirement, for all v P RN , that
α1}v}22 ď
ż τ`T
τ
pvJΨpxtqq2 dt ď α2}v}22 (3.6)
hold for all τ ě 0. Recall that here, the integral in (3.5) is a (time-varying) Grammian
matrix and the inequality represents that its (nonnegative) eigenvalues must be in some
strictly positive, bounded interval rα1, α2s.
Remark 3.1. If the continuous trajectory xt stays within some compact set C Ă Rn
for all t ě 0, and if g1, . . . , gN are bounded on C, then a finite upper eigenvalue bound
α2 will exist. The difficulty of establishing the persistency of excitation condition (3.5)
for a given Ψ and bounded trajectory xt lies with proving that there exists some window
length T ą 0 which gives a strictly positive lower eigenvalue bound α1, over all τ ě 0
shifts of the window rτ, τ ` T s.
We now highlight an important connection between the persistency of excitation con-
dition (3.5) and linear independence of the continuous almost everywhere basis functions
g1pxq, . . . , gN pxq, whereby there exists no nonzero vector v P RNzt0Nu such that
vJΨpxq “
Nÿ
i“1
vigipxq “ 0 for almost all x P Rn .
In other words, only the zero vector v “ 0N allows this to hold.
Lemma 2 of [40] shows that g1, . . . , gN are linearly independent if and only if there
exist points xp1q, . . . , xpNq P Rn in the state space such that all g1, . . . , gN are continuous
46
at all xp1q, . . . , xpNq points and the matrix»————–
Ψpxp1qqJ
...
ΨpxpNqqJ
fiffiffiffiffifl “
»————–
g1pxp1qq . . . gN pxp1qq
...
...
g1pxpNqq . . . gN pxpNqq
fiffiffiffiffifl
is invertible. Then, Proposition 1 of [40] shows that if the continuous trajectory xt over
some time window rτ, τ ` T s satisfies the following two conditions: 1) there exist times
t1, . . . , tN in the window such that }xti ´ xpiq}2 ă  for all i P rN s and some positive
scalar  ą 0, and 2) the time derivative is bounded as } 9xt}2 ă c for some positive scalar
c ą 0, then there exists a positive scalar α1 such that
α1IN ĺ
ż τ`T
τ
ΨpxtqΨpxtqJ dt .
Therefore, this shows that if the continuous trajectory xt stays within the compact set
C P Rn with a bounded time derivative, and within every τ ě 0 shift of a time window
of length T it gets sufficiently close to points xp1q, . . . , xpNq P Rn that ensure the linear
independence of g1, . . . , gN , which are bounded on C, then the persistency of excitation
condition (3.5) is guaranteed to hold.
And so, in this chapter we will show how to find such points xp1q, . . . , xpNq P Rn
that ensure the linear independence of g1, . . . , gN , when they are composed of specific
nonlinear, positive semidefinite activation functions together with unique affine trans-
formations of the state space Rn.
3.2 Nonlinear, Positive Semidefinite Activation Functions
Let Ψ : Rn Ñ RN be a vector function defined as
Ψpxq “
»————–
g1pxq
...
gN pxq
fiffiffiffiffifl “
»————–
σ1pwJ1x` b1q
...
σN pwJNx` bN q
fiffiffiffiffifl , (3.7)
47
where g1, . . . , gN : Rn Ñ R are composed of nonlinear, positive semidefinite activa-
tion functions σ1, . . . , σN : R Ñ R together with an affine transformations wJ1x `
b1, . . . , w
J
Nx ` bN : Rn Ñ R. We allow wi P Rnzt0nu and bi P R to be arbitrary for
all i P rN s, except that we assume each wJi x ` bi “ 0 hyperplane in Rn is unique and
has dimension n ´ 1. Let us define the n ˆ N matrix W “ rw1 ¨ ¨ ¨ wN s and vector
b “ rb1 ¨ ¨ ¨ bN sJ P RN . For any S Ă rN s, let WJS denote the submatrix of WJ with
rows given by wJi for i P S. Define bS similarly for b.
Note that this is equivalent to Ψpxq being the outputs of a single hidden neural
network layer with N neurons fully connected to the input x P Rn, having nonlinear
activations, and being initialized with weights and biases defining unique hyperplanes.
Hence why we refer to the σi as activation functions.
We then define positive semidefinite to mean that each activation i P rN s satisfies
σipyq “ 0 for all y ď 0 and σipyq ą 0 for all y ą 0. In this chapter, we will focus on:
σcSpyq “
$’&’%
0 if y ď 0
c if y ą 0
(scaled step) (3.8)
σRpyq “
$’&’%
0 if y ď 0
y if y ą 0
(ReLU) , (3.9)
where c ą 0 is an arbitrary positive scalar.
3.2.1 Positive Semidefinite Activation Geometry
We define the following subsets for all activations i P rN s:
X`i “ tx P Rn | wJi x` bi ą 0u (active)
Xi˝ “ tx P Rn | wJi x` bi ď 0u (zero) .
Each is a half-space of Rn formed by the hyperplane wJi x ` bi “ 0. These are then
used to define the activation regions Aj , with indices corresponding to a binary string
48
indicating which active (binary 1) and zero (binary 0) half-spaces are in the intersection:
A0 “ XN˝ X XN˝´1 X . . . X X2˝ X X1˝
A1 “ XN˝ X XN˝´1 X . . . X X2˝ X X`1
...
A2N´2 “ X`N X X`N´1 X . . . X X`2 X X1˝
A2N´1 “ X`N X X`N´1 X . . . X X`2 X X`1 .
These partition Rn into, at most, 2N convex polytopes. It is likely that some of the
regions will be infeasible (Aj “ H). For example, in the n “ 1 case, there are always
only N ` 1 feasible activation regions, since there will be N unique (by assumption)
points partitioning the R line. In higher dimensions, more feasible regions are possible.
For all j P t0, . . . , 2N ´ 1u, we define the active set Sj as
Sj “ ti P rN s | wJi x` bi ą 0 @x P Aju .
This captures which of the N activations are active in a particular activation region.
For example with N ě 3, if A4 feasible it contains only a nonempty subset of X`3 which
in turn gives S4 “ t3u.
For any two activation regions j, k P t0, . . . , 2N ´ 1u with a nonempty intersection
Aj XAk ‰ H, their intersection defines a border between the two regions. We define a
nondegenerate border to mean dimpAj XAkq “ n´ 1. In this case, only one activation
is different (active to zero or vice-versa) between those regions. This is because the
borders between activation regions must be (subsets of) the hyperplanes that define
the half-spaces Xi, and the intersection of more than one unique hyperplane must have
dimension less than n´ 1. Thus, a nondegenerate border must be a subset of (or equal
to) a single hyperplane, meaning only one Xi half-space can flip from active to zero or
vice-versa.
49
We then define a degenerate border to mean dimpAj X Akq ă n ´ 1. In this case,
the border is (a subset of) the lower dimensional intersection of two or more unique
hyperplanes. Thus, multiple activations are different between the regions.
In Figure 3.1, we visualize an example withN “ 4 unique, 1-dimensional hyperplanes
partitioning the R2 state space. Nondegenerate borders are 1-dimensional line segments,
while degenerate borders are 0-dimensional points where the hyperplanes intersect. This
same example will be used for the numerical simulation presented in Section 3.4.
Now consider a continuous trajectory xt P Rn visiting a sequence of activation regions
over the time window rτ, τ ` T s, for some τ ě 0 and T ą 0. Assume xt only crosses
nondegenerate borders and that the number of regions visited is L ě 2. Let us denote
A1, . . . ,AL as the activation region indices of the visited sequence and I1, . . . ,IL´1 as
the sequence of activation indices corresponding to the hyperplanes crossed in order to
visit that sequence of activation regions. For example, if the sequence of visited regions
is A4,A12,A8 over the time window, then we have A1 “ 4,A2 “ 12,A3 “ 8 and
I1 “ 4,I2 “ 3 as the respective index sequences with L “ 3. In this case, activation
I1 “ 4 becomes active and then activation I2 “ 3 is turned off (becomes zero).
For all s P rLs, we define time window subsets Ts Ă rτ, τ ` T s as
Ts “ tt P rτ, τ ` T s | xt P AAsu . (3.10)
Since we assume xt crosses only nondegenerate borders when visiting the activation
regions during the time window, we have by definition that
SAs`1 “
$’’&’’%
SAs
Ť tIsu if AAs`1 Ď X`Is
SAs z tIsu if AAs`1 Ď XI˝s
(3.11)
holds over all s P rL ´ 1s. In words, this simply states that when xt crosses the
nondegenerate border from region AAs to region AAs`1 in the sequence, it must be
true that the active set for AAs`1 is the active set for AAs plus activation Is if it
50
becomes active, or minus activation Is if it turns off (becomes zero). Using the previous
example, going from activation region A12 to region A8 requires crossing the hyperplane
for activation I2 “ 3, which turns that activation off (becomes zero). Thus, it must be
true that S8 “ S12zt3u.
3.3 Sufficient Conditions for Persistency of Excitation with
(Scaled) Step and ReLU Activations
In this section we present our main theoretical results, giving novel sufficient conditions
for persistency of excitation with Ψ composed of the (scaled) step or ReLU activations.
3.3.1 Main Results
First, we present the theorem for (scaled) step functions.
Theorem 3.1. Let state trajectory xt be continuous and stay within some compact
set C Ă Rn for all t ě 0, and let Ψpxtq “ rσcSpwJ1xt ` b1q ¨ ¨ ¨ σcSpwJNxt ` bN qsJ be
composed of (scaled) step functions (3.8) with positive scalars c “ rc1 ¨ ¨ ¨ cN sJ together
with N unique affine transformations of Rn according to w1, . . . , wN P Rnzt0nu and
b1, . . . , bN P R. If xt over t ě 0 is such that there exists a window length T ˚ ą 0
whereby the sequence s P rLs of activation regions visited during any shift τ ě 0 of the
time window rτ, τ ` T ˚s always satisfies the following two conditions:
1. all i P rN s hyperplanes wJi xt ` bi “ 0 are crossed
2. only nondegenerate borders are crossed,
then Ψpxtq satisfies the persistency of excitation conditions (3.5) and (3.6).
Proof. Given in Appendix B.1.
Next, we present the theorem for ReLU functions. Here, WJAs denotes the |SAs | ˆn
51
submatrix of WJ with rows corresponding to the active set SAs , for each s P rLs of any
particular time window rτ, τ ` T ˚s.
Theorem 3.2. Let state trajectory xt be continuous and stay within some compact set
C Ă Rn for all t ě 0, and let Ψpxtq “ rσRpwJ1 xt` b1q ¨ ¨ ¨ σRpwJNxt` bN qs be composed
of ReLU functions (3.9) together with N unique affine transformations of Rn according
to w1, . . . , wN P Rnzt0nu and b1, . . . , bN P R. If xt over t ě 0 is such that there exists
a window length T ˚ ą 0 whereby the sequence s P rLs of activation regions visited
during any shift τ ě 0 of the time window rτ, τ `T ˚s always satisfies the following three
conditions:
1. all i P rN s hyperplanes wJi xt ` bi “ 0 are crossed
2. only nondegenerate borders are crossed
3. for each s P rLs, there are times t1, tˆ1, . . . , tM , tˆM P Ts, for any M ě n, such that
the nˆM matrix “pxt1 ´ xtˆ1q ¨ ¨ ¨ pxtM ´ xtˆM q‰ satisfies
rankpWJAsq “ rank
´
WJAs
“pxt1 ´ xtˆ1q ¨ ¨ ¨ pxtM ´ xtˆM q‰¯ ,
then Ψpxtq satisfies the persistency of excitation conditions (3.5) and (3.6).
Proof. Given in Appendix B.1.
Note that a sufficient condition for property 3. is that
“pxt1 ´ xtˆ1q ¨ ¨ ¨ pxtM ´ xtˆM q‰
has rank n. This can be achieved, for example, if xt is the state trajectory of a system
that satisfies a suitable (local) controllability property and can therefore reach points
within each activation region s P rLs that are not confined to some lower dimensional
subset of Rn, during all shifts τ ě 0 of the time window rτ, τ ` T ˚s.
It is possible that the sample paths of the adaptive stochastic process xt from Chap-
ter 2 could be shown to satisfy this condition with probability 1, due to the spatial
“jittering” (but still continuous) induced by the additive Brownian motion.
52
3.3.2 Proof Sketch
Both proofs rely on the same overall contradiction method. That is, we assume there
exists a nonzero vector v P RNzt0Nu such that
vJ
ˆż τ`T
τ
ΨpxtqΨpxtqJdt
˙
v “ 0 (3.12)
holds for all shifts τ ě 0 and all window lengths T ą 0.
We proceed by showing that if the state trajectory xt meets certain requirements
over any shift τ ě 0 of the window rτ, τ ` T ˚s, for some window length T ˚ ą 0, then
in fact (3.12) can only hold if v “ 0N . This is a contradiction and thus proves that
the LHS integral must be a strictly positive definite matrix, uniformly over all windows
rτ, τ ` T ˚s for all τ ě 0.
3.3.3 Extension to Other Activation Functions
In Appendix B.1 we discuss how these methods could be applied similarly to other
nonlinear, positive semidefinite activation functions σ, such that σpyq “ 0 for all y ď 0
and σpyq ą 0 for all y ą 0.
3.4 Application to Linear MRAC
In this section, we provide a numerical simulation of the theoretical results presented
in Section 3.3, using a linear MRAC application which is a variation on the setup from
Chapter 9 of [41]. We will implement adaptive control in order to drive the plant state
to track the state of a reference linear system, in spite of uncertainties in the plant’s
parameterization of an additive nonlinear vector function. The plant and reference
systems have n “ 2 states, allowing convenient visualization of the hyperplanes and
state space.
53
3.4.1 Setup
We largely follow the general linear MRAC setup introduced in [41], with some simpli-
fications. The plant is given by
9xt “ Axt `B
`
ut `ΘJΨpxtq
˘
, (3.13)
where A is a known nˆ n state matrix for the plant state xt P Rn, B is a known nˆ `
input matrix for the input ut P R`, and Θ is an unknown N ˆ ` matrix which linearly
parameterizes the known vector function Ψ : Rn Ñ RN defined by (3.7). We assume
pA,Bq is controllable. The general setup also includes an unknown diagonal scaling
matrix Λ, such that the overall input matrix is BΛ. We have omitted this for simplicity.
The control input ut will be designed in order to force the plant states to track the
states of a controllable, linear reference system given by
9xrt “ Ar xrt `Br rt , (3.14)
where Ar is a known Hurwitz nˆn reference state matrix for the reference state xrt P Rn,
Br is a known nˆ ` reference input matrix, and rt P R` is a bounded reference input.
We thus assume there exists an n ˆ ` matrix of feedback gains Kx and an ` ˆ `
matrix of feedforward gains Kr satisfying the matching conditions
A`BKJx “ Ar
BKJr “ Br . (3.15)
The general setup has A and Λ as unknown, and thus Kx and Kr need to be estimated.
Here, we assume that Kx and Kr can be directly calculated from known A and B, and
used directly in the control law.
Remark 3.2. The first condition of (3.15) is feasible if some Hurwitz Ar can be reached
from pA,Bq by state feedback. Towards this end, it is often practical to solve for an LQR
feedback gain matrix KLQR and then check that Ar “ A`BKJLQR is Hurwitz.
54
Next, we introduce parameter estimates pΘt, which will be dynamically updated to
estimate true parameter values Θ. Thus, by applying to the plant (3.24) the feedback
control law ut “ KJx xt `KJr rt ´ pΘJtΨpxtq, the plant dynamics become
9xt “ Ar xt `Br rt ´B ppΘt ´ΘqJΨpxtq . (3.16)
This in turn gives the dynamics of the state tracking error et “ xt ´ xrt as
9et “ 9xt ´ 9xrt “ Ar et ´B ppΘt ´ΘqJΨpxtq . (3.17)
with xr0 set to x0, so that e0 “ 0n.
As shown in Chapter 9 of [41], these state tracking error dynamics are globally
uniformly asymptotically stable, such that limtÑ8 }et}2 “ 0, if the parameter estimates
are dynamically updated as
9pΘt “ Γ Ψpxtq eJt PxB . (3.18)
This is shown by using Barbalat’s lemma along with analyzing the Lyapunov function
V “ eJt Pxet ` tr
`ppΘt ´ΘqJΓ´1ppΘt ´Θq˘ (3.19)
such that the update rule (3.18) results in
9V ď ´eJt Qxet ď 0 (3.20)
for all values of et and pΘt´Θ. Here, Px is the unique symmetric, positive-definite nˆn
matrix that solves the algebraic Lyapunov equation
PxAr `AJr Px “ ´Qx (3.21)
for some symmetric, positive-definite nˆ n matrix Qx, and adaptation rates Γ is some
symmetric, positive-definite N ˆN matrix, where we denote G :“ }Γ}F .
55
Remark 3.3. In the case where Kx and Kr are also being estimated by Kˆx,t and Kˆr,t
respectively, the true values can be appended to Θ, the estimates can be appended topΘt, and an overall vector function Φpxt, rtq which combines xt, rt, and Ψpxtq, can be
formed, similar to (2.21). However, such Φ then do not strictly meet the definition of
(3.7), where each component function is a positive, semidefinite activation, and are thus
beyond the scope of the results in this chapter.
3.4.2 Persistency of Excitation
The dynamic update rule (3.18) only guarantees asymptotic convergence of the state
tracking error et to zero. We now show that the parameter estimation error, which we
will denote as rΘt “ pΘt ´Θ, also goes to zero if Ψpxtq is persistently exciting.
Since Ar is Hurwitz, it is guaranteed invertible. And so, from (3.17) we have
eJt PxB “ 9eJt A´1r JPxB `ΨpxtqJrΘtBJA´1r JPxB
and then combined with (3.18) we get
9rΘt “ 9pΘt “ Γ Ψpxtq eJt PxB
“ Γ ΨpxtqΨpxtqJrΘtBJA´1r JPxB ` Γ Ψpxtq 9eJt A´1r JPxB
“ ´12Γ ΨpxtqΨpxtqJrΘtBJA´1r JQxA´1r B ` Et , (3.22)
where where the last inequality uses (3.21) and the invertibility of Ar to get that
A´1r
J
Px “ ´12A´1r JQxA´1r .
In order to apply the results of this chapter, we will assume that Ψpxtq is bounded
with xt being contained in a compact set C Ă Rn. Also, from (3.20) we have that et
and rΘt are bounded for all t ě 0. And so, from (3.16), (3.17), and (3.18) we have
respectively that 9xt, 9et, and
9pΘt are bounded for all t ě 0. This in turn means that
:et “ Ar 9et ´B 9pΘJt Ψpxtq ´BrΘJt ddxΨpxtq 9xt
56
is bounded for all t ě 0, since the Jacobian
d
dxΨpxtq “
»————–
w1,1σ
1
1pwJ1xt ` b1q ¨ ¨ ¨ w1,nσ11pwJ1xt ` b1q
...
. . .
...
wN,1σ
1
N pwJNxt ` bN q ¨ ¨ ¨ wN,nσ1N pwJNxt ` bN q
fiffiffiffiffifl
is composed of scaled derivatives σ1pyq of the (scaled) step (3.8) and ReLU activations
(3.9), which are globally bounded almost everywhere over y P R. This allows us to use
Barbalat’s lemma again (with respect to 9et) to conclude that limtÑ8 } 9et}2 “ 0. And
so, we can conclude that the Et term in (3.22) is bounded for all t ě 0 and that it
asymptotically goes to limtÑ8 }Et}F “ 0.
Let us now restrict to the case of ` “ 1. Since Qx is positive-definite we have the
positive scalar c “ 12BJA´1r JQxA´1r B, and obtain from (3.22) that the dynamics of the
parameter estimation error vector rΘt P RN asymptotically tend to the LTV system
9rΘt “ ´Γc ΨpxtqΨpxtqJrΘt , (3.23)
where Γc “ cΓ is symmetric, positive-definite with }Γc}F “ cG.
Therefore, if Ψpxtq is persistently exciting by condition (3.5), then the parameter
estimation error dynamics (3.23) are globally uniformly asymptotically stable such that
we have limtÑ8 }rΘt}2 “ 0. We show explicitly in Appendix B.2 how this follows from
Lemmas 1 and 2 and Theorem 1 of [2].
3.4.3 Numerical Simulation
We perform a numerical simulation in the xt P R2 state space. Euler integration with
a timestep of dt “ 0.001 seconds is used to obtain all numerical solutions to differential
equations.
57
We use a basic controllable canonical form for the plant
9xt “
»– 0 1
´a1 2a2
fifl»–xt,1
xt,2
fifl`
»–0
β
fifl`ut `ΘJΨpxtq˘ , (3.24)
where a1, a2, β ą 0 are positive scalars, ut P R is the ` “ 1 dimensional input, and Θ is a
fixed vector of nonzero scalars which linearly parameterizes the known nonlinear vector
function Ψpxtq “ rσRpwJ1 xt` b1q ¨ ¨ ¨ σRpwJ4 xt` b4qsJ with ReLU activation functions,
meeting the definition (3.7). This plant model is an n “ 2 example of a general class of
single-input systems with dynamics characterized by an nth order nonlinear ordinary
differential equation. See section 9.5 in [41] and section 4.1 in [35] for examples of
physical systems in this form using various nonlinear functions.
The reference system is a unity-gain damped harmonic oscillator with a bounded
rt P R reference input
9xrt “
»– 0 1
´ω20 ´2ξω0
fifl»–xrt,1
xrt,2
fifl`
»– 0
ω20
fifl rt , (3.25)
where ω0, ξ ą 0 are the natural frequency and damping ratio. This gives the plant and
reference eigenvalues as λ “ a2 ˘
a
a22 ´ a1 and λr “ p´ξ ˘
a
ξ2 ´ 1 qω0, thus A is
unstable with oscillations if a1 ą a22 and Ar is always Hurwitz, and without oscillations
if ξ ě 1. Thus, we can always directly calculate the required feedback and feedforward
gains that satisfy the matching conditions (3.15) as
KJx “
„
´ω
2
0 ´ a1
β
´ 2ξω0 ` 2a2
β

and KJr “
„
ω20
β

.
For the plant system, we use a1 “ 2, a2 “ 0.5, β “ 0.75. This results in an unstable,
oscillatory A matrix with λ “ 0.5 ˘ j?1.75 . The reference system uses ω0 “ 2 and
ξ “ 1, ensuring that Ar is Hurwitz and nonoscillatory with λr “ ´2 . The matching
conditions are thus satisfied with KJx “ r´2.667 ´ 6.667s and KJr “ r5.333s. For the
58
parameter estimate dynamic updates (3.18) we use
Γ “
»———————–
5 0 0 0
0 1 0 0
0 0 5 0
0 0 0 2
fiffiffiffiffiffiffiffifl
Qx “
»–1 0
0 10
fifl Px “
»–5.625 0.125
0.125 1.281
fifl ,
and for the linearly parameterized nonlinear vector function ΘJΨpxtq we use
Θ “
»———————–
´1.2
2.7
0.8
´3.2
fiffiffiffiffiffiffiffifl
WJ “
»———————–
2 1
1 ´2
1.5 ´0.5
0.5 2
fiffiffiffiffiffiffiffifl
b “
»———————–
1
2
2.5
3
fiffiffiffiffiffiffiffifl
.
In Figure 3.1 we plot points to highlight each activation region, partitioning the state
space R2.
Defining parameter estimates pΘt “ rθˆt,1 ¨ ¨ ¨ θˆt,4sJ and using the known Kx and
Kr, we apply the feedback control law ut “ KJx xt `KJr rt ´ pΘJtΨpxtq to the plant and
update the parameter estimates according to the update law (3.18). We use the following
bounded reference inputs to drive the reference system as two different scenarios:
r
p1q
t “ 10 sinp0.5tq rp2qt “ 40`
2ÿ
k“1
10 sinp0.25k tq .
The resulting plant and reference state trajectories xt and x
r
t are plotted in state
space along with the hyperplanes WJx`b “ 04, in Figure 3.2 and Figure 3.3 respectively
for the two scenarios. We see clearly that the limit cycles in both cases stay within a
compact set C P R2, and so the ReLU activations within Ψ are bounded on this C. Note
that the trajectories never maintain a linear path, which is sufficient to meet property
(iii) of Theorem 3.2. In the first case, the state trajectories cross all N “ 4 hyperplanes
at nondegenerate borders persistently. In the second case, the limit cycle only crosses
two of the hyperplanes persistently.
59
The norm of the state tracking error }et}2 and the parameter estimation error
}pΘt ´ Θ}2 for both cases are plotted over simulation time in Fig. 3.4. In both cases,
we see the state tracking error converging to zero. For the case with r
p1q
t , we see that
the parameter estimation error converges to zero as the parameter estimators converge
to the true values, while for the case with r
p2q
t the error does not converge and is well
above zero.
−10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0
xt, 1
−10.0
−7.5
−5.0
−2.5
0.0
2.5
5.0
7.5
10.0
x t,
2
0 00001 00012 00103 00114 01005 01016 01107 01118 10009 100110 101011 101112 110013 110114 111015 1111
X1
X2
X3
X4
Figure 3.1: State space x P R2 showing N “ 4 hyper planes WJx ` b “ 04. There are
11 feasible activation regions Aj (color dots) and 5 infeasible regions (no dots).
60
−10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0
xt, 1
−6
−4
−2
0
2
4
6
x t,
2
Figure 3.2: Phase plot of xt (cyan), tracking into x
r
t (magenta) driven by r
p1q
t , and
clearly crossing all N hyperplanes at nondegenerate borders along the limit cycle.
61
−10 0 10 20 30 40 50 60
xt, 1
−15
−10
−5
0
5
10
15
20
25
30
x t,
2
Figure 3.3: Phase plot of xt (cyan), tracking into x
r
t (magenta) driven by r
p2q
t , but does
not cross all N hyperplanes along the limit cycle.
62
0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
2.00
||et||(1)2
||et||(2)2
0 20 40 60 80 100 120 140
time t
0
1
2
3
4
5
6
7
||Θ̂t−Θ||(1)2
||Θ̂t−Θ||(2)2
Figure 3.4: Norm of state tracking error }et}2 converges to zero in both cases, while the
parameter estimation error }pΘt ´ Θ}2 converges to zero for the case with rp1qt and but
not for the case with r
p2q
t .
Chapter 4
Function Approximation
In this chapter, we consider the approximation of target functions f : Rn Ñ R that
are continuous on a bounded set B Ă Rn containing the origin, which we denote as
f P CpBq. We will generally use the term function approximation to refer to (the
existence of) a finite linear combination of basis functions, such that some norm of the
difference between a target function and its approximation, over x P B, is bounded.
Formally, for a given target function f P CpBq, positive integer N P N, and positive
scalar εN ą 0 (which may depend on N) there exists some vector θ P RN and nonlinear,
continuous almost everywhere basis functions B1, . . . ,BN : Rn Ñ R such that the
approximation pfpxq :“ Nÿ
i“1
θiBipxq (4.1)
satisfies ›››fpxq ´ pfpxq›››
B
ď εN . (4.2)
Here, we denote the the L2pB, µ¯Bq norm
}fpxq}B :“
ˆż
B
|fpxq|2 µ¯Bpdxq
˙ 1
2
with µ¯B as the uniform probability measure over B such that
ş
B µ¯Bpdxq “ 1.
63
64
In particular, the focus of this chapter will be on parameterized basis functions
B : Rn Ñ R, with the parameters denoted as points γ in some parameter set Υ Ď Rp.
In this case, we denote pfN as an approximation of the form (4.1) using the parameterized
basis Bpx ;γ1q, . . . ,Bpx ;γN q for some fixed parameters γ1, . . . ,γN P Υ and the vector
θ P RN as pfN pxq :“ Nÿ
i“1
θiBpx ;γiq . (4.3)
On the other hand, if we randomly sample the parameters γ1, . . . ,γR according to
some probability density PΥ over Υ to form the basis Bpx ;γ1q, . . . ,Bpx ;γRq, for some
integer R P N, we have the randomly parameterized approximation pFR. Here again, we
mean that for some target function f P CpBq and positive scalar εR ą 0, which may
depend on R, there exists (in high probability) some vector ϑ P RR such that
pFRpx ;γ1, . . . ,γRq :“ Rÿ
j“1
ϑj Bpx ;γjq (4.4)
satisfies ›››fpxq ´ pFRpx ;γ1, . . . ,γRq›››B ď εR .
We summarize the organization of the chapter’s sections, as follows:
1. Section 4.1 provides setup for function approximation with activation functions,
2. Section 4.2 discusses important existing results for base approximations which are
convex combinations of step and ReLU activation functions,
3. Section 4.3 gives a novel result on mollified integral representations to base ap-
proximations, resulting in mollified approximations,
4. Sections 4.4 and 4.5 provide a general setup for random approximations and a
main result on their approximation error,
5. Sections 4.6 and 4.7 extend the results to the supremum norm and show how this
can be applied to an extension of the linear MRAC setup.
65
4.1 Function Approximation with Activation Functions
For the rest of the chapter, we will assume that the parameterized basis function B is
a composition of some nonlinear, continuous almost everywhere function σ : R Ñ R
together with an affine transformation of Rn into R according to the parameter γ P Υ.
We define each point γ :“ rwJ bsJ “ rw1 . . . wn bsJ with w P Rnzt0nu as the weight
vector and b P R as the bias term. We assume the parameter set Υ is a bounded subset
of Rnzt0nu ˆ R.
And so, we set Bpx ;γiq “ σpwJi x ` biq “ σpγJiXq, where X :“ rx1 ¨ ¨ ¨ xn 1sJ
and we denote γi “ rwJi bisJ “ rwi,1 . . . wi,n bisJ, for all i P rN s. We assume each
wJi x` bi “ 0 hyperplane in Rn is unique and has dimension n´1. The same holds with
randomly sampled γj P Υ and the corresponding Bpx ;γjq “ σpγJjXq, for all j P rRs.
Note that this is equivalent to a feedforward neural network with a single hidden
layer of N (or R) neurons having respective outputs σpwJi x` biq “ σpγJiXq. Hence why
we refer to σ as an activation function, as the output of each neuron is the activation
of the weighted inputs (plus bias term) to the neuron.
We will assume that the activation function σ has limited growth as follows.
Assumption 4.1. The nonlinear, continuous almost everywhere activation function
σ : RÑ R is bounded (growth) over all feasible inputs, such that either of
1. |σpuq ´ σpvq| ď Dσ
2. |σpuq ´ σpvq| ď Lσ |u´ v|
hold for all u, v P I :“ tγJX | pγ, Xq P Υˆ B ˆ t1uu, with positive scalar Dσ or Lσ.
If we shift the activation to the origin as qσpyq :“ σpyq ´ σp0q, this maintains the
limited growth assumption in either case. The (scaled) step function σcS as in (3.8) and
the ReLU function σR as in (3.9) are examples meeting this assumption, with Dσ “ c
and Lσ “ 1 respectively.
66
4.1.1 Base and Random Approximations
We will use the terms base approximation and random approximation to some target
function f P CpBq to respectively refer to the pfN in (4.3) and the pFR in (4.4) where the
parameterized basis function is Bpx ;γq “ σpγJXq and uses some activation function σ
satisfying Assumption 4.1.
And so, the base approximation is defined for some vector θ P RN and fixed param-
eters γ1, . . . ,γN P Υ, for any integer N P N, as
pfN pxq “ Nÿ
i“1
θi σpγJiXq , (4.5)
and the random approximation is defined for some vector ϑ P RR and randomly sampled
parameters γ1, . . . ,γR P Υ, for any integer R P N, as
pFRpx ;γ1, . . . ,γRq “ Rÿ
j“1
ϑj σpγJjXq . (4.6)
Note that we use integer R and index j for random approximations to distinguish them
from base approximations, but otherwise there is no significance.
4.2 Existing Results for Base Approximations
In this section, we discuss some relevant existing results for base approximations in the
L2 norm when using step and ReLU activations.
4.2.1 Convex Combinations of Step and ReLU Activations
Classic results from [7] and [12] prove the existence of convex linear combinations of
step and ReLU activations that approximate functions in specific classes arbitrarily well,
with an approximation error proportional to 1?
N
. Importantly, the convexity property
allows the length (number of terms) N to be arbitrarily large while maintaining the
67
same overall bound on the coefficient size
řN
i“1 |θi|. As we will show in later sections,
our results do not depend on the length N of the base approximation, but rather only
on the coefficient size bound.
First, we define these function classes. Note that B Ď B0pρq, an origin-centered ball
of sufficiently large radius ρ, and let us define |ξ|B :“ supxPB
ˇˇ
ξJx
ˇˇ ď ρ }ξ}2 for any
ξ P Rn. Let there be a function g P CpBq, a complex measure rGpdξq “ eiφpξqGpdξq (for
some φ : Rn Ñ R and variation G “ | rG|), and a constant C ą 0 such that:
1. qgpxq “ gpxq ´ gp0nq “ ż
Rn
`
eiξ
Jx ´ 1˘ rGpdξq (4.7)
2. Cp1qg :“
ż
Rn
|ξ|B Gpdξq ď
ż
Rn
ρ }ξ}2 Gpdξq ď C (4.8)
hold for all x P Rn. Then, such a function g belongs to the class F p1qB,C .
Similarly, if the function g P CpBq is differentiable at the origin and there is a
complex measure rGpdξq and constant C ą 0 satisfying:
1. qgpxq “ gpxq ´∇gp0nqJx´ gp0nq “ ż
Rn
`
eiξ
Jx ´ iξJx´ 1˘ rGpdξq (4.9)
2. Cp2qg :“
ż
Rn
|ξ|2B Gpdξq ď
ż
Rn
ρ2 }ξ}22 Gpdξq ď C (4.10)
for all x P Rn, then g belongs to the class F p2qB,C .
Furthermore, any function f P CpBq that is equal to g on B (equivalently, qf is equal
to qg on B) is also in the respective function class. This means, if we are interested in a
particular f , we can define g “ f on B and then outside of B we may define g in any
manner that helps to satisfy the above requirements.
An example of a complex measure meeting these requirements is rGpdξq “ rgpξqdξ,
where rgpξq is the Fourier transform of the function g, such that the variation Gpdξq “
|rgpξq|dξ is able to meet the second requirement for some C ą 0.
In Appendix C.1, we show for both class definitions that the second conditions
establish the integrability of the respective first conditions.
68
We now state the existing results for step and ReLU activation functions.
Theorem 4.1 (Barron[7] Theorem 1). For any constant C ą 0 and bounded set B Ă Rn
containing the origin, there exists a base approximation pfN of the form (4.5) using
the step activation function σS with parameters satisfying |w|B “ 1 and |b| ď 1, for
any integer N ě 1, to every function f P F p1qB,C such that the approximation errorrfN pxq “ qfpxq ´ pfN pxq to the affine shifted qfpxq “ fpxq ´ fp0nq satisfies the L2pB, µ¯Bq
norm bound ››› rfN›››B “
dż
B
ˇˇˇ rfN pxqˇˇˇ2 µ¯Bpdxqq ď 2C?
N
(4.11)
with coefficient size bounded as
řN
i“1 |θi| ď 2C.
Proof. The proof is given in Appendix C.2.
Theorem 4.2 (Breiman[12] Theorem 3). For any constant C ą 0 and bounded set
B Ă Rn containing the origin, there exists a base approximation pfN of the form (4.5)
using the ReLU activation function σR with parameters satisfying |w|B “ 1 and |b| ď 1,
for any integer N ě 1, to every function f P F p2qB,C such that the approximation errorrfN pxq “ qfpxq´ pfN pxq to the affine shifted qfpxq “ fpxq´∇fp0nqJx´ fp0nq satisfies the
L2pB, µ¯Bq norm bound
››› rfN›››B “
dż
B
ˇˇˇ rfN pxqˇˇˇ2 µ¯Bpdxqq ď
b
16
3 C?
N
(4.12)
with coefficient size bounded as
řN
i“1 |θi| ď 2C.
Proof. The proof is given in Appendix C.3.
Remark 4.1. The definitions of the function class are with respect to some constant C
which then shows up directly in the results for this section. Those results do not directly
depend on the input dimension n of the functions. This may lend an appearance of
avoiding the curse of dimensionality. However, we can see from the definitions of C
p1q
g
69
and C
p2q
g that both are n-dimensional integrals, and in general nothing is assumed about
the complex measure rG which allows them to avoid this dimensionality. Thus, for any
fixed C the size of the function class will decrease as the dimension increases.
4.3 Mollified Approximations
In this section, we define the mollified approximation pfλ,N using a novel method of
forming a mollified integral representation for a base approximation pfN to a target
function f P CpBq, and bound their difference in the L2pB, µ¯Bq norm.
4.3.1 Functions with Integral Representation
If a function f : Rn Ñ R has an integral representation over the parameters set Υ with
some activation function σ : RÑ R, then
fpxq “
ż
Υ
gpγqσpγJXqdγ (4.13)
holds with some bounded, continuous coefficient function g : Υ Ñ R. This is a version of
(4.5) that is continuous in the parameters γi over Υ. Indeed, if we define γ1, . . . ,γN to
be a sequence of points that uniformly grids Υ with volume dN between any neighboring
points, then as N Ñ8 (and dN Ñ 0) we have
lim
NÑ8
Nÿ
i“1
gpγiqσpγJiXqdN “
ż
Υ
gpγqσpγJXqdγ “ fpxq .
This implies that the (infinite) sum of the absolute values of the coefficients is finite
lim
NÑ8
Nÿ
i“1
|gpγiq|dN “
ż
Υ
|gpγq|dγ ă 8
since g is bounded and the integral is over the bounded set Υ .
However, finding such integral representations (4.13) for general classes of functions f
and activation functions σ is a nontrivial task. If both are L1 (absolutely integrable over
70
Rn and R respectively), then Theorem 1 in [29] shows that the integral representation
can always be constructed with g related to the Fourier transforms of f and σ. But
this is a limiting restriction, since none of the typical activation functions like sigmoids,
steps, ReLU’s, and their variants meet this requirement. Even so, in [22] this was used
to prove the universal approximation properties of sigmoid activations.
4.3.2 Mollified Integral Representation
In this section, we will propose a novel method of forming an integral representation for
any function f P CpBq, so long as a base approximation pfN of the form (4.5) exists and
the bounded parameter set Υ Ă Rn`1 and a bound SN on the coefficient size řNi“1 |θi|
are known.
For any fixed parameters γ1, . . . ,γN P Υ, this approximation pfN can be defined as
pfN pxq :“ Nÿ
i“1
θi σpγJiXq “
ż
Υ
Nÿ
i“1
θi δ
n`1pγ´ γiq σpγJXq dγ , (4.14)
where δn`1 is the pn ` 1q-dimensional Dirac delta which satisfies for any γi P Υ thatş
Υ δ
n`1pγ´ γiqσpγJXq dγ “ σpγJiXq .
Thus, the RHS of (4.14) is like an integral representation (4.13) of pfN with a coef-
ficient mapping gpγq “ řNi“1 θi δn`1pγ´ γiq. However, this mapping is not continuous
and bounded as required, and the Dirac delta is not even a function in the regular sense.
Indeed, if the overall coefficients of a random approximation pFR were set as ϑj “ gpγjq
for each randomly sampled γ1, . . . ,γR P Υ, then in this case they almost surely would
all be zero.
We propose the Mollified integral representation to solve this issue, where the Dirac
delta in (4.14) is replaced with the mollified delta function δˆn`1λ . Recalling that γ “
71
rw1 ¨ ¨ ¨ wn b s, this is defined as the pn` 1q-dimensional bump function
δˆn`1λ pγq :“
$’’’’’’&’’’’’’%
pηλqn`1 exp
ˆ ´1
1´ λ2w21
˙
¨ ¨ ¨ exp
ˆ ´1
1´ λ2w2n
˙
exp
ˆ ´1
1´ λ2b2
˙
γ P p´ 1λ , 1λqn`1
0 otherwise
(4.15)
with η´1 “ ş1´1 exp´ ´11´y2¯dy « 0.444 giving ş δˆn`1λ pγ´ γiq dγ “ ş δn`1pγ´ γiq dγ “ 1
for any n ě 1 and γi P Rn`1.
We term the positive scalar λ ą 0 as the mollification factor. This controls the height
and width of the mollified delta δˆn`1λ . Since the height scales as λn`1 and the integral
remains constant, this requires the width to shrink. Figure 4.1 plots the 1-dimensional
case for different λ values.
−1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00
γ
0.0
0.5
1.0
1.5
2.0
2.5
3.0
λ = 1
λ = 2
λ = 4
Figure 4.1: The mollified delta δˆλ for different λ values. In each case,
ş1{λ
´1{λ δˆλpγqdγ “ 1.
The support of δˆn`1λ pγ ´ γiq is the pn ` 1q-dimensional open cube set p´ 1λ , 1λqn`1
centered about any γi P Υ. And so, if γi is sufficiently close to (or on) the boundary of
Υ, then it no longer holds that the entire support of δˆn`1λ pγ´γiq is within Υ. Thus, we
define the λ-expanded parameters set Υλ such that the support is always fully contained,
72
up to and include the boundary of Υ. For example, if Υ is rectangular product of
intervals rα1, β1sˆ¨ ¨ ¨ˆrαn, βns, then each interval is expanded as rαi´ 1λ , βi` 1λ s to obtain
Υλ. With this λ-expanded parameters set, we always maintain that
ş
Υλ
δˆn`1λ pγ´γiqdγ “
1 for all λ ą 0, n ě 1, and any γi P Υ.
Since Υ Ă Υλ for all λ ą 0, then (4.14) is equivalently
pfN pxq “ ż
Υλ
Nÿ
i“1
θi δ
n`1pγ´ γiq σpγJXq dγ . (4.16)
And so, by replacing the Dirac delta in (4.16) with the mollified delta δˆn`1λ , for any
choice of λ ą 0, we obtain the corresponding mollified approximation
pfλ,N pxq :“ ż
Υλ
Nÿ
i“1
θi δˆ
n`1
λ pγ´ γiq σpγJXq dγ (4.17)
using a mollified integral representation. This now gives the required continuous, bounded
coefficient mapping for any fixed γ1, . . . ,γN P Υ as
gλpγq :“
Nÿ
i“1
θi δˆ
n`1
λ pγ´ γiq . (4.18)
We illustrate the conceptual transformation for the 1-dimensional case in Figure 4.2.
Figure 4.2: The base approximation pfN is transformed into the mollified approximationpfλ,N by replacing Dirac deltas δ with the mollified deltas δˆλ.
73
4.3.3 Main Result
Let us define Epxq :“ } rx1 ¨ ¨ ¨ xn 1s }2 “ }X}2 as the Euclidean norm mapping from
Rn ˆ t1u to R, and then define
EB :“ }Epxq}B (4.19)
as the L2pB, µ¯Bq norm of this mapping over the Euclidean distances of x P B. We can
now state the main result, which bounds the difference between a base approximation
and its corresponding mollified approximation, for a chosen mollifcation factor λ, in the
L2pB, µ¯Bq norm.
Theorem 4.3. Let there exist a base approximation pfN of the form (4.5) using any
activation function σ satisfying Assumption 4.1, for any N P N, with some unknown
parameters γ1, . . . ,γN P Υ and coefficients θ P RN , but having a known bound on the
coefficient size of
řN
i“1 |θi| ď SN and the parameters set Υ known. Then, the mollified
approximation pfλ,N over the λ-expanded parameters set Υλ in (4.17) satisfies either of:
iq
››› pfN ´ pfλ,N›››B ď SN Dσ (4.20)
iiq
››› pfN ´ pfλ,N›››B ď SN Lσ
?
n` 1
λ
EB . (4.21)
Proof. The proof is given in Appendix C.4.
Remark 4.2. The results of Theorem 4.3 show that if the activation function σ is
Lipschitz over its feasible inputs, then as λÑ8 we have pfλ,N pxq Ñ pfN pxq in the sense
of the L2pB, µ¯Bq norm.
74
4.4 Random Approximations to Mollified Approximations
In this section, we focus on random approximations pFR of the form (4.6) and we prove
that they exist (in high probability) such that››› pfλ,N pxq ´ pFRpx ;γ1, . . . ,γRq›››B ď εR (4.22)
holds for a positive scalar εR ą 0 that is proportional to 1?R . Here, pfλ,N pxq is a mollified
approximation as described in Section 4.3.2.
4.4.1 Expected Value of Random Approximations to Functions with
Integral Representations
An important property of functions that have integral representations with some σ is
that random approximations pFR of the form (4.6) exist which are exactly equal to the
function in expected value over the randomly drawn parameters.
And so, the following lemma builds upon the mollified integral representation of the
mollified approximation pfλ,N to show that there always exists random approximations
with randomly sampled parameters from the λ-expanded set Υλ satisfying this property.
Lemma 4.1. Let the target function f P CpBq satisfy Theorem 4.3 and thus have the
mollified approximation pfλ,N as in (4.13) with activation σ for the selected mollification
factor λ ą 0, which has the integral representation (4.16) over the λ-expanded param-
eters set Υλ. If the random parameters γ1, . . . ,γR are sampled iid according to any
density Pλ which is nonzero over Υλ, then there exists a random approximation pFR of
the form (4.6) using the activation σ such that
E
γj„Pλ
” pFRpx ;γ1, . . . ,γRqı “ pfλ,N pxq . (4.23)
Proof. The proof is given in Appendix C.5.
75
4.4.2 Setting Up McDiarmid’s Inequality
For the remainder of the section, we closely follow the strategy in Lemma 4 of [67], which
uses McDiarmid’s inequality to obtain the final result. We start by defining function
hB : ΥRλ Ñ R, which captures the norm difference between the random approximation
and its expected value over the iid draws of pγ1, . . . ,γRq P ΥRλ , as
hBpγ1, . . . ,γRq :“
›››› pFRpx ;γ1, . . . ,γRq ´ Eγj„Pλ
” pFRpx ;γ1, . . . ,γRqı››››
B
. (4.24)
These are precisely the kinds of functions which McDiarmid’s inequality bounds (see
Appendix C.1 for a basic definition of the inequality). In order to use this inequality,
we must prove two properties of this function: 1) the absolute value of the difference in
h when any one component γj is different, is bounded, and 2) the expected value of h
itself (over the random parameters) is bounded.
Let us first make the following definitions. With δˆn`1λ as in (4.15) for any λ ą 0, we
define
gλpγq “
Nÿ
i“1
θi δˆ
n`1
λ pγ´ γiq ùñ maxγPΥλ |gλpγq| ď
Nÿ
i“1
|θi|
ˆ
ηλ
e
˙n`1
“: gλmax . (4.25)
For the corresponding Υλ, we define the diameter DΥλ :“ supt}γ ´ φ}2 | γ, φ P Υλu
and largest point γλmax :“ supt}γ}2 | γ P Υλu, and for the selected Pλ we define the
minimum Pλmin :“ inftPλpγq | γ P Υλu.
We prove the two required properties with the following lemma.
Lemma 4.2. Let hB be defined as in (4.24) for a random approximation pFR of the form
(4.6) which satisfies the result of Lemma 4.1 to some mollified approximation pfλ,N from
Theorem 4.3, both using activation function σ which satisfies Assumption 4.1. Let the
random parameters γ1, . . . ,γR be sampled iid according to any nonzero density Pλ over
the λ-expanded set Υλ, which has diameter DΥλ and largest point γλmax, and let EB be
as in (4.19). Then for all pγ1, . . . ,γj , . . . ,γRq, pγ1, . . . , rγj , . . . ,γRq P ΥRλ differing only
76
at coordinate j, either of the following hold for all j P rRs:
ˇˇ
hBpγ1, . . . ,γj , . . . ,γRq ´ hBpγ1, . . . , rγj , . . . ,γRqˇˇ ď
iq Dσ gλmax
PλminR
ˆ
3` 2 |σp0q|
Dσ
˙
(4.26)
iiq Lσ gλmax
PλminR
ˆ
pDΥλ ` 2γλmaxqEB ` 2 |σp0q|Lσ
˙
. (4.27)
And the expected value of hB satisfies either of:
E
γj„Pλ
”
hBpγ1, . . . ,γRq
ı
ď
iq Dσ gλmax
Pλmin
?
R
ˆ
1` |σp0q|
Dσ
˙
(4.28)
iiq Lσ gλmax
Pλmin
?
R
ˆ
γλmaxEB ` |σp0q|L
˙
. (4.29)
Proof. The proof is given in Appendix C.6.
4.4.3 Using McDiarmid’s Inequality
With the two required properties of hB proven, we can now use McDiarmid’s inequality
to to achieve the bound (4.22), given by the following theorem.
Theorem 4.4. Let hB be defined as in (4.24) and satisfy the results of Lemma 4.2,
for some mollified approximation pfλ,N from Theorem 4.3, using activation function σ
which satisfies Assumption 4.1. Let the random parameters γ1, . . . ,γR be sampled iid
according to any nonzero density Pλ over the λ-expanded set Υλ, which has diameter
DΥλ and largest point γλmax, and let EB be as in (4.19). Then for any ν P p0, 1q, with
probability greater than 1 ´ ν there exists a random approximation pFRpx ;γ1, . . . ,γRq
of the form (4.6) with |ϑj | ď gλmaxPλminR for all j P rRs satisfying the result of Lemma 4.1,
77
such that either of the following hold:
} pfλ,N pxq ´ pFRpx ;γ1, . . . ,γRq} ď
iq Dσ gλmax
Pλmin
?
R
KDσpνq
iiq Lσ gλmax
Pλmin
?
R
KLσpνq .
The coefficients KDσ and KLσ are defined in terms of the selected ν as:
KDσpνq :“ 1` |σp0q|Dσ `
ˆ
3` 2 |σp0q|
Dσ
˙d
1
2
log
ˆ
1
ν
˙
KLσpνq :“ γλmaxEB ` |σp0q|Lσ `
ˆ
pDΥλ ` 2γλmaxqEB ` 2 |σp0q|Lσ
˙d
1
2
log
ˆ
1
ν
˙
.
Proof. The proof is given in Appendix C.7.
Remark 4.3. The results of Theorem 4.4 show that for any fixed mollification factor
λ ą 0 and ν P p0, 1q, with probability greater than 1´ ν there exists an pFR with bounded
coefficients such that as RÑ8 then pFRpx ;γ1, . . . ,γRq Ñ pfλ,N pxq for all x P B.
4.5 Random Approximations to Base Approximations
We now state and prove an overall result which ties together the results of the previous
sections to provide an overall bound on the approximation error in L2pB, µ¯Bq norm,
between a base approximation pfN of the form (4.5) to a target function f P CpBq and
a random approximation pFRpx ;γ1, . . . ,γRq of the form (4.6).
Theorem 4.5. Let there exist a base approximation pfN of the form (4.5), for some
integer N P N, to a target function f P CpBq, with unknown parameters γ1, . . . ,γN
in a known bounded parameter set Υ Ă Rn`1, unknown coefficients θ P RN with a
known bound on the coefficient size
řN
i“1 |θi| ď SN , and using an activation function σ
satisfying Assumption 4.1. Assume that
››› qfpxq ´ pfN pxq›››B ď εN holds for some εN ą 0
78
which may depend on N , for an affine shift qfpxq “ f´aJx´bfp0q with some a P Rn and
b P R. The mollified approximation pfλ,N as in (4.17) satisfies the results of Theorem 4.3
for any mollification factor λ ą 0. Let the random parameters γ1, . . . ,γR be sampled
iid according to any nonzero density Pλ over the λ-expanded set Υλ, which has diameter
DΥλ and farthest point γλmax, and let EB be as in (4.19). Then for any ν P p0, 1q, with
probability greater than 1´ν there exists a random approximation pFRpx ;γ1, . . . ,γRq as
in (4.6) with |ϑj | ď gλmaxPλminR for all j P rRs that satisfies Theorem 4.4 such that either of
the following hold:››› qfpxq ´ pFRpx ;γ1, . . . ,γRq›››B ď
iq εN ` SN Dσ
ˆ
1 ` λ
n`1
?
R
rKDσpνq˙
iiq εN ` SNLσ
ˆ?
n` 1
λ
EB ` λ
n`1
?
R
rKLσpνq˙ .
The coefficients rKLσ are defined in terms of the selected ν as:
rKDσpνq :“ ηn`1en`1Pλmin KDσpνq and rKLσpνq :“ η
n`1
en`1Pλmin
KLσpνq
with KDσ and KLσ as defined in Theorem 4.4.
Proof. The proof is given in Appendix C.8.
Remark 4.4. The results of Theorem 4.5 show that the base approximations discussed in
Section 4.2 are particularly useful in this context, because they are convex combinations
of activation functions. This means we can allow the number of terms to grow arbitrarily
large N Ñ 8, since N does not appear directly in the results. This results in εN Ñ 0,
while maintaining the same bound SN for any N .
79
4.6 Extending the Error Bound to L8 for Lipschitz Func-
tions on Convex Compact Sets
In this section, we show how the L2 norm bounds of the previous sections can be
extended to L8 (supremum) norm bounds when the target function f P CpKq is dif-
ferentiable at the origin and Lipschitz on a convex compact set K Ă Rn including the
origin, and the base pfN or random pFR approximation to this function uses a Lipschitz
activation function σ. We assume that K is full dimension, with nonzero Lebesgue mea-
sure (volume) V “ µpKq ą 0 and diameter D ą 0. It also must hold that K Ď B0pρq for
a sufficiently large radius 0 ă ρ ď D.
We then have for any such approximation pf “ řNi“1 θi σpγJiXq, thatˇˇˇˇ
ˇ Nÿ
i“1
θi σpγJiXq ´
Nÿ
i“1
θi σpγJi Y q
ˇˇˇˇ
ˇ ď Nÿ
i“1
ˇˇ
θi σpγJiXq ´ θi σpγJi Y q
ˇˇ
“
Nÿ
i“1
|θi|
ˇˇ
σpγJiXq ´ σpγJi Y q
ˇˇ
ď
Nÿ
i“1
|θi|Lσ
ˇˇ
γJiX ´ γJi Y
ˇˇ “ Nÿ
i“1
|θi|Lσ
ˇˇ
wJi px´ yq
ˇˇ
ď
Nÿ
i“1
|θi|}wi}2 }x´ y}2
holds for all x, y P Rn, recalling that X :“ rx1 ¨ ¨ ¨ xn 1sJ and Y :“ ry1 ¨ ¨ ¨ yn 1sJ, since
|σpuq ´ σpvq| ď Lσ |u´ v| for any u, v P R by the Lipschitz condition of Assumption 4.1.
Thus, such approximations are pL-Lipschitz with constant
pL “ Lσ Nÿ
i“1
|θi|}wi}2 . (4.30)
Let the affine shift to function f be
qfpxq :“ fpxq ´∇fp0nqJx´ fp0nq . (4.31)
80
Thus, if f is L-Lipschitz, thenˇˇˇ qfpxq ´ qfpyqˇˇˇ “ ˇˇfpxq ´∇fp0nqJx´ fp0nq ´ fpyq `∇fp0nqJy ` fp0nqˇˇ
“ ˇˇfpxq ´ fpyq ´∇fp0nqJpx´ yqˇˇ ď |fpxq ´ fpyq| ` ˇˇ∇fp0nqJpx´ yqˇˇ
ď L }x´ y}2 ` }∇fp0nq}2 }x´ y}2
holds for all x, y P Rn. And so, qf is also Lipschitz, with constant qL “ L` }∇fp0nq}2.
Lemma 4.3. Let f : Rn Ñ R be L-Lipschitz on a convex compact set K Ă Rn containing
the origin and differentiable at the origin. Let there also be an approximation pf : Rn Ñ R
which is pL-Lipschitz on K and such that the approximation error rfpxq “ qfpxq ´ pfpxq,
with qf as in (4.31), satisfies the L2pK, µ¯Kq norm bound››› rf ›››
K
“
dż
K
ˇˇˇ rfpxqˇˇˇ2 µ¯Kpdxq ď εN
where the bound εN ą 0 is a finite constant and µ¯Kpdxq is the uniform probability
measure on K such that şK µ¯Kpdxq “ 1.
Assume K is full dimension with diameter D ą 0, and so it holds that K Ď B0pρq
for a sufficiently large radius 0 ă ρ ď D.
Then, the approximation error also satisfies the L8pKq norm bound
sup
xPK
ˇˇˇ rfpxqˇˇˇ ď 2 ˆ rDpr `?r2 `D2q
˙´n
n`2 ´´
L` }∇fp0nq}2 ` pL¯n ε 2N¯ 1n`2 (4.32)
such that r is the largest radius ball within K centered at its centroid.
Proof. The proof is given in Appendix C.9.
4.7 Application to Linear MRAC
In Chapter 12 of [41] an extension to the general linear MRAC setup is introduced which
allows for nonlinearities f : Rn Ñ R` as fpxq “ ΘJΨpxq ` f pxq such that the plant is
81
given by
9xt “ Axt `B
`
ut ` fpxqq “ Axt `B
`
ut `ΘJΨpxtq ` f pxq
˘
, (4.33)
where A is a known n ˆ n state matrix for the plant state xt P Rn, B is a known
n ˆ ` input matrix for the input ut P R`, and Θ is an unknown N ˆ ` matrix which
linearly parameterizes the known vector function Ψ : Rn Ñ RN . We assume pA,Bq
is controllable. The general setup also includes an unknown diagonal scaling matrix
Λ, such that the overall input matrix is BΛ, and assumes that A is unknown. For
simplicity, we assume A is known and omit Λ.
Again it is assumed that there exists an n ˆ ` matrix of feedback gains Kx and an
`ˆ ` matrix of feedforward gains Kr satisfying the matching conditions
A`BKJx “ Ar
BKJr “ Br
to a controllable, linear reference model
9xrt “ Ar xrt `Br rt ,
where Ar is a known Hurwitz nˆn reference state matrix for the reference state xrt P Rn,
Br is a known n ˆ ` reference input matrix, and rt P R` is a bounded reference input.
Here again, we assume that Kx and Kr can be directly calculated from known A and
B, and used directly in the control law.
It is required that the nonlinearity satisfy the bound
}f pxq}2 ď ε¯ (4.34)
for some constant ε¯ ą 0 and for all x P B0prq with some radius r ą 0. Then, the
adaptive control law
ut “ KJx xt ´ pΘJtΨpxtq ` p1´ µpxqqKJr rt ` µpxqrupxtq
82
is shown to stabilize the state tracking error et “ xt ´ xrt down to a compact set about
the origin, where the N ˆ ` matrix of parameter estimates pΘt is dynamically updated
with the usual law
9pΘt “ Γ Ψpxtq eJt PxB
(see Section 3.4 for further details). The scalar function µ : Rn Ñ r0, 1s transitions the
control law from tracking the reference input to simply returning the plant state xt to
the bounded region B0prq within which (4.34) is valid, using an appropriately definedrupxtq based on assumptions about the growth of f pxq outside of B0prq. (See Chapter 12
of [41] for details.)
And so, if the vector function Ψpxq is constructed as in (3.7) with an activation
function σ satisfying the Lipschitz condition of Assumption 4.1 with constant Lσ, then
the above setup is valid for any nonlinearity f “ rf1 ¨ ¨ ¨ f`s where we can show that
sup
xPK
ˇˇ
fipxq ´ΘJi Ψpxq
ˇˇ ď ε¯i
holds for some constants ε¯1, . . . , ε¯` ą 0. Here, f1, . . . , f` : Rn Ñ R and each ΘJi P RN is
the corresponding row of the true parameters ΘJ.
4.7.1 Example on Euclidean Ball
Here we will show how, for example, the results of Theorem 4.2 can be used for the
above pΘJt,iΨpxq approximations, which are equivalent to pfN of the form (4.5) when
Ψpxq is defined as in (3.7) with σ1, . . . , σN “ σR ReLU activation functions. These
satisfy the Lipschitz condition of Assumption 4.1 with Lσ “ 1. Then from (4.30), the
approximations are Lipschitz with constant
pL “ Nÿ
k“1
|pΘJi qk|}wk}2
for all i P r`s.
83
Let K be a Euclidean ball of radius ρ centered at the origin (thus giving the diameter
as D “ 2ρ). In this case, we have r “ ρ as the largest radius ball centered at the centroid,
which is the origin, contained in K. And so, from Lemma 4.3 we getˆ
r
Dpr `?r2 `D2q
˙´n
n`2 “
˜
ρ
2ρpρ`aρ2 ` p2ρq2q
¸´n
n`2
“
ˆ
ρ´1
2p1`?5q
˙´n
n`2
.
Furthermore, for all w P Rn we have
|w|K :“ sup
xPK
ˇˇ
wJx
ˇˇ “ ρ }w}2 ,
since this is always maximized by aligning x in the same direction as w and then
supxPK }x}2 “ ρ, and so (4.10) is now equivalently
Cg “
ż
Rn
ρ2 }w}22 Gpdwq ď C . (4.35)
And so, if there is a function g : Rn Ñ R with a Fourier transform rg such that the
variation Gpdwq “ |rgpwq|dw satisfies
C}w} :“
ż
Rn
}w}22 Gpdwq ă 8 ,
then we can set C “ Cg “ ρ2C}w}. An example of such a function is
gpxq “
nź
i“1
βi sincpαixiq
for any α1, βi P R, where sincpuq :“ sinpuqu for u P R is the unnormalized sinc function.
It has a Fourier transform of
rgpwq “ ż
Rn
nź
i“1
sincpαixiq e´iwJxdx
“
ż 8
´8
¨ ¨ ¨
ż 8
´8
β1sincpα1x1q ¨ ¨ ¨βnsincpαnxnq e´iw1x1 ¨ ¨ ¨ e´iwnxndx1 ¨ ¨ ¨ dxn
“ β1
ż 8
´8
sincpα1x1q e´iw1x1dx1 ¨ ¨ ¨ βn
ż 8
´8
sincpαnxnq e´iwnxndxn
“
nź
i“1
βi
pi
|αi|Π
ˆ
wi
2αi
˙
“
$’’&’’%
śn
i“1 βi pi|αi| w1 P r´|α1|, |α1|s, . . . , wn P r´|αn|, |αn|s
0 o.w.
,
84
which is real. Thus, with Gpdwq “ |rgpwq|dw we have
C}w} “
ż
Rn
}w}22 Gpdwq “
ż
Rn
pw21 ` ¨ ¨ ¨ ` w2nq |rgpwq|dw
“ pin
nź
i“1
|βi|
|αi|
ż |α1|
´|α1|
¨ ¨ ¨
ż |αn|
´|αn|
pw21 ` ¨ ¨ ¨ ` w2nq dw1 ¨ ¨ ¨ dwn
“ pin
nź
i“1
|βi|
|αi|
¨˝
nÿ
j“1
2 |αj |3
3
ź
k‰j
2|αk|‚˛ .
Using
sinc1puq “
$’’&’’%
cospuq´sincpuq
u u ‰ 0
0 u “ 0
for u P R and the chain rule BgBxi “ BgBui duidxi with ui “ αixi, the gradient of g is then
∇gpxq “
„
β1α1
cospα1x1q ´ sincpα1x1q
α1x1
¨ ¨ ¨ βnαn cospαnxnq ´ sincpαnxnq
αnxn
J
∇gp0nq “ r0 ¨ ¨ ¨ 0sJ .
And since |sinc1puq| ď 12 holds for all u P R, and g is continuously differentiable, we can
use the Lipschitz constant
}∇gpxq}2 ď
››››„β1α12 ¨ ¨ ¨ βnαn2
››››
2
“: L .
In Theorem 4.2 it is specified that pfN exists with parameters satisfying |wi|K “ ρ }w}2 “
1 for all i P rN s, and so we can bound its Lipschitz constant as
pL “ Nÿ
k“1
|pΘJi qk| }wk}2 ď
2C
ρ
“ 2ρC}w} .
Thus, for any fpxq that is equal to this gpxq on x P B0pρq for any ρ ą 0, the approxi-
mation error bound in Theorem 4.2 can be stated as
sup
xPK
ˇˇˇ rfpxqˇˇˇ ď 2ˆ ρ´1
2p1`?5q
˙´n
n`2
˜ˆ››››„β1α12 ¨ ¨ ¨ βnαn2
››››
2
` 2ρC}w}
n˙ 64
3 ρ
2C2}w}
N
¸ 1
n`2
.
Chapter 5
Conclusion and Future Directions
We have studied the learning components of the linear MRAC setup in [41], from both
a stochastic and deterministic viewpoint. We extended the framework in [18, 20] for ex-
ponential stochastic contraction in specialized Wasserstein distances, while also showing
that a flexible convex projection operation can be achieved through the use of reflection
processes. We develop novel sufficient conditions for persistency of excitation for vec-
tor functions of continuous, bounded state trajectories with step and ReLU activation
functions. And we developed a novel method for bridging approximations using activa-
tion functions with unknown parameters to approximations using randomly initialized
parameters, with the mollified integral representation. We showed that this could be
extended to the supremum norm for use with an extended, approximate version of the
linear MRAC setup.
Two immediate extensions of the work on stochastic contraction would be to: 1)
allow arbitrary bounded reference trajectories other than the zero vector (ie. adaptive
regulation of the plant state to the origin) to also achieve contraction, to a distribution
which is not strictly stationary (the mean would be time-varying, at least) but can be
tractably analyzed in perhaps in terms of local times of the state space (or perhaps
85
86
periodic, depending on the reference input), and 2) allow for the fully general adaptive
parameter update law, with leading coefficient matrices other than identity. Two other
possible ideas are: understand if it is possible to remove the Brownian motion from
the parameter update law completely, since it does not strictly make sense that it
would be needed there instead of just on the plant, and determine if the Fokker-Planck
(Kolmogorov forward) equation could be solved for these systems.
For extending the work on persistency of excitation, the follow ideas come to mind.
Is it be possible to incorporate estimation of the feedback and feedforward gains con-
currently with the parameterization of the nonlinarity? Extending the geometrical in-
terpretation to other kinds of adaptive systems would be a good idea. Particularly
since [2] and [60] provide other, similar types of conditions for more complex systems.
It would be interesting to understand what can be said about a completely general
(control affine) system which is linearized at an equilibrium point, such that it is not
necessarily possible for the control input to cancel out the higher-order terms function.
Extending the work on function approximation has two immediate paths: 1) is it
possible to do “approximate persistency of excitation on the extended, approximate
version of the linear MRAC setup, and 2) extending the work of [29] towards the deter-
mination of integral representations for more general classes of functions. In particular,
using linear combinations of step functions, which cannot be trained using backpropa-
gation in the modern paradigm of training deep neural networks. Yet in the 1D case,
we can easily show that the step function is superior at uniformly approximating func-
tions than the ReLU function. In fact, the step can handle discontinuous (regulated)
functions, while the ReLU can only deal with continuous functions.
References
[1] Veronica Adetola and Martin Guay. Finite-time parameter estimation in adaptive
control of nonlinear systems. IEEE Transactions on Automatic Control, 53(3):
807–811, 2008.
[2] Brian Anderson. Exponential stability of linear equations arising in adaptive iden-
tification. IEEE Transactions on Automatic Control, 22(1):83–88, 1977.
[3] Anuradha M Annaswamy, Anubhav Guha, Yingnan Cui, Joseph E Gaudio, and
Jose´ M Moreu. Online algorithms and policies using adaptive and machine learning
approaches. arXiv preprint arXiv:2105.06577, 2021.
[4] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained
analysis of optimization and generalization for overparameterized two-layer neu-
ral networks. In International Conference on Machine Learning, pages 322–332.
PMLR, 2019.
[5] Alessandro Astolfi, Dimitrios Karagiannis, and Romeo Ortega. Nonlinear and adap-
tive control with applications. Springer Science & Business Media, 2007.
[6] Dominique Bakry, Ivan Gentil, and Michel Ledoux. Analysis and geometry of
Markov diffusion operators, volume 348. Springer Science & Business Media, 2013.
87
88
[7] Andrew R Barron. Universal approximation bounds for superpositions of a sig-
moidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
[8] Nicholas M Boffi, Stephen Tu, and Jean-Jacques E Slotine. Regret bounds for
adaptive nonlinear control. arXiv preprint arXiv:2011.13101, 2020.
[9] Franc¸ois Bolley, Ivan Gentil, and Arnaud Guillin. Convergence to equilibrium in
wasserstein distance for fokker–planck equations. Journal of Functional Analysis,
263(8):2430–2457, 2012.
[10] Djallel Bouneffouf, Irina Rish, and Charu Aggarwal. Survey on applications of
multi-armed and contextual bandits. In 2020 IEEE Congress on Evolutionary
Computation (CEC), pages 1–8. IEEE, 2020.
[11] Stephen Boyd and Sosale Shankara Sastry. Necessary and sufficient conditions for
parameter convergence in adaptive control. Automatica, 22(6):629–639, 1986.
[12] Leo Breiman. Hinging hyperplanes for regression, classification, and function ap-
proximation. IEEE Transactions on Information Theory, 39(3):999–1013, 1993.
[13] Se´bastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and
nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine
Learning, 5(1):1–122, 2012.
[14] Girish Chowdhary, Tansel Yucelen, Maximillian Mu¨hlegg, and Eric N Johnson.
Concurrent learning adaptive control of linear systems with exponentially conver-
gent bounds. International Journal of Adaptive Control and Signal Processing, 27
(4):280–301, 2013.
[15] George Cybenko. Approximation by superpositions of a sigmoidal function. Math-
ematics of control, signals and systems, 2(4):303–314, 1989.
89
[16] Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey,
Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego
de Las Casas, et al. Magnetic control of tokamak plasmas through deep reinforce-
ment learning. Nature, 602(7897):414–419, 2022.
[17] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient de-
scent finds global minima of deep neural networks. In International Conference on
Machine Learning, pages 1675–1685, 2019.
[18] Andreas Eberle. Reflection couplings and contraction rates for diffusions. Proba-
bility theory and related fields, 166(3-4):851–886, 2016.
[19] Andreas Eberle, Arnaud Guillin, and Raphael Zimmer. Quantitative harris-type
theorems for diffusions and mckean–vlasov processes. Transactions of the American
Mathematical Society, 371(10):7135–7173, 2019.
[20] Andreas Eberle, Arnaud Guillin, Raphael Zimmer, et al. Couplings and quanti-
tative contraction rates for langevin dynamics. The Annals of Probability, 47(4):
1982–2010, 2019.
[21] Andre Esteva, Katherine Chou, Serena Yeung, Nikhil Naik, Ali Madani, Ali Mot-
taghi, Yun Liu, Eric Topol, Jeff Dean, and Richard Socher. Deep learning-enabled
medical computer vision. NPJ digital medicine, 4(1):5, 2021.
[22] Ken-Ichi Funahashi. On the approximate realization of continuous mappings by
neural networks. Neural networks, 2(3):183–192, 1989.
[23] Joseph E Gaudio, Travis E Gibson, Anuradha M Annaswamy, Michael A Bolender,
and Eugene Lavretsky. Connections between adaptive control and optimization in
machine learning. In 2019 IEEE 58th Conference on Decision and Control (CDC),
pages 4563–4568. IEEE, 2019.
90
[24] Elad Hazan, Sham Kakade, and Karan Singh. The nonstochastic control problem.
In Algorithmic Learning Theory, pages 408–421. PMLR, 2020.
[25] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward
networks are universal approximators. Neural networks, 2(5):359–366, 1989.
[26] Naira Hovakimyan and Chengyu Cao. L1 Adaptive Control Theory: Guaranteed
Robustness with Fast Adaptation. SIAM, 2010.
[27] Daniel Hsu, Clayton Sanford, Rocco A Servedio, and Emmanouil-Vasileios
Vlatakis-Gkaragkounis. On the approximation power of two-layer networks of ran-
dom relus. In Conference on Learning Theory, pages 2423–2461. PMLR, 2021.
[28] Petros Ioannou and Baris¸ Fidan. Adaptive control tutorial. SIAM, 2006.
[29] Bunpei Irie and Sei Miyake. Capabilities of three-layered perceptrons. In ICNN,
pages 641–648, 1988.
[30] Arthur Jacot, Franck Gabriel, and Cle´ment Hongler. Neural tangent kernel: Con-
vergence and generalization in neural networks. arXiv preprint arXiv:1806.07572,
2018.
[31] Paul C Kainen, Veˇra Ku˘rkova´, and Andrew Vogt. Integral combinations of heavi-
sides. Mathematische Nachrichten, 283(6):854–878, 2010.
[32] Olav Kallenberg. Foundations of Modern Probability. Springer, third edition edi-
tion, 2021.
[33] Rushikesh Kamalapurkar, Patrick Walters, Joel Rosenfeld, and Warren Dixon. Re-
inforcement learning for optimal feedback control. Springer, 2018.
91
[34] Philipp Karg, Florian Ko¨pf, Christian Alexander Braun, and So¨ren Hohmann.
Excitation for adaptive optimal control of nonlinear systems in differential games.
IEEE Transactions on Automatic Control, 2022.
[35] Hassan K Khalil. High-gain observers in nonlinear feedback control. SIAM, 2017.
[36] Gerhard Kreisselmeier and Gudrun Rietze-Augst. Richness and excitation on an
interval-with application to continuous-time adaptive control. IEEE transactions
on automatic control, 35(2):165–171, 1990.
[37] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification
with deep convolutional neural networks. Communications of the ACM, 60(6):
84–90, 2017.
[38] AJ Kurdila, Francis J Narcowich, and Joseph D Ward. Persistency of excitation in
identification using radial basis function approximants. SIAM journal on control
and optimization, 33(2):625–642, 1995.
[39] Andrew Lamperski. Projected stochastic gradient langevin algorithms for con-
strained sampling and non-convex learning. In Conference on Learning Theory,
pages 2891–2937. PMLR, 2021.
[40] Andrew Lamperski. Neural network independence properties with applications to
adaptive control. In IEEE Conference on Decision and Control (CDC), 2022.
[41] Eugene Lavretsky and Kevin A Wise. Robust and Adaptive Control: with Aerospace
Applications. Springer, 2013.
[42] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pen-
nington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes.
arXiv preprint arXiv:1711.00165, 2017.
92
[43] Tyler Lekang and Andrew Lamperski. Simple algorithms for dueling bandits. arXiv
preprint arXiv:1906.07611, 2019.
[44] Tyler Lekang and Andrew Lamperski. Wasserstein contraction bounds on closed
convex domains with applications to stochastic adaptive control. In IEEE Confer-
ence on Decision and Control (CDC), 2021.
[45] Tyler Lekang and Andrew Lamperski. Sufficient conditions for persistency of ex-
citation with step and relu activation functions. In IEEE Conference on Decision
and Control (CDC), 2022.
[46] FL Lewis, A Yesildirak, and Suresh Jagannathan. Neural network control of robot
manipulators and nonlinear systems, 1998.
[47] Jung-Shan Lin and Ioannis Kanellakopoulos. Nonlinearities enhance parameter
convergence in strict feedback systems. IEEE Transactions on Automatic Control,
44(1):89–94, 1999.
[48] Torgny Lindvall and LCG Rogers. Coupling of multidimensional diffusions by
reflection. The Annals of Probability, pages 860–872, 1986.
[49] Pierre-Louis Lions and Alain-Sol Sznitman. Stochastic differential equations with
reflecting boundary conditions. Communications on Pure and Applied Mathemat-
ics, 37(4):511–537, 1984.
[50] Batta Mahesh. Machine learning algorithms-a review. International Journal of
Science and Research (IJSR).[Internet], 9:381–386, 2020.
[51] Yuly Makovoz. Random approximants and neural networks. Journal of Approxi-
mation Theory, 85(1):98–109, 1996.
93
[52] Iven MY Mareels and Michel Gevers. Persistency of excitation criteria for linear,
multivariable, time-varying systems. Mathematics of Control, Signals and Systems,
1(3):203–226, 1988.
[53] Riccardo Marino and Patrizio Tomei. Global adaptive output-feedback control of
nonlinear systems. i. linear parameterization. IEEE Transactions on Automatic
Control, 38(1):17–32, 1993.
[54] Sean P Meyn and Richard L Tweedie. Stability of markovian processes iii: Foster-
lyapunov criteria for continuous-time processes. Advances in Applied Probability,
pages 518–548, 1993.
[55] AP Morgan and KS Narendra. On the uniform asymptotic stability of certain linear
nonautonomous differential equations. SIAM Journal on Control and Optimization,
15(1):5–24, 1977.
[56] Kamil Nar and S Shankar Sastry. Persistency of excitation for robustness of neural
networks. arXiv preprint arXiv:1911.01043, 2019.
[57] Radford M Neal. Priors for infinite networks (Technical Report CRG-TR-94-1).
1994.
[58] Daniel W Otter, Julian R Medina, and Jugal K Kalita. A survey of the usages
of deep learning for natural language processing. IEEE transactions on neural
networks and learning systems, 32(2):604–624, 2020.
[59] Alberto Padoan, Giordano Scarciotti, and Alessandro Astolfi. A geometric charac-
terization of the persistence of excitation condition for the solutions of autonomous
systems. IEEE Transactions on Automatic Control, 62(11):5666–5677, 2017.
[60] Elena Panteley, Antonio Loria, and Andrew Teel. Relaxed persistency of excitation
94
for uniform asymptotic stability. IEEE Transactions on Automatic Control, 46(12):
1874–1886, 2001.
[61] Armenak Petrosyan, Anton Dereventsov, and Clayton G Webster. Neural network
integral representations with the relu activation function. In Mathematical and
Scientific Machine Learning, pages 128–143. PMLR, 2020.
[62] Quang-Cuong Pham, Nicolas Tabareau, and Jean-Jacques Slotine. A contraction
theory approach to stochastic incremental stability. IEEE Transactions on Auto-
matic Control, 54(4):816–820, 2009.
[63] Andrey Pilipenko. An introduction to stochastic differential equations with reflec-
tion, volume 1. Universita¨tsverlag Potsdam, 2014.
[64] Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta
numerica, 8:143–195, 1999.
[65] Gilles Pisier. Remarques sur un re´sultat non publie´ de B. Maurey. Se´minaire
Analyse fonctionnelle, pages 1–12, 1981.
[66] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya
Ganguli. Exponential expressivity in deep neural networks through transient chaos.
Advances in neural information processing systems, 29, 2016.
[67] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: replacing
minimization with randomization in learning. In Advances in Neural Information
Processing Systems, volume 21, 2008.
[68] Benjamin Recht. A tour of reinforcement learning: The view from continuous
control. Annual Review of Control, Robotics, and Autonomous Systems, 2:253–
279, 2019.
95
[69] Sayan Basu Roy, Shubhendu Bhasin, and Indra Narayan Kar. Combined mrac for
unknown mimo lti systems with parameter convergence. IEEE Transactions on
Automatic Control, 63(1):283–290, 2017.
[70] Juan G Rueda-Escobedo and Jaime A Moreno. Strong lyapunov functions for two
classical problems in adaptive control. Automatica, 124:109250, 2021.
[71] Shankar Sastry and Marc Bodson. Adaptive control: stability, convergence and
robustness. Prentice Hall, Inc., 1989.
[72] Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-
stochastic control. In Conference on Learning Theory, pages 3320–3436. PMLR,
2020.
[73] Leszek S lomin´ski. Euler’s approximations of solutions of sdes with reflecting bound-
ary. Stochastic processes and their applications, 94(2):317–337, 2001.
[74] Jean-Jacques E Slotine, Weiping Li, et al. Applied nonlinear control, volume 199.
Prentice hall Englewood Cliffs, NJ, 1991.
[75] Hiroshi Tanaka. Stochastic differential equations with reflecting boundary condition
in convex regions. Hiroshima Mathematical Journal, 9(1):163–177, 1979.
[76] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez,  Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.
Advances in neural information processing systems, 30, 2017.
[77] Ce´dric Villani. Optimal transport: old and new, volume 338. Springer Science &
Business Media, 2008.
[78] Draguna Vrabie, Kyriakos G Vamvoudakis, and Frank L Lewis. Optimal adaptive
96
control and differential games by reinforcement learning principles, volume 2. IET,
2013.
[79] Tyler Westenbroek, Eric Mazumdar, David Fridovich-Keil, Valmik Prabhu, Claire J
Tomlin, and S Shankar Sastry. Adaptive control for linearizable systems using on-
policy reinforcement learning. In 2020 59th IEEE Conference on Decision and
Control (CDC), pages 118–125. IEEE, 2020.
[80] Jan C Willems, Paolo Rapisarda, Ivan Markovsky, and Bart LM De Moor. A note
on persistency of excitation. Systems & Control Letters, 54(4):325–329, 2005.
Appendix A
A.1 Proof of Theorem 2.1
Theorem. The function ρpx, yq defined in (2.11) is a metric over x, y P X . Let xt
and yt be two solutions to (2.5) with initial conditions x0 ‰ y0 P X and respective laws
Pt and Qt. If the initial laws satisfy
ş
X VpxqdP0pxq ă 8 and
ş
X VpxqdQ0pxq ă 8, then
the laws contract exponentially in Wρ distance as
WρpPt,Qtq ď e´atWρpP0,Q0q , (2.14)
where a “ 12 mintλ, ξε, βεu.
Proof. Many of the arguments here are similar to those in the proof of Theorem 2.1
in [19]. We give terse explanations of these and then highlight the differences.
The main idea is to define a reflection coupling between xt and yt and use it to show
that eatρpxt,ytq, with a defined as
a “ 1
2
mintλ, ξε, βεu (A.1)
97
98
and ρ defined in (2.11), is supermartingale. This then gives the main result as:
Epxt,ytq„Γt˚ rρpxt,ytqs ď Epxt,ytq„Γtrρpxt,ytqs ď e´atEpx0,y0q„Γ˚0 rρpx0,y0qs
“WρpPt,Qtq “ e´atWρpP0,Q0q .
The left inequality is by the infimum definition of Wρ from (2.10), since at any time
t ą 0 the optimal coupling Γt˚ gives the Wasserstein distance for the laws Pt,Qt. The
right inequality is then by the supermartingale property, with moving the eat factor to
the right side, which holds for any couplings Γt P CpPt,Qtq of the laws at time t ą 0
and Γ0 P CpP0,Q0q of the initial laws. Thus, using the optimal Γ0˚ gives the final result.
Note that this optimal Γ0˚ always exists by Theorem 4.1 of [77].
Showing That ρ is a Metric
First, note from the definition of ρ in (2.11) that
ρpx, yq´fp}x´y}q “ rγV pxq`γV pyq`maxtV pxq, φpMqu`maxtV pyq, φpMqusIpx ‰ yq
is a metric, since V pxq ě 0 for all x P X and φpMq ą 0. In particular, it is a weighted
version of the Hamming metric.
Since the sum of metrics is again a metric, we must show that fp}x´y}q is a metric.
This holds provided that f is concave, with f 1p0q ą 0 and fp0q “ 0. (See [18, 19] for
details.)
To this end, it suffices to show that f is monotonically increasing from fp0q “ 0,
and that fpa` bq ď fpaq ` fpbq holds for any a, b ě 0. Working through the definitions
(2.13), we get for all r ě 0 that:
fprq “
ż mintr,R2u
0
ϕpsqgpsqds
f 1prq “ ϕprqgprqIpr ă R2q
f2prq “ ´h1prqf 1prq ´ ξ
4
Ipr ă R1q ´ β
4
ΦprqIpr ă R2q . (A.2)
99
Here, hprq is strictly increasing over r ě 0, and thus so are ϕprq and Φprq, respectively
from hp0q “ 0, ϕp0q “ 1, and Φp0q “ 0. Also, gprq ě 12 for all r ě 0 by construction, as ξ
and β are positive constants, with gp0q “ 1. Thus, we immediately have that fp0q “ 0,
f 1p0q “ 1, fprq is strictly increasing over r P r0, R2q, and then is constant for r ě R2.
This satisfies the first two requirements.
We also see from (A.2) that f2prq ă 0 over r P r0, R2q, and then zero for r ě R2.
This gives that f 1prq is monotonically decreasing for all r ě 0. Therefore, the last
required property (triangle inequality) is satisfied, since we have
fpa` bq “ fpaq `
ż a`b
a
f 1psqds ď fpaq `
ż b
0
f 1psqds “ fpaq ` fpbq .
Gradient and Hessian of General Norms with Quadratic Form
Throughout this proof, } ¨ } denotes a general norm having the quadratic form }z}2 “
xz, zy “ zJPz for all z P Rn, which is valid for any symmetric, positive-definite n ˆ n
matrix P . The Euclidian norm }z}2 “ xz, zy2 “ zJz satisfies this with P “ In. Then,
for any z P Rnzt0nu and where we define u :“ z}z} with }z} ‰ 0, the gradient and Hessian
of }z} are given by:
∇}z} “ ∇`p}z}2q 12 ˘ “ 12`p}z}2q´ 12 ˘∇}z}2 “ 12}z}´1 2Pz “ Pu
∇2}z} “ }z}´1`P ´∇}z}p∇}z}qJ˘ “ }z}´1P `In ´ uuJP ˘ . (A.3)
Reflection Coupling
The key approach from [19] is to define a reflection coupling (see [48]) between the solu-
tions xt and yt and then use it to upper bound EΓtreatρpxt,ytqs “ eatEΓtrρpxt,ytqs by a
local martingale with zero mean. Then by the definition of Wasserstein distance (2.10),
we have WρpPt,Qtq ď EΓtrρpxt,ytqs for any joint measure pxt,ytq „ Γt P CpPt,Qtq that
is a coupling of the laws of xt and yt at time t ě 0.
100
To define the reflection coupling, let us first make the following definitions:
ut :“ xt ´ yt}xt ´ yt} and δt :“
G´1pxt ´ ytq
}G´1pxt ´ ytq}2 .
This gives that uJt Put “ 1 and δJt δt “ 1. The reflection coupling is then given by the
following RSDE definitions corresponding to (2.5):
dxt “ Hpxtqdt`Gdwt ´ vxt dµxptq
dyt “ Hpytqdt`G
`
In ´ 2δtδJt Ipt ă τ q
˘
dwt ´ vyt dµyptq . (A.4)
Here, τ :“ inftt P r0,8q | xt “ ytu is the coupling time and the unique bounded
variation reflection processes ´ şt0 vxsdµxpsq and ´ şt0 vysdµypsq respectively ensure both
xt and yt remain within X for all t ě 0, from initial conditions x0 ‰ y0 P X . And so,
we have that xt “ yt for all t ě τ .
Note that the above definition for dyt is indeed of the form given in (2.5), because
d rwt “ `In´2δtδJt ˘dwt is a valid definition of a Brownian motion. This follows because`
In ´ 2δtδJt
˘
is an orthogonal matrix, since we have
`
In ´ 2δtδJt
˘`
In ´ 2δtδJt
˘J “`
In ´ 2δtδJt
˘J`
In ´ 2δtδJt
˘ “ In ´ 2δtδJt ´ 2δtδJt ` 4δtδJt δtδJt “ In.
Remark A.1. “Reflection coupling” overlaps with an unrelated meaning in our setup:
“reflecting” the solutions xt and yt at the boundary of X to remain within the closed
convex set, versus “reflecting” the Brownian motion with In ´ 2δtδJt .
The Supermartingale Property
Now we will show that eatρpxt,ytq is supermartingale. This is an exercise in non-smooth
Itoˆ calculus for continuous semimartingales. A good reference is Chapter 29 of [32].
In the discussion below, mt will denote a local martingale with zero mean. We use
this notation in several places to denote different processes. The specific value of mt
will not matter, since it will have zero mean.
101
Let zt :“ xt´yt and rt :“ }zt}. We will evaluate drt via Itoˆ’s formula when rt ą 0,
meaning that t ă τ . The method is similar to the proof of Theorem 2.1 in [19], but
our setup contains extra terms for the reflection processes, ´vxt dµxptq and ´vyt dµyptq.
From (A.4), we have that the process zt is a solution to the RSDE:
dzt “
`
Hpxtq ´Hpytq
˘
dt` 2GδtδJt dwt ´ vxt dµxptq ` vyt dµyptq t ă τ
zt “ 0n t ě τ .
With rt ą 0, we use (A.3) to apply Itoˆ’s formula for continuous semimartingales as:
drt “ B} ¨ }Bt dt` p∇rtq
Jdzt ` 1
2
pdztqJ∇2rt dzt
“ 0` xut, dzty ` 1
2}zt} pdztq
JP pIn ´ utuJt P qdzt
“ xut,Hpxtq ´Hpytqydt` 2xut, GδtδJt dwty ` xut,vyt dµyptq ´ vxt dµxptqy
` 2}zt}tr
`
P pIn ´ utuJt P qGδtδJt GpGδtδJt GqJ
˘
dt . (A.5)
Recall that here we are using the general inner product xu, vy “ uJPv. The first term
being zero just represents that the definition of rt in terms of the norm }¨} is fixed with
respect to time. We use the trace identity aJb “ trpbaJq and the Itoˆ rules dt2 “ 0,
dtdwt “ 0n, and dwtpdwtqJ “ dtIn to obtain the final term.
A significantly simpler expression can be obtained which upper bounds (A.5). First,
recalling that δt “ zt}G´1zt}2 , we note that the quantity inside the trace is a zero matrix:
P pIn ´ utuJt P qGδtδJt GpGδtδJt GqJ “ P
“
Gδtδ
J
t ´ utuJtP GδtδJt
‰
GpGδtδJt GqJ
“ P
”
zt}G´1zt}2δ
J
t ´ zt
zJt P
}zt}2
zt}G´1zt}2δ
J
t
ı
GpGδtδJt GqJ “ P 0nˆnGpGδtδJt GqJ “ 0nˆn .
Next, recall that vxt P NX pxtq and vyt P NX pytq, where NX is the normal cone of the
closed, convex set X . It then follows from the definition of NX that xut,vyt y ď 0 and
´xut,vxt y ď 0. (We elaborate on this in Section 2.1.) Since µx and µy are non-negative
measures, we therefore can bound those terms above by 0.
102
Thus, eliminating the terms explained above gives an upper bound as:
drt ď xut,Hpxtq ´Hpytqydt` 2xut, GδtδJt dwty
“ xut,Hpxtq ´Hpytqydt` 2 zJt}zt}P G 
G´1zt}G´1zt}2δ
J
t dwt
“ xut,Hpxtq ´Hpytqydt` 2 }zt}}G´1zt}2dbt , (A.6)
where bt “
şt
0 δ
J
s dws is a 1-dimensional Brownian motion. Note again that we will use
mt going forward to denote various local martingales that have zero mean.
Next we will bound dfprtq, using the f defind by (2.13). This part is similar to
the analysis from [19]. The main distinction is that [19] examines the special case of
P “ G “ In, while the general case leads to more complex formulas.
We extend f to the entire real line by setting fprq “ r for r ď 0. Note that f is
a non-smooth concave function which is left-differentiable everywhere. Denote the left-
derivative by f 1´ prq. Let Lrt denote the right-continuous local time of rt. Then Theorem
29.5 of [32] states that
fprtq ´ fpr0q “
ż t
0
f 1´ prsqdrs ` 1
2
ż 8
´8
Lrtdf
1´ prq , (A.7)
where the integral on the right is a Stieltjes integral. (Specifically, it is an integral with
respect to the signed measure ν defined by νpra, bqq “ f 1´ pbq´f 1´ paq for a ă b.) Theorem
29.5 of [32] also states that for any measurable function Fprq, we haveż t
0
Fprsqdrrss “
ż 8
´8
FprqLrtdr (A.8)
almost surely. Here, rrst is the quadratic variation. Since the xut, dµy type terms in
(A.5) are bounded variation, they have zero quadratic variation, and so we can use the
RHS of (A.6) to calculate rrst. This gives:
rrst “
ż t
0
ˆ
2
}zt}
}G´1zt}2
˙2
ds “ 4
ż t
0
zJsPzs
zJs pG´1qJG´1zsds .
103
SinceG is invertible, thenGGJ is positive definite and thus so is pGGJq´1 “ pG´1qJG´1.
Then the ratio of quadratic forms can be bounded for all zt P X as
λminpP q
λmaxppG´1qJG´1q ď
zJt Pzt
zJt pG´1qJG´1zt
ď λmaxpP q
λminppG´1qJG´1q
where λminpAq and λmaxpAq are respectively the smallest and largest eigenvalues of the
square matrix A. It then holds that 1
λmaxppG´1qJG´1q “ λminpGGJq “ σminpGq2, where
σminpGq :“ min}v}2“1 }Gv}2 since G is full rank, and vice versa with σmaxpGq :“ max}v}2“1 }Gv}2
(see Appendix A.3). And so, we can bound the quadratic variation as
ς t :“ 4λminpP qσminpGq2 t ď rrst ď 4λmaxpP qσmaxpGq2 t “: ζ t .
Note that f is twice continuously differentiable except at r P tR1, R2u. Then (A.8)
implies that rt spends zero time at tR1, R2u almost surely, since:ż t
0
Iprs P tR1, R2uqds ď ς´1
ż t
0
Iprs P tR1, R2uqdrrss
“ ς´1
ż 8
´8
Ipr P tR1, R2uqLrtdr
“ 0 . (A.9)
The last equality follows because Lrt is right continuous in r. More generally, this
argument shows that rt spends zero time in any finite collection of points.
The local time is non-negative, and the measure defined by νpra, bqq “ f 1´ pbq´f 1´ paq
is non-positive. Thus, we get the following inequalities:ż 8
´8
Lrtdf
1´ prq ď
ż 8
´8
Ipr R tR1, R2uqLrtdf 1´ prq
“
ż 8
´8
Ipr R tR1, R2uqf2prqLrtdr
“
ż t
0
Iprs R tR1, R2uqf2prsqdrrss
ď ζ
ż t
0
f2prsqds. (A.10)
104
The last inequality follows from the bounds on the quadratic variation, non-negativity
of f2, and the fact that rs spends zero time in tR1, R2u almost surely.
The preceding argument then implies that for almost all t P r0, τ q, the following
holds almost surely:
dfprtq ď
”
f 1prtqxut,Hpxtq ´Hpytqy ` ζ
2
f2prtq
ı
dt` dmt
ď
”
f 1prtq
`
κprqr ` αp}xt} ` }yt}q
˘` ζ
2
f2prtq
ı
dt` dmt
ď
”
f 1prtqαp}xt} ` }yt} ´ 2Mq ´ ξζ
8
Ipr ă R1q ´ βζ
8
fprqIpr ă R2q
ı
dt` dmt .
(A.11)
Here, the first inequality combines (A.6), (A.7), and (A.10), the second inequality uses
the one-sided growth bound (2.6), and the third inequality uses (A.2) combined with
the following facts that hold by construction:
fprq ď Φprq and h1prq “ 2
ζ
prκprq ` 2αMq “ 1
2ε
prκprq ` 2αMq .
The fact that fprq ď Φprq can be deduced from the definitions of Φ and f in (2.13),
since gprq P r1{2, 1s. We have thus bounded dfprtq as in (A.11).
We can directly use the Lyapunov assumption (2.7) to bound dVpxtq and dVpytq
almost surely as:
dVpxtq ď pC ´ λVpxtqqdt` dmt
dVpytq ď pC ´ λVpytqqdt` dmt . (A.12)
Now we will bound dpmaxtVpxtq, φpMquq. Recall that Vpxq “ φp}x}q, and that φpdq
is strictly monotonically increasing over d ě 0. We will follow a similar strategy as used
to bound dfprtq. For compact notation, set Vt “ Vpxtq, let LˆVt be the right-continuous
local time of Vt, and let F paq “ maxta, φpMqu, so that F pVpxtqq “ maxtVpxtq, φpMqu.
This gives that F paq is convex and smooth over a P r0,8qztφpMqu. Furthermore, since
105
(A.9) implies that rt spends zero time at M , then similarly we have that Vt spends zero
time at φpMq.
Note that the measure defined by νˆpra, bqq “ F 1´ pbq´F 1´ paq is a Dirac delta centered
at φpMq. Thus, the calculation analogous to (A.7) gives
F pVtq ´ F pV0q “
ż t
0
F 1´ pVsqdVs ` 1
2
Lˆ
φpMq
t ,
where the local time Lˆ
φpMq
t only changes at the times when Vt “ φpMq. (See again [32]
for details). Indeed, this local time is defined by
Lˆ
φpMq
t “ |Vt ´ φpMq| ´ |V0 ´ φpMq| ´
ż t
0
pIpVs ą φpMqq ´ IpVs ď φpMqqqdVs .
Thus, the local time remains unchanged on intervals in which Vt ą φpMq or Vt ă φpMq.
Then, since the amount of time Vt spends at φpMq is zero almost surely, we have that
for almost all t ă τ :
dF pVtq “ IpVt ą φpMqqdVt
“ Ip}xt} ąMqdVpxtq
ď Ip}xt} ąMqpC ´ λVpxtqq ` dmt
ď ´Ip}xt} ąMq
`
α}xt} ` λ2Vpxtq
˘` dmt . (A.13)
The first inequality is due to the bound (A.12), while the second inequality follows
from the assumption described in (2.9). And so, we have thus bounded dF pVtq “
dpmaxtVpxtq, φpMquq as in (A.13).
Using the definition of ρ in (2.11), and recalling that rt :“ }xt´yt} and xt ‰ yt for
all t ă τ , we then have
dρpxt,ytqq “ dfprtq ` γ
`
dVpxtq ` dVpytq
˘
` dpmaxtVpxtq, φpMquq ` dpmaxtVpytq, φpMquq .
106
And so, (A.11), (A.12), and (A.13) can be combined, along with the product rule, to
give for almost all t ă τ :
dpeatρpxt,ytqq “ aeatρpxt,ytqdt` eatdρpxt,ytq
ď eat
„
afprtq ` aγ
`
Vpxtq ` Vpytq
˘` a`maxtVpxtq, φpMqu `maxtVpytq, φpMqu˘
` f 1prtqαp}xt} ` }yt} ´ 2Mq ´ ξζ
8
Iprt ă R1q ´ βζ
8
fprtqIprt ă R2q
` γ`2C ´ λVpxtq ´ λVpytq˘´ Ip}xt} ąMq`α}xt} ` λ2Vpxtq˘
´ Ip}yt} ąMq
`
α}yt} ` λ2Vpytq
˘
dt` dmt .
We must show that the terms inside the brackets always sum to a non-positive value,
which then gives dpeatρpxt,ytqq ď dmt as desired.
For rt ă R2, the afprtq term is negated by βζ8 fprtq, since a ď βζ8 “ βε2 by definition.
On the other hand, for rt ě R2 we will show that afprtq is negated by aγ
`
Vpxtq `
Vpytq
˘` γ`2C ´ λVpxtq ´ λVpytq˘ “ γ`2C ` pa´ λqVpxtq ` pa´ λqVpytq˘, as follows:
γ
`
2C ` pa´ λqVpxtq ` pa´ λqVpytq
˘ ď γ`2C ´ λ2Vpxtq ´ λ2Vpytq˘
ď γ`2C ´ λ2Vpxtq˘ ď ´γ4C}xt} ď ´ξε2 rt ď ´art ď ´afprtq .
The first inequality follows from a ď λ2 . The second is from the assumption, without
loss of generality, that }xt} ě }yt} and thus Vpxtq ě Vpytq, and then dropping the
non-positive ´λ2Vpytq term. The third is from the assumption described in (2.9), such
that ´λ2Vpxtq ď ´4C}xt} ´ 2C. The fourth follows from the definition of γ “ ξε4C and
from the triangle inequality giving that 2 maxt}xt}, }yt}u “ 2}xt} ě rt “ }xt´yt}, thus
´}xt} ď ´rt2 . Then we have from the definition of a that a ď ξε2 . Lastly, since we have
shown that fprq is concave from fp0q “ 0, it holds that ´rt ď ´fprtq.
Next consider the case that f 1prtqαp}xt} ` }yt} ´ 2Mq ą 0. Then, since both
f 1prtq ě 0 and α ě 0, at least one of }xt} ą M or }yt} ą M must hold. If }xt} ą M ,
107
then since f 1prtq ď 1, the corresponding term is negated by
Ip}xt} ąMqppa´ λ{2qVpxtq ´ α}xt}q ď ´α}xt} ď ´f 1prtqα}xt} .
A similar negation occurs if }yt} ąM .
Finally, we consider the term γ2C “ ξε2 “ ξζ8 if rt ă R2, since previously it was
only part of a negation if rt ě R2. In the case that rt ă R1, this term is negated by
´ ξζ8 Iprt ă R1q. If rt ě R1, the definition of R1 “ diamtpx, yq P X | Vpxq ` Vpyq ď 4Cλ u
along with a ď λ2 gives that it is negated by
2C ` pa´ λqVpxtq ` pa´ λqVpytq ď 2C ´ λ2 pVpxtq ` Vpytqq ď 0 .
We have therefore shown that dpeatρpxt,ytqq ď dmt, and thus proven that the process
eatρpxt,ytq is supermartingale.
Localization
The last step in the proof is ruling out the possibility that Ereatρpxt,ytqs “ 8. This is
achieved in the proof of Theorem 2.1 in [19] by a localization argument using stopping
times τk :“ inftt | rt ď 1k or maxt}xt}, }yt}u ě ku, with τk Ò τ almost surely by the
nonexplosiveness of solutions to (2.5), which in turn is guaranteed by assumption (2.7).
The argument works without change in this setting.
A.2 Proof of Corollary 2.1
Corollary. For (2.5), there exists a unique stationary distribution pi satisfyingş
X Vpxqdpipxq ă 8 . Let xt be a solution to (2.5) with law Pt such that the initial law
satisfies
ş
X VpxqdP0pxq ă 8 . If Vpxq ě 1 for all x P X , then Pt contracts exponentially
to pi in total variation distance as
}Pt ´ pi}TV ď γ´1e´atWρpP0, piq . (2.15)
108
Now let q P r1,8q satisfy 1p ` 1q “ 1 with the corresponding p ą 1. If Vpxq ě }x}q for
all x P X , then Pt contracts exponentially to pi in standard W q as
W qpPt, piq ď 21{p γ´1{q
`
e´atWρpP0, piq
˘1{q
. (2.16)
Proof. The proof of (2.15) follows the proof of Corollary 2.1 in [19]. There, the initial
equality below is given (with citation), for any two probability measures P,Q over X ,
as ż
X
Vpxq |P´Q|pdxq “ inf
ΓPCpP,Qq
ż
XˆX
`
Vpxq ` Vpyq˘Ipx ‰ yqdΓpx, yq
ď
ż
XˆX
`
Vpxq ` Vpyq˘Ipx ‰ yq dΓ˚px, yq
ď γ´1WρpP,Qq , (A.14)
where Γ˚ is the coupling measure in WρpP,Qq. The last inequality follows since it holds
that
`
Vpxq`Vpyq˘Ipx ‰ yq ď γ´1ρpx, yq, with ρ defined by (2.11), for all px, yq P X ˆX .
Then from the definition of total variation distance, we have that
}P´Q}TV “ 1
2
ż
X
|P´Q|pdxq ď
ż
X
Vpxq |P´Q|pdxq (A.15)
holds if Vpxq ě 1 for all x P X . Therefore, (2.15) follows from combining (2.14) with
(A.14) and (A.15) where we use P “ Pt and Q “ pi.
The proof of (2.16) follows a similar argument, as is also the case with Remark 2.3
in [19]. There it is given, for any two probability measures P,Q over X , that
W qpP,Qq ď 21{p
ˆż
X
}x}q|P´Q|pdxq
˙1{q
(A.16)
holds for any q P r1,8q satisfying 1p ` 1q “ 1 with the corresponding p ą 1. Thus, if
Vpxq ě }x}q for all x P X , we have (by combining with (A.14)) thatż
X
}x}q|P´Q|pdxq ď
ż
X
Vpxq |P´Q|pdxq ď γ´1WρpP,Qq . (A.17)
109
And so again, the result is obtained by combining (2.14) with (A.16) and (A.17) where
we use P “ Pt and Q “ pi.
A.3 Useful Properties of Invertible Matrix G
Let G be an nˆ n square matrix. Then GGJ is symmetric and thus for any x P Rn it
holds that
xJGGJxJ “ pGJxqJpGJxq “
nÿ
i“1
pGJxq2i ě 0 .
If G is invertible, then so is GGJ as pGGJq´1 “ pG´1qJG´1. This means GGJ has no
zero eigenvalues and thus is positive definite. Let λ1 ě ¨ ¨ ¨ ě λn ą 0 be the eigenvalues
of GGJ and let us define λminpGGJq :“ λn and λmaxpGGJq :“ λ1.
Since GGJ is positive definite, then so is its inverse pGGJq´1 “ pG´1qJG´1 with
eigenvalues of 1λn ě ¨ ¨ ¨ ě 1λ1 ą 0. We can similarly define λminppG´1qJG´1q :“ 1λ1 and
λmaxppG´1qJG´1q :“ 1λn .
Therefore, we have
1
λmaxppG´1qJG´1q “ λn “ λminpGG
Jq “ σminpGq2
and
1
λminppG´1qJG´1q “ λ1 “ λmaxpGG
Jq “ σmaxpGq2
where σminpGq “ min}v}2“1 }Gv}2 (since G is full rank) and σmaxpGq “ max}v}2“1 }Gv}2 :“
}G}2 are respectively the minimum and maximum singular value of G with respect to
Euclidean norms.
Appendix B
B.1 Proofs of Theorems 3.1 and 3.2
Here we will prove the main results provided in Section 3.3.
B.1.1 Proof Methodology and Setup
The proofs for both theorems rely on the same contradiction method and share a com-
mon setup. First, we will assume that the state trajectory xt is continuous and stays
within some compact set C Ă Rn for all t ě 0, and that the component activations
σ1pwJ1 x` b1q, . . . , σN pwJNx` bN q of Ψ are then bounded on that C.
Then we will assume that for any shift τ ě 0 of the time window rτ, τ ` T s of any
length T ą 0, there exists a nonzero vector v P RNzt0Nu such that
vJ
ˆż τ`T
τ
ΨpxtqΨpxtqJdt
˙
v “ 0 . (B.1)
Finally, we will show that if the state trajectory xt meets certain requirements within
any shift τ ě 0 of the window rτ, τ ` T ˚s, for some window length T ˚ ą 0, then in
fact (B.1) can only hold if v “ 0N . This is a contradiction and thus proves that the
time-varying matrix ż τ`T˚
τ
ΨpxtqΨpxtqJdt (B.2)
110
111
must be strictly positive definite, with bounded eigenvalues, uniformly over all windows
rτ, τ ` T ˚s. Then by the Min-max principle, the persistency of excitation conditions
(3.5) and (3.6) are satisfied with setting α1 and α2 respectively as the smallest and
largest eigenvalues of (B.2) over all shifts τ ě 0 of the window rτ, τ ` T ˚s.
And so, let there be such a continuous trajectory xt staying within a compact set
C Ă Rn for all t ě 0. Within any particular time window rτ, τ `T s, the trajectory visits
L activation regions AA1 , . . . ,AAL , with the corresponding index sequences A1, . . . ,AL
and I1, . . . ,IL´1 defined as in Section 3.2.1.
For any time window rτ, τ ` T s and over the corresponding sequence s P rLs of
visited activation regions, recall the definition of the subsets Ts from (3.10):
Ts “ tt P rτ, τ ` T s | xt P AAsu .
Clearly the union of these subsets gives back the entire window as
Lď
s“1
Ts “ rτ, τ ` T s ,
meaning (B.1) is equivalent toż τ`T
τ
pvJΨpxtqq2dt “
Lÿ
s“1
ż
Ts
pvJΨpxtqq2dt “ 0 (B.3)
holding for all time windows rτ, τ ` T s and corresponding Ts subsets.
Since each integral in the sum is over t P Ts, and by definition we know that xt is
only within AAs during that subset of the time window, then by definition only the
activation indices i P SAs of Ψpxtq will be nonzero, which gives
vJΨpxtq “
Nÿ
i“1
vi Ψpxtqi “
ÿ
SAs
vi Ψpxtqi :“ vJAs ΨpxtqAs .
Here we denote vAs and ΨpxtqAs respectively as the indices of v and Ψpxtq according
to the active set i P SAs , for each s P rLs. This means (B.3) is equivalent to
Lÿ
s“1
ż
Ts
pvJAsΨpxtqAsq2dt “ 0 (B.4)
112
holding for all time windows rτ, τ `T s and corresponding Ts subsets. And since each of
these integrals is always nonnegative, (B.4) is equivalent to
vJAsΨpxtqAs “ 0 @t P Ts,@s P rLs (B.5)
holding for all time windows rτ, τ ` T s and corresponding Ts subsets.
B.1.2 Proof of Theorem 3.1
Theorem. Let state trajectory xt be continuous and stay within some compact
set C Ă Rn for all t ě 0, and let Ψpxtq “ rσcSpwJ1xt ` b1q ¨ ¨ ¨ σcSpwJNxt ` bN qsJ be
composed of (scaled) step functions (3.8) with positive scalars c “ rc1 ¨ ¨ ¨ cN sJ together
with N unique affine transformations of Rn according to w1, . . . , wN P Rnzt0nu and
b1, . . . , bN P R. If xt over t ě 0 is such that there exists a window length T ˚ ą 0
whereby the sequence s P rLs of activation regions visited during any shift τ ě 0 of the
time window rτ, τ ` T ˚s always satisfies the following two conditions:
1. all i P rN s hyperplanes wJi xt ` bi “ 0 are crossed
2. only nondegenerate borders are crossed,
then Ψpxtq satisfies the persistency of excitation conditions (3.5) and (3.6).
Proof. In this case, (B.5) is equivalent to
vJAsΨpxtqAs “ vJAscAs “ 0 @t P Ts,@s P rLs (B.6)
holding for all time windows rτ, τ ` T s and corresponding Ts subsets. Here we denote
cAs as the indices of c according to the active set i P SAs , for all s P rLs. This gives a
fixed, nonzero |SAs | length vector.
Note we have assumed that xt over t ě 0 is such that there exists T ˚ ą 0 whereby all
N hyperplanes are crossed and only at nondegenerate borders, within any time window
113
rτ, τ ` T ˚s. Thus, we can combine (3.11) and (B.6) to get that
0 “ vJAs`1cAs`1 @t P Ts`1 (B.7)
“
$’’&’’%
vJAscAs ` vIscIs if AAs`1 Ď X`Is
vJAscAs ´ vIscIs if AAs`1 Ď XI˝s
must hold over all s P rL ´ 1s with vJA1cA1 “ 0 for all t P T1, for all time windows
rτ, τ `T ˚s and corresponding Ts subsets. However, we will now show by induction that
this can only hold if v “ 0N .
Starting at s “ 1: it must hold that vJA1cA1 “ 0 for all t P T1, and since v and
c are fixed it must continue to hold for all t P T2, during which it must hold that
vJA2cA2 “ 0. And so, these can both hold only if the additional term is vI1cI1 “ 0,
and since cI1 ‰ 0 it must be that vI1 “ 0. We can then inductively make the same
argument over the remaining s “ 2, 3, . . . ,L ´ 1 to obtain vI1 “ ¨ ¨ ¨ “ vIL´1 “ 0. And
since we have assumed all N hyperplanes were crossed at nondegenerate borders over
the L´ 1 hyperplane crossings, we therefore must have that vi “ 0 for all i P rN s. This
establishes the contradiction from the original assumption of a nonzero v.
Since we assume that all N hyperplanes are crossed and only at nondegenerate
borders during any time window rτ, τ ` T ˚s, this contradiction must hold for all such
windows. Therefore, (B.2) must be strictly positive definite for all τ ě 0 with this T ˚
window length. And since the (scaled) step is bounded on any compact set C Ă Rn, we
have proven persistency of excitation holds for any window length of at least T ˚.
B.1.3 Proof of Theorem 3.2
Theorem. Let state trajectory xt be continuous and stay within some compact set
C Ă Rn for all t ě 0, and let Ψpxtq “ rσRpwJ1 xt` b1q ¨ ¨ ¨ σRpwJNxt` bN qs be composed
of ReLU functions (3.9) together with N unique affine transformations of Rn according
114
to w1, . . . , wN P Rnzt0nu and b1, . . . , bN P R. If xt over t ě 0 is such that there exists
a window length T ˚ ą 0 whereby the sequence s P rLs of activation regions visited
during any shift τ ě 0 of the time window rτ, τ `T ˚s always satisfies the following three
conditions:
1. all i P rN s hyperplanes wJi xt ` bi “ 0 are crossed
2. only nondegenerate borders are crossed
3. for each s P rLs, there are times t1, tˆ1, . . . , tM , tˆM P Ts, for any M ě n, such that
the nˆM matrix “pxt1 ´ xtˆ1q ¨ ¨ ¨ pxtM ´ xtˆM q‰ satisfies
rankpWJAsq “ rank
´
WJAs
“pxt1 ´ xtˆ1q ¨ ¨ ¨ pxtM ´ xtˆM q‰¯ ,
then Ψpxtq satisfies the persistency of excitation conditions (3.5) and (3.6).
Proof. In this case, (B.5) is equivalent to
vJAsΨpxtqAs “ vJAspWJAsxt ` bAsq “ 0 @t P Ts,@s P rLs (B.8)
holding for all time windows rτ, τ ` T s and corresponding Ts subsets. Here we denote
WJ as the N ˆ n matrix with wJi as the ith row and b as the length N vector with bi
as the ith entry, for all i P rN s, and thus the As subscripts are the indices according to
the active set i P SAs , for each s P rLs.
Let t1, tˆ1, . . . , tM , tˆM P Ts be the times from condition (iii). Then for i “ 1, . . . ,M
it must hold by (B.8) that
vJAspWJAsxti ` bAsq ´ vJAspWJAsxtˆi ` bAsq “ 0 “ vJAsWJAspxti ´ xtˆiq . (B.9)
In particular, vAs must be in the nullspace of»————–
pxt1 ´ xtˆ1qJ
...
pxtM ´ xtˆM qJ
fiffiffiffiffiflWAs “: xWAs .
115
Condition (iii) requires that xWAs and WAs have the same rank, and so the rank-nullity
theorem implies that their nullspaces must have the same dimension. Now, since the
nullspace of WˆAs is a subspace of the nullspace of WAs , the nullspaces must actually be
the same. As a result, (B.9) implies that vJAsW
J
As
“ 0 and so vJAsbAs “ 0 as well.
We have just shown that with condition (iii) satisfied, (B.8) is equivalent to
vJAsrWJAs bAss “ 0Jn`1 @t P Ts,@s P rLs (B.10)
holding for all time windows rτ, τ ` T s and corresponding Ts subsets.
Note we have assumed that xt over t ě 0 is such that there exists T ˚ ą 0 whereby all
N hyperplanes are crossed and only at nondegenerate borders, within any time window
rτ, τ ` T ˚s. Thus, we can combine (3.11) and (B.10) to get that
0Jn`1 “ vJAs`1rWJAs`1 bAs`1s @t P Ts`1 (B.11)
“
$’’&’’%
vJAsrWJAs bAss ` vIsrwJIs bIss if AAs`1 Ď X`Is
vJAsrWJAs bAss ´ vIsrwJIs bIss if AAs`1 Ď XI˝s
must hold over all s P rL ´ 1s with vJA1rWJA1 bA1s “ 0Jn`1 for all t P T1, for all time
windows rτ, τ ` T ˚s and corresponding Ts subsets. However, we will now show by
induction that this can only hold if v “ 0N .
Starting at s “ 1: it must hold that vJA1rWJA1 bA1s “ 0Jn`1 for all t P T1, and since
v, W , and b are fixed it must continue to hold for all t P T2, during which it must hold
that vJ2 rWJA2 bA2s “ 0Jn`1. And so, these can both only hold if the additional term is
vI1rwJI1 bI1s “ 0Jn`1, and since rwJI1 bI1s is nonzero it must be that vI1 “ 0. We can then
inductively make the same argument over the remaining s “ 2, 3, . . . ,L ´ 1 to obtain
vI1 “ ¨ ¨ ¨ “ vIL´1 “ 0. And since we have assumed all N hyperplanes were crossed at
nondegenerate borders over the L´1 hyperplane crossings, we therefore must have that
vi “ 0 for all i P rN s. This establishes the contradiction from the original assumption
of a nonzero v.
116
Since we assume that all N hyperplanes are crossed and only at nondegenerate
borders during any time window rτ, τ ` T ˚s, this contradiction must hold for all such
windows. Therefore, (B.2) must be strictly positive definite for all τ ě 0 with this T ˚
window length. And since the ReLU is bounded on any compact set C Ă Rn, we have
proven persistency of excitation holds for any window length of at least T ˚.
B.1.4 Extension to Other Activation Functions
The reasoning and methods described in this appendix section should also be able to be
applied similarly to any nonlinear, positive semidefinite activation function for which it
is possible to show the following:
1. (if needed) describe a sufficient condition for the trajectory xt such that we only
need to consider fixed properties (constants, vectors, matrices, etc.) contributing
to the value of vJAsΨpxtqAs within any activation region s P rLs, for example as
was done going from (B.8) to (B.10), and
2. formulate geometric conditions like (B.7) from (B.6) and (B.11) from (B.10), which
then allow the main induction steps to proceed.
B.2 Global Uniform Asymptotic Stability of (3.23)
Here we will show how the linear time-varying (LTV) system
9yt “ Γ ΨpxtqΨpxtqJyt (B.12)
is proven globally uniformly asymptotically stable if the vector function Ψ : Rn Ñ RN
satisfies the persistency of excitation condition (3.5). State trajectories y : r0,8q Ñ RN
and x : r0,8q Ñ Rn are assumed to be continuous (in time), and xt is assumed to
stay within some compact set C P Rn for all t ě 0. The vector function is defined
117
Ψpxq :“ rg1pxq ¨ ¨ ¨ gN pxqsJ, where the continuous almost everywhere, nonlinear scalar
functions g1, . . . , gN : Rn Ñ R are bounded on C, and Γ is some symmetric, positive-
definite N ˆN matrix.
Firstly, Lemma 1 of [2] says that if F : r0,8q Ñ RNˆN , H : r0,8q Ñ RN , and
Kr0,8q Ñ RN are regulated matrix/vector functions of time (meaning one-sided limits
exist for all t P r0,8q), and there exists positive scalars α3, T ą 0 such thatż τ`T
τ
}Kt}22 dt ď α3
holds for all τ ě 0, then rFt, Hts is uniformly completely observable (UCO) if and only
if rFt `KtHJt , Hts is UCO.
Next, Lemma 2 of [2] says that the global uniform asymptotic stability of
9yt “ Ft yt
is equivalent to: i) the existence of a symmetric, differentiable matrix function of time
P : r0,8q Ñ Rnˆn satisfying, for some positive scalars β1, β2 ą 0, that
β1IN ĺ Pt ĺ β2IN and ´ 9Pt “ Pt Ft ` FJt Pt `HtHJt
both hold for all t ě 0, and ii) that rFt, Hts is UCO.
Finally, a slight modification of Theorem 1 of [2], in order to allow for Γ matrices
other than the identity IN , says that the global uniform asymptotic stability of (B.12)
holds if Ψpxtq satisfies the persistency of excitation condition (3.5).
Proof. Suppose that Ψpxtq satisfies the persistency of excitation condition (3.5) for some
positive scalars α1, α2, T ą 0. This means, by definition, that r0NˆN ,Ψpxtqs is UCO.
Now set Kt “ ´Γ Ψpxtq and note that (3.5) givesż τ`T
τ
}Kt}22 dt ď
ż τ`T
τ
}Γ}2F }Ψpxtq}2F dt “ G2
ż τ`T
τ
tr
“
ΨpxtqΨpxtqJ
‰
dt ď G2α2N ,
thus satisfying Lemma 1 and so r´Γ ΨpxtqΨpxtqJ,Ψpxtqs is UCO.
118
Therefore, we have that Lemma 2 is satisfied with Pt “ 12Γ´1, Ft “ ´Γ ΨpxtqΨpxtqJ,
and Ht “ Ψpxtq. This proves (B.12) is globally uniformly asymptotically stable.
Remark B.1. Lemma 1 and Theorem 1 of [2] actually state exponential stability, rather
than global asymptotic uniform stability. However, for linear systems these types of
stability are equivalent. See Section 1.5.2 of [71] for details.
Remark B.2. Lemma 1 and 2 and Theorem 1 of [2] in fact hold with Vt “ Ψpxtq,
Kt, and Ht being regulated N ˆ ` matrix functions of time for any ` ě 1, using the
induced Euclidean matrix norm when ` ą 1. We specified their results for ` “ 1 and the
Euclidean vector norm, in order to match the application in Section 3.4.
Appendix C
C.1 Useful Relations
Bounding Absolute Value of F p1qB,C and F p2qB,C Integrands
First, we note that
ˇˇ
re
“
Zeib
‰ ˇˇ ď |Zeib| “ |Z| holds for all Z P C and b P R. Thus, we
bound the absolute value of the integrand in (4.7) for any ξ, x P Rn and b P R as
ˇˇ
cospξJx` bq ´ cospbqˇˇ ď ˇˇpeiξJx ´ 1qeib ˇˇ “ ˇˇeiξJx ´ 1ˇˇ “ 2ˇˇ sinpξJx2 qˇˇ
ď 2ˇˇξJx2 ˇˇ “ ˇˇξJxˇˇ , (C.1)
which gives | cospξJxq ´ 1| ď |ξJx| for b “ 0.
By then also noting | sinp0q ´ 0| “ 0 and ddy psinpyq ´ yq “ cospyq ´ 1 for y P R, we
can show that
| sinpyq ´ y| “
ˇˇˇˇż y
0
cospzq ´ 1 dz
ˇˇˇˇ
ď
ż y
0
| cospzq ´ 1| dz ď
ż y
0
|z|dz “ 12 |y|2 .
119
120
Combining these results, we bound the absolute value of the integrand in (4.9) as
| cospξJx` bq ` ξJx sinpbq ´ cospbq| ď |eiξJx ´ iξJx´ 1|
“
b
pcospξJxq ´ 1q2 ` psinpξJxq ´ ξJxq2
ď | cospξJxq ´ 1| ` | sinpξJxq ´ ξJx|
“ 2| sinpξJx2 q|2 ` | sinpξJxq ´ ξJx|
ď 2|ξJx2 |2 ` 12 |ξJx|2
“ |ξJx|2 . (C.2)
Integrability of F p1qB,C and F p2qB,C Classes
We note here that the integrability of (4.7) and (4.9) are given by using (C.1),(4.8) and
(C.2),(4.10) respectively, along with the definition of |ξ|B “ supxPB
ˇˇ
ξJx
ˇˇ
, asż
Rn
|eiξJx ´ 1|Gpdξq ď
ż
Rn
|ξJx|Gpdξq ď
ż
Rn
|ξ|B Gpdξq ď Cp1qgż
Rn
|eiξJx ´ iξJx´ 1|Gpdξq ď
ż
Rn
|ξJx|2Gpdξq ď
ż
Rn
|ξ|2B Gpdξq ď Cp2qg .
Reverse Triangle Inequality
Let F be some function space with f, g P F . Let } ¨ } be any norm on F . Then we have:
}f} “ }f ´ g ` g} ď }f ´ g} ` }g}
}g} “ }g ´ f ` f} ď }g ´ f} ` }f} “ }f ´ g} ` }f} ,
due to the triangle inequality. Therefore, it holds that
ˇˇ }f} ´ }g} ˇˇ ď }f ´ g} . (C.3)
121
Norm Difference for Functions with Different Coefficients
Let F be some function space with f, g P F . Let } ¨ } be any norm on F , and a, b P R.
Then we have:
}af ´ bg} “ }apf ´ gq ´∆g} ď |a| }f ´ g} ` |∆| }g} (C.4)
where ∆ “ b´ a, by using the triangle inequality.
Bounding Expected Variation of Sample Average Function from its Mean
Let f ,f1, . . . ,fN be drawn iid from a function space F and define the sample average
function f¯N “ 1N
řN
i“1 fi. With the L2pB, µq norm (for any valid measure µ) denoted
as }¨}B, we have
E
››f¯N ´ Ef¯N››2B “ E żB `f¯N ´ Ef¯N˘2 µpdxq
paq“
ż
B
´
E
“
f¯ 2N
‰´ 2E “f¯N ‰E “f¯N ‰` E “f¯N ‰2¯µpdxq
pbq“
ż
B
´
E
“
f¯ 2N
‰´ E “f¯N ‰2¯µpdxq “ ż
B
var
`
f¯N
˘
µpdxq
pcq“
ż
B
1
N
varpfqµpdxq “ 1
N
ż
B
`
E
“
f2
‰´ Erf s2˘µpdxq
“ 1
N
´
E }f}2B ´ }Erf s}2B
¯ pdqď 1
N
E }f}2B . (C.5)
Here (a) is by using Fubini’s theorem to bring the expectation inside the integral, the
linearity of expectation, and the outer expectation then being redundant for the last
term, (b) is from combining the middle and last terms, (c) is by the facts of variance
of averaged iid random variables (covariance terms are zero, while means and variances
are equal), and (d) is because the second term is nonpositive and therefore it can be
dropped with inequality.
122
McDiarmid’s Inequality
Let v1, . . . ,vd, for any d P N, be independent random variables in the sets V1, . . . ,Vd
and let the function h : V1ˆ¨ ¨ ¨ˆVd Ñ R satisfy for all i P rds and some positive scalars
k1, . . . , kd ą 0, that
|hpv1, . . . , vi, . . . , vdq ´ hpv1, . . . , rvi, . . . , vdq| ď ki
holds for all points pv1, . . . , vi, . . . , vdq, pv1, . . . , rvi, . . . , vdq P V1,ˆ ¨ ¨ ¨ ˆ Vd differing only
at vi, rvi P Vi. Then, for all t ě 0, it holds that
P
”
hpv1, . . . ,vdq ´ Erhpv1, . . . ,vdqs ě t
ı
ď exp
˜
´2 t2řd
i“1 k2i
¸
. (C.6)
Lemma on Approximating Functions of Convex Combinations
The following lemma is credited to Maurey in [65].
Lemma C.1. If qf is in the closure of the convex hull of a bounded set of functions F
in a Hilbert space H, such that }f}H ď b holds for all f P F , then for every N ě 1,
there exists an f¯N in the convex hull of N points in F such that››› qf ´ f¯N›››2H ď c¯N
holds for any c¯ ą b2 ´
››› qf ›››2
H
.
Proof. Define f‹ as a finite but arbitrarily large convex combination of m points f‹i P F ,
meaning it has the form f‹ “ řmi“1 γif‹i with γi ě 0, řmi“1 γi “ 1, and such that››› qf ´ f‹›››
H
ď δ?
N
for some δ ą 0. Randomly sample functions f ,f1, . . . ,fN iid from the
set tf‹i , . . . , f‹mu according to Prf “ f‹i s “ γi. Thus, we have Erf¯N s “ Er 1N
řN
i“1 fis “
Erf s “ f‹, and then using (C.5) gives E ››f¯N ´ f‹››2H “ 1N pE }f}2H ´ }f‹}2Hq ď 1N pb2 ´
123
}f‹}2Hq. Since the expectation is bounded this way, there must be a realization f¯N
that also satisfies this bound. Then by the triangle inequality, we have
››› qf ´ f¯N›››H ď››› qf ´ f‹›››
H
`››f¯N ´ f‹››H “ 1?N pδ`bb2 ´ }f‹}2Hq. And since f‹ can get arbitrarily close
to qf by letting mÑ8, and thus δ Ñ 0 and }f‹}H Ñ ››› qf ›››H, this proves the result.
In our setting, we use the }¨}B norm for the L2pBq space. We note that Theorem 1
of [51] gives the same Op 1?
N
q result, but with a superior constant. The proof there uses
the same method as above, but breaks F into N subsets of diameter , and then uses
that to bound the variance terms. This gives the bound constant in terms of an inverse
covering number of F (denoted N pFq), which goes to zero as N Ñ 8. However, this
requires quantifying N pFq, which may be nontrivial.
C.2 Proof of Theorem 4.1
Theorem. For any constant C ą 0 and bounded set B Ă Rn containing the origin,
there exists a base approximation pfN of the form (4.5) using the step activation function
σS with parameters satisfying |w|B “ 1 and |b| ď 1, for any integer N ě 1, to every
function f P F p1qB,C such that the approximation error rfN pxq “ qfpxq´ pfN pxq to the affine
shifted qfpxq “ fpxq ´ fp0nq satisfies the L2pB, µ¯Bq norm bound
››› rfN›››B “
dż
B
ˇˇˇ rfN pxqˇˇˇ2 µ¯Bpdxqq ď 2C?
N
(4.11)
with coefficient size bounded as
řN
i“1 |θi| ď 2C.
Proof. We closely follow the proof methods in [7]. Recalling that gpxq “ fpxq, and thus
qgpxq “ qfpxq, on B, then taking the real part of both sides of (4.7) gives for all x P B
124
that
rer qgpxqs “ re” qfpxqı “ re„ż
Rn
peiξJx ´ 1q eiφpξqGpdξq

qfpxq “ ż
Rn
´
cos
`
ξJx` φpξq˘´ cos `φpξq˘¯Gpdξq
“
ż
Ω
hpx, ξqΛpdξq
where Ω “ Rnzt0nu and equality is maintained because the integrand at ξ “ 0n is zero.
Here, we defined
hpx, ξq :“ C
p1q
g
|ξ|B
´
cos
`
ξJx` φpξq˘´ cos `φpξq˘¯
Λpdξq :“ |ξ|B
C
p1q
g
Gpdξq
with C
p1q
g as in (4.8), and so Λ is a probability measure over Ω such that
ş
Ω Λpdξq “ 1.
Then by (C.1) and recalling the definition of |ξ|B “ supxPB
ˇˇ
ξJx
ˇˇ
, it holds for all x P B
and ξ P Ω that
|hpx, ξq| ď C
p1q
g
|ξ|B
|ξJx| ď Cp1qg ď C .
With ξ,ξ1, . . . ,ξN P Ω sampled iid according to Λ, we have Er 1N
řN
i“1 hpx,ξiqs “ qfpxq.
Thus, by (C.5) we have
E
››››› qfpxq ´ 1N
Nÿ
i“1
hpx,ξiq
›››››
2
B
ď 1
N
E }hpx,ξq}2B ď
1
N
ż
B
C2 µ¯Bpdxq “ C
2
N
.
Since this goes to zero as N Ñ8, we have that for every f P F p1qB,C , it holds over x P B
that qf P coHcosp1qC,B , where
Hcosp1qC,B :“
"
α
|ξ|B
`
cospξJx` bq ´ cospbq˘ ˇˇˇ ξ P Ω, |α| ď C, b P R*
with closure in the sense of the L2pB, µ¯Bq norm.
125
Next, for any ξ P Ω we set y “ 1|ξ|B ξ
Jx, which gives y P Y Ď r´1, 1s over x P B.
Therefore, functions in Hcosp1qC,B correspond to smooth univariate functions of y having
the form
hpyq “ α|ξ|B
`
cosp|ξ|B y ` bq ´ cospbq
˘
for any ξ P Ω, |α| ď C, and b P R, and thus having derivative
h1pyq “ ´α sinp|ξ|B y ` bq .
These hpyq functions are uniformly continuous and bounded on y P r´1, 1s with hp0q “ 0
and |hpyq| ď C, and so they can be uniformly approximated on this interval by piecewise
constant functions. Setting a uniform partition of r0, 1s for any positive integer P as
∆ :“ 1P ă 2∆ ă ¨ ¨ ¨ ă pP ´ 1q∆ ă P∆ “ 1, we define
h2P pyq :“
ÿ
dPt´1,1u
˜
hp0qσSpdyq `
P´1ÿ
k“1
`
hpdk∆q ` hpdpk ´ 1q∆q˘σSpdy ´ k∆q
¸
,
recalling that σSpyq “ Iry ą 0s is the step activation function. This definition means
h2P is a continuous, length 2P sequence of constant lines on y P r´1, 1s between each
pair of points
 
hpdpk ´ 1q∆q, hpdk∆q(, over k P rP s and d P t´1, 1u.
Thus, as P Ñ 8 (and ∆ Ñ 0) we get h2P Ñ h uniformly on y P r´1, 1s, and the
absolute values of the coefficients are
|hp0q| “ 0 and |hpdk∆q ´ hpdpk ´ 1q∆q| .
By definition, the sum of the absolute values of the coefficients is then bounded by the
total variation of h1pyq over y P r´1, 1s. And since |h1pyq| “ |α|| ´ sinp|ξ|B y ` bq| ď C,
we have for any P ě 1 that
ÿ
dPt´1,1u
˜
|hp0q| `
P´1ÿ
i“1
|hpdk∆q ` hpdpk ´ 1q∆q|
¸
ď
ż 1
´1
|h1pyq|dy ď 2C .
126
This proves that for every f P F p1qB,C , it holds over x P B that qfpxq P coHSC,B, where
HSC,B :“
 
ασSpξJx` bq
ˇˇ |α| ď 2C, |ξ|B “ 1, |b| ď 1(
with closure in the supremum norm over x P B, and thus also in the sense of the
L2pB, µ¯Bq norm. Recalling that we had defined y “ 1|ξ|B ξ
Jx, which gave y P Y Ď r´1, 1s
for any ξ P Ω, then ξJx P W Ď r´ |ξ|B , |ξ|Bs and thus why |ξ|B “ 1 is specified in the
definition of HSC,B.
To bound the L2pB, µ¯Bq norm for all functions in this set, note that the largest
norms belong to the functions that correspond to hpyq “ p˘2CqσSp˘y` 1q, which over
y P r´1, 1s are equivalent to p˘2Cq. Thus, for all h P HRC,B we have
}h}2B “
ż
B
pασSpξJx` bqq2 µ¯Bpdxq ď p˘2Cq2
ż 1
´1
12 µ¯r´1,1spdyq “ p2Cq2 ,
where µ¯r´1,1spdyq “ 12dy is the uniform probability measure over y P r´1, 1s such thatş1
´1 µ¯r´1,1spdyq “
ş1
´1
1
2dy “ 1.
Using Lemma C.1 with c¯ “ b2 “ p2Cq2 (which is justified, since we can only have
} qf }2B “ 0 if qfpxq “ 0 for all x P B) then allows us to conclude that there must exist pfN
in the convex hull of N points in HSC,B satisfying } qf ´ pfN}B ď 2C`N´ 12 ˘ .
This means the approximation has the form pfN pxq “ řNi“1 ui αiσSpwJi x ` biq such
that
řN
i“1 ui “ 1 with u1, . . . , uN ě 0. Therefore, we can bound the coefficient size asřN
i“1 |θi| “
řN
i“1 ui|αi| ď
řN
i“1 uip2Cq “ 2C
řN
i“1 ui “ 2C for any N P N.
C.3 Proof of Theorem 4.2
Theorem. For any constant C ą 0 and bounded set B Ă Rn containing the
origin, there exists a base approximation pfN of the form (4.5) using the ReLU activation
function σR with parameters satisfying |w|B “ 1 and |b| ď 1, for any integer N ě 1, to
every function f P F p2qB,C such that the approximation error rfN pxq “ qfpxq´ pfN pxq to the
127
affine shifted qfpxq “ fpxq ´∇fp0nqJx´ fp0nq satisfies the L2pB, µ¯Bq norm bound››› rfN›››B “
dż
B
ˇˇˇ rfN pxqˇˇˇ2 µ¯Bpdxqq ď
b
16
3 C?
N
(4.12)
with coefficient size bounded as
řN
i“1 |θi| ď 2C.
Proof. We closely follow the proof methods in [7] and [12]. Recalling that gpxq “ fpxq,
and thus qgpxq “ qfpxq, on B, then taking the real part of both sides of (4.9) gives for all
x P B that
rer qgpxqs “ re” qfpxqı “ re„ż
Rn
peiξJx ´ iξJx´ 1q eiφpξqGpdξq

qfpxq “ ż
Rn
´
cos
`
ξJx` φpξq˘` ξJx sin `φpξq˘´ cos `φpξq˘¯Gpdξq
“
ż
Ω
hpx, ξqΛpdξq
where Ω “ Rnzt0nu and equality is maintained because the integrand at ξ “ 0n is zero.
Here, we defined
hpx, ξq :“ C
p2q
g
|ξ|2B
´
cos
`
ξJx` φpξq˘` ξJx sin `φpξq˘´ cos `φpξq˘¯
Λpdξq :“ |ξ|
2
B
C
p2q
g
Gpdξq
with C
p2q
g as in (4.10), and so Λ is a probability measure over Ω such that
ş
Ω Λpdξq “ 1.
Then by (C.2) and recalling the definition of |ξ|B “ supxPB
ˇˇ
ξJx
ˇˇ
, it holds for all x P B
and ξ P Ω that
|hpx, ξq| ď C
p2q
g
|ξ|2B
|ξJx|2 ď Cp2qg ď C .
With ξ,ξ1, . . . ,ξN P Ω sampled iid according to Λ, we have Er 1N
řN
i“1 hpx,ξiqs “ qfpxq.
Thus, by (C.5) we have
E
››››› qfpxq ´ 1N
Nÿ
i“1
hpx,ξiq
›››››
2
B
ď 1
N
E }hpx,ξq}2B ď
1
N
ż
B
C2 µ¯Bpdxq “ C
2
N
.
128
Since this goes to zero as N Ñ8, we have that for every f P F p2qB,C , it holds over x P B
that qf P coHcosp2qC,B , where
Hcosp2qC,B :“
#
α
|ξ|2B
`
cospξJx` bq ` ξJx sinpbq ´ cospbq˘ ˇˇˇ ξ P Ω, |α| ď C, b P R+
with closure in the sense of the L2pB, µ¯Bq norm.
Next, for any ξ P Ω we set y “ 1|ξ|B ξ
Jx, which gives y P Y Ď r´1, 1s over x P B.
Therefore, functions in Hcosp2qC,B correspond to smooth univariate functions of y having
the form
hpyq “ α|ξ|2B
`
cosp|ξ|B y ` bq ` |ξ|B y sinpbq ´ cospbq
˘
for any ξ P Ω, |α| ď C, and b P R, and thus having derivatives
h1pyq “ α|ξ|B
`´ sinp|ξ|B y ` bq ` sinpbq˘ and h2pyq “ ´α cosp|ξ|B y ` bq .
These hpyq functions are uniformly continuous and bounded on y P r´1, 1s with hp0q “ 0
and |hpyq| ď C, and so they can be uniformly approximated on this interval by piecewise
linear functions. Setting a uniform partition of r0, 1s for any positive integer P as
∆ :“ 1P ă 2∆ ă ¨ ¨ ¨ ă pP ´ 1q∆ ă P∆ “ 1, we define
h2P pyq :“
ÿ
dPt´1,1u
ˆ
hpd∆q ´ hp0q
∆
σRpdyq
`
P´1ÿ
k“1
hpdpk ` 1q∆q ´ 2hpdk∆q ` hpdpk ´ 1q∆q
∆
σRpdy ´ k∆q
¸
,
recalling that σRpyq “ maxt0, yu is the ReLU activation function. This definition means
h2P is a continuous, length 2P sequence of secant lines on y P r´1, 1s between each pair
of points
 
hpdpk ´ 1q∆q, hpdk∆q(, over k P rP s and d P t´1, 1u.
Thus, as P Ñ 8 (and ∆ Ñ 0) we get h2P Ñ h uniformly on y P r´1, 1s, and the
129
absolute values of the coefficients go toˇˇˇˇ
hpd∆q ´ hp0q
∆
ˇˇˇˇ
Ñ |h1p0q| “ 0ˇˇˇˇ
hpdpk ` 1q∆q ´ 2hpdk∆q ` hpdpk ´ 1q∆q
∆
ˇˇˇˇ
Ñ |h1pdk∆q ´ h1pdpk ´ 1q∆q| .
By definition, the sum of the absolute values of the coefficients is then bounded by the
total variation of h2pyq over y P r´1, 1s. And since |h2pyq| “ |α|| ´ cosp|ξ|B y ` bq| ď C,
we have for any P ě 1 thatÿ
dPt´1,1u
˜ˇˇˇˇ
hpd∆q ´ hp0q
∆
ˇˇˇˇ
`
P´1ÿ
i“1
ˇˇˇˇ
hpdpk ` 1q∆q ´ 2hpdk∆q ` hpdpk ´ 1q∆q
∆
ˇˇˇˇ¸
ď
ż 1
´1
|h2pyq|dy ď 2C .
This proves that for every f P F p2qB,C , it holds over x P B that qfpxq P coHRC,B, where
HRC,B :“
 
ασRpξJx` bq
ˇˇ |α| ď 2C, |ξ|B “ 1, |b| ď 1(
with closure in the supremum norm over x P B, and thus also in the sense of the
L2pB, µ¯Bq norm. Recalling that we had defined y “ 1|ξ|B ξ
Jx, which gave y P Y Ď r´1, 1s
for any ξ P Ω, then ξJx P W Ď r´ |ξ|B , |ξ|Bs and thus why |ξ|B “ 1 is specified in the
definition of HRC,B.
To bound the L2pB, µ¯Bq norm for all functions in this set, note that the largest
norms belong to the functions that correspond to hpyq “ p˘2CqσRp˘y` 1q, which over
y P r´1, 1s are equivalent to p˘2Cqp˘y ` 1q. Thus, for all h P HRC,B we have
}h}2B “
ż
B
pασRpξJx` bqq2 µ¯Bpdxq ď p˘2Cq2
ż 1
´1
p˘y ` 1q2 µ¯r´1,1spdyq “ 43p2Cq2 ,
where µ¯r´1,1spdyq “ 12dy is the uniform probability measure over y P r´1, 1s such thatş1
´1 µ¯r´1,1spdyq “
ş1
´1
1
2dy “ 1.
Using Lemma C.1 with c¯ “ b2 “ 43p2Cq2 (which is justified, since we can only have
} qf }2B “ 0 if qfpxq “ 0 for all x P B) then allows us to conclude that there must exist pfN
in the convex hull of N points in HRC,B satisfying } qf ´ pfN}B ď b163 C `N´ 12 ˘ .
130
This means the approximation has the form pfN pxq “ řNi“1 ui αiσRpwJi x ` biq such
that
řN
i“1 ui “ 1 with u1, . . . , uN ě 0. Therefore, we can bound the coefficient size asřN
i“1 |θi| “
řN
i“1 ui|αi| ď
řN
i“1 uip2Cq “ 2C
řN
i“1 ui “ 2C .
C.4 Proof of Theorem 4.3
Theorem. Let there exist a base approximation pfN of the form (4.5) using any
activation function σ satisfying Assumption 4.1, for any N P N, with some unknown
parameters γ1, . . . ,γN P Υ and coefficients θ P RN , but having a known bound on the
coefficient size of
řN
i“1 |θi| ď SN and the parameters set Υ known. Then, the mollified
approximation pfλ,N over the λ-expanded parameters set Υλ in (4.17) satisfies either of:
iq
››› pfN ´ pfλ,N›››B ď SN Dσ (4.20)
iiq
››› pfN ´ pfλ,N›››B ď SN Lσ
?
n` 1
λ
EB . (4.21)
Proof. First, we observe that (4.16) is equivalently
pfN pxq “ Nÿ
i“1
θi σpγJiXq
ż
Υλ
δˆn`1λ pγ´ γiq dγ “
Nÿ
i“1
θi
ż
Υλ
δˆn`1λ pγ´ γiqσpγJiXqdγ ,
since
ş
Υλ
δˆn`1λ pγ ´ γiqdγ “ 1 holds for all λ ą 0, n ě 1, and γi P Υ, and σpγJiXq is a
constant with respect to dγ.
Then we define the sets Υi,λ as the supports of each δˆ
n`1
λ pγ´γiq, which is the open
cube p´ 1λ , 1λqn`1 centered at the corresponding γi P Υ, and note that these are always
131
contained within the λ-expanded Υλ. Therefore, for any x P B we haveˇˇ pfN pxq ´ pfλ,N pxqˇˇ “
ˇˇˇˇ
ˇ Nÿ
i“1
θi
ż
Υλ
δˆn`1λ pγ´ γiq
´
σpγJiXq ´ σpγJXq
¯
dγ
ˇˇˇˇ
ˇ
paqď
Nÿ
i“1
|θi|
ˇˇˇˇż
Υλ
δˆn`1λ pγ´ γiq
´
σpγJiXq ´ σpγJXq
¯
dγ
ˇˇˇˇ
pbqď
Nÿ
i“1
|θi|
ż
Υλ
δˆn`1λ pγ´ γiq
ˇˇ
σpγJiXq ´ σpγJXq
ˇˇ
dγ
pcqď
Nÿ
i“1
|θi|
ż
Υλ
δˆn`1λ pγ´ γiq sup
φPΥi,λ
ˇˇ
σpγJiXq ´ σpφJXq
ˇˇ
dγ
pdq“
Nÿ
i“1
|θi| sup
φPΥi,λ
ˇˇ
σpγJiXq ´ σpφJXq
ˇˇ ż
Υλ
δˆn`1λ pγ´ γiqdγ
peq“ SN sup
φPΥi,λ
ˇˇ
σpγJiXq ´ σpφJXq
ˇˇ
.
Here (a) is by the triangle inequality, (b) is because
ˇˇş
fpxq dxˇˇ ď ş |fpxq|dx and δˆn`1λ
is positive, (c) is because the integrand is only nonzero on γ P Υi,λ by the support of
δˆn`1λ , (d) is because the sup is a constant and can thus be pulled out of the integral,
and (e) is by the definition of the mollified delta such that
ş
Υλ
δˆn`1λ pγ ´ γiqdγ “ 1 for
all i P rN s and the (assumed known) bound řNi“1 |θi| ď SN .
If σ satisfies the overall boundedness assumption with constant Dσ, then we haveˇˇ pfN pxq ´ pfλ,N pxqˇˇ “ SN Dσ
and thus ››› pfN pxq ´ pfλ,N pxq›››B ď
dż
B
|SN Dσ|2 µ¯Bpdxq “ SN Dσ
where SN and Dσ are positive constants.
Otherwise, σ satisfies the Lipschitz assumption with constant Lσ and so we haveˇˇ pfN pxq ´ pfλ,N pxqˇˇ ď SN Lσ sup
φPΥi,λ
ˇˇpγi ´ φqJX ˇˇ
pfqď SN Lσ
?
n` 1
λ
}X}2 (C.7)
132
where (f) uses the Cauchy-Schwarz inequality, combined with the fact that }γi ´ φ}2 ď
?
n`1
λ for all φ P Υi,λ at any γi P Υλ. And thus
››› pfN pxq ´ pfλ,N pxq›››B ď
dż
B
ˇˇˇˇ
SN Lσ
?
n` 1
λ
}X}2
ˇˇˇˇ2
µ¯Bpdxq “ SN Lσ
?
n` 1
λ
EB
where SN , Lσ,
?
n` 1, and λ are all positive constants.
C.5 Proof of Lemma 4.1
Lemma. Let the target function f P CpBq satisfy Theorem 4.3 and thus have the
mollified approximation pfλ,N as in (4.13) with activation σ for the selected mollification
factor λ ą 0, which has the integral representation (4.16) over the λ-expanded param-
eters set Υλ. If the random parameters γ1, . . . ,γR are sampled iid according to any
density Pλ which is nonzero over Υλ, then there exists a random approximation pFR of
the form (4.6) using the activation σ such that
E
γj„Pλ
” pFRpx ;γ1, . . . ,γRqı “ pfλ,N pxq . (4.23)
Proof. The mollified approximation pfλ,N has the integral approximation (4.17), which
can be equivalently written as
pfλ,N pxq :“ ż
Υλ
gλpγiq σpγJzq dγ
where coefficient function gλ is defined in (4.18). Then with Pλ being the known nonzero
density over Υλ used for sampling the parameters, we can equivalently write this as
pfλ,N pxq “ ż
Υλ
gλpγq
Pλpγq σpγ
JzqPλpγqdγ “ E
γ„Pλ
„
gλpγq
Pλpγq σpγ
Jzq

. (C.8)
133
For any sampled parametervalues γ1, . . . ,γR from Υλ, let us denote rfR˚ as the random
approximation that has coefficient values ϑ˚ P RR of
ϑj˚ “ gλpγjqPλpγjqR “
řN
i“1 θi δˆ
n`1
λ pγj ´ γiq
PλpγjqR @j P rRs . (C.9)
Let us denote dPλpγjq :“ Pλpγjqdγj for all j P rRs, so that in the below integrals
γ,γ1, . . . ,γR all represent separate variables of integration. Then with the iid assump-
tion on all γj „ Pλ, for this rfR˚ we have
E
γj„Pλ
” rfR˚px ;γ1, . . . ,γRqı “ E
γj„Pλ
«
Rÿ
j“1
ϑj˚ σpγJj zq
ff
“
ż
Υλ
¨ ¨ ¨
ż
Υλ
Rÿ
j“1
gλpγjq
PλpγjqR σpγ
J
j zq dPλpγ1q ¨ ¨ ¨ dPλpγRq
paq“
ż
Υλ
¨ ¨ ¨
ż
Υλ
gλpγ1q
Pλpγ1qR σpγ
J
1 zq dPλpγ1q ¨ ¨ ¨ dPλpγRq ` ¨ ¨ ¨
`
ż
Υλ
¨ ¨ ¨
ż
Υλ
gλpγRq
PλpγRqR σpγ
J
RzqdPλpγ1q ¨ ¨ ¨ dPλpγRq
pbq“
ż
Υλ
gλpγ1q
Pλpγ1qR σpγ
J
1 zqdPλpγ1q
`
1R´1q ` ¨ ¨ ¨
`
ż
Υλ
gλpγRq
PλpγRqR σpγ
J
Rzq dPλpγRq
`
1R´1q
pcq“ R
ˆ
1
R
ż
Υλ
gλpγq
Pλpγq σpγ
JzqdPλpγq
˙
pdq“ pfλ,N pxq .
Here, (a) is by the linearity of integration, (b) is each γj only belongs to the dPλpγjq
integral, while the remaining R ´ 1 integrals in each term are şΥλ dPλpγjq “ 1, (c) is
because there are R copies of the same integral, and (d) is (C.8).
C.6 Proof of Lemma 4.2
Lemma. Let hB be defined as in (4.24) for a random approximation pFR of the form
(4.6) which satisfies the result of Lemma 4.1 to some mollified approximation pfλ,N from
Theorem 4.3, both using activation function σ which satisfies Assumption 4.1. Let the
134
random parameters γ1, . . . ,γR be sampled iid according to any nonzero density Pλ over
the λ-expanded set Υλ, which has diameter DΥλ and largest point γλmax, and let EB be
as in (4.19). Then for all pγ1, . . . ,γj , . . . ,γRq, pγ1, . . . , rγj , . . . ,γRq P ΥRλ differing only
at coordinate j, either of the following hold for all j P rRs:
ˇˇ
hBpγ1, . . . ,γj , . . . ,γRq ´ hBpγ1, . . . , rγj , . . . ,γRqˇˇ ď
iq Dσ gλmax
PλminR
ˆ
3` 2 |σp0q|
Dσ
˙
(4.26)
iiq Lσ gλmax
PλminR
ˆ
pDΥλ ` 2γλmaxqEB ` 2 |σp0q|Lσ
˙
. (4.27)
And the expected value of hB satisfies either of:
E
γj„Pλ
”
hBpγ1, . . . ,γRq
ı
ď
iq Dσ gλmax
Pλmin
?
R
ˆ
1` |σp0q|
Dσ
˙
(4.28)
iiq Lσ gλmax
Pλmin
?
R
ˆ
γλmaxEB ` |σp0q|L
˙
. (4.29)
Proof. Since pFR is assumed to satisfy Lemma 4.1, then by (C.9) it holds for all j P rRs
that
|ϑj | ď
ˇˇˇˇ
gλpγjq
PλpγjqR
ˇˇˇˇ
ď gλmax
PλminR
, (C.10)
where gλmax is defined in (4.25) and Pλmin :“ mintPλpγq | γ P Υλu.
From the definition of hB in (4.24), and for all points differing only at coordinate j
135
as pγ1, . . . ,γj , . . . ,γRq, pγ1, . . . , rγj , . . . ,γRq P ΥRλ , we have for all j P rRs thatˇˇ
hBpγ1, . . . ,γj , . . . ,γRq ´ hBpγ1, . . . , rγj , . . . ,γRqˇˇ
“
ˇˇˇ ››› pfλ,N pxq ´ pFRpx ;γ1, . . . ,γj , . . . ,γRq›››B ´ ››› pfλ,N pxq ´ pFRpx ;γ1, . . . , rγj , . . . ,γRq›››B ˇˇˇ
paqď ›› pfλ,N pxq ´ pFRpx ;γ1, . . . ,γj , . . . ,γRq ´ pfλ,N pxq ` pFRpx ;γ1, . . . , rγj , . . . ,γRq››B
pbq“
›››› gλprγjqPλprγjqR σprγJj Xq ´ gλpγjqPλpγjqR σpγJjXq
››››
B
pcqď
ˇˇˇˇ
gλprγjq
PλprγjqR
ˇˇˇˇ ››σprγJj Xq ´ σpγJjXq››B `
ˇˇˇˇ
gλpγjq
PλpγjqR ´
gλprγjq
PλprγjqR
ˇˇˇˇ ››σpγJjXq››B
pdqď gλmax
PλminR
ˆ››qσprγJj Xq ´ qσpγJjXq››B ` 2 ››qσpγJjXq››B ` 2 |σp0q|˙
Here, (a) is by the reverse triangle inequality (C.3), (b) is by the definition of pFR such
that only ϑj differs at γj ‰ rγj , (c) is by the inequality (C.4), (d) is by noting that the θi
of gλ can be negative (thus the 2), the absolute value bound (C.10), and the definition
of the shifted activation qσpuq “ σpuq ´ σp0q for all u P R.
If σ satisfies the overall boundedness condition of Assumption 4.1 with constant Dσ,
note that |qσpuq´ qσpvq| ď Dσ holds for all u, v P I and thus |qσpuq´ qσp0q| “ |qσpuq| ď Dσ
also holds for all u P I. And so we have
ˇˇ
hBpγ1, . . . ,γj , . . . ,γRq ´ hBpγ1, . . . , rγj , . . . ,γRqˇˇ ď Dσgλmax
PλminR
ˆ
3` 2 |σp0q|
Dσ
˙
.
Otherwise, σ satisfies the Lipschitz condition of Assumption 4.1 with constant Lσ
and so |qσpuq ´ qσpvq| ď Lσ|u´ v| holds for all u, v P I and thus |qσpuq ´ qσp0q| “ |qσpuq| ď
Lσ|u| also holds for all u P I. And so we have
ˇˇ
hBpγ1, . . . ,γj , . . . ,γRq´hBpγ1, . . . , rγj , . . . ,γRqˇˇ
ď Lσgλmax
PλminR
ˆ››prγj ´ γjqJX››B ` 2 ››γJjX››B ` 2 |σp0q|Lσ
˙
peqď Lσgλmax
PλminR
ˆ
pDΥλ ` 2γλmaxqEB ` 2 |σp0q|Lσ
˙
136
where (e) is by Cauchy-Schwarz and the definitions of DΥλ, γλmax, and EB. This proves
the first results (4.26) and (4.27).
Next, recall from Lemma 4.1 that in order for E
γj„Pλ
” pFRpx ;γ1, . . . ,γRqı “ pfλ,N pxq
to hold, it must be that
pFRpx ;γ1, . . . ,γRq “ 1
R
Rÿ
j“1
gλpγjq
Pλpγjqσpγ
J
j zq .
We define a class of functions f : Rn Ñ R using the activation function σ : RÑ R, over
the values of γ P Υλ as
F :“
!
fpx ;γq “ gλpγq
Pλpγq σpγ
J
jXq
ˇˇˇ
γ P Υλ
)
.
Thus, it holds that pFRpx ;γ1, . . . ,γRq “ 1
R
Rÿ
j“1
fpx ;γjq
with γ1, . . . ,γR sampled iid according to Pλ, and so the inequality (C.5) applies.
And so, ith hB as defined in (4.24) we have
E
γj„Pλ
“
hBpγ1, . . . ,γRq2
‰ “ E
γj„Pλ
››› pfλ,N pxq ´ pFRpx ;γ1, . . . ,γRq›››2B
“ E
γj„Pλ
›››› pFRpx ;γ1, . . . ,γRq ´ Eγj„Pλ
” pFRpx ;γ1, . . . ,γRqı››››2
B
paqď 1
R
E
γ„Pλ
›››› gλpγqPλpγqσpγJjXq
››››2
B
ď g
2
λmax
P 2λminR
E
γ„Pλ
››qσpγJjXq ` σp0q››2B . (C.11)
Here, (a) is the inequality (C.5).
Note that
b
Eγ„Pλ }fpx ;γq}2B defines a norm over such randomly parameterized
137
functions. Using the square root of (C.11) then gives
E
γj„Pλ
rhBpγ1, . . . ,γRqs “ E
γj„Pλ
”a
hBpγ1, . . . ,γRq2
ı
paqď
c
E
γj„Pλ
rhBpγ1, . . . ,γRq2 s
ď gλmax
Pλmin
?
R
d
E
γ„Pλ
›››qσpγJjXq ` σp0q›››2B
pbqď gλmax
Pλmin
?
R
˜d
E
γ„Pλ
›››qσpγJjXq›››2B ` |σp0q|
¸
.
Here, (a) is Jensen’s inequality and (b) is the triangle inequality.
If σ satisfies the overall boundedness condition of Assumption 4.1 with constant Dσ,
E
γj„Pλ
rhBpγ1, . . . ,γRqs ď Dσ gλmax
Pλmin
?
R
ˆ
1` |σp0q|
Dσ
˙
then holds. Otherwise, σ satisfies the Lipschitz condition with constant Lσ,
E
γj„Pλ
rhBpγ1, . . . ,γRqs ď gλmax
Pλmin
?
R
˜
Lσ
d
E
γ„Pλ
›››γJjX›››2B ` |σp0q|
¸
pcqď Lσ gλmax
Pλmin
?
R
ˆ
γλmaxEB ` |σp0q|Lσ
˙
then holds, where (c) is Cauchy-Schwarz and the definitions of γλmax and EB. This
proves the second results (4.28) and (4.29).
C.7 Proof of Theorem 4.4
Theorem. Let hB be defined as in (4.24) and satisfy the results of Lemma 4.2,
for some mollified approximation pfλ,N from Theorem 4.3, using activation function σ
which satisfies Assumption 4.1. Let the random parameters γ1, . . . ,γR be sampled iid
according to any nonzero density Pλ over the λ-expanded set Υλ, which has diameter
DΥλ and largest point γλmax, and let EB be as in (4.19). Then for any ν P p0, 1q, with
138
probability greater than 1 ´ ν there exists a random approximation pFRpx ;γ1, . . . ,γRq
of the form (4.6) with |ϑj | ď gλmaxPλminR for all j P rRs satisfying the result of Lemma 4.1,
such that either of the following hold:
} pfλ,N pxq ´ pFRpx ;γ1, . . . ,γRq} ď
iq Dσ gλmax
Pλmin
?
R
KDσpνq
iiq Lσ gλmax
Pλmin
?
R
KLσpνq .
The coefficients KDσ and KLσ are defined in terms of the selected ν as:
KDσpνq :“ 1` |σp0q|Dσ `
ˆ
3` 2 |σp0q|
Dσ
˙d
1
2
log
ˆ
1
ν
˙
KLσpνq :“ γλmaxEB ` |σp0q|Lσ `
ˆ
pDΥλ ` 2γλmaxqEB ` 2 |σp0q|Lσ
˙d
1
2
log
ˆ
1
ν
˙
.
Proof. As in the proof of Lemma 4.2, we have by (C.10) that the coefficients of pFR are
bounded for all j P rRs as
|ϑj | ď gλmax
PλminR
,
where gλmax is defined in (4.25) and Pλmin :“ mintPλpγq | γ P Υλu.
With hB as defined in (4.24), applying McDiarmid’s inequality (C.6) gives
P
” ˇˇˇ
hBpγ1, . . . ,γRq ´ E
“
hBpγ1, . . . ,γRq
‰ˇˇˇ ě tı ď exp˜ ´2 t2řR
j“1 k2j
¸
.
In Lemma 4.2, we proved the kj bounds for all j P rRs are
kj “ gλmax
PλminR
A
and the expected value bound is
E
“
hBpγ1, . . . ,γRq
‰ ď gλmax
Pλmin
?
R
B
139
for coefficients A and B that depend on properties of the activation function σ and the
sets B and Υλ. Therefore, we have for any t ą 0, that
P
« ˇˇˇˇ
ˇhBpγ1, . . . ,γRq ´ gλmaxPλmin?R B
ˇˇˇˇ
ě t
ff
ď P
” ˇˇ
hBpγ1, . . . ,γRq ´ E
“
hBpγ1, . . . ,γRq
‰ˇˇ ě tı
ď exp
¨˚
˝ ´2 t2
R
´
gλmax
PλminR
A
¯2 ‹˛‚ “ expˆ´2P 2λminR t2g2λmaxA2
˙
. (C.12)
Setting the RHS of (C.12) equal to ν P p0, 1q and solving for t then gives
log
ˆ
1
ν
˙
“ 2P
2
λminR t
2
g2λmaxA
2
ùñ t “ gλmaxA
Pλmin
?
R
d
1
2
log
ˆ
1
ν
˙
. (C.13)
Combining (C.12) and (C.13) gives the result, for any ν P p0, 1q, as the complement of
P
«
hBpγ1, . . . ,γRq ě gλmax
Pλmin
?
R
B ` gλmaxA
Pλmin
?
R
d
1
2
log
ˆ
1
ν
˙ ff
“ P
«
hBpγ1, . . . ,γRq ě gλmax
Pλmin
?
R
˜
B `A
d
1
2
log
ˆ
1
ν
˙ ¸ff
iq “ P
«
hBpγ1, . . . ,γRq ě Dσ gλmax
Pλmin
?
R
KDσpνq
ff
ď ν
iiq “ P
«
hBpγ1, . . . ,γRq ě Lσ gλmax
Pλmin
?
R
KLσpνq
ff
ď ν
where iq and iiq correspond to which boundedness property of Assumption 4.1 the
activation function σ satisfies.
And so, we have the coefficients KDσ and KLσ defined in terms of the selected ν,
using either (4.26),(4.28) or (4.27),(4.29) as:
KDσpνq :“ 1` |σp0q|Dσ `
ˆ
3` 2 |σp0q|
Dσ
˙d
1
2
log
ˆ
1
ν
˙
KLσpνq :“ γλmaxEB ` |σp0q|Lσ `
ˆ
pDΥλ ` 2γλmaxqEB ` 2 |σp0q|Lσ
˙d
1
2
log
ˆ
1
ν
˙
.
140
C.8 Proof of Theorem 4.5
Theorem. Let there exist a base approximation pfN of the form (4.5), for some
integer N P N, to a target function f P CpBq, with unknown parameters γ1, . . . ,γN
in a known bounded parameter set Υ Ă Rn`1, unknown coefficients θ P RN with a
known bound on the coefficient size
řN
i“1 |θi| ď SN , and using an activation function σ
satisfying Assumption 4.1. Assume that
››› qfpxq ´ pfN pxq›››B ď εN holds for some εN ą 0
which may depend on N , for an affine shift qfpxq “ f´aJx´bfp0q with some a P Rn and
b P R. The mollified approximation pfλ,N as in (4.17) satisfies the results of Theorem 4.3
for any mollification factor λ ą 0. Let the random parameters γ1, . . . ,γR be sampled
iid according to any nonzero density Pλ over the λ-expanded set Υλ, which has diameter
DΥλ and farthest point γλmax, and let EB be as in (4.19). Then for any ν P p0, 1q, with
probability greater than 1´ν there exists a random approximation pFRpx ;γ1, . . . ,γRq as
in (4.6) with |ϑj | ď gλmaxPλminR for all j P rRs that satisfies Theorem 4.4 such that either of
the following hold:››› qfpxq ´ pFRpx ;γ1, . . . ,γRq›››B ď
iq εN ` SN Dσ
ˆ
1 ` λ
n`1
?
R
rKDσpνq˙
iiq εN ` SNLσ
ˆ?
n` 1
λ
EB ` λ
n`1
?
R
rKLσpνq˙ .
The coefficients rKLσ are defined in terms of the selected ν as:
rKDσpνq :“ ηn`1en`1Pλmin KDσpνq and rKLσpνq :“ η
n`1
en`1Pλmin
KLσpνq
with KDσ and KLσ as defined in Theorem 4.4.
Proof. As in the proof of Lemma 4.2, we have by (C.10) that the coefficients of pFR are
bounded for all j P rRs as
|ϑj | ď gλmax
PλminR
,
141
where gλmax is defined in (4.25) and Pλmin :“ mintPλpγq | γ P Υλu.
Applying the triangle inequality, we then have››› qfpxq ´ pFRpx ;γ1, . . . ,γRq›››B ď ››› qfpxq ´ pfN pxq›››B ` ››› pfN pxq ´ pfλ,N pxq›››B
`
››› pfλ,N pxq ´ pFRpx ;γ1, . . . ,γRq›››B .
The first term is assumed bounded by εN for the base approximation pfN , which are
discussed in Section 4.2. The second term is bounded by Theorem 4.3 and is discussed
in Section 4.3.2, and the third term is bounded by Theorem 4.4 and is discussed in
Section 4.4.
We pull out front the activation function constant Dσ or Lσ and the base approx-
imation coefficient size bound
řN
i“1 |θi| ď SN , since they are common to the last two
terms. This follows for the last term because
gλmax
PλminR
“
Nÿ
i“1
|θi|
ˆ
ηλ
e
˙n`1 1
PλminR
ď SN λ
n`1
R
ηn`1
en`1 Pλmin
,
and so the remaining η
n`1
en`1Pλmin is added to the existing coefficient in terms of ν.
C.9 Proof of Lemma 4.3
Lemma. Let f : Rn Ñ R be L-Lipschitz on a convex compact set K Ă Rn containing
the origin and differentiable at the origin. Let there also be an approximation pf : Rn Ñ R
which is pL-Lipschitz on K and such that the approximation error rfpxq “ qfpxq ´ pfpxq,
with qf as in (4.31), satisfies the L2pK, µ¯Kq norm bound
››› rf ›››
K
“
dż
K
ˇˇˇ rfpxqˇˇˇ2 µ¯Kpdxq ď εN
where the bound εN ą 0 is a finite constant and µ¯Kpdxq is the uniform probability
measure on K such that şK µ¯Kpdxq “ 1.
142
Assume K is full dimension with diameter D ą 0, and so it holds that K Ď B0pρq
for a sufficiently large radius 0 ă ρ ď D.
Then, the approximation error also satisfies the L8pKq norm bound
sup
xPK
ˇˇˇ rfpxqˇˇˇ ď 2 ˆ rDpr `?r2 `D2q
˙´n
n`2 ´´
L` }∇fp0nq}2 ` pL¯n ε 2N¯ 1n`2 (4.32)
such that r is the largest radius ball within K centered at its centroid.
Proof. The Lipschitz constants of qf and pf are respectively qL “ L ` }∇fp0nq}2 and pL.
Then, the approximation error function rf “ qf ´ pf is also Lipschitz, satisfyingˇˇˇ rfpxq ´ rfpyqˇˇˇ “ ˇˇˇ qfpxq ´ pfpxq ´ qfpyq ` pfpyqˇˇˇ ď ˇˇˇ qfpxq ´ qfpyqˇˇˇ` ˇˇˇ pfpxq ´ pfpyqˇˇˇ
ď p qL` pL q }x´ y}2
for all x, y P K. And so we define rL :“ qL` pL as the Lipschitz constant.
We note here that
µ
`
Bxpρq
˘ “ ρn pi n2
Γpn2 ` 1q
is the Lebesgue measure (volume) of the n-dimensional ball of radius ρ centered at any
x P Rn. Let us define for any h ě 0 the set
Ωh :“
!
x P K
ˇˇˇ ˇˇ rfpxq| ě h)
and note that the following therefore holds:
ε2N ě
ż
K
| rfpxq|2 µ¯Kpdxq ě ż
Ωh
| rfpxq|2 µ¯Kpdxq ě ż
Ωh
h2 µ¯Kpdxq “ h2µ¯KpΩhq
ùñ µ¯KpΩhq ď ε
2
N
h2
.
Now, let there be a px P K satisfying | rfppxq| “ h for some finite h ą εN , and let us define
the set
Sd :“
"
x P K
ˇˇˇ
}px´ x}2 ď h
d rL
*
“ KXBpx
ˆ
h
d rL
˙
143
for any d ě mint1, hrLDu, which respectively satisfies }px´ x}2 ď hrL and }px´ x}2 ď D.
We then get that
| rfpxq| “ | rfppxq ´ rfppxq ` rfpxq| ě | rfppxq| ´ | rfppxq ´ rfpxq|
ě | rfppxq| ´ rL }px´ x}2 ě h´ rL h
d rL “ d´ 1d h
for all x P Sd, by using the reverse triangle inequality.
If px is such that Sd is entirely within the interior of K, then the intersection is the
entire ball and we thus have
µ¯KpSdq “ µ¯K
ˆ
Bpx
ˆ
h
d rL
˙˙
“ 1V
ˆ
h
d rL
˙n pi n2
Γpn2 ` 1q
.
As px moves toward a flat surface boundary of K, then only half of the ball remains in
the intersection, which is the limit in the n “ 1 case. But in higher dimensions n ě 2,
the compact convex geometry of K means reductions of more than half are possible.
Figure C.1 uses an n “ 2 example to illustrate visually of how px being located on
a convex curved surface or convex vertex of intersecting lines gives further reductions
compared to the flat surface.
144
Kpx
Bpxp hd rLq
Sd ě Kpx
Bpxp hd rLq
Sd ě
Kpx
Bpxp hd rLq
Sd
Figure C.1: The intersection KX Bpx´ hd rL¯ in green, for possible convex geometrical
features on the boundary of K where the center px can lie.
And so, there must exist at least one minimizing boundary location satisfying
µ¯KpSdq‹ :“ min
xPBK µ¯KpSdq “
PK
V
ˆ
h
d rL
˙n pi n2
Γpn2 ` 1q
for some 0 ă PK ă 12 , since V ą 0 and the boundary of K must contain convex curved
surfaces and/or vertices.
The exact portion PK of the ball contained within µ¯KpSdq‹ may not be practical
to calculate for a given K, but we can always lower bound it in the following manner
(based on methods in [39] Proof of Lemma 16, Case 1).
Denote cK as the centroid of K and let r be the radius of the largest ball centered
at cK contained within K. Then define ` ě r as the distance from the centroid to the
closest minimizing boundary location x‹, and align a coordinate system at cK such that
145
x‹ is located at position
“´ ` 0 ¨ ¨ ¨ 0‰J. Then x P K satisfying
´` ď x1 ď 0 and
gffe nÿ
i“2
x2i ď r `
r
`
x1
defines a conic section which is entirely within K. If we then define
ϑ :“ arctan
´ r
`
¯
and d ` d sinpϑq “ h
d rL ,
this allows a ball of radius d sinpϑq to be inscribed within the conic section at location“´ `` d 0 ¨ ¨ ¨ 0‰J.
Using the above definitions, we have that this radius satisfies
d sinpϑq “
h
d rL
1` sinpϑq sinpϑq “
h
d rL
1` r?
r2``2
r?
r2 ` `2 “
r
r `?r2 ` `2
h
d rL “: %K hd rL
and since the centroid is within the interior of K, the factor % must satisfy
r
r `?r2 `D2 ă %K ă
r
r `?r2 ` r2 “
1
1`?2 .
Figure C.2 illustrates an example of this method in n “ 2, with the inscribed ball
outlined in red.
x‹
Bx‹p h
d rLq Sd r
`
ϑ
cK
K
d
Figure C.2: The largest ball which always lower bounds Sd, centered at d with radius
d sinpϑq.
Thus, as px moves toward such a minimizing boundary point of K, we get the lower
bound
µ¯KpSdq ě µ¯KpSdq‹ ě %
n
K
V
ˆ
h
d rL
n˙ pi
n
2
Γpn2 ` 1q
“: ρK,nV
ˆ
h
d rL
n˙
.
146
On the other hand, there may be x P K more than h
d rL away from any px where it still
holds that | rfpxq| ě d´1d h. Thus,
Sd Ď Ω d´1
d
h ùñ µ¯KpSdq ď µ¯K
´
Ω d´1
d
h
¯
ď d
2ε2N
pd´ 1q2h2 .
Combining these relations gives that
ρK,n
V
ˆ
h
d rL
˙n
ď µ¯KpSdq ď d
2ε2N
pd´ 1q2h2 ùñ h ď
d
pd´ 1q 2n`2
˜
V rLnε2N
ρK,n
¸ 1
n`2
must hold for any d ě 1. Then since
d
dd
d
pd´ 1q 2n`2
“ npd´ 1q ´ 2
pn` 2qpd´ 1qn`4n`2
“ 0 ùñ d “ n` 2
n
,
we have the smallest upper bound on h as
h ď
n`2
n
2
n
1
n`2
˜
V rLnε2N
ρK,n
¸ 1
n`2
ď 2 pρK,nq ´1n`2
´
V rLnε2N¯ 1n`2 .
By definition, we have
pρK,nq ´1n`2 “
˜
%nK
pi
n
2
Γpn2 ` 1q
¸ ´1
n`2
“
ˆ
r
r `?r2 ` `2
˙´n
n`2
˜
pi
n
2
Γpn2 ` 1q
¸ ´1
n`2
and since K Ď B0pρq for sufficiently large ρ, then
V ď ρn pi
n
2
Γpn2 ` 1q
.
Thus,
pρK,nq ´1n`2 V 1n`2 ď
ˆ
r
r `?r2 `D2
˙´n
n`2
D nn`2 .