Persistency of Excitation, Nonlinear Function Approximation, and Stochastic Contraction Analysis for Learning in Model Reference Adaptive Control A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Tyler Lekang IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Andrew Lamperski, Advisor August, 2023 © Tyler Lekang 2023 ALL RIGHTS RESERVED Acknowledgements First and foremost, I am grateful to my advisor and mathematics guru Prof. Andrew Lamperski. None of this would be possible without your wisdom and encouragement. We’ve come a long way, from studying robotics and bandits, to learning theory in adaptive control. Through it all, Andy has always had confidence in my abilities and was an inspiration in perseverance. I am grateful for the support provided through the grant NSF CMMI-2122856. There are many people in the Biomedical Engineering Department who played an important role in my early PhD career. The IGERT Fellowship, headed at the time by Prof. Bin He, was a tremendous source of opportunity, affording me the chance to attend conferences, present results, collaborate, as well as receive funding. Andy and I had projects with Prof. Tay Netoff and Dr. Logan Grado, among others, which inspired our earlier work on Dueling Bandits. I have made many wonderful friends in the department, such as tailgating and river floating extraordinaire John Basile and his adventurous lab colleagues Gerardo Rodriguez and Inderbir Sondh. I also had the privledge of befriending the biomdedical power couple, Drs. Jose´ Valdez and Rachel Niu. There are many others, too numerous to list here. I’m grateful to have known many very intelligent lab colleagues. Drs. Jianjun Yuan, Bolei Di, Venkat Ram Subramanian have graduated, while Yuping Zheng is bright eyed and bushy tailed, fast on her way to that milestone. i I was fortunate to have three internships while a PhD student: at 3M under David Redinger (thanks to Jianjun!), at Honeywell under Hal Voges, and at Seagate under Allan Luk. Thanks to Dr. Vivek Nagaraj for helping me get the Honeywell internship. Last but certainly not least, I want to thank my family, without which, there is no way I could have continued on the PhD path. My parents Theresa Lekang and Lane & Karen Brettingen, as well as Jim & Sue Reid. My siblings, cousins, and extended family as well. But most importantly, my wonderful, dear wife, Larissa Lekang, for her constant foundation of encouragement, motivation, companionship, and love. We look forward to the next chapter in our lives, with the arrival of our baby girl and all the joy she will bring. ii Dedication To my wonderful family in all its forms and branches, my terrific friends new and old, and most of all, to my lovely wife Larissa, our furry companion Jake, and our sweet baby girl Linnea who fills our lives with so much joy. iii Abstract Machine learning has achieved unprecedented levels of success recently, in the areas of language processing and modeling, image and video classification and generation, and recommendation and dynamic pricing systems. The control of dynamic systems has also benefited from these advancements in learning, particularly in the areas of reinforcement learning for tasks such as robotic navigation and control of nuclear fusion processes. We wish to study learning in another area where it can naturally be applied: adaptive control systems. These systems must estimate and identify uncertainties in the plant in order to apply their adaptive control laws. We study the areas of stochastic contraction and convex projection, persistency of excitation, and function approximation, with an eye towards this application. The first part of the thesis is motivated by the problem of quantitatively bounding the convergence of adaptive control methods for stochastic systems to a stationary dis- tribution. Such bounds are useful for analyzing statistics of trajectories and determining appropriate step sizes for simulations. To this end, we extend a methodology from (un- constrained) stochastic differential equations (SDEs) which provides contractions in a specially chosen Wasserstein distance. This theory focuses on unconstrained SDEs with fairly restrictive assumptions on the drift terms. Typical adaptive control schemes place constraints on the learned parameters and their update rules violate the drift conditions. To this end, we extend the contraction theory to the case of constrained systems repre- sented by reflected stochastic differential equations and generalize the allowable drifts. We show how the general theory can be used to derive quantitative contraction bounds on a nonlinear stochastic adaptive regulation problem. iv The second part of the thesis defines geometric criteria which are then used to estab- lish sufficient conditions for persistency of excitation with vector functions constructed from single hidden-layer neural networks with step or ReLU activation functions. We show that these conditions hold when employing reference system tracking, as is com- monly done in adaptive control. We demonstrate the results numerically on a system with linearly parameterized activations of this type and show that the parameter esti- mates converge to the true values with the sufficient conditions met. The third part of the thesis studies function approximation. Classical results in neural network approximation theory show how arbitrary continuous functions can be approximated by networks with a single hidden layer, under mild assumptions on the activation function. However, the classical theory does not give a constructive means to generate the network parameters that achieve a desired accuracy. Recent results have demonstrated that for specialized activation functions, such as ReLUs, high accuracy can be achieved via linear combinations of randomly initialized activations. These re- cent works utilize specialized integral representations of target functions that depend on the specific activation functions used. This paper defines mollified integral repre- sentations, which provide a means to form integral representations of target functions using activations for which no direct integral representation is currently known. The new construction enables approximation guarantees for randomly initialized networks using any activation for which there exists an established base approximation which may not be constructive. We extend the results to the supremum norm and show how this enables application to an extended, approximate version of (linear) model reference adaptive control. v Contents Acknowledgements i Dedication iii Abstract iv List of Figures x 1 Introduction 1 1.1 Why Learning in Model Reference Adaptive Control? . . . . . . . . . . 1 1.2 Challenges and State of the Art . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Stochastic Contraction and Convex Projection . . . . . . . . . . 3 1.2.2 Persistency of Excitation . . . . . . . . . . . . . . . . . . . . . . 4 1.2.3 Function Approximation . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Stochastic Contraction and Convex Projection 10 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.1 General Inner Product and Norm . . . . . . . . . . . . . . . . . . 14 vi 2.1.2 The Convex Projection Operation . . . . . . . . . . . . . . . . . 14 2.1.3 The Normal Cone of a Closed Convex Set . . . . . . . . . . . . . 14 2.1.4 Relationship Between Convex Projection and Normal Cone . . . 15 2.1.5 Projection Coordinates . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.6 Reflection Processes . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.7 Partial Projection Using Partial Identity P Matrix . . . . . . . . 18 2.2 Contraction Analysis for Reflected Stochastic Differential Equations over Closed Convex Domains . . . . . . . . . . . 19 2.2.1 Problem Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.3 Background on Wasserstein Distances . . . . . . . . . . . . . . . 21 2.2.4 Main Contraction Result . . . . . . . . . . . . . . . . . . . . . . 22 2.3 Application to Linear MRAC . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.2 Lyapunov-Based Parameter Update Rule . . . . . . . . . . . . . 27 2.3.3 Satisfying Assumptions on the Lyapunov Equation . . . . . . . . 30 2.3.4 Satisfying the One-Sided Growth Condition . . . . . . . . . . . . 31 2.4 Numerical Simulation of Linear MRAC Application . . . . . . . . . . . . 34 3 Persistency of Excitation with Step and ReLU Functions 42 3.1 Persistency of Excitation With Basis Functions of a Continuous State Trajectory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2 Nonlinear, Positive Semidefinite Activation Functions . . . . . . . . . . 46 3.2.1 Positive Semidefinite Activation Geometry . . . . . . . . . . . . . 47 3.3 Sufficient Conditions for Persistency of Excitation with (Scaled) Step and ReLU Activations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 vii 3.3.2 Proof Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.3 Extension to Other Activation Functions . . . . . . . . . . . . . 52 3.4 Application to Linear MRAC . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4.2 Persistency of Excitation . . . . . . . . . . . . . . . . . . . . . . 55 3.4.3 Numerical Simulation . . . . . . . . . . . . . . . . . . . . . . . . 56 4 Function Approximation 63 4.1 Function Approximation with Activation Functions . . . . . . . . . . . . 65 4.1.1 Base and Random Approximations . . . . . . . . . . . . . . . . . 66 4.2 Existing Results for Base Approximations . . . . . . . . . . . . . . . . . 66 4.2.1 Convex Combinations of Step and ReLU Activations . . . . . . . 66 4.3 Mollified Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.3.1 Functions with Integral Representation . . . . . . . . . . . . . . 69 4.3.2 Mollified Integral Representation . . . . . . . . . . . . . . . . . . 70 4.3.3 Main Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.4 Random Approximations to Mollified Approximations . . . . . . . . . . 74 4.4.1 Expected Value of Random Approximations to Functions with Integral Representations . . . . . . . . . . . . . . . . . . . . . . . 74 4.4.2 Setting Up McDiarmid’s Inequality . . . . . . . . . . . . . . . . . 75 4.4.3 Using McDiarmid’s Inequality . . . . . . . . . . . . . . . . . . . . 76 4.5 Random Approximations to Base Approximations . . . . . . . . . . . . 77 4.6 Extending the Error Bound to L8 for Lipschitz Functions on Convex Compact Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.7 Application to Linear MRAC . . . . . . . . . . . . . . . . . . . . . . . . 80 4.7.1 Example on Euclidean Ball . . . . . . . . . . . . . . . . . . . . . 82 viii 5 Conclusion and Future Directions 85 References 87 Appendix A. 97 A.1 Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 A.2 Proof of Corollary 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 A.3 Useful Properties of Invertible Matrix G . . . . . . . . . . . . . . . . . . 109 Appendix B. 110 B.1 Proofs of Theorems 3.1 and 3.2 . . . . . . . . . . . . . . . . . . . . . . . 110 B.1.1 Proof Methodology and Setup . . . . . . . . . . . . . . . . . . . . 110 B.1.2 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . 112 B.1.3 Proof of Theorem 3.2 . . . . . . . . . . . . . . . . . . . . . . . . 113 B.1.4 Extension to Other Activation Functions . . . . . . . . . . . . . 116 B.2 Global Uniform Asymptotic Stability of (3.23) . . . . . . . . . . . . . . 116 Appendix C. 119 C.1 Useful Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 C.2 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 C.3 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 C.4 Proof of Theorem 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 C.5 Proof of Lemma 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 C.6 Proof of Lemma 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 C.7 Proof of Theorem 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 C.8 Proof of Theorem 4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 C.9 Proof of Lemma 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 ix List of Figures 2.1 Numerical simulation of similar LTI system and Ornstein-Uhlenbeck pro- cess from the same initial conditions. . . . . . . . . . . . . . . . . . . . . 12 2.2 Contractivity of the law Pt from initial deterministic P0 (deltas) to the Gaussian stationary distribution pi. . . . . . . . . . . . . . . . . . . . . . 12 2.3 Reflected Brownian Motion. The reflection process ψt constrains 2-dimensional Brownian Motion to remain within a box set X Ă R2 containing the origin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Convex Projection and Normal Cone. Here we see the relation- ship between convex projection of exterior points (colored dots to black dots) and vectors in the normal cone at those projected points (colored outwards pointing vectors). . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5 States x p1q t (blue) and x p2q t (orange), for the three simulation scenarios. Coupling time τ is the vertical red line. . . . . . . . . . . . . . . . . . . 38 2.6 Euclidean norm of states }xp1qt }2 (magenta) and }xp2qt }2 (cyan), for the three simulation scenarios. Coupling time τ is the vertical red line. . . . 39 2.7 Phase plots of the 2D polygons used to constrain each row of pKt (left side) and pΩt (right side), with blue for pΘp1qt and orange for pΘp2qt . Cyan dots show true parameter values. Red and Green dots show final parameter estimates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 x 2.8 Euclidean norm of parameter error } pΘp1qt ´Θ}2 (magenta) and } pΘp2qt ´Θ}2 (cyan), for the three simulation scenarios. Coupling time τ is the vertical red line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.1 State space x P R2 showing N “ 4 hyper planes WJx ` b “ 04. There are 11 feasible activation regions Aj (color dots) and 5 infeasible regions (no dots). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.2 Phase plot of xt (cyan), tracking into x r t (magenta) driven by r p1q t , and clearly crossing all N hyperplanes at nondegenerate borders along the limit cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.3 Phase plot of xt (cyan), tracking into x r t (magenta) driven by r p2q t , but does not cross all N hyperplanes along the limit cycle. . . . . . . . . . . 61 3.4 Norm of state tracking error }et}2 converges to zero in both cases, while the parameter estimation error }pΘt ´ Θ}2 converges to zero for the case with r p1q t and but not for the case with r p2q t . . . . . . . . . . . . . . . . . 62 4.1 The mollified delta δˆλ for different λ values. In each case, ş1{λ ´1{λ δˆλpγqdγ “ 1. 71 4.2 The base approximation pfN is transformed into the mollified approxima- tion pfλ,N by replacing Dirac deltas δ with the mollified deltas δˆλ. . . . . 72 C.1 The intersection KX Bpx´ hd rL¯ in green, for possible convex geometrical features on the boundary of K where the center px can lie. . . . . . . . . 144 C.2 The largest ball which always lower bounds Sd, centered at d with radius d sinpϑq. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 xi Chapter 1 Introduction 1.1 Why Learning in Model Reference Adaptive Control? Machine learning is a wide-ranging umbrella topic that encompasses many different methods and models, which can be used to bestow onto a machine or system the ability to complete various tasks without specific experience being hard coded or otherwise already possessed by the system [50]. In particular, deep neural network based ap- proaches have achieved unprecedented levels of performance and utility in areas such as language processing [58], medical computer vision [21], and reinforcement learning [16]. Bandit model [13, 43] approaches are another method with various applications in clinical trials, recommendation systems, and dynamic pricing [10]. Adaptive control lies at the intersection of machine learning methods and the control of dynamics systems [23], and has a rich history in the controls literature [41, 26, 28, 35]. It has wide applications in areas such as robotics [46], aerospace systems [41], and elec- tromechanical systems [5]. The typical approach utilizes Lyapunov-based design to update parameters while guaranteeing stability. In particular, the setup of (linear) model reference adaptive control (MRAC) [41, 74, 71] lends itself well to focused study 1 2of the learning components of the overall adaptive mechanisms, with the goal of enhanc- ing the already well understood aspects of state error tracking stability. Improvement in learning within this application corresponds to improvement in the ability of the adaptive system to estimate uncertain parameters, identify the system, and cancel out detrimental effects of additive nonlinearities inherent in the plant. 1.2 Challenges and State of the Art The fundamental challenges that both machine learning and adaptive control methods attempt to address ultimately revolve around learning (approximating) an uncertain function. Many recent state of the art models employ deep neural networks, often with particular structure (such as transformers [76] or convolutional layers [37]). At their heart, these deep models are highly-parameterized approximators of complex, nonlinear functions of interest that govern the intrinsic nature of various tasks. An important aspect of addressing this challenge in the context of MRAC adaptive control is the concept of persistency of excitation. We show that the adaptive control law of the linear MRAC setup in [41] has parameter estimation error which can be formed as a linear time-varying system, and thus will be globally uniformly asymptotically stable if persistency of excitation is met. This is a well known and defined condition [2], but relatively few methods exist that provide constructive recipes for achieving it. These challenges can be addressed in both a deterministic and stochastic context. In [44], we took the first step within the stochastic context. There we also addressed the challenge of projection, in which the parameter estimates of the MRAC system are desired to stay within some compact convex set, without disrupting the state error tracking stability. Currently existing methods for this [41, 28, 26, 35] tend to rely on specialized operators for specific classes of convex sets. In [45], we switched to the deterministic setting and provided novel sufficient conditions for achieving persistency 3of excitation in the context of the linear MRAC setup in [41] where the nonlinear vector function is a single hidden-layer neural network with known parameterization but uncertain coefficients. 1.2.1 Stochastic Contraction and Convex Projection In recent years, there has been a drive to connect adaptive control methods with tech- niques from reinforcement learning [78, 33, 79, 8]. In parallel, methods from reinforce- ment learning have seen works giving precise optimality guaranatees [68, 72, 24]. These works rely on precise convergence bounds that are fairly straightforward for linear sys- tems, but substantially more complex in stochastic nonlinear systems. Convergence of stochastic nonlinear systems is a vast area with numerous approaches, e.g. [54, 62, 6, 9]. In order to derive convergence guarantees analogous to those available in linear systems, explicit quantitative bounds are required. In [44], our desire was to derive quantitative convergence guarantees for stochastic adaptive control methods. To this end, we build upon methodologies at the intersection of stochastic differential equa- tions and optimal transport [20, 18]. However, the existing methods in this area are too restrictive to be applied directly to common adaptive control schemes. In particular, these focus unconstrained processes with strong Lipschitz-like conditions on the drift term. However, in adaptive control, the parameters are typically constrained and their update rules often contain quadratic terms that violate the drift conditions. Our primary contribution to stochastic convergence theory is an extension of the methodology from [20] to constrained processes with less restrictive drift conditions. We derive an explicit exponential contraction bound under a specially constructed Wasser- stein distance. The result implies exponential convergence to a unique stationary dis- tribution under a variety of measures, including total variation distance and Euclidean Wasserstein distances. We then show how this result can be used to prove exponential 4convergence in a stochastic adaptive regulation problem. Additionally, we show how a projection method based on reflected stochastic differential equations can be used to constrain the parameters to an arbitrary closed convex set. This provides a flexible alternative to handling constraints, which contrasts with more specialized projection operators commonly employed in adaptive control [41, 26, 35]. 1.2.2 Persistency of Excitation Persistency of excitation is a fundamental concept employed within contexts and appli- cations related to parameter learning, such as system identification and adaptive control. It is often discussed, or at least mentioned, in adaptive control textbooks, such as in [71, 74, 41]. It was proven in [55, 2] that persistency of excitation is necessary and suf- ficient for the global uniform asymptotic stability of certain linear time-varying (LTV) systems, which can be shown to hold for the parameter estimation error dynamics that stem from the adaptive parameter update laws in linear MRAC. There have been many works since [55, 2] which utilize an assumption of persistency of excitation in order to achieve results in parameter learning. In [1], a simple parameter learning scheme can be employed for a general class of nonlinear systems which have some kind of working (nonlinear) feedback controller. An integral condition similar to persistency of excitation is assumed to be satisfiable. This inspired the work in [69], which assumes a similar integral condition in order to identify, and then provide MRAC control for, an unknown MIMO LTI system. And in [14, 41] we see additional examples of MRAC control which assume persistency of excitation in order to achieve parameter convergence, while in [3], persistently exciting assumptions are made in reinforcement learning applications. In [80], a sufficient condition for windows of observed behavior of an LTI system (in discrete time) to span the space of possible windows, is for a component signal (like the input) to be persistently exciting. In [56], conditions for 5neural networks excitation are given to guarantee bounds on the function estimate error. Lastly, in [60] it is proven that for a general class of nonlinear systems which are feedback linearizable (see [53, 35]), global uniform asymptotic stability can be achieved for linearly paramatrized vector functions meeting relaxed persistency of excitation conditions. However, in these and other works which utilize persistency of excitation assump- tions, there is often no explicit sufficient conditions provided for how to ensure that persistency of excitation is satisfied. On the other hand, there have been some works which do provide these sufficient conditions. A classic result is that the state of an LTI system satisfies persistence of excitation if the (stationary) input to that system contains sufficient frequency content (“sufficient richness”, see [11]). In [52, 36] this is extended to certain linear time-varying systems, while in [47] frequency arguments for sufficient conditions for excitation are then extended to nonlinear systems in parametric-strict- feedback form and in [34] to the context of adaptive dynamic programming, in which optimal control value functions are approximated using polynomial basis functions. In [59], a rank condition is proven sufficient and necessary for the state of time-invariant systems to be persistently exciting, and in [70], strong Lyapunov functions are provided that are equivalent to the persistency of excitation condition. Finally, and closest in spirit of our work in [45], in [38] a sufficient condition, based on geometric criteria, is given for satisfying persistency of excitation with vector functions composed of radial basis functions. 1.2.3 Function Approximation A key challenge in function approximation work is that many theorems describe the ex- istence of approximations that are linear combinations of some activation function (such as a sigmoid), but are not constructive. In [64], many of these results are reviewed, in- cluding classic results on the universal approximation properties of continuous sigmoidal 6activations [15, 25, 22], as well as another classic result by [7] for arbitrary sigmoidal activations, which was expanded to arbitrary hinge functions by [12]. Random approximation provides a key way to avoid the non-constructiveness by randomly sampling the internal parameters of the activations from a known set, if the target function of interest has an integral representation. However, determining these representations for general classes of target functions and activations is a nontrivial task. [29] gives a constructive method, but only for target and activation functions in L1. In [31] and [61], they propose constructive methods for a class of target functions with unit step and ReLU activations respectively. In [27], functions are approximated using trigonometric polynomial ridge functions, which can then be shown in expectation to be equivalent to randomly initialized ReLU activations. There are several interesting and important results in the literature having to do with neural networks with random (typically Gaussian) parameter initialization, sometimes as a consequence of using randomly initialized gradient descent for training the network. A classic result by [57] shows that the output of a single hidden-layer network with Gaussian randomly initialized parameters goes to a Gaussian Process as the width goes to infinity. Similar results are then achieved in [42] for deep fully-connected networks as all hidden layer widths go to infinity. Also for deep fully-connected networks, in [30] the authors define the Neural Tangent Kernel and propose that its limit, as the hidden layer widths go to infinity, can be used to study the timestep evolution and dynamics of the parameters, and the corresponding network output function, in gradient descent. In [66], the authors show that single hidden-layer networks cannot achieve the same rates of increase in a measure of curvature produced by the network output, as deep networks can, with parameters Gaussian randomly initialized and bounded activation functions. In [4], the authors show that single hidden-layer networks of a sufficient width can use Gaussian randomly initialized gradient descent on values of a target 7function, and achieve guaranteed generalization to the entire function. And in [17], the authors show that deep fully-connected networks, where each hidden layer width meets a sufficient size, can be trained with Gaussian randomly initialized gradient descent and be guaranteed to reach the global minimum at a linear rate. 1.3 Contributions We summarize our main contributions as follows: • We develop a novel extension of the framework in [18, 20] for establishing ex- ponential stochastic contraction using specialized Wasserstein distances, which is detailed in Section 2.2. The main technical result is Theorem 2.1, which also pro- vides for Corollary 2.1. We analyze a modified, stochastic version of linear MRAC to show that the theory can be applied in Section 2.3, and perform numerical simulation. • We develop novel sufficient conditions for persistency of excitation for (nonlinear) vector functions of continuous, bounded state trajectories when they are composed of activation functions (eg. step and ReLU) together with affine transformations of the state space, which is detailed in Sections 3.1 and 3.2. The main technical results are Theorems 3.1 and 3.2. We analyze a simplified version of linear MRAC to show that the theory can be applied in Section 3.4, and perform numerical simulation. • We develop a novel method for bridging convex combinations of activation func- tions (eg. step and ReLU) with unknown parameters to approximations using randomly initialized parameters by using a mollified integral representation, which is detailed in Section 4.3. By combining existing results discussed in Section 4.2 8with the general setup in Section 4.4, we obtain the main technical result Theo- rem 4.5. We extend this result to the supremum norm in Section 4.6, and analyze linear MRAC to show that the theory can be applied in Section 4.7. 1.4 Thesis Outline • Chapter 2 focuses on Stochastic Contraction and Convex Projection. • Chapter 3 focuses on Persistency of Excitation. • Chapter 4 focuses on Function Approximation. • Chapter 5 provides conclusions and future directions for these lines of work. 1.5 Notation Here, we will describe notation that is consistent throughout the thesis. Any deviations or additional notation will be specified within individual chapters. We use R,C,N to denote the real, complex, and natural numbers. I denotes the indicator function, while E denotes the expected value. We use B, C,K,X to denote bounded, compact, compact convex, and closed convex subsets of Rn. The integer set t1, . . . , ku is denoted by rks. We use coH to denote the closure of the convex hull of a set H. Bxprq Ă Rn denotes the radius r Euclidean ball centered at x P Rn. For any n,m P N, we interpret x, y P Rn as length n column vectors and their transposes xJ, yJ as length n row vectors. }x}2 “ ? xJx will always denote the standard Euclidean vector norm and inner product, with xJy the dot product of two equal length vectors. 0n, 1n denote the length n vector of zeros and ones respectively. The notation that A is a n ˆ m matrix assumes real entries, thus A P Rnˆm and its transpose AJ P Rmˆn. If A is a square matrix its trace is denoted by trpAq. In denotes the nˆ n 9identity matrix. A ă B (ĺ) denotes that the matrix B ´ A is positive (semi)definite. And so, aIn ĺ A ĺ bIn denotes that eigpAq P ra, bs. Index subscripts on vectors and matrices denote the row index, for example AJi is the ith row of AJ. The Frobenius matrix norm is denoted }A}F . We denote the product of matrix A P Rnˆm with vector x P Rm as Ax, which is a length n (column) vector where the ith (row) element is the inner product pAxqi. If A P Rnˆm and x P Rn, then the matrix product AJx results in a length m vector where we denote the ith element pAJxqi equivalently as the inner product AJi x. We denote time indices using subscripts, for all (scalar, vector, or matrix valued) functions of time. For example, xt or At, instead of xptq or Aptq. Random variables are denoted in bold, eg. x, and so stochastic processes also have a time subscript, eg. xt. The ith row of a time-varying vector or matrix is denoted with t, i subscript. For a function f : Rn Ñ R, an affine shift qf is defined as qfpxq “ fpxq ´ aJx´ b for some a P Rn and b P R, typically with a “ ∇fp0q and b “ fp0q. Chapter 2 Stochastic Contraction and Convex Projection In this chapter, we will analyze what happens to the general setup for linear MRAC from Chapter 9 of [41] when two important modifications are applied: 1. stochastic terms are added to the dynamics, which thus makes the system itself stochastic, such that at every time t ě 0 the state xt is a stochastic process with a probability distribution (law) Pt that evolves with time, and 2. reflection terms are added to the dynamics, which constrain the state of the system to be within any closed convex set X Ă Rn. First, consider the standard LTI control system which is described by the linear differential equation 9xt “ Axt `B ut (2.1) where xt P Rn is the system state, A is an n ˆ n matrix, ut P R` is the control input and B is an nˆ ` matrix. We know from classic results in LTI controls theory that the 10 11 solution to this equation is the state trajectory xt “ eAtx0 ` ż t 0 eApt´sqBusds where x0 P Rn is the initial state. And so, if we use an input ut which is stochastic, then the solution (state trajec- tory) xt will also necessarily be stochastic. Let us then input an n-dimension Brownian motion wt. This is a Wiener process satisfying that w0 “ 0n, its increments are multi- variate zero-mean Gaussian pwt`u ´wtq „ Np0n, uInq at all t ě 0, for any u ě t, and independent of past values ws for all 0 ď s ă t, and its sample paths are continuous in time. We express this using a differential form of (2.1) as dxt “ Axtdt`G dwt (2.2) where G is an nˆn matrix. This form is used because the time-derivative of Brownian Motion dwtdt does not exist anywhere. (And yet, wt is continuous!) It is no surprise then, that the solution to (2.2) is the stochastic process xt, also known as an Ornstein–Uhlenbeck process, which satisfies xt “ eAtx0 ` ż t 0 eApt´sqGdws ùñ Erxts “ eAtErx0s . If the eigenvalues of A are not zero, then a stationary distribution exists for this process which is pi “ Np0n,Σq, where the covariance matrix satisfies the Lyapunov equation AΣ`ΣA “ ´GGJ. If we also require G to be invertible, then GGJ is positive-definite. Therefore in both cases, if A is Hurwitz then we have: 1. the LTI system (2.1) is exponentially stable, with xt Ñ 0n exponentially fast from initial condition x0, and 2. the Ornstein-Uhlenbeck process (2.2) is exponentially stochastically stable, with the law xt „ Pt contracting to the stationary pi exponentially fast from the initial P0 and thus Erxts Ñ 0n exponentially fast. 12 We illustrate these concepts with the following figures. Figure 2.1 shows numerical simulation of both (2.1) and (2.2) using the same A “ ´In and B “ G “ In with ` “ n “ 10 and x0 “ x0 linearly spaced between r´10, 10s, over times t P r0, 10s sec. Figure 2.2 illustrates contractivity of the law xt „ Pt over time from deterministic P0. 0 2 4 6 8 10 −10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0 Figure 2.1: Numerical simulation of similar LTI system and Ornstein-Uhlenbeck process from the same initial conditions. Figure 2.2: Contractivity of the law Pt from initial deterministic P0 (deltas) to the Gaussian stationary distribution pi. 13 Next, we consider additive reflection terms, which can be either deterministic ψt or stochastic processes ψt, and constrain the solutions xt to (2.1) and xt to (2.2) within a closed convex subset X Ă Rn. For example, consider the Brownian motion dwt in n “ 2 dimensions by itself, together with the additive reflection term ψt which constrains to X Ă R2 being a box containing the origin. The differential form is integrated to give the solution %t “ şt 0 dwt `ψt, and a numerical simulation is shown in Figure 2.3. Figure 2.3: Reflected Brownian Motion. The reflection process ψt constrains 2- dimensional Brownian Motion to remain within a box set X Ă R2 containing the origin. This chapter is adapted from the published work [44] and we summarize the orga- nization of the sections, as follows: 1. Section 2.1 provides background information for understanding the main results, 2. Section 2.2 sets up and provides the main result for exponential contraction of the law Pt in a specially constructed Wasserstein distance, 3. Section 2.3 explains how the application to linear MRAC meets the requirements of the results from the previous section, and 4. Section 2.4 provides a numerical simulation of this application. 14 2.1 Background This section gives a basic overview of important concepts like convex projection and reflection processes, and provides important definitions that will be used throughout the chapter (and the associated appendix) related to these concepts. 2.1.1 General Inner Product and Norm We use x¨, ¨y and } ¨ } without subscript to denote a general inner product and norm for x P Rn having the quadratic form }x}2 “ xx, xy “ xJPx, using some symmetric, positive-definite nˆ n matrix P . The Euclidean (P “ In) inner product and norm are denoted } ¨ }22 “ xx, xy2 “ xJx, and all other norms are explicitly mentioned in context. 2.1.2 The Convex Projection Operation The convex projection of any exterior point y P RnzX onto the set X is defined as ΠX pyq :“ arg min xPX }x´ y} 2 . (2.3) In words, the convex projection is the closest point on the boundary of X to the exterior point y, with distance in the sense of the general norm. Simple geometries, such as Euclidean balls and boxes, have analytic formulas for the convex projection. More generally, for any closed convex set, the convex projection can be obtained via constrained linear optimization methods. 2.1.3 The Normal Cone of a Closed Convex Set Let X be a closed convex set in Rn. Then at any y P Rn, the normal cone of X is the subset of vectors in Rn defined as NX pyq :“ $’’&’’% H if y R X tv P Rn | xx´ y, vy ď 0 @x P X u if y P X , (2.4) 15 where xx´ y, vy “ px´ yqJPv is the general inner product. In words, the normal cone at some y P X is the set of all vectors in Rn satisfying that the general inner products with all vectors pointing from that y to all x P X are non-positive. It is defined to be empty at all points outside the set X , and at all interior points of X the only such vector satisfying this criteria is the zero vector 0n. However, at boundary points of X the normal cone is nontrivial. For points on curved or flat boundary surfaces, it contains all non-negative scalings of the vector in the orthogonal direction (with respect to P ) to the boundary at that point. And at vertex points, it contains all non-negative scalings of all such orthogonal directions (again, with respect to P ), which forms a cone. 2.1.4 Relationship Between Convex Projection and Normal Cone By inspecting (2.3) and (2.4), we can see the close relationship between the convex projection onto X and the normal cone of X . That is, if we denote x˚ “ ΠX pyq as the projection onto X for some exterior point y P RnzX , then it must hold that there are v P NX px˚q extending outward from x˚ which pass through the point y. See Fig. 2.4 for an example with X Ă R2 as a compact polygon and P “ I2 (so that orthogonal directions to flat boundary surfaces are at right angles). 16 Figure 2.4: Convex Projection and Normal Cone. Here we see the relationship between convex projection of exterior points (colored dots to black dots) and vectors in the normal cone at those projected points (colored outwards pointing vectors). 2.1.5 Projection Coordinates When P ‰ In the convex projection (2.3) may not project the exterior point y to the nearest boundary point x P X , in the sense of Euclidean distance. This can be mitigated by using different coordinates for the projection. For all v, x, y P Rn, let qv, qx, qy be the linear transformations according to the matrix P´ 12 (eg. qx “ P´ 12x) and similarly let the transformed set be qX “ P´ 12X “ tP´ 12x | x P X u. Then for any fixed y P RnzX and corresponding qy, we have minqxP qX }qx´ qy}2 “ minqxP qX pqx´ qyqJP pqx´ qyq “ minpxPX ppx´ yqJP´ 12PP´ 12 ppx´ yq “ minpxPX }px´ y}22 where px “ P 12 qx P P 12 qX “ X and the matrices P 12 and P´ 12 are also symmetric, positive- definite and can always be constructed from the real, positive singular values of P by using their (inverse) square roots. This gives the Euclidean minimizer as px˚ “ P 12 qx˚. 17 2.1.6 Reflection Processes Here we will describe some basic concepts about reflection processes ψt, which are used to constrain the solution of a reflected (stochastic) differential equation within a set by reflecting it at the boundary. For details, see Appendix F of [39], as well as [75] and [73]. Let X Ă Rn be some closed convex set and let % : r0,8q Ñ Rn be a piecewise- continuous function of time with %0 P X . Then the piecewise-continuous functions of time x : r0,8q Ñ Rn and ψ : r0,8q Ñ Rn are said to solve the Skorokhod problem if the following hold for all T ą 0: 1. xt “ %t ` ψt, with x0 “ %0 P X , remains within X for all t P r0, T q 2. ψt has the form ψt “ ´ şt 0 vsdµpsq, where vt P NX pxtq such that }vt}2 P t0, 1u, and the non-negative measure µ is finite µpr0, tqq ă 8, for all t P r0, T q. Note that NX pxtq “ t0nu if xt is in the interior of X . And so, ψt only changes value at times when xt is at the boundary of X . Similarly, if %t is a solution to an unconstrained stochastic differential equation, then ψt “ ´ şt 0 vsdµpsq, with vt and µptq meeting the conditions above, is a stochastic reflection process with bounded variation, enforcing that xt “ %t`ψt, with x0 “ %0 P X , remains within X for all t ě 0. Thus, xt is a solution to the reflected stochastic differential equation (RSDE) dxt “ d%t ´ vtdµptq . Importantly, it therefore follows from the definition of the normal cone (2.4) and µptq being a non-negative measure, that the following hold for all t ě 0 and y P X : ´xxt ´ y,vtdµptqy “ ´pxt ´ yqJP vtdµptq ď 0 xy ´ xt,vtdµptqy “ py ´ xtqJP vtdµptq ď 0 . 18 These facts will be used in multiple places throughout the chapter (and the associated appendix), to show that the reflection processes have negligible effect on Lyapunov and martingale analysis. Finally, we note that the solution of the above RSDE, obtained by integrating the differential form as xt “ x0 ` ż t 0 d%t `ψt , can be simulated numerically (see [75, 73] for details) via a projected Euler method with the convex projection ΠX . Thus, the effect of ψt on %t is the continuous time limit of the convex projection operation ΠX back onto X . 2.1.7 Partial Projection Using Partial Identity P Matrix Let us define that xt, y P Rn`m with xJt “ rxJt,n xJt,ms and yJ “ ryJn yJms, and let it be that we only desire to project the xt,m part of the overall solution xt, leaving the xt,n part unconstrained. In such a case, we can define the closed convex set X :“ Rn ˆ Xm using the closed convex set Xm Ă Rm, and P “ »–Pn 0 0 Pm fifl and vt dµptq “ »– 0n vmt dµ mptq fifl where Pn and Pm are any symmetric, positive-definite matrices with respective n ˆ n and m ˆm dimension, and vmt P NXmpxt,mq is defined with respect to Pm. Thus, the general inner product is xxt, yy “ xJt Py “ xJt,nPnyn ` xJt,mPmym, which gives ´xxt ´ y,vtdµptqy “ 0´ pxt,m ´ ymqJPm vmt dµmptq ď 0 xy ´ xt,vtdµptqy “ 0` pym ´ xt,mqJPm vmt dµmptq ď 0 for all xt, y P X “ Rn ˆ Xm. We use this type of implementation for the application presented in Section 2.3, with Xm “ K as a compact convex set and Pm “ Im which avoids any need for projection coordinate transformations. 19 2.2 Contraction Analysis for Reflected Stochastic Differential Equations over Closed Convex Domains This section gives a general contraction result for reflected stochastic differential equa- tions (RSDEs) over closed convex domains. The results in this section build upon the contraction theory of [19], with substantial novel work required to generalize the con- ditions in that work. This was required in order for the adaptive control application studied in Section 2.3 to be able to satisfy those generalized conditions. We consider RSDEs in order to handle the constraints that can be applied to pa- rameter update rules in adaptive control applications, while in [19] the authors focus on unconstrained SDEs. 2.2.1 Problem Setup Let X be a closed convex subset of Rn. We will examine contractivity properties of continuous-time Markov processes with continuous sample paths xt P X , called diffusion processes, which are solutions (trajectories) of reflected SDEs having the form dxt “ Hpxtqdt`Gdwt ´ vtdµptq , (2.5) where H : Rn Ñ Rn is a time-invariant vector function (the drift coefficient), G is a constant, invertible nˆn matrix (the diffusion coefficient), wt is an n-dimensional Brow- nian motion (Wiener process), and ψt “ ´ şt 0 vsdµpsq is a bounded variation reflection process enforcing that xt P X for all t ě 0 from initial condition x0 P X . When H is Lipschitz, it can be shown that ψt is the unique bounded variation reflection process such that vt P NX pxtq with }vt}2 P t0, 1u, and µ is a random measure satisfying µpr0, tsq ă 8, for all t ě 0. Then the solution (trajectory) satisfies xt P X for all t ě 0. See [75, 49]. The Lipschitz condition can be relaxed to H being locally Lipschitz, provided that the process is nonexplosive. See Section 2.4 of [63]. 20 Note that instead of the usual 9xt “ dxtdt notation often seen in controls theory with deterministic differential equations, (2.5) instead uses the differential notation of SDEs. With this notation, we directly integrate over time to obtain the solution (trajectory) xt “ x0 ` ż t 0 Hpxsqds`G ż t 0 dws ´ψt . Such solutions can be simulated numerically via a projected Euler method with time step dt and independent Brownian increments wt`dt ´wt as xt`dt « ΠX ` xt ` dt Hpxtq `Gpwt`dt ´wtq ˘ . 2.2.2 Assumptions The first requirement is that H satisfy a one-sided growth condition. We assume that there is a function κ : p0,8q Ñ r0,8q satisfying ş10 rκprqdr ă 8 and a non-negative number α, such that for all x, y P X , with r :“ }x´ y}, the following bound holds: xx´ y,Hpxq ´Hpyqy ď κprqr2 ` α rp}x} ` }y}q . (2.6) This one-sided growth condition generalizes the one-sided Lipschitz condition from [19], which corresponds to the special case with α “ 0. The extra terms are required to handle the application to adaptive control in Section 2.3. Let A denote the generator associated with the (reflected) diffusion process xt. Specifically, for any function g : X Ñ R, the generator is given by pAgqpxq “ lim hÓ0 h ´1pErgpxhq|x0 “ xs ´ gpxqq . We then assume that there is a twice continuously differentiable Lyapunov function V : X Ñ r0,8q and positive numbers λ and C such that for all x P X it holds that pAVqpxq ď C ´ λVpxq . (2.7) 21 Also, we assume that Vpxq increases monotonically with }x}. Specifically, there is a strictly monotonically increasing function φ such that Vpxq “ φp}x}q. We will further assume that V grows at least linearly with }x} (ie, φ grows at least linearly). Let R1 be the diameter of the set tpx, yq P X | Vpxq`Vpyq ď 4C{λu. Linear growth implies that R1 is finite. By construction, if }x´ y} ą R1, then pAVqpxq ` pAVqpyq ď ´pλ{2qpVpxq ` Vpyqq . (2.8) Then, let M be a positive number such that M ě R1 and for all x with }x} ě M the following bound holds: Vpxq ě max " 2 λ pα}x} ` Cq, 4C λ p2}x} ` 1q * . (2.9) Finally, let R2 be the diameter of the set tpx, yq P X | }x} ď M and }y} ď Mu. Note then that R2 ď 2M by the triangle inequality. In Section 2.3 we take Vpxq “ }x}2 ` 1 (meaning φpdq “ d2 ` 1), and so the growth assumptions on the Lyapunov function V are automatically satisfied. 2.2.3 Background on Wasserstein Distances Our main theory describes convergence in a Wasserstein distance. To state the result, some basic concepts from optimal transport are required. See [77] for a general reference. Let P and Q be probability measures over X with respect to the standard Borel sigma algebra. Then the joint probability measure Γ over X ˆ X is called a coupling of P and Q if its marginals are respectively P and Q. In other words, for any Borel set S, we have ΓpS ˆ X q “ PpSq and ΓpX ˆ Sq “ QpSq. Let CpP,Qq denote the set of all couplings of P and Q. If ρ : RnˆRn Ñ r0,8q is a metric, then the q-Wasserstein distance W qρ , with respect 22 to metric ρ, between measures P and Q is defined as W qρ pP,Qq :“ ˆ inf ΓPCpP,Qq ż XˆX ρpx, yqqdΓpx, yq ˙1{q “ ´ inf ΓPCpP,Qq Epx,yq„Γ “ ρpx,yqq‰¯1{q . (2.10) For simple notation, we follow the convention that Wρ :“W 1ρ is used for general ρ, and W q :“W q}¨}2 is used for ρ being the Euclidean norm. 2.2.4 Main Contraction Result The main idea in [19] is to construct a metric ρ for which contraction in Wρ of the law Pt (ie, the time evolving probability measure) of the trajectory xt can be tractably analyzed, and then use the result to examine the contraction in more standard measures such as W q and the total variation distance. We follow a similar approach, but the metric must be modified to account for the more general growth condition (2.6). Our metric will have the form ρpx, yq “ `fp}x´ y}q ` γVpxq ` γVpyq ` maxtVpxq, φpMqu `maxtVpyq, φpMqu˘ Ipx ‰ yq . (2.11) Recall that φ is a function such that Vpxq “ φp}x}q and I is the indicator function. We define the following positive constants: ε :“ λmaxpP qσmaxpGq2 and γ :“ ξε 4C , (2.12) where λmaxpP q is the largest eigenvalue of the matrix P (which defines the general inner product and norm), σmaxpGq “ max}v}2“1 }Gv}2 is the largest singular value of G (with respect to the Euclidean norm), and ξ is defined below. 23 The function f : r0,8q Ñ r0,8q is defined via the following chain of definitions: hprq “ 1 ε ˆ 1 2 ż r 0 sκpsqds` αMr ˙ (2.13a) ϕprq “ e´hprq (2.13b) Φprq “ ż r 0 ϕpsqds (2.13c) ξ´1 “ ż R1 0 ϕpsq´1ds (2.13d) β´1 “ ż R2 0 pΦpsq{ϕpsqqds (2.13e) gprq “ 1´ ξ 4 ż mintr,R1u 0 ϕpsq´1ds´ β 4 ż mintr,R2u 0 pΦpsq{ϕpsqqds (2.13f) fprq “ ż mintr,R2u 0 ϕpsqgpsqds. (2.13g) We can now state the main contraction result, as follows. Theorem 2.1. The function ρpx, yq defined in (2.11) is a metric over x, y P X . Let xt and yt be two solutions to (2.5) with initial conditions x0 ‰ y0 P X and respective laws Pt and Qt. If the initial laws satisfy ş X VpxqdP0pxq ă 8 and ş X VpxqdQ0pxq ă 8, then the laws contract exponentially in Wρ distance as WρpPt,Qtq ď e´atWρpP0,Q0q , (2.14) where a “ 12 mintλ, ξε, βεu. Proof. The proof is given in Appendix A.1. The following corollary then establishes contraction to a stationary distribution pi (which satisfies that if QT “ pi, then Qt “ pi for all t ě T ) in total variation distance and standard q-Wasserstein distances, analogous to Corollary 2.1 and Remark 2.3 of [19]. Corollary 2.1. For (2.5), there exists a unique stationary distribution pi satisfyingş X Vpxqdpipxq ă 8 . Let xt be a solution to (2.5) with law Pt such that the initial law 24 satisfies ş X VpxqdP0pxq ă 8 . If Vpxq ě 1 for all x P X , then Pt contracts exponentially to pi in total variation distance as }Pt ´ pi}TV ď γ´1e´atWρpP0, piq . (2.15) Now let q P r1,8q satisfy 1p ` 1q “ 1 with the corresponding p ą 1. If Vpxq ě }x}q for all x P X , then Pt contracts exponentially to pi in standard W q as W qpPt, piq ď 21{p γ´1{q ` e´atWρpP0, piq ˘1{q . (2.16) Proof. The proof is given in Appendix A.2. 2.3 Application to Linear MRAC In this section, we analyze a stochastic variation of a linear MRAC application from Chapter 9 of [41]. We will implement adaptive regulation of the plant state towards the equilibrium point of a reference linear system, in spite of uncertainties in the plant’s system matrix and its parameterization of an additive nonlinear vector function. We also use convex projection, as detailed in Section 2.1, on the parameter update rules so that the parameters remain within a constraint set. The plant will also be under the influence of an additive (scaled) Brownian motion disturbance, and thus our objective will be to show that such applications of linear MRAC meet the requirements for the contraction results detailed in Section 2.2 2.3.1 Setup We largely follow the general linear MRAC setup introduced in Chapter 9 of [41],with some important modications. Notably, we introduce stochastic terms into the plant and parameter update dynamics, and we will be tracking the fixed model reference state xrt “ 0n, as opposed to a reference model with dynamic state behavior. 25 The plant has the form (in differential notation) dxt “ “ Axt `B ` ut ` ΩJΨpxtq ˘‰ dt`Gxdwxt , (2.17) where A is an unknown n ˆ n state matrix for the plant state xt P Rn, B is a known n ˆ ` input matrix for the input ut P R`, Ω is an unknown L ˆ ` matrix that linearly parameterizes a vector of known nonlinear basis functions Ψ1, . . . ,ΨL : Rn Ñ R of the plant state, and Gx is an invertible n ˆ n matrix scaling an n-dimensional Brownian motion wxt . We omit the input scaling matrix Λ, which typically gives an overall input matrix of BΛ, for simplicity. The linear model reference system is then given by 9xrt “ Arxrt `Brt , (2.18) where Ar is a known Hurwitz nˆn reference state matrix for the reference state xrt P Rn and rt P R` is a bounded reference input. And so, we assume there exists an (unknown) nˆ ` matrix of feedback control gains K satisfying the matching condition A`BKJ “ Ar . (2.19) Since the reference input rt will not deviate from zero, there is no need for a specific reference input matrix Br with corresponding matching condition and feedforward gains. With the matching condition satisfied, the plant (2.17) can be rewritten as dxt “ “ Arxt `B ` ut ´KJxt ` ΩJΨpxtq ˘‰ dt`Gxdwxt , (2.20) Remark 2.1. The condition (2.19) is feasible if some Hurwitz Ar can be reached from pA,Bq by state feedback. Towards this end, it is often practical to solve for an LQR feedback gain matrix KLQR and then check that Ar “ A`BKJLQR is Hurwitz. The methods derived in Chapter 9 of [41] aim to drive the norm of the nominal state tracking error et “ xt ´ xrt to zero asymptotically as limtÑ8 }et}2 “ 0, by designing 26 parameter update rules based on Lyapunov analysis in combination with Barbalat’s lemma. Here, xt is the nominal plant state, which is (2.17) without any added stochastic term Gxdw x t . In this application, we will set x r 0 “ 0n and rt “ 0` for all t ě 0, and so the reference state will always remain at the origin. Thus, it holds that the stochastic state tracking error et :“ xt´ 0n “ xt, and so our goal in this section is to regulate the (stochastic) plant state to the origin, in spite of the plant uncertainties and the (scaled) Brownian motion disturbance. If K and Ω were known, we could then simply apply the (stochastic) feedback control law of ut “ KJxt ´ ΩJΨpxtq and obtain a stable Ornstein–Uhlenbeck process dxt “ Arxtdt`Gxdwxt . But since we do not know K or Ω, we must instead use ut “ pKJtxt´ pΩJt Ψpxtq, wherepKt and pΩt are stochastic parameter estimates that require dynamic update rules. To simplify notation, we define the following: Θ :“ »– K ´Ω fifl , pΘt :“ »– pKt ´pΩt fifl , Φpxtq “ »– xt Ψpxtq fifl . (2.21) Here, Θ and pΘt are N ˆ ` matrices and we have the vector function Φ : Rn Ñ RN (note this is unrelated to the Φprq in (2.13)), where we set N :“ n ` L. Then the matched plant (2.20) with feedback control law ut “ pΘJt Φpxtq gives the state dynamics as dxt “ “ Arxt `Bp pΘt ´ΘqJΦpxtq‰dt`Gx dwxt . (2.22) We assume that the vector function Φ satisfies a Lipschitz condition }Φpxq ´Φpyq}2 ď L}x´ y}2, (2.23) for all x, y P X and some constant scalar L ą 0. Now we define a vectorization of the matrices Θ and pΘt. Set m :“ N` and let S : Rm Ñ RNˆ` be the matrix shaping function defined by Spvqi,j “ vpi´1q``j for 27 all i, j P rN s ˆ r`s. Then S is an invertible linear function, such that we have the vectorizations: θ “ S´1pΘq and pθt “ S´1p pΘtq. Let K be a compact convex subset of Rm that contains the true values θ “ S´1pΘq. We assume that such a K is known, with diameter D :“ diamtKu. Then we let X “ Rn ˆ K be the closed convex subset of Rn`m containing the combined state zJt “ rxJt pθJt s P X , and assume that pθt has an update rule satisfying an RSDE of the form dpθt “ Rpztq dt`Gθ dwθt ´ vθt dµθptq , (2.24) where R : X Ñ Rm, wθt P Rm is an m-dimensional Brownian motion with invertible scaling matrix Gθ P Rmˆm, and ψθt “ ´ şt 0 v θ s dµ θpsq is a bounded variation reflection process enforcing that pθt P K for all t ě 0 from initial condition pθ0 P K. And so, the combination of (2.22) and (2.24) defines a special case of the RSDE (2.5). Indeed, for every point zJ “ rxJ ϑJs P X , the corresponding NX pzq contains vectors of the form vJ “ r0Jn pvθqJs, where vθ P NKpϑq and NKpϑq is defined with respect to the Euclidean inner product. Remark 2.2. The parameter update rule (2.24) differs from typical methods of adaptive control in how it forces pθt to remain in the constraint set K. This method uses reflection processes (RSDEs), which can be approximated in discrete time by convex projections. In contrast, the parameters of adaptive control laws are commonly constrained using specialized projection operators that are designed for specific classes of convex sets (see examples in [41, 26, 28, 35]). 2.3.2 Lyapunov-Based Parameter Update Rule We will follow a Lyapunov-based procedure for designing the parameter update rule (2.24) similar to the method in Chapter 9 of [41], but incorporating the stochastic 28 terms. First, we construct the Lyapunov function. We fix some nˆ n symmetric, positive definite matrix Qx, and since Ar is Hurwitz, then there is a unique n ˆ n symmetric, positive definite matrix Px satisfying the Lyapunov equation AJr Px ` PxAr “ ´Qx . (2.25) For any point zJ “ rxJ ϑJs P X “ Rn ˆK, we thus define the Lyapunov function as Vpzq “ xJPxx` pϑ´ θqJpϑ´ θq ` 1 , (2.26) recalling that θ is a vectorized Θ, containing the true parameter values. We define the corresponding norm over Rn`m as }z}2 “ zJPz “ xJPxx ` ϑJϑ, where our overall P matrix is the block diagonal P “ »–Px 0 0 Im fifl . (2.27) The setup in [41] uses the Lyapunov function Vpzq “ xJPxx` tr ` Spϑ´ θqJΓSpϑ´ θq˘ , where Γ is an N ˆN symmetric, positive-definite adaptation rates matrix. Noting that aJb “ trpSpaqJSpbqq “ trpAJBq holds for all a, b P Rm, then (2.26) satisfies this with Γ “ IN such that tr ` Spϑ´ θqJIN Spϑ´ θq ˘ “ pϑ´ θqJpϑ´ θq. Theorem 2.1 requires that the Lyapunov function be a monotonic function of a norm. This could be attained using the affine coordinate transformation zˇJ “ rxJ pϑ ´ θqJs and then setting Vˇpzˇq “ }zˇ}2 ` 1 “ Vpzq. However, in the analysis below we will work in the original coordinates zJ “ rxJ ϑJs. 29 We now apply Itoˆ’s formula to the Lyapunov function (2.26), which gives: dVpztq “ ∇xVJdxt ` 1 2 dxJt p∇2x Vq dxt `∇θVJdpθt ` 12 dpθJt p∇2θ Vq dpθt “ 2xJt PJx dxt ` dxJt Px dxt ` 2 ppθt ´ θqJdpθt ` dpθJt dpθt “ xJt ` AJr Px ` PxAr ˘ xt dt` 2xJt PxBp pΘt ´ΘqJΦpxtq dt` 2xJt PxGx dwxt ` dxJt Px dxt ` 2 tr `p pΘt ´ΘqJd pΘt˘` dpθJt dpθt “ ´xJt Qx xt dt` 2 tr `p pΘt ´ΘqJpΦpxtqxJt PxB dt` d pΘtq˘ ` trpGJxPxGxq dt` dpθJt dpθt ` 2xJt PxGx dwxt . (2.28) The third equality uses (2.22), that aJb “ trpSpaqJSpbqq “ trpAJBq for any a, b P Rm, and that the symmetry of Px gives 2xJt PJx dxt “ xJt Px dxt ` dxJt Px xt . The fourth equality uses (2.25), the trace identity aJb “ trpbaJq as 2xJt PxBp pΘt ´ΘqJΦpxtq dt “ 2tr`p pΘt ´ΘqJΦpxtqxJt PxB dt˘ , and that the Itoˆ rules dt2 “ 0, dt dwxt “ 0n, and dwxt pdwxt qJ “ dt In thus give dxJt Px dxt “ pdwxt qJGJxPxGxdwxt “ trpGJxPxGxq dt . By examining the second term of (2.28), we see that ΦpxtqxJt PxB dt within the trace operation is canceled if R in (2.24) is defined as Rpztq “ S´1 `´ΦpxtqxJt PxB˘ , (2.29) where S´1 vectorizes the matrix in the parentheses by stacking its rows. This gives the design for the stochastic parameter update rule of pθt, and so also pΘt “ Sppθtq. 30 2.3.3 Satisfying Assumptions on the Lyapunov Equation Now we will continue analyzing dVpztq to ensure that the Lyapunov function V as defined in (2.26) satisfies the required assumptions in Section 2.2.2. Using the R defined in (2.29) within the parameter update rule dpθt (2.24) and then plugging this into dVpztq (2.28) by using d pΘt “ Spdpθtq, gives dVpztq “ ´xJt Qxxt dt` 2 tr ´ p pΘt ´ΘqJ´SpGθ dwθt q ´ S`vθt dµθptq˘¯¯ ` trpGJxPxGxqdt` trpGJθ Gθq dt` 2xJt PxGx dwxt “ `´ xJt Qxxt ` trpGJxPxGxq ` trpGJθ Gθq ˘ dt ` 2xJt PxGx dwxt ` 2 ppθt ´ θqJGθ dwθt ´ 2 ppθt ´ θqJvθt dµθptq , (2.30) where again we used aJb “ trpSpaJqSpbqq and the Itoˆ rules dt2 “ 0, dt dwθt “ 0m, and dwθt pdwθt qJ “ dt Im in the second equality. Now since vθt P NKppθtq and µθ is a non-negative measure, the definition of the normal cone of K, with respect to the lower block identity Im in (2.27), therefore implies ´ppθt ´ θqJvθt dµθptq ď 0 . And so, we have dVpztq ď `´ xJt Qxxt ` trpGJxPxGxq ` trpGJθ Gθq ˘ dt ` 2xJt PxGx dwxt ` 2 ppθt ´ θqJGθ dwθt , giving that the generator of the Lyapunov function Vpztq satisfies AVpzq ď `´ xJQxx` trpGJxPxGxq ` trpGJθ Gθq ˘ . (2.31) To ensure that this meets the required decrease assumption (2.7), we use a standard quadratic form bound, followed by the diameter condition on K as pϑ´θqJpϑ´θq ď D2. 31 Thus, from (2.26) we have: xJQxx ě λminpQxq λmaxpPxqx JPxx ě λminpQxq λmaxpPxqVpzq ´ λminpQxq λmaxpPxqpD 2 ` 1q . Here λminpQxq is the minimum eigenvalue of Qx and λmaxpPxq is the maximum eigen- value of Px. Plugging this bound into (2.31) shows V satisfies (2.7) with C “ trpGJxPxGxq ` trpGJθ Gθq ` λminpQxqλmaxpPxqpD 2 ` 1q λ “ λminpQxq λmaxpPxq . The constants R1, M , and R2 are then defined relative to this C and λ as in Section 2.2.2, along with the inequalities (2.8) and (2.9) for the α to be defined in the next section. All assumptions on the Lyapunov equation have been satisfied. 2.3.4 Satisfying the One-Sided Growth Condition The final task needed to apply Theorem 2.1 to this application is to ensure that the one-sided growth condition (2.6) holds. Note that the combination of (2.22), (2.24), and (2.29) leads to a special case of (2.5) with the following definitions: Hpztq “ »–Axt `B ` pΘt ´ΘqJΦpxtq S´1 `´ΦpxtqxJt PxB˘ fifl , G “ »–Gx 0 0 Gθ fifl dwt “ »–dwxt dwθt fifl , and vt dµptq “ »– 0 vθt dµ θptq fifl . (2.32) Let zJ “ rxJ, ϑJs and rzJ “ rrxJ, rϑJs be two points in X , and let us denote Ξ “ Spϑq and rΞ “ Sprϑq for visual distinction from the true parameters Θ “ Spθq. Direct 32 calculation using the general inner product xz, rz y “ xJPxrx` ϑJrϑ then gives: xz ´ rz,Hpzq ´Hprzqy “C »–x´ rx ϑ´ rϑ fifl , »–Apx´ rxq `B `Ξ´ΘqJΦpxq ´B `rΞ´ΘqJΦprxq S´1 `´ΦpxqxJPxB˘´ S´1`´ΦprxqrxJPxB˘ fiflG “ px´ rxqJPxA px´ rxq ` px´ rxqJPxB pΞ´ΘqJΦpxq ´ px´ rxqJPxB prΞ´ΘqJΦprxq ` tr`pΞ´ rΞqJpΦpxqxJ ´ΦprxqrxJ˘PxB˘ “ ´1 2 px´ rxqJQx px´ rxq ` tr`pΞ´ΘqJΦpxqpx´ rxqJPxB˘´ tr`prΞ´ΘqJΦprxqpx´ rxqJPxB˘ ` tr`pΞ´ rΞqJp´ΦpxqxJ `ΦprxqrxJ˘PxB˘ “ ´1 2 px´ rxqJQx px´ rxq `((((((( ((( tr ` ΞJΦpxqxJPxB ˘´ tr`ΞJΦpxqrxJPxB˘´ tr`ΘJΦpxqxJPxB˘` tr`ΘJΦpxqrxJPxB˘ ´ tr`rΞJΦprxqxJPxB˘`((((((((((tr`rΞJΦprxqrxJPxB˘` tr`ΘJΦprxqxJPxB˘´ tr`ΘJΦprxqrxJPxB˘ ´((((((( ((( tr ` ΞJΦpxqxJPxB ˘` tr`ΞJΦprxqrxJPxB˘` tr`rΞJΦpxqxJPxB˘´((((((((((tr`rΞJΦprxqrxJPxB˘ ď ´tr`ΞJΦpxqrxJPxB˘´ tr`ΘJΦpxqxJPxB˘` tr`ΘJΦpxqrxJPxB˘´ tr`ΘJΦprxqrxJPxB˘ ´ tr`rΞJΦprxqxJPxB˘` tr`ΘJΦprxqxJPxB˘` tr`ΞJΦprxqrxJPxB˘` tr`rΞJΦpxqxJPxB˘ “ ´xJPxB prΞ´ΘqJpΦpxq ´Φprxqq ` rxJPxB pΞ´ΘqJpΦpxq ´Φprxqq , (2.33) where we drop the non-positive ´12px´ rxqJQx px´ rxq term for inequality and use the trace identity aJb “ trpb aJq in many places. Setting y :“ PxB pΞ´ΘqJpΦpxq ´Φprxqq ry :“ PxB prΞ´ΘqJpΦpxq ´Φprxqq , 33 note that we then have }y ´ ry}2 “ }PxBpΞ´ rΞqJpΦpxq ´Φprxqq}2 ď }PxB}2}Ξ´ rΞ}2L}x´ rx}2 ď }PxB}2}Ξ´ rΞ}FL}x´ rx}2 ď }PxB}2DL}x´ rx}2 . Here, } ¨ }2 applied to matrices is the induced martix 2-norm. Then the first inequality uses sub-multiplicativity followed by the Lipschitz assumption on Φ. Next we note that the induced matrix 2-norm is bounded above by the Frobenius norm, and that }Ξ ´ rΞ}F “ }ϑ ´ rϑ}2. So, the final inequality follows from the diameter bound on K. An analogous derivation shows that }ry}2 ď }PxB}2}rΞ´Θ}FL}x´ rx}2 ď }PxB}2DL}x´ rx}2 . Recalling that }z}2 “ xJPxx ` ϑJϑ, we can then apply a quadratic form bound of }x}2 ď 1? λminpPxq}z}. Therefore, we achieve the following overall upper bound: xz´rz,Hpzq ´Hprzqy ď ´xJry ` rxJy “ prx´ xqJry ` rxJpy ´ ryq ď }rx´ x}2}ry}2 ` }rx}2}y ´ ry}2 ď }PxB}2DL `}x´ rx}22 ` }rx}2}x´ rx}2˘ ď }PxB}2DL λminpPxq `}z ´ rz}2 ` }rz}}z ´ rz}˘ ď }PxB}2DL λminpPxq ` r2 ` rp}z} ` }rz}q˘ . This satisfies the one-sided growth requirement (2.6) with κprq “ α “ }PxB}2DL λminpPxq . 34 The results are summarized in the following theorem: Theorem 2.2. The combined system dynamics and parameter update rule described by (2.22), (2.24), and (2.29) leads to a special case of (2.5), as defined in (2.32). The control law ut “ pΘJt Ψpxtq with parameter estimates, as defined in (2.21), meets all requirements of Section 2.2, such that the law Pt of any solution xt from initial condition x0 P X , contracts to a unique stationary distribution pi. Contraction is exponential with respect to total variation distance and standard W qpPt, piq Wasserstein distance, with convergence rates described by Theorem 2.1 and Corollary 2.1. 2.4 Numerical Simulation of Linear MRAC Application In this section, we will perform a numerical simulation of the application in Section 2.3, using the example from Section 11.5 of [41]. The linear system is an LTI model for the lateral-directional dynamics of a small passenger aircraft in cruise configuration. The state xt “ rxt,1 xt,2 xt,3 xt,4sJ represents the bank angle, sideslip angle, roll rate, and yaw rate, while the control input ut “ rut,1 ut,2sJ represents the rudder and aileron deflection angles (all units are radians and radians/sec). The nominal plant is given as 9xt “ Axt `B ` ut ` ΩJΨpxtq ˘ , with A “ »———————– 0 0 1 0 0.0487 ´0.0829 0 ´1 0 ´4.546 ´1.699 0.1717 0 3.382 ´0.0654 ´0.0893 fiffiffiffiffiffiffiffifl , B “ »———————– 0 0 0 0.0116 27.276 0.5758 0.3952 ´1.362 fiffiffiffiffiffiffiffifl , ΩJ “ »–2A3,2 1A3,3 1A3,4 2A4,2 1A4,3 1A4,4 fifl , and Ψpxq “ rx2 x3 x4sJ . 35 Thus, we have n “ 4, ` “ 2, and L “ 3. We can see that the input ut mainly affects the last two states (roll and yaw rates). And so, the unknown linearly parameterized ΩJΨpxtq represents an uncertainty about the critical system model parameters of A for those two states, which we will also assume are unknown. (Note: in [41] it appears the A entries used for the parameterization are incorrect, relative to the stated goal of the modeling uncertainties.) The given A has eigenvalues of λpAq “ t´0.0464 ˘ j1.8784, ´1.7797, 0.0014u, so we use the LQR method with QLQR “ diagpr0 0 0.1 5sq and RLQR “ I` to obtain KJ “ »–´0.1118 ´0.1911 ´0.2756 0.1011 ´0.084 0.1267 ´0.0414 2.1049 fifl , which gives a Hurwitz Ar “ A`BKJ “ »———————– 0 0 1 0 0.0477 ´0.0814 ´0.0005 ´0.9756 ´3.0986 ´9.6867 ´9.2396 4.1402 0.0702 3.1339 ´0.1179 ´2.9162 fiffiffiffiffiffiffiffifl with eigenvalues of λpArq “ t´8.835, ´0.3319, ´1.5352˘ j1.0673u. Setting Qx “ diagpr1 1 10 10sq and solving the Lyapunov equation with Ar gives Px “ »———————– 3.2517 0.2803 0.1649 ´0.0356 0.2803 8.1912 ´0.4639 ´1.3807 0.1649 ´0.4639 0.5565 0.2 ´0.0356 ´1.3807 0.2 2.4605 fiffiffiffiffiffiffiffifl which is positive definite with eigenvalues of λpPxq “ t8.5513, 3.251, 0.5085, 2.149u. We use stochastic parameter estimates pθt “ S´1p pΘtq, with m “ 14, in place of the unknown K and Ω, and Φ, as defined in (2.21) with N “ 7, in the control law ut “ pΘJt Φpxtq as in (2.22). This gives the stochastic plant as dxt “ “ Arxt `Bp pΘt ´ΘqJΦpxtq‰dt`Gx dwxt , 36 where we set Gx “ gx I4 for some positive scalar gx ą 0. We similarly use (2.24) and (2.29) to get the stochastic parameter update rule as dpθt “ S´1`´ΦpxtqxJt PxB˘ dt`Gθ dwθt ´ vθt dµθptq , where we set Gθ to be block diagonal of GK “ gK I8 and GΩ “ gΩ I6, for some positive scalars gK , gΩ ą 0. The compact convex constraint set K Ă R14 is defined in the following manner. Each row of pΘt (column of pΘJt ) will be constrained within a convex 2-dimensional polygon. Each of the n “ 4 rows corresponding to parameters of K will use the same polygon, and each of the L “ 3 rows corresponding to parameters of Ω will use the same polygon. See Figure 2.7 for a visualization of how the parameter estimates are constrained within these polygons. With the combined state zJt “ rxJt pθJt s P X “ R4 ˆ K, and using the definitions of H, G, dwt, and vtdµptq in (2.32), we have the overall adaptive regulation system described by the RSDE dzt “ Hpztqdt`Gdwt ´ vtdµptq , (2.34) which meets all requirements of Theorem 2.2. Note here that G has the overall form G “ »————– gx I4 0 0 0 gK I8 0 0 0 gΩ I6 fiffiffiffiffifl . We perform three simulations, each simulating two solutions z p1q t and z p2q t to (2.34) from the initial conditions: x p1q 0 “ r0.1 ¨ ¨ ¨ 0.1sJ, xp2q0 “ r´0.15 ¨ ¨ ¨ ´ 0.15sJ, andpθp1q0 “ pθp2q0 “ 0m, thus satisfying zp1q0 ‰ zp2q0 P X . The three scenarios are: i) gx “ gK “ gΩ “ 0, which gives the nominal system behavior with no stochastic terms, ii) gx “ gK “ gΩ “ 0.05, which gives the stochastic system behavior with a moderate amount 37 of Brownian disturbance, and iii) gx “ gK “ 0.01 and gΩ “ 1, which places higher Brownian disturbance on the pΩt stochastic parameter estimates in order to highlight the convex projection to and constraint within K. All simulations are performed using a projected Euler method over t P r0, 30s sec, with timesteps of dt “ 0.001 sec and a convex projection operation ΠK. This aligns with the discussion at the end of Section 2.2.1. At each timestep, an independent Brownian increment wt`dt ´wt is sampled from N ` 0n`m, dt In`m ˘ . The solution z p2q t uses Brownian motion d rwt “ pIn`m ´ 2δtδJt qdwt, which induces a reflection coupling between the two solutions. (See Appendix A.1 for details.) The coupling time τ between the solutions is the first time step when the norm difference is practically close to zero, such that }zp1qτ ´ zp2qτ } ď ε « 0. Figure 2.5 shows a plot of the states x p1q t and x p2q t over time. We see in the first scenario (top plot) the nominal behavior with no stochastic terms. The states decay (mostly) smoothly down to the origin. The second scenario (middle plot) shows similar behavior, but with stochastic noise added to the trajectories. Coupling occurs at τ « 17sec. The third scenario (last plot) shows similar behavior until around t “ 10 sec, when a larger disturbance occurs, after which the states again decay down to the origin. We see similar results in Figure 2.6, which plots the Euclidean norms }xp1qt }2 and }xp2qt }2 of the states over time, and is being driven to zero in this application. Figure 2.7 shows the phase plot of the polygons used to constrain each row of pKp1qt ,pKp2qt (left side) and pΩp1qt , pΩp2qt (right side). Blue dots show true values K and Ω. Red and Green dots show final parameter estimates. We see in all scenarios that the constraint is enforced for the pKt heavily on the left side of the set. The constraint for pΩt is mostly not needed for the first two scenarios but in the third it is enforced at all boundaries. Figure 2.8 shows the Euclidean norms } pΘp1qt ´Θ}2 and } pΘp2qt ´Θ}2 of the parameter error from the true values over time, which is not being driven to zero in this application. 38 0 5 10 15 20 25 30 −3 −2 −1 0 1 2 0 5 10 15 20 25 30 −3 −2 −1 0 1 2 0 5 10 15 20 25 30 −2 −1 0 1 2 Figure 2.5: States x p1q t (blue) and x p2q t (orange), for the three simulation scenarios. Coupling time τ is the vertical red line. 39 0 5 10 15 20 25 30 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 5 10 15 20 25 30 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 5 10 15 20 25 30 0.0 0.5 1.0 1.5 2.0 2.5 Figure 2.6: Euclidean norm of states }xp1qt }2 (magenta) and }xp2qt }2 (cyan), for the three simulation scenarios. Coupling time τ is the vertical red line. 40 −0.4 −0.2 0.0 0.2 0.4 K1 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 K 2 −10 −8 −6 −4 −2 0 Ω1 0 2 4 6 Ω 2 −0.4 −0.2 0.0 0.2 0.4 K1 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 K 2 −10 −8 −6 −4 −2 0 Ω1 0 2 4 6 Ω 2 −0.4 −0.2 0.0 0.2 0.4 K1 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 K 2 −10 −8 −6 −4 −2 0 Ω1 0 2 4 6 Ω 2 Figure 2.7: Phase plots of the 2D polygons used to constrain each row of pKt (left side) and pΩt (right side), with blue for pΘp1qt and orange for pΘp2qt . Cyan dots show true parameter values. Red and Green dots show final parameter estimates. 41 0 5 10 15 20 25 30 11.45 11.50 11.55 11.60 11.65 0 5 10 15 20 25 30 11.25 11.30 11.35 11.40 11.45 11.50 11.55 11.60 11.65 0 5 10 15 20 25 30 9 10 11 12 13 14 Figure 2.8: Euclidean norm of parameter error } pΘp1qt ´Θ}2 (magenta) and } pΘp2qt ´Θ}2 (cyan), for the three simulation scenarios. Coupling time τ is the vertical red line. Chapter 3 Persistency of Excitation with Step and ReLU Functions In this chapter, we introduce novel sufficient conditions for achieving persistency of excitation with vector functions constructed from single hidden-layer neural networks that use step or ReLU activation functions. We apply these conditions to the general setup for linear MRAC from Chapter 9 of [41], showing how that setup can meet the conditions and therefore learn the true parameterization values. Persistency of excitation is a fundamental concept employed within many contexts and applications related to parameter learning, such as system identification and adap- tive control. It is often discussed, or at least mentioned, in adaptive control textbooks, such as in [71, 74, 41]. It was proven in [55, 2] that persistency of excitation is necessary and sufficient for the global uniform asymptotic stability of the linear time-varying (LTV) system 9rθt “ ´Vt V Jt rθt , (3.1) where rθt P RN is the system state, and V : r0,8q Ñ RN is a regulated vector function of 42 43 time. Here, regulated means that the left and right limits of Vt exist at all times t ě 0, though they may not be equal at a finite number of discontinuities on a compact subset C Ă Rn. Let us denote Vt “ rVt,1 ¨ ¨ ¨ Vt,N sJ, with each component a scalar regulated function of time, mapping to R. The N ˆN matrixż τ`T τ Vt V J t dt (3.2) defines a Gramian, since the i, j-th entry şτ`T τ Vt,i Vt,j dt defines an inner product of functions Vt,1, . . . , Vt,N over any positive time interval rτ, τ ` T s with τ ě 0 and T ą 0. Gramian matrices are always positive semidefinite, which can be shown using the bi- linearity of inner products. Persistency of excitation is the requirement that the Gramian matrix (3.2) be strictly positive definite with eigenvalues in some positive, bounded interval rα1, α2s, over all shifts τ ě 0 of the sliding time window rτ, τ ` T s for some window length T ą 0. Formally, persistency of excitation requires the existence of constants α1, α2, T ą 0 such that α1IN ĺ ż τ`T τ Vt V J t dt ĺ α2IN (3.3) holds for all τ ě 0. An equivalent scalar requirement is that α1}v}22 ď ż τ`T τ pvJVtq2 dt ď α2}v}22 (3.4) must hold for all v P RN and τ ě 0, since vJVtV Jt v “ pvJVtqpvJVtqJ “ pvJVtq2. Note that we can partition the time window r0, T s for any T ą 0 using any positive integer D P N length increasing sequence of times 0 “ T0 ă T1 ă ¨ ¨ ¨ ă TD´1 ă TD “ T , and thus obtain by linearity of integration thatż τ`T τ Vt V J t dt “ Dÿ d“1 ż τ`Td τ`Td´1 Vt V J t dt . 44 Note then that by the same reasoning as with (3.2), each integral in the sum is also a Gramian matrix and thus positive semidefinite. Therefore, for any particular τ shift of the window, the overall sum will be positive definite if at least one term, corresponding to the partition rτ ` Td´1, τ ` Tds for some d P rDs, is positive definite. This follows since vJpA1 ` A2 ` . . . qv “ vJA1v ` vJA2v ` ¨ ¨ ¨ ą 0 for all v P RNzt0Nu, if at least one of A1, A2, . . . is positive definite and all others are positive semidefinite. And so, if (3.3) holds for some T ą 0, then it must also hold for all rT ą T sinceż τ` rT τ Vt V J t dt “ ż τ`T τ Vt V J t dt` ż τ` rT τ`T Vt V J t dt . This chapter is adapted from the published work [45] and we summarize the orga- nization of the sections, as follows: 1. Section 3.1 explains P.E. concepts in the specific setting of basis functions and state trajectories, which will be used throughout the chapter, 2. Section 3.2 defines the activations used in the basis functions, and associated geometry they induce on the state space, which are crucial to the P.E. conditions, 3. Section 3.3 gives the main results, which are novel sufficient conditions for P.E. when using step and ReLU activations, and 4. Section 3.4 analyzes the linear MRAC setup, explains how it can meet the P.E conditions, and provides a numerical simulation of this application. 3.1 Persistency of Excitation With Basis Functions of a Continuous State Trajectory In this chapter, we will focus on vector functions Vt having the form Vt “ Ψpxtq “ rg1pxtq ¨ ¨ ¨ gN pxtqsJ , 45 where nonlinear basis functions g1, . . . , gN : Rn Ñ R are continuous almost everywhere and x : r0,8q Ñ Rn is a continuous-time trajectory representing the state of a system. And so, we will analyze sufficient conditions for persistency of excitation defined by the existence of constants α1, α2, T ą 0 such that α1IN ĺ ż τ`T τ ΨpxtqΨpxtqJ dt ĺ α2IN (3.5) and the equivalent scalar requirement, for all v P RN , that α1}v}22 ď ż τ`T τ pvJΨpxtqq2 dt ď α2}v}22 (3.6) hold for all τ ě 0. Recall that here, the integral in (3.5) is a (time-varying) Grammian matrix and the inequality represents that its (nonnegative) eigenvalues must be in some strictly positive, bounded interval rα1, α2s. Remark 3.1. If the continuous trajectory xt stays within some compact set C Ă Rn for all t ě 0, and if g1, . . . , gN are bounded on C, then a finite upper eigenvalue bound α2 will exist. The difficulty of establishing the persistency of excitation condition (3.5) for a given Ψ and bounded trajectory xt lies with proving that there exists some window length T ą 0 which gives a strictly positive lower eigenvalue bound α1, over all τ ě 0 shifts of the window rτ, τ ` T s. We now highlight an important connection between the persistency of excitation con- dition (3.5) and linear independence of the continuous almost everywhere basis functions g1pxq, . . . , gN pxq, whereby there exists no nonzero vector v P RNzt0Nu such that vJΨpxq “ Nÿ i“1 vigipxq “ 0 for almost all x P Rn . In other words, only the zero vector v “ 0N allows this to hold. Lemma 2 of [40] shows that g1, . . . , gN are linearly independent if and only if there exist points xp1q, . . . , xpNq P Rn in the state space such that all g1, . . . , gN are continuous 46 at all xp1q, . . . , xpNq points and the matrix»————– Ψpxp1qqJ ... ΨpxpNqqJ fiffiffiffiffifl “ »————– g1pxp1qq . . . gN pxp1qq ... ... g1pxpNqq . . . gN pxpNqq fiffiffiffiffifl is invertible. Then, Proposition 1 of [40] shows that if the continuous trajectory xt over some time window rτ, τ ` T s satisfies the following two conditions: 1) there exist times t1, . . . , tN in the window such that }xti ´ xpiq}2 ă  for all i P rN s and some positive scalar  ą 0, and 2) the time derivative is bounded as } 9xt}2 ă c for some positive scalar c ą 0, then there exists a positive scalar α1 such that α1IN ĺ ż τ`T τ ΨpxtqΨpxtqJ dt . Therefore, this shows that if the continuous trajectory xt stays within the compact set C P Rn with a bounded time derivative, and within every τ ě 0 shift of a time window of length T it gets sufficiently close to points xp1q, . . . , xpNq P Rn that ensure the linear independence of g1, . . . , gN , which are bounded on C, then the persistency of excitation condition (3.5) is guaranteed to hold. And so, in this chapter we will show how to find such points xp1q, . . . , xpNq P Rn that ensure the linear independence of g1, . . . , gN , when they are composed of specific nonlinear, positive semidefinite activation functions together with unique affine trans- formations of the state space Rn. 3.2 Nonlinear, Positive Semidefinite Activation Functions Let Ψ : Rn Ñ RN be a vector function defined as Ψpxq “ »————– g1pxq ... gN pxq fiffiffiffiffifl “ »————– σ1pwJ1x` b1q ... σN pwJNx` bN q fiffiffiffiffifl , (3.7) 47 where g1, . . . , gN : Rn Ñ R are composed of nonlinear, positive semidefinite activa- tion functions σ1, . . . , σN : R Ñ R together with an affine transformations wJ1x ` b1, . . . , w J Nx ` bN : Rn Ñ R. We allow wi P Rnzt0nu and bi P R to be arbitrary for all i P rN s, except that we assume each wJi x ` bi “ 0 hyperplane in Rn is unique and has dimension n ´ 1. Let us define the n ˆ N matrix W “ rw1 ¨ ¨ ¨ wN s and vector b “ rb1 ¨ ¨ ¨ bN sJ P RN . For any S Ă rN s, let WJS denote the submatrix of WJ with rows given by wJi for i P S. Define bS similarly for b. Note that this is equivalent to Ψpxq being the outputs of a single hidden neural network layer with N neurons fully connected to the input x P Rn, having nonlinear activations, and being initialized with weights and biases defining unique hyperplanes. Hence why we refer to the σi as activation functions. We then define positive semidefinite to mean that each activation i P rN s satisfies σipyq “ 0 for all y ď 0 and σipyq ą 0 for all y ą 0. In this chapter, we will focus on: σcSpyq “ $’&’% 0 if y ď 0 c if y ą 0 (scaled step) (3.8) σRpyq “ $’&’% 0 if y ď 0 y if y ą 0 (ReLU) , (3.9) where c ą 0 is an arbitrary positive scalar. 3.2.1 Positive Semidefinite Activation Geometry We define the following subsets for all activations i P rN s: X`i “ tx P Rn | wJi x` bi ą 0u (active) Xi˝ “ tx P Rn | wJi x` bi ď 0u (zero) . Each is a half-space of Rn formed by the hyperplane wJi x ` bi “ 0. These are then used to define the activation regions Aj , with indices corresponding to a binary string 48 indicating which active (binary 1) and zero (binary 0) half-spaces are in the intersection: A0 “ XN˝ X XN˝´1 X . . . X X2˝ X X1˝ A1 “ XN˝ X XN˝´1 X . . . X X2˝ X X`1 ... A2N´2 “ X`N X X`N´1 X . . . X X`2 X X1˝ A2N´1 “ X`N X X`N´1 X . . . X X`2 X X`1 . These partition Rn into, at most, 2N convex polytopes. It is likely that some of the regions will be infeasible (Aj “ H). For example, in the n “ 1 case, there are always only N ` 1 feasible activation regions, since there will be N unique (by assumption) points partitioning the R line. In higher dimensions, more feasible regions are possible. For all j P t0, . . . , 2N ´ 1u, we define the active set Sj as Sj “ ti P rN s | wJi x` bi ą 0 @x P Aju . This captures which of the N activations are active in a particular activation region. For example with N ě 3, if A4 feasible it contains only a nonempty subset of X`3 which in turn gives S4 “ t3u. For any two activation regions j, k P t0, . . . , 2N ´ 1u with a nonempty intersection Aj XAk ‰ H, their intersection defines a border between the two regions. We define a nondegenerate border to mean dimpAj XAkq “ n´ 1. In this case, only one activation is different (active to zero or vice-versa) between those regions. This is because the borders between activation regions must be (subsets of) the hyperplanes that define the half-spaces Xi, and the intersection of more than one unique hyperplane must have dimension less than n´ 1. Thus, a nondegenerate border must be a subset of (or equal to) a single hyperplane, meaning only one Xi half-space can flip from active to zero or vice-versa. 49 We then define a degenerate border to mean dimpAj X Akq ă n ´ 1. In this case, the border is (a subset of) the lower dimensional intersection of two or more unique hyperplanes. Thus, multiple activations are different between the regions. In Figure 3.1, we visualize an example withN “ 4 unique, 1-dimensional hyperplanes partitioning the R2 state space. Nondegenerate borders are 1-dimensional line segments, while degenerate borders are 0-dimensional points where the hyperplanes intersect. This same example will be used for the numerical simulation presented in Section 3.4. Now consider a continuous trajectory xt P Rn visiting a sequence of activation regions over the time window rτ, τ ` T s, for some τ ě 0 and T ą 0. Assume xt only crosses nondegenerate borders and that the number of regions visited is L ě 2. Let us denote A1, . . . ,AL as the activation region indices of the visited sequence and I1, . . . ,IL´1 as the sequence of activation indices corresponding to the hyperplanes crossed in order to visit that sequence of activation regions. For example, if the sequence of visited regions is A4,A12,A8 over the time window, then we have A1 “ 4,A2 “ 12,A3 “ 8 and I1 “ 4,I2 “ 3 as the respective index sequences with L “ 3. In this case, activation I1 “ 4 becomes active and then activation I2 “ 3 is turned off (becomes zero). For all s P rLs, we define time window subsets Ts Ă rτ, τ ` T s as Ts “ tt P rτ, τ ` T s | xt P AAsu . (3.10) Since we assume xt crosses only nondegenerate borders when visiting the activation regions during the time window, we have by definition that SAs`1 “ $’’&’’% SAs Ť tIsu if AAs`1 Ď X`Is SAs z tIsu if AAs`1 Ď XI˝s (3.11) holds over all s P rL ´ 1s. In words, this simply states that when xt crosses the nondegenerate border from region AAs to region AAs`1 in the sequence, it must be true that the active set for AAs`1 is the active set for AAs plus activation Is if it 50 becomes active, or minus activation Is if it turns off (becomes zero). Using the previous example, going from activation region A12 to region A8 requires crossing the hyperplane for activation I2 “ 3, which turns that activation off (becomes zero). Thus, it must be true that S8 “ S12zt3u. 3.3 Sufficient Conditions for Persistency of Excitation with (Scaled) Step and ReLU Activations In this section we present our main theoretical results, giving novel sufficient conditions for persistency of excitation with Ψ composed of the (scaled) step or ReLU activations. 3.3.1 Main Results First, we present the theorem for (scaled) step functions. Theorem 3.1. Let state trajectory xt be continuous and stay within some compact set C Ă Rn for all t ě 0, and let Ψpxtq “ rσcSpwJ1xt ` b1q ¨ ¨ ¨ σcSpwJNxt ` bN qsJ be composed of (scaled) step functions (3.8) with positive scalars c “ rc1 ¨ ¨ ¨ cN sJ together with N unique affine transformations of Rn according to w1, . . . , wN P Rnzt0nu and b1, . . . , bN P R. If xt over t ě 0 is such that there exists a window length T ˚ ą 0 whereby the sequence s P rLs of activation regions visited during any shift τ ě 0 of the time window rτ, τ ` T ˚s always satisfies the following two conditions: 1. all i P rN s hyperplanes wJi xt ` bi “ 0 are crossed 2. only nondegenerate borders are crossed, then Ψpxtq satisfies the persistency of excitation conditions (3.5) and (3.6). Proof. Given in Appendix B.1. Next, we present the theorem for ReLU functions. Here, WJAs denotes the |SAs | ˆn 51 submatrix of WJ with rows corresponding to the active set SAs , for each s P rLs of any particular time window rτ, τ ` T ˚s. Theorem 3.2. Let state trajectory xt be continuous and stay within some compact set C Ă Rn for all t ě 0, and let Ψpxtq “ rσRpwJ1 xt` b1q ¨ ¨ ¨ σRpwJNxt` bN qs be composed of ReLU functions (3.9) together with N unique affine transformations of Rn according to w1, . . . , wN P Rnzt0nu and b1, . . . , bN P R. If xt over t ě 0 is such that there exists a window length T ˚ ą 0 whereby the sequence s P rLs of activation regions visited during any shift τ ě 0 of the time window rτ, τ `T ˚s always satisfies the following three conditions: 1. all i P rN s hyperplanes wJi xt ` bi “ 0 are crossed 2. only nondegenerate borders are crossed 3. for each s P rLs, there are times t1, tˆ1, . . . , tM , tˆM P Ts, for any M ě n, such that the nˆM matrix “pxt1 ´ xtˆ1q ¨ ¨ ¨ pxtM ´ xtˆM q‰ satisfies rankpWJAsq “ rank ´ WJAs “pxt1 ´ xtˆ1q ¨ ¨ ¨ pxtM ´ xtˆM q‰¯ , then Ψpxtq satisfies the persistency of excitation conditions (3.5) and (3.6). Proof. Given in Appendix B.1. Note that a sufficient condition for property 3. is that “pxt1 ´ xtˆ1q ¨ ¨ ¨ pxtM ´ xtˆM q‰ has rank n. This can be achieved, for example, if xt is the state trajectory of a system that satisfies a suitable (local) controllability property and can therefore reach points within each activation region s P rLs that are not confined to some lower dimensional subset of Rn, during all shifts τ ě 0 of the time window rτ, τ ` T ˚s. It is possible that the sample paths of the adaptive stochastic process xt from Chap- ter 2 could be shown to satisfy this condition with probability 1, due to the spatial “jittering” (but still continuous) induced by the additive Brownian motion. 52 3.3.2 Proof Sketch Both proofs rely on the same overall contradiction method. That is, we assume there exists a nonzero vector v P RNzt0Nu such that vJ ˆż τ`T τ ΨpxtqΨpxtqJdt ˙ v “ 0 (3.12) holds for all shifts τ ě 0 and all window lengths T ą 0. We proceed by showing that if the state trajectory xt meets certain requirements over any shift τ ě 0 of the window rτ, τ ` T ˚s, for some window length T ˚ ą 0, then in fact (3.12) can only hold if v “ 0N . This is a contradiction and thus proves that the LHS integral must be a strictly positive definite matrix, uniformly over all windows rτ, τ ` T ˚s for all τ ě 0. 3.3.3 Extension to Other Activation Functions In Appendix B.1 we discuss how these methods could be applied similarly to other nonlinear, positive semidefinite activation functions σ, such that σpyq “ 0 for all y ď 0 and σpyq ą 0 for all y ą 0. 3.4 Application to Linear MRAC In this section, we provide a numerical simulation of the theoretical results presented in Section 3.3, using a linear MRAC application which is a variation on the setup from Chapter 9 of [41]. We will implement adaptive control in order to drive the plant state to track the state of a reference linear system, in spite of uncertainties in the plant’s parameterization of an additive nonlinear vector function. The plant and reference systems have n “ 2 states, allowing convenient visualization of the hyperplanes and state space. 53 3.4.1 Setup We largely follow the general linear MRAC setup introduced in [41], with some simpli- fications. The plant is given by 9xt “ Axt `B ` ut `ΘJΨpxtq ˘ , (3.13) where A is a known nˆ n state matrix for the plant state xt P Rn, B is a known nˆ ` input matrix for the input ut P R`, and Θ is an unknown N ˆ ` matrix which linearly parameterizes the known vector function Ψ : Rn Ñ RN defined by (3.7). We assume pA,Bq is controllable. The general setup also includes an unknown diagonal scaling matrix Λ, such that the overall input matrix is BΛ. We have omitted this for simplicity. The control input ut will be designed in order to force the plant states to track the states of a controllable, linear reference system given by 9xrt “ Ar xrt `Br rt , (3.14) where Ar is a known Hurwitz nˆn reference state matrix for the reference state xrt P Rn, Br is a known nˆ ` reference input matrix, and rt P R` is a bounded reference input. We thus assume there exists an n ˆ ` matrix of feedback gains Kx and an ` ˆ ` matrix of feedforward gains Kr satisfying the matching conditions A`BKJx “ Ar BKJr “ Br . (3.15) The general setup has A and Λ as unknown, and thus Kx and Kr need to be estimated. Here, we assume that Kx and Kr can be directly calculated from known A and B, and used directly in the control law. Remark 3.2. The first condition of (3.15) is feasible if some Hurwitz Ar can be reached from pA,Bq by state feedback. Towards this end, it is often practical to solve for an LQR feedback gain matrix KLQR and then check that Ar “ A`BKJLQR is Hurwitz. 54 Next, we introduce parameter estimates pΘt, which will be dynamically updated to estimate true parameter values Θ. Thus, by applying to the plant (3.24) the feedback control law ut “ KJx xt `KJr rt ´ pΘJtΨpxtq, the plant dynamics become 9xt “ Ar xt `Br rt ´B ppΘt ´ΘqJΨpxtq . (3.16) This in turn gives the dynamics of the state tracking error et “ xt ´ xrt as 9et “ 9xt ´ 9xrt “ Ar et ´B ppΘt ´ΘqJΨpxtq . (3.17) with xr0 set to x0, so that e0 “ 0n. As shown in Chapter 9 of [41], these state tracking error dynamics are globally uniformly asymptotically stable, such that limtÑ8 }et}2 “ 0, if the parameter estimates are dynamically updated as 9pΘt “ Γ Ψpxtq eJt PxB . (3.18) This is shown by using Barbalat’s lemma along with analyzing the Lyapunov function V “ eJt Pxet ` tr `ppΘt ´ΘqJΓ´1ppΘt ´Θq˘ (3.19) such that the update rule (3.18) results in 9V ď ´eJt Qxet ď 0 (3.20) for all values of et and pΘt´Θ. Here, Px is the unique symmetric, positive-definite nˆn matrix that solves the algebraic Lyapunov equation PxAr `AJr Px “ ´Qx (3.21) for some symmetric, positive-definite nˆ n matrix Qx, and adaptation rates Γ is some symmetric, positive-definite N ˆN matrix, where we denote G :“ }Γ}F . 55 Remark 3.3. In the case where Kx and Kr are also being estimated by Kˆx,t and Kˆr,t respectively, the true values can be appended to Θ, the estimates can be appended topΘt, and an overall vector function Φpxt, rtq which combines xt, rt, and Ψpxtq, can be formed, similar to (2.21). However, such Φ then do not strictly meet the definition of (3.7), where each component function is a positive, semidefinite activation, and are thus beyond the scope of the results in this chapter. 3.4.2 Persistency of Excitation The dynamic update rule (3.18) only guarantees asymptotic convergence of the state tracking error et to zero. We now show that the parameter estimation error, which we will denote as rΘt “ pΘt ´Θ, also goes to zero if Ψpxtq is persistently exciting. Since Ar is Hurwitz, it is guaranteed invertible. And so, from (3.17) we have eJt PxB “ 9eJt A´1r JPxB `ΨpxtqJrΘtBJA´1r JPxB and then combined with (3.18) we get 9rΘt “ 9pΘt “ Γ Ψpxtq eJt PxB “ Γ ΨpxtqΨpxtqJrΘtBJA´1r JPxB ` Γ Ψpxtq 9eJt A´1r JPxB “ ´12Γ ΨpxtqΨpxtqJrΘtBJA´1r JQxA´1r B ` Et , (3.22) where where the last inequality uses (3.21) and the invertibility of Ar to get that A´1r J Px “ ´12A´1r JQxA´1r . In order to apply the results of this chapter, we will assume that Ψpxtq is bounded with xt being contained in a compact set C Ă Rn. Also, from (3.20) we have that et and rΘt are bounded for all t ě 0. And so, from (3.16), (3.17), and (3.18) we have respectively that 9xt, 9et, and 9pΘt are bounded for all t ě 0. This in turn means that :et “ Ar 9et ´B 9pΘJt Ψpxtq ´BrΘJt ddxΨpxtq 9xt 56 is bounded for all t ě 0, since the Jacobian d dxΨpxtq “ »————– w1,1σ 1 1pwJ1xt ` b1q ¨ ¨ ¨ w1,nσ11pwJ1xt ` b1q ... . . . ... wN,1σ 1 N pwJNxt ` bN q ¨ ¨ ¨ wN,nσ1N pwJNxt ` bN q fiffiffiffiffifl is composed of scaled derivatives σ1pyq of the (scaled) step (3.8) and ReLU activations (3.9), which are globally bounded almost everywhere over y P R. This allows us to use Barbalat’s lemma again (with respect to 9et) to conclude that limtÑ8 } 9et}2 “ 0. And so, we can conclude that the Et term in (3.22) is bounded for all t ě 0 and that it asymptotically goes to limtÑ8 }Et}F “ 0. Let us now restrict to the case of ` “ 1. Since Qx is positive-definite we have the positive scalar c “ 12BJA´1r JQxA´1r B, and obtain from (3.22) that the dynamics of the parameter estimation error vector rΘt P RN asymptotically tend to the LTV system 9rΘt “ ´Γc ΨpxtqΨpxtqJrΘt , (3.23) where Γc “ cΓ is symmetric, positive-definite with }Γc}F “ cG. Therefore, if Ψpxtq is persistently exciting by condition (3.5), then the parameter estimation error dynamics (3.23) are globally uniformly asymptotically stable such that we have limtÑ8 }rΘt}2 “ 0. We show explicitly in Appendix B.2 how this follows from Lemmas 1 and 2 and Theorem 1 of [2]. 3.4.3 Numerical Simulation We perform a numerical simulation in the xt P R2 state space. Euler integration with a timestep of dt “ 0.001 seconds is used to obtain all numerical solutions to differential equations. 57 We use a basic controllable canonical form for the plant 9xt “ »– 0 1 ´a1 2a2 fifl»–xt,1 xt,2 fifl` »–0 β fifl`ut `ΘJΨpxtq˘ , (3.24) where a1, a2, β ą 0 are positive scalars, ut P R is the ` “ 1 dimensional input, and Θ is a fixed vector of nonzero scalars which linearly parameterizes the known nonlinear vector function Ψpxtq “ rσRpwJ1 xt` b1q ¨ ¨ ¨ σRpwJ4 xt` b4qsJ with ReLU activation functions, meeting the definition (3.7). This plant model is an n “ 2 example of a general class of single-input systems with dynamics characterized by an nth order nonlinear ordinary differential equation. See section 9.5 in [41] and section 4.1 in [35] for examples of physical systems in this form using various nonlinear functions. The reference system is a unity-gain damped harmonic oscillator with a bounded rt P R reference input 9xrt “ »– 0 1 ´ω20 ´2ξω0 fifl»–xrt,1 xrt,2 fifl` »– 0 ω20 fifl rt , (3.25) where ω0, ξ ą 0 are the natural frequency and damping ratio. This gives the plant and reference eigenvalues as λ “ a2 ˘ a a22 ´ a1 and λr “ p´ξ ˘ a ξ2 ´ 1 qω0, thus A is unstable with oscillations if a1 ą a22 and Ar is always Hurwitz, and without oscillations if ξ ě 1. Thus, we can always directly calculate the required feedback and feedforward gains that satisfy the matching conditions (3.15) as KJx “ „ ´ω 2 0 ´ a1 β ´ 2ξω0 ` 2a2 β  and KJr “ „ ω20 β  . For the plant system, we use a1 “ 2, a2 “ 0.5, β “ 0.75. This results in an unstable, oscillatory A matrix with λ “ 0.5 ˘ j?1.75 . The reference system uses ω0 “ 2 and ξ “ 1, ensuring that Ar is Hurwitz and nonoscillatory with λr “ ´2 . The matching conditions are thus satisfied with KJx “ r´2.667 ´ 6.667s and KJr “ r5.333s. For the 58 parameter estimate dynamic updates (3.18) we use Γ “ »———————– 5 0 0 0 0 1 0 0 0 0 5 0 0 0 0 2 fiffiffiffiffiffiffiffifl Qx “ »–1 0 0 10 fifl Px “ »–5.625 0.125 0.125 1.281 fifl , and for the linearly parameterized nonlinear vector function ΘJΨpxtq we use Θ “ »———————– ´1.2 2.7 0.8 ´3.2 fiffiffiffiffiffiffiffifl WJ “ »———————– 2 1 1 ´2 1.5 ´0.5 0.5 2 fiffiffiffiffiffiffiffifl b “ »———————– 1 2 2.5 3 fiffiffiffiffiffiffiffifl . In Figure 3.1 we plot points to highlight each activation region, partitioning the state space R2. Defining parameter estimates pΘt “ rθˆt,1 ¨ ¨ ¨ θˆt,4sJ and using the known Kx and Kr, we apply the feedback control law ut “ KJx xt `KJr rt ´ pΘJtΨpxtq to the plant and update the parameter estimates according to the update law (3.18). We use the following bounded reference inputs to drive the reference system as two different scenarios: r p1q t “ 10 sinp0.5tq rp2qt “ 40` 2ÿ k“1 10 sinp0.25k tq . The resulting plant and reference state trajectories xt and x r t are plotted in state space along with the hyperplanes WJx`b “ 04, in Figure 3.2 and Figure 3.3 respectively for the two scenarios. We see clearly that the limit cycles in both cases stay within a compact set C P R2, and so the ReLU activations within Ψ are bounded on this C. Note that the trajectories never maintain a linear path, which is sufficient to meet property (iii) of Theorem 3.2. In the first case, the state trajectories cross all N “ 4 hyperplanes at nondegenerate borders persistently. In the second case, the limit cycle only crosses two of the hyperplanes persistently. 59 The norm of the state tracking error }et}2 and the parameter estimation error }pΘt ´ Θ}2 for both cases are plotted over simulation time in Fig. 3.4. In both cases, we see the state tracking error converging to zero. For the case with r p1q t , we see that the parameter estimation error converges to zero as the parameter estimators converge to the true values, while for the case with r p2q t the error does not converge and is well above zero. −10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0 xt, 1 −10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0 x t, 2 0 00001 00012 00103 00114 01005 01016 01107 01118 10009 100110 101011 101112 110013 110114 111015 1111 X1 X2 X3 X4 Figure 3.1: State space x P R2 showing N “ 4 hyper planes WJx ` b “ 04. There are 11 feasible activation regions Aj (color dots) and 5 infeasible regions (no dots). 60 −10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0 xt, 1 −6 −4 −2 0 2 4 6 x t, 2 Figure 3.2: Phase plot of xt (cyan), tracking into x r t (magenta) driven by r p1q t , and clearly crossing all N hyperplanes at nondegenerate borders along the limit cycle. 61 −10 0 10 20 30 40 50 60 xt, 1 −15 −10 −5 0 5 10 15 20 25 30 x t, 2 Figure 3.3: Phase plot of xt (cyan), tracking into x r t (magenta) driven by r p2q t , but does not cross all N hyperplanes along the limit cycle. 62 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 ||et||(1)2 ||et||(2)2 0 20 40 60 80 100 120 140 time t 0 1 2 3 4 5 6 7 ||Θ̂t−Θ||(1)2 ||Θ̂t−Θ||(2)2 Figure 3.4: Norm of state tracking error }et}2 converges to zero in both cases, while the parameter estimation error }pΘt ´ Θ}2 converges to zero for the case with rp1qt and but not for the case with r p2q t . Chapter 4 Function Approximation In this chapter, we consider the approximation of target functions f : Rn Ñ R that are continuous on a bounded set B Ă Rn containing the origin, which we denote as f P CpBq. We will generally use the term function approximation to refer to (the existence of) a finite linear combination of basis functions, such that some norm of the difference between a target function and its approximation, over x P B, is bounded. Formally, for a given target function f P CpBq, positive integer N P N, and positive scalar εN ą 0 (which may depend on N) there exists some vector θ P RN and nonlinear, continuous almost everywhere basis functions B1, . . . ,BN : Rn Ñ R such that the approximation pfpxq :“ Nÿ i“1 θiBipxq (4.1) satisfies ›››fpxq ´ pfpxq››› B ď εN . (4.2) Here, we denote the the L2pB, µ¯Bq norm }fpxq}B :“ ˆż B |fpxq|2 µ¯Bpdxq ˙ 1 2 with µ¯B as the uniform probability measure over B such that ş B µ¯Bpdxq “ 1. 63 64 In particular, the focus of this chapter will be on parameterized basis functions B : Rn Ñ R, with the parameters denoted as points γ in some parameter set Υ Ď Rp. In this case, we denote pfN as an approximation of the form (4.1) using the parameterized basis Bpx ;γ1q, . . . ,Bpx ;γN q for some fixed parameters γ1, . . . ,γN P Υ and the vector θ P RN as pfN pxq :“ Nÿ i“1 θiBpx ;γiq . (4.3) On the other hand, if we randomly sample the parameters γ1, . . . ,γR according to some probability density PΥ over Υ to form the basis Bpx ;γ1q, . . . ,Bpx ;γRq, for some integer R P N, we have the randomly parameterized approximation pFR. Here again, we mean that for some target function f P CpBq and positive scalar εR ą 0, which may depend on R, there exists (in high probability) some vector ϑ P RR such that pFRpx ;γ1, . . . ,γRq :“ Rÿ j“1 ϑj Bpx ;γjq (4.4) satisfies ›››fpxq ´ pFRpx ;γ1, . . . ,γRq›››B ď εR . We summarize the organization of the chapter’s sections, as follows: 1. Section 4.1 provides setup for function approximation with activation functions, 2. Section 4.2 discusses important existing results for base approximations which are convex combinations of step and ReLU activation functions, 3. Section 4.3 gives a novel result on mollified integral representations to base ap- proximations, resulting in mollified approximations, 4. Sections 4.4 and 4.5 provide a general setup for random approximations and a main result on their approximation error, 5. Sections 4.6 and 4.7 extend the results to the supremum norm and show how this can be applied to an extension of the linear MRAC setup. 65 4.1 Function Approximation with Activation Functions For the rest of the chapter, we will assume that the parameterized basis function B is a composition of some nonlinear, continuous almost everywhere function σ : R Ñ R together with an affine transformation of Rn into R according to the parameter γ P Υ. We define each point γ :“ rwJ bsJ “ rw1 . . . wn bsJ with w P Rnzt0nu as the weight vector and b P R as the bias term. We assume the parameter set Υ is a bounded subset of Rnzt0nu ˆ R. And so, we set Bpx ;γiq “ σpwJi x ` biq “ σpγJiXq, where X :“ rx1 ¨ ¨ ¨ xn 1sJ and we denote γi “ rwJi bisJ “ rwi,1 . . . wi,n bisJ, for all i P rN s. We assume each wJi x` bi “ 0 hyperplane in Rn is unique and has dimension n´1. The same holds with randomly sampled γj P Υ and the corresponding Bpx ;γjq “ σpγJjXq, for all j P rRs. Note that this is equivalent to a feedforward neural network with a single hidden layer of N (or R) neurons having respective outputs σpwJi x` biq “ σpγJiXq. Hence why we refer to σ as an activation function, as the output of each neuron is the activation of the weighted inputs (plus bias term) to the neuron. We will assume that the activation function σ has limited growth as follows. Assumption 4.1. The nonlinear, continuous almost everywhere activation function σ : RÑ R is bounded (growth) over all feasible inputs, such that either of 1. |σpuq ´ σpvq| ď Dσ 2. |σpuq ´ σpvq| ď Lσ |u´ v| hold for all u, v P I :“ tγJX | pγ, Xq P Υˆ B ˆ t1uu, with positive scalar Dσ or Lσ. If we shift the activation to the origin as qσpyq :“ σpyq ´ σp0q, this maintains the limited growth assumption in either case. The (scaled) step function σcS as in (3.8) and the ReLU function σR as in (3.9) are examples meeting this assumption, with Dσ “ c and Lσ “ 1 respectively. 66 4.1.1 Base and Random Approximations We will use the terms base approximation and random approximation to some target function f P CpBq to respectively refer to the pfN in (4.3) and the pFR in (4.4) where the parameterized basis function is Bpx ;γq “ σpγJXq and uses some activation function σ satisfying Assumption 4.1. And so, the base approximation is defined for some vector θ P RN and fixed param- eters γ1, . . . ,γN P Υ, for any integer N P N, as pfN pxq “ Nÿ i“1 θi σpγJiXq , (4.5) and the random approximation is defined for some vector ϑ P RR and randomly sampled parameters γ1, . . . ,γR P Υ, for any integer R P N, as pFRpx ;γ1, . . . ,γRq “ Rÿ j“1 ϑj σpγJjXq . (4.6) Note that we use integer R and index j for random approximations to distinguish them from base approximations, but otherwise there is no significance. 4.2 Existing Results for Base Approximations In this section, we discuss some relevant existing results for base approximations in the L2 norm when using step and ReLU activations. 4.2.1 Convex Combinations of Step and ReLU Activations Classic results from [7] and [12] prove the existence of convex linear combinations of step and ReLU activations that approximate functions in specific classes arbitrarily well, with an approximation error proportional to 1? N . Importantly, the convexity property allows the length (number of terms) N to be arbitrarily large while maintaining the 67 same overall bound on the coefficient size řN i“1 |θi|. As we will show in later sections, our results do not depend on the length N of the base approximation, but rather only on the coefficient size bound. First, we define these function classes. Note that B Ď B0pρq, an origin-centered ball of sufficiently large radius ρ, and let us define |ξ|B :“ supxPB ˇˇ ξJx ˇˇ ď ρ }ξ}2 for any ξ P Rn. Let there be a function g P CpBq, a complex measure rGpdξq “ eiφpξqGpdξq (for some φ : Rn Ñ R and variation G “ | rG|), and a constant C ą 0 such that: 1. qgpxq “ gpxq ´ gp0nq “ ż Rn ` eiξ Jx ´ 1˘ rGpdξq (4.7) 2. Cp1qg :“ ż Rn |ξ|B Gpdξq ď ż Rn ρ }ξ}2 Gpdξq ď C (4.8) hold for all x P Rn. Then, such a function g belongs to the class F p1qB,C . Similarly, if the function g P CpBq is differentiable at the origin and there is a complex measure rGpdξq and constant C ą 0 satisfying: 1. qgpxq “ gpxq ´∇gp0nqJx´ gp0nq “ ż Rn ` eiξ Jx ´ iξJx´ 1˘ rGpdξq (4.9) 2. Cp2qg :“ ż Rn |ξ|2B Gpdξq ď ż Rn ρ2 }ξ}22 Gpdξq ď C (4.10) for all x P Rn, then g belongs to the class F p2qB,C . Furthermore, any function f P CpBq that is equal to g on B (equivalently, qf is equal to qg on B) is also in the respective function class. This means, if we are interested in a particular f , we can define g “ f on B and then outside of B we may define g in any manner that helps to satisfy the above requirements. An example of a complex measure meeting these requirements is rGpdξq “ rgpξqdξ, where rgpξq is the Fourier transform of the function g, such that the variation Gpdξq “ |rgpξq|dξ is able to meet the second requirement for some C ą 0. In Appendix C.1, we show for both class definitions that the second conditions establish the integrability of the respective first conditions. 68 We now state the existing results for step and ReLU activation functions. Theorem 4.1 (Barron[7] Theorem 1). For any constant C ą 0 and bounded set B Ă Rn containing the origin, there exists a base approximation pfN of the form (4.5) using the step activation function σS with parameters satisfying |w|B “ 1 and |b| ď 1, for any integer N ě 1, to every function f P F p1qB,C such that the approximation errorrfN pxq “ qfpxq ´ pfN pxq to the affine shifted qfpxq “ fpxq ´ fp0nq satisfies the L2pB, µ¯Bq norm bound ››› rfN›››B “ dż B ˇˇˇ rfN pxqˇˇˇ2 µ¯Bpdxqq ď 2C? N (4.11) with coefficient size bounded as řN i“1 |θi| ď 2C. Proof. The proof is given in Appendix C.2. Theorem 4.2 (Breiman[12] Theorem 3). For any constant C ą 0 and bounded set B Ă Rn containing the origin, there exists a base approximation pfN of the form (4.5) using the ReLU activation function σR with parameters satisfying |w|B “ 1 and |b| ď 1, for any integer N ě 1, to every function f P F p2qB,C such that the approximation errorrfN pxq “ qfpxq´ pfN pxq to the affine shifted qfpxq “ fpxq´∇fp0nqJx´ fp0nq satisfies the L2pB, µ¯Bq norm bound ››› rfN›››B “ dż B ˇˇˇ rfN pxqˇˇˇ2 µ¯Bpdxqq ď b 16 3 C? N (4.12) with coefficient size bounded as řN i“1 |θi| ď 2C. Proof. The proof is given in Appendix C.3. Remark 4.1. The definitions of the function class are with respect to some constant C which then shows up directly in the results for this section. Those results do not directly depend on the input dimension n of the functions. This may lend an appearance of avoiding the curse of dimensionality. However, we can see from the definitions of C p1q g 69 and C p2q g that both are n-dimensional integrals, and in general nothing is assumed about the complex measure rG which allows them to avoid this dimensionality. Thus, for any fixed C the size of the function class will decrease as the dimension increases. 4.3 Mollified Approximations In this section, we define the mollified approximation pfλ,N using a novel method of forming a mollified integral representation for a base approximation pfN to a target function f P CpBq, and bound their difference in the L2pB, µ¯Bq norm. 4.3.1 Functions with Integral Representation If a function f : Rn Ñ R has an integral representation over the parameters set Υ with some activation function σ : RÑ R, then fpxq “ ż Υ gpγqσpγJXqdγ (4.13) holds with some bounded, continuous coefficient function g : Υ Ñ R. This is a version of (4.5) that is continuous in the parameters γi over Υ. Indeed, if we define γ1, . . . ,γN to be a sequence of points that uniformly grids Υ with volume dN between any neighboring points, then as N Ñ8 (and dN Ñ 0) we have lim NÑ8 Nÿ i“1 gpγiqσpγJiXqdN “ ż Υ gpγqσpγJXqdγ “ fpxq . This implies that the (infinite) sum of the absolute values of the coefficients is finite lim NÑ8 Nÿ i“1 |gpγiq|dN “ ż Υ |gpγq|dγ ă 8 since g is bounded and the integral is over the bounded set Υ . However, finding such integral representations (4.13) for general classes of functions f and activation functions σ is a nontrivial task. If both are L1 (absolutely integrable over 70 Rn and R respectively), then Theorem 1 in [29] shows that the integral representation can always be constructed with g related to the Fourier transforms of f and σ. But this is a limiting restriction, since none of the typical activation functions like sigmoids, steps, ReLU’s, and their variants meet this requirement. Even so, in [22] this was used to prove the universal approximation properties of sigmoid activations. 4.3.2 Mollified Integral Representation In this section, we will propose a novel method of forming an integral representation for any function f P CpBq, so long as a base approximation pfN of the form (4.5) exists and the bounded parameter set Υ Ă Rn`1 and a bound SN on the coefficient size řNi“1 |θi| are known. For any fixed parameters γ1, . . . ,γN P Υ, this approximation pfN can be defined as pfN pxq :“ Nÿ i“1 θi σpγJiXq “ ż Υ Nÿ i“1 θi δ n`1pγ´ γiq σpγJXq dγ , (4.14) where δn`1 is the pn ` 1q-dimensional Dirac delta which satisfies for any γi P Υ thatş Υ δ n`1pγ´ γiqσpγJXq dγ “ σpγJiXq . Thus, the RHS of (4.14) is like an integral representation (4.13) of pfN with a coef- ficient mapping gpγq “ řNi“1 θi δn`1pγ´ γiq. However, this mapping is not continuous and bounded as required, and the Dirac delta is not even a function in the regular sense. Indeed, if the overall coefficients of a random approximation pFR were set as ϑj “ gpγjq for each randomly sampled γ1, . . . ,γR P Υ, then in this case they almost surely would all be zero. We propose the Mollified integral representation to solve this issue, where the Dirac delta in (4.14) is replaced with the mollified delta function δˆn`1λ . Recalling that γ “ 71 rw1 ¨ ¨ ¨ wn b s, this is defined as the pn` 1q-dimensional bump function δˆn`1λ pγq :“ $’’’’’’&’’’’’’% pηλqn`1 exp ˆ ´1 1´ λ2w21 ˙ ¨ ¨ ¨ exp ˆ ´1 1´ λ2w2n ˙ exp ˆ ´1 1´ λ2b2 ˙ γ P p´ 1λ , 1λqn`1 0 otherwise (4.15) with η´1 “ ş1´1 exp´ ´11´y2¯dy « 0.444 giving ş δˆn`1λ pγ´ γiq dγ “ ş δn`1pγ´ γiq dγ “ 1 for any n ě 1 and γi P Rn`1. We term the positive scalar λ ą 0 as the mollification factor. This controls the height and width of the mollified delta δˆn`1λ . Since the height scales as λn`1 and the integral remains constant, this requires the width to shrink. Figure 4.1 plots the 1-dimensional case for different λ values. −1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 γ 0.0 0.5 1.0 1.5 2.0 2.5 3.0 λ = 1 λ = 2 λ = 4 Figure 4.1: The mollified delta δˆλ for different λ values. In each case, ş1{λ ´1{λ δˆλpγqdγ “ 1. The support of δˆn`1λ pγ ´ γiq is the pn ` 1q-dimensional open cube set p´ 1λ , 1λqn`1 centered about any γi P Υ. And so, if γi is sufficiently close to (or on) the boundary of Υ, then it no longer holds that the entire support of δˆn`1λ pγ´γiq is within Υ. Thus, we define the λ-expanded parameters set Υλ such that the support is always fully contained, 72 up to and include the boundary of Υ. For example, if Υ is rectangular product of intervals rα1, β1sˆ¨ ¨ ¨ˆrαn, βns, then each interval is expanded as rαi´ 1λ , βi` 1λ s to obtain Υλ. With this λ-expanded parameters set, we always maintain that ş Υλ δˆn`1λ pγ´γiqdγ “ 1 for all λ ą 0, n ě 1, and any γi P Υ. Since Υ Ă Υλ for all λ ą 0, then (4.14) is equivalently pfN pxq “ ż Υλ Nÿ i“1 θi δ n`1pγ´ γiq σpγJXq dγ . (4.16) And so, by replacing the Dirac delta in (4.16) with the mollified delta δˆn`1λ , for any choice of λ ą 0, we obtain the corresponding mollified approximation pfλ,N pxq :“ ż Υλ Nÿ i“1 θi δˆ n`1 λ pγ´ γiq σpγJXq dγ (4.17) using a mollified integral representation. This now gives the required continuous, bounded coefficient mapping for any fixed γ1, . . . ,γN P Υ as gλpγq :“ Nÿ i“1 θi δˆ n`1 λ pγ´ γiq . (4.18) We illustrate the conceptual transformation for the 1-dimensional case in Figure 4.2. Figure 4.2: The base approximation pfN is transformed into the mollified approximationpfλ,N by replacing Dirac deltas δ with the mollified deltas δˆλ. 73 4.3.3 Main Result Let us define Epxq :“ } rx1 ¨ ¨ ¨ xn 1s }2 “ }X}2 as the Euclidean norm mapping from Rn ˆ t1u to R, and then define EB :“ }Epxq}B (4.19) as the L2pB, µ¯Bq norm of this mapping over the Euclidean distances of x P B. We can now state the main result, which bounds the difference between a base approximation and its corresponding mollified approximation, for a chosen mollifcation factor λ, in the L2pB, µ¯Bq norm. Theorem 4.3. Let there exist a base approximation pfN of the form (4.5) using any activation function σ satisfying Assumption 4.1, for any N P N, with some unknown parameters γ1, . . . ,γN P Υ and coefficients θ P RN , but having a known bound on the coefficient size of řN i“1 |θi| ď SN and the parameters set Υ known. Then, the mollified approximation pfλ,N over the λ-expanded parameters set Υλ in (4.17) satisfies either of: iq ››› pfN ´ pfλ,N›››B ď SN Dσ (4.20) iiq ››› pfN ´ pfλ,N›››B ď SN Lσ ? n` 1 λ EB . (4.21) Proof. The proof is given in Appendix C.4. Remark 4.2. The results of Theorem 4.3 show that if the activation function σ is Lipschitz over its feasible inputs, then as λÑ8 we have pfλ,N pxq Ñ pfN pxq in the sense of the L2pB, µ¯Bq norm. 74 4.4 Random Approximations to Mollified Approximations In this section, we focus on random approximations pFR of the form (4.6) and we prove that they exist (in high probability) such that››› pfλ,N pxq ´ pFRpx ;γ1, . . . ,γRq›››B ď εR (4.22) holds for a positive scalar εR ą 0 that is proportional to 1?R . Here, pfλ,N pxq is a mollified approximation as described in Section 4.3.2. 4.4.1 Expected Value of Random Approximations to Functions with Integral Representations An important property of functions that have integral representations with some σ is that random approximations pFR of the form (4.6) exist which are exactly equal to the function in expected value over the randomly drawn parameters. And so, the following lemma builds upon the mollified integral representation of the mollified approximation pfλ,N to show that there always exists random approximations with randomly sampled parameters from the λ-expanded set Υλ satisfying this property. Lemma 4.1. Let the target function f P CpBq satisfy Theorem 4.3 and thus have the mollified approximation pfλ,N as in (4.13) with activation σ for the selected mollification factor λ ą 0, which has the integral representation (4.16) over the λ-expanded param- eters set Υλ. If the random parameters γ1, . . . ,γR are sampled iid according to any density Pλ which is nonzero over Υλ, then there exists a random approximation pFR of the form (4.6) using the activation σ such that E γj„Pλ ” pFRpx ;γ1, . . . ,γRqı “ pfλ,N pxq . (4.23) Proof. The proof is given in Appendix C.5. 75 4.4.2 Setting Up McDiarmid’s Inequality For the remainder of the section, we closely follow the strategy in Lemma 4 of [67], which uses McDiarmid’s inequality to obtain the final result. We start by defining function hB : ΥRλ Ñ R, which captures the norm difference between the random approximation and its expected value over the iid draws of pγ1, . . . ,γRq P ΥRλ , as hBpγ1, . . . ,γRq :“ ›››› pFRpx ;γ1, . . . ,γRq ´ Eγj„Pλ ” pFRpx ;γ1, . . . ,γRqı›››› B . (4.24) These are precisely the kinds of functions which McDiarmid’s inequality bounds (see Appendix C.1 for a basic definition of the inequality). In order to use this inequality, we must prove two properties of this function: 1) the absolute value of the difference in h when any one component γj is different, is bounded, and 2) the expected value of h itself (over the random parameters) is bounded. Let us first make the following definitions. With δˆn`1λ as in (4.15) for any λ ą 0, we define gλpγq “ Nÿ i“1 θi δˆ n`1 λ pγ´ γiq ùñ maxγPΥλ |gλpγq| ď Nÿ i“1 |θi| ˆ ηλ e ˙n`1 “: gλmax . (4.25) For the corresponding Υλ, we define the diameter DΥλ :“ supt}γ ´ φ}2 | γ, φ P Υλu and largest point γλmax :“ supt}γ}2 | γ P Υλu, and for the selected Pλ we define the minimum Pλmin :“ inftPλpγq | γ P Υλu. We prove the two required properties with the following lemma. Lemma 4.2. Let hB be defined as in (4.24) for a random approximation pFR of the form (4.6) which satisfies the result of Lemma 4.1 to some mollified approximation pfλ,N from Theorem 4.3, both using activation function σ which satisfies Assumption 4.1. Let the random parameters γ1, . . . ,γR be sampled iid according to any nonzero density Pλ over the λ-expanded set Υλ, which has diameter DΥλ and largest point γλmax, and let EB be as in (4.19). Then for all pγ1, . . . ,γj , . . . ,γRq, pγ1, . . . , rγj , . . . ,γRq P ΥRλ differing only 76 at coordinate j, either of the following hold for all j P rRs: ˇˇ hBpγ1, . . . ,γj , . . . ,γRq ´ hBpγ1, . . . , rγj , . . . ,γRqˇˇ ď iq Dσ gλmax PλminR ˆ 3` 2 |σp0q| Dσ ˙ (4.26) iiq Lσ gλmax PλminR ˆ pDΥλ ` 2γλmaxqEB ` 2 |σp0q|Lσ ˙ . (4.27) And the expected value of hB satisfies either of: E γj„Pλ ” hBpγ1, . . . ,γRq ı ď iq Dσ gλmax Pλmin ? R ˆ 1` |σp0q| Dσ ˙ (4.28) iiq Lσ gλmax Pλmin ? R ˆ γλmaxEB ` |σp0q|L ˙ . (4.29) Proof. The proof is given in Appendix C.6. 4.4.3 Using McDiarmid’s Inequality With the two required properties of hB proven, we can now use McDiarmid’s inequality to to achieve the bound (4.22), given by the following theorem. Theorem 4.4. Let hB be defined as in (4.24) and satisfy the results of Lemma 4.2, for some mollified approximation pfλ,N from Theorem 4.3, using activation function σ which satisfies Assumption 4.1. Let the random parameters γ1, . . . ,γR be sampled iid according to any nonzero density Pλ over the λ-expanded set Υλ, which has diameter DΥλ and largest point γλmax, and let EB be as in (4.19). Then for any ν P p0, 1q, with probability greater than 1 ´ ν there exists a random approximation pFRpx ;γ1, . . . ,γRq of the form (4.6) with |ϑj | ď gλmaxPλminR for all j P rRs satisfying the result of Lemma 4.1, 77 such that either of the following hold: } pfλ,N pxq ´ pFRpx ;γ1, . . . ,γRq} ď iq Dσ gλmax Pλmin ? R KDσpνq iiq Lσ gλmax Pλmin ? R KLσpνq . The coefficients KDσ and KLσ are defined in terms of the selected ν as: KDσpνq :“ 1` |σp0q|Dσ ` ˆ 3` 2 |σp0q| Dσ ˙d 1 2 log ˆ 1 ν ˙ KLσpνq :“ γλmaxEB ` |σp0q|Lσ ` ˆ pDΥλ ` 2γλmaxqEB ` 2 |σp0q|Lσ ˙d 1 2 log ˆ 1 ν ˙ . Proof. The proof is given in Appendix C.7. Remark 4.3. The results of Theorem 4.4 show that for any fixed mollification factor λ ą 0 and ν P p0, 1q, with probability greater than 1´ ν there exists an pFR with bounded coefficients such that as RÑ8 then pFRpx ;γ1, . . . ,γRq Ñ pfλ,N pxq for all x P B. 4.5 Random Approximations to Base Approximations We now state and prove an overall result which ties together the results of the previous sections to provide an overall bound on the approximation error in L2pB, µ¯Bq norm, between a base approximation pfN of the form (4.5) to a target function f P CpBq and a random approximation pFRpx ;γ1, . . . ,γRq of the form (4.6). Theorem 4.5. Let there exist a base approximation pfN of the form (4.5), for some integer N P N, to a target function f P CpBq, with unknown parameters γ1, . . . ,γN in a known bounded parameter set Υ Ă Rn`1, unknown coefficients θ P RN with a known bound on the coefficient size řN i“1 |θi| ď SN , and using an activation function σ satisfying Assumption 4.1. Assume that ››› qfpxq ´ pfN pxq›››B ď εN holds for some εN ą 0 78 which may depend on N , for an affine shift qfpxq “ f´aJx´bfp0q with some a P Rn and b P R. The mollified approximation pfλ,N as in (4.17) satisfies the results of Theorem 4.3 for any mollification factor λ ą 0. Let the random parameters γ1, . . . ,γR be sampled iid according to any nonzero density Pλ over the λ-expanded set Υλ, which has diameter DΥλ and farthest point γλmax, and let EB be as in (4.19). Then for any ν P p0, 1q, with probability greater than 1´ν there exists a random approximation pFRpx ;γ1, . . . ,γRq as in (4.6) with |ϑj | ď gλmaxPλminR for all j P rRs that satisfies Theorem 4.4 such that either of the following hold:››› qfpxq ´ pFRpx ;γ1, . . . ,γRq›››B ď iq εN ` SN Dσ ˆ 1 ` λ n`1 ? R rKDσpνq˙ iiq εN ` SNLσ ˆ? n` 1 λ EB ` λ n`1 ? R rKLσpνq˙ . The coefficients rKLσ are defined in terms of the selected ν as: rKDσpνq :“ ηn`1en`1Pλmin KDσpνq and rKLσpνq :“ η n`1 en`1Pλmin KLσpνq with KDσ and KLσ as defined in Theorem 4.4. Proof. The proof is given in Appendix C.8. Remark 4.4. The results of Theorem 4.5 show that the base approximations discussed in Section 4.2 are particularly useful in this context, because they are convex combinations of activation functions. This means we can allow the number of terms to grow arbitrarily large N Ñ 8, since N does not appear directly in the results. This results in εN Ñ 0, while maintaining the same bound SN for any N . 79 4.6 Extending the Error Bound to L8 for Lipschitz Func- tions on Convex Compact Sets In this section, we show how the L2 norm bounds of the previous sections can be extended to L8 (supremum) norm bounds when the target function f P CpKq is dif- ferentiable at the origin and Lipschitz on a convex compact set K Ă Rn including the origin, and the base pfN or random pFR approximation to this function uses a Lipschitz activation function σ. We assume that K is full dimension, with nonzero Lebesgue mea- sure (volume) V “ µpKq ą 0 and diameter D ą 0. It also must hold that K Ď B0pρq for a sufficiently large radius 0 ă ρ ď D. We then have for any such approximation pf “ řNi“1 θi σpγJiXq, thatˇˇˇˇ ˇ Nÿ i“1 θi σpγJiXq ´ Nÿ i“1 θi σpγJi Y q ˇˇˇˇ ˇ ď Nÿ i“1 ˇˇ θi σpγJiXq ´ θi σpγJi Y q ˇˇ “ Nÿ i“1 |θi| ˇˇ σpγJiXq ´ σpγJi Y q ˇˇ ď Nÿ i“1 |θi|Lσ ˇˇ γJiX ´ γJi Y ˇˇ “ Nÿ i“1 |θi|Lσ ˇˇ wJi px´ yq ˇˇ ď Nÿ i“1 |θi|}wi}2 }x´ y}2 holds for all x, y P Rn, recalling that X :“ rx1 ¨ ¨ ¨ xn 1sJ and Y :“ ry1 ¨ ¨ ¨ yn 1sJ, since |σpuq ´ σpvq| ď Lσ |u´ v| for any u, v P R by the Lipschitz condition of Assumption 4.1. Thus, such approximations are pL-Lipschitz with constant pL “ Lσ Nÿ i“1 |θi|}wi}2 . (4.30) Let the affine shift to function f be qfpxq :“ fpxq ´∇fp0nqJx´ fp0nq . (4.31) 80 Thus, if f is L-Lipschitz, thenˇˇˇ qfpxq ´ qfpyqˇˇˇ “ ˇˇfpxq ´∇fp0nqJx´ fp0nq ´ fpyq `∇fp0nqJy ` fp0nqˇˇ “ ˇˇfpxq ´ fpyq ´∇fp0nqJpx´ yqˇˇ ď |fpxq ´ fpyq| ` ˇˇ∇fp0nqJpx´ yqˇˇ ď L }x´ y}2 ` }∇fp0nq}2 }x´ y}2 holds for all x, y P Rn. And so, qf is also Lipschitz, with constant qL “ L` }∇fp0nq}2. Lemma 4.3. Let f : Rn Ñ R be L-Lipschitz on a convex compact set K Ă Rn containing the origin and differentiable at the origin. Let there also be an approximation pf : Rn Ñ R which is pL-Lipschitz on K and such that the approximation error rfpxq “ qfpxq ´ pfpxq, with qf as in (4.31), satisfies the L2pK, µ¯Kq norm bound››› rf ››› K “ dż K ˇˇˇ rfpxqˇˇˇ2 µ¯Kpdxq ď εN where the bound εN ą 0 is a finite constant and µ¯Kpdxq is the uniform probability measure on K such that şK µ¯Kpdxq “ 1. Assume K is full dimension with diameter D ą 0, and so it holds that K Ď B0pρq for a sufficiently large radius 0 ă ρ ď D. Then, the approximation error also satisfies the L8pKq norm bound sup xPK ˇˇˇ rfpxqˇˇˇ ď 2 ˆ rDpr `?r2 `D2q ˙´n n`2 ´´ L` }∇fp0nq}2 ` pL¯n ε 2N¯ 1n`2 (4.32) such that r is the largest radius ball within K centered at its centroid. Proof. The proof is given in Appendix C.9. 4.7 Application to Linear MRAC In Chapter 12 of [41] an extension to the general linear MRAC setup is introduced which allows for nonlinearities f : Rn Ñ R` as fpxq “ ΘJΨpxq ` f pxq such that the plant is 81 given by 9xt “ Axt `B ` ut ` fpxqq “ Axt `B ` ut `ΘJΨpxtq ` f pxq ˘ , (4.33) where A is a known n ˆ n state matrix for the plant state xt P Rn, B is a known n ˆ ` input matrix for the input ut P R`, and Θ is an unknown N ˆ ` matrix which linearly parameterizes the known vector function Ψ : Rn Ñ RN . We assume pA,Bq is controllable. The general setup also includes an unknown diagonal scaling matrix Λ, such that the overall input matrix is BΛ, and assumes that A is unknown. For simplicity, we assume A is known and omit Λ. Again it is assumed that there exists an n ˆ ` matrix of feedback gains Kx and an `ˆ ` matrix of feedforward gains Kr satisfying the matching conditions A`BKJx “ Ar BKJr “ Br to a controllable, linear reference model 9xrt “ Ar xrt `Br rt , where Ar is a known Hurwitz nˆn reference state matrix for the reference state xrt P Rn, Br is a known n ˆ ` reference input matrix, and rt P R` is a bounded reference input. Here again, we assume that Kx and Kr can be directly calculated from known A and B, and used directly in the control law. It is required that the nonlinearity satisfy the bound }f pxq}2 ď ε¯ (4.34) for some constant ε¯ ą 0 and for all x P B0prq with some radius r ą 0. Then, the adaptive control law ut “ KJx xt ´ pΘJtΨpxtq ` p1´ µpxqqKJr rt ` µpxqrupxtq 82 is shown to stabilize the state tracking error et “ xt ´ xrt down to a compact set about the origin, where the N ˆ ` matrix of parameter estimates pΘt is dynamically updated with the usual law 9pΘt “ Γ Ψpxtq eJt PxB (see Section 3.4 for further details). The scalar function µ : Rn Ñ r0, 1s transitions the control law from tracking the reference input to simply returning the plant state xt to the bounded region B0prq within which (4.34) is valid, using an appropriately definedrupxtq based on assumptions about the growth of f pxq outside of B0prq. (See Chapter 12 of [41] for details.) And so, if the vector function Ψpxq is constructed as in (3.7) with an activation function σ satisfying the Lipschitz condition of Assumption 4.1 with constant Lσ, then the above setup is valid for any nonlinearity f “ rf1 ¨ ¨ ¨ f`s where we can show that sup xPK ˇˇ fipxq ´ΘJi Ψpxq ˇˇ ď ε¯i holds for some constants ε¯1, . . . , ε¯` ą 0. Here, f1, . . . , f` : Rn Ñ R and each ΘJi P RN is the corresponding row of the true parameters ΘJ. 4.7.1 Example on Euclidean Ball Here we will show how, for example, the results of Theorem 4.2 can be used for the above pΘJt,iΨpxq approximations, which are equivalent to pfN of the form (4.5) when Ψpxq is defined as in (3.7) with σ1, . . . , σN “ σR ReLU activation functions. These satisfy the Lipschitz condition of Assumption 4.1 with Lσ “ 1. Then from (4.30), the approximations are Lipschitz with constant pL “ Nÿ k“1 |pΘJi qk|}wk}2 for all i P r`s. 83 Let K be a Euclidean ball of radius ρ centered at the origin (thus giving the diameter as D “ 2ρ). In this case, we have r “ ρ as the largest radius ball centered at the centroid, which is the origin, contained in K. And so, from Lemma 4.3 we getˆ r Dpr `?r2 `D2q ˙´n n`2 “ ˜ ρ 2ρpρ`aρ2 ` p2ρq2q ¸´n n`2 “ ˆ ρ´1 2p1`?5q ˙´n n`2 . Furthermore, for all w P Rn we have |w|K :“ sup xPK ˇˇ wJx ˇˇ “ ρ }w}2 , since this is always maximized by aligning x in the same direction as w and then supxPK }x}2 “ ρ, and so (4.10) is now equivalently Cg “ ż Rn ρ2 }w}22 Gpdwq ď C . (4.35) And so, if there is a function g : Rn Ñ R with a Fourier transform rg such that the variation Gpdwq “ |rgpwq|dw satisfies C}w} :“ ż Rn }w}22 Gpdwq ă 8 , then we can set C “ Cg “ ρ2C}w}. An example of such a function is gpxq “ nź i“1 βi sincpαixiq for any α1, βi P R, where sincpuq :“ sinpuqu for u P R is the unnormalized sinc function. It has a Fourier transform of rgpwq “ ż Rn nź i“1 sincpαixiq e´iwJxdx “ ż 8 ´8 ¨ ¨ ¨ ż 8 ´8 β1sincpα1x1q ¨ ¨ ¨βnsincpαnxnq e´iw1x1 ¨ ¨ ¨ e´iwnxndx1 ¨ ¨ ¨ dxn “ β1 ż 8 ´8 sincpα1x1q e´iw1x1dx1 ¨ ¨ ¨ βn ż 8 ´8 sincpαnxnq e´iwnxndxn “ nź i“1 βi pi |αi|Π ˆ wi 2αi ˙ “ $’’&’’% śn i“1 βi pi|αi| w1 P r´|α1|, |α1|s, . . . , wn P r´|αn|, |αn|s 0 o.w. , 84 which is real. Thus, with Gpdwq “ |rgpwq|dw we have C}w} “ ż Rn }w}22 Gpdwq “ ż Rn pw21 ` ¨ ¨ ¨ ` w2nq |rgpwq|dw “ pin nź i“1 |βi| |αi| ż |α1| ´|α1| ¨ ¨ ¨ ż |αn| ´|αn| pw21 ` ¨ ¨ ¨ ` w2nq dw1 ¨ ¨ ¨ dwn “ pin nź i“1 |βi| |αi| ¨˝ nÿ j“1 2 |αj |3 3 ź k‰j 2|αk|‚˛ . Using sinc1puq “ $’’&’’% cospuq´sincpuq u u ‰ 0 0 u “ 0 for u P R and the chain rule BgBxi “ BgBui duidxi with ui “ αixi, the gradient of g is then ∇gpxq “ „ β1α1 cospα1x1q ´ sincpα1x1q α1x1 ¨ ¨ ¨ βnαn cospαnxnq ´ sincpαnxnq αnxn J ∇gp0nq “ r0 ¨ ¨ ¨ 0sJ . And since |sinc1puq| ď 12 holds for all u P R, and g is continuously differentiable, we can use the Lipschitz constant }∇gpxq}2 ď ››››„β1α12 ¨ ¨ ¨ βnαn2 ›››› 2 “: L . In Theorem 4.2 it is specified that pfN exists with parameters satisfying |wi|K “ ρ }w}2 “ 1 for all i P rN s, and so we can bound its Lipschitz constant as pL “ Nÿ k“1 |pΘJi qk| }wk}2 ď 2C ρ “ 2ρC}w} . Thus, for any fpxq that is equal to this gpxq on x P B0pρq for any ρ ą 0, the approxi- mation error bound in Theorem 4.2 can be stated as sup xPK ˇˇˇ rfpxqˇˇˇ ď 2ˆ ρ´1 2p1`?5q ˙´n n`2 ˜ˆ››››„β1α12 ¨ ¨ ¨ βnαn2 ›››› 2 ` 2ρC}w} n˙ 64 3 ρ 2C2}w} N ¸ 1 n`2 . Chapter 5 Conclusion and Future Directions We have studied the learning components of the linear MRAC setup in [41], from both a stochastic and deterministic viewpoint. We extended the framework in [18, 20] for ex- ponential stochastic contraction in specialized Wasserstein distances, while also showing that a flexible convex projection operation can be achieved through the use of reflection processes. We develop novel sufficient conditions for persistency of excitation for vec- tor functions of continuous, bounded state trajectories with step and ReLU activation functions. And we developed a novel method for bridging approximations using activa- tion functions with unknown parameters to approximations using randomly initialized parameters, with the mollified integral representation. We showed that this could be extended to the supremum norm for use with an extended, approximate version of the linear MRAC setup. Two immediate extensions of the work on stochastic contraction would be to: 1) allow arbitrary bounded reference trajectories other than the zero vector (ie. adaptive regulation of the plant state to the origin) to also achieve contraction, to a distribution which is not strictly stationary (the mean would be time-varying, at least) but can be tractably analyzed in perhaps in terms of local times of the state space (or perhaps 85 86 periodic, depending on the reference input), and 2) allow for the fully general adaptive parameter update law, with leading coefficient matrices other than identity. Two other possible ideas are: understand if it is possible to remove the Brownian motion from the parameter update law completely, since it does not strictly make sense that it would be needed there instead of just on the plant, and determine if the Fokker-Planck (Kolmogorov forward) equation could be solved for these systems. For extending the work on persistency of excitation, the follow ideas come to mind. Is it be possible to incorporate estimation of the feedback and feedforward gains con- currently with the parameterization of the nonlinarity? Extending the geometrical in- terpretation to other kinds of adaptive systems would be a good idea. Particularly since [2] and [60] provide other, similar types of conditions for more complex systems. It would be interesting to understand what can be said about a completely general (control affine) system which is linearized at an equilibrium point, such that it is not necessarily possible for the control input to cancel out the higher-order terms function. Extending the work on function approximation has two immediate paths: 1) is it possible to do “approximate persistency of excitation on the extended, approximate version of the linear MRAC setup, and 2) extending the work of [29] towards the deter- mination of integral representations for more general classes of functions. In particular, using linear combinations of step functions, which cannot be trained using backpropa- gation in the modern paradigm of training deep neural networks. Yet in the 1D case, we can easily show that the step function is superior at uniformly approximating func- tions than the ReLU function. In fact, the step can handle discontinuous (regulated) functions, while the ReLU can only deal with continuous functions. References [1] Veronica Adetola and Martin Guay. Finite-time parameter estimation in adaptive control of nonlinear systems. IEEE Transactions on Automatic Control, 53(3): 807–811, 2008. [2] Brian Anderson. Exponential stability of linear equations arising in adaptive iden- tification. IEEE Transactions on Automatic Control, 22(1):83–88, 1977. [3] Anuradha M Annaswamy, Anubhav Guha, Yingnan Cui, Joseph E Gaudio, and Jose´ M Moreu. Online algorithms and policies using adaptive and machine learning approaches. arXiv preprint arXiv:2105.06577, 2021. [4] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neu- ral networks. In International Conference on Machine Learning, pages 322–332. PMLR, 2019. [5] Alessandro Astolfi, Dimitrios Karagiannis, and Romeo Ortega. Nonlinear and adap- tive control with applications. Springer Science & Business Media, 2007. [6] Dominique Bakry, Ivan Gentil, and Michel Ledoux. Analysis and geometry of Markov diffusion operators, volume 348. Springer Science & Business Media, 2013. 87 88 [7] Andrew R Barron. Universal approximation bounds for superpositions of a sig- moidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993. [8] Nicholas M Boffi, Stephen Tu, and Jean-Jacques E Slotine. Regret bounds for adaptive nonlinear control. arXiv preprint arXiv:2011.13101, 2020. [9] Franc¸ois Bolley, Ivan Gentil, and Arnaud Guillin. Convergence to equilibrium in wasserstein distance for fokker–planck equations. Journal of Functional Analysis, 263(8):2430–2457, 2012. [10] Djallel Bouneffouf, Irina Rish, and Charu Aggarwal. Survey on applications of multi-armed and contextual bandits. In 2020 IEEE Congress on Evolutionary Computation (CEC), pages 1–8. IEEE, 2020. [11] Stephen Boyd and Sosale Shankara Sastry. Necessary and sufficient conditions for parameter convergence in adaptive control. Automatica, 22(6):629–639, 1986. [12] Leo Breiman. Hinging hyperplanes for regression, classification, and function ap- proximation. IEEE Transactions on Information Theory, 39(3):999–1013, 1993. [13] Se´bastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012. [14] Girish Chowdhary, Tansel Yucelen, Maximillian Mu¨hlegg, and Eric N Johnson. Concurrent learning adaptive control of linear systems with exponentially conver- gent bounds. International Journal of Adaptive Control and Signal Processing, 27 (4):280–301, 2013. [15] George Cybenko. Approximation by superpositions of a sigmoidal function. Math- ematics of control, signals and systems, 2(4):303–314, 1989. 89 [16] Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de Las Casas, et al. Magnetic control of tokamak plasmas through deep reinforce- ment learning. Nature, 602(7897):414–419, 2022. [17] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient de- scent finds global minima of deep neural networks. In International Conference on Machine Learning, pages 1675–1685, 2019. [18] Andreas Eberle. Reflection couplings and contraction rates for diffusions. Proba- bility theory and related fields, 166(3-4):851–886, 2016. [19] Andreas Eberle, Arnaud Guillin, and Raphael Zimmer. Quantitative harris-type theorems for diffusions and mckean–vlasov processes. Transactions of the American Mathematical Society, 371(10):7135–7173, 2019. [20] Andreas Eberle, Arnaud Guillin, Raphael Zimmer, et al. Couplings and quanti- tative contraction rates for langevin dynamics. The Annals of Probability, 47(4): 1982–2010, 2019. [21] Andre Esteva, Katherine Chou, Serena Yeung, Nikhil Naik, Ali Madani, Ali Mot- taghi, Yun Liu, Eric Topol, Jeff Dean, and Richard Socher. Deep learning-enabled medical computer vision. NPJ digital medicine, 4(1):5, 2021. [22] Ken-Ichi Funahashi. On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3):183–192, 1989. [23] Joseph E Gaudio, Travis E Gibson, Anuradha M Annaswamy, Michael A Bolender, and Eugene Lavretsky. Connections between adaptive control and optimization in machine learning. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 4563–4568. IEEE, 2019. 90 [24] Elad Hazan, Sham Kakade, and Karan Singh. The nonstochastic control problem. In Algorithmic Learning Theory, pages 408–421. PMLR, 2020. [25] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989. [26] Naira Hovakimyan and Chengyu Cao. L1 Adaptive Control Theory: Guaranteed Robustness with Fast Adaptation. SIAM, 2010. [27] Daniel Hsu, Clayton Sanford, Rocco A Servedio, and Emmanouil-Vasileios Vlatakis-Gkaragkounis. On the approximation power of two-layer networks of ran- dom relus. In Conference on Learning Theory, pages 2423–2461. PMLR, 2021. [28] Petros Ioannou and Baris¸ Fidan. Adaptive control tutorial. SIAM, 2006. [29] Bunpei Irie and Sei Miyake. Capabilities of three-layered perceptrons. In ICNN, pages 641–648, 1988. [30] Arthur Jacot, Franck Gabriel, and Cle´ment Hongler. Neural tangent kernel: Con- vergence and generalization in neural networks. arXiv preprint arXiv:1806.07572, 2018. [31] Paul C Kainen, Veˇra Ku˘rkova´, and Andrew Vogt. Integral combinations of heavi- sides. Mathematische Nachrichten, 283(6):854–878, 2010. [32] Olav Kallenberg. Foundations of Modern Probability. Springer, third edition edi- tion, 2021. [33] Rushikesh Kamalapurkar, Patrick Walters, Joel Rosenfeld, and Warren Dixon. Re- inforcement learning for optimal feedback control. Springer, 2018. 91 [34] Philipp Karg, Florian Ko¨pf, Christian Alexander Braun, and So¨ren Hohmann. Excitation for adaptive optimal control of nonlinear systems in differential games. IEEE Transactions on Automatic Control, 2022. [35] Hassan K Khalil. High-gain observers in nonlinear feedback control. SIAM, 2017. [36] Gerhard Kreisselmeier and Gudrun Rietze-Augst. Richness and excitation on an interval-with application to continuous-time adaptive control. IEEE transactions on automatic control, 35(2):165–171, 1990. [37] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6): 84–90, 2017. [38] AJ Kurdila, Francis J Narcowich, and Joseph D Ward. Persistency of excitation in identification using radial basis function approximants. SIAM journal on control and optimization, 33(2):625–642, 1995. [39] Andrew Lamperski. Projected stochastic gradient langevin algorithms for con- strained sampling and non-convex learning. In Conference on Learning Theory, pages 2891–2937. PMLR, 2021. [40] Andrew Lamperski. Neural network independence properties with applications to adaptive control. In IEEE Conference on Decision and Control (CDC), 2022. [41] Eugene Lavretsky and Kevin A Wise. Robust and Adaptive Control: with Aerospace Applications. Springer, 2013. [42] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pen- nington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017. 92 [43] Tyler Lekang and Andrew Lamperski. Simple algorithms for dueling bandits. arXiv preprint arXiv:1906.07611, 2019. [44] Tyler Lekang and Andrew Lamperski. Wasserstein contraction bounds on closed convex domains with applications to stochastic adaptive control. In IEEE Confer- ence on Decision and Control (CDC), 2021. [45] Tyler Lekang and Andrew Lamperski. Sufficient conditions for persistency of ex- citation with step and relu activation functions. In IEEE Conference on Decision and Control (CDC), 2022. [46] FL Lewis, A Yesildirak, and Suresh Jagannathan. Neural network control of robot manipulators and nonlinear systems, 1998. [47] Jung-Shan Lin and Ioannis Kanellakopoulos. Nonlinearities enhance parameter convergence in strict feedback systems. IEEE Transactions on Automatic Control, 44(1):89–94, 1999. [48] Torgny Lindvall and LCG Rogers. Coupling of multidimensional diffusions by reflection. The Annals of Probability, pages 860–872, 1986. [49] Pierre-Louis Lions and Alain-Sol Sznitman. Stochastic differential equations with reflecting boundary conditions. Communications on Pure and Applied Mathemat- ics, 37(4):511–537, 1984. [50] Batta Mahesh. Machine learning algorithms-a review. International Journal of Science and Research (IJSR).[Internet], 9:381–386, 2020. [51] Yuly Makovoz. Random approximants and neural networks. Journal of Approxi- mation Theory, 85(1):98–109, 1996. 93 [52] Iven MY Mareels and Michel Gevers. Persistency of excitation criteria for linear, multivariable, time-varying systems. Mathematics of Control, Signals and Systems, 1(3):203–226, 1988. [53] Riccardo Marino and Patrizio Tomei. Global adaptive output-feedback control of nonlinear systems. i. linear parameterization. IEEE Transactions on Automatic Control, 38(1):17–32, 1993. [54] Sean P Meyn and Richard L Tweedie. Stability of markovian processes iii: Foster- lyapunov criteria for continuous-time processes. Advances in Applied Probability, pages 518–548, 1993. [55] AP Morgan and KS Narendra. On the uniform asymptotic stability of certain linear nonautonomous differential equations. SIAM Journal on Control and Optimization, 15(1):5–24, 1977. [56] Kamil Nar and S Shankar Sastry. Persistency of excitation for robustness of neural networks. arXiv preprint arXiv:1911.01043, 2019. [57] Radford M Neal. Priors for infinite networks (Technical Report CRG-TR-94-1). 1994. [58] Daniel W Otter, Julian R Medina, and Jugal K Kalita. A survey of the usages of deep learning for natural language processing. IEEE transactions on neural networks and learning systems, 32(2):604–624, 2020. [59] Alberto Padoan, Giordano Scarciotti, and Alessandro Astolfi. A geometric charac- terization of the persistence of excitation condition for the solutions of autonomous systems. IEEE Transactions on Automatic Control, 62(11):5666–5677, 2017. [60] Elena Panteley, Antonio Loria, and Andrew Teel. Relaxed persistency of excitation 94 for uniform asymptotic stability. IEEE Transactions on Automatic Control, 46(12): 1874–1886, 2001. [61] Armenak Petrosyan, Anton Dereventsov, and Clayton G Webster. Neural network integral representations with the relu activation function. In Mathematical and Scientific Machine Learning, pages 128–143. PMLR, 2020. [62] Quang-Cuong Pham, Nicolas Tabareau, and Jean-Jacques Slotine. A contraction theory approach to stochastic incremental stability. IEEE Transactions on Auto- matic Control, 54(4):816–820, 2009. [63] Andrey Pilipenko. An introduction to stochastic differential equations with reflec- tion, volume 1. Universita¨tsverlag Potsdam, 2014. [64] Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta numerica, 8:143–195, 1999. [65] Gilles Pisier. Remarques sur un re´sultat non publie´ de B. Maurey. Se´minaire Analyse fonctionnelle, pages 1–12, 1981. [66] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. Advances in neural information processing systems, 29, 2016. [67] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems, volume 21, 2008. [68] Benjamin Recht. A tour of reinforcement learning: The view from continuous control. Annual Review of Control, Robotics, and Autonomous Systems, 2:253– 279, 2019. 95 [69] Sayan Basu Roy, Shubhendu Bhasin, and Indra Narayan Kar. Combined mrac for unknown mimo lti systems with parameter convergence. IEEE Transactions on Automatic Control, 63(1):283–290, 2017. [70] Juan G Rueda-Escobedo and Jaime A Moreno. Strong lyapunov functions for two classical problems in adaptive control. Automatica, 124:109250, 2021. [71] Shankar Sastry and Marc Bodson. Adaptive control: stability, convergence and robustness. Prentice Hall, Inc., 1989. [72] Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non- stochastic control. In Conference on Learning Theory, pages 3320–3436. PMLR, 2020. [73] Leszek S lomin´ski. Euler’s approximations of solutions of sdes with reflecting bound- ary. Stochastic processes and their applications, 94(2):317–337, 2001. [74] Jean-Jacques E Slotine, Weiping Li, et al. Applied nonlinear control, volume 199. Prentice hall Englewood Cliffs, NJ, 1991. [75] Hiroshi Tanaka. Stochastic differential equations with reflecting boundary condition in convex regions. Hiroshima Mathematical Journal, 9(1):163–177, 1979. [76] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [77] Ce´dric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. [78] Draguna Vrabie, Kyriakos G Vamvoudakis, and Frank L Lewis. Optimal adaptive 96 control and differential games by reinforcement learning principles, volume 2. IET, 2013. [79] Tyler Westenbroek, Eric Mazumdar, David Fridovich-Keil, Valmik Prabhu, Claire J Tomlin, and S Shankar Sastry. Adaptive control for linearizable systems using on- policy reinforcement learning. In 2020 59th IEEE Conference on Decision and Control (CDC), pages 118–125. IEEE, 2020. [80] Jan C Willems, Paolo Rapisarda, Ivan Markovsky, and Bart LM De Moor. A note on persistency of excitation. Systems & Control Letters, 54(4):325–329, 2005. Appendix A A.1 Proof of Theorem 2.1 Theorem. The function ρpx, yq defined in (2.11) is a metric over x, y P X . Let xt and yt be two solutions to (2.5) with initial conditions x0 ‰ y0 P X and respective laws Pt and Qt. If the initial laws satisfy ş X VpxqdP0pxq ă 8 and ş X VpxqdQ0pxq ă 8, then the laws contract exponentially in Wρ distance as WρpPt,Qtq ď e´atWρpP0,Q0q , (2.14) where a “ 12 mintλ, ξε, βεu. Proof. Many of the arguments here are similar to those in the proof of Theorem 2.1 in [19]. We give terse explanations of these and then highlight the differences. The main idea is to define a reflection coupling between xt and yt and use it to show that eatρpxt,ytq, with a defined as a “ 1 2 mintλ, ξε, βεu (A.1) 97 98 and ρ defined in (2.11), is supermartingale. This then gives the main result as: Epxt,ytq„Γt˚ rρpxt,ytqs ď Epxt,ytq„Γtrρpxt,ytqs ď e´atEpx0,y0q„Γ˚0 rρpx0,y0qs “WρpPt,Qtq “ e´atWρpP0,Q0q . The left inequality is by the infimum definition of Wρ from (2.10), since at any time t ą 0 the optimal coupling Γt˚ gives the Wasserstein distance for the laws Pt,Qt. The right inequality is then by the supermartingale property, with moving the eat factor to the right side, which holds for any couplings Γt P CpPt,Qtq of the laws at time t ą 0 and Γ0 P CpP0,Q0q of the initial laws. Thus, using the optimal Γ0˚ gives the final result. Note that this optimal Γ0˚ always exists by Theorem 4.1 of [77]. Showing That ρ is a Metric First, note from the definition of ρ in (2.11) that ρpx, yq´fp}x´y}q “ rγV pxq`γV pyq`maxtV pxq, φpMqu`maxtV pyq, φpMqusIpx ‰ yq is a metric, since V pxq ě 0 for all x P X and φpMq ą 0. In particular, it is a weighted version of the Hamming metric. Since the sum of metrics is again a metric, we must show that fp}x´y}q is a metric. This holds provided that f is concave, with f 1p0q ą 0 and fp0q “ 0. (See [18, 19] for details.) To this end, it suffices to show that f is monotonically increasing from fp0q “ 0, and that fpa` bq ď fpaq ` fpbq holds for any a, b ě 0. Working through the definitions (2.13), we get for all r ě 0 that: fprq “ ż mintr,R2u 0 ϕpsqgpsqds f 1prq “ ϕprqgprqIpr ă R2q f2prq “ ´h1prqf 1prq ´ ξ 4 Ipr ă R1q ´ β 4 ΦprqIpr ă R2q . (A.2) 99 Here, hprq is strictly increasing over r ě 0, and thus so are ϕprq and Φprq, respectively from hp0q “ 0, ϕp0q “ 1, and Φp0q “ 0. Also, gprq ě 12 for all r ě 0 by construction, as ξ and β are positive constants, with gp0q “ 1. Thus, we immediately have that fp0q “ 0, f 1p0q “ 1, fprq is strictly increasing over r P r0, R2q, and then is constant for r ě R2. This satisfies the first two requirements. We also see from (A.2) that f2prq ă 0 over r P r0, R2q, and then zero for r ě R2. This gives that f 1prq is monotonically decreasing for all r ě 0. Therefore, the last required property (triangle inequality) is satisfied, since we have fpa` bq “ fpaq ` ż a`b a f 1psqds ď fpaq ` ż b 0 f 1psqds “ fpaq ` fpbq . Gradient and Hessian of General Norms with Quadratic Form Throughout this proof, } ¨ } denotes a general norm having the quadratic form }z}2 “ xz, zy “ zJPz for all z P Rn, which is valid for any symmetric, positive-definite n ˆ n matrix P . The Euclidian norm }z}2 “ xz, zy2 “ zJz satisfies this with P “ In. Then, for any z P Rnzt0nu and where we define u :“ z}z} with }z} ‰ 0, the gradient and Hessian of }z} are given by: ∇}z} “ ∇`p}z}2q 12 ˘ “ 12`p}z}2q´ 12 ˘∇}z}2 “ 12}z}´1 2Pz “ Pu ∇2}z} “ }z}´1`P ´∇}z}p∇}z}qJ˘ “ }z}´1P `In ´ uuJP ˘ . (A.3) Reflection Coupling The key approach from [19] is to define a reflection coupling (see [48]) between the solu- tions xt and yt and then use it to upper bound EΓtreatρpxt,ytqs “ eatEΓtrρpxt,ytqs by a local martingale with zero mean. Then by the definition of Wasserstein distance (2.10), we have WρpPt,Qtq ď EΓtrρpxt,ytqs for any joint measure pxt,ytq „ Γt P CpPt,Qtq that is a coupling of the laws of xt and yt at time t ě 0. 100 To define the reflection coupling, let us first make the following definitions: ut :“ xt ´ yt}xt ´ yt} and δt :“ G´1pxt ´ ytq }G´1pxt ´ ytq}2 . This gives that uJt Put “ 1 and δJt δt “ 1. The reflection coupling is then given by the following RSDE definitions corresponding to (2.5): dxt “ Hpxtqdt`Gdwt ´ vxt dµxptq dyt “ Hpytqdt`G ` In ´ 2δtδJt Ipt ă τ q ˘ dwt ´ vyt dµyptq . (A.4) Here, τ :“ inftt P r0,8q | xt “ ytu is the coupling time and the unique bounded variation reflection processes ´ şt0 vxsdµxpsq and ´ şt0 vysdµypsq respectively ensure both xt and yt remain within X for all t ě 0, from initial conditions x0 ‰ y0 P X . And so, we have that xt “ yt for all t ě τ . Note that the above definition for dyt is indeed of the form given in (2.5), because d rwt “ `In´2δtδJt ˘dwt is a valid definition of a Brownian motion. This follows because` In ´ 2δtδJt ˘ is an orthogonal matrix, since we have ` In ´ 2δtδJt ˘` In ´ 2δtδJt ˘J “` In ´ 2δtδJt ˘J` In ´ 2δtδJt ˘ “ In ´ 2δtδJt ´ 2δtδJt ` 4δtδJt δtδJt “ In. Remark A.1. “Reflection coupling” overlaps with an unrelated meaning in our setup: “reflecting” the solutions xt and yt at the boundary of X to remain within the closed convex set, versus “reflecting” the Brownian motion with In ´ 2δtδJt . The Supermartingale Property Now we will show that eatρpxt,ytq is supermartingale. This is an exercise in non-smooth Itoˆ calculus for continuous semimartingales. A good reference is Chapter 29 of [32]. In the discussion below, mt will denote a local martingale with zero mean. We use this notation in several places to denote different processes. The specific value of mt will not matter, since it will have zero mean. 101 Let zt :“ xt´yt and rt :“ }zt}. We will evaluate drt via Itoˆ’s formula when rt ą 0, meaning that t ă τ . The method is similar to the proof of Theorem 2.1 in [19], but our setup contains extra terms for the reflection processes, ´vxt dµxptq and ´vyt dµyptq. From (A.4), we have that the process zt is a solution to the RSDE: dzt “ ` Hpxtq ´Hpytq ˘ dt` 2GδtδJt dwt ´ vxt dµxptq ` vyt dµyptq t ă τ zt “ 0n t ě τ . With rt ą 0, we use (A.3) to apply Itoˆ’s formula for continuous semimartingales as: drt “ B} ¨ }Bt dt` p∇rtq Jdzt ` 1 2 pdztqJ∇2rt dzt “ 0` xut, dzty ` 1 2}zt} pdztq JP pIn ´ utuJt P qdzt “ xut,Hpxtq ´Hpytqydt` 2xut, GδtδJt dwty ` xut,vyt dµyptq ´ vxt dµxptqy ` 2}zt}tr ` P pIn ´ utuJt P qGδtδJt GpGδtδJt GqJ ˘ dt . (A.5) Recall that here we are using the general inner product xu, vy “ uJPv. The first term being zero just represents that the definition of rt in terms of the norm }¨} is fixed with respect to time. We use the trace identity aJb “ trpbaJq and the Itoˆ rules dt2 “ 0, dtdwt “ 0n, and dwtpdwtqJ “ dtIn to obtain the final term. A significantly simpler expression can be obtained which upper bounds (A.5). First, recalling that δt “ zt}G´1zt}2 , we note that the quantity inside the trace is a zero matrix: P pIn ´ utuJt P qGδtδJt GpGδtδJt GqJ “ P “ Gδtδ J t ´ utuJtP GδtδJt ‰ GpGδtδJt GqJ “ P ” zt}G´1zt}2δ J t ´ zt zJt P }zt}2 zt}G´1zt}2δ J t ı GpGδtδJt GqJ “ P 0nˆnGpGδtδJt GqJ “ 0nˆn . Next, recall that vxt P NX pxtq and vyt P NX pytq, where NX is the normal cone of the closed, convex set X . It then follows from the definition of NX that xut,vyt y ď 0 and ´xut,vxt y ď 0. (We elaborate on this in Section 2.1.) Since µx and µy are non-negative measures, we therefore can bound those terms above by 0. 102 Thus, eliminating the terms explained above gives an upper bound as: drt ď xut,Hpxtq ´Hpytqydt` 2xut, GδtδJt dwty “ xut,Hpxtq ´Hpytqydt` 2 zJt}zt}PG  G´1zt}G´1zt}2δ J t dwt “ xut,Hpxtq ´Hpytqydt` 2 }zt}}G´1zt}2dbt , (A.6) where bt “ şt 0 δ J s dws is a 1-dimensional Brownian motion. Note again that we will use mt going forward to denote various local martingales that have zero mean. Next we will bound dfprtq, using the f defind by (2.13). This part is similar to the analysis from [19]. The main distinction is that [19] examines the special case of P “ G “ In, while the general case leads to more complex formulas. We extend f to the entire real line by setting fprq “ r for r ď 0. Note that f is a non-smooth concave function which is left-differentiable everywhere. Denote the left- derivative by f 1´ prq. Let Lrt denote the right-continuous local time of rt. Then Theorem 29.5 of [32] states that fprtq ´ fpr0q “ ż t 0 f 1´ prsqdrs ` 1 2 ż 8 ´8 Lrtdf 1´ prq , (A.7) where the integral on the right is a Stieltjes integral. (Specifically, it is an integral with respect to the signed measure ν defined by νpra, bqq “ f 1´ pbq´f 1´ paq for a ă b.) Theorem 29.5 of [32] also states that for any measurable function Fprq, we haveż t 0 Fprsqdrrss “ ż 8 ´8 FprqLrtdr (A.8) almost surely. Here, rrst is the quadratic variation. Since the xut, dµy type terms in (A.5) are bounded variation, they have zero quadratic variation, and so we can use the RHS of (A.6) to calculate rrst. This gives: rrst “ ż t 0 ˆ 2 }zt} }G´1zt}2 ˙2 ds “ 4 ż t 0 zJsPzs zJs pG´1qJG´1zsds . 103 SinceG is invertible, thenGGJ is positive definite and thus so is pGGJq´1 “ pG´1qJG´1. Then the ratio of quadratic forms can be bounded for all zt P X as λminpP q λmaxppG´1qJG´1q ď zJt Pzt zJt pG´1qJG´1zt ď λmaxpP q λminppG´1qJG´1q where λminpAq and λmaxpAq are respectively the smallest and largest eigenvalues of the square matrix A. It then holds that 1 λmaxppG´1qJG´1q “ λminpGGJq “ σminpGq2, where σminpGq :“ min}v}2“1 }Gv}2 since G is full rank, and vice versa with σmaxpGq :“ max}v}2“1 }Gv}2 (see Appendix A.3). And so, we can bound the quadratic variation as ς t :“ 4λminpP qσminpGq2 t ď rrst ď 4λmaxpP qσmaxpGq2 t “: ζ t . Note that f is twice continuously differentiable except at r P tR1, R2u. Then (A.8) implies that rt spends zero time at tR1, R2u almost surely, since:ż t 0 Iprs P tR1, R2uqds ď ς´1 ż t 0 Iprs P tR1, R2uqdrrss “ ς´1 ż 8 ´8 Ipr P tR1, R2uqLrtdr “ 0 . (A.9) The last equality follows because Lrt is right continuous in r. More generally, this argument shows that rt spends zero time in any finite collection of points. The local time is non-negative, and the measure defined by νpra, bqq “ f 1´ pbq´f 1´ paq is non-positive. Thus, we get the following inequalities:ż 8 ´8 Lrtdf 1´ prq ď ż 8 ´8 Ipr R tR1, R2uqLrtdf 1´ prq “ ż 8 ´8 Ipr R tR1, R2uqf2prqLrtdr “ ż t 0 Iprs R tR1, R2uqf2prsqdrrss ď ζ ż t 0 f2prsqds. (A.10) 104 The last inequality follows from the bounds on the quadratic variation, non-negativity of f2, and the fact that rs spends zero time in tR1, R2u almost surely. The preceding argument then implies that for almost all t P r0, τ q, the following holds almost surely: dfprtq ď ” f 1prtqxut,Hpxtq ´Hpytqy ` ζ 2 f2prtq ı dt` dmt ď ” f 1prtq ` κprqr ` αp}xt} ` }yt}q ˘` ζ 2 f2prtq ı dt` dmt ď ” f 1prtqαp}xt} ` }yt} ´ 2Mq ´ ξζ 8 Ipr ă R1q ´ βζ 8 fprqIpr ă R2q ı dt` dmt . (A.11) Here, the first inequality combines (A.6), (A.7), and (A.10), the second inequality uses the one-sided growth bound (2.6), and the third inequality uses (A.2) combined with the following facts that hold by construction: fprq ď Φprq and h1prq “ 2 ζ prκprq ` 2αMq “ 1 2ε prκprq ` 2αMq . The fact that fprq ď Φprq can be deduced from the definitions of Φ and f in (2.13), since gprq P r1{2, 1s. We have thus bounded dfprtq as in (A.11). We can directly use the Lyapunov assumption (2.7) to bound dVpxtq and dVpytq almost surely as: dVpxtq ď pC ´ λVpxtqqdt` dmt dVpytq ď pC ´ λVpytqqdt` dmt . (A.12) Now we will bound dpmaxtVpxtq, φpMquq. Recall that Vpxq “ φp}x}q, and that φpdq is strictly monotonically increasing over d ě 0. We will follow a similar strategy as used to bound dfprtq. For compact notation, set Vt “ Vpxtq, let LˆVt be the right-continuous local time of Vt, and let F paq “ maxta, φpMqu, so that F pVpxtqq “ maxtVpxtq, φpMqu. This gives that F paq is convex and smooth over a P r0,8qztφpMqu. Furthermore, since 105 (A.9) implies that rt spends zero time at M , then similarly we have that Vt spends zero time at φpMq. Note that the measure defined by νˆpra, bqq “ F 1´ pbq´F 1´ paq is a Dirac delta centered at φpMq. Thus, the calculation analogous to (A.7) gives F pVtq ´ F pV0q “ ż t 0 F 1´ pVsqdVs ` 1 2 Lˆ φpMq t , where the local time Lˆ φpMq t only changes at the times when Vt “ φpMq. (See again [32] for details). Indeed, this local time is defined by Lˆ φpMq t “ |Vt ´ φpMq| ´ |V0 ´ φpMq| ´ ż t 0 pIpVs ą φpMqq ´ IpVs ď φpMqqqdVs . Thus, the local time remains unchanged on intervals in which Vt ą φpMq or Vt ă φpMq. Then, since the amount of time Vt spends at φpMq is zero almost surely, we have that for almost all t ă τ : dF pVtq “ IpVt ą φpMqqdVt “ Ip}xt} ąMqdVpxtq ď Ip}xt} ąMqpC ´ λVpxtqq ` dmt ď ´Ip}xt} ąMq ` α}xt} ` λ2Vpxtq ˘` dmt . (A.13) The first inequality is due to the bound (A.12), while the second inequality follows from the assumption described in (2.9). And so, we have thus bounded dF pVtq “ dpmaxtVpxtq, φpMquq as in (A.13). Using the definition of ρ in (2.11), and recalling that rt :“ }xt´yt} and xt ‰ yt for all t ă τ , we then have dρpxt,ytqq “ dfprtq ` γ ` dVpxtq ` dVpytq ˘ ` dpmaxtVpxtq, φpMquq ` dpmaxtVpytq, φpMquq . 106 And so, (A.11), (A.12), and (A.13) can be combined, along with the product rule, to give for almost all t ă τ : dpeatρpxt,ytqq “ aeatρpxt,ytqdt` eatdρpxt,ytq ď eat „ afprtq ` aγ ` Vpxtq ` Vpytq ˘` a`maxtVpxtq, φpMqu `maxtVpytq, φpMqu˘ ` f 1prtqαp}xt} ` }yt} ´ 2Mq ´ ξζ 8 Iprt ă R1q ´ βζ 8 fprtqIprt ă R2q ` γ`2C ´ λVpxtq ´ λVpytq˘´ Ip}xt} ąMq`α}xt} ` λ2Vpxtq˘ ´ Ip}yt} ąMq ` α}yt} ` λ2Vpytq ˘ dt` dmt . We must show that the terms inside the brackets always sum to a non-positive value, which then gives dpeatρpxt,ytqq ď dmt as desired. For rt ă R2, the afprtq term is negated by βζ8 fprtq, since a ď βζ8 “ βε2 by definition. On the other hand, for rt ě R2 we will show that afprtq is negated by aγ ` Vpxtq ` Vpytq ˘` γ`2C ´ λVpxtq ´ λVpytq˘ “ γ`2C ` pa´ λqVpxtq ` pa´ λqVpytq˘, as follows: γ ` 2C ` pa´ λqVpxtq ` pa´ λqVpytq ˘ ď γ`2C ´ λ2Vpxtq ´ λ2Vpytq˘ ď γ`2C ´ λ2Vpxtq˘ ď ´γ4C}xt} ď ´ξε2 rt ď ´art ď ´afprtq . The first inequality follows from a ď λ2 . The second is from the assumption, without loss of generality, that }xt} ě }yt} and thus Vpxtq ě Vpytq, and then dropping the non-positive ´λ2Vpytq term. The third is from the assumption described in (2.9), such that ´λ2Vpxtq ď ´4C}xt} ´ 2C. The fourth follows from the definition of γ “ ξε4C and from the triangle inequality giving that 2 maxt}xt}, }yt}u “ 2}xt} ě rt “ }xt´yt}, thus ´}xt} ď ´rt2 . Then we have from the definition of a that a ď ξε2 . Lastly, since we have shown that fprq is concave from fp0q “ 0, it holds that ´rt ď ´fprtq. Next consider the case that f 1prtqαp}xt} ` }yt} ´ 2Mq ą 0. Then, since both f 1prtq ě 0 and α ě 0, at least one of }xt} ą M or }yt} ą M must hold. If }xt} ą M , 107 then since f 1prtq ď 1, the corresponding term is negated by Ip}xt} ąMqppa´ λ{2qVpxtq ´ α}xt}q ď ´α}xt} ď ´f 1prtqα}xt} . A similar negation occurs if }yt} ąM . Finally, we consider the term γ2C “ ξε2 “ ξζ8 if rt ă R2, since previously it was only part of a negation if rt ě R2. In the case that rt ă R1, this term is negated by ´ ξζ8 Iprt ă R1q. If rt ě R1, the definition of R1 “ diamtpx, yq P X | Vpxq ` Vpyq ď 4Cλ u along with a ď λ2 gives that it is negated by 2C ` pa´ λqVpxtq ` pa´ λqVpytq ď 2C ´ λ2 pVpxtq ` Vpytqq ď 0 . We have therefore shown that dpeatρpxt,ytqq ď dmt, and thus proven that the process eatρpxt,ytq is supermartingale. Localization The last step in the proof is ruling out the possibility that Ereatρpxt,ytqs “ 8. This is achieved in the proof of Theorem 2.1 in [19] by a localization argument using stopping times τk :“ inftt | rt ď 1k or maxt}xt}, }yt}u ě ku, with τk Ò τ almost surely by the nonexplosiveness of solutions to (2.5), which in turn is guaranteed by assumption (2.7). The argument works without change in this setting. A.2 Proof of Corollary 2.1 Corollary. For (2.5), there exists a unique stationary distribution pi satisfyingş X Vpxqdpipxq ă 8 . Let xt be a solution to (2.5) with law Pt such that the initial law satisfies ş X VpxqdP0pxq ă 8 . If Vpxq ě 1 for all x P X , then Pt contracts exponentially to pi in total variation distance as }Pt ´ pi}TV ď γ´1e´atWρpP0, piq . (2.15) 108 Now let q P r1,8q satisfy 1p ` 1q “ 1 with the corresponding p ą 1. If Vpxq ě }x}q for all x P X , then Pt contracts exponentially to pi in standard W q as W qpPt, piq ď 21{p γ´1{q ` e´atWρpP0, piq ˘1{q . (2.16) Proof. The proof of (2.15) follows the proof of Corollary 2.1 in [19]. There, the initial equality below is given (with citation), for any two probability measures P,Q over X , as ż X Vpxq |P´Q|pdxq “ inf ΓPCpP,Qq ż XˆX ` Vpxq ` Vpyq˘Ipx ‰ yqdΓpx, yq ď ż XˆX ` Vpxq ` Vpyq˘Ipx ‰ yq dΓ˚px, yq ď γ´1WρpP,Qq , (A.14) where Γ˚ is the coupling measure in WρpP,Qq. The last inequality follows since it holds that ` Vpxq`Vpyq˘Ipx ‰ yq ď γ´1ρpx, yq, with ρ defined by (2.11), for all px, yq P X ˆX . Then from the definition of total variation distance, we have that }P´Q}TV “ 1 2 ż X |P´Q|pdxq ď ż X Vpxq |P´Q|pdxq (A.15) holds if Vpxq ě 1 for all x P X . Therefore, (2.15) follows from combining (2.14) with (A.14) and (A.15) where we use P “ Pt and Q “ pi. The proof of (2.16) follows a similar argument, as is also the case with Remark 2.3 in [19]. There it is given, for any two probability measures P,Q over X , that W qpP,Qq ď 21{p ˆż X }x}q|P´Q|pdxq ˙1{q (A.16) holds for any q P r1,8q satisfying 1p ` 1q “ 1 with the corresponding p ą 1. Thus, if Vpxq ě }x}q for all x P X , we have (by combining with (A.14)) thatż X }x}q|P´Q|pdxq ď ż X Vpxq |P´Q|pdxq ď γ´1WρpP,Qq . (A.17) 109 And so again, the result is obtained by combining (2.14) with (A.16) and (A.17) where we use P “ Pt and Q “ pi. A.3 Useful Properties of Invertible Matrix G Let G be an nˆ n square matrix. Then GGJ is symmetric and thus for any x P Rn it holds that xJGGJxJ “ pGJxqJpGJxq “ nÿ i“1 pGJxq2i ě 0 . If G is invertible, then so is GGJ as pGGJq´1 “ pG´1qJG´1. This means GGJ has no zero eigenvalues and thus is positive definite. Let λ1 ě ¨ ¨ ¨ ě λn ą 0 be the eigenvalues of GGJ and let us define λminpGGJq :“ λn and λmaxpGGJq :“ λ1. Since GGJ is positive definite, then so is its inverse pGGJq´1 “ pG´1qJG´1 with eigenvalues of 1λn ě ¨ ¨ ¨ ě 1λ1 ą 0. We can similarly define λminppG´1qJG´1q :“ 1λ1 and λmaxppG´1qJG´1q :“ 1λn . Therefore, we have 1 λmaxppG´1qJG´1q “ λn “ λminpGG Jq “ σminpGq2 and 1 λminppG´1qJG´1q “ λ1 “ λmaxpGG Jq “ σmaxpGq2 where σminpGq “ min}v}2“1 }Gv}2 (since G is full rank) and σmaxpGq “ max}v}2“1 }Gv}2 :“ }G}2 are respectively the minimum and maximum singular value of G with respect to Euclidean norms. Appendix B B.1 Proofs of Theorems 3.1 and 3.2 Here we will prove the main results provided in Section 3.3. B.1.1 Proof Methodology and Setup The proofs for both theorems rely on the same contradiction method and share a com- mon setup. First, we will assume that the state trajectory xt is continuous and stays within some compact set C Ă Rn for all t ě 0, and that the component activations σ1pwJ1 x` b1q, . . . , σN pwJNx` bN q of Ψ are then bounded on that C. Then we will assume that for any shift τ ě 0 of the time window rτ, τ ` T s of any length T ą 0, there exists a nonzero vector v P RNzt0Nu such that vJ ˆż τ`T τ ΨpxtqΨpxtqJdt ˙ v “ 0 . (B.1) Finally, we will show that if the state trajectory xt meets certain requirements within any shift τ ě 0 of the window rτ, τ ` T ˚s, for some window length T ˚ ą 0, then in fact (B.1) can only hold if v “ 0N . This is a contradiction and thus proves that the time-varying matrix ż τ`T˚ τ ΨpxtqΨpxtqJdt (B.2) 110 111 must be strictly positive definite, with bounded eigenvalues, uniformly over all windows rτ, τ ` T ˚s. Then by the Min-max principle, the persistency of excitation conditions (3.5) and (3.6) are satisfied with setting α1 and α2 respectively as the smallest and largest eigenvalues of (B.2) over all shifts τ ě 0 of the window rτ, τ ` T ˚s. And so, let there be such a continuous trajectory xt staying within a compact set C Ă Rn for all t ě 0. Within any particular time window rτ, τ `T s, the trajectory visits L activation regions AA1 , . . . ,AAL , with the corresponding index sequences A1, . . . ,AL and I1, . . . ,IL´1 defined as in Section 3.2.1. For any time window rτ, τ ` T s and over the corresponding sequence s P rLs of visited activation regions, recall the definition of the subsets Ts from (3.10): Ts “ tt P rτ, τ ` T s | xt P AAsu . Clearly the union of these subsets gives back the entire window as Lď s“1 Ts “ rτ, τ ` T s , meaning (B.1) is equivalent toż τ`T τ pvJΨpxtqq2dt “ Lÿ s“1 ż Ts pvJΨpxtqq2dt “ 0 (B.3) holding for all time windows rτ, τ ` T s and corresponding Ts subsets. Since each integral in the sum is over t P Ts, and by definition we know that xt is only within AAs during that subset of the time window, then by definition only the activation indices i P SAs of Ψpxtq will be nonzero, which gives vJΨpxtq “ Nÿ i“1 vi Ψpxtqi “ ÿ SAs vi Ψpxtqi :“ vJAs ΨpxtqAs . Here we denote vAs and ΨpxtqAs respectively as the indices of v and Ψpxtq according to the active set i P SAs , for each s P rLs. This means (B.3) is equivalent to Lÿ s“1 ż Ts pvJAsΨpxtqAsq2dt “ 0 (B.4) 112 holding for all time windows rτ, τ `T s and corresponding Ts subsets. And since each of these integrals is always nonnegative, (B.4) is equivalent to vJAsΨpxtqAs “ 0 @t P Ts,@s P rLs (B.5) holding for all time windows rτ, τ ` T s and corresponding Ts subsets. B.1.2 Proof of Theorem 3.1 Theorem. Let state trajectory xt be continuous and stay within some compact set C Ă Rn for all t ě 0, and let Ψpxtq “ rσcSpwJ1xt ` b1q ¨ ¨ ¨ σcSpwJNxt ` bN qsJ be composed of (scaled) step functions (3.8) with positive scalars c “ rc1 ¨ ¨ ¨ cN sJ together with N unique affine transformations of Rn according to w1, . . . , wN P Rnzt0nu and b1, . . . , bN P R. If xt over t ě 0 is such that there exists a window length T ˚ ą 0 whereby the sequence s P rLs of activation regions visited during any shift τ ě 0 of the time window rτ, τ ` T ˚s always satisfies the following two conditions: 1. all i P rN s hyperplanes wJi xt ` bi “ 0 are crossed 2. only nondegenerate borders are crossed, then Ψpxtq satisfies the persistency of excitation conditions (3.5) and (3.6). Proof. In this case, (B.5) is equivalent to vJAsΨpxtqAs “ vJAscAs “ 0 @t P Ts,@s P rLs (B.6) holding for all time windows rτ, τ ` T s and corresponding Ts subsets. Here we denote cAs as the indices of c according to the active set i P SAs , for all s P rLs. This gives a fixed, nonzero |SAs | length vector. Note we have assumed that xt over t ě 0 is such that there exists T ˚ ą 0 whereby all N hyperplanes are crossed and only at nondegenerate borders, within any time window 113 rτ, τ ` T ˚s. Thus, we can combine (3.11) and (B.6) to get that 0 “ vJAs`1cAs`1 @t P Ts`1 (B.7) “ $’’&’’% vJAscAs ` vIscIs if AAs`1 Ď X`Is vJAscAs ´ vIscIs if AAs`1 Ď XI˝s must hold over all s P rL ´ 1s with vJA1cA1 “ 0 for all t P T1, for all time windows rτ, τ `T ˚s and corresponding Ts subsets. However, we will now show by induction that this can only hold if v “ 0N . Starting at s “ 1: it must hold that vJA1cA1 “ 0 for all t P T1, and since v and c are fixed it must continue to hold for all t P T2, during which it must hold that vJA2cA2 “ 0. And so, these can both hold only if the additional term is vI1cI1 “ 0, and since cI1 ‰ 0 it must be that vI1 “ 0. We can then inductively make the same argument over the remaining s “ 2, 3, . . . ,L ´ 1 to obtain vI1 “ ¨ ¨ ¨ “ vIL´1 “ 0. And since we have assumed all N hyperplanes were crossed at nondegenerate borders over the L´ 1 hyperplane crossings, we therefore must have that vi “ 0 for all i P rN s. This establishes the contradiction from the original assumption of a nonzero v. Since we assume that all N hyperplanes are crossed and only at nondegenerate borders during any time window rτ, τ ` T ˚s, this contradiction must hold for all such windows. Therefore, (B.2) must be strictly positive definite for all τ ě 0 with this T ˚ window length. And since the (scaled) step is bounded on any compact set C Ă Rn, we have proven persistency of excitation holds for any window length of at least T ˚. B.1.3 Proof of Theorem 3.2 Theorem. Let state trajectory xt be continuous and stay within some compact set C Ă Rn for all t ě 0, and let Ψpxtq “ rσRpwJ1 xt` b1q ¨ ¨ ¨ σRpwJNxt` bN qs be composed of ReLU functions (3.9) together with N unique affine transformations of Rn according 114 to w1, . . . , wN P Rnzt0nu and b1, . . . , bN P R. If xt over t ě 0 is such that there exists a window length T ˚ ą 0 whereby the sequence s P rLs of activation regions visited during any shift τ ě 0 of the time window rτ, τ `T ˚s always satisfies the following three conditions: 1. all i P rN s hyperplanes wJi xt ` bi “ 0 are crossed 2. only nondegenerate borders are crossed 3. for each s P rLs, there are times t1, tˆ1, . . . , tM , tˆM P Ts, for any M ě n, such that the nˆM matrix “pxt1 ´ xtˆ1q ¨ ¨ ¨ pxtM ´ xtˆM q‰ satisfies rankpWJAsq “ rank ´ WJAs “pxt1 ´ xtˆ1q ¨ ¨ ¨ pxtM ´ xtˆM q‰¯ , then Ψpxtq satisfies the persistency of excitation conditions (3.5) and (3.6). Proof. In this case, (B.5) is equivalent to vJAsΨpxtqAs “ vJAspWJAsxt ` bAsq “ 0 @t P Ts,@s P rLs (B.8) holding for all time windows rτ, τ ` T s and corresponding Ts subsets. Here we denote WJ as the N ˆ n matrix with wJi as the ith row and b as the length N vector with bi as the ith entry, for all i P rN s, and thus the As subscripts are the indices according to the active set i P SAs , for each s P rLs. Let t1, tˆ1, . . . , tM , tˆM P Ts be the times from condition (iii). Then for i “ 1, . . . ,M it must hold by (B.8) that vJAspWJAsxti ` bAsq ´ vJAspWJAsxtˆi ` bAsq “ 0 “ vJAsWJAspxti ´ xtˆiq . (B.9) In particular, vAs must be in the nullspace of»————– pxt1 ´ xtˆ1qJ ... pxtM ´ xtˆM qJ fiffiffiffiffiflWAs “: xWAs . 115 Condition (iii) requires that xWAs and WAs have the same rank, and so the rank-nullity theorem implies that their nullspaces must have the same dimension. Now, since the nullspace of WˆAs is a subspace of the nullspace of WAs , the nullspaces must actually be the same. As a result, (B.9) implies that vJAsW J As “ 0 and so vJAsbAs “ 0 as well. We have just shown that with condition (iii) satisfied, (B.8) is equivalent to vJAsrWJAs bAss “ 0Jn`1 @t P Ts,@s P rLs (B.10) holding for all time windows rτ, τ ` T s and corresponding Ts subsets. Note we have assumed that xt over t ě 0 is such that there exists T ˚ ą 0 whereby all N hyperplanes are crossed and only at nondegenerate borders, within any time window rτ, τ ` T ˚s. Thus, we can combine (3.11) and (B.10) to get that 0Jn`1 “ vJAs`1rWJAs`1 bAs`1s @t P Ts`1 (B.11) “ $’’&’’% vJAsrWJAs bAss ` vIsrwJIs bIss if AAs`1 Ď X`Is vJAsrWJAs bAss ´ vIsrwJIs bIss if AAs`1 Ď XI˝s must hold over all s P rL ´ 1s with vJA1rWJA1 bA1s “ 0Jn`1 for all t P T1, for all time windows rτ, τ ` T ˚s and corresponding Ts subsets. However, we will now show by induction that this can only hold if v “ 0N . Starting at s “ 1: it must hold that vJA1rWJA1 bA1s “ 0Jn`1 for all t P T1, and since v, W , and b are fixed it must continue to hold for all t P T2, during which it must hold that vJ2 rWJA2 bA2s “ 0Jn`1. And so, these can both only hold if the additional term is vI1rwJI1 bI1s “ 0Jn`1, and since rwJI1 bI1s is nonzero it must be that vI1 “ 0. We can then inductively make the same argument over the remaining s “ 2, 3, . . . ,L ´ 1 to obtain vI1 “ ¨ ¨ ¨ “ vIL´1 “ 0. And since we have assumed all N hyperplanes were crossed at nondegenerate borders over the L´1 hyperplane crossings, we therefore must have that vi “ 0 for all i P rN s. This establishes the contradiction from the original assumption of a nonzero v. 116 Since we assume that all N hyperplanes are crossed and only at nondegenerate borders during any time window rτ, τ ` T ˚s, this contradiction must hold for all such windows. Therefore, (B.2) must be strictly positive definite for all τ ě 0 with this T ˚ window length. And since the ReLU is bounded on any compact set C Ă Rn, we have proven persistency of excitation holds for any window length of at least T ˚. B.1.4 Extension to Other Activation Functions The reasoning and methods described in this appendix section should also be able to be applied similarly to any nonlinear, positive semidefinite activation function for which it is possible to show the following: 1. (if needed) describe a sufficient condition for the trajectory xt such that we only need to consider fixed properties (constants, vectors, matrices, etc.) contributing to the value of vJAsΨpxtqAs within any activation region s P rLs, for example as was done going from (B.8) to (B.10), and 2. formulate geometric conditions like (B.7) from (B.6) and (B.11) from (B.10), which then allow the main induction steps to proceed. B.2 Global Uniform Asymptotic Stability of (3.23) Here we will show how the linear time-varying (LTV) system 9yt “ Γ ΨpxtqΨpxtqJyt (B.12) is proven globally uniformly asymptotically stable if the vector function Ψ : Rn Ñ RN satisfies the persistency of excitation condition (3.5). State trajectories y : r0,8q Ñ RN and x : r0,8q Ñ Rn are assumed to be continuous (in time), and xt is assumed to stay within some compact set C P Rn for all t ě 0. The vector function is defined 117 Ψpxq :“ rg1pxq ¨ ¨ ¨ gN pxqsJ, where the continuous almost everywhere, nonlinear scalar functions g1, . . . , gN : Rn Ñ R are bounded on C, and Γ is some symmetric, positive- definite N ˆN matrix. Firstly, Lemma 1 of [2] says that if F : r0,8q Ñ RNˆN , H : r0,8q Ñ RN , and Kr0,8q Ñ RN are regulated matrix/vector functions of time (meaning one-sided limits exist for all t P r0,8q), and there exists positive scalars α3, T ą 0 such thatż τ`T τ }Kt}22 dt ď α3 holds for all τ ě 0, then rFt, Hts is uniformly completely observable (UCO) if and only if rFt `KtHJt , Hts is UCO. Next, Lemma 2 of [2] says that the global uniform asymptotic stability of 9yt “ Ft yt is equivalent to: i) the existence of a symmetric, differentiable matrix function of time P : r0,8q Ñ Rnˆn satisfying, for some positive scalars β1, β2 ą 0, that β1IN ĺ Pt ĺ β2IN and ´ 9Pt “ Pt Ft ` FJt Pt `HtHJt both hold for all t ě 0, and ii) that rFt, Hts is UCO. Finally, a slight modification of Theorem 1 of [2], in order to allow for Γ matrices other than the identity IN , says that the global uniform asymptotic stability of (B.12) holds if Ψpxtq satisfies the persistency of excitation condition (3.5). Proof. Suppose that Ψpxtq satisfies the persistency of excitation condition (3.5) for some positive scalars α1, α2, T ą 0. This means, by definition, that r0NˆN ,Ψpxtqs is UCO. Now set Kt “ ´Γ Ψpxtq and note that (3.5) givesż τ`T τ }Kt}22 dt ď ż τ`T τ }Γ}2F }Ψpxtq}2F dt “ G2 ż τ`T τ tr “ ΨpxtqΨpxtqJ ‰ dt ď G2α2N , thus satisfying Lemma 1 and so r´Γ ΨpxtqΨpxtqJ,Ψpxtqs is UCO. 118 Therefore, we have that Lemma 2 is satisfied with Pt “ 12Γ´1, Ft “ ´Γ ΨpxtqΨpxtqJ, and Ht “ Ψpxtq. This proves (B.12) is globally uniformly asymptotically stable. Remark B.1. Lemma 1 and Theorem 1 of [2] actually state exponential stability, rather than global asymptotic uniform stability. However, for linear systems these types of stability are equivalent. See Section 1.5.2 of [71] for details. Remark B.2. Lemma 1 and 2 and Theorem 1 of [2] in fact hold with Vt “ Ψpxtq, Kt, and Ht being regulated N ˆ ` matrix functions of time for any ` ě 1, using the induced Euclidean matrix norm when ` ą 1. We specified their results for ` “ 1 and the Euclidean vector norm, in order to match the application in Section 3.4. Appendix C C.1 Useful Relations Bounding Absolute Value of F p1qB,C and F p2qB,C Integrands First, we note that ˇˇ re “ Zeib ‰ ˇˇ ď |Zeib| “ |Z| holds for all Z P C and b P R. Thus, we bound the absolute value of the integrand in (4.7) for any ξ, x P Rn and b P R as ˇˇ cospξJx` bq ´ cospbqˇˇ ď ˇˇpeiξJx ´ 1qeib ˇˇ “ ˇˇeiξJx ´ 1ˇˇ “ 2ˇˇ sinpξJx2 qˇˇ ď 2ˇˇξJx2 ˇˇ “ ˇˇξJxˇˇ , (C.1) which gives | cospξJxq ´ 1| ď |ξJx| for b “ 0. By then also noting | sinp0q ´ 0| “ 0 and ddy psinpyq ´ yq “ cospyq ´ 1 for y P R, we can show that | sinpyq ´ y| “ ˇˇˇˇż y 0 cospzq ´ 1 dz ˇˇˇˇ ď ż y 0 | cospzq ´ 1| dz ď ż y 0 |z|dz “ 12 |y|2 . 119 120 Combining these results, we bound the absolute value of the integrand in (4.9) as | cospξJx` bq ` ξJx sinpbq ´ cospbq| ď |eiξJx ´ iξJx´ 1| “ b pcospξJxq ´ 1q2 ` psinpξJxq ´ ξJxq2 ď | cospξJxq ´ 1| ` | sinpξJxq ´ ξJx| “ 2| sinpξJx2 q|2 ` | sinpξJxq ´ ξJx| ď 2|ξJx2 |2 ` 12 |ξJx|2 “ |ξJx|2 . (C.2) Integrability of F p1qB,C and F p2qB,C Classes We note here that the integrability of (4.7) and (4.9) are given by using (C.1),(4.8) and (C.2),(4.10) respectively, along with the definition of |ξ|B “ supxPB ˇˇ ξJx ˇˇ , asż Rn |eiξJx ´ 1|Gpdξq ď ż Rn |ξJx|Gpdξq ď ż Rn |ξ|B Gpdξq ď Cp1qgż Rn |eiξJx ´ iξJx´ 1|Gpdξq ď ż Rn |ξJx|2Gpdξq ď ż Rn |ξ|2B Gpdξq ď Cp2qg . Reverse Triangle Inequality Let F be some function space with f, g P F . Let } ¨ } be any norm on F . Then we have: }f} “ }f ´ g ` g} ď }f ´ g} ` }g} }g} “ }g ´ f ` f} ď }g ´ f} ` }f} “ }f ´ g} ` }f} , due to the triangle inequality. Therefore, it holds that ˇˇ }f} ´ }g} ˇˇ ď }f ´ g} . (C.3) 121 Norm Difference for Functions with Different Coefficients Let F be some function space with f, g P F . Let } ¨ } be any norm on F , and a, b P R. Then we have: }af ´ bg} “ }apf ´ gq ´∆g} ď |a| }f ´ g} ` |∆| }g} (C.4) where ∆ “ b´ a, by using the triangle inequality. Bounding Expected Variation of Sample Average Function from its Mean Let f ,f1, . . . ,fN be drawn iid from a function space F and define the sample average function f¯N “ 1N řN i“1 fi. With the L2pB, µq norm (for any valid measure µ) denoted as }¨}B, we have E ››f¯N ´ Ef¯N››2B “ E żB `f¯N ´ Ef¯N˘2 µpdxq paq“ ż B ´ E “ f¯ 2N ‰´ 2E “f¯N ‰E “f¯N ‰` E “f¯N ‰2¯µpdxq pbq“ ż B ´ E “ f¯ 2N ‰´ E “f¯N ‰2¯µpdxq “ ż B var ` f¯N ˘ µpdxq pcq“ ż B 1 N varpfqµpdxq “ 1 N ż B ` E “ f2 ‰´ Erf s2˘µpdxq “ 1 N ´ E }f}2B ´ }Erf s}2B ¯ pdqď 1 N E }f}2B . (C.5) Here (a) is by using Fubini’s theorem to bring the expectation inside the integral, the linearity of expectation, and the outer expectation then being redundant for the last term, (b) is from combining the middle and last terms, (c) is by the facts of variance of averaged iid random variables (covariance terms are zero, while means and variances are equal), and (d) is because the second term is nonpositive and therefore it can be dropped with inequality. 122 McDiarmid’s Inequality Let v1, . . . ,vd, for any d P N, be independent random variables in the sets V1, . . . ,Vd and let the function h : V1ˆ¨ ¨ ¨ˆVd Ñ R satisfy for all i P rds and some positive scalars k1, . . . , kd ą 0, that |hpv1, . . . , vi, . . . , vdq ´ hpv1, . . . , rvi, . . . , vdq| ď ki holds for all points pv1, . . . , vi, . . . , vdq, pv1, . . . , rvi, . . . , vdq P V1,ˆ ¨ ¨ ¨ ˆ Vd differing only at vi, rvi P Vi. Then, for all t ě 0, it holds that P ” hpv1, . . . ,vdq ´ Erhpv1, . . . ,vdqs ě t ı ď exp ˜ ´2 t2řd i“1 k2i ¸ . (C.6) Lemma on Approximating Functions of Convex Combinations The following lemma is credited to Maurey in [65]. Lemma C.1. If qf is in the closure of the convex hull of a bounded set of functions F in a Hilbert space H, such that }f}H ď b holds for all f P F , then for every N ě 1, there exists an f¯N in the convex hull of N points in F such that››› qf ´ f¯N›››2H ď c¯N holds for any c¯ ą b2 ´ ››› qf ›››2 H . Proof. Define f‹ as a finite but arbitrarily large convex combination of m points f‹i P F , meaning it has the form f‹ “ řmi“1 γif‹i with γi ě 0, řmi“1 γi “ 1, and such that››› qf ´ f‹››› H ď δ? N for some δ ą 0. Randomly sample functions f ,f1, . . . ,fN iid from the set tf‹i , . . . , f‹mu according to Prf “ f‹i s “ γi. Thus, we have Erf¯N s “ Er 1N řN i“1 fis “ Erf s “ f‹, and then using (C.5) gives E ››f¯N ´ f‹››2H “ 1N pE }f}2H ´ }f‹}2Hq ď 1N pb2 ´ 123 }f‹}2Hq. Since the expectation is bounded this way, there must be a realization f¯N that also satisfies this bound. Then by the triangle inequality, we have ››› qf ´ f¯N›››H ď››› qf ´ f‹››› H `››f¯N ´ f‹››H “ 1?N pδ`bb2 ´ }f‹}2Hq. And since f‹ can get arbitrarily close to qf by letting mÑ8, and thus δ Ñ 0 and }f‹}H Ñ ››› qf ›››H, this proves the result. In our setting, we use the }¨}B norm for the L2pBq space. We note that Theorem 1 of [51] gives the same Op 1? N q result, but with a superior constant. The proof there uses the same method as above, but breaks F into N subsets of diameter , and then uses that to bound the variance terms. This gives the bound constant in terms of an inverse covering number of F (denoted N pFq), which goes to zero as N Ñ 8. However, this requires quantifying N pFq, which may be nontrivial. C.2 Proof of Theorem 4.1 Theorem. For any constant C ą 0 and bounded set B Ă Rn containing the origin, there exists a base approximation pfN of the form (4.5) using the step activation function σS with parameters satisfying |w|B “ 1 and |b| ď 1, for any integer N ě 1, to every function f P F p1qB,C such that the approximation error rfN pxq “ qfpxq´ pfN pxq to the affine shifted qfpxq “ fpxq ´ fp0nq satisfies the L2pB, µ¯Bq norm bound ››› rfN›››B “ dż B ˇˇˇ rfN pxqˇˇˇ2 µ¯Bpdxqq ď 2C? N (4.11) with coefficient size bounded as řN i“1 |θi| ď 2C. Proof. We closely follow the proof methods in [7]. Recalling that gpxq “ fpxq, and thus qgpxq “ qfpxq, on B, then taking the real part of both sides of (4.7) gives for all x P B 124 that rer qgpxqs “ re” qfpxqı “ re„ż Rn peiξJx ´ 1q eiφpξqGpdξq  qfpxq “ ż Rn ´ cos ` ξJx` φpξq˘´ cos `φpξq˘¯Gpdξq “ ż Ω hpx, ξqΛpdξq where Ω “ Rnzt0nu and equality is maintained because the integrand at ξ “ 0n is zero. Here, we defined hpx, ξq :“ C p1q g |ξ|B ´ cos ` ξJx` φpξq˘´ cos `φpξq˘¯ Λpdξq :“ |ξ|B C p1q g Gpdξq with C p1q g as in (4.8), and so Λ is a probability measure over Ω such that ş Ω Λpdξq “ 1. Then by (C.1) and recalling the definition of |ξ|B “ supxPB ˇˇ ξJx ˇˇ , it holds for all x P B and ξ P Ω that |hpx, ξq| ď C p1q g |ξ|B |ξJx| ď Cp1qg ď C . With ξ,ξ1, . . . ,ξN P Ω sampled iid according to Λ, we have Er 1N řN i“1 hpx,ξiqs “ qfpxq. Thus, by (C.5) we have E ››››› qfpxq ´ 1N Nÿ i“1 hpx,ξiq ››››› 2 B ď 1 N E }hpx,ξq}2B ď 1 N ż B C2 µ¯Bpdxq “ C 2 N . Since this goes to zero as N Ñ8, we have that for every f P F p1qB,C , it holds over x P B that qf P coHcosp1qC,B , where Hcosp1qC,B :“ " α |ξ|B ` cospξJx` bq ´ cospbq˘ ˇˇˇ ξ P Ω, |α| ď C, b P R* with closure in the sense of the L2pB, µ¯Bq norm. 125 Next, for any ξ P Ω we set y “ 1|ξ|B ξ Jx, which gives y P Y Ď r´1, 1s over x P B. Therefore, functions in Hcosp1qC,B correspond to smooth univariate functions of y having the form hpyq “ α|ξ|B ` cosp|ξ|B y ` bq ´ cospbq ˘ for any ξ P Ω, |α| ď C, and b P R, and thus having derivative h1pyq “ ´α sinp|ξ|B y ` bq . These hpyq functions are uniformly continuous and bounded on y P r´1, 1s with hp0q “ 0 and |hpyq| ď C, and so they can be uniformly approximated on this interval by piecewise constant functions. Setting a uniform partition of r0, 1s for any positive integer P as ∆ :“ 1P ă 2∆ ă ¨ ¨ ¨ ă pP ´ 1q∆ ă P∆ “ 1, we define h2P pyq :“ ÿ dPt´1,1u ˜ hp0qσSpdyq ` P´1ÿ k“1 ` hpdk∆q ` hpdpk ´ 1q∆q˘σSpdy ´ k∆q ¸ , recalling that σSpyq “ Iry ą 0s is the step activation function. This definition means h2P is a continuous, length 2P sequence of constant lines on y P r´1, 1s between each pair of points hpdpk ´ 1q∆q, hpdk∆q(, over k P rP s and d P t´1, 1u. Thus, as P Ñ 8 (and ∆ Ñ 0) we get h2P Ñ h uniformly on y P r´1, 1s, and the absolute values of the coefficients are |hp0q| “ 0 and |hpdk∆q ´ hpdpk ´ 1q∆q| . By definition, the sum of the absolute values of the coefficients is then bounded by the total variation of h1pyq over y P r´1, 1s. And since |h1pyq| “ |α|| ´ sinp|ξ|B y ` bq| ď C, we have for any P ě 1 that ÿ dPt´1,1u ˜ |hp0q| ` P´1ÿ i“1 |hpdk∆q ` hpdpk ´ 1q∆q| ¸ ď ż 1 ´1 |h1pyq|dy ď 2C . 126 This proves that for every f P F p1qB,C , it holds over x P B that qfpxq P coHSC,B, where HSC,B :“ ασSpξJx` bq ˇˇ |α| ď 2C, |ξ|B “ 1, |b| ď 1( with closure in the supremum norm over x P B, and thus also in the sense of the L2pB, µ¯Bq norm. Recalling that we had defined y “ 1|ξ|B ξ Jx, which gave y P Y Ď r´1, 1s for any ξ P Ω, then ξJx P W Ď r´ |ξ|B , |ξ|Bs and thus why |ξ|B “ 1 is specified in the definition of HSC,B. To bound the L2pB, µ¯Bq norm for all functions in this set, note that the largest norms belong to the functions that correspond to hpyq “ p˘2CqσSp˘y` 1q, which over y P r´1, 1s are equivalent to p˘2Cq. Thus, for all h P HRC,B we have }h}2B “ ż B pασSpξJx` bqq2 µ¯Bpdxq ď p˘2Cq2 ż 1 ´1 12 µ¯r´1,1spdyq “ p2Cq2 , where µ¯r´1,1spdyq “ 12dy is the uniform probability measure over y P r´1, 1s such thatş1 ´1 µ¯r´1,1spdyq “ ş1 ´1 1 2dy “ 1. Using Lemma C.1 with c¯ “ b2 “ p2Cq2 (which is justified, since we can only have } qf }2B “ 0 if qfpxq “ 0 for all x P B) then allows us to conclude that there must exist pfN in the convex hull of N points in HSC,B satisfying } qf ´ pfN}B ď 2C`N´ 12 ˘ . This means the approximation has the form pfN pxq “ řNi“1 ui αiσSpwJi x ` biq such that řN i“1 ui “ 1 with u1, . . . , uN ě 0. Therefore, we can bound the coefficient size asřN i“1 |θi| “ řN i“1 ui|αi| ď řN i“1 uip2Cq “ 2C řN i“1 ui “ 2C for any N P N. C.3 Proof of Theorem 4.2 Theorem. For any constant C ą 0 and bounded set B Ă Rn containing the origin, there exists a base approximation pfN of the form (4.5) using the ReLU activation function σR with parameters satisfying |w|B “ 1 and |b| ď 1, for any integer N ě 1, to every function f P F p2qB,C such that the approximation error rfN pxq “ qfpxq´ pfN pxq to the 127 affine shifted qfpxq “ fpxq ´∇fp0nqJx´ fp0nq satisfies the L2pB, µ¯Bq norm bound››› rfN›››B “ dż B ˇˇˇ rfN pxqˇˇˇ2 µ¯Bpdxqq ď b 16 3 C? N (4.12) with coefficient size bounded as řN i“1 |θi| ď 2C. Proof. We closely follow the proof methods in [7] and [12]. Recalling that gpxq “ fpxq, and thus qgpxq “ qfpxq, on B, then taking the real part of both sides of (4.9) gives for all x P B that rer qgpxqs “ re” qfpxqı “ re„ż Rn peiξJx ´ iξJx´ 1q eiφpξqGpdξq  qfpxq “ ż Rn ´ cos ` ξJx` φpξq˘` ξJx sin `φpξq˘´ cos `φpξq˘¯Gpdξq “ ż Ω hpx, ξqΛpdξq where Ω “ Rnzt0nu and equality is maintained because the integrand at ξ “ 0n is zero. Here, we defined hpx, ξq :“ C p2q g |ξ|2B ´ cos ` ξJx` φpξq˘` ξJx sin `φpξq˘´ cos `φpξq˘¯ Λpdξq :“ |ξ| 2 B C p2q g Gpdξq with C p2q g as in (4.10), and so Λ is a probability measure over Ω such that ş Ω Λpdξq “ 1. Then by (C.2) and recalling the definition of |ξ|B “ supxPB ˇˇ ξJx ˇˇ , it holds for all x P B and ξ P Ω that |hpx, ξq| ď C p2q g |ξ|2B |ξJx|2 ď Cp2qg ď C . With ξ,ξ1, . . . ,ξN P Ω sampled iid according to Λ, we have Er 1N řN i“1 hpx,ξiqs “ qfpxq. Thus, by (C.5) we have E ››››› qfpxq ´ 1N Nÿ i“1 hpx,ξiq ››››› 2 B ď 1 N E }hpx,ξq}2B ď 1 N ż B C2 µ¯Bpdxq “ C 2 N . 128 Since this goes to zero as N Ñ8, we have that for every f P F p2qB,C , it holds over x P B that qf P coHcosp2qC,B , where Hcosp2qC,B :“ # α |ξ|2B ` cospξJx` bq ` ξJx sinpbq ´ cospbq˘ ˇˇˇ ξ P Ω, |α| ď C, b P R+ with closure in the sense of the L2pB, µ¯Bq norm. Next, for any ξ P Ω we set y “ 1|ξ|B ξ Jx, which gives y P Y Ď r´1, 1s over x P B. Therefore, functions in Hcosp2qC,B correspond to smooth univariate functions of y having the form hpyq “ α|ξ|2B ` cosp|ξ|B y ` bq ` |ξ|B y sinpbq ´ cospbq ˘ for any ξ P Ω, |α| ď C, and b P R, and thus having derivatives h1pyq “ α|ξ|B `´ sinp|ξ|B y ` bq ` sinpbq˘ and h2pyq “ ´α cosp|ξ|B y ` bq . These hpyq functions are uniformly continuous and bounded on y P r´1, 1s with hp0q “ 0 and |hpyq| ď C, and so they can be uniformly approximated on this interval by piecewise linear functions. Setting a uniform partition of r0, 1s for any positive integer P as ∆ :“ 1P ă 2∆ ă ¨ ¨ ¨ ă pP ´ 1q∆ ă P∆ “ 1, we define h2P pyq :“ ÿ dPt´1,1u ˆ hpd∆q ´ hp0q ∆ σRpdyq ` P´1ÿ k“1 hpdpk ` 1q∆q ´ 2hpdk∆q ` hpdpk ´ 1q∆q ∆ σRpdy ´ k∆q ¸ , recalling that σRpyq “ maxt0, yu is the ReLU activation function. This definition means h2P is a continuous, length 2P sequence of secant lines on y P r´1, 1s between each pair of points hpdpk ´ 1q∆q, hpdk∆q(, over k P rP s and d P t´1, 1u. Thus, as P Ñ 8 (and ∆ Ñ 0) we get h2P Ñ h uniformly on y P r´1, 1s, and the 129 absolute values of the coefficients go toˇˇˇˇ hpd∆q ´ hp0q ∆ ˇˇˇˇ Ñ |h1p0q| “ 0ˇˇˇˇ hpdpk ` 1q∆q ´ 2hpdk∆q ` hpdpk ´ 1q∆q ∆ ˇˇˇˇ Ñ |h1pdk∆q ´ h1pdpk ´ 1q∆q| . By definition, the sum of the absolute values of the coefficients is then bounded by the total variation of h2pyq over y P r´1, 1s. And since |h2pyq| “ |α|| ´ cosp|ξ|B y ` bq| ď C, we have for any P ě 1 thatÿ dPt´1,1u ˜ˇˇˇˇ hpd∆q ´ hp0q ∆ ˇˇˇˇ ` P´1ÿ i“1 ˇˇˇˇ hpdpk ` 1q∆q ´ 2hpdk∆q ` hpdpk ´ 1q∆q ∆ ˇˇˇˇ¸ ď ż 1 ´1 |h2pyq|dy ď 2C . This proves that for every f P F p2qB,C , it holds over x P B that qfpxq P coHRC,B, where HRC,B :“ ασRpξJx` bq ˇˇ |α| ď 2C, |ξ|B “ 1, |b| ď 1( with closure in the supremum norm over x P B, and thus also in the sense of the L2pB, µ¯Bq norm. Recalling that we had defined y “ 1|ξ|B ξ Jx, which gave y P Y Ď r´1, 1s for any ξ P Ω, then ξJx P W Ď r´ |ξ|B , |ξ|Bs and thus why |ξ|B “ 1 is specified in the definition of HRC,B. To bound the L2pB, µ¯Bq norm for all functions in this set, note that the largest norms belong to the functions that correspond to hpyq “ p˘2CqσRp˘y` 1q, which over y P r´1, 1s are equivalent to p˘2Cqp˘y ` 1q. Thus, for all h P HRC,B we have }h}2B “ ż B pασRpξJx` bqq2 µ¯Bpdxq ď p˘2Cq2 ż 1 ´1 p˘y ` 1q2 µ¯r´1,1spdyq “ 43p2Cq2 , where µ¯r´1,1spdyq “ 12dy is the uniform probability measure over y P r´1, 1s such thatş1 ´1 µ¯r´1,1spdyq “ ş1 ´1 1 2dy “ 1. Using Lemma C.1 with c¯ “ b2 “ 43p2Cq2 (which is justified, since we can only have } qf }2B “ 0 if qfpxq “ 0 for all x P B) then allows us to conclude that there must exist pfN in the convex hull of N points in HRC,B satisfying } qf ´ pfN}B ď b163 C `N´ 12 ˘ . 130 This means the approximation has the form pfN pxq “ řNi“1 ui αiσRpwJi x ` biq such that řN i“1 ui “ 1 with u1, . . . , uN ě 0. Therefore, we can bound the coefficient size asřN i“1 |θi| “ řN i“1 ui|αi| ď řN i“1 uip2Cq “ 2C řN i“1 ui “ 2C . C.4 Proof of Theorem 4.3 Theorem. Let there exist a base approximation pfN of the form (4.5) using any activation function σ satisfying Assumption 4.1, for any N P N, with some unknown parameters γ1, . . . ,γN P Υ and coefficients θ P RN , but having a known bound on the coefficient size of řN i“1 |θi| ď SN and the parameters set Υ known. Then, the mollified approximation pfλ,N over the λ-expanded parameters set Υλ in (4.17) satisfies either of: iq ››› pfN ´ pfλ,N›››B ď SN Dσ (4.20) iiq ››› pfN ´ pfλ,N›››B ď SN Lσ ? n` 1 λ EB . (4.21) Proof. First, we observe that (4.16) is equivalently pfN pxq “ Nÿ i“1 θi σpγJiXq ż Υλ δˆn`1λ pγ´ γiq dγ “ Nÿ i“1 θi ż Υλ δˆn`1λ pγ´ γiqσpγJiXqdγ , since ş Υλ δˆn`1λ pγ ´ γiqdγ “ 1 holds for all λ ą 0, n ě 1, and γi P Υ, and σpγJiXq is a constant with respect to dγ. Then we define the sets Υi,λ as the supports of each δˆ n`1 λ pγ´γiq, which is the open cube p´ 1λ , 1λqn`1 centered at the corresponding γi P Υ, and note that these are always 131 contained within the λ-expanded Υλ. Therefore, for any x P B we haveˇˇ pfN pxq ´ pfλ,N pxqˇˇ “ ˇˇˇˇ ˇ Nÿ i“1 θi ż Υλ δˆn`1λ pγ´ γiq ´ σpγJiXq ´ σpγJXq ¯ dγ ˇˇˇˇ ˇ paqď Nÿ i“1 |θi| ˇˇˇˇż Υλ δˆn`1λ pγ´ γiq ´ σpγJiXq ´ σpγJXq ¯ dγ ˇˇˇˇ pbqď Nÿ i“1 |θi| ż Υλ δˆn`1λ pγ´ γiq ˇˇ σpγJiXq ´ σpγJXq ˇˇ dγ pcqď Nÿ i“1 |θi| ż Υλ δˆn`1λ pγ´ γiq sup φPΥi,λ ˇˇ σpγJiXq ´ σpφJXq ˇˇ dγ pdq“ Nÿ i“1 |θi| sup φPΥi,λ ˇˇ σpγJiXq ´ σpφJXq ˇˇ ż Υλ δˆn`1λ pγ´ γiqdγ peq“ SN sup φPΥi,λ ˇˇ σpγJiXq ´ σpφJXq ˇˇ . Here (a) is by the triangle inequality, (b) is because ˇˇş fpxq dxˇˇ ď ş |fpxq|dx and δˆn`1λ is positive, (c) is because the integrand is only nonzero on γ P Υi,λ by the support of δˆn`1λ , (d) is because the sup is a constant and can thus be pulled out of the integral, and (e) is by the definition of the mollified delta such that ş Υλ δˆn`1λ pγ ´ γiqdγ “ 1 for all i P rN s and the (assumed known) bound řNi“1 |θi| ď SN . If σ satisfies the overall boundedness assumption with constant Dσ, then we haveˇˇ pfN pxq ´ pfλ,N pxqˇˇ “ SN Dσ and thus ››› pfN pxq ´ pfλ,N pxq›››B ď dż B |SN Dσ|2 µ¯Bpdxq “ SN Dσ where SN and Dσ are positive constants. Otherwise, σ satisfies the Lipschitz assumption with constant Lσ and so we haveˇˇ pfN pxq ´ pfλ,N pxqˇˇ ď SN Lσ sup φPΥi,λ ˇˇpγi ´ φqJX ˇˇ pfqď SN Lσ ? n` 1 λ }X}2 (C.7) 132 where (f) uses the Cauchy-Schwarz inequality, combined with the fact that }γi ´ φ}2 ď ? n`1 λ for all φ P Υi,λ at any γi P Υλ. And thus ››› pfN pxq ´ pfλ,N pxq›››B ď dż B ˇˇˇˇ SN Lσ ? n` 1 λ }X}2 ˇˇˇˇ2 µ¯Bpdxq “ SN Lσ ? n` 1 λ EB where SN , Lσ, ? n` 1, and λ are all positive constants. C.5 Proof of Lemma 4.1 Lemma. Let the target function f P CpBq satisfy Theorem 4.3 and thus have the mollified approximation pfλ,N as in (4.13) with activation σ for the selected mollification factor λ ą 0, which has the integral representation (4.16) over the λ-expanded param- eters set Υλ. If the random parameters γ1, . . . ,γR are sampled iid according to any density Pλ which is nonzero over Υλ, then there exists a random approximation pFR of the form (4.6) using the activation σ such that E γj„Pλ ” pFRpx ;γ1, . . . ,γRqı “ pfλ,N pxq . (4.23) Proof. The mollified approximation pfλ,N has the integral approximation (4.17), which can be equivalently written as pfλ,N pxq :“ ż Υλ gλpγiq σpγJzq dγ where coefficient function gλ is defined in (4.18). Then with Pλ being the known nonzero density over Υλ used for sampling the parameters, we can equivalently write this as pfλ,N pxq “ ż Υλ gλpγq Pλpγq σpγ JzqPλpγqdγ “ E γ„Pλ „ gλpγq Pλpγq σpγ Jzq  . (C.8) 133 For any sampled parametervalues γ1, . . . ,γR from Υλ, let us denote rfR˚ as the random approximation that has coefficient values ϑ˚ P RR of ϑj˚ “ gλpγjqPλpγjqR “ řN i“1 θi δˆ n`1 λ pγj ´ γiq PλpγjqR @j P rRs . (C.9) Let us denote dPλpγjq :“ Pλpγjqdγj for all j P rRs, so that in the below integrals γ,γ1, . . . ,γR all represent separate variables of integration. Then with the iid assump- tion on all γj „ Pλ, for this rfR˚ we have E γj„Pλ ” rfR˚px ;γ1, . . . ,γRqı “ E γj„Pλ « Rÿ j“1 ϑj˚ σpγJj zq ff “ ż Υλ ¨ ¨ ¨ ż Υλ Rÿ j“1 gλpγjq PλpγjqR σpγ J j zq dPλpγ1q ¨ ¨ ¨ dPλpγRq paq“ ż Υλ ¨ ¨ ¨ ż Υλ gλpγ1q Pλpγ1qR σpγ J 1 zq dPλpγ1q ¨ ¨ ¨ dPλpγRq ` ¨ ¨ ¨ ` ż Υλ ¨ ¨ ¨ ż Υλ gλpγRq PλpγRqR σpγ J RzqdPλpγ1q ¨ ¨ ¨ dPλpγRq pbq“ ż Υλ gλpγ1q Pλpγ1qR σpγ J 1 zqdPλpγ1q ` 1R´1q ` ¨ ¨ ¨ ` ż Υλ gλpγRq PλpγRqR σpγ J Rzq dPλpγRq ` 1R´1q pcq“ R ˆ 1 R ż Υλ gλpγq Pλpγq σpγ JzqdPλpγq ˙ pdq“ pfλ,N pxq . Here, (a) is by the linearity of integration, (b) is each γj only belongs to the dPλpγjq integral, while the remaining R ´ 1 integrals in each term are şΥλ dPλpγjq “ 1, (c) is because there are R copies of the same integral, and (d) is (C.8). C.6 Proof of Lemma 4.2 Lemma. Let hB be defined as in (4.24) for a random approximation pFR of the form (4.6) which satisfies the result of Lemma 4.1 to some mollified approximation pfλ,N from Theorem 4.3, both using activation function σ which satisfies Assumption 4.1. Let the 134 random parameters γ1, . . . ,γR be sampled iid according to any nonzero density Pλ over the λ-expanded set Υλ, which has diameter DΥλ and largest point γλmax, and let EB be as in (4.19). Then for all pγ1, . . . ,γj , . . . ,γRq, pγ1, . . . , rγj , . . . ,γRq P ΥRλ differing only at coordinate j, either of the following hold for all j P rRs: ˇˇ hBpγ1, . . . ,γj , . . . ,γRq ´ hBpγ1, . . . , rγj , . . . ,γRqˇˇ ď iq Dσ gλmax PλminR ˆ 3` 2 |σp0q| Dσ ˙ (4.26) iiq Lσ gλmax PλminR ˆ pDΥλ ` 2γλmaxqEB ` 2 |σp0q|Lσ ˙ . (4.27) And the expected value of hB satisfies either of: E γj„Pλ ” hBpγ1, . . . ,γRq ı ď iq Dσ gλmax Pλmin ? R ˆ 1` |σp0q| Dσ ˙ (4.28) iiq Lσ gλmax Pλmin ? R ˆ γλmaxEB ` |σp0q|L ˙ . (4.29) Proof. Since pFR is assumed to satisfy Lemma 4.1, then by (C.9) it holds for all j P rRs that |ϑj | ď ˇˇˇˇ gλpγjq PλpγjqR ˇˇˇˇ ď gλmax PλminR , (C.10) where gλmax is defined in (4.25) and Pλmin :“ mintPλpγq | γ P Υλu. From the definition of hB in (4.24), and for all points differing only at coordinate j 135 as pγ1, . . . ,γj , . . . ,γRq, pγ1, . . . , rγj , . . . ,γRq P ΥRλ , we have for all j P rRs thatˇˇ hBpγ1, . . . ,γj , . . . ,γRq ´ hBpγ1, . . . , rγj , . . . ,γRqˇˇ “ ˇˇˇ ››› pfλ,N pxq ´ pFRpx ;γ1, . . . ,γj , . . . ,γRq›››B ´ ››› pfλ,N pxq ´ pFRpx ;γ1, . . . , rγj , . . . ,γRq›››B ˇˇˇ paqď ›› pfλ,N pxq ´ pFRpx ;γ1, . . . ,γj , . . . ,γRq ´ pfλ,N pxq ` pFRpx ;γ1, . . . , rγj , . . . ,γRq››B pbq“ ›››› gλprγjqPλprγjqR σprγJj Xq ´ gλpγjqPλpγjqR σpγJjXq ›››› B pcqď ˇˇˇˇ gλprγjq PλprγjqR ˇˇˇˇ ››σprγJj Xq ´ σpγJjXq››B ` ˇˇˇˇ gλpγjq PλpγjqR ´ gλprγjq PλprγjqR ˇˇˇˇ ››σpγJjXq››B pdqď gλmax PλminR ˆ››qσprγJj Xq ´ qσpγJjXq››B ` 2 ››qσpγJjXq››B ` 2 |σp0q|˙ Here, (a) is by the reverse triangle inequality (C.3), (b) is by the definition of pFR such that only ϑj differs at γj ‰ rγj , (c) is by the inequality (C.4), (d) is by noting that the θi of gλ can be negative (thus the 2), the absolute value bound (C.10), and the definition of the shifted activation qσpuq “ σpuq ´ σp0q for all u P R. If σ satisfies the overall boundedness condition of Assumption 4.1 with constant Dσ, note that |qσpuq´ qσpvq| ď Dσ holds for all u, v P I and thus |qσpuq´ qσp0q| “ |qσpuq| ď Dσ also holds for all u P I. And so we have ˇˇ hBpγ1, . . . ,γj , . . . ,γRq ´ hBpγ1, . . . , rγj , . . . ,γRqˇˇ ď Dσgλmax PλminR ˆ 3` 2 |σp0q| Dσ ˙ . Otherwise, σ satisfies the Lipschitz condition of Assumption 4.1 with constant Lσ and so |qσpuq ´ qσpvq| ď Lσ|u´ v| holds for all u, v P I and thus |qσpuq ´ qσp0q| “ |qσpuq| ď Lσ|u| also holds for all u P I. And so we have ˇˇ hBpγ1, . . . ,γj , . . . ,γRq´hBpγ1, . . . , rγj , . . . ,γRqˇˇ ď Lσgλmax PλminR ˆ››prγj ´ γjqJX››B ` 2 ››γJjX››B ` 2 |σp0q|Lσ ˙ peqď Lσgλmax PλminR ˆ pDΥλ ` 2γλmaxqEB ` 2 |σp0q|Lσ ˙ 136 where (e) is by Cauchy-Schwarz and the definitions of DΥλ, γλmax, and EB. This proves the first results (4.26) and (4.27). Next, recall from Lemma 4.1 that in order for E γj„Pλ ” pFRpx ;γ1, . . . ,γRqı “ pfλ,N pxq to hold, it must be that pFRpx ;γ1, . . . ,γRq “ 1 R Rÿ j“1 gλpγjq Pλpγjqσpγ J j zq . We define a class of functions f : Rn Ñ R using the activation function σ : RÑ R, over the values of γ P Υλ as F :“ ! fpx ;γq “ gλpγq Pλpγq σpγ J jXq ˇˇˇ γ P Υλ ) . Thus, it holds that pFRpx ;γ1, . . . ,γRq “ 1 R Rÿ j“1 fpx ;γjq with γ1, . . . ,γR sampled iid according to Pλ, and so the inequality (C.5) applies. And so, ith hB as defined in (4.24) we have E γj„Pλ “ hBpγ1, . . . ,γRq2 ‰ “ E γj„Pλ ››› pfλ,N pxq ´ pFRpx ;γ1, . . . ,γRq›››2B “ E γj„Pλ ›››› pFRpx ;γ1, . . . ,γRq ´ Eγj„Pλ ” pFRpx ;γ1, . . . ,γRqı››››2 B paqď 1 R E γ„Pλ ›››› gλpγqPλpγqσpγJjXq ››››2 B ď g 2 λmax P 2λminR E γ„Pλ ››qσpγJjXq ` σp0q››2B . (C.11) Here, (a) is the inequality (C.5). Note that b Eγ„Pλ }fpx ;γq}2B defines a norm over such randomly parameterized 137 functions. Using the square root of (C.11) then gives E γj„Pλ rhBpγ1, . . . ,γRqs “ E γj„Pλ ”a hBpγ1, . . . ,γRq2 ı paqď c E γj„Pλ rhBpγ1, . . . ,γRq2 s ď gλmax Pλmin ? R d E γ„Pλ ›››qσpγJjXq ` σp0q›››2B pbqď gλmax Pλmin ? R ˜d E γ„Pλ ›››qσpγJjXq›››2B ` |σp0q| ¸ . Here, (a) is Jensen’s inequality and (b) is the triangle inequality. If σ satisfies the overall boundedness condition of Assumption 4.1 with constant Dσ, E γj„Pλ rhBpγ1, . . . ,γRqs ď Dσ gλmax Pλmin ? R ˆ 1` |σp0q| Dσ ˙ then holds. Otherwise, σ satisfies the Lipschitz condition with constant Lσ, E γj„Pλ rhBpγ1, . . . ,γRqs ď gλmax Pλmin ? R ˜ Lσ d E γ„Pλ ›››γJjX›››2B ` |σp0q| ¸ pcqď Lσ gλmax Pλmin ? R ˆ γλmaxEB ` |σp0q|Lσ ˙ then holds, where (c) is Cauchy-Schwarz and the definitions of γλmax and EB. This proves the second results (4.28) and (4.29). C.7 Proof of Theorem 4.4 Theorem. Let hB be defined as in (4.24) and satisfy the results of Lemma 4.2, for some mollified approximation pfλ,N from Theorem 4.3, using activation function σ which satisfies Assumption 4.1. Let the random parameters γ1, . . . ,γR be sampled iid according to any nonzero density Pλ over the λ-expanded set Υλ, which has diameter DΥλ and largest point γλmax, and let EB be as in (4.19). Then for any ν P p0, 1q, with 138 probability greater than 1 ´ ν there exists a random approximation pFRpx ;γ1, . . . ,γRq of the form (4.6) with |ϑj | ď gλmaxPλminR for all j P rRs satisfying the result of Lemma 4.1, such that either of the following hold: } pfλ,N pxq ´ pFRpx ;γ1, . . . ,γRq} ď iq Dσ gλmax Pλmin ? R KDσpνq iiq Lσ gλmax Pλmin ? R KLσpνq . The coefficients KDσ and KLσ are defined in terms of the selected ν as: KDσpνq :“ 1` |σp0q|Dσ ` ˆ 3` 2 |σp0q| Dσ ˙d 1 2 log ˆ 1 ν ˙ KLσpνq :“ γλmaxEB ` |σp0q|Lσ ` ˆ pDΥλ ` 2γλmaxqEB ` 2 |σp0q|Lσ ˙d 1 2 log ˆ 1 ν ˙ . Proof. As in the proof of Lemma 4.2, we have by (C.10) that the coefficients of pFR are bounded for all j P rRs as |ϑj | ď gλmax PλminR , where gλmax is defined in (4.25) and Pλmin :“ mintPλpγq | γ P Υλu. With hB as defined in (4.24), applying McDiarmid’s inequality (C.6) gives P ” ˇˇˇ hBpγ1, . . . ,γRq ´ E “ hBpγ1, . . . ,γRq ‰ˇˇˇ ě tı ď exp˜ ´2 t2řR j“1 k2j ¸ . In Lemma 4.2, we proved the kj bounds for all j P rRs are kj “ gλmax PλminR A and the expected value bound is E “ hBpγ1, . . . ,γRq ‰ ď gλmax Pλmin ? R B 139 for coefficients A and B that depend on properties of the activation function σ and the sets B and Υλ. Therefore, we have for any t ą 0, that P « ˇˇˇˇ ˇhBpγ1, . . . ,γRq ´ gλmaxPλmin?R B ˇˇˇˇ ě t ff ď P ” ˇˇ hBpγ1, . . . ,γRq ´ E “ hBpγ1, . . . ,γRq ‰ˇˇ ě tı ď exp ¨˚ ˝ ´2 t2 R ´ gλmax PλminR A ¯2 ‹˛‚ “ expˆ´2P 2λminR t2g2λmaxA2 ˙ . (C.12) Setting the RHS of (C.12) equal to ν P p0, 1q and solving for t then gives log ˆ 1 ν ˙ “ 2P 2 λminR t 2 g2λmaxA 2 ùñ t “ gλmaxA Pλmin ? R d 1 2 log ˆ 1 ν ˙ . (C.13) Combining (C.12) and (C.13) gives the result, for any ν P p0, 1q, as the complement of P « hBpγ1, . . . ,γRq ě gλmax Pλmin ? R B ` gλmaxA Pλmin ? R d 1 2 log ˆ 1 ν ˙ ff “ P « hBpγ1, . . . ,γRq ě gλmax Pλmin ? R ˜ B `A d 1 2 log ˆ 1 ν ˙ ¸ff iq “ P « hBpγ1, . . . ,γRq ě Dσ gλmax Pλmin ? R KDσpνq ff ď ν iiq “ P « hBpγ1, . . . ,γRq ě Lσ gλmax Pλmin ? R KLσpνq ff ď ν where iq and iiq correspond to which boundedness property of Assumption 4.1 the activation function σ satisfies. And so, we have the coefficients KDσ and KLσ defined in terms of the selected ν, using either (4.26),(4.28) or (4.27),(4.29) as: KDσpνq :“ 1` |σp0q|Dσ ` ˆ 3` 2 |σp0q| Dσ ˙d 1 2 log ˆ 1 ν ˙ KLσpνq :“ γλmaxEB ` |σp0q|Lσ ` ˆ pDΥλ ` 2γλmaxqEB ` 2 |σp0q|Lσ ˙d 1 2 log ˆ 1 ν ˙ . 140 C.8 Proof of Theorem 4.5 Theorem. Let there exist a base approximation pfN of the form (4.5), for some integer N P N, to a target function f P CpBq, with unknown parameters γ1, . . . ,γN in a known bounded parameter set Υ Ă Rn`1, unknown coefficients θ P RN with a known bound on the coefficient size řN i“1 |θi| ď SN , and using an activation function σ satisfying Assumption 4.1. Assume that ››› qfpxq ´ pfN pxq›››B ď εN holds for some εN ą 0 which may depend on N , for an affine shift qfpxq “ f´aJx´bfp0q with some a P Rn and b P R. The mollified approximation pfλ,N as in (4.17) satisfies the results of Theorem 4.3 for any mollification factor λ ą 0. Let the random parameters γ1, . . . ,γR be sampled iid according to any nonzero density Pλ over the λ-expanded set Υλ, which has diameter DΥλ and farthest point γλmax, and let EB be as in (4.19). Then for any ν P p0, 1q, with probability greater than 1´ν there exists a random approximation pFRpx ;γ1, . . . ,γRq as in (4.6) with |ϑj | ď gλmaxPλminR for all j P rRs that satisfies Theorem 4.4 such that either of the following hold:››› qfpxq ´ pFRpx ;γ1, . . . ,γRq›››B ď iq εN ` SN Dσ ˆ 1 ` λ n`1 ? R rKDσpνq˙ iiq εN ` SNLσ ˆ? n` 1 λ EB ` λ n`1 ? R rKLσpνq˙ . The coefficients rKLσ are defined in terms of the selected ν as: rKDσpνq :“ ηn`1en`1Pλmin KDσpνq and rKLσpνq :“ η n`1 en`1Pλmin KLσpνq with KDσ and KLσ as defined in Theorem 4.4. Proof. As in the proof of Lemma 4.2, we have by (C.10) that the coefficients of pFR are bounded for all j P rRs as |ϑj | ď gλmax PλminR , 141 where gλmax is defined in (4.25) and Pλmin :“ mintPλpγq | γ P Υλu. Applying the triangle inequality, we then have››› qfpxq ´ pFRpx ;γ1, . . . ,γRq›››B ď ››› qfpxq ´ pfN pxq›››B ` ››› pfN pxq ´ pfλ,N pxq›››B ` ››› pfλ,N pxq ´ pFRpx ;γ1, . . . ,γRq›››B . The first term is assumed bounded by εN for the base approximation pfN , which are discussed in Section 4.2. The second term is bounded by Theorem 4.3 and is discussed in Section 4.3.2, and the third term is bounded by Theorem 4.4 and is discussed in Section 4.4. We pull out front the activation function constant Dσ or Lσ and the base approx- imation coefficient size bound řN i“1 |θi| ď SN , since they are common to the last two terms. This follows for the last term because gλmax PλminR “ Nÿ i“1 |θi| ˆ ηλ e ˙n`1 1 PλminR ď SN λ n`1 R ηn`1 en`1 Pλmin , and so the remaining η n`1 en`1Pλmin is added to the existing coefficient in terms of ν. C.9 Proof of Lemma 4.3 Lemma. Let f : Rn Ñ R be L-Lipschitz on a convex compact set K Ă Rn containing the origin and differentiable at the origin. Let there also be an approximation pf : Rn Ñ R which is pL-Lipschitz on K and such that the approximation error rfpxq “ qfpxq ´ pfpxq, with qf as in (4.31), satisfies the L2pK, µ¯Kq norm bound ››› rf ››› K “ dż K ˇˇˇ rfpxqˇˇˇ2 µ¯Kpdxq ď εN where the bound εN ą 0 is a finite constant and µ¯Kpdxq is the uniform probability measure on K such that şK µ¯Kpdxq “ 1. 142 Assume K is full dimension with diameter D ą 0, and so it holds that K Ď B0pρq for a sufficiently large radius 0 ă ρ ď D. Then, the approximation error also satisfies the L8pKq norm bound sup xPK ˇˇˇ rfpxqˇˇˇ ď 2 ˆ rDpr `?r2 `D2q ˙´n n`2 ´´ L` }∇fp0nq}2 ` pL¯n ε 2N¯ 1n`2 (4.32) such that r is the largest radius ball within K centered at its centroid. Proof. The Lipschitz constants of qf and pf are respectively qL “ L ` }∇fp0nq}2 and pL. Then, the approximation error function rf “ qf ´ pf is also Lipschitz, satisfyingˇˇˇ rfpxq ´ rfpyqˇˇˇ “ ˇˇˇ qfpxq ´ pfpxq ´ qfpyq ` pfpyqˇˇˇ ď ˇˇˇ qfpxq ´ qfpyqˇˇˇ` ˇˇˇ pfpxq ´ pfpyqˇˇˇ ď p qL` pL q }x´ y}2 for all x, y P K. And so we define rL :“ qL` pL as the Lipschitz constant. We note here that µ ` Bxpρq ˘ “ ρn pi n2 Γpn2 ` 1q is the Lebesgue measure (volume) of the n-dimensional ball of radius ρ centered at any x P Rn. Let us define for any h ě 0 the set Ωh :“ ! x P K ˇˇˇ ˇˇ rfpxq| ě h) and note that the following therefore holds: ε2N ě ż K | rfpxq|2 µ¯Kpdxq ě ż Ωh | rfpxq|2 µ¯Kpdxq ě ż Ωh h2 µ¯Kpdxq “ h2µ¯KpΩhq ùñ µ¯KpΩhq ď ε 2 N h2 . Now, let there be a px P K satisfying | rfppxq| “ h for some finite h ą εN , and let us define the set Sd :“ " x P K ˇˇˇ }px´ x}2 ď h d rL * “ KXBpx ˆ h d rL ˙ 143 for any d ě mint1, hrLDu, which respectively satisfies }px´ x}2 ď hrL and }px´ x}2 ď D. We then get that | rfpxq| “ | rfppxq ´ rfppxq ` rfpxq| ě | rfppxq| ´ | rfppxq ´ rfpxq| ě | rfppxq| ´ rL }px´ x}2 ě h´ rL h d rL “ d´ 1d h for all x P Sd, by using the reverse triangle inequality. If px is such that Sd is entirely within the interior of K, then the intersection is the entire ball and we thus have µ¯KpSdq “ µ¯K ˆ Bpx ˆ h d rL ˙˙ “ 1V ˆ h d rL ˙n pi n2 Γpn2 ` 1q . As px moves toward a flat surface boundary of K, then only half of the ball remains in the intersection, which is the limit in the n “ 1 case. But in higher dimensions n ě 2, the compact convex geometry of K means reductions of more than half are possible. Figure C.1 uses an n “ 2 example to illustrate visually of how px being located on a convex curved surface or convex vertex of intersecting lines gives further reductions compared to the flat surface. 144 Kpx Bpxp hd rLq Sd ě Kpx Bpxp hd rLq Sd ě Kpx Bpxp hd rLq Sd Figure C.1: The intersection KX Bpx´ hd rL¯ in green, for possible convex geometrical features on the boundary of K where the center px can lie. And so, there must exist at least one minimizing boundary location satisfying µ¯KpSdq‹ :“ min xPBK µ¯KpSdq “ PK V ˆ h d rL ˙n pi n2 Γpn2 ` 1q for some 0 ă PK ă 12 , since V ą 0 and the boundary of K must contain convex curved surfaces and/or vertices. The exact portion PK of the ball contained within µ¯KpSdq‹ may not be practical to calculate for a given K, but we can always lower bound it in the following manner (based on methods in [39] Proof of Lemma 16, Case 1). Denote cK as the centroid of K and let r be the radius of the largest ball centered at cK contained within K. Then define ` ě r as the distance from the centroid to the closest minimizing boundary location x‹, and align a coordinate system at cK such that 145 x‹ is located at position “´ ` 0 ¨ ¨ ¨ 0‰J. Then x P K satisfying ´` ď x1 ď 0 and gffe nÿ i“2 x2i ď r ` r ` x1 defines a conic section which is entirely within K. If we then define ϑ :“ arctan ´ r ` ¯ and d ` d sinpϑq “ h d rL , this allows a ball of radius d sinpϑq to be inscribed within the conic section at location“´ `` d 0 ¨ ¨ ¨ 0‰J. Using the above definitions, we have that this radius satisfies d sinpϑq “ h d rL 1` sinpϑq sinpϑq “ h d rL 1` r? r2``2 r? r2 ` `2 “ r r `?r2 ` `2 h d rL “: %K hd rL and since the centroid is within the interior of K, the factor % must satisfy r r `?r2 `D2 ă %K ă r r `?r2 ` r2 “ 1 1`?2 . Figure C.2 illustrates an example of this method in n “ 2, with the inscribed ball outlined in red. x‹ Bx‹p h d rLq Sd r ` ϑ cK K d Figure C.2: The largest ball which always lower bounds Sd, centered at d with radius d sinpϑq. Thus, as px moves toward such a minimizing boundary point of K, we get the lower bound µ¯KpSdq ě µ¯KpSdq‹ ě % n K V ˆ h d rL n˙ pi n 2 Γpn2 ` 1q “: ρK,nV ˆ h d rL n˙ . 146 On the other hand, there may be x P K more than h d rL away from any px where it still holds that | rfpxq| ě d´1d h. Thus, Sd Ď Ω d´1 d h ùñ µ¯KpSdq ď µ¯K ´ Ω d´1 d h ¯ ď d 2ε2N pd´ 1q2h2 . Combining these relations gives that ρK,n V ˆ h d rL ˙n ď µ¯KpSdq ď d 2ε2N pd´ 1q2h2 ùñ h ď d pd´ 1q 2n`2 ˜ V rLnε2N ρK,n ¸ 1 n`2 must hold for any d ě 1. Then since d dd d pd´ 1q 2n`2 “ npd´ 1q ´ 2 pn` 2qpd´ 1qn`4n`2 “ 0 ùñ d “ n` 2 n , we have the smallest upper bound on h as h ď n`2 n 2 n 1 n`2 ˜ V rLnε2N ρK,n ¸ 1 n`2 ď 2 pρK,nq ´1n`2 ´ V rLnε2N¯ 1n`2 . By definition, we have pρK,nq ´1n`2 “ ˜ %nK pi n 2 Γpn2 ` 1q ¸ ´1 n`2 “ ˆ r r `?r2 ` `2 ˙´n n`2 ˜ pi n 2 Γpn2 ` 1q ¸ ´1 n`2 and since K Ď B0pρq for sufficiently large ρ, then V ď ρn pi n 2 Γpn2 ` 1q . Thus, pρK,nq ´1n`2 V 1n`2 ď ˆ r r `?r2 `D2 ˙´n n`2 D nn`2 .