Author: Rosenthal J.S.  

Tags: mathematics   probability theory  

ISBN: 978-981-270-370-5

Year: 2006

Text
                    AT
RIGOROUS
PROBABILITY
THEORY
Second Edition


Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library. First published 2006 Reprinted 2008 A FIRST LOOK AT RIGOROUS PROBABILITY THEORY (2nd Edition) Copyright © 2006 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher. For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher. ISBN-13 978-981-270-370-5 ISBN-10 981-270-370-5 ISBN-13 978-981-270-371-2 (pbk) ISBN-10 981-270-371-3 (pbk) Printed in Singapore by World Scientific Printers (S) Re Ltd
To my wonderful wife, Margaret Fulford: always supportive, always encouraging.
Preface to the First Edition. This text grew out of my lecturing the Graduate Probability sequence STA 211 IF / 221 IS at the University of Toronto over a period of several years. During this time, it became clear to me that there are a large number of graduate students from a variety of departments (mathematics, statistics, economics, management, finance, computer science, engineering, etc.) who require a working knowledge of rigorous probability, but whose mathematical background may be insufficient to dive straight into advanced texts on the subject. This text is intended to answer that need. It provides an introduction to rigorous (i.e., mathematically precise) probability theory using measure theory. At the same time, I have tried to make it brief and to the point, and as accessible as possible. In particular, probabilistic language and perspective are used throughout, with necessary measure theory introduced only as needed. I have tried to strike an appropriate balance between rigorously covering the subject, and avoiding unnecessary detail. The text provides mathematically complete proofs of all of the essential introductory results of probability theory and measure theory. However, more advanced and specialised areas are ignored entirely or only briefly hinted at. For example, the text includes a complete proof of the classical Central Limit Theorem, including the necessary Continuity Theorem for characteristic functions. However, the Lindeberg Central Limit Theorem and Martingale Central Limit Theorem are only briefly sketched and are not proved. Similarly, all necessary facts from measure theory are proved before they are used. However, more abstract and advanced measure theory results are not included. Furthermore, the measure theory is almost always discussed purely in terms of probability, as opposed to being treated as a separate subject which must be mastered before probability theory can be studied. I hesitated to bring these notes to publication. There are many other books available which treat probability theory with measure theory, and some of them are excellent. For a partial list see Subsection B.3 on page 210. (Indeed, the book by Billingsley was the textbook from which I taught before I started writing these notes. While much has changed since then, the knowledgeable reader will still notice Billingsley's influence in the treatment of many topics herein. The Billingsley book remains one of the best sources for a compete, advanced, and technically precise treatment of probability theory with measure theory.) In terms of content, therefore, the current text adds very little indeed to what has already been written. It was only the reaction of certain students, who found the subject easier to learn from my notes than from longer, more advanced, and more all-inclusive books, that convinced me to go ahead and publish. The reader is urged to consult
other books for further study and additional detail. There are also many books available (see Subsection B.2) which treat probability theory at the undergraduate, less rigorous level, without the use of general measure theory. Such texts provide intuitive notions of probabilities, random variables, etc., but without mathematical precision. In this text it will generally be assumed, for purposes of intuition, that the student has at least a passing familiarity with probability theory at this level. Indeed, Section 1 of the text attempts to link such intuition with the mathematical precision to come. However, mathematically speaking we will not require many results from undergraduate-level probability theory. Structure. The first six sections of this book could be considered to form a "core" of essential material. After learning them, the student will have a precise mathematical understanding of probabilities and a-algebras; random variables, distributions, and expected values; and inequalities and laws of large numbers. Sections 7 and 8 then diverge into the theory of gambling games and Markov chain theory. Section 9 provides a bridge to the more advanced topics of Sections 10 through 14, including weak convergence, characteristic functions, the Central Limit Theorem, Lebesgue Decomposition, conditioning, and martingales. The final section, Section 15, provides a wide-ranging and somewhat less rigorous introduction to the subject of general stochastic processes. It leads up to diffusions, Ito's Lemma, and finally a brief look at the famous Black-Scholes equation from mathematical finance. It is hoped that this final section will inspire readers to learn more about various aspects of stochastic processes. Appendix A contains basic facts from elementary mathematics. This appendix can be used for review and to gauge the book's level. In addition, the text makes frequent reference to Appendix A, especially in the earlier sections, to ease the transition to the required mathematical level for the subject. It is hoped that readers can use familiar topics from Appendix A as a springboard to less familiar topics in the text. Finally, Appendix B lists a variety of references, for background and for further reading. Exercises. The text contains a number of exercises. Those very closely related to textual material are inserted at the appropriate place. Additional exercises are found at the end of each section, in a separate subsection. I have tried to make the exercises thought provoking without being too difficult. Hints are provided where appropriate. Rather than always asking for computations or proofs, the exercises sometimes ask for explanations and/or examples, to hopefully clarify the subject matter in the student's mind. Prerequisites. As a prerequisite to reading this text, the student should
have a solid background in basic undergraduate-level real analysis (not including measure theory). In particular, the mathematical background summarised in Appendix A should be very familiar. If it is not, then books such as those in Subsection B.l should be studied first. It is also helpful, but not essential, to have seen some undergraduate-level probability theory at the level of the books in Subsection B.2. Further reading. For further reading beyond this text, the reader should examine the similar but more advanced books of Subsection B.3. To learn additional topics, the reader should consult the books on pure measure theory of Subsection B.4, and/or the advanced books on stochastic processes of Subsection B.5, and/or the books on mathematical finance of Subsection B.6. I would be content to learn only that this text has inspired students to look at more advanced treatments of the subject. Acknowledgements. I would like to thank several colleagues for encouraging me in this direction, in particular Mike Evans, Andrey Feuerverger, Keith Knight, Omiros Papaspiliopoulos, Jeremy Quastel, Nancy Reid, and Gareth Roberts. Most importantly, I would like to thank the many students who have studied these topics with me; their questions, insights, and difficulties have been my main source of inspiration. Jeffrey S. Rosenthal Toronto, Canada, 2000 j eff@math.toronto.edu http://probability.ca/jeff/ Second Printing (2003). For the second printing, a number of minor errors have been corrected. Thanks to Tom Baird, Meng Du, Avery Fuller- ton, Longhai Li, Hadas Moshonov, Nataliya Portman, and Idan Regev for helping to find them. Third Printing (2005). A few more minor errors were corrected, with thanks to Samuel Hikspoors, Bin Li, Mahdi Lotfinezhad, Ben Reason, Jay Sheldon, and Zemei Yang.
Preface to the Second Edition. I am pleased to have the opportunity to publish a second edition of this book. The book's basic structure and content are unchanged; in particular, the emphasis on establishing probability theory's rigorous mathematical foundations, while minimising technicalities as much as possible, remains paramount. However, having taught from this book for several years, I have made considerable revisions and improvements. For example: • Many small additional topics have been added, and existing topics expanded. As a result, the second edition is over forty pages longer than the first. • Many new exercises have been added, and some of the existing exercises have been improved or "cleaned up". There are now about 275 exercises in total (as compared with 150 in the first edition), ranging in difficulty from quite easy to fairly challenging, many with hints provided. • Further details and explanations have been added in steps of proofs which previously caused confusion. • Several of the longer proofs are now broken up into a number of lemmas, to more easily keep track of the different steps involved, and to allow for the possibility of skipping the most technical bits while retaining the proof's overall structure. • A few proofs, which are required for mathematical completeness but which require advanced mathematics background and/or add little understanding, are now marked as "optional". • Various interesting, but technical and inessential, results are presented as remarks or footnotes, to add information and context without interrupting the text's flow. • The Extension Theorem now allows the original set function to be defined on a semialgebra rather than an algebra, thus simplifying its application and increasing understanding. • Many minor edits and rewrites were made throughout the book to improve the clarity, accuracy, and readability. I thank Ying Oi Chiew and Lai Fun Kwong of World Scientific for facilitating this edition, and thank Richard Dudley, Eung Jun Lee, Neal Madras, Peter Rosenthal, Hermann Thorisson, and Balint Virag for helpful comments. Also, I again thank the many students who have studied and discussed these topics with me over many years. Jeffrey S. Rosenthal Toronto, Canada, 2006 Second Printing (2007). A few very minor corrections were made, with thanks to Joe Blitzstein and Emil Zeuthen.
Contents Preface to the First Edition vii Preface to the Second Edition xi 1 The need for measure theory 1 1.1 Various kinds of random variables 1 1.2 The uniform distribution and non-measurable sets 2 1.3 Exercises 4 1.4 Section summary 5 2 Probability triples 7 2.1 Basic definition 7 2.2 Constructing probability triples 8 2.3 The Extension Theorem 10 2.4 Constructing the Uniform[0,1] distribution 15 2.5 Extensions of the Extension Theorem . 18 2.6 Coin tossing and other measures 21 2.7 Exercises 23 2.8 Section summary 27 3 Further probabilistic foundations 29 3.1 Random variables 29 3.2 Independence 31 3.3 Continuity of probabilities 33 3.4 Limit events 34 3.5 Tail fields 36 3.6 Exercises 38 3.7 Section summary 41 4 Expected values 43 4.1 Simple random variables 43 4.2 General non-negative random variables 45 4.3 Arbitrary random variables 49 4.4 The integration connection 50 4.5 Exercises 52 4.6 Section summary 55 5 Inequalities and convergence 57 5.1 Various inequalities 57 5.2 Convergence of random variables 58 5.3 Laws of large numbers 60 5.4 Eliminating the moment conditions 61
5.5 Exercises 5.6 Section summary 6 Distributions of random variables 6.1 Change of variable theorem 6.2 Examples of distributions 6.3 Exercises 6.4 Section summary 7 Stochastic processes and gambling games 7.1 A first existence theorem 7.2 Gambling and gambler's ruin 7.3 Gambling policies 7.4 Exercises 7.5 Section summary 8 Discrete Markov chains 8.1 A Markov chain existence theorem 8.2 Transience, recurrence, and irreducibility 8.3 Stationary distributions and convergence 8.4 Existence of stationary distributions 8.5 Exercises 8.6 Section summary 9 More probability theorems 9.1 Limit theorems 9.2 Differentiation of expectation 9.3 Moment generating functions and large deviations 9.4 Fubini's Theorem and convolution 9.5 Exercises 9.6 Section summary 10 Weak convergence 10.1 Equivalences of weak convergence 10.2 Connections to other convergence 10.3 Exercises 10.4 Section summary 11 Characteristic functions 11.1 The continuity theorem 11.2 The Central Limit Theorem 11.3 Generalisations of the Central Limit Theorem . . 11.4 Method of moments 11.5 Exercises
11.6 Section summary 142 12 Decomposition of probability laws 143 12.1 Lebesgue and Hahn decompositions 143 12.2 Decomposition with general measures 147 12.3 Exercises 148 12.4 Section summary 149 13 Conditional probability and expectation 151 13.1 Conditioning on a random variable 151 13.2 Conditioning on a sub-cr-algebra 155 13.3 Conditional variance 157 13.4 Exercises 158 13.5 Section summary 160 14 Martingales 161 14.1 Stopping times 162 14.2 Martingale convergence 168 14.3 Maximal inequality 171 14.4 Exercises 173 14.5 Section summary 176 15 General stochastic processes 177 15.1 Kolmogorov Existence Theorem 177 15.2 Markov chains on general state spaces 179 15.3 Continuous-time Markov processes 182 15.4 Brownian motion as a limit 186 15.5 Existence of Brownian motion 188 15.6 Diffusions and stochastic integrals 190 15.7 Ito's Lemma 193 15.8 The Black-Scholes equation 194 15.9 Section summary 197 A Mathematical Background 199 A.l Sets and functions 199 A.2 Countable sets 200 A.3 Epsilons and Limits 202 A.4 Infimums and supremums 204 A.5 Equivalence relations 207
XVI B Bibliography B.l Background in real analysis . . B.2 Undergraduate-level probability B.3 Graduate-level probability . . . B.4 Pure measure theory B.5 Stochastic processes B.6 Mathematical finance Index
1. The need for measure theory. 1 1. The need for measure theory. This introductory section is directed primarily to those readers who have some familiarity with undergraduate-level probability theory, and who may be unclear as to why it is necessary to introduce measure theory and other mathematical difficulties in order to study probability theory in a rigorous manner. We attempt to illustrate the limitations of undergraduate-level probability theory in two ways: the restrictions on the kinds of random variables it allows, and the question of what sets can have probabilities defined on them. 1.1. Various kinds of random variables. The reader familiar with undergraduate-level probability will be comfortable with a statement like, "Let X be a random variable which has the Poisson(5) distribution." The reader will know that this means that X takes as its value a "random" non-negative integer, such that the integer k > 0 is chosen with probability P(X = k) = e~55fc/fc!. The expected value of, say, X2, can then be computed as E(X2) = YlT=o^2e~5^k/^" ^ 1S an example of a discrete random variable. Similarly, the reader will be familiar with a statement like, "Let Y be a random variable which has the Normal(0,1) distribution." This means that the probability that Y lies between two real numbers a < b is given by the integral P(a < Y < b) = f* -±=e-y2/2dy. (On the other hand, P(^ = y) = 0 f°r any particular real number y.) The expected value of, say, F2, can then be computed as E(F2) = f™ooy2-7=e~y2/2dy. Y is an example of an absolutely continuous random variable. But now suppose we introduce a new random variable Z, as follows. We let X and Y be as above, and then flip an (independent) fair coin. If the coin comes up heads we set Z = X, while if it comes up tails we set Z = Y. In symbols, P(Z = X) = P(Z = Y) = 1/2. Then what sort of random variable is Z? It is not discrete, since it can take on an uncountable number of different values. But it is not absolutely continuous, since for certain values z (specifically, when z is a non-negative integer) we have P(Z = z) > 0. So how can we study the random variable Z? How could we compute, say, the expected value of Z2? The correct response to this question, of course, is that the division of random variables into discrete versus absolutely continuous is artificial. Instead, measure theory allows us to give a common definition of expected value, which applies equally well to discrete random variables (like X above), to continuous random variables (like Y above), to combinations of them (like
2 1. The need for measure theory. Z above), and to other kinds of random variables not yet imagined. These issues are considered in Sections 4, 6, and 12. 1.2. The uniform distribution and non-measurable sets. In undergraduate-level probability, continuous random variables are often studied in detail. However, a closer examination suggests that perhaps such random variables are not completely understood after all. To take the simplest case, suppose that X is a random variable which has the uniform distribution on the unit interval [0,1]. In symbols, X ~ Uniform[0,1]. What precisely does this mean? Well, certainly this means that P(0<X<l) = lIt also means that P(0 < X < 1/2) = 1/2, that P(3/4 < X < 7/8) = 1/8, etc., and in general that P(a < X < b) = b — a whenever 0 < a < b < 1, with the same formula holding if < is replaced by <. We can write this as P([fl, b]) = P((a, b}) = P([a, b)) = P((a, 6)) = 6 - a, 0 < a < 6 < 1. (1.2.1) In words, the probability that X lies in any interval contained in [0,1] is simply the length of the interval. (We include in this the degenerate case when a = 6, so that P({a}) = 0 for the singleton set {a}; in words, the probability that X is equal to any particular number a is zero.) Similarly, this means that P(l/4 < X < 1/2 or 2/3 < X < 5/6) = P(l/4<X<l/2) + P(2/3<X<5/6) = 1/4 + 1/6 = 5/12, and in general that if A and B are disjoint subsets of [0,1] (for example, if A = [1/4, 1/2] and B = [2/3, 5/6]), then P(AuB) = P(A) + P(B). (1.2.2) Equation (1.2.2) is called finite additivity. Indeed, to allow for countable operations (such as limits, which are extremely important in probability theory), we would like to extend (1.2.2) to the case of a countably infinite number of disjoint subsets: if A\, A<i, As,... are disjoint subsets of [0,1], then P (Ax U A2 U A3 U ...) = P(Ai) + P{A2) + P{A3) + ... . (1.2.3) Equation (1.2.3) is called countable additivity. Note that we do not extend equation (1.2.3) to uncountable additivity. Indeed, if we did, then we would expect that P([0,1]) = X!xe[o,i] ^({x})'
1.2. The uniform distribution and non-measurable sets. 3 which is clearly false since the left-hand side equals 1 while the right-hand side equals 0. (There is no contradiction to (1.2.3) since the interval [0,1] is not countable.) It is for this reason that we restrict attention to countable operations. (For a review of countable and uncountable sets, see Subsection A.2. Also, recall that for non-negative uncountable collections {ra}aej, J2 Gj ra is defined to be the supremum of Y^aeJ r<* over nn^e J ^ I-) Similarly, to reflect the fact that X is "uniform" on the interval [0,1], the probability that X lies in some subset should be unaffected by "shifting" (with wrap-around) the subset by a fixed amount. That is, if for each subset A C [0,1] we define the r-shift of A by A®r = {a + r; aeA, a + r < 1} u {a + r-1; aG A, a + r > 1}, (1.2.4) then we should have P(A 0 r) = P{A), 0 < r < 1. (1.2.5) So far so good. But now suppose we ask, what is the probability that X is rational? What is the probability that Xn is rational for some positive integer n? What is the probability that X is algebraic, i.e. the solution to some polynomial equation with integer coefficients? Can we compute these things? More fundamentally, are all probabilities such as these necessarily even defined? That is, does P(A) (i.e., the probability that X lies in the subset A) even make sense for every possible subset A C [0,1]? It turns out that the answer to this last question is no, as the following proposition shows. The proof requires equivalence relations, but can be skipped if desired since the result is not used elsewhere in this book. Proposition 1.2.6. There does not exist a definition ofP(A), defined for all subsets A C [0,1], satisfying (1.2.1) and (1.2.3) and (1.2.5). Proof (optional). Suppose, to the contrary, that P(A) could be so defined for each subset A C [0,1]. We will derive a contradiction to this. Define an equivalence relation (see Subsection A.5) on [0,1] by: x ~ y if and only if the difference y - x is rational. This relation partitions the interval [0,1] into a disjoint union of equivalence classes. Let H be a subset of [0,1] consisting of precisely one element from each equivalence class (such H must exist by the Axiom of Choice, see page 200). For definiteness, assume that 0 £ H (say, if 0 G H, then replace it by 1/2). Now, since If contains an element of each equivalence class, we see that each point in (0,1] is contained in the union (J (H 0 r) of shifts of H. r€[0,l) r rational Furthermore, since H contains just one point from each equivalence class, we see that these sets H 0 r, for rational r G [0,1), are all disjoint.
4 1. The need for measure theory. But then, by countable additivity (1.2.3), we have P((o,i])= £ P(ffer). re[o,i) r rational Shift-invariance (1.2.5) implies that P(ff0r) = P(#), whence 1 = P((0,1])= £ P(H). re[o,i) r rational This leads to the desired contradiction: A countably infinite sum of the same quantity repeated can only equal 0, or oo, or —oo, but it can never equal 1. I This proposition says that if we want our probabilities to satisfy reasonable properties, then we cannot define them for all possible subsets of [0,1]. Rather, we must restrict their definition to certain "measurable" sets. This is the motivation for the next section. Remark. The existence of problematic sets like H above turns out to be equivalent to the Axiom of Choice. In particular, we can never define such sets explicitly - only implicitly via the Axiom of Choice as in the above proof. 1.3. Exercises. Exercise 1.3.1. Suppose that ft = {1,2}, with P(0) = 0 and P{1,2} = 1. Suppose P{1} = \. Prove that P is countably additive if and only if P{2} = f • Exercise 1.3.2. Suppose ft = {1,2,3} and T is the collection of all subsets of ft. Find (with proof) necessary and sufficient conditions on the real numbers x, y, and z, such that there exists a countably additive probability measure P on T, with x = P{1,2}, y = P{2,3}, and z = P{1,3}. Exercise 1.3.3. Suppose that ft = N is the set of positive integers, and P is defined for all A C ft by P(A) = 0 if A is finite, and P(A) = 1 if A is infinite. Is P finitely additive? Exercise 1.3.4. Suppose that ft = N, and P is defined for all A C ft by P(j4) = \A\ if A is finite (where \A\ is the number of elements in In fact, assuming the Continuum Hypothesis, Proposition 1.2.6 continues to hold if we require only (1.2.3) and that 0 < P([0,1]) < oo and P{x} = 0 for all x; see e.g. Billingsley (1995, p. 46).
1.4. Section summary. the subset A), and P(A) = oo if A is infinite. This P is of course not a probability measure (in fact it is counting measure), however we can still ask the following. (By convention, 00 + 00 = 00.) (a) Is P finitely additive? (b) Is P countably additive? Exercise 1.3.5. (a) In what step of the proof of Proposition 1.2.6 was (1.2.1) used? (b) Give an example of a countably additive set function P, defined on all subsets of [0,1], which satisfies (1.2.3) and (1.2.5), but not (1.2.1). 1.4. Section summary. In this section, we have discussed why measure theory is necessary to develop a mathematical rigorous theory of probability. We have discussed basic properties of probability measures such as additivity. We have considered the possibility of random variables which are neither absolutely continuous nor discrete, and therefore do not fit easily into undergraduate-level understanding of probability. Finally, we have proved that, for the uniform distribution on [0,1], it will not be possible to define a probability on every single subset.
2, Probability triples. 2. Probability triples- In this section we consider probability triples and how to construct them. In light of the previous section, we see that to study probability theory properly it will be necessary to keep track of which subsets A have a probability P(A) denned for them. 2.1. Basic definition. We define a probability triple or (probability) measure space or probability space to be a triple (ft,.F,P), where: • the sample space ft is any non-empty set (e.g. ft = [0,1] for the uniform distribution considered above); • the a-algebra (read "sigma-algebra") or a-field (read "sigma-field") T is a collection of subsets of ft, containing ft itself and the empty set 0, and closed under the formation of complements and countable unions and countable intersections (e.g. for the uniform distribution considered above, T would certainly contain all the intervals [a, 6], but would contain many more subsets besides); • the probability measure P is a mapping from T to [0,1], with P(0) = 0 and P(ft) = 1, such that P is countably additive as in (1.2.3). This definition will be in constant use throughout the text. Furthermore it contains a number of subtle points. Thus, we pause to make a few additional observations. The cr-algebra T is the collection of all events or measurable sets. These are the subsets Kfifor which P(A) is well-defined. We know from Proposition 1.2.6 that in general T might not contain all subsets of ft, though we still expect it to contain most of the subsets that come up naturally. To say that T is closed under the formation of complements and countable unions and countable intersections means, more precisely, that (i) For any subset A C ft, if A <E T, then Ac <E T\ (ii) For any countable (or finite) collection of subsets Ai,A2, A3,... C ft, if Ai e T for each z, then the union Ax U A2 U A3 U ... € T\ (ni) For any countable (or finite) collection of subsets Ai,A2, A3,... C ft, if A{ e T for each z, then the intersection A\ Pi A2 Pi A3 H ... € T. Like for countable additivity, the reason we require T to be closed under countable operations is to allow for taking limits, etc., when studying probability theory. Also like for additivity, we cannot extend the definition to For the definitions of complements, unions, intersections, etc., see Subsection A.l on . Page 199.
8 2. Probability triples. require that T be closed under uncountable unions; in this case, for the example of Subsection 1.2 above, T would contain every subset A, since every subset can be written as A = \JxeA{x} and since the singleton sets {x} are all in T. There is some redundancy in the definition above. For example, it follows from de Morgan's Laws (see Subsection A.l) that if T is closed under complement and countable unions, then it is automatically closed under countable intersections. Similarly, it follows from countable additivity that we must have P(0) = 0, and that (once we know that P(^) = 1 and P(A) > 0 for all A <E T) we must have P(A) < 1. More generally, from additivity we have P(A) + P(AC) = P(A U Ac) = P(fl) — 1, whence P(AC) = l-P(A), (2.1.1) a fact that will be used often. Similarly, if A C B, then since B = AU (B\A) (where U means disjoint union), we have that P(B) = P(A) + P(B \ A) > P(A), i.e. P(A) < P(B) whenever A CB, (2.1.2) which is the monotonicity property of probability measures. Also, if A, B € J", then P(iUB) = P[(A\B)U(B\A)U{AnB)] = P(A \ B) + P(B \ A) + P(A n B) - P(4) - P(4 nB) + P(B) - P(^l nB) + P{A n B) = P{A) + P(B) - P(A n B), the principle of inclusion-exclusion. For a generalisation see Exercise 4.5.7. Finally, for any sequence Ai,A2, ■ ■ ■ € T (whether disjoint or not), we have by countable additivity and monotonicity that P(i4i U A2 U A3 U . •.) = P[A1Ci(A2\A1)U{A3\A2\A1)U ...] = P(A1) + P(A2\A1) + P(A3\A2\A1) + ... < P(i41) + P(A2) + P(J43) + ... which is the countable subadditivity property of probability measures. 2.2. Constructing probability triples. We clarify the definition of Subsection 2.1 with a simple example. Let us again consider the Poisson(5) distribution considered in Subsection 1.1. In this case, the sample space Q would consist of all the non-negative integers:
2.2. Constructing probability triples. 9 o = {0,1,2,.. .}• Also, the cr-algebra T would consist of all subsets of ft. Finally, the probability measure P would be defined, for any A G T, by p(a) = J2e~55k/kl- keA It is straightforward to check that T is indeed a cr-algebra (it contains all subsets of ft, so it's closed under any set operations), and that P is a probability measure defined on T (the additivity following since if A and B are disjoint, then T,heAuB is the same ^ T,keA + Efc€s)- So in the case of Poisson(5), we see that it is entirely straightforward to construct an appropriate probability triple. The construction is similarly straightforward for any discrete probability space, i.e. any space for which the sample space ft is finite or countable. We record this as follows. Theorem 2.2.1. Let ft be a finite or countable non-empty set. Let p : ft —> [0,1] be any function satisfying ^2ujeQp((^) = 1. Then there is a valid probability triple (ft, J7, P) where T is the collection of all subsets of 0, and for A^T, P(A) = Eu,€aPM- Example 2.2.2. Let ft be any finite non-empty set, T be the collection of all subsets of ft, and P(A) = \A\ / \Q\ for all A e T (where \A\ is the cardinality of the set ^4). Then (^,«F, P) is a valid probability triple, called the uniform distribution on tt, written Uniform(O). However, if the sample space is not countable, then the situation is considerably more complex, as seen in Subsection 1.2. How can we formally define a probability triple (O, J7, P) which corresponds to, say, the Uniform[0,1] distribution? It seems clear that we should choose Q = [0,1]. But what about f? We know from Proposition 1.2.6 that T cannot contain all subsets of ft, but it should certainly contain all the intervals [a, 6], [a, 6), etc. That is, we must have T D J, where J = {all intervals contained in [0,1]} and where "intervals" is understood to include all the open / closed / half- °pen / singleton / empty intervals. Exercise 2.2.3. Prove that the above collection J' is a semialgebra of subsets of ft, meaning that it contains 0 and SI, it is closed under finite intersection, and the complement of any element of J is equal to a finite disjoint union of elements of J.
10 2. Probability triples. Since J is only a semialgebra, how can we create a cr-algebra? As a first try, we might consider B0 = {all finite unions of elements of J} . (2.2.4) (After all, by additivity we already know how to define P on Bo.) However, Bo is not a cr-algebra: Exercise 2.2.5. (a) Prove that Bo is an algebra (or, field) of subsets of fi, meaning that it contains CI and 0, and is closed under the formation of complements and of finite unions and intersections. (b) Prove that Bo is not a cr-algebra. As a second try, we might consider B\ = {all finite or countable unions of elements of J} . (2.2.6) Unfortunately, B\ is still not a cr-algebra (Exercise 2.4.7). Thus, the construction of T, and of P, presents serious challenges. To deal with them, we next prove a very general theorem about constructing probability triples. 2.3. The Extension Theorem. The following theorem is of fundamental importance in constructing complicated probability triples. Recall the definition of semialgebra from Exercise 2.2.3. Theorem 2.3.1. (The Extension Theorem.) Let J be a semialgebra of subsets ofto. Let P : J -► [0,1] with P(0) = 0 and P(fi) = 1, satisfying the finite superadditivity property that (k \ k k U Ai) - Yl P(Ai^ whenever Au...,Ak€j, and \J At € J, and the {Ai} are disjoint, (2.3.2) and also the countable monotonicity property that P(A) < ^2^(An) for A,AuA2,...eJwithAc]jAn. (2.3.3) n n Then there is a a-algebra M 3 J, and a countably additive probability measure P* on M, such that P*(A) = P(A) for all A G J. (That is,
2.3. The Extension Theorem. 11 (Q M,P*) JS a valid probability triple, which agrees with our previous probabilities on J.) Remark. Of course, the conclusions of Theorem 2.3.1 imply that (2.3.2) must actually hold with equality. However, (2.3.2) need only be verified as an inequality to apply Theorem 2.3.1. Theorem 2.3.1 provides precisely what we need: a way to construct complicated probability triples on a full cr-algebra, using only probabilities defined on the much simpler subsets (e.g., intervals) in J. However, it is not clear how to even start proving this theorem. Indeed, how could we begin to define P(A) for all A in a cr-algebra? The key is given by outer measure P*, defined by P*(A) = Aijnf6,J>(i4i), ACQ. (2.3.4) That is, we define P*(^4), for any subset A C ft, to be the infimum of sums of P{Ai), where {Ai} is any countable collection of elements of the original semialgebra J whose union contains A. In other words, we use the values of P{A) for A € J, to help us define P*(A) for any A C ft. Of course, we know that P* will not necessarily be a proper probability measure for all A C ft; for example, this is not possible for Uniform[0,1] by Proposition 1.2.6. However, it is still useful that P*(A) is at least defined for all A C ft. We shall eventually show that P* is indeed a probability measure on some cr-algebra M, and that P* is an extension of P. To continue, we note a few simple properties of P*. Firstly, we clearly have P*(0) = 0; indeed, we can simply take Ai = $ for each i in the definition (2.3.4). Secondly, P* is clearly monotone; indeed, if A C B then the infimum (2.3.4) for P*(A) includes all choices of {Ai} which work for P*(£) plus many more besides, so that P*(A) < P*(B). We also have: Lemma 2.3.5. P* is an extension of P, i.e. P*{A) = P(A) for all AG J. Proof. Let A £ J. It follows from (2.3.3) that P\A) > P(A). On the other hand, choosing Ax = A and At = 0 for i > 1 in the definition (2.3.4) shows by (A.4.1) that P*(A) < P{A). I Lemma 2.3.6. P* is countably subadditive, i.e. (oo \ oo [JBn < ^P*(Bn) for any BuB2,... C SI. n=l / n=l
Since J is only a semialgebra, how can we create a cr-algebra? As a first try, we might consider So = {all finite unions of elements of J} . (2.2.4) (After all, by additivity we already know how to define P on So-) However, So is not a cr-algebra: Exercise 2.2.5. (a) Prove that So is an algebra (or, field) of subsets of ft, meaning that it contains Q and 0, and is closed under the formation of complements and of finite unions and intersections. (b) Prove that So is not a cr-algebra. As a second try, we might consider Si = {all finite or countable unions of elements of J} . (2.2.6) Unfortunately, B\ is still not a cr-algebra (Exercise 2.4.7). Thus, the construction of T, and of P, presents serious challenges. To deal with them, we next prove a very general theorem about constructing probability triples. 2.3. The Extension Theorem. The following theorem is of fundamental importance in constructing complicated probability triples. Recall the definition of semialgebra from Exercise 2.2.3. Theorem 2.3.1. (The Extension Theorem.) Let J be a semialgebra of subsets of ft. Let P : J -+ [0,1] with P(0) = 0 and P(fi) = 1, satisfying the finite superadditivity property that (k \ k k UAi) -H P(A*) whenever Au...,AkeJ, and \J A> € J, i=i / i=\ i=i and the {Ai} are disjoint, (2.3.2) and also the countable monotonicity property that P(A) < ^2P(An) for A,AuA2,...eJwithAc\jAn. (2.3.3) n n Then there is a a-algebra M 2 J, and a countably additive probability measure P* on M, such that P*(A) = P(A) for all A e J. (That is,
(O M P*) iS a valid probability triple, which agrees with our previous probabilities on J.) Remark. Of course, the conclusions of Theorem 2.3.1 imply that (2.3.2) must actually hold with equality. However, (2.3.2) need only be verified as an inequality to apply Theorem 2.3.1. Theorem 2.3.1 provides precisely what we need: a way to construct complicated probability triples on a full a-algebra, using only probabilities defined on the much simpler subsets (e.g., intervals) in J. However, it is not clear how to even start proving this theorem. Indeed, how could we begin to define P(A) for all A in a a-algebra? The key is given by outer measure P*, defined by P*(A) = ^ mf £j £ P(^), ACQ. (2.3.4) That is, we define P*(A), for any subset A C fi, to be the infimum of sums of P(^4i), where {Ai} is any countable collection of elements of the original semialgebra J whose union contains A. In other words, we use the values of P{A) for A € J, to help us define P*(A) for any ACQ,. Of course, we know that P* will not necessarily be a proper probability measure for all A C Q,; for example, this is not possible for Uniform[0,1] by Proposition 1.2.6. However, it is still useful that P*(A) is at least defined for all KO. We shall eventually show that P* is indeed a probability measure on some cr-algebra M, and that P* is an extension of P. To continue, we note a few simple properties of P*. Firstly, we clearly have P*(0) = 0; indeed, we can simply take Ai = 0 for each i in the definition (2.3.4). Secondly, P* is clearly monotone; indeed, if A C B then the infimum (2.3.4) for P (A) includes all choices of {Ai} which work for P*(JB) plus many more besides, so that P*(A) < P*(B). We also have: Lemma 2.3.5. P* is an extension ofP, i.e. P*(A) = P(A) for all At J. Proof. Let AG J. It follows from (2.3.3) that P*{A) > P(A). On the other hand, choosing A\ = A and Ai = 0 for i > 1 in the definition (2.3.4) shows by (A.4.1) that P*(A) < P(A). I Lemma 2.3.6. P* is countably subadditive, i.e. (oo \ oo [JBn < ^P*(Bn) for any B1,ft,...Cfl. n=l / n=l
12 2. Probability triples. Proof. Let B\,B2,... C fl. From the definition (2.3.4), we see that for any e > 0, we can find (cf. Proposition A.4.2) a collection {Cnk}(kL1 for each n e N, with Cnk £ J, such that Bn C \Jk Cnk and £fc P(Cnk) < P*(Bn) + e2~n. But then the overall collection {Cnk}n<)k=1 contains U^Li Bn. It follows that P* (U~=iBn) < En/fCnfe) < EnP*(^n) + €. Since this is true for any e > 0, we must have (cf. Proposition A.3.1) that P* (U^Li Bn) < En P*(Bn), as claimed. I We now set M = {Acn-, P*(A DE) + P*(AC HE) = P*{E) V^Cfi}. (2.3.7) That is, M is the set of all subsets A with the property that P* is additive on the union of A C\ E with Ac D E, for all subsets E. Note that by subadditivity we always have P*(An E) + P*(,4C n E) > P*(E), so (2.3.7) is equivalent to M = {ACQ-, P*(A HE) + P*(AC DE)< P*{E) VE C ft} , (2.3.8) which is sometimes helpful. Furthermore, P* is countably additive on M: Lemma 2.3.9. If Ai,A2,... € M are disjoint, then P*(\JnAn) = EnP*(4). Proof. If A\ and A2 are disjoint, with A\ G M, then P*{A1\JA2) = P* (Ai fl (i4i U i42)) + P* (^f n (Ai U il2)) since Ax e M = P*{A1)-\-P*(A2) since AX,A2 disjoint. Hence, by induction, the lemma holds for any finite collection of Ai. Then, with countably many disjoint Ai € A4, we see that for any m € N, £P*(An) = P*( (J ,4n) < P*(UA»), where the inequality follows from monotonicity. Since this is true for any m <E N, we have (cf. Proposition A.3.6) that £nP*(An) < P* {\JnAn). On the other hand, by subadditivity we have ^nP*(An) > P* {\JnAn). Hence, the lemma holds for countably many Ai as well. I The plan now is to show that M is a cr-algebra which contains J. We break up the proof into a number of lemmas.
2.3. The Extension Theorem. 13 ma 2.3.10. M is an algebra, i.e. ft G M and M is closed under nlement and finite intersection (and hence also finite union). P oof ^ 1S immediate from (2.3.7) that M contains ft and is closed der complement. For the statement about finite intersections, suppose A B € M> Then, for any E C ft, using subadditivity, F*((AnB)nE) + P*((AnB)cnE) = P*(AnBDE) +P* ((Ac nBr\E)u(AnBcnE)u(AcnBcr) e)) < p* (A n B n E) + P* (Ac n B n E) +p* (AnBcnE)+ P* (Acn5cne)) = P*(B DE) + P*{BC fl E) since ,4 € M = P*(E) since BeM. Hence, by (2.3.8), AnBeM. I Lemma 2.3.11. Let A\, A2,... € M be disjoint. For each m € N, iet 5m = Un<m ^n- Then for all m € N, and for aii i? C ft, we have P*(£H£m) = ^P*(£nAn). (2.3.12) n<m Proof. We use induction on m. Indeed, the statement is trivially true when m = 1. Assuming it true for some particular value of m, and noting that Bm n JBm+1 = Bm and B^ Pi £m+i = i4m+i, we have (noting that Bm e M by Proposition 2.3.10) that P*(£n£m+i) = P*(£m n en Bm+1) + P*(££ n E n Bm+1) since Bm e M = P*(EnBm) + p*(EnArn+1) = En<m+i P*{E fl ,<4n) by the induction hypothesis, thus completing the induction proof. I Lemma 2.3.13. Let AuA2,...£Mbe disjoint. Then \Jn An € M. Proof. For each m e N, let Bm = \Jn<m An. Then for any m € N and any E C ft, P*(£) = P(finBm)+P(£nBS) since BmeM = En<mP*(^4) + P(^n^) by (2.3.12)
14 2. Probability triples. where the inequality follows by monotonicity since ({JAn)c C B%. This is true for any m G N, so it implies (cf. Proposition A.3.6) that P*(£?) > ^P*(EnAn) + P*(En({jAnf\ n \ n / > P*(En(\jAn)\+p*(En(\jAnf\ , where the final inequality follows by subadditivity. Hence, by (2.3.8) we havelJn^n € M. I Lemma 2.3.14. M is a a-algebra. Proof. In light of Lemma 2.3.10, it suffices to show that \JnAn € M whenever A\, A2,... € M. Let Di = A\, and Di = Ai n Af n ... n -AfLi for i > 2. Then {A} are disjoint, with \J{ Di = \Ji Ai, and with Di € M by Lemma 2.3.10. Hence, by Lemma 2.3.13, \J. Di € M, i.e. |Ji Ai € M. I Lemma 2.3.15. J Q M. Proof. Let A £ J. Then since J' is a semialgebra, we can write Ac = J\ U ... U Jk for some disjoint J\,..., Jk € J'. Also, for any £ C ^ and e > 0, by the definition (2.3.4) we can find (cf. Proposition A.4.2) A\, A2,... € J with EC[Jn4 and £n P(An) < P*(£) + 6. Then P*(£nA) + P*(£nAc) < P* ((|Jn An) n i4) + P* ((|Jn An) n Ac) by monotonicity = P*(UJAnnA)) +P*(l)n{Jl1(AnnJi)) < En P*(^n H A) +.En Eti p*(^n n J4) by subadditivity = En P(^n nA) + En ELl P(^n H J<) since P* = P OU J = En (p(^» n A) + Eti P(^n n J,)) <EnP(^n) by (2.3.2) < P*(E) + e by assumption. This is true for any e > 0, hence (cf. Proposition A.3.1) we have P*(E C\ A) + P*(E fi ^c) < P*(E), for any ECfi. Hence, from (2.3.8), we have A <E M. This holds for any A <E J, hence J C M. I With all those lemmas behind us, we are now, finally, able to complete the proof of the Extension Theorem.
2 4. Constructing the Uniform[0, 1] distribution. 15 f of Theorem 2.3.1. Lemmas 2.3.5, 2.3.9, 2.3.14, and 2.3.15 to- ther show that M is a a-algebra containing J, that P* is a probability geeasure on M, and that P* is an extension of P. I Exercise 2.3.16. Prove that the extension (fi,Al,P*) constructed in the proof of Theorem 2.3.1 must be complete, meaning that if A G M with p*M) __ o^ and if £? C j4, then B G Af. (It then follows from monotonicity thatP*(^) = 0.) 2.4. Constructing the Uniform[0,1] distribution. Theorem 2.3.1 allows us to automatically construct valid probability triples which take particular values on particular sets. We now use this to construct the Uniform[0,1] distribution. We begin by letting Q = [0,1], and again setting J = {all intervals contained in [0,1]} , (2.4.1) where again "intervals" is understood to include all the open, closed, half- open, and singleton intervals contained in [0,1], and also the empty set 0. Then J' is a semialgebra by Exercise 2.2.3. For I G J, we let P( J) be the length of I. Thus P(0) = 0 and P(ft) = 1. We now proceed to verify (2.3.2) and (2.3.3). Proposition 2.4.2. The above definition of J and P satisfies (2.3.2), with equality. Proof. Let Jx,..., Ik be disjoint intervals contained in [0,1], whose union is some interval J0. For 0 < j < k, write Oj for the left end-point of Ij, and bj for the right end-point of Ij. The assumptions imply that by re-ordering, we can ensure that a0 = ai < bi = a2 < b2 = a3 < ... < bk = b0. Then Ep(^) = !>;-<*;) = bk-ai = b0-ao = P(/0). I 3 j The verification of (2.3.3) for this J and P is a bit more involved: Exercise 2.4.3. (a) Prove that if Iu J2,..., In is a finite collection of intervals, and if |J"=1 Ij 2 I for some interval /, then YTj=i *(Ij) > P(/)- [Hint: Imitate the proof of Proposition 2.4.2.] •r oo r°Ve ^at ^ ^' ^2'''' ^s a countaD^e collection of open intervals, and lf U=i Ij D I for some closed interval J, then ^°°=1 P(/?) > P(7). [Hint:
16 2. Probability triples. You may use the Heine-Borel Theorem, which says that if a collection of open intervals contain a closed interval, then some finite sub-collection of the open intervals also contains the closed interval.] (c) Verify (2.3.3), i.e. prove that if I\, J2,... is any countable collection of intervals, and if \J7=i h 2 I for any interval /, then £°liP(^) > P(J). [Hint: Extend the interval Ij by e2~j at each end, and decrease J by e at each end, while making Ij open and J closed. Then use part (b).] In light of Proposition 2.4.2 and Exercise 2.4.3, we can apply Theorem 2.3.1 to conclude the following: Theorem 2.4.4. There exists a probability triple (Q,M,P*) such that Q = [0,1], M contains all intervals in [0,1], and for any interval I C [0,1], P*(J) is the length of I. This probability triple is called either the uniform distribution on [0,1], or Lebesgue measure on [0,1]. Depending on the context, we sometimes write the probability measure P* as P or as A. Remark. Let B = a( J) be the cr-algebra generated by J, i.e. the smallest cr-algebra containing J. (The collection B is called the Borel a-algebra of subsets of [0,1], and the elements of B are called Borel sets.) Clearly, we must have M. D B. In this case, it can be shown that M is in fact much bigger than B; it even has larger cardinality. Furthermore, it turns out that Lebesgue measure restricted to B is not complete, though on M it is (by Exercise 2.3.16). In addition to the Borel subsets of [0,1], we shall also have occasion to refer to the Borel a-algebra of subsets of R, denned to be the smallest cr-algebra of subsets of R which includes all intervals. Exercise 2.4.5. Let A = {(-00,x]; x G R}. Prove that a (A) = B, i.e. that the smallest cr-algebra of subsets of R which contains A is equal to the Borel cr-algebra of subsets of R. [Hint: Does o~(A) include all intervals?] Writing A for Lebesgue measure on [0,1], we know that \{x} = 0 for any singleton set {x}. It follows by countable additivity that X(A) = 0 for any set A which is countable. This includes (cf. Subsection A.2) the rational numbers, the integer roots of the rational numbers, the algebraic numbers, etc. That is, if X is uniformly distributed on [0,1], then P(X is rational) = 0, and P(Xn is rational for some n G N) = 0, and P(X is algebraic) = 0, and so on. There also exist uncountable sets which have Lebesgue measure 0. The simplest example is the Cantor set K, denned as follows (see Figure 2.4.6). We begin with the interval [0,1]. We then remove the open interval con-
4. Constructing the Uniform[0, 1] distribution. 17 1 2 I 2 7 8 0 9 9 3 3 9 9 Figure 2.4.6. Constructing the Cantor set K. sisting of the middle third (1/3, 2/3). We then remove the open middle thirds of each of the two pieces, i.e. we remove (1/9, 2/9) and (7/9, 8/9). We then remove the four open middle thirds (1/27, 2/27), (7/27, 8/27), (19/27, 20/27), and (25/27, 26/27) of the remaining pieces. We continue inductively, at the nth stage removing the 2n_1 middle thirds of all remaining sub-intervals, each of length l/3n. The Cantor set K is defined to be everything that is left over, after we have removed all these middle thirds. Now, the complement of the Cantor set has Lebesgue measure given by \{KC) = 1/3 + 2(1/9) + 4(1/27) + ... = £~=1 2n~l/3n = 1. Hence, by (2.1.1), X(K) = 1-1=0. On the other hand, K is uncountable. Indeed, for each point x G K, let dn(x) = 0 or 1 depending on whether, at the nth stage of the construction of K, x was to the left or the right of the nearest open interval removed. Then define the function / : K -► [0,1] by f(x) = £^°=1 dn(x) 2"n. It is easily checked that f(K) = [0,1], i.e. that / maps K onto [0,1]. Since [0,1] is uncountable, this means that K must also be uncountable. Remark. The Cantor set is also equal to the set of all numbers in [0,1] which have a base-3 expansion that does not contain the digit 1. That is, K = { E~=i CnS~n : each cn e {0,2}}. Exercise 2.4.7. (a) Prove that K, Kc e B, where B are the Borel subsets of [0,1]. (b) Prove that K, Kc e M, where M is the a-algebra of Theorem 2.4.4. (c) Prove that Kc e Bu where Bx is defined by (2.2.6). (d) Prove that K ^Bl. (e) Prove that Bx is not a cr-algebra. On the other hand, from Proposition 1.2.6 we know that:
_. ^ ^t-v^jj-n-jJiLill I IJttlb'LKS. Proposition 2.4.8. For the probability triple (Cl,M,P*) of Theorem 2.4.4 corresponding to Lebesgue measure on [0,1], there exists at least one subset H C Q, with H 0 M.. 2.5. Extensions of the Extension Theorem. The Extension Theorem (Theorem 2.3.1) will be our main tool for proving the existence of complicated probability triples. While (2.3.2) is generally easy to verify, (2.3.3) can be more challenging. Thus, we present some alternative formulations here. Corollary 2.5.1. Let J be a semialgebra of subsets ofCl. Let P : J -» [0,1] with P(0) = 0 and P(fi) = 1, satisfying (2.3.2), and the umonotonicity on J" property that P(A) < P(B) whenever A,B ej with ACB, (2.5.2) and also the ucountable subadditivity on J" property that P(UB») - 5ZP(B") forB1,B2,...eJ with\jBneJ. (2.5.3) \ n / n n Then there is a a-algebra M 2 *7, and a countably additive probability measure P* on M, such that P*(A) = P(A) for all AeJ. Proof. In light of Theorem 2.3.1, we need only verify (2.3.3). To that end, let A,Ai,A2,...eJ with A C (Jn An. Set Bn = A n An. Then since A C |Jn An, we have A = U„(^ n An) = Un Bn, whence (2.5.3) and (2.5.2) give that P(A) = P ((J Bn ) < Y, P(5„) < £ P^«) • ■ Another version assumes countable additivity of P on J: Corollary 2.5.4. Let J be a semialgebra of subsets of CI. Let P : J —> [0,1] with P(fi) = 1, satisfying the countable additivity property that P(ljAi) = ^P(Ai) forDi,D2,...G J disjoint with [JDne J. \ n / n n (2.5.5) Then there is a a-algebra M. D J, and a countably additive probability measure P* on M, such that P*{A) = P(A) for all AeJ.
2.5. Extensions of the Extension Theorem. 19 0 f Note that (2.5.5) immediately implies (2.3.2) (with equality), and p(0) = 0. Hence, in light of Corollary 2.5.1, we need only verify (2.5.2) and (2.5.3). For (2.5.2), let A, B G J with A C B. Since J is a semialgebra, we can write Ac = JiU ... UJfc, for some disjoint Ji,..., Jk G J. Then using (2.5.5), P(B) = P(A) + P(BnJi) + ... + P(BnJfc) > P(A). For (2.5.3), let Bi,B2,... G J with |Jn #n G J. Set £>i = #i, and Dn = BnPiB?n.. .nfi^Lx for n > 2. Then {£>„} are disjoint, with |Jn Dn = 1 I B Furthermore, since J is a semialgebra, each Dn can be written as a finite disjoint union of elements of J, say Dn = Jn\ U ... U JnKn. It then follows from (2.5.5) that p(u*») = p(u^») = p(uu^) = zi>(^). \ n J \ n / \ n 2=1 / n 2=1 On the other hand, Km Bn = (J \J(Jmi^Bn) m<n 2=1 and the union is disjoint, with Jni C Bn, so P(Bn) = J2J2P(J™nB^ * £P(Jnjn£n) = ^P(J„i), m<n i=l i=l 2=1 and hence kn / \ $>(B„) > ££P(J„i) = P(UBn) ' " n n 2=1 \ n / Exercise 2.5.6. Suppose P satisfies (2.5.5) for finite collections {Dn}. Suppose further that, whenever AUA2,... G J such that An+i C An and (XLi An = 0, we have limn_*oo P(An) = 0. Prove that P also satisfies (2.5.5) for countable collections {Dn}. [Hint: Set An = (\J°°=1 DA \ (ur=1A).i The extension of Theorem 2.3.1 also has a uniqueness property: Proposition 2.5.7. Let J, P, P*, and M be as in Theorem 2.3.1 (or as in Corollary 2.5.1 or 2.5.4). Let T be any a-algebra with J C T C M (e.g.
20 2. Probability triples. F = .A/f ? or J7 = cr(J)). Let Q be any probability measure on J7, such that Q{A) = P{A) for all AeJ. Then Q(A) = P*(A) for aii A G J7. Proof. For A G .T7, we compute P*(A) = inf^.Aa.-e^ Ei^(^i) from (2.3.4) = inf ^^^....ej X^i Q{Ai) since Q = P on J > inf a1,a2,...€j Q(\J Ai) by countable subadditivity A-lJtA* > inf a1,a2,...<=j Q(A) by monotonicity = Q(A), i.e. P*(4) > Q{A). Similarly, P*(AC) > Q(AC), and then (2.1.1) implies 1 - P*{A) > 1 - Q(A), i.e. P*(A) < Q(j4). Hence, P*(A) = Q(j4). I Proposition 2.5.7 immediately implies the following: Proposition 2.5.8. Let J be a semialgebra of subsets of Q, and let T — (J {J) be the generated a-algebra. Let P and Q be two probability distributions defined on T. Suppose that P(A) = Q{A) for all AeJ. Then P = Q, i.e. P{A) = Q(A) for all AeJ7. Proof. Since P and Q are probability measures, they both satisfy (2.3.2) and (2.3.3). Hence, by Proposition 2.5.7, each of P and Q is equal to the P* of Theorem 2.3.1. I One useful special case of Proposition 2.5.8 is: Corollary 2.5.9. Let P and Q be two probability distributions defined on the collection B of Borel subsets of R. Suppose P((—oo,:r]) = Q((-oo,:r]) for all xeR. Then P(A) = Q(A) for all AeB. Proof. Since P((y,oo)) = 1 - P((-oo,y]), and P((x,y\) = 1 - P((-oo,:r]) — P((y, oo)), and similarly for Q, it follows that P and Q agree on J = {(-oo,a;] :xgR}u{(2/,oc) :y€R)u{(i,y] :x,|/GR)u{«,r}. (2.5.10) But J is a semialgebra (Exercise 2.7.10), and it follows from Exercise 2.4.5 that <j( J) = B. Hence, the result follows from Proposition 2.5.8. |
2.6. Coin tossing and other measures. 21 2 6. Coin tossing and other measures. Now that we have Theorem 2.3.1 to help us, we can easily construct other probability triples as well. For example, of frequent mention in probability theory is (independent, fair) coin tossing. To model the flipping of n coins, we can simply take Q = {{r\,r2, • • • ,rn)', ri — 0 or 1} (where 0 stands for tails and 1 stands for heads), let T = 2Q be the collection of all subsets of ft, and define P by Y(A) = |^4|/2n for AC J7. This is another example of a discrete probability space; and we know from Theorem 2.2.1 that these spaces present no difficulties. But suppose now that we wish to model the flipping of a (countably) infinite number of coins. In this case we can let 0 = {(n,r2,r3,...); ri =0or 1} be the collection of all binary sequences. But what about T and P? Well, for each n G N and each ai,a2,...,an G {0,1}, let us define subsets Aaia2...an C Q by Aaia2...an = {(n,r2,...) G ft; r{ = a{ for 1 < i < n} . (Thus, Ao is the event that the first coin comes up tails; An is the event that the first two coins both come up heads; and Aioi is the event that the first and third coins are heads while the second coin is tails.) Then we clearly want P(Aua2...aJ = l/2n for each set j4ai02...an. Hence, if we set J = R1a2...an;neN,ai,a2,...,ane{0,l}}U{W}, then we already know how to define P(A) for each A G J. To apply the Extension Theorem (in this case, Corollary 2.5.4), we need to verify that certain conditions are satisfied. Exercise 2.6.1. (a) Verify that the above J is a semialgebra. (b) Verify that the above J and P satisfy (2.5.5) for finite collections {Dn}. [Hint: For a finite collection {Dn} C J, there is k G N such that the results of only coins 1 through k are specified by any Dn. Partition Q into the corresponding 2k subsets.] Verifying (2.5.5) for countable collections unfortunately requires a bit of topology; the proof of this next lemma may be skipped. Lemma 2.6.2. The above J and P (for infinite coin tossing) sat- isfy (2.5.5).
22 2. Probability triples. Proof (optional). In light of Exercises 2.6.1 and 2.5.6, it suffices to show that for Ai,A2,... G J with An+i C An and D^Li ^n = 0, limn^00P(An) =0. Give {0,1} the discrete topology, and give fi = {0,1} x {0,1} x ... the corresponding product topology. Then fi is a product of compact sets {0,1}, and hence is itself compact by Tychonov's Theorem. Furthermore each element of J is a closed subset of fi, since its complement is open in the product topology. Hence, each An is a closed subset of a compact space, and is therefore compact. The finite intersection property of compact sets then implies that there is N G N with An = 0 for all n > N. In particular, P(An) -► 0. I Now that these conditions have been verified, it then follows from Corollary 2.5.4 that the probabilities for the special sets Aaia2...an G J can automatically be extended to a a-algebra M containing J. (Once again, this a-algebra will be quite complicated, and we will never understand it completely. But it is still essential mathematically that we know it exists.) This will be our probability triple for infinite fair coin tossing. As a sample calculation, let Hn = {(n, r2,...) G fi; rn = 1} be the event that the nth coin comes up heads. We certainly would hope that Hn e M, with P(Hn) = |. Happily, this is indeed the case. We note that ri,r2,...,rn_iG{0,l} the union being disjoint. Hence, since M is closed under countable (including finite) unions, we have Hn G M.. Then, by countable additivity, P(#n) = 2^ P(^4r1r2...rn_1l) ri,r2,...,rn_i€{0,l} . ^2 V2n = 2n-1/2n = 1/2. n,r2,...,rn_i€{0,l} Remark. In fact, if we identify an element x G [0,1] by its binary expansion (ri,r2,...), i.e. so that x = YlT=i rk/^k', then we see that in fact infinite fair coin tossing may be viewed as being "essentially" the same thing as Lebesgue measure on [0,1]. Next, given any two probability triples (fii, T\, Pi) and (Q2, ^2, P2), we can define their product measure P on the Cartesian product set fii x Q2 = {((Ji,(j2) '-Mi G fit (i = 1>2)}. We set J = {AxB; AeFi, B e T2} , (2.6.3)
2.7. Exercises. 23 nd define P(A x B) = Pi (A) P2(B) for A x B e J. (The elements of J are called measurable rectangles.) F rcise 2.6.4. Verify that the above J' is a semialgebra, and that 0jfi e J with P(0) = 0 and P(0) = 1. We will show later (Exercise 4.5.15) that these J and P satisfy (2.5.5). Hence by Corollary 2.5.4, we can extend P to a a-algebra containing J. The resulting probability triple is called the product measure of (fti, T\, P\) and (02,^2, ft). An important special case of product measure is Lebesgue measure in higher dimensions. For example, in dimension two, we define 2-dimensional Lebesgue measure on [0,1] x [0,1] to be the product of Lebesgue measure on [0,1] with itself. This is a probability measure on ft = [0,1] x [0,1] with the property that p(\a,b]x[c,d\\ = (b-a)(d-c), 0 < a < 6 < 1, 0<c<d<l. It is thus a measure of area in two dimensions. More generally, we can inductively define d-dimensional Lebesgue measure on [0, l]d for any d G N, by taking the product of Lebesgue measure on [0,1] with (d — l)-dimensional Lebesgue measure on [0, l]d_1. When d = 3, Lebesgue measure on [0,1] x [0,1] x [0,1] is a measure of volume. 2.7. Exercises. Exercise 2.7.1. Let Q = {1,2,3,4}. Determine whether or not each of the following is a a-algebra. (a) ^ = {0, {1,2}, {3,4}, {1,2,3,4}}. (b) ^2 = {0, {3}, {4}, {1,2}, {3,4}, {1,2,3}, {1,2,4}, {1,2,3,4}}. (c) 7-3 = {0, {1,2}, {1,3}, {1,4}, {2,3}, {2,4}, {3,4}, {1,2,3,4}}. Exercise 2.7.2. Let Q = {1,2,3,4}, and let J = {{1}, {2}}. Describe explicitly the cr-algebra a(J) generated by J. Exercise 2.7.3. Suppose T is a collection of subsets of ft, such that (a) Suppose that whenever A,B e F, then also A\B = Af)Bc e J7. prove that T is an algebra. (b) Suppose T is a semialgebra. Prove that T is an algebra.
24 2. Probability triples. (c) Suppose that T is closed under complement, and also closed under finite disjoint unions (i.e. whenever A,BeT are disjoint, then AuB G T). Give a counter-example to show that T might not be an algebra. Exercise 2.7.4. Let T\,T2, • • • be a sequence of collections of subsets of Q, such that Tn C .Fn+i for each n. (a) Suppose that each Ti is an algebra. Prove that USi -^ *s a^so an algebra. (b) Suppose that each Ti is a cr-algebra. Show (by counter-example) that USi ^i might not be a cr-algebra. Exercise 2.7.5. Suppose that Cl = N is the set of positive integers, and T is the set of all subsets A such that either A or Ac is finite, and P is defined by P(A) = 0 if A is finite, and P(A) = 1 if Ac is finite. (a) Is T an algebra? (b) Is T a cr-algebra? (c) Is P finitely additive? (d) Is P countably additive on T, meaning that if AuA2l... G T are disjoint, and if it happens that |Jn An G T, then P(Un ^n) = Sn P(-^n)? Exercise 2.7.6. Suppose that Q, = [0,1] is the unit interval, and T is the set of all subsets A such that either A or Ac is finite, and P is defined by P(A) = 0 if A is finite, and P(A) = 1 if Ac is finite. (a) Is T an algebra? (b) Is T a cr-algebra? (c) Is P finitely additive? (d) Is P countably additive on T (as in the previous exercise)? Exercise 2.7.7. Suppose that Q = [0,1] is the unit interval, and T is the set of all subsets A such that either A or Ac is countable (i.e., finite or countably infinite), and P is defined by P(-A) = 0 if A is countable, and P(A) = 1 if Ac is countable. • (a) Is T an algebra? (b) Is T a a-algebra? (c) Is P finitely additive? (d) Is P countably additive on T? Exercise 2.7.8. For the example of Exercise 2.7.7, is P uncountably additive (cf. page 2)? Exercise 2.7.9. Let T be a cr-algebra, and write \T\ for the total number of subsets in T. Prove that if \T\ < 00 (i.e., if T consists of just a finite number of subsets), then \T\ = 2m for some m G N. [Hint: Consider those
2.7. Exercises. 25 on-empty subsets in T which do not contain any other non-empty setset in T How can all subsets in T be "built up" from these particular subsets?] Exercise 2.7.10. Prove that the collection J of (2.5.10) is a semialgebra. Exercise 2.7.11. Let ft = [0,1]. Let J' be the set of all half-open intervals of the form (a, b], for 0 < a < b < 1, together with the sets 0, ft, and {0}. (a) Prove that J' is a semialgebra. (b) Prove that <j{J') = #, i.e. that the cr-algebra generated by this J' is equal to the cr-algebra generated by the J of (2.4.1). (c) Let B'0 be the collection of all finite disjoint unions of elements of J'. Prove that B'0 is an algebra. Is B'0 the same as the algebra So defined in (2.2.4)? [Remark: Some treatments of Lebesgue measure use Jf instead of J.] Exercise 2.7.12. Let K be the Cantor set as defined in Subsection 2.4. Let Dn = K 0 £ where K 0 £ is defined as in (1.2.4). Let B = \J™=1 Dn. (a) Draw a rough sketch of D$. (b) What is \{D3)7 (c) Draw a rough sketch of B. (d) What is \{B)1 Exercise 2.7.13. Give an example of a sample space fi, a semialgebra J, and a non-negative function P : J —> R with P(0) = 0 and P(fi) = 1, such that (2.5.5) is not satisfied. Exercise 2.7.14. Let ft = {1, 2,3,4}, with T the collection of all subsets of ft. Let P and Q be two probability measures on T, such that P{1} = P{2} = P{3} = P{4} = 1/4, and Q{2} = Q{4} = 1/2, extended to T by linearity. Finally, let J = {0, ft, {1,2}, {2,3}, {3,4}, {1,4}}. (a) Prove that P(A) = Q(A) for all AeJ. (b) Prove that there is A G a(J) with P(A) ^ Q{A). (c) Why does this not contradict Proposition 2.5.8? Exercise 2.7.15. Let (£l,M,\) be Lebesgue measure on the interval [0,1]. Let fi' = {(i,!/)eR2;0<i<l,0<y<l}. Let T be the collection of all subsets of Q' of the form {(x,y)eR2; x€ A, 0 < y < 1} for some A e M. Finally, define a probability P on T by P ({{x,y) € R2 ; x € A, 0 < y < 1}) = \{A).
26 2. Probability triples. (a) Prove that (ft', T, P) is a probability triple. (b) Let P* be the outer measure corresponding to P and T. Define the subset S C ft' by S = {(x,y)eR2; 0 < x < 1, y = 1/2} . (Note that S 0 J\) Prove that P*(5) = 1 and P*(5C) = 1. Exercise 2.7.16. (a) Where in the proof of Theorem 2.3.1 was assumption (2.3.3) used? (b) How would the conclusion of Theorem 2.3.1 by modified if assumption (2.3.3) were dropped (but all other assumptions remained the same)? Exercise 2.7.17. Let ft = {1,2}, and let J be the collection of all subsets of ft, with P(0) = 0, P(ft) = 1, and P{1} = P{2} = 1/3. (a) Verify that all assumptions of Theorem 2.3.1 other than (2.3.3) are satisfied. (b) Verify that assumption (2.3.3) is not satisfied. (c) Describe precisely the M and P* that would result in this example from the modified version of Theorem 2.3.1 in Exercise 2.7.16(b). Exercise 2.7.18. Let ft = {1,2}, J = {0,ft,{l}}, P(0) = 0, P(ft) = 1, and P({1}) = 1/3. (a) Can Theorem 2.3.1, Corollary 2.5.1, or Corollary 2.5.4 be applied in this case? Why or why not? (b) Can this P be extended to a valid probability measure? Explain. Exercise 2.7.19. Let ft be a finite non-empty set, and let J consist of all singletons in ft, together with 0 and ft. Let p : ft —> [0,1] with T,u,enP(u) = *> and define p(0) = °> P(fi) = *' and PM = ^M for a11 uj G ft. (a) Prove that J is a semialgebra. (b) Prove that (2.3.2) and (2.3.3) are satisfied. (c) Describe precisely the M. and P* that result from applying Theorem 2.3.1. (d) Are these M and P* the same as those described in Theorem 2.2.1? Exercise 2.7.20. Let P and Q be two probability measures defined on the same sample space ft and cr-algebra T. (a) Suppose that P(j4) = Q(^) for all A G T with P(4) < \. Prove that P = Q, i.e. that P(A) = Q(A) for all A G T.
2.8. Section summary. 27 (b) Give an example where P(A) = Q(A) for all A G T with P(A) < |, but such that P ^ Q, i.e. that P(A) ^ Q{A) for some AeT. Exercise 2.7.21. Let A be Lebesgue measure in dimension two, i.e. Lebesgue measure on [0,1] x [0,1]. Let A be the triangle {(x,y) G [0,1] x fO ll' U < x)- Prove tnat A is measurable with respect to A, and compute \(A). Exercise 2.7.22. Let ($li,Fi,Pi) be Lebesgue measure on [0,1]. Consider a second probability triple, (ft^-T^P^, defined as follows: ft2 = {1,2}, Ti consists of all subsets of ft2, and P2 is defined by P2{1} = \, p2{2} = |, and additivity. Let (fi,^7, P) be the product measure of (fii^i.Pi) and (fi2,^*2,P2)- (a) Express each of ft, T, and P as explicitly as possible. (b) Find a set A G T such that P(A) = |. 2.8. Section summary. The section gave a formal definition of a probability triple (ft,^7, P), consisting of a sample space ft, a cr-algebra T, and a probability measure P, and derived certain basic properties of them. It then considered the question of how to construct such probability triples. Discrete spaces (with countable ft) were straightforward, but other spaces were more challenging. The key tool was the Extension Theorem, which said that once a probability measure has been constructed on a semialgebra, it can then automatically be extended to a cr-algebra. The Extension Theorem allowed us to construct Lebesgue measure on [0,1], and to consider some of its basic properties. It also allowed us to construct other probability triples such as infinite coin tossing, product measures, and multi-dimensional Lebesgue measure.
3. Further probabilistic foundations. 29 3. Further probabilistic foundations. Now that we understand probability triples well, we discuss some additional essential ingredients of probability theory. Throughout Section 3 (and, indeed, throughout most of this text and most of probability theory in general), we shall assume that there is an underlying probability triple (17 T, P) w^h respect to which all further probability objects are defined. This assumption shall be so universal that we will often not even mention it. 3.1. Random variables. If we think of a sample space il as the set of all possible random outcomes of some experiment, then a random variable assigns a numerical value to each of these outcomes. More formally, we have Definition 3.1.1. Given a probability triple (fi, T, P), a random variable is a function X from fi to the real numbers R, such that {a; eft; X(w) < x} G T, xGR. (3.1.2) Equation (3.1.2) is a technical requirement, and states that the function X must be measurable. It can also be written as {X < x} G J7, or X~l ((—oc, x]) G T, for all x G R. Since complements and unions and intersections are preserved under inverse images (see Subsection A.l), it follows from Exercise 2.4.5 that equation (3.1.2) is equivalent to saying that X~1(B) G T for every Borel set B. That is, the set X~1(B), also written {X G B), is indeed an event. So, for any Borel set B, it makes sense to talk about P(X G B), the probability that X lies in B. Example 3.1.3. Suppose that (fi,^,P) is Lebesgue measure on [0,1], then we might define some random variables X, Y', and Z by X(lj) = u, y{u) = 2lj, and Z(u>) = 3u + 4. We then have, for example, that Y = 2X, and Z = SX + 4 = §y + 4. Also, P(F < 1/3) = P{o;; Y{u>) < 1/3} = P{w; 2a; < 1/3} = P([0,1/6]) = 1/6. Exercise 3.1.4. For Example 3.1.3, compute P(Z > a) and P(X < a and Y < b) as functions of a, b G R. Now, not all functions from Q to R are random variables. For example, «* (fi,^,P) be Lebesgue measure on [0,1], and let H c f2 be the non- measurable set of Proposition 2.4.8. Define X : ft -► R by X = lHc, so
30 3. Further probabilistic foundations. X{u>) = 0 for uj G H, and X(u) = 1 for u 0 H. Then {u e Q : X{u) < 1/2} = H 0 .F, so X is not a random variable. On the other hand, the following proposition shows that condition (3.1.2) is preserved under usual arithmetic and limits. In practice, this means that if functions from fi toR are constructed in "usual" ways, then (3.1.2) will be satisfied, so the functions will indeed be random variables. Proposition 3.1.5. (i) If X = 1a is the indicator of some event A e F, then X is a random variable. (ii) IfX and Y are random variables and cGR, then X + c, cX, X2, X + Y, and XY are all random variables. (Hi) If Z\, Z2,... are random variables such that lim™-^ Zn(uj) exists for each lj G ft, and Z(u) = lin^^oo Zn{u), then Z is also a random variable. Proof, (i) If X = 1A for A G T, then X~l(B) must be one of A, Ac', 0, or Q, so X^iB) eT. (ii) The first two of these assertions are immediate. The third follows since for y > 0, {X2 < y} = {X e [-y/y, y/y}} e T. For the fourth, note (by finding a rational number r G (X, x — Y)) that {X + Y < x} = (J ({X < r} H {Y < x - r}) G T. r rational The fifth assertion then follows since XY = \ [(X + Y)2 - X2 - Y2]. (iii) For x G R, CX) CX) CX) {***>= nun ra=ln=lk=n But Zk is a random variable, so {Zk < x 4- ^} G T. Then, since T is a ^•-algebra, we must have {X < x} £ J7. I Exercise 3.1.7. Prove (3.1.6). [Hint: remember the definition ofX(u) = limn^00Xn(a;), cf. Subsection A.3.] Suppose now that X is a random variable, and / : R —> R is a function from R to R which is Borel-measurable, meaning that f~x{A) G B for any A G B (where B is the collection of Borel sets of R). (Equivalently, / is a random variable corresponding to Q = R and T = B.) We can define a new random variable f(X), the composition of X with /, by f(X)(u) = f (X(u)) for each u G ft. Then (3.1.2) is satisfied since for J5 G 5, \f{X) e B} = {Xef-l(B)}eF. |zfc<x+i|. (3.1.6)
3.2. Independence. 31 proposition 3.1.8. If f is a continuous function, or a piecewise- continuous function, then f is Borel-measurable. Proof. A basic result of point-set topology says that if / is continuous, then f~l{0) is an open subset of R whenever O is. In particular, /-1 {{x, oo)) is open, so f~l ((x, oo)) G B, so /_1 ((oo, x\) G B. If / is piecewise-continuous, then we can write / = filix + /2U2 +... + fnljn where the fj are continuous and the {Ij} are disjoint intervals. It follows from the above and Proposition 3.1.5 that / is Borel-measurable. I For example, if f(x) = xk for k G N, then / is Borel-measurable. Hence, if X is a random variable, then so is Xk for all k G N. Remark 3.1.9. In probability theory, the underlying probability triple (Q, T, P) is usually complete (cf. Exercise 2.3.16; for example this is always true for discrete probability spaces, or for those such as Lebesgue measure constructed using the Extension Theorem). In that case, if X is a random variable, and Y : fJ —> R such that P(X = Y) = 1, then Y must also be a random variable. Remark 3.1.10. In Definition 3.1.1, we assume that X is a real-valued random variable, i.e. that it maps Q, into the set of real numbers equipped with the Borel cr-algebra. More generally, one could consider a random variable which mapped ft to an arbitrary second measurable space, i.e. to some second non-empty set Q' with its own collection T' of measurable subsets. We would then have X : ft —> fJ', with condition (3.1.2) replaced by the condition that X~1(Af) G T whenever A! G P. 3.2. Independence. Informally, events or random variables are independent if they do not affect each other's probabilities. Thus, two events A and B are independent if P(A nfl) = P(A)P(B). Intuitively, the probabilistic proportion of the event B which also includes A (i.e., P{AnB)/P(B)) is equal to the overall probability of A (i.e., P{A)) - the definition uses products to avoid division by zero. Three events A, B, and C are said to be independent if all of the following equations are satisfied: P(AnB) = P{A)P(B); P(AnC) = P(i4)P(C); P(B n C) = P(B)P(C); and P(AnBnC) = P{A)P(B)P{C). It is not sufficient (see Exercise 3.6.3) to check just the final - or just the first three ~- of these equations. More generally, a possibly-infinite collection {Aa}aeI
32 3. Further probabilistic foundations. of events is said to be independent if for each jGN and each distinct finite choice ai, 0:2,..., ol3; G /, we have P(41n^2n...nAQj) = P(AQl)P(AQ2)...P(Aaj). (3.2.1) Exercise 3.2.2. Suppose (3.2.1) is satisfied. (a) Show that (3.2.1) is still satisfied if Aai is replaced by A^. (b) Show that (3.2.1) is still satisfied if each Aai is replaced by the corresponding A%.. (c) Prove that if {Aa}aej is independent, then so is {A%}aei. We shall on occasion also talk about independence of collections of events. Collections of events {Aa\ot G 1} are independent if for all j G N, for all distinct ql\ ,..., olj G /, and for all A\ G AQl, • • •, A? ^ A*, > equation (3.2.1) holds. We shall also talk about independence of random variables. Random variables X and Y are independent if for all Borel sets Si and 52, the events X-1(Si) and Y~1(S2) are independent, i.e. P(X G Si, Y G S2) = P(X G 5i) P(F G S2). More generally, a collection {XQ; a G 1} of random variables are independent if for all j G N, for all distinct ai,... ,oij G /, and for all Borel sets Si,..., Sj, we have P (^qi G Si, XQ2 G S2, ..., Xaj G Sj) = P(Xai G Si) P(Xa2 G S2) ... P(Xaj G S,). Independence is preserved under deterministic transformations: Proposition 3.2.3. Let X and Y be independent random variables. Let f,g : R —► R be Borel-measurable functions. Then the random variables f{X) and g(Y) are independent. Proof. For Borel Si, S2 C R, we compute that P(/(X) € Su g(Y) € 52) = P (X € /-^Si), Y € g-\S2)) = P(Xef-HSi)) P(Y€g-1(S2)) = P(f(X)eS1)P(g(Y)€S2). I We also have the following. Proposition 3.2.4. Let X and Y be two random variables, defined jointly on some probability triple (ft, T, P). Then X and Y are independent if and only ifP(X <x, Y < y) = P(X < x) P(Y < y) for all x,yeK.
3.3. Continuity of probabilities. 33 proof. The "only if" part is immediate from the definition. For the "if" part, fix x G R with P(X < x) > 0, and define the measure Q on the Borel subsets of R by Q(5) = P(X < x, Y G S)/P(X < x). Then by assumption, Q((-oo,y]) = P(Y < y) for all y G R. It follows from Corollary 2.5.9 that Q(5) = P(y G 5) for all Borel SCR, i.e. that P(X<x, YeS) = P(X<x)P(YeS). Then, for fixed Borel SCR, let R(T) = P{X eT.Ye S)/P(X G T). By the above, it follows that R((—oo,x]) = P(X < x) for each x G R. It then follows from Corollary 2.5.9 that R(T) = P(X G T) for all Borel T C R, i.e. that P{X G T, F G 5) = P(X G T)P(F G 5) for all Borel 55 T C R. Hence, X and V are independent. I Independence will come up often in this text, and its significance will become more clear as we proceed. 3.3. Continuity of probabilities. Given a probability triple (fi,^7,P), and events A,A\,A2,... G J", we write {An} / A to mean that A\ C A2 C A3 C ..., and Un^i = A. In words, the events An increase to A. Similarly, we write {An} \ A to mean that {A%} / Ac, or equivalently that Ai 2 ^2 2 ^3 5 • • •, and Hn -^n = ^4- In words, the events An decrease to A. We then have Proposition 3.3.1. (Continuity of probabilities.) If {An} / A or {An} \ A, then limn-*, P(An) = P{A). Proof. Suppose {An} /" A. Let £n = 4n D ^£_i. Then the {Bn} are disjoint, with [jBn = [jAn = A. Hence, P(A) = P(|j0 = f^P(Bm) \m / m=l = lim V P(Bm) = lim P I M Bm ) = lim P(An) m=l \m<n / (where the last equality is the only time we use that the {Am} are a nested sequence). If instead {An} \ A, then {A%} / Ac', so P(;4) = 1-P(AC) = l-limP(^) = lim(l-P(^)) = limP(An). I
34 3. Further probabilistic foundations. If the {An} are not nested, then we may not have limn P(An) = P(A). For example, suppose that An = ^ for n odd, but An = 0 for n even. Then P(An) alternates between 0 and 1, so that limnP(^4n) does not exist. 3.4. Limit events. Given events Ai, A2,... G T, we define OO OO lim sup-An = {An i.o.} = f| (J Ak n=lk=n and 00 00 liminf An = {^n a.a.} = (J pj Afe. n=lfe=n The event lim supn An is referred to as "An infinitely often"; it stands for those lj £ ft which are in infinitely many of the An. Intuitively, it is the event that infinitely many of the events An occur. Similarly, the event lim infn An is referred to as "An almost always"; intuitively, it is the event that all but a finite number of the events An occur. Since T is a cr-algebra, we see that lim supn An £ J~ and lim infn An £ J~. Also, by de Morgan's laws, (limsupn An)c = liminfn(^), so P{An i.o.) = l-P(ASa.a.). For example, suppose (J}, T, P) is infinite fair coin tossing, and Hn is the event that the nth coin is heads. Then limsupni/n is the event that there are infinitely many heads. Also, lim inf n Hn is the event that all but a finite number of the coins were heads, i.e. that there were only finitely many tails. Proposition 3.4.1. We always have P( liminf An) < HminfP(i4n) < limsupP(An) < P( lim sup An). \ n / n n \ n J Proof. The middle inequality holds by definition, and the last inequality follows similarly to the first, so we prove only the first inequality. We 00 note that as n —► 00, the events { P| Ak} are increasing (cf. page 33) to k=n liminf An. Hence, by continuity of probabilities, n P(limninf^n) sp([J f| A*) -JBm p(n A* \ n k—n / \k=n
3.4. Limit events. 35 = HminfP Pi Ak < liminfP(^n), n—kx) \ ' ' / n—>oo \k=n / h e the final equality follows by definition (if a limit exists, then it is equal the liminf), and the final inequality follows from monotonicity (2.1.2). I For example, again considering infinite fair coin tossing, with Hn the event that the nth coin is heads. Proposition 3.4.1 says that *P(Hn i.o.) > \, which is interesting but vague. To improve this result, we require a more powerful theorem. Theorem 3.4.2. (The Borel-Cantelli Lemma.) Let A\, A2,... G T. (i) ^Enp(^) < °°' then P(limsupnil„) = 0. (u) ^"Sn P(^n) = °°, and {An} are independent, then P(limsupn An) = 1. Proof. For (i), we note that for any m G N, we have by countable subadditivity that P Aim sup An) < P I (J Ak) < JT p(40, ^ ' \k=m / k=m which goes to 0 as m —> oo if the sum is convergent. For (ii), since (limsupn An f = Un°=i C\T=n Ak> [t suffices (by countable subadditivity) to show that P (fl^U Ak) = ° for each n e N- Well> for n, ra G N, we have by independence and Exercise 3.2.2 (and since 1 — x < e~x for any real number x) that p(nr=n40 < p(fC^) which goes to 0 as m —> oo if the sum is divergent. I This theorem is striking since it asserts that if {An} are independent, then P(limsupn An) is always either 0 or 1 - it is never \ or | or any other value. In the next section we shall see that this statement is true even more generally. We note that the independence assumption for part (ii) of Theorem 3.4.2 cannot simply be omitted. For example, consider infinite fair coin tossing, and let Ax = A2 = A3 = ... = {n = 1}, i.e. let all the events be the event that the first coin comes up heads. Then the {An} are clearly not ^dependent. And, we clearly have P(limsupn An) = P(n = 1) = I
36 3. Further probabilistic foundations. Theorem 3.4.2 provides very precise information about P(limsupn An) in many cases. Consider again infinite fair coin tossing, with Hn the event that the nth coin is heads. This theorem shows that P(Hn i.o.) = 1, i.e. there is probability 1 that an infinite sequence of coins will contain infinitely many heads. Furthermore, P(Hn a.a.) = 1 — P(H% i.o.) = 1 — 1 = 0, so the infinite sequence will never contain all but finitely many heads. Similarly, we have that P {H2n+i n H2n+2 n... n #2"+[iog2n] i-o.} = l, since the events {H2n+inH2n+2n.. .nif2n+[iog2 n] are seen to be independent for different values of n, and since their probabilities are approximately 1/n which sums to infinity. On the other hand, P{#2r>+in#2"+2n...n#2n+[2iog2n] i.o.} = 0, since in this case the probabilities are approximately 1/n2 which have finite sum. An event like P(Bn i.o.), where Bn = {Hn n #n+i}, is more difficult. In this case ^P(Bn) = ]Cn(V^) = oo. However, the {Bn} are not independent, since Bn and Bn+\ both involve the same event Hn+\ (i.e., the (n + l)st coin). Hence, Theorem 3.4.2 does not immediately apply. On the other hand, by considering the subsequence n = 2k of indices, we see that {B2k}kLi are independent, and J2T=i P(B2k) = EfcLi p(B2k) = oo. Hence, P(^2fe i-o.) = 1, so that P(Bn i.o.) = 1 also. For a similar but more complicated example, let Bn = {ifn+i H ifn+2 H • • • nifn+[iog2 iog2 n]}- Again, £ P(Bn) = oo, but the {Bn} are no^ independent. But by considering the subsequence n = 2k of indices, we compute that {B2k} are independent, and ]CfcP(-^2fc) = °°- Hence, P(^2fc *-°-) = 1» so that P(5n i.o.) = 1 also. 3.5. Tail fields. Given a sequence of events A\,A2,..., we define their tail field by oo r = p| a(i4n,i4n+i,i4n+2,...)- 71=1 In words, an event A G r must have the property that for any n, it depends only on the events j4n,j4n+i,...; in particular, it does not care about any finite number of the events An. One might think that very few events could possibly be in the tail field, but in fact it sometimes contains many events. For example, if we are
3.5. Tail fields. 37 considering infinite fair coin tossing (Subsection 2.6), and Hn is the event that the nth coin comes up heads, then r includes the event lim supn Hn that we obtain infinitely many heads; the event lim infn Hn that we obtain only finitely many tails; the event limsupn#2ri that we obtain infinitely many heads on tosses 2,4,8,...; the event {limn^oo £ Yl?=i ri - \) tnat the limiting fraction of heads is < \; the event {rn = rn+i = rn+2 i-o.} that we infinitely often obtain the same result on three consecutive coin flips; etc. So we see that r contains many interesting events. A surprising theorem is Theorem 3.5.1. (Kolmogorov Zero-One Law.) If events Ai, A2,... are independent, with tail-Geld r, and if A G r, then P(A) = 0 or 1. To prove this theorem, we need a technical result about independence. Lemma 3.5.2. Let B,B\,B2,- -. be independent. Then {B} and &{Bi,B2,.. •) are independent classes, i.e. if S G a{B\,B2,.. .)> then P(S D B) = P{S)P(B). Proof. Assume that P(B) > 0, otherwise the statement is trivial. Let J be the collection of all sets of the form Dix D Di2 D... D Din, where n G N and where Dij is either Bij or Bf, together with 0 and Q. Then for A G J, we have by independence that P(j4) = P(B D A)/P(B). Now define a new probability measure Q on cr(Bi,B2,...) by Q(5) = P{BnS)/P{B), for S G <j(£i,£2,...)- Then Q(0) = 0, Q(fi) = 1, and Q is countably additive since P is, so Q is indeed a probability measure. Furthermore, Q and P agree on J. Hence, by Proposition 2.5.8, Q and P agree on a(J) = a{BuB2^..). That is, P(S) = Q(5) = P(BnS)/P(B) for all 5 G a(Bi,B2,...), as required. I Applying this lemma twice, we obtain: Corollary 3.5.3. Let Au A2,..., Bu B2,... be independent. Then if S\ G cr(Ai,A2,...), then Si, B\, B2,... are independent. Furthermore, the cr-algebras cr(Ai, A2,...) and a(Bi,B2,...) are independent classes, i.e. if Siea{A1,A2,...),andS2e(T{B1,B2,...),thenP(SlnS2) = P{S1)P(S2)- Proof. For any distinct ii,i2,... ,2n, let A = Bh D ... D Bin. Then it follows immediately that A,A\,A2,... are independent. Hence, from Lemma 3.5.2, if Si G a{Au A2, •..), then P(A n Si) = P(A)P(Si). Since this is true for all distinct i\,...,in, it follows that Si, B\, B2,... are independent. Lemma 3.5.2 then implies that if S2 G a(Bi,B2,...), then Si and
38 3. Further probabilistic foundations. 52 are independent. | Proof of Theorem 3.5.1. We can now easily prove the Kolmogorov Zero-One Law. The proof is rather remarkable! Indeed, A G a(An,An+i,...), so by Corollary 3.5.3, A,Ai,A2,... ,An^l are independent. Since this is true for all n G N, and since independence is defined in terms of finite subcollections only, it is also true that A, A\, Ai,... are independent. Hence, from Lemma 3.5.2, A and S are independent for all S G a(i4i,i42,...). On the other hand, A e t C a(A\,A2,. • .)• ^ follows that A is independent of itself (!). This implies that P(j4 n A) = P(A)P(A). That is, P(A) = P{A)2, so P{A) = 0 or 1. I 3.6. Exercises. Exercise 3.6.1. Let X be a real-valued random variable defined on a probability triple (f2, J7, P). Fill in the following blanks: (a) J7 is a collection of subsets of . (b) P(A) is a well-defined element of provided that A is an element of . (c) {X < 5} is shorthand notation for the particular subset of which is defined by: . (d) If 5 is a subset of , then {X G S} is a subset of . (e) If 5 is a subset of , then {X G 5} must be an element of . Exercise 3.6.2. Let (fi,^7, P) be Lebesgue measure on [0,1]. Let A = (1/2,3/4) and B = (0,2/3). Are A and B independent events? • Exercise 3.6.3. Give an example of events A, B, and C, each of probability strictly between 0 and 1, such that (a) P(A H B) = P(A)P(B), P{A n C) = P(A)P(C), and P(B n C) = P(£)P(C); but it is not the case that P(A CiBnC) = P(A)P(B)P(C). [Hint: You can let Q be a set of four equally likely points.] (b) P{AHB) = P(A)P(B), P{AnC) = P(>1)P(C), and P(AnBnC) = P(A)P(B)P(C); but it is not the case that P{B n C) = P(B)P{C). [Hint: You can let Q, be a set of eight equally likely points.] Exercise 3.6.4. Suppose {An} /A. Let / : Q -> R be any function. Prove that lim^oo inf^^n /M = inf^A /M-
3.6. Exercises. 39 rcise 3.6.5. Let (H, T, P) be a probability triple such that Q is count- hie Prove that it is impossible for there to exist a sequence j4i, A2,... G J7 hich is independent, such that P(A;) = ^ for each i [Hint: First prove that for each a; G fi, and each n G N, we have P ({a;}) < l/2n. Then derive a contradiction.] Exercise 3.6.6. Let X, Y, and Z be three independent random variables, and set W = X + Y. Let Bk,n = {(n - l)2"fc < X < n2~k} and let Ckm = {(m - l)2"fc < Y < m2-k}. Let • Ak = (J (Bfc,n n Ct,m). (n + m)2-fc<x Fix x, z G R, and let A = {X + Y < x} = {W < x) and D = {Z < z}. (a) Prove that {Ak} / A. (b) Prove that Ak and D are independent. (c) By continuity of probabilities, prove that A and D are independent. (d) Use this to prove that W and Z are independent. Exercise 3.6.7. Let (ft,T, P) be the uniform distribution on Q = {1,2,3}, as in Example 2.2.2. Give an example of a sequence ^4i, A2,... G ^* such that P ( lim inf An ) < lim inf P (An) < lim sup P(A„)<P lira SUp An , ^™/ri n V n / i.e. such that all three inequalities are s£nc£. Exercise 3.6.8. Let A be Lebesgue measure on [0,1], and let 0 < a < b < c < d < 1 be arbitrary real numbers with d < b + c — a. Give an example of a sequence Ai,A2,... of intervals in [0,1], such that A(liminfn An) = a, Hminfn X(An) = 6, limsupn X(An) = c, and A(limsupn An) = d. For bonus points, solve the question when d > b + c - a, with each An a finite union of intervals. Exercise 3.6.9. Let A1,A2,...,B1,B2,...be events, (a) Prove that ( lim sup An) fl ( lim sup Bn) 2 lim sup (An n Bn) . (b) Give an example where the above inclusion is strict, and another example where it holds with equality.
40 3. Further probabilistic foundations. Exercise 3.6.10. Let A\, A2,... be a sequence of events, and let N € N. Suppose there are events B and C such that B C An C C for all n> N, and such that P(B) = P(C). Prove that P(liminfnAn) = P(limsupn An) = P(B) = P(C). Exercise 3.6.11. Let {Xn}^L1 be independent random variables, with Xn ~ Uniform({l,2,...,n}) (cf. Example 2.2.2). Compute P(Xn = 5 z.o.), the probability that an infinite number of the Xn are equal to 5. Exercise 3.6.12. Let X be a random variable with P(X > 0) > 0. Prove that there is S > 0 such that P(X > S) > 0. [Hint: Don't forget continuity of probabilities.] Exercise 3.6.13. Let X\,X2l... be defined jointly on some probability space (fi,^,P), with E[Xi] = 0 and E[(Xi)2} = 1 for all i. Prove that P[Xn > n i.o.) = 0. Exercise 3.6.14. Let 5, e > 0, and let X\, X2,... be a sequence of non- negative random variables such that P(Xi > S) > e for all i. Prove that with probability one, YlZi ^i = °°- Exercise 3.6.15. Let ^1,^2,... be a sequence of events, such that (i) AiY, Ai2,..., Aik are independent whenever ij+i > ij + 2 for 1 < j < k — 1, and (ii) ^n P(An) = 00. Then the Borel-Cantelli Lemma does not directly apply. Still, prove that P(limsupn^4n) = 1. Exercise 3.6.16. Consider infinite, independent, fair coin tossing as in Subsection 2.6, and let Hn be the event that the nth coin is heads. Determine the following probabilities. (a) P(ffn+inffn+2n...nffn+9t.o.). (b) P(JJn+inHn+2n...nH2n i.o.). (c) P(#n+i n Hn+2 n... n #n+[2iog2n] i.o.). (d) Prove that P(iJn+i fl Hn+2 H ... D #n+[iog2 n] i-o.) must equal either 0 or 1. (e) Determine P(#n+i H Hn+2 H ... fl #n+[iog2n] i-o.). [Hint: Find the right subsequence of indices.] Exercise 3.6.17. Show that Lemma 3.5.2 is false if we require only that P(BnBn) = P{B) P(Bn) for each n G N, but do not require that the {Bn} be independent of each other. [Hint: Don't forget Exercise 3.6.3(a).] Exercise 3.6.18. Let ^1,^2?... be any independent sequence of events, and let Sx = {lim^oo £ Yl7=i ^ — x}- Prove tnat f°r eacn # G R we have P(5X) = 0 or 1.
3.7. Section summary. 41 Exercise 3.6.19. Let j4i,j42,... be independent events. Let Y be a random variable which is measurable with respect to cr(An, An+i,...) for each n G N. Prove that there is a real number a such that P(Y = a) = 1. [Hint: Consider P(Y < x) for x € R; what values can it take?] 3.7. Section summary. In this section, we defined random variables, which are functions on the state space. We also defined independence of events and of random variables. We derived the continuity property of probability measures. We defined limit events and proved the important Borel-Cantelli Lemma. We defined tail fields and proved the remarkable Kolmogorov Zero-One Law.
4. Expected values. 43 4. Expected values. There is one more notion that is fundamental to all of probability theory, that of expected values. The general definition of expected value will be developed in this section. 4.1. Simple random variables. Let (0, T, P) be a probability triple, and let X be a random variable defined on this triple. We begin with a definition. Definition 4.1.1. A random variable X is simple if range(X) is finite, where range(X) = {X(u);lu e ft}. That is, a random variable is simple if it takes on only a finite number of different values. If X is a simple random variable, then listing the distinct elements of its range as xi, #2, • • • > xw> we can then write X = Yl7=i Xi^-Ai where Ai = {u e Q;X(u>) = Xi] = X-1({#i}), and where the 1^. are indicator functions. We note that the sets A{ form a finite partition of ft. For such a simple random variable X = Yl^ixi^Al, we define its expected value or expectation or mean by E(X) = Yl7=i #iP(^i)- That is, / " \ A E 22xilAt j = 2jxiP(Ai), {Ai} a finite partition of ft. (4.1.2) We sometimes write fix for E(X). Exercise 4.1.3. Prove that (4.1.2) is well-defined, in the sense that if {Ai} and {Bj} are two different finite partitions of ft, such that YJi=\xi^Ai = Y^LiVjIb,, then EILi^P(^) = ££=i Wp(Bj)- [Hint: collect together those Ai and Bj corresponding to the same values of Xi and yj] For a quick example, let (ft,.F, P) be Lebesgue measure on [0,1], and define simple random variables X and Y by {2, lj rational 4, u = l/V2 6, other lj < 1/4 8, otherwise. Then it is easily seen that E(X) = 13/3, and E(Y) = 15/2. From equation (4.1.2), we see immediately that E(1A) = P(A), and that E(c) = c. We now claim that E(-) is linear. Indeed, if X = J2ixilA%
44 4. Expected values. and Y = ^ yj1-Bt, where {Ai} and {Bj} are finite partitions of 17, and if a, 6, G R, then {Ai D B^} is again a finite partition of ft, and we have E(aX + bY) = E^^oxi + fty^UnB,) = Eii^ + tyjJP^nB,-) = aE(X) + bE(Y), as claimed. It follows that E (]T^=1 x^l^) = £^=1 X{P(Ai) for an?/ finite collection of subsets Ai C 17, even if they do not form a partition. It also follows that E(-) is order-preserving, i.e. if X < Y (meaning that X{uj) < Y{u) for all u e 17), then E(X) < E(Y). Indeed, in that case Y - X > 0, so from (4.1.2) we have E(Y -X)>0; by linearity this implies that E(Y) - E(X) > 0. In particular, since -\X\ < X < \X\, we have |E(X)| < E(|X|), which is sometimes referred to as the (generalised) triangle inequality. If X takes on the two values a and 6, each with probability \, then this inequality reduces to the usual \a + b\ < \a\ + \b\. Finally, if X and Y are independent simple random variables, then E{XY) = E(X)E{Y). Indeed, again writing X = £^1^ and Y = J2jVj^-Bt^ where {Ai} and {Bj} are finite partitions of 17 and where {#;} are distinct and {yj} are distinct, we see that X and Y are independent if and only if P(^ PI Bj) = P(Ai)P(Bj) for all i and j. In that case, E(*y) = ZidXiyjV(AinBj) = £^%p(^)p(^) = e(x)e(y), as claimed. Note that this may be false if X and Y are not independent; for example, if X takes on the values ±1, each with probability |, and if Y = X, then E(X) = E{Y) = 0 but E(XY) = 1. Also, we may have E(XY) = E(X)E(Y) even if X and V are not independent; for example, this occurs if X takes on the three values 0, 1, and 2 each with probability |, and if Y is defined by Y(u) = 1 whenever X(u) = 0 or 2, and Y^o;) = 5 whenever X(uj) = 1. If X = Er=i xil^i^ witn {Ai} a finite partition of 17, and if / : R —» R is any function, then f(X) = Yl7=i f(xi)^A is also a simple random variable, withE(/(A-)) = E?=1/(xi)P(Ai). In particular, if f(x) = (x — fix)2, we get the variance of X, defined by Var(X) = E((X - /iX)2). Clearly Var(X) > 0. Expanding the square and using linearity, we see that Var(X) = E(X2) -\i\= E(X2) - E{X)2. In particular, we always have Var(X) < E(X2). (4.1.4) It also follows immediately that Var(aX + /3) = a2 Var(X). (4.1.5)
4.2. General non-negative random variables. 45 We also see that Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y), where Cov(X, Y) = E ((X - /**)(r - Mr)) = E(XY) - E(X)E(Y) is the covari- ance' in particular, if X and Y are independent then Cov(X, Y) = 0 so thatVar(X + Y) = Var(X) + Var(Y). More generally, Var(Ei^i) = V. Var(A'i) + 2 £i<:J. Cov(Xi, Xj), so we see that Var(Xi + ... + X„) = Var(Xi) + ... + Var(X„), {Xn} independent. (4.1.6) Finally, if Var(X) > 0 and Var(F) > 0, then the correlation between X and Y is defined by Corr(X,y) = Cov(X, y)/^Var(X) Var(F); see Exercises 4.5.11 and 5.5.6. This concludes our discussion of the basic properties of expectation for simple random variables. (Indeed, it is possible to read Section 5 immediately at this point, provided that one restricts attention to simple random variables only.) We now note a fact that will help us to define E(X) for random variables X which are not simple. It follows immediately from the order-preserving property of E(-). Proposition 4.1.7. If X is a simple random variable, then E{X) = sup{E(y); Y simple, Y < X) . 4.2. General non-negative random variables. If X is not simple, then it is not clear how to define its expected value E(X). However, Proposition 4.1.7 provides a suggestion of how to proceed. Indeed, for a general non-negative random variable X, we define the expected value E(X) by E(X) = sup{E(y); Y simple, Y < X} . By Proposition 4.1.7, if X happens to be a simple random variable then this definition agrees with the previous one, so there is no confusion in reusing the same symbol E(-). Indeed, this one single definition (with a minor modification for negative values in Subsection 4.3 below) will apply to all random variables, be they discrete, absolutely continuous, or neither (cf. Section 6). We note that it is indeed possible that E(X) will be infinite. For example, suppose (fi,J",P) is Lebesgue measure on [0,1], and define X by X{u>) = 2n for 2~n < u < 2-(n"1). (See Figure 4.2.1.) Then E{X) > T,Li 2*2"* = N for any N G N. Hence, E(X) - oo. Recall that for k £ N, the kth moment of a non-negative random variable X is defined to be E(Xfc), finite if E(\X\k) < oo. Since Ix^"1 <
46 4. Expected values. X{u>) , 8- 7 - 6 - 5 - 4- 3 - 2 - 1 - 0- ■ i i 1 1— H 1 1 1 1 1 3 I 8 2 Figure 4.2.1. A random variable X having E(X) = oo. max(|x|fe, 1) < \x\k + 1 for any x G R, we see that if E(|X|fc) is finite, then soisEdXl*"1). It is immediately apparent that our general definition of E(-) is still order-preserving. However, proving linearity is less clear. To assist, we have the following result. We say that {Xn} / X if X\ < X<i < ..., and also Mnin^oo Xn(u) = X{uj) for each uj G ft. (That is, the sequence {Xn} converges monotonically to X.) Theorem 4.2.2. (The monotone convergence theorem.) Suppose X\, X2,.. • are random variables with E(Xi) > — 00, and {Xn} / X. Then X is a random variable, and limn_>oo F,(Xn) = E(X). Proof. We know from (3.1.6) that X is a random variable (alternatively, simply note that in this case, {X < x} = Oni^n < x} G T for all xGR). Furthermore, by monotonicity we have E(Xi) < E(X2) < ... < E(X), so that limnE(Xn) exists (though it may be infinite if E(X) = +00), and is < E(X). To finish, it suffices to show that limn E{Xn) > E(X). If ^(Xi) = +00 this is trivial, so assume E{X\) is finite. Then, by replacing Xn by Xn—X\ and X by X - X\, it suffices to assume the Xn and X are non-negative.
4.2. General non-negative random variables. 47 By the definition of E(X) for non-negative X, it suffices to show that I'm E(Xn) ^ E(y) for any simple random variable y < X. Writing Y = V VilAti we see ^at ^ suffices to prove that limnE(Xn) > YliviP(Ai), where {Ai} is any finite partition of 0, with Vi < X(u) for all uj € Ai. To that end, choose e > 0, and set Ain = {u G Ai; Xn(u;) > v» - c}. Then {j4»n} / 4i as n -+ oo. Furthermore, E(Xn) > Yli(vi -e)P(Ain). As n —► oo, by continuity of probabilities this converges to ^(^ ~~ e)P(^i) — y^.ViP(Ai) — e. Hence, limE(Xn) > Ylivi^(^i) ~ e- Since this is true for any 6 > 0, we must (cf. Proposition A.3.1) have limE(Xn) > X^fzP^z)* as required. I Remark 4.2.3. Since expected values are unchanged if we modify the random variable values on sets of probability 0, we still have linin—oo E(Xn) = E(X) provided {Xn} / X almost surely (a.s.), i.e. on a subset of Q, having probability 1. (Compare Remark 3.1.9.) We note that the monotonicity assumption of Theorem 4.2.2 is indeed necessary. For example, if (fi,^*, P) is Lebesgue measure on [0,1], and if Xn = nl(0>i), then Xn —► 0 (since for each w e [0,1] we have Xn(u) = 0 for all n > l/w), but E{Xn) = 1 for all n. To make use of Theorem 4.2.2, set *„(x) = min(n, 2_nL2nxJ) for x > 0, where [^J is the floor of r, or greatest integer not exceeding r. (See Figure 4.2.4.) Then \J>n(x) is a slightly rounded-down version of x, truncated at n. Indeed, for fixed x > 0 we have that \J>(x) > 0, and {\J>n(x)} /xas n —► oo. Furthermore, the range of \I>n is finite (of size n2n + 1). Hence, this shows: Proposition 4.2.5. Let X be a general non-negative random variable. Set Xn = Vn(X) with Vn as above. Then Xn > 0 and {Xn} / X, and each Xn is a simple random variable. In particular, there exists a sequence of simple random variables increasing to X. Using Theorem 4.2.2 and Proposition 4.2.5, it is straightforward to prove the linearity of expected values for general non-negative random variables. Indeed, with Xn = 9n(X) and Yn = Vn{Y), we have (using.linearity of expectation for simple random variables) that for X, Y > 0 and a, b > 0, E(aX + bY) = limE(aXn + bYn) = lim {aE(Xn) + bE(Yn)) = aE(X) + bE(Y). (4.2.6)
48 4. Expected values. y = xy 2 + 7 4 3 2 5 4 l 4- 3 4 2 4 H 1 h 0 A i ! H 1 h 3 1 o 2 4 Z y = *2(a?) Figure 4.2.4. A graph of y = ^(aO- Similarly, if X and V are independent, then by Proposition 3.2.3, Xn and Yn are also independent. Hence, if E(X) and E(y) are finite, E(XY) = lim E{XnYn) = lim E(Xn) E(yn) = E(X) E{Y), (4.2.7) n n as was the case for simple random variables. It then follows that Var(X + y) = Var(X) + Var(y) exactly as in the previous case. We also have countable linearity. Indeed, if X\, X2,... > 0, then by the monotone convergence theorem, E(Xi+X2+...) = E AimXi + X2 + ... + Xn) = limE(Xi+X2+.. .+X„) V n / n = lim [E(JCi) + E(X2) + ... + E(Xn)] = E(XX) + E{X2) + .... (4.2.8) n If the Xi are non non-negative, then (4.2.8) may fail, though it still holds under certain conditions; see Exercise 4.5.14 (and Corollary 9.4.4). We also have the following. Proposition 4.2.9. If X is a non-negative random variable, then Sibil f*(X > &) = EL-^J, where [X\ is the greatest integer not exceeding
4.3. Arbitrary random variables. 49 X. (In particular, if X is non-negative-integer valued, then YH?=i Y(X > k) = E(X).) Proof. We compute that oo oo ]TP(X>fc) = ^[P(fc<X<fc + l)-hP(fc-hl<X<fc-h2)-h...] oo oo = jr*p(*<x<* + i) = ^ep([x\=e) = e[x\. ■ 4.3. Arbitrary random variables. Finally, we consider random variables which may be neither simple nor non-negative. For such a random variable X, we may write X = X+ — X~, where X+(u) = max(X(u;),0) and X~(u) = max(-X(u;),0). Both X+ and X~ are non-negative, so the theory of the previous subsection applies to them. We may then set E{X) = E(X+)-E(X"). (4.3.1) We note that E(X) is undefined if both E(X+) and E(X~) are infinite. However, if E(X+) = oo and E(X~) < oo, then we take E(X) = oo. Similarly, if E(X+) < oo and E(X~) = oo, then we take E(X) = -oo. Obviously, if E(X+) < oo and E(X~) < oo, then E(X) will be a finite number. We next check that, with this modification, expected value retains the basic properties of order-preserving, linear, etc.: Exercise 4.3.2. Let X and Y be two general random variables (not necessarily non-negative) with well-defined means, such that X < Y. (a) Prove that X+ < Y+ and X~ >Y~. (b) Prove that expectation is still order-preserving, i.e. that E(X) < E(Y) under these assumptions. Exercise 4.3.3. Let X and Y be two general random variables with finite means, and let Z = X + Y. (a) Express Z+ and Z~ in terms of X+, X~, F+, and Y~. (b) Prove that E(Z) = E(X) + E(F), i.e. that E(Z+)-E(Z~) = E(X+)- E(X~) + E(F+) - E(y-). [Hint: Re-arrange the relations of part (a) so that you can make use of (4.2.6).]
50 4. Expected values. (c) Prove that expectation is still (finitely) linear, for general random variables with finite means. Exercise 4.3.4. Let X and Y be two independent general random variables with finite means, and let Z = XY. (a) Prove that X+ and Y+ are independent, and similarly for each of X+ and Y~, and X~ and y+, and X~ and Y~. (b) Express Z+ and Z~ in terms of X+, X~, Y+, and Y~. (c) Prove that E(XY) = E{X) E(y). 4.4. The integration connection. Given a probability triple (fi, .F, P), we sometimes write E(X) as JQ XdP or Jn X(u)P(duj) (the "f2" is sometimes omitted). We call this the integral of the (measurable) function X with respect to the (probability) measure P. Why do we make this identification? Well, certainly expected value satisfies some similar properties to that of the integral: it is linear, order- preserving, etc. But a more convincing reason is given by Theorem 4.4.1. Let (fi,^*,P) be Lebesgue measure on [0,1]. Let X : [0,1] —> R be a bounded function which is Riemann integrable (i.e. integrable in the usual calculus sense). Then X is a random variable with respect to (fi,^7, P), and E(X) = J0 X(t)dt. In words, the expected value of the random variable X is equal to the calculus-style integral of the function X. Proof. Recall the definitions of lower and upper integrals, viz. L J X = sup \ J2(U - ti-i) inf X(t); 0 = t0 < h < ... < tn = 1 > ; jo yt^i u-\<t<u j U X = inf \ Y\(U - U-x) sup X(t); 0 = t0 < tx < ... < tn = 1 \ . Jo [~[ u-i<t<u J Recall that we always have L j0 X < U j0 X, and that X is Riemann integrable if L J0 X = U J0 X, in which case /0 X(t) dt is defined to be this common value. But X)r=i(*< - U-i) 'mftt-i<t<tt X(t) = E(y), where the simple random variable Y is defined by Y(uj) = inf^.^t^tt X(t) whenever ^_x < u < U. Hence, since X is Riemann integrable, for each n E N we can find a simple
4.4. The integration connection. 51 random variable Yn < X with F,(Yn) > L J0 X(i) dt — £. Similarly we can find^n >X withE(Zn) < UfnX{t)dt+±. Let An = max{Yu ... ,rn), and Bn = min(Zi,... ,Z„). We claim that {An} / X a.s. Indeed, {An} is increasing by construction. Furthermore, An < X < Bn, and lim E{An) = lim n—►oo f1 X(t) dt. Hence, if Sk = {u e ft : lim^^ (fl„(o;) - j4„(o;)) > l/fc}> then 0°= limn-.ooE(Bn - An) > P{Sk)/k, so P(Sk) = 0 for all k e N. Then P[lim„->oo ^n < X] < Pflim^oo An < limn^oo Bn] < P[|Jfc Sk] = 0 by countable subadditivity, proving the claim. Hence, by Theorem 4.2.2 and Remarks 4.2.3 and 3.1.9, X must be a random variable, with E(X) = limn^oo E(j4„) = J0 X(t) dt. I On the other hand, there are many functions X which are not Riemann integrable, but which nevertheless are random variables with respect to Lebesgue measure, and thus have well-defined expected values. For example, if X is defined by X(i) = 1 for t irrational, but X(t) = 0 for t rational, i.e. X = Iqc where Qc is the set of irrational numbers, then X is not Riemann-integrable but we still have E(X) = J XdP = 1 being perfectly well-defined. We sometimes call the expected value of X, with respect to the Lebesgue measure probability triple, its Lebesgue integral, and if E|X| < oo we say that X is Lebesgue integrable. By Theorem 4.4.1, the Lebesgue integral is a generalisation of the Riemann integral. Hence, in addition to learning many things about probability theory, we are also learning more about integration at the same time! We also note that we do not need to restrict attention to integrals over the unit interval [0,1]. Indeed, if X : R —> [0, oo) is Borel-measurable, then we can define f^X d\, where A stands for Lebesgue measure on the entire real line R, by X{t)\(dt) = Y^ I X{n + t)P{dt), (4.4.2) or equivalently x(t)\(dt) = ^E(yB), nez where Yn(t) = X(n + t) and the expectation is with respect to Lebesgue measure on [0,1]. That is, we can integrate a non-negative function over the entire real line by adding up its integral over each interval [n,n + 1). (For general X, we can then write X = X+ - X~, as usual.) In other words, we can represent A, Lebesgue measure on R, as a sum of count ably many copies of Lebesgue measure on unit intervals. We shall see in Section 6 that / /
52 4. Expected values. we may use Lebesgue measure on R to help us define the distributions of general absolutely-continuous random variables. Remark 4.4.3. Measures like A above, which are not finite but which can be written as the countable sum of finite measures, are called a-finite measures. They are not probability measures, and cannot be used to define probability spaces; however by countable additivity they still satisfy many properties that probability measures do. Of course, unlike in the probability measure case, J^ X dX may be infinite or undefined even if X is bounded. 4.5. Exercises. Exercise 4.5.1. Let (fi,.F, P) be Lebesgue measure on [0,1], and set {1 , 0 < lj < 1/4 2a;2, 1/4 < u < 3/4 a;2, 3/4 < u < 1. Compute P(X e A) where (a) A = [0,1]. (b) ^ = [|,1]. Exercise 4.5.2. Let X be a random variable with finite mean, and let a G R be any real number. Prove that E(max(X, a)) > max(E(X),a). [Hint: Consider separately the cases E(X) > a and E(X) < a] (See also Exercise 5.5.7.) Exercise 4.5.3. Give an example of random variables X and Y defined on Lebesgue measure on [0,1], such that P(X > Y) > \, but E(X) < E(Y). Exercise 4.5.4. Let (fi,^7, P) be the uniform distribution on £1 = {1,2,3}, as in Example 2.2.2. Find random variables X, Y, and Z on (n,.F,P) such that P(X > Y)P(Y > Z)P{Z > X) > 0, and E(X) = E(y) = E(Z). Exercise 4.5.5. Let X be a random variable on (fi,^7, P), and suppose that fi is a finite set. Prove that X is a simple random variable. Exercise 4.5.6. Let X be a random variable defined on Lebesgue measure on [0,1], and suppose that X is a one-to-one function, i.e. that if u>\ ^ UJ2 then X(uj\) ^ X(u>2)- Prove that X is not a simple random variable.
4.5. Exercises. 53 Exercise 4.5.7. (Principle of inclusion-exclusion, general case) Let Ax, A2, • • • 1 An G T. Generalise the principle of inclusion-exclusion to: n P(^iU...Ui4„) = ^P(^) - Y, P(AinAi) + Y P(Ai n^jn^) - • • • ± P(^l n • • • n An)- l<i<j<k<n [Hint: Expand 1 — ]^[^=1(1 - 1a,)» and take expectations of both sides.] Exercise 4.5.8. Let f(x) = ax2 + bx + c be a second-degree polynomial function (where a, 6, c € R are constants). (a) Find necessary and sufficient conditions on a, 6, and c such that the equation E(f(aX)) = a2E(f(X)) holds for all a G R and all random variables X. (b) Find necessary and sufficient conditions on a, 6, and c such that the equation E(f(X - /?)) = E(/(X)) holds for all (3 G R and all random variables X. (c) Do parts (a) and (b) account for the properties of the variance function? Why or why not? Exercise 4.5.9. In proving property (4.1.6) of variance, why did we not simply proceed by induction on n? That is, suppose we know that Var(X -f Y) — Var(X) + Var(y) whenever X and Y are independent. Does it follow easily that Var(X + Y + Z) = Var(X) + Var(F) + Var(Z) whenever X, Y, and Z are independent? Why or why not? How does Exercise 3.6.6 fit in? Exercise 4.5.10. Let X\, X2,... be i.i.d. with mean // and variance a2, and let iV be an integer-valued random variable with mean m and variance v, with N independent of all the X{. Let S = Xi +... + XN = X)£i Xi lN>i Compute Var(5) in terms of /i, a2, m, and v. Exercise 4.5.11. Let X and Z be independent, each with the standard normal distribution, let a, b G R (not both 0), and let Y = aX + bZ. (a) Compute Corr(X,r). (b) Show that |Corr(X,y)| < 1 in this case. (Compare Exercise 5.5.6.) (c) Give necessary and sufficient conditions on the values of a and b such that Corr(X,F) = 1. (d) Give necessary and sufficient conditions on the values of a and b such that Corr(X,y) = -1.
54 4. Expected values. Exercise 4.5.12. Let X and Y be independent general non-negative random variables, and let Xn = \I>n(X), where \J>n(x) = min(n, 2_n[2na:J) as in Proposition 4.2.5. (a) Give an example of a sequence of functions $n : [0, oo) —> [0, oo), other than $n(x) = \Pn(z), sucn tnat f°r aU a;, 0< $n(x) < x and {$n(x)} / x as n —► oo. (b) Suppose yn = $n00 with 3>n as in part (a). Must Xn and Yn be independent? (c) Suppose {Yn} is an arbitrary collection of non-negative simple random variables such that {Yn} / Y. Must Xn and Yn be independent? (d) Under the assumption of part (c), determine (with proof) which quantities in equation (4.2.7) are necessarily equal. Exercise 4.5.13. Give examples of a random variable X defined on Lebesgue measure on [0,1], such that (a) E(X+) = oo and 0 < E(X~) < oo. (b) E(X-) = oo and 0 < E(X+) < oo. (c) E(X+)=E(X-) = oo. (d) E(X) < oo but E{X2) = oo. Exercise 4.5.14. Let Zi, Z2,... be general random variables with E|Z*| < 00, and let Z = Zi + Z2 + .... (a) Suppose EiE(^) < °° and Y,iE(zr) < °°- Prove that E(z) = (b) Show that we still have E(Z) = ^ E(Z^) if we have at least one of E<E(Z+)<ooorE<E(Zr)<oo. (c) Let {Z»} be independent, with P(Z» = +1) = P(Z» = -1) = \ for each i. Does E(Z) = Yli E(Z*) in this case? How does that relate to (4.2.8)? Exercise 4.5.15. Let (^1,^*1,Pi) and (02,^*2^2) be two probability triples. Let A\, A2,... G T\, and B\, B2,... G ^2- Suppose that it happens that the sets {An x £?n} are all disjoint, and furthermore that (J^Li(^n x Bn) = A x B for some A^T\ and J5 G ^2- (a) Prove that for each u; G fii, we have 00 lA(u)P2(B) = £)u»P2(Bn). n=l [Hint: This is essentially countable additivity of P2, but you do need to be careful about disjointness.] (b) By taking expectations of both sides with respect to Pi and using countable additivity of Pi, prove that 00 Pi(A)P2(B) = £Pi(An)P2(B„). 71=1
4.6. Section summary. 55 (c) Use this result to prove that the J and P for product measure, presented in Subsection 2.6, do indeed satisfy (2.5.5). 4.6. Section summary. In this section we defined the expected value E(X) of a random variable X first for simple random variables and then for general random variables. We proved basic properties such as linearity and order-preserving. We defined the variance Var(X). If X and Y are independent, then E(XY) = E{X)E(Y) and Var(X + Y) = Var(X) + Var(y). We also proved the monotone convergence theorem. Finally, we connected expected value to Riemann (calculus-style) integration.
5. Inequalities and convergence. 57 5. Inequalities and convergence. In this section we consider various relationships regarding expected values and limits. 5.1. Various inequalities. We begin with two very important inequalities about random variables. Proposition 5.1.1. (Markov's inequality) IfX is a non-negative random variable, then for all a > 0, P{X>a) < E(X)/a. In words, the probability that X exceeds a is bounded above by its mean divided by a. Proof. Define a new random variable Z by Z{u) " \ 0, X(u>)<a. Then clearly Z < X, so that E(Z) < E(X) by the order-preserving property. On the other hand, we compute that E(Z) = aP(X > a). Hence, aP(X>a)<E{X). I Markov's inequality applies only to non-negative random variables, but it immediately implies another inequality which holds more generally: Proposition 5.1.2. (Chebychev's inequality.) Let Y be an arbitrary random variable, with finite mean [iY. Then for all a > 0, P{\Y-fjLY\>a) < Var(Y)/a2. In words, the probability that Y differs from its mean by more than a is bounded above by its variance divided by a2. "roof. Set X = (Y — Hy)2. Then X is a non-negative random variable. Thus, using Proposition 5.1.1, we have P (\Y -»y\>a) = P(X> a2) < E(X)/a2 = Var(y)/c*2 . I We shall use the above two inequalities extensively, including to prove the laws of large numbers presented below. Two other sometimes-useful inequalities are as follows.
58 5. Inequalities and convergence. Proposition 5.1.3. (Cauchy-Schwarz inequality.) Let X and Y be random variables with E(X2) < oo and E(Y2) < oo. Then E\XY\ < ^/E(X2)E(Y2). Proof. Let Z = \X\ / y/WX2) and W = \Y\ / y/WY2), so that E(Z2) = E(W2) = 1. Then 0 < E((Z-W)2) = E(Z2 + W2-2ZW) = l + l-2E(ZWr), so E{ZW) < 1, i.e. E|AY| < y/E{X2) E{Y2). I Proposition 5.1.4. (Jensen's inequality) Let X be a random variable with finite mean, and let (j) : R —> R be a convex function, i.e. a function such that X(f)(x) + (1 — X)(/)(y) > (j){\x + (1 — \)y) for x,y,\ 6 R and 0 < A < 1. Then E (4>(X)) > <\> (E(X)). Proof. Since (j) is convex, we can find a linear function g(x) = ax + b which lies entirely below the graph of (j) but which touches it at the point x = E(X), i.e. such that g{x) < </>{x) for all x e R, and g(E(X)) = 0(E(X)). Then E(0(X)) > B(g(X)) = B(aX + b) = aE(X) + 6 = #(E(X)) = 0(E(X)). I 5.2. Convergence of random variables. If Z, Zi, Z2,... are random variables defined on some (fi, T, P), what does it mean to say that {Zn} converges to Z as n —> 00? One notion we have already seen (cf. Theorem 4.2.2) is pointwise convergence, i.e. limn_oc Zn(uj) = Z(uj). A slightly weaker notion which often arises is convergence almost Surely (or, a.s. or with probability 1 or w.p. 1 or almost everywhere), meaning that P(limn_oc Zn = Z) = 1, i.e. that P{u; 6 fi : limn_>oo Zn(u) = Z(u)} = 1. As an aid to establishing such convergence, we have the following: Lemma 5.2.1. Let Z, Z\, Z2,... be random variables. Suppose for each e > 0, we have P(|Zn - Z\ > e i.o.) = 0. Then P(Zn -> Z) = 1, i.e. {Zn} converges to Z almost surely. Proof. It follows from Proposition A.3.3 that P(Zn -+Z) = P(Ve>0, \Zn-Z\ <ea.a.) = l-P(3e>0, \Zn-Z\ >ei.o.).
5.2. Convergence of random variables. 59 By countable subadditivity, we have that p(3 6 > 0, 6 rational, \Zn - Z\ > e i.o.) < ]T P(|Zn - Z\ > e i.o.) = 0. e>0 e rational But given any e > 0, there exists a rational e' > 0 with e' < e. For this e', we have that {\Zn - Z\ > e i.o.} C {\Zn - Z\ > e' i.o.}. It follows that p(3 e > 0, \Zn-Z\ > e i.o.) < P(3 e' > 0, e' rational, \Zn-Z\ > e' i.o.) = 0, thus giving the result. I Combining Lemma 5.2.1 with the Borel-Cantelli Lemma, we obtain: Corollary 5.2.2. Let Z, Zi,Z2,... be random variables. Suppose for each e >0, we have £n P(|Z„ - Z\ > e) < oo. Then P(Zn -+ Z) = 1, i.e. {Zn} converges to Z almost surely Another notion only involves probabilities: we say that {Zn} converges to Z in probability if for all e > 0, limn-^ P(|Zn — Z| > e) = 0. We next consider the relation between convergence in probability, and convergence almost surely. Proposition 5.2.3. Let Z, Zi,Z2,... be random variables. Suppose Zn —> Z almost surely (i.e., P(Zn —> Z) = 1). Then Zn —> Z in probability (i.e., for any e > 0, we have P(|Zn — Z\ > e) —> 0J. That is, if a sequence of random variables converges almost surely, then it converges in probability to the same limit. Proof. Fix e > 0, and let An = {uj; 3m > n, \Zm - Z\ > e}. Then {An} is a decreasing sequence of events. Furthermore, if uj 6 pL-^™' then Zn(u) ~h Z(uj) as n -> oo. Hence, V{f)nAn) < P(Zn /> Z) = 0. By continuity of probabilities, we have P(-An) —> PCfln^) = 0. Hence, P(|Zn - Z| > e) < P(i4n) -^ 0, as required. I On the other hand, the converse to Proposition 5.2.3 is false. (This justifies the use of "strong" and "weak" in describing the two Laws of Large Numbers below.) For a first example, let {Zn} be independent, with P(Zn = 1) = 1/n = 1 - P(Zn = 0). (Formally, the existence of such {Zn} follows from Theorem 7.1.1.) Then clearly Zn converges to 0 in probability. On the other hand, by the Borel-Cantelli Lemma, P(Zn = 1 i.o.) = 1, so P(Zn - 0) = 0, so Zn does not converge to 0 almost surely. For a second example, let (£l,P, P) be Lebesgue measure on [0,1], and set ZY = l^i), Z2 = lfi,!], Z3 = l[0,i), Z4 = l[i,j), ^5 = l[j,f), ZQ =
60 5. Inequalities and convergence. 1(3,1], ^7 = 1[o,J)' ^8 = 1[^,i)' etc* Then, by inspection, Zn converges to 0 in probability, but Zn does not converge to 0 almost surely. 5.3. Laws of large numbers. Here we prove a first form of the weak law of large numbers. Theorem 5.3.1. (Weak law of large numbers - first version.) £e£ X\, X2,. •. be a sequence of independent random variables, each having the same mean m, and each having variance < v < 00. Then for all e > 0, lim p(\-(X1+X2 + ... + Xn) - ml > e) =0. In words, the partial averages ^ (X\ + X2 +... + Xn) converge in probability to m. Proof. Set Sn = ±{XX + X2 + ... + Xn). Then using linearity of expected value, and also properties (4.1.5) and (4.1.6) of variance, we see that E(5n) = m and Var(5n) < v/n. Hence by Chebychev's inequality (Theorem 5.1.2), we have P M-(Xi+X2 + ... + Xn) - m\ > eJ < v/e2n -> 0, n->oo, as required. I For example, in the case of infinite coin tossing, if Xi = r» = 0 or 1 as the 2th coin is tails or heads, then Theorem 5.3.1 states that the probability that the fraction of heads on the first n tosses differs from ^ by more than e goes to 0 as n —> 00. Informally, the fraction of heads gets closer and closer to ^ with higher and higher probability. We next prove a first form of the strong law of large numbers. Theorem 5.3.2. (Strong law of large numbers - first version.) Let X\, X2, • • • be a sequence of independent random variables, each having the same finite mean m, and each having E((X* — m)4) < a < 00. Then P ( lim -(Xi + X2 + ... + Xn) = m) = 1. In words, the partial averages £ (Xi + X2 +... + Xn) converge almost surely to m.
Eliminating the moment conditions. 61 q* e (Xi - m)2 - (^ — m)4 "*" * (consider separately the two cases proof. 2 < 1 anci ^ - m)2 > 1), it follows that each Xi has variance (Xi - mj_ -^ ^ ^e assume for simplicity that m = 0; if not then we can ' mtv^place ^ by X, - m. simpi> r^F= Xi + x2 + • • • + Xn, and consider E(S4). Now, S4 will (when ltiplied out) contain many terms of the form XiXjXkXi for ij,k,£ mU • t but all of these have expected value zero. Similarly, it will contain !anyCte™s of the form XiXj(Xk)2 and Xi(Xj)3 which also have expected 11 1 zero The only terms with non-vanishing expectation will be n terms of "he form (X,)4, and Q)(42) = 3n(n - 1) terms of the form (X*)2^)2 with i ^ 3- Now, E ({Xi)4) < a. Furthermore, if i ^ j then X? and X2 are independent by Proposition 3.2.3, so since m = 0 we have E ((Xi)2{Xj)2) = E ((X,)2) E ((X,)2) = Var(X,)Var(X,) < v2. We conclude that E(54) < rm 4- 3n(n - l)v2 < Kn2 where K = a + 3v2. This is the key. To finish, we note that for any e > 0, we have by Markov's inequality that /.l i \ P (|± 5n| > e) = P (|5n| > ne) = P (|5n|4 > nV) < E (S4) /(n4e4) < tfn2/(n4e4) = Ke"4^ • Since E^=i £* < °° by (A.3.7), it follows from Corollary 5.2.2 that ^Sn converges to 0 almost surely. I For example, in the case of coin tossing, Theorem 5.3.2 states that the fraction of heads on the first n tosses will converge, as n —> oo, to |. Although this conclusion sounds quite similar to the corresponding conclusion from Theorem 5.3.1, we know from Proposition 5.2.3 (and the examples following) that it is actually a stronger result. 5.4. Eliminating the moment conditions. Theorems 5.3.1 and 5.3.2 provide clear evidence that the partial sums n(^i + • • • + Xn) are indeed converging to the common mean m in some sense. However, they require that the variance (i.e., second moment) or even the fourth moment of the X{ be finite (and uniformly bounded). This ^ an unnatural condition which is sometimes difficult to check. Thus, in this section we develop a new form of the strong law of large numbers which requires only that the mean (i.e., first moment) of the random variables be finite. However, as a penalty, it demands that the random variables be "i.i.d." as opposed to merely independent. (Of course, once we have proven a strong law, we will immediately obtain a weak law by using
62 5. Inequalities and convergence. Proposition 5.2.3.) Our proof follows the approach in Billingsley (1995). We begin with a definition. Definition 5.4.1. A collection of random variables {Xa}aei are identically distributed if for any Borel-measurable function / : R —> R, the expected value E(/(XQ)) does not depend on a, i.e. is the same for all a e I. Remark 5.4.2. It follows from Proposition 6.0.2 and Corollary 6.1.3 below that {Xa}aej are identically distributed if and only if for all xeR, the probability P (Xa < x) does not depend on a. Definition 5.4.3. A collection of random variables {Xa}aei are i.i.d. if they are independent and are also identically distributed. Theorem 5.4.4. (Strong law of large numbers - second version.) Let X\, X2,... be a sequence of i.i.d. random variables, each having finite mean m. Then P ( lim -{Xx + X2 + ... + Xn) = m J = 1. yn—00 n J In words, the partial averages ^{X\ +X2 +... + Xn) converge almost surely to m. Proof. The proof is somewhat difficult because we do not assume the finiteness of any moments of the Xi other than the first. Instead, we shall use a "truncation argument", by defining new random variables Yi which are truncated versions of the corresponding Xi. Thus, higher moments will exist for the Yi, even though the Yi will tend to be "similar" to the Xi. To begin, we assume that Xi > 0; if not, we can consider separately X* and X~. (Note that we cannot now assume without loss of generality that m = 0.) We set Yi = Xilxt<i, i.e. Yi = Xi unless Xi exceeds i, in which case Yi = 0. Then since 0 < J^ < i, therefore E(F-fc) < ik < 00. Also, by Proposition 3.2.3, the {Yi} are independent. Furthermore, since the Xi are i.i.d., we have that E(Yi) = E(XilXi<i) = E(X1lXl<i) / E(Xi) = m as i —> 00, by the monotone convergence theorem. We set Sn = £?=1 Xu and set 5; = £2n=1 Y{. We compute using (4.1.6) and (4.1.4) that Var(S;) = Var(yx) + ... + Var(Fn) < E(>f) + ... + E(Fn2) = E{X\\Xx<a) + • • • + E(X2nlXn<n) < r£E{XllXl<n) < n3 < oc . (5.4.5) We now choose a > 1 (we will later let a \ 1), and set un = [an\, i.e. un is the greatest integer not exceeding an. Then un < an. Furthermore,
5.4. Eliminating the moment conditions. 63 nee OLn > 1? ^ follows (consider separately the cases an < 2 and an > 2) that un > #n/% i-e. l/wn < 2/an. Hence, for any x > 0, we have that 00 9/ £ l/un < £ V«n < £ 2/a" < £ 2/afe = ^ . (5.4.6) ->_ an>x an>x OQ (Note that here we sum over k = logQ x, logQ x + 1,..., even if logQ x is not an integer.) We now proceed to the heart of the proof. For any e > 0, we have that s* -e(s; ) "n. v un' oo Var(S;n/un) S.-Kfi]|>,) u„ \ u„ >\ — ) f by Chebychev's inequality oo " Var(S* ) = YZ-i1^ by (4.1.5) < YZ.1 U"E{Xli^n) by (5.4.5) = ^E (xx2 Y,%Li ^T^n^x,) by countable linearity < ^E (X? f^) by (5.4.6) ^X)E(Xr) 2m < OO. " «a(i-i) This finiteness is the key. It now follows from Corollary 5.2.2 that —— Uj2— > converges to 0 almost surely. (5.4.7) Un J E(S* ) To complete the proof, we need to replace —^^- by m, and replace SUn by 5'Un, and finally replace the index un by the general index k. We consider each of these three issues in turn. First, since E(Y^) —> m as i —*• oo, and since iin —> oo as n —> oo, it follows immediately that Un —> m as n —> oo. Hence, it follows from (5.4.7) that converges to m almost surely. (5.4.8) Second, we note by Proposition 4.2.9 that oo oo oo ^P(Xk^Yk) = J2P(Xk>k) < J2P(Xk>k) fc=l fc=l k=l
64 5. Inequalities and convergence. Y^P{Xx>k) = E[X1\<E{Xl)=m<oo. k=l Hence, again by Borel-Cantelli, we have that P(X& ^ Yk io.) = 0, so that P(Xfc = Yk a.a.) = 1. It follows that, with probability one, as n —> oo the s* s limit of —^ coincides with the limit of —!£lL. Hence, (5.4.8) implies that ) _JiiL K converges to m almost surely. (5.4.9) I un J Finally, for an arbitrary index &, we can find n = Uk so that un < k < un+\. But then un SUn _ SUn < Sj^ < Sun+i _ Sun+1 un+\ un+i un un+i k un un+i un Now; as k —> oo, we have n = Uk —*• oo, so that -^ ► - and ^±^ —> q. Hence, for any a > 1 and S > 0, with probability 1 we have ra/(l + 5)a < ^f- < (1 + (5)am for all sufficiently large k. For any e > 0, choosing a > 1 and 5 > 0 so that m/(l + 5)a > m — e and (1 + 8)am < m + e, this implies that P(|(Sfc/A;) — ra| > e z.o.) = 0. Hence, by Lemma 5.2.1, we have that as k —*• oo, — converges to m almost surely, AC as required. I Using Proposition 5.2.3, we immediately obtain a corresponding statement about convergence in probability. Corollary 5.4.11. (Weak law of large numbers - second version.) Let Xi, X2, • •. be a sequence ofi.i.d. random variables, each having finite mean m. Then for all e > 0, lim P[ -(X1+X2 + ... + Xn) - m n > c = 0. In words, the partial averages ^ (X\ + X2 +... + Xn) converge in probability to m.
5.5. Exercises. 65 5.5. Exercises. Exercise 5.5.1. Suppose E(2X) = 4. Prove that P(X > 3) < 1/2. Exercise 5.5.2. Give an example of a random variable X and a > 0 such that P{X > ot) > E(X)/a. [Hint: Obviously X cannot be non-negative.] Where does the proof of Markov's inequality break down in this case? Exercise 5.5.3. Give examples of random variables Y with mean 0 and variance 1 such that (a) P(|y| > 2) = 1/4. (b) P(|F| > 2) < 1/4. Exercise 5.5.4. Suppose X is a non-negative random variable with E(X) = oo. What does Markov's inequality say in this case? Exercise 5.5.5. Suppose Y is a random variable with finite mean \iy and with Var(y) = oo. What does Chebychev's inequality say in this case? Exercise 5.5.6. For general jointly defined random variables X and Y, prove that |Corr(X, Y)\ < 1. [Hint: Don't forget the Cauchy-Schwarz inequality.] (Compare Exercise 4.5.11.) Exercise 5.5.7. Let cGR, and let 4>(x) = max(x, a) as in Exercise 4.5.2. Prove that 0 is a convex function. Relate this to Jensen's inequality and to Exercise 4.5.2. Exercise 5.5.8. Let </>(x) = x2. (a) Prove that 0 is a convex function. (b) What does Jensen's inequality say for this choice of 0? (c) Where in the text have we already seen the result of part (b)? Exercise 5.5.9. Prove Cantelli's inequality, which states that if X is a random variable with finite mean m and finite variance v, then for a > 0, P (X - m > a) < —^ . v + a2 [Hint: First show P(X - m > a) < P((X - m + y)2 > (a + y)2). Then use Markov's inequality, and minimise the resulting bound over choice of 2/>0.] Exercise 5.5.10. Let X\, X2,... be a sequence of random variables, with E[Xn] = 8 and Var[Xn] = l/y/h~ for each n. Prove or disprove that {Xn} must converge to 8 in probability.
66 5. Inequalities and convergence. Exercise 5.5.11. Give (with proof) an example of a sequence {Yn} of jointly-defined random variables, such that as n —> oo: (i) Yn/n converges to 0 in probability; and (ii) Yn/n2 converges to 0 with probability 1; but (iii) Yn/n does not converge to 0 with probability 1. Exercise 5.5.12. Give (with proof) an example of two discrete random variables having the same mean and the same variance, but which are not identically distributed. Exercise 5.5.13. Let r 6 N. Let Xi,X2,... be identically distributed random variables having finite mean m, which are r-dependent, i.e. such that Xkx, Xk2,..., Xk3 are independent whenever ki+\ > ki+r for each i. (Thus, independent random variables are O-dependent.) Prove that with probability one, ^ J^JLi Xi —► m as n —► oo. [Hint: Break up the sum Sr=i ^ mto r different sums.] Exercise 5.5.14. Prove the converse of Lemma 5.2.1. That is, prove that if {Xn} converges to X almost surely, then for each e > 0 we have P(|Xn-X| >ci.o.) = 0. Exercise 5.5.15. Let X\,X2,... be a sequence of independent random variables with P(Xn = 3n) = P(Xn = -3n) = \. Let Sn = Xi +... + Xn. (a) Compute E(Xn) for each n. (b) For n G N, compute Rn = sup{r G R; P(|5n| > r) = 1}, i.e. the largest number such that \Sn\ is always at least Rn. (c) Compute lim^^ ±Rn. (d) For which e > 0 (if any) is it the case that P(£|Sn| > e) -^ 0? (e) Why does this result not contradict the various laws of large numbers? 5.6. Section summary. This section presented inequalities about random variables. The first, Markov's inequality, provides an upper bound on P(I > a) for non- negative random variables X. The second, Chebychev's inequality, provides an upper bound on P(|y — /xy| > a) in terms of the variance of Y. The Cauchy-Schwarz and Holder inequalities were also discussed. It then discussed convergence of sequences of random variables, and presented various versions of the Law of Large Numbers. This law concerns partial averages \{X\ + • • - + Xn) of collections of independent random variables {Xn} all having the same mean m. Under the assumption of either finite higher moments (First Version) or identical distributions (Second Version), we proved that these partial averages converge in probability (Weak Law) or almost surely (Strong Law) to the mean ra.
6. Distributions of random variables. 67 6. Distributions of random variables. The distribution or law of a random variable is defined as follows. Definition 6.0.1. Given a random variable X on a probability triple (fl, T, P)> Ms distribution (or law) is the function \i defined on B, the Borel subsets of R, by fi{B) = P(IGB) = P(X~1{B)), BeB. If fj, is the law of a random variable, then (R, B, /jl) is a valid probability triple. We shall sometimes write /i as C(X) or as PX"1. We shall also write X ~ fi to indicate that /i is the distribution of X. We define the cumulative distribution function of a random variable X by Fx(^) = P(^ < x)> f°r ^ € R. By continuity of probabilities, the function Fx is right-continuous, i.e. if {xn} \ x then Fx(xn) —> Fx(x). It is also clearly a non-decreasing function of x, with limx_oo Fx(x) = 1 and limx_^_oo Fx{x) = 0. We note the following. Proposition 6.0.2. Let X and Y be two random variables (possibly defined on different probability triples). Then C(X) = £(Y) if and only if Fx{x) = FY{x) for all x e R. Proof. The "if" part follows from Corollary 2.5.9. The "only if" part is immediate upon setting B = (—oo,x\. I 6.1. Change of variable theorem. The following result shows that distributions specify completely the expected values of random variables (and functions of them). Theorem 6.1.1. (Change of variable theorem.) Given a probability triple (Q^J7, P), let X be a random variable having distribution \i. Then for any Borel-measurable function / : R —> R, we have f f{X{u))V(dw) = ^ f(t)»(dt), (6.1.2) i.e. Ep[/(X)] = EM(/), provided that either side is well-defined. In words, the expected value of the random variable f(X) with respect to the probability measure P on Q is equal to the expected value of the function f with respect to the measure /i on R.
68 6. Distributions of random variables. Proof. Suppose first that / = 1b is an indicator function of a Borel set KR. Then fnf(X(u))P((L>) = fn lWu,)€*}P(du;) = P(X G B), while JZofitMdt) = f^l^By^dt) = /z(£) = P(X G B), so equality holds in this case. Now suppose that / is a non-negative simple function. Then / is a finite positive linear combination of indicator functions. But since both sides of (6.1.2) are linear functions of /, we see that equality still holds in this case. Next suppose that / is a general non-negative Borel-measurable function. Then by Proposition 4.2.5, we can find a sequence {/n} of non-negative simple functions such that {/n} / f. We know that (6.1.2) holds when / is replaced by fn. But then by letting n —► oo and using the Monotone Convergence Theorem (Theorem 4.2.2), we see that (6.1.2) holds for / as well. Finally, for general Borel-measurable /, we can write / = /+ — /"". Since (6.1.2) holds for /+ and for /" separately, and since it is linear, therefore it must also hold for /. | Remark. The method of proof used in Theorem 6.1.1 (namely considering first indicator functions, then non-negative simple functions, then general non-negative functions, and finally general functions) is quite widely applicable; we shall use it again in the next subsection. Corollary 6.1.3. Let X and Y be two random variables (possibly defined on different probability triples). Then C(X) = C(Y) if and only if E[f(X)} = E[f(Y)] for all Borel-measurable f : R -♦ R for which either expectation is well-defined. (Compare Proposition 6.0.2 and Remark 5.4.2.) Proof. If C(X) = C(Y) = \i (say), then Theorem 6.1.1 says that E[f(X)}=E[f(Y)] = fRfd». Conversely, if E[f(X)] = E[f(Y)] for all Borel-measurable / : R -► R, then setting f = lB shows that P[X G B] = P[Y G B] for all Borel BCR, i.e. that C{X) = C(Y). # ■ Corollary 6.1.4. If X and Y are random variables with P(X = Y) = 1, then E[f(X)] = E[f(Y)} for all Borel-measurable f : R -► R for which either expectation is well-defined. Proof. It follows directly that C(X) = C(Y). Then, letting /x = C{X) = C(Y), we have from Theorem 6.1.1 that E[f(X)] = E[f(Y)} = fRfdfi. ■
6.2. Examples of distributions. 69 6.2. Examples of distributions. For a first example of a distribution of a random variable, suppose that p(X = c) = 1, i.e. that X is always (or, at least, with probability 1) equal to some constant real number c. Then the distribution of X is the point mass 8C, defined by 6C(B) = 1b(c), i.e. 5C(B) equals 1 if c G B and equals 0 otherwise. In this case we write X ~ 5C, or C(X) = 5C. From Corollary 6.1.4, since P(X = c) = 1, we have E(X) = E(c) = c, and E(X3 + 2) = E(c3 + 2) = c3 + 2, and more generally E[f(X)] = f(c) for any function /. In symbols, /n/(X(w))P(du;) = JRf(t)5c(dt) = /(c). That is, the mapping / k-> E[/(X)] is an evaluation map. For a second example, suppose X has the Poisson(5) distribution considered earlier. Then P(X G A) = ^2jeAe~55j/j\, which implies that C(X) = Z]jlo(e_5^/i0^j5 a convex combination of point masses. The following proposition shows that we then have E(/(X)) = YlTLo fti) e~5&/j}- for any function / : R —► R. Proposition 6.2.1. Suppose /x = ^2iPi^i, where {//*} are probability distributions, and {Pi} are non-negative constants (summing to 1, if we want // to also be a probability distribution). Then for Borel-measurable functions f : R —► R, / /d/z = ^2 Pi / fdV>i, provided either side is well-defined. Proof. As in the proof of Theorem 6.1.1, it suffices (by linearity and the monotone convergence theorem) to check the equation when f — 1b is an indicator function of a Borel set B. But in this case the result follows immediately since »(B) = £. fam{B). ■ Clearly, any other discrete random variable can be handled similarly to the Poisson(5) example. Thus, discrete random variables do not present any substantial new technical issues. For a third example of a distribution of a random variable, suppose X has the Normal(0,1) distribution considered earlier (henceforth denoted AT(0,1)). We can define its law ^iN by /oo 4>{t) 1B(t)\(dt), B Borel, (6.2.2) -OO where A is Lebesgue measure on R (cf. (4.4.2)) and where </>(t) = -^=e~*2/2. We note that for a mathematically complete definition, it is necessary to
70 6. Distributions of random variables. use the Lebesgue integral rather than the Riemann integral. Indeed, the Riemann integral is undefined unless B is a rather simple set (e.g. a finite union of intervals, or more generally a set whose boundary has measure 0), while we need /xat(-B) to be defined for all Borel sets B. Furthermore, since <f> is continuous it is Borel-measurable (Proposition 3.1.8), so Lebesgue integrals such as (6.2.2) make sense. Similarly, given any Borel-measurable function (called a density function) f such that / > 0 and /f° f(t)X(dt) = 1, we can define a law // by /oo f(t)lB(t)X(dt), B Borel. -OO We shall sometimes write this as /x(£) = fB f or /x(£) = fB f(t) X(dt), or even as fi(dt) = f(t) X(dt) (where such equalities of "differentials" have the interpretation that the two sides are equal when integrated over t G B for any Borel B, i.e. that JB fi(dt) = fB f(t)X(dt) for all B). We shall also write this as ^ = /, and shall say that /x is absolutely continuous with respect to A, and that / is the density for /x with respect to A. We then have the following. Proposition 6.2.3. Suppose /x has density f with respect to A. Then for any Borel-measurable function g : R —► R, /oo /»oo g(tMdt) = / g(t)f(t)\(dt), -oo J — oo provided either side is well-defined. In words, to compute the integral of a function with respect to /x, it suffices to compute the integral of the function times the density with respect to A. Proof. Once again, it suffices to check the equation when g = 1b IS an indicator function of a Borel set B. But in that case, f g(t)ii(dt) = flB(th(dt) = fi(B), while fmg(t)f(t)X(dt) = flB(t)f(t)X(dt) = p(B) by definition. The result follows. I By combining Theorem 4.4.1 and Proposition 6.2.3, it is possible to do explicit computations with absolutely-continuous random variables. For example, if X ~ N(0,1), then E(X) = ftfiN(dt) = ft(f)(t)X(dt) = J tcj){i)dt, and more generally E(g(X)) = Jg(t)fiN(dt) = j g(t)(t>(t)X(dt) = j^g{t) cf>{t) dt
6.3. Exercises. 71 for any Riemann-integrable function g; here the last expression is an ordinary, old-fashioned, calculus-style Riemann integral. It can be computed in this manner that E{X) = 0, E(X2) = 1, E(X4) = 3, etc. For an example combining Propositions 6.2.3 and 6.2.1, suppose that £(X) = \$i + 1^2 + |mat, where fi^ is again the AT(0,1) distribution. Then E(X) = 1(1) + 1(2) + ±(0) = f, E(X2) = 1(1) + 1(4) + i(l) = |, and so on. Note, however, that it is not the case that Var(X) equals the corresponding linear combination of variances (indeed, the variance of a point-mass is 0, so that the corresponding linear combinations of variances is i(0) + i(0) + ±(l) = |); rather, the formula Var(X) = E(X2)-E(X)2 = |-(f)2 = ±§ should be used. Exercise 6.2.4. Why does Proposition 6.2.1 not imply that Var(X) equals the corresponding linear combination of variances? 6.3. Exercises. Exercise 6.3.1. Let (Q,T, P) be Lebesgue measure on [0,1], and set {1 , 0 < u < 1/4 2a;2, 1/4 < u < 3/4 a;2, 3/4 < u < 1. Compute P(X e A) where (a) A =[0,1]. (b) A =[1,1]. Exercise 6.3.2. Suppose P(Z = 0) = P(Z = 1) = \, that Y ~ N(0,1), and that Y and Z are independent. Set X = YZ. What is the law of XI Exercise 6.3.3. Let X ~ Poisson(5). (a) Compute E(X) and Var(X). (b) Compute E(3X). . Exercise 6.3.4. Compute E(X), E(X2), and Var(X), where the law of X is given by (a) C(X) = \8\ + ^A, where A is Lebesgue measure on [0,1]. (b) C(X) = \52 + yN, where //at is the standard normal distribution ^(0,1). Exercise 6.3.5. Let X and Z be independent, with X ~ JV(0,1), and with P(Z = 1) = P(Z = -1) = 1/2. Let Y = XZ (i.e., Y is the product of .* and Z).
72 6. Distributions of random variables. (a) Prove that Y ~ N(0,1). (b) Prove that P(\X\ = \Y\) = 1. (c) Prove that X and Y are not independent. (d) Prove that Cov(X, Y) = 0. (e) It is sometimes claimed that if X and Y are normally distributed random variables with Cov(X, Y) = 0, then X and Y must be independent. Is that claim correct? Exercise 6.3.6. Let X and Y be random variables on some probability triple (fi,^,P). Suppose E(X4) < oo, and that P[ra < X < z] = P[m < Y < z] for all integers m and all z G R. Prove or disprove that we necessarily haveE(r4) = E(X4). Exercise 6.3.7. Let X be a random variable, and let Fx(x) be its cumulative distribution function. For fixed x G R, we know by right-continuity that limyNfX Fx(y) = Fx(x). (a) Give a necessary and sufficient condition that limyyx Fx(y) = Fx(x). (b) More generally, give a formula for Fx(x) — (limyyx Fx(y)), in terms of a simple property of X. Exercise 6.3.8. Consider the statement: f(x) = (f{x)) for all x G R. (a) Prove that the statement is true for all indicator functions / = 1^. (b) Prove that the statement is not true for the identity function f(x) = x. (c) Why does this fact not contradict the method of proof of Theorem 6.1.1? 6.4. Section summary. This section defined the distribution (or law), C(X), of a random variable X, to be a corresponding distribution on the real line. It proved that C(X) is completely determined by the cumulative distribution function, Fx(x) = P(X < x), of X. It proved that expectation E(f(X)) of any function of X can be computed (in principle) once C(X) or Fx(x) is known. It then considered a number of examples of distributions of random variables, including discrete and continuous random variables and various combinations of them. It provided a number of results for computing expected values with respect to such distributions.
7. Stochastic processes and gambling games. 73 7. Stochastic processes and gambling games. Now that we have covered most of the essential foundations of rigorous probability theory, it is time to get "moving", i.e. to consider random processes rather than just static random variables. A (discrete time) stochastic process is simply a sequence X0, X\, X2,... of random variables defined on some fixed probability triple (fi, T, P). The random variables {Xn} are typically not independent. In this context we often think of n as representing time; thus, Xn represents the value of a random quantity at the time n. For a specific example, let (ri,r2,...) be the result of infinite fair coin tossing (so that {ri} are independent, and each r* equals 0 or 1 with probability \\ see Subsection 2.6), and set X0 = 0; Xn = n+r2 + ... + rn, n > 1; (7.0.1) thus, Xn represents the number of heads obtained up to time n. Alternatively, we might set X0 = 0; Xn = 2(n + r2 + ... + rn) - n, n > 1; (7.0.2) then Xn represents the number of heads minus the number of tails obtained up to time n. This last example suggests a gambling game: each time we obtain a head we increase Xn by 1 (i.e., we "win"), while each time we obtain a tail we decrease Xn by 1 (i.e., we "lose"). To allow for non-fair games, we might wish to generalise (7.0.2) to X0 — 0; Xn = Zi + Z2 + •.. + Zn, n > 1, where the {Zi} are assumed to be i.i.d. random variables, satisfying P(Z* = 1) = p and P(Z» = -1) = 1 - p, for some fixed 0 < p < 1. (If p = \ then this is equivalent to (7.0.2), with Zi = 2r{ - 1.) This raises an immediate issue: can we be sure that such random variables {Zi} even exist? In fact the answer to this question is yes, as the following subsection shows. ' •!• A first existence theorem. We here show the existence of sequences of independent random vari- a°les having prescribed distributions. heorem 7.1.1. Let /ii,/x2, • • • be any sequence of Borel probability •Measures on R. Then there exists a probability space (f2, T, P), and random
74 7. Stochastic processes and gambling games. variables X\, X2, ... defined on (fi, T, P), such that {Xn} are independent, and C(Xn) = /zn. We begin with a lemma. Lemma 7.1.2. Let U be a random variable whose distribution is Lebesgue measure (i.e., the uniform distribution) on [0,1]. Let F be any cumulative distribution function, and set (j){u) = inf{x; F(x) > u) for 0 < u < 1. Then P((j)(U) < x) = F(z) for each z G R; in words, the cumulative distribution function of (f){U) is F. Proof. Since F is non-decreasing, if F(z) < u, then (j){u) > z. On the other hand, since F is right-continuous, we have that inf{x;F(x) > u] = min{x;F(x) > u}; that is, the infimum is actually obtained. It follows that if F(z) > u, then 4>{u) > z. Hence, 0(ii) < z if and only if u < F(z). Since 0 < F(z) < 1, we obtain that P(())(U) < z) = P(J7 < F(z)) = F(z). I Proof of Theorem 7.1.1. We let (fJ,.?7, P) be infinite independent fair coin tossing, so that r*i,r2,... are i.i.d. with P(r^ = 0) = P(r^ = 1) = |. Let {Zij} be a two-dimensional array filled by these r*i, as follows: / Z\\ ^12 ^13 • -\ Z21 Z22 ^23 • • Z31 Z32 Z33 • • Z41 Z42 Z43 . . V ; / (n T2 7*5 TQ U r8 r7 ... v; ; •\ / Hence, {Zij} are independent, with P(Zij = 0) = P(Zij = 1) = \. Then, for each n G N, we set C/n = X^ Znk/2k. By Corollary 3.5.3, the {£/n} are independent. Furthermore, by the way the Un were constructed, we have P (^ <Un < ^r) = ^ for k G N and 0 < j < 2k. By additivity and continuity of probabilities, this implies that P(a <Un< b) = b — a whenever 0 < a < b < 1. Hence, by Proposition 2.5.8, each Un follows the uniform distribution (i.e. Lebesgue measure) on [0,1]. Finally, we set Fn(x) = fin ((—00, x]) for x G R, set (j)n{u) = inf{x;w < Fn(x)} for 0 < u < 1, and set Xn = <t>n{Un). Then {Xn} are independent by Proposition 3.2.3, and Xn ~ /xn by Lemma 7.1.2, as required. I
7.2. Gambling and gambler's ruin. 75 7.2. Gambling and gambler's ruin. By Theorem 7.1.1, for fixed 0 < p < 1 we can find random variables {Zi) which are i.i.d. with P(Z* = 1) = p and P(Z* = -1) = 1 - p = q. We then set Xn = a + Z\ + Z2 +... + Zn (with X0 = a) for some fixed integer a. We shall interpret Xn as a gambling player's "fortune" (in dollars) at time n when repeatedly making $1 bets, and shall refer to the stochastic process {Xn} as simple random walk. Thus, our player begins with $a, and at each time has probability p of winning $1 and probability q = 1 — p of losing $1. We first note the distribution of Xn. Indeed, clearly P(Xn = a + k) = 0 unless — n < k < n with n + k even. For such A:, there are (n+fc) different V 2 ' possible sequences Zi,..., Zn such that Xn = a + k, namely all sequences consisting of II^ symbols +1 and r^ symbols —1. Furthermore, each such sequence has probability p~*~q~*~. We conclude that / ft \ n + k n-k P(Xn = a + k) = I n+fc jp 2 q 2 , -n<k<n, n + A;even, with P(Xn = a + A:) = 0 otherwise. This is a rather "static" observation about the process {Xn}; of greater interest are questions which depend on its time-evolution. One such question is the gambler's ruin problem, defined as follows. Suppose that 0 < a < c, and let To = inf{n > 0;Xn = 0} and rc = inf{n > 0;Xn = c] be the first hitting time of 0 and c, respectively. (These infima are taken to be +oo if the condition is satisfied for no n.) The gambler's ruin question is, what is P(rc < To)? That is, what is the probability that the player's fortune will reach the value c before it reaches the value 0. Informally, what is the probability that the gambler gets rich before going broke? (Note that (Tc < to} includes the case when To = oo while tc < oo; but it does not include the case when tc = t0 = oo.) Solving this question is not straightforward, since there is no limit to how long it will take until either the fortune c or the fortune 0 is reached. However, by using the right trick, the solution presents itself. We set s(a) = sc,P(a) = P(tc < to). Writing the dependence on a explicitly will allow us to vary a, and to relate s(a) to s(a-l) and s(a+l). Indeed, for 1 < a < c-1 we have that s(a) ■= P(tc<t0) = P(Zi = -1, TC < T0) + P(Zi = +1, TC < T0) (7.2.1) = qs(a — l)+ps(a + l). That is, s(a) is a simple convex combination of s(a — 1) and s(a +1); this is the key. We further have by definition that s(0) = 0 and s(c) = 1.
76 7. Stochastic processes and gambling games. Now, (7.2.1) gives c— 1 equations, for the c—1 unknowns s(l),..., s(c— 1). This system of equations can then be solved in several different ways (see Exercises 7.2.4, 7.2.5, and 7.2.6 below), to obtain that i-uy 1 s(a) = sc,p(a) = f-i-c, p?-. (7.2.2) i-(;) and s(a) = sCyP(a) = a/c, p = -. (7.2.3) (This last equation is suggestive: for a fair game (p = ^), with probability a/c you end up with c dollars; and the product of these two quantities is your initial fortune a. We will consider this issue again when we study martingales; see in particular Exercises 14.4.10 and 14.4.11.) Exercise 7.2.4. Verify that (7.2.2) and (7.2.3) satisfy (7.2.1), and also satisfy s(0) = 0 and s(c) = 1. Exercise 7.2.5. Solve equation (7.2.1) by direct algebra, as follows. (a) Show that (7.2.1) implies that for 1 < a < c — 1, s(a + 1) — s(a) = *(s(a)-s(a-l)). (b) Show that this implies that for 0 < a < c — 1, s(a + 1) — s(a) = (c) Show that this implies that for 0 < a < c, s(a) = ^Zo (p) s(1)- (d) Solve for s(l), and verify (7.2.2) and (7.2.3). Exercise 7.2.6. Solve equation (7.2.1) using the theory of difference equations, as follows. (a) Show that the corresponding "characteristic equation" t° = qt~l +ptl has two distinct roots t\ and.^2 when p ^ 1/2, and one double root £3 when p = 1/2. Solve for t\, £2, and £3. (b) When p ^ 1/2, the theory of difference equations says that we must have sc,p(a) = C\(ti)a + C2(^)a for some constants C\ and C2. Assuming this, use the boundary conditions sc,p(0) = 0 and sc,p(c) = 1 to solve for C\ and C2. Verify (7.2.2). (c) When p = 1/2, the theory of difference equations says that we must have sCjP(a) = Cs(ts)a + C±a{tz)a for some constants C3 and C4. Assuming this, use the boundary conditions to solve for C3 and C4. Verify (7.2.3). As a specific example, suppose you start with $9,700 (i.e., a = 9700) and your goal is to win $10,000 before going broke (i.e., c = 10000). If p = \,
7.3. Gambling policies. 77 vour probability of success is a/c = 0.97, which is very high; on the h r hand, if P = 0.49, then your probability of success is given by V0.49 9700 /051\ Vo.49y 10000 which is approximately 6.1 x 10~6, or about one chance in 163,000. This shows rather dramatically that even a small disadvantage on each bet can lead to a very large disadvantage in the long run! Now let r(a) — rCiP(a) = P(to < rc) be the probability that our gambler goes broke before reaching the desired fortune. Then clearly s(a) and r(a) are related. Indeed, by "considering the bank's point of view", we see immediately that rCjP(a) = sc,i-P(c-a) (that is, the chance of going broke before obtaining c dollars, when starting with a dollars and having probability p of winning each bet, is the same as the chance of obtaining c dollars before going broke, when starting with c — a dollars and having probability 1 — p of winning each bet), so that rc,P(a) i-(f)c"a i-(f)c P±\ c—a c Finally, let us consider the probability that the gambler will eventually go broke if they never stop gambling, i.e. P(to < oo) without regard to any target fortune c. Well, if we let Hc = {r0 < rc}, then clearly {Hc} is increasing up to {r0 < oo}, as c —► oo. Hence, by continuity of probabilities, P(t0 < oo) = lim P(HC) = lim rCiP(a) = < 1, 1(0" p<h p>h- (7.2.7) Thus, if p < I (i.e., if the gambler has no advantage), then the gambler is certain to eventually go broke. On the other hand, if say p = 0.6 and a = 1, then the gambler has probability 1/3 of never going broke. 7.3. Gambling policies. Suppose now that our gambler is allowed to choose how much to bet each time. That is, the gambler can choose random variables Wn so that their fortune at time n is given by Xn = a + W1Z1+W2Z2 + ... + WnZn,
78 7. Stochastic processes and gambling games. with {Zn} as before. To avoid "cheating", we shall insist that Wn > o (you can't bet a negative amount), and also that Wn = fn{Zi,Z2, • •, ZnA [s a function only of the previous bet results (you can't know the result of a bet before you choose how much to wager). Here the fn : {—1, l}n_1 —► R>o are fixed deterministic functions, collectively referred to as the gambling policy. (If Wn = 1 for each n, then this is equivalent to our previous gambling model.) Note that Xn = Xn_i +WnZn, with Wn and Zn independent, and with E(Zn) =p-q. Hence, E(Xn) = E(Xn_i) + (p - q)E(Wn). Furthermore, since Wn > 0, therefore E(Wn) > 0. It follows that (a) if p = i, then E(Xn) = E(Xn_i) = ... = E(X0) = a, so that limE(Xn) = a; (b) if p < i, then E(Xn) < E(Xn_i) < ... < E(X0) = a, so that limE(Xn) <a; (c) if p > |, then E(Xn) > E(Xn_i) > ... > E(X0) = a, so that limE(Xn) >a. This seems simple enough, and corresponds to our intuition: if p < ^ then the player's expected value can only decrease. End of story? Perhaps not. Consider the "double 'til you win" policy, defined by Vri~1' ^n-| q^ otherwise ' n~1' That is, we first bet $1. Each time we lose, we double our bet on the succeeding turn. As soon as we win once, we bet zero from then on. It is easily seen that, with this gambling policy, we will be up $1 as soon as we win a bet. That is, letting r = inf{n > 1; Zn = +1}, we have that Xn = a + 1 provided that r < n. Now, clearly P(r > n) = (1 -p)n. Hence, assuming p > 0, we see that P(r < oo) = 1. It follows that P(limXn = a + 1) = 1, so that E(limXn) = a + 1. In words, with probability one we will gain $1 with this gambling policy, for any positive value of p, and thus "cheat fate". How can this be, in light of (a) and (b) above? The answer, of course, is triat in this case E(limXn) (which equals a + 1) is not the same as limE(Xn) (which must be < a). This is what allows us to "cheat fate". On the other hand, we may need to lose an arbitrarily large amount of money before we win our $1, so "infinite capital" is required to follow this gambling policy. We show now that, if the fortunes Xn must remain bounded (i.e., if we only have a finite amount of capital to draw on), then E(limXn) must indeed be the same as limE(Xn). Theorem 7.3.1. (The bounded convergence theorem.) Let {Xn} be a sequence of random variables, with limXn = X. Suppose there is K G R such that \Xn\ < K for all n G N (i.e., the {Xn} are uniformly bounded). ThenE(X) = limE(X„).
7.3. Gambling policies. 79 of. We have from the triangle inequality that \E(X) - E(Xn)\ = \E(X - Xn)\ < E(\X - Xn\). \\re shall show that this last expression goes to 0 as n —► oo. Indeed, fix > 0, and set An = {u e ft; \X(u>) - Xn(u)\ > c}. Then \X(u) - Xn(v)\ ^ e + 2K1AnHi so that E(l* - Xn\) < € + 2KP(An). Hence, using Proposition 3.4.1, we have that limsupE(|X-Xn|) < e + 2#limsupP(Al) < e + 2/rP(limsupAn) = e, since \X(u) - Xn{uj)\ —> 0 for all uj G f2, so that limsup An is the empty set. Hence, E(|X - Xn\) -+ 0, as claimed. I It follows immediately that, if we are gambling with p < |, and we use any gambling policy which leaves the fortunes {Xn} uniformly bounded, then limXn (if it exists) will have expected value equal to limE(Xn), and therefore be < a. So, if we have only finite capital (no matter how large), then it is not possible to cheat fate. Finally, we consider the following question. Suppose p < |, and 0 < a < c. What gambling system maximises P(rc < To), i.e. maximises the probability that we reach the fortune c before losing all our money? For example, suppose again that p = 0.49, and that a = 9700, c = 10000. If Wn = 1 (i.e. we bet exactly $1 each time), then we already know that P(rc < r0) = [(0.51/0.49)9700 - l] / [(0.51/0.49)10000 - l] = 6.1 x 10~6. On the other hand, if Wn = 2, then this is equivalent to instead having a = 4850 and c = 5000, so we see that P(tc < r0) = [(0.51/0.49)4850 - l] / [(0.51/0.49)5000 - 1] = 2.5 x 10"3, which is about 400 times better. In fact, if Wn = 100, thenP(Tc < t0) = [(0.51/0.49)97 - l] / [(0.51/0.49)100 - l] = 0.885, which is a very favourable probability. This example suggests that, if p < \ and we wish to maximise P(tc < ro), then it is best to bet in larger amounts, i.e. to get the game over with as quickly as possible. This is indeed true. More precisely, we define bold play to be the gambling strategy Wn = min(Xn_i, c - Xn_i), i.e. the strategy °f betting as much as possible each time. It is then a theorem that, when P < 2> this is the optimal strategy in the sense of maximising P(tc < To). For a proof, see e.g. Billingsley (1995, Theorem 7.3). Disclaimer. We note that for p < ^, even though to maximise P(tc < t0) ^ is best to bet large amounts, still overall it is best not to gamble at all. Indeed, by (7.2.7), if p < \ and you have any finite amount of money, then
80 7. Stochastic processes and gambling games. if you keep betting you will eventually lose it all and go broke. This is th reason few probabilists attend gambling casinos! 7.4. Exercises. Exercise 7.4.1. For the stochastic process {Xn} given by (7.0.1), compute (for n, k > 0) (a) P(Xn = k). (b) P(rk = n). [Hint: These two questions do not have the same answer.] Exercise 7.4.2. For the stochastic process {Xn} given by (7.0.2), compute (for n, k > 0) (a) P(Xn = k). (b).P(Xn>0). Exercise 7.4.3. Prove that there exist random variables Y and Z such that p(y = i) = P(r = -l) = P(z = i) = P(z = -l) = i P(r = 0) = P(Z = 0) = i, and such that Cov(y, Z) = \. (In particular, Y and Z are not independent.) [Hint: First use Theorem 7.1.1 to construct independent random variables X\, X2, and X3 each having certain two- point distributions. Then construct Y and Z as functions of X\, X2, and Exercise 7.4.4. For the gambler's ruin model of Subsection 7.2, with c = 10000 and p = 0.49, find the smallest positive integer a such that sc,p(a) ^ \- Interpret your result in plain English. Exercise 7.4.5. For the gambler's ruin model of Subsection 7.2, let Pn = P(min(ro,Tc) > n) be J;he probability that the player's fortune has not hit 0 or c by time n. (a) Find any explicit, simple expression jn such that (3n < 7n for all n € N, and such that limn_>oo jn = 0. (b) Find any explicit, simple expression an such that (3n > an > 0 for all ne N. Exercise 7.4.6. Let {Wn} be i.i.d. with P[Wn = +1] = P[Wn = 0] = 1/4 and P[Wn = —1] = 1/2, and let a be a positive integer. Let Xn = a + W\ + ... + Wn, and let r0 = inf{n > 0; Xn = 0}. Compute P(r0 < 00). [Hint: Let {Yn} be like {Xn}, except with immediate repetitions of values omitted. Is {Yn} a simple random walk? With what parameter p?]
7.5. Section summary. 81 '*o 7 4.7« Verify explicitly that rcJa) + scJa) = 1 for all a, c, and?- cise 7.4-8- In gambler's ruin, recall that {rc < r0} is the event the player eventually wins, and {to < rc} is the event that the player eventually loses. ( \ Give a similar plain-English description of the complement of the union of these two events, i.e. ({rc < r0} U {r0 < rc}) . (b) Give three different proofs that the event described in part (a) has probability 0: one using Exercise 7.4.7; a second using Exercise 7.4.5; and a third recalling how the probabilities sCiP(a) were computed in the text, and seeing to what extent the computation would have differed if we had instead replaced sClP(a) by SCiP(a) = P(rc < r0). (c) Prove that, if c > 4, then the event described in part (a) contains un- countably many outcomes (i.e., that uncountably many different sequences Z\, Z2, • • • correspond to this event, even though it has probability 0). [Hint: This is not entirely dissimilar from the analysis of the Cantor set in Subsection 2.4.] Exercise 7.4.9. For the gambling policies model of Subsection 7.3, consider the "triple 'til you win" policy defined by W\ = 1, and for n > 2, Wn = 3n_1 if Zi = ... = Zn-i = -1 otherwise Wn = 0. (a) Prove that, with probability 1, the limit limn^oo Xn exists. (b) Describe precisely the distribution of linin^oo Xn. Exercise 7.4.10. Consider the gambling policies model, with p = 1/3, a = 6, and c = 8. (a) Compute the probability sc?p(a) that the player will win (i.e. hit c before hitting 0) if they bet $1 each time (i.e. if Wn = 1). (b) Compute the probability that the player will win if they bet $2 each time (i.e. if Wn = 2). (c) Compute the probability that the player will win if they employ the strategy of Bold Play (i.e., if Wn = min(Xn_1, c - Xn-i)). [Hint: While it is difficult to do explicit computations involving Bold Play in general, here the numbers are small enough that it is not difficult.] 7.5. Section summary. This section introduced the concept of stochastic processes, from the point of view of gambling games. It proved a general theorem about the existence of sequences of independent random variables having arbitrary .specified distributions. It used this to define several models of gambling
82 7. Stochastic processes and gambling games. games. If the player bets $1 each time, and has independent probability p 0r winning each bet, then explicit formulae were developed for the probability that the player achieves some specified target fortune c before losing all their money. The formula is very sensitive to values of p near ^. If the player is allowed to choose how much to bet at each stage, then with clever betting they can "cheat fate" and always win. However, this requires them to have infinite capital available. If their capital is finite and p < |, then on average they will always lose money by the Bounded Convergence Theorem.
8. Discrete Markov chains. 83 g piscrete Markov chains. In this section, we consider the general notion of a (discrete time, discrete nace, time-homogeneous) Markov chain. A Markov chain is characterised by three ingredients: a state space S hich is a finite or countable set; an initial distribution {fi}ies consisting of non-negative numbers summing to 1; and transition probabilities {pij}ijes consisting of non-negative numbers with ^2j£SPij = 1 for each i € S. Intuitively, a Markov chain represents the random motion of some particle moving around the space S. V{ represents the probability that the particle starts at the point z, while Pij represents the probability that, if the particle is at the point i, it will then jump to the point j on the next step. More formally, a Markov chain is defined to be a sequence of random variables Xo, X\, X2,... taking values in the set 5, such that P(Xo = io, X\ — il-, - - - ->Xn = in) = VioPioiiPi^ - • -Pin-iin for any n e N and any choice of io,H, • • • ,in € S. (Note that, for these {Xn} to be random variables in the sense of Definition 3.1.1, we need to have S C R; however, if we allow more general random variables as in Remark 3.1.10, then this restriction is not necessary.) It then follows that, for example, P(Xx=j) = ]TP(X°=i> Xy=j) = 5>iPij. ieS i€S This also has a matrix interpretation: writing [p] for the matrix {Pij}ijes, and [//(")] for the row-vector {P{Xn = i)}i€s, we have [/^1}] = [//0)] [p], and more generally [fi^+V] = [fi^] [p]. We present some simple examples of Markov chains here. Note that, except in the first example, we do not bother to specify initial probabilities {vi}\ we shall see that initial probabilities are often not crucial when studying a chain's properties. Example 8.0.1. (Simple random walk.) Let S = Z be the set of all integers. Fix a e Z, and let va = 1, with i/» = 0 for i ^ a. Fix a real number p with 0 < p < 1, and let pM+i = p and pM_i = 1 - p for each i € Z, with p^; = 0 if j / i ± 1. Thus, this Markov chain begins at the point a (with probability 1), and at each step either increases by 1 (with probability p) or decreases by 1 (with probability 1 -p). It is easily seen that this Markov chain corresponds precisely to the gambling game of Subsection 7.2.
84 8. Discrete Markov chains. Example 8.0.2. Let S = {1,2,3} consist of just three elements, and define the transition probabilities {ptj} in matrix form by / 0 1/2 1/2N (Pij) = 1/3 1/3 1/3 \l/4 1/4 1/2j (so that p3i = |, etc.). This Markov chain jumps around on the three points {1,2,3} in a random and interesting way. Example 8.0.3. (Random walk on Z/(d).) Let S = {0,1,2,..., d - 1}, and define the transition probabilities by pHo; i — J QL i = j + l(mod d) or i = j — l(mod d); otherwise. If we think of the d elements of 5 as arranged in a circle, then our particle, at each step, either stays where it is, or moves one step clockwise, or moves one step counter-clockwise, each with probability |. Example 8.0.4. (Ehrenfest's urn.) Consider two urns, Urn 1 and Urn 2. Suppose there are d balls divided between the two urns. Suppose at each step, we choose one ball uniformly at random from among the d balls, and switch it to the opposite urn. We let Xn be the number of balls in Urn 1 at time n. Thus, S = {0,1,2,..., d}, with Pi^-i = i/d and Pi,i+i = (d — i)/d, for 0 < i < d (with p^ = 0 if j ^ i ± 1). Thus, this Markov chain moves randomly among the possible numbers {0,1,..., d} of balls in Urn 1 at each time. One might expect that, if d is large and the Markov chain is run for a long time, that there would most likely be approximately d/2 balls in Urn 1. (We shall consider such questions in Subsection 8.3 below.) We note that we can also interpret Markov chains in terms of conditional probability. Recall that, if A and B are events with P{B) > 0, then the conditional probability of A given B is P(A\B) = L^) > intuitively, it is the probabilistic proportion of the event B which is also contained in the event A. Thus, for a Markov chain, if P(Xk = i) > 0, then p( y, , _ a I y, _ a\ _ P(Xfc=i, Xk+i=j) _ Zll0,i1,...,lfc_1 »*0Pi0*lP*li2~Pik-^k-lP*k-l*PiJ £si0,il,...,ik_1 I/toPlOtlPllt2-Pifc-2llfc-lP'fe-ll = Pij • This formally justifies the notion of p^ as a transition probability. Note also that this conditional probability does not depend on the starting time
8.1. A Markov chain existence theorem. 85 • atifying the notion of time homogeneity. (The only reason we do not u this conditional probability as our definition of a Markov chain, is that f conditional probability is not well-defined if P(Xk = i) = 0.) \\Te can similarly compute, for any n G N, again assuming P(Xk = i) > 0 that P(Xfc+n =j\Xk=i) is equal to E_ (n) a Piik+iPik+iik+2 ' ' 'Pik+n-2ik+n-iPik+n-ij — Pij > here p\^ 1S an nth or^er transition probability. Note again that this probability does not depend on the starting time k (despite appearances to the contrary). By convention, we set ^=Sii = { J' i=J • , (8.0.5) Vx* %3 \ 0, otherwise ' v ; i.e. in zero time units the Markov chain just stays where it is. 8.1. A Markov chain existence theorem. To rigorously study Markov chains, we need to be sure that they exist. Fortunately, this is relatively straightforward. Theorem 8.1.1. Given a non-empty countable set S, and non-negative numbers {vi}ies and {pij}i,jes, with Y.jvj = 1 and YZjPij = * for eacn i G S, there exists a probability triple (ft, J7, P), and random variables Xq,Xi,... defined on (£}, J7,P), such that *P(X0 = i0,X=ii,...,Xn = in) = "ioPioii --Pin-iin for allneN and all i0,...,in6S. Proof. We let (fi, J7, P) be Lebesgue measure on [0,1]. We construct the random variables {Xn} as follows. 1. Partition [0,1] into intervals {^0)}i€5, with length(/f 0)) = ^. 2. Partition each jf0) into intervals {I^)}iJeS, with length^) = v^. 3. Inductively partition [0,1] into intervals {/i£i...in}t0,tll...>in€S» such that ^Oil-in ^ ^oT-^n-l aIld len6th(4f<)l...<n) = "toP^l. •■•P.„-l<« ^ all n G N. 4. Define Xn by saying that Xn(uj) = in if uj G ^^...^..^ for some choice Of 20,...,2n-l G 5.
86 8. Discrete Markov chains. Then it is easily verified that the random variables {Xn} have the desired properties. | We thus see that, as in Theorem 7.1.1, an "old friend" (in this case Lebesgue measure on [0,1]) was able to serve as a probability space on which to define important random variables. 8.2. Transience, recurrence, and irreducibility. In this subsection, we consider some fundamental notions related to Markov chains. For simplicity, we shall assume that Vi > 0 for all z G 5, and shall write Pj(- • •) as shorthand for the conditional probability P(- • • \Xq = z). Intuitively, P{ (A) stands for the probability that the event A would have occurred, had the Markov chain been started at the state z. We shall also write Ei(- • •) for expected values with respect to Pj. To proceed, we define, for z, j G S and n G N, the probabilities /£° = Pi(Xn = j, but Xm + j for 1 < m < n - 1); oo fij = Pi(3n>l;Xn=j) = £/§°. 71=1 That is, fl?' is the probability, starting from z, that we first hit j at the time n; fij is the probability, starting from z, that we ever hit j. A state z G 5 is called recurrent (or persistent) if fa = 1, i.e. if starting from z we will certainly eventually return to z. It is called transient if it is not recurrent, i.e. if fa < 1. Recurrence and transience are very important concepts in Markov chain theory, and we prove some results about them here. (Recall that z.o. stands for infinitely often.) Theorem 8.2.1. Let {Xn} be a Markov chain, on a state space S. Let i- G S. Then i is transient'if and only ifPi(Xn = i z.o.) = 0 if and only if Yl^Li Pii < °°- On the other hand, i is recurrent if and only if Pi(Xn = i z.o.) = 1 if and only ifY^^LiPu = °°- To prove this theorem, we begin with a lemma. Lemma 8.2.2. We have P{ (#{n > 1; Xn = j} > k) = fij{fjj)k-\ for k = 1, 2, — In words, the probability that, starting from i, we eventually hit the state j at least k times, is given by fij(fjj)k~l- Proof. Starting from z, to hit j at least k times is equivalent to first hitting j once (starting from z), then to return to j at least k — 1 more
8.2. Transience, recurrence, and irreducibility. 87 t'mes (each time starting from j). The result follows. I proof of Theorem 8.2.1. By continuity of probabilities, and by Lemma 8 2.2, we have that Pi(Xn = i i.o.) = lim Pi (#{n > 1; Xn = i} > k) k—+oo {I lim (/„)* =-{ "' fll k-*oo I 1, Jii — ± • This proves the first equivalence to each of transience and recurrence. Also, using first countable linearity, and then Proposition 4.2.9, and then Lemma 8.2.2, we compute that E~iPin) = E£=iP<(*n = 0 - S^Li Et(ix„=i) = E4(#{n>l;X„ = i}) f i^- < oo, fu<l I oo, /«= 1 thus proving the remaining two equivalences. I As an application of this theorem, we consider simple symmetric random walk (cf. Subsection 7.2 or Example 8.0.1, with X0 = 0 and p = |). We recall that here S = Z, and for any i € Z we have p^n) = (n/2) (|)n = ((n/2)'!)22n f°r n even (with p^ = 0 for n odd). Using Sterling's approximation, which states that n\ ~ (^) \I2-ku as n —> oo, we compute that for large even n, we have p^n) ~ y/2]wn. Hence, we see (cf. (A.3.7)) that ^n=iPii ~ °°- Therefore, by Theorem 8.2.1, simple symmetric random walk is recurrent. On the other hand, if p ^ \ for simple random walk, then p\^ = (n/2)(p(l-p))n/2 Withp(l-p) < |. Sterling then gives for large even n that p^n) ~ [4p(l -p)]n/2y27™, with 4p(l - p) < 1. It follows that ^n=i Pa < oo, so that simple asymmetric random walk is not recurrent. In higher dimensions, suppose that S = Zd, with each of the d coordinates independently following simple symmetric random walk. Then clearly Pa for this process is simply the d}h power of the corresponding quantity for ordinary (one-dimensional) simple symmetric random walk. Hence, for
88 8. Discrete Markov chains. large even n, we have pW ~ (2/7m)d/2 in this case. It follows (cf (j^ o ^ that Z^Li Pi? = oo if and only if d < 2. That is, higher-dimensional sim i symmetric random walk is recurrent in dimensions 1 and 2, but trans in dimensions > 3, a somewhat counter-intuitive result. Finally, we note that for irreducible Markov chains, somewhat mo can be said. A Markov chain is irreducible if fij > 0 for all i, j £ 5 : if it is possible for the chain to move from any state to any other state Equivalently, the Markov chain is irreducible if for any i, j g S, there is r G N with p\j > 0. (A chain is reducible if it is not irreducible, i.e. if fij = 0 for some i,j G 5.) We can now prove Theorem 8.2.3. Let {Pij}ijes be the transition probabilities for an irreducible Markov chain on a state space S. Then the following are equivalent: (1) There is k G S with fkk = 1> J-e- with k recurrent. (2) For all i,j G S, we have fij = 1 (so, in particular, all states are recurrent). (3) There are k,£eS with £^°=1 p$ = 00. (4) For all i,j G 5, we have Yl™=i Pij = °°- If any of (l)-(4) hold, we say that the Markov chain itself is recurrent. Proof. That (2) implies (1) is immediate. That (1) implies (3) follows immediately from Theorem 8.2.1. To show that (3) implies (4): Assume that (3) holds, and let i,j G S. By irreducibility, there are m, r G N with p\™] > 0 and p^} > 0. But then, since pg^) > p^tfrt^ we have £nP£> > p^EntiS = «>, as claimed. To show that (4) implies (2): Suppose, to the contrary of (2), that fij < 1 for some ij G S. Then 1 - fjj > Pjfc < Tj)(\ - fij) > 0. (Here Pjin < Tj) stands for the probability, starting from j, that we hit i before returning to j; and it is positive by irreducibility.) Hence, fjj < 1, so by Theorem 8.2.1 we have YlnPjj < °°> contradicting (4). I For example, for simple symmetric random walk, since YlnPu = oc" this theorem says that from any state i, the walk will (with probability 1) eventually reach any other state j. Thus, the walk will keep on wondering from state to state forever; this is related to "fluctuation theory" for random walks. In particular, this implies that for simple symmetric random walk, with probability 1 we have limsupn Xn = oo and liminfn Xn = — oo. Such considerations are related to the remarkable law of the iterated logarithm. Let Xn = Z\ + ... + Zn define a random walk, where Zx, Z2,. • •
Stationary distributions and convergence. 89 th mean 0 and variance 1. (For example, perhaps {Xn} is sim- are i.i. • random walk.) Then it is a fact that with probability 1, ple s^ ,*r / /JTTToglogn) = 1. In other words, for any e > 0, we have liinviP>\l+e)V2TW^i'0.) = 0andP(Xn > (l-e)y/2nla&ogni.o.) = *^ u-"""dves extremely precise information about how the peaks of {Xn} 1 f large n. For a proof see e.g. Billingsley (1995, Theorem 9.5). © a 3 Stationary distributions and convergence. Given a Markov chain on a state space S, with transition probabilities \Pi }i tet {ni}i€S ^e a distribution on S (i.e., 7^ > 0 for all i G S, and f^ Q^i = !)• Then {iti}^ is stationary for the Markov chain if ^2ieSTTiPij = ^ for all j G 5. (In matrix form, [7r][p] = [it].) What this means is that if we start the Markov chain in the distribution {7Ti} (i.e., P[Xo = i] = TTi for all i G S), then one time unit later the distribution will still be {7^} (i.e., P[Xi = i] = m for all i G S); this is the reason for the terminology "stationary". It follows immediately by induction that the chain will still be in the same distribution {ni} any number n steps later (i.e., P[Xn = i] = m for all i G S). Equivalently, J2ies niPij = nj *OT any n G N. (In matrix form this is even clearer: [7r][p]n = [it].) Exercise 8.3.1. A Markov chain is said to be reversible with respect to a distribution {71-;} if, for all i,j G 5, we have TTiPij = TTjPji. Prove that, if a Markov chain is reversible with respect to {iTi}, then {7^} is a stationary distribution for the chain. For example, for random walk on Z/(d) as in Example 8.0.3, it is computed that the uniform distribution, given by TXi — 1/d for i = 0,1,..., d-1, is a stationary distribution. For Ehrenfest's Urn (Example 8.0.4), it is computed that the binomial distribution, given by tx{ = (f) £ for i = 0,1,..., d, is a stationary distribution. Exercise 8.3.2. Verify these last two statements. Now, given a Markov chain with a stationary distribution, one might expect that if the chain is run for a long time (i.e. n —► 00), that the probability of being at a particular state j G S might converge to 7rj5 regardless °f the initial state chosen. That is, one might expect that lim^^ Pi(Xn = J) == 7Tj for any i, j G 5. This is not true in complete generality, as the following two examples show. However, we shall see in Theorem 8.3.10 below that this is indeed true for many Markov chains.
90 8. Discrete Markov chains. For a first example, suppose S = {1,2}, and that the transition probabilities are given by (pa) = (I J). That is, this Markov chain never moves at all! Hence, any distribution is stationary for this chain. In particular, we could take 7Ti = 7T2 = \ as a stationary distribution. On the other hand, we clearly have Pi(Xn = 1) = i for all n G N, which certainly does not approach \. Thus, this Markov chain does not converge to the stationary distribution {^i\. In fact, this Markov chain is clearly reducible (i.e., not irreducible), which is the obstacle to convergence. For a second example, suppose again that S = {1,2}, and that the transition probabilities are given this time by (Pij) = (; j). Again we may take -K\ = tt2 = \ as a stationary distribution (in fact, this time the stationary distribution is unique). Furthermore, this Markov chain is irreducible. On the other hand, we have Pi(Xn = 1) = 1 for n even, and Pi(Xn = 1) = 0 for n odd. Hence, again we do not have Pi{Xn = 1) —> \. (On the other hand, the Cesaro averages of Pi(Xn = 1), i.e. n Sr=i PiC^i = 1)? d° mdeed converge to ^, which is not a coincidence.) Here the obstacle to convergence is that the Markov chain is "periodic", with period 2, as we now discuss. Definition 8.3.3. Given Markov chain transitions {p^} on a state space 5, and a state i G 5, the period of i is the greatest common divisor of the set{n>l;pin)>0}. That is, the period of i is the g.c.d. of the times at which it is possible to travel from i to i. For example, if the period of i is 2, then this means it is only possible to travel from i to i in an even number of steps. (Such was the case for the second example above.) Clearly, if the period of a state is greater than 1, then this will be an obstacle to convergence to a stationary distribution. This prompts the following definition. Definition 8.3.4. A Markov chain is aperiodic if the period of each state is 1. (A chain which is not aperiodic is said to be periodic.) Before proceeding, we note a fact about periods that makes aperiodicity easier to verify.
8.3. Stationary distributions and convergence. 91 T emnia 8.3.5. Let i and j be two states of a Markov chain, and suppose that fij > 0 an<^ fj* > 0 (*-e-> * and 3 "communicate"). Then the periods ofi and of j are equal proof. Since fij > 0 and fji > 0, there are r, s G N with p\y > 0 and J?) > 0. Since plr+n+s) > P^P^P^, this implies that p![+n+s) > 0 whenever p<? > ° • (8'3'6) Now, if we let the periods of i and j be U and tj, respectively, then (8.3.6) with n = 0 implies that U divides r + s. Then, for any n e N with P^] > o, (8.3.6) implies that U divides r + n + s, hence that U divides n. That is, U is a common divisor of {n G N; p^ > 0}. Since tj is the greatest common divisor of this set, we must have U < tj. Similarly, tj < U, so we must have U = tj. ■ This immediately implies Corollary 8.3.7. If a Markov chain is irreducible, then all of its states have the same period. Hence, for an irreducible Markov chain, it suffices to check aperiodicity at any single state. We shall prove in Theorem 8.3.10 below that all Markov chains which are irreducible and aperiodic, and have a stationary distribution, do in fact converge to it. Before doing so, we require two further lemmas, which give more concrete implications of irreducibility and aperiodicity. Lemma 8.3.8. If a Markov chain is irreducible, and has a stationary distribution {ni}, then it is recurrent. Proof. Suppose to the contrary that the chain is not recurrent. Then, by Theorem 8.2.3, we have Y2nPij < °° f°r an< states * and j; in particular, Almn-+ocPi™ = 0. Now, since {7^} is stationary, we have ttj = J2t KiP^ for all n e N. Hence, letting n —> 00, we see (formally, by using the M-test, cf. Proposition A.4.8) that we must have ttj = 0, for all states j. This contradicts the fact that £,- nj = 1. I •Remark. The converse to Lemma 8.3.8 is false, in the sense that just because an irreducible Markov chain is recurrent, it might not have a stationary distribution (e.g. this is the case for simple symmetric random walk).
92 8. Discrete Markov chains. Lemma 8.3.9. If a Markov chain is irreducible and aperiodic, then for each pair (i,j) of states, there is a number no = no(i,j), such that p\V > q for all n>no. Proof. Fix states i and j. Let T = {n > l;p£° > 0}. Then, by aperiodicity, gcd(T) = 1. Hence, we can find m £ N, and fci,..., km e T and integers b\,..., 6m, such that fci&i + ... + kmbm — 1. We furthermore choose any a G T, and also (by irreducibility) choose c € N with p\y > 0. These values shall be the key to defining no- We now set M = ki\b\\ + ... + fcm|6m| (i.e., a sum that without the absolute value signs would be 1), and define no = no(i,j) = aM + c. Then, if n > no, then letting r = [(n — c)/a\, we can write n = c + ra + s where 0 < s < a and r > M. We then observe that, since Y1T=\ bthe = 1 and YlT=i IM^ = M, we have m n = (r - M)a + ^ {a\be\ + sbe) ke + c, e=i where the quantities in brackets are non-negative. Hence, recalling that a, ki € T, and that pf) > 0, we have that $ > o, as required. I We are now able to prove our main Markov chain convergence theorem. Theorem 8.3.10. If a Markov chain is irreducible and aperiodic, and has a stationary distribution {iti}, then for all states i and j, we have lim^oo Pi(Xn = j) = ttj. . Proof. The proof uses the method of "coupling". Let the original Markov chain have state space S, and transition probabilities {pij}. We define a new Markov chain {(Xn,Yn)}<%L0, having state space S = S x S, and transition probabilities given by P^^ki) ~ PikPji- That is, our new Markov chain has two coordinates, each of which is an independent copy of the original Markov chain. It follows immediately that the distribution on S given by n^j) = Ttiitj (i.e., a product of two probabilities in the original stationary distribution) is in fact a stationary distribution for our new Markov chain. Furthermore, from Lemma 8.3.9 above, we see that our new Markov chain will again r-M <$*{&) IK* U=l *(*«>) a\be\+sbe
8.3. Stationary distributions and convergence. 93 h irreducible and aperiodic (indeed, we have pfil ,k£s > 0 whenever n > max(no(^)>n()C7'^)))- Hence, from Lemma 8.3.8, we see that our new Markov chain is in fact recurrent. This is the key to what follows. To complete the proof, we choose %q G 5, and set r = inf{n > 1; Xn = Y — i0}. Note that, for m < n, the quantities P(ij)(r = m, Xn = k) and P(i7)(r — miYn — k) are equal; indeed, they both equal P^(r — m)p^n~k - Hence, for any states i, j, and fc, we have that \Pik (n) ■Pjk W) (Xn = k)-P{ij)(Yn = k)\ |Em=i P(ij)(*» = fc, r = m) + P(ij)(Xn = fc, r > n) " ELi P(ij)(^n = fc, r = m) - P(ii)(y„ = k, r>n) = \P{ij)(Xn = fc, r>n)- P(ij)(yn = fc, r > n)| < P(ii)(r>n) (where the inequality follows since we are considering a difference of two positive quantities, each of which is < P(^)(r > n)). Now, since the new Markov chain is irreducible and recurrent, it follows from Theorem 8.2.3 that f(ij),(i0i0) = 1. That is, with probability 1, the chain will eventually hit the state (io,io), in which case r < oo. Hence, as n —+ oo, we have P(ij)(r > n) —> 0, so that (n) (n) 0. On the other hand, by stationarity, we have for the original Markov chain that (n) Pij -*3 E^(^-«) kes ^ V I (w) (n) fcG5 and we see (again, by the M-test) that this converges to 0 as n —> oo. Since Pij = Pi(Xn = j), this proves the theorem. I Finally, we note that Theorem 8.3.10 remains true if the chain begins in some other initial distribution {^} besides those with i/* = 1 for some i: Corollary 8.3.11. If a Markov chain is irreducible and aperiodic, and has a stationary distribution {71^}, then regardless of the initial distribution, for all j e S, we have limn_+oo P(Xn = j) = 7Tj.
94 8. Discrete Markov chains. Proof. Using the M-test and Theorem 8.3.10, we have lim^oo P(Xn = j) = limn^oo ^2ieS P(X0 = i, Xn = j) = lmwoe Ziesp(*o = i) P(Xn =j\X0 = i) = limn^oo J2ies vi pi(^n = j) = J2ieS vi limn-oo Pi(Xn = j) = TTj , which gives the result. 8.4. Existence of stationary distributions. Theorem 8.3.10 gives powerful information about the convergence of irreducible and aperiodic Markov chains. However, this theorem requires the existence of a stationary distribution {ni}. It is reasonable to ask for conditions under which a stationary distribution will exist. We consider that question here. Given a Markov chain {Xn} on a state space 5, and given a state i € 5, we define the mean return time rrii by rrii = E* (inf{n > 1; Xn = i}). That is, rrii is the expected time to return to the state i. We always have m* > 1. If i is transient, then with positive probability we will have inf{n > 1; Xn = i} = 00, so of course we will have ra^ = 00. On the other hand, if i is recurrent, then we shall call i null recurrent if rrii = 00, and shall call i positive recurrent if m^ < 00. (The names come from considering 1/rrii rather than rrii.) The main theorem of this subsection is Theorem 8.4.1. If a Markov chain is irreducible, and if each state i of the Markov chain is positive recurrent with (finite) mean return time rrii, then the Markov chain has a unique stationary distribution {ni}, given by 7Ti = \ I rrii for each state i. This theorem is rather surprising. It is not immediately clear that the mean return times rrii have anything to do with a stationary distribution; it is even less expected that they provide a precise formula for the stationary distribution values. The proof of the theorem relies heavily on the following lemma. Lemma 8.4.2. LetGn(i, j) = E* (#{£; 1 < £ < n, Xe = j}) = YZ=iPif- Then for an irreducible recurrent Markov chain, lim G«™ - ' n m 3
8.4. Existence of stationary distributions. 95 for any states i and j. proof. Let TJ be the time of the rth hit of the state j. Then JJ = T) + {Tf - T/) + ... + (IJ - T/-1); here the r — 1 terms in brackets are i.i.d. and have mean rrij. Hence, from the Strong Law of Large Numbers (second version, Theorem 5.4.4), we have TV that limr_+oc -jr = rrij with probability 1. Now, for n e N, let r(n) = #{£; 1 < £ < n, Xe = j}. Then limn—oo r(n) = °° w^n probability 1 by recurrence. Also clearly T[(n) < n<Tr(n) + lsothat rr,r(n) r7-rr(n) + l T-i— < -^- < T-i r(n) r(n) r(n) Hence, we must have limn^oo ^y = rrij with probability 1 as well. Therefore, with probability 1 we have linin—oo ^^ = 1 /rrij. On the other hand, Gn(i,j) = Ei(r(n)), and furthermore 0 < ^^ < 1. Hence, by the bounded convergence theorem, lim v 1JJ = lim E» -^ = l/rrij, as claimed. Remark 8.4.3. We note that this proof goes through without change in the case rrij = oo, so that lim^oo Gn^ = 0 in that case. Proof of Theorem 8.4.1. We begin by proving the uniqueness. Indeed, suppose that 53* oupij = otj for all states j, for some probability distribution {oti}. Then, by induction, Y^iaiVij = aj fc>r a^ t G N, so that also n 12t=i £» aip\f = <*j. Hence, qlj = limn^oe 1 YZ=i £. aipV = 53. oci limn^oo \ 53"=1 p^ by the M-test = Y.i ai^ by tne Lemma = l/rrij. That is, olj = I/rrij is the only possible stationary distribution.
96 8. Discrete Markov chains. Now, fix any state z. Then for all t G N, we have YljPzj = *• Hence, if we set C = J^ 1/™?, then using the Lemma and (A.4.9), we have j t=l j t=l In particular, C < oo. Again fix any state 2. Then for all states j, and for all t € N, we have YliPziPij ~ Pzj • Hence, again by the Lemma and (A.4.9), we have ** - SuLkt^ = JS5.E ;£>!?*. a £><• <8-«> But then summing both sides over all states j, and recalling that V ^- = C < 00, we see that we must have equality in (8.4.4), i.e. we must have ~WT ~ Si ^~Pij- Hut this means that the probability distribution defined by TTz = l/Crrti must be a stationary distribution for the Markov chain! Finally, by uniqueness, we must have ttj = l/rrij for each state j, i.e. we must have C = 1. This completes the proof of the theorem. I On the other hand, states which are not positive recurrent cannot contribute to a stationary distribution: Proposition 8.4.5. If a Markov chain has a stationary distribution {7^}, and if a state j is not positive recurrent (i.e., satisfies rrij = 00), then we must have ttj = 0. Proof. We have that Yli^iPij = nj f°r a^ t e N. Hence, using the M-test and also Lemma 8.4.2 and Remark 8.4.3, we have 7rj= lim -5>f>g>=$> Hm If>«=5>)(0)=0, i t=\ i t=l i as claimed. I Corollary 8.4.6. If a Markov chain has no positive recurrent states, then it does not have a stationary distribution. Proof. Suppose to the contrary that it did have a stationary distribution {ni}. Then from the above proposition, we would necessarily have ttj = 0
8.4. Existence of stationary distributions. 97 for each state j, contradicting the fact that V ttj = 1. I Theorem 8.4.1 and Corollary 8.4.6 provide clear information about Markov chains where all states are positive recurrent, or where none are, respectively. One could still wonder about chains which have some positive recurrent and some non-positive-recurrent states. We now show that, for irreducible Markov chains, this cannot happen. The statement is somewhat analogous to that of Lemma 8.3.5. Proposition 8.4.7. Let i and j be two states of a Markov chain. Suppose that fij > 0 and fji > 0 (i.e., the states i and j communicate). Then ifi is positive recurrent, then j is also positive recurrent. Proof. Find r,t G N with p(?) > 0 and p[f > 0. Then by Lemma 8.4.2 and Remark 8.4.3, 1 t l V^ (™) \ r l V^ (r) (m-r-t) (t) Vji Vij ^ n — = hm - > p). } > hm - > p^'p). 'p\.} = -^ J- > 0, J 77i=l m=r+t so that rrij < oo. I Corollary 8.4.8. For an irreducible Markov chain, either all states are positive recurrent or none are. Combining this corollary with Theorem 8.4.1 and Corollary 8.4.6, we see that Theorem 8.4.9. Consider an irreducible Markov chain. Then either (a) all states are positive recurrent, and there exists a unique stationary distribution, given by nj = l/m?, to which (assuming aperiodicity) Pi(Xn = j) converges as n —> oo; or (b) no states are positive recurrent, and there does not exist a stationary distribution. For example, consider simple symmetric random walk, with state space the integers Z. It is clear that rrij must be the same for all states j (i.e. no state is any "different" from any other state). Hence, it is impossible that 2Zjez Vm.? = 1- Thus, simple symmetric random walk cannot fall into category (a) above. Hence, simple symmetric random walk falls into category (b), and does not have a stationary distribution; in fact, it is null recurrent. Finally, we observe that all irreducible Markov chains on finite state spaces necessarily fall into category (a) above:
98 8. Discrete Markov chains. Proposition 8.4.10. For an irreducible Markov chain on a finite state space, all states are positive recurrent (and hence a unique stationary distribution exists). Proof. Fix a state i. Write hj{ } = Pj(X* = i for some 1 < t < m) = Yl™=ifji- Then limm_+oo hfr' = fji > 0, for each state j. Hence, since the state space is finite, we can find m G N and 6 > 0 such that h^ > 6 for all states j. But then we must have 1 - h^ < (1 - 5)Ln/mJ, so that letting r{ = inf{n > 1; Xn = i}, we have by Proposition 4.2.9 that oo oo oo m = J2Pi(n > n +1) = £(i - h^) < £(i - 6)Wm* = j < oo. I n=0 n=0 n=0 8.5. Exercises. Exercise 8.5.1. Consider a discrete-time, time-homogeneous Markov chain with state space 5 = {1,2}, and transition probabilities given by Pn = a, P12 = 1 - a, P21 = 1, P22 = 0. For each 0 < a < 1, (a) Compute p\™' = P(Xn = j \ XQ = i) for each i,j G X and n G N. (b) Classify each state as recurrent or transient. (c) Find all stationary distributions for this Markov chain. Exercise 8.5.2. For any e > 0, give an example of an irreducible Markov chain on a countably infinite state space, such that \pij — Pik \ < e for all states i, j, and k. Exercise 8.5.3. For an arbitrary Markov chain on a state space consisting of exactly d states, find (with proof) the largest possible positive integer N such that for some states i and j, we have p\, * > 0 but p\™ = 0 for all n < N. Exercise 8.5.4. Given Markov chain transition probabilities {Pij}i,jeS on a state space 5, call a subset CCS closed if ^2jecPij ~ 1 f°r eacn i G C. Prove that a Markov chain is irreducible if and only if it has no closed subsets (aside from the empty set and S itself).
8.5. Exercises. 99 Exercise 8.5.5. Suppose we modify Example 8.0.3 so the chain moves one unit clockwise with probability r, or one unit counter-clockwise with probability 1 - r, for some 0 < r < 1. That is, S = {0,1,2,..., d ■ 1} and Pij = = < 1 0, j = i + 1 (mod d) j = i — l(mod d) otherwise. Find (with explanation) all stationary distributions of this Markov chain. Exercise 8.5.6. Consider the Markov chain with state space S = {1,2,3} and transition probabilities pi2 = P23 = P3i = 1. Let 71*1 = 7r2 = 7r3 = 1/3. (a) Determine whether or not the chain is irreducible. (b) Determine whether or not the chain is aperiodic. (c) Determine whether or not the chain is reversible with respect to {iti}. (d) Determine whether or not {ni} is a stationary distribution. (e) Determine whether or not limn_0 (n) Exercise 8.5.7. Give an example of an irreducible Markov chain, and two distinct states i and j, such that /^ ' > 0 for all n G N, and such that (n) ,(n+lK ft"' is not a decreasing function of n (i.e. for some n G N, /^ < /^ ;). Exercise 8.5.8. Condition on X\. Prove the identity fa = p^ + J2k^jP^fkj- [Hint: Exercise 8.5.9. For each of the following transition probability matrices, determine which states are recurrent and which are transient, and also compute fn for each i. (a) (b) /1/2 2/3 0 V 1 / 1 1/2 0 0 1/10. , 0 V 0 1/2 0 0 0 0 0 1/5 0 0 0 0 0 0 \ 0 1/3 4/5 1/5 1 0 0 / 0 0 1/2 0 4/5 0 1/3 1/3 0 0 0 0 0 0 0 0 0 1/3 7/10 0 0 0 0 0 0 0 0 1 0 \ 0 0 0 1/5 1 j 0 / Exercise 8.5.10. Consider a Markov chain (not necessarily irreducible) ♦ on a finite state space.
100 8. Discrete Markov chains. (a) Prove that at least one state must be recurrent. (b) Give an example where exactly one state is recurrent (and all the rest are transient). (c) Show by example that if the state space is countably infinite then part (a) is no longer true. Exercise 8.5.11. For asymmetric one-dimensional simple random walk (i.e. where P(Xn = Xn_i + 1) = p = 1 - P(Xn = Xn_i - 1) for some p/|), provide an asymptotic upper bound for Y^LnPu • That is, find an explicit expression 7^, with lim^^oo 7^ = 0, such that YI^LnPu — 7iv for all sufficiently large N. Exercise 8.5.12. Let P = (pij) be the matrix of transition probabilities for a Markov chain on a finite state space. (a) Prove that P always has 1 as an eigenvalue. [Hint: Recall that the eigenvalues of P are the same whether it acts on row vectors to the left or on column vectors to the right.] (b) Suppose that v is a row eigenvector for P corresponding to the eigenvalue 1, so that vP = v. Does v necessarily correspond to a stationary distribution? Why or why not? Exercise 8.5.13. Call a Markov chain doubly stochastic if its transition matrix {Pij}ijes has the property that YliesPij ~ * ^or eac^ 3 ^ ^' Prove that, for a doubly stochastic Markov chain on a finite state space 5, the uniform distribution (i.e. TTi = l/\S\ for each i £ 5) is a stationary distribution. Exercise 8.5.14. Give an example of a Markov chain on a finite state space, such that three of the states each have a different period. Exercise 8.5.15. Consider Ehrenfest's Urn (Example 8.0.4). (a) Compute Po(Xn = 0) for n odd. (b) Prove that P0(Xn = 0) /» 2~d as n -► 00. (c) Why does this not contradict Theorem 8.3.10? Exercise 8.5.16. Consider the Markov chain with state space 5 = {1,2,3} and transition probabilities given by / 0 2/3 1/3 \ (Pij) =1/4 0 3/4 . \4/5 1/5 0 / (a) Find an explicit formula for Pi(ti = n) for each n e N, where T\ —
8.6. Section summary. 101 inf{™ > !; *" = *}■ Cb) Compute the mean return time mi = E(ti). (c) Prove that this Markov chain has a unique stationary distribution, to be called {7^}. (d) Compute the stationary probability 7Ti. Exercise 8.5.17. Give an example of a Markov chain for which some states are positive recurrent, some states are null recurrent, and some states are transient. Exercise 8.5.18. Prove that if fa > 0 and fji = 0, then i is transient. Exercise 8.5.19. Prove that for a Markov chain on a finite state space, no states are null recurrent. [Hint: The previous exercise provides a starting point.] Exercise 8.5.20. (a) Give an example of a Markov chain on a finite state space which has multiple (i.e. two or more) stationary distributions. (b) Give an example of a reducible Markov chain on a finite state space, which nevertheless has a unique stationary distribution. (c) Suppose that a Markov chain on a finite state space is decomposable, meaning that the state space can be partitioned as 5 = S\ US2, with Si non-empty, such that fa = fji = 0 whenever i E S\ and j G 52- Prove that the chain has multiple stationary distributions. (d) Prove that for a Markov chain as in part (b), some states are transient. [Hint: Exercise 8.5.18 may help.] 8.6. Section summary. This section gave a lengthy exploration of Markov chains (in discrete time and space). It gave several examples of Markov chains, and proved their existence. It defined the important concepts of transience and recurrence, and proved a number of equivalences of them. For an irreducible Markov chain, either all states are transient or all states are recurrent. The section then discussed stationary distributions. It proved that the transition probabilities of an irreducible, aperiodic Markov chain will converge to the stationary distribution, regardless of the starting point. Finally, Jt related stationary distributions to the mean return times of the chain's states, proving that (for an irreducible chain) a stationary distribution exists if and only if these mean return times are all finite.
9. More probability theorems. 103 9. More probability theorems. In this section, we discuss a few probability ideas that we have not needed so far, but which will be important for the more advanced material to come. 9.1. Limit theorems. Suppose X,Xi,X2,-.- are random variables defined on some probability triple (fi,.F,P). Suppose further that liuin^oQ Xn(oj) = X(uj) for each fixed u G ft, (or at least for all u outside a set of probability 0), i.e. that {Xn} converges to X almost surely. Does it necessarily follow that lim„_ooE(Xn) = E(X)? We already know that this is not the case in general. Indeed, it was not the case for the "double 'til you win" gambling strategy of Subsection 7.3 (where if 0 < p < 1/2, then E(Xn) < a for all n, even though E(limXn) = a + 1). Or for a simple counter-example, let ft = N, with P(u;) = 2~u for u € ft, and let Xn(u) = 2nSUJin (i.e., Xn(u) = 2n if u = n, and equals 0 otherwise). Then {Xn} converges to 0 with probability 1, but E(Xn) = l-/+0asn->oo. On the other hand, we already have two results giving conditions under which it is true that E(Xn) —> E(X), namely the Monotone Convergence Theorem (Theorem 4.2.2) and the Bounded Convergence Theorem (Theorem 7.3.1). Such limit theorems are sometimes very helpful. In this section, we shall establish two more similar limit theorems, namely the Dominated Convergence Theorem and the Uniformly Integrable Convergence Theorem. A first key result is Theorem 9.1.1. (Fatou 's Lemma) IfXn > C for all n, and some constant C > -oo, then E fliminf Xn) < liminf E(Xn). V n—►oo / n—KX) (We allow the possibility that both sides are infinite.) Proof. Set Yn = infk>nXk, and let Y = limn_^oo Yn = liminf Xn. n—►oo Then Yn > C and {Yn} / Y, and furthermore Yn < Xn. Hence, by the order-preserving property and the monotone convergence theorem, we have liminfn E{Xn) > liminfn E(Fn) = E(F), as claimed. I Remarks. 1- In this theorem, "lim infn_oo Xn" is of course interpreted pointwise; that is, its value at uj is liminfn-^oo Xn(uj).
104 9. More probability theorems. 2. For the "simple counter-example" mentioned above, it is easily verified that E(liminfn_ooXn) = 0 while liminfn^oo E(Xn) = 1, so the theorem gives 0 < 1 which is true. If we replace Xn by — Xn in that example, we instead obtain 0 < — 1 which is false; however, in that case the {Xn} are not bounded below. 3. For the "double 'til you win" gambling strategy of Subsection 7.3, the fortunes {Xn} are not bounded below. However, they are bounded above (by a + 1), so their negatives are bounded below. When the theorem is applied to their negatives, it gives that — (a + 1) < liminf E(—Xn), so that limsupE(Xn) < a + 1, which is certainly true since Xn < a -f 1 for all n. (In fact, if p < ^, then limsupE(Xn) < a.) It is now straightforward to prove Theorem 9.1.2. (The Dominated Convergence Theorem) IfX, X\, X2,... are random variables, and if {Xn} —> X with probability 1, and if there is a random variable Y with \Xn\ < Y for all n and with E(Y) < 00, then limn_00E(Xn) = E(X). Proof. We note that Y + Xn > 0. Hence, applying Fatou's Lemma to [Y + Xn}, we obtain that E(F) + E{X) = E{Y + X)< lim inf E(F 4- Xn) = E(F) + lim inf E(Xn). n n Hence, canceling the E(Y) terms (which is where we use the fact that E(y) < 00), we see that E(X) < liminfnE(Xn). Similarly, Y — Xn > 0, and applying Fatou's Lemma to [Y — Xn}, we obtain that E(Y) - E(X) < E(Y) + liminfnE(-Xn) = E(F) - limsupnE(Xn), so that E(X) > limsupnE(Xn). But we always have limsupnE(Xn) > liminfnE(Xn). Hence, we must have that limsupn E(Xn) = lim*infn E(Xn) = E(X), as claimed. I Remark. Of course, if the random variable Y is constant, then the dominated convergence theorem reduces to the bounded convergence theorem. For our second new limit theorem, we need a definition. Note that for any random variable X with E(|X|) < 00 (i.e., with X "integrable"), by the dominated convergence theorem lima_oo E (|X|l|x|>a) = E(0) = 0. Taking a supremum over n makes a collection of random variables uniformly integrable:
9.1. Limit theorems. 105 Definition 9.1.3. A collection {Xn} of random variables is uniformly integrable if lim supE(|Xn|l|Xn|>a) =0. (9.1.4) a—oo n v i i_ / Uniform integrability immediately implies boundedness of certain expectations: Proposition 9.1.5. If {Xn} is uniformly integrable, then supn E(|Xn|) < oo. Furthermore, if also {Xn} —> X a.s., then E|X| < oo. Proof. Choose ol\ so that (say) supnE (|^n|l|xn|>ai) < 1- Then supE(|Xn|) = supE(|Xn|l|Xn|<ai -h|Xn|l|Xn|>ai) <ai + Koo. n n It then follows from Fatou's Lemma that, if {Xn} —> X a.s., then E(|X|) < liminfnE(|Xn|) < supnE(|Xn|) < oo. I The main use of uniform integrability is given by: Theorem 9.1.6. (The Uniform Integrability Convergence Theorem) If X,X\,X2,... are random variables, and if {Xn} —> X with probability 1, and if {Xn} are uniformly integrable, then limn_>oo E(Xn) = E(X). Proof. Let Yn = \Xn-X\, so that Yn -► 0. We shall show that E(Fn) -► 0; it then follows from the triangle inequality that \E(Xn)-E(X)\ < E(Yn) —► 0, thus proving the theorem. We will consider E(Fn) in two pieces, using thatyn = yniyn<a + yniyn>a. Let Fn(a) = ynlyn<Q. Then for any fixed a > 0, we have |ria)| < a, and also Fn(a) -> 0 as n —> oo, so by the bounded convergence theorem we have lim E(Vn<a)>) =0, a>0. (9.1.7) n—►oo V / For the second piece, we note by the triangle inequality (again) that Yn < \Xn\ + \X\ <2Mn where Mn = max(|X„|, |X|). Hence, if Yn > a, then we must have Mn > a/2, and thus that either \Xn\ > a/2 or \X\ > ®/2. This implies that ^nlyn>a < 2MnlMn>a/2 < 2|Xn|l,Xn|>a/2 + 2|X|l|X|>a/2. Taking expectations and supremums gives supE(FnlyTi>Q) < 2supE(|Xn|l|XTi|>a/2)+2E(|X|l|X|>a/2) .
106 9. More probability theorems. Now, as a —> oo, the first of these two terms goes to 0 by the uniform integrability assumption; and the second goes to 0 by the dominated convergence theorem (since E|X| < oo). We conclude that lim supE(ynlyn>a) = 0. (9.1.8) a—oo n To finish the proof, let e > 0. By (9.1.8), we can find ao > 0 such that supnE(Fnlyn>ao) < e/2. By (9.1.7), we can find n0 € N such that E (Yn J < e/2 for all n > Uq. Then, for any n > no, we have E(Yn) = E (Y^) + E (ynly„>Q0) < \ + | = e. Hence, E(yn) —> 0, as desired. | Remark 9.1.9. While the above limit theorems were all stated for a countable limit of random variables, they apply equally well for continuous limits. For example, suppose {Xt}t>o is a continuous-parameter family of random variables, and that lim^oX^u;) = Xq(uj) for each fixed u G fi. Suppose further that the family {Xt}t>o is dominated (or uniformly integral) as in the previous limit theorems. Then for any countable sequence of parameters {tn} \ 0, the theorems say that Ft(Xtn) —> E(Xo). Since this is true for any sequence {£n} \ 0, it follows that lim^o E(J*Q) = E(X) as well. 9.2. Differentiation of expectation. A classic question from multivariate calculus asks when we can "differentiate under the integral sign". For example, is it true that The answer is yes, as the following proposition shows. More generally, the proposition considers a family of random variables {Ft} (e.g. Ft(uj) = ewt), and the derivative (with respect to t) of the function E(i^) (e.g. of J0 e^cLu). Proposition 9.2.1. Let {Ft}a<t<b be a collection of random variables with finite expectations, defined on some probability triple (ft,!7,P). Suppose for each u and each a < t < b, the derivative F/(u;) = j^Ft{uj) exists. Then F[ is a random variable. Suppose further that there is a random variable Y on (ft,.F,P) with E(Y) < oo, such that |F/| < Y for all a < t < b.
9.3. Moment generating functions and large deviations. 107 Then if we define 4>(t) = E(Ft), then (j) is differentiable, with finite derivative ^(t) = E(F/) for alla<t <b. proof. To see that F[ is a random variable, let tn = t + £. Then F[ = limn-+oo n(^tn — ^t) and hence is the countable limit of random variables, and therefore is itself a random variable by (3.1.6). Next, note that we always have Ft+h — Ft < Y. Hence, using the domi- h nated convergence theorem together with Remark 9.1.9, we have h^o h h-+o \ h and this is finite since E(|F/|) < E(Y) < oo. I 9.3. Moment generating functions and large deviations. The moment generating function of a random variable X is the function Mx(s) = E(e5X), seR. At first glance, it may appear that this function is of little use. However, we shall see that a surprising amount of information about the distribution of X can be obtained from Mx(s). If X and Y are independent random variables, then esX and esy are independent by Proposition 3.2.3, so we have by (4.2.7) that Mx+y(s) = Mx{s)MY(s), X, Y independent. (9.3.1) Clearly, we always have Mx(0) = 1. However, we may have Mx(s) = oo for certain s ^ 0. For example, if P[X = m] = c/m2 for all integers rn ^ 0, where c = (£j#0 j"2)-1, then Mx{s) = oo for all s ^ 0. On the other hand, if X ~ JV(0,1), then completing the square gives that Mx(s) = f°° esx^=e-*2/2dx = e*^2 f°° -Le-(x-s)2/2<te = es^2{l) = es^2. (9.3.2) A key property of moment generating functions, at least when they are finite in a neighbourhood of 0, is the following.
108 9. More probability theorems. Theorem 9.3.3. Let X be a random variable such that Mx(s) < oo fox- Is! < s0, for some Sq > 0. Then E|Xn| < oo for all n, and Mx{s) is analytic for \s\ < so with oo Mx(s) = ^E(X")sn/n!. n=0 In particular, the rth derivative at s = 0 is given by M% (0) = E(Xr). Proof. The idea of the proof is that Mx(s) = E(esX) = E (1 + (sX) + (sX)2/2\ + ...) <?2 = l + sE(X) + -E(X2) + ... . However, the final equality requires justification. For this, fix s with \s\ < sq, and let Zn = 1 + (sX) + (sX)2/2! + ... + (sX)n/n!. We have to show that E(limn_oo Zn) = limn^oo E(Zn). Now, for all n G N, \Zn\ < l + \sX\ + \sX\2/2\ + ... + \sX\n/n\ < 1 + \sX\ + \sX\2/2\ + ... = e|5X| < esX+e~sX = Y. Since \s\ < s0» we have E(y) = Mx(s) + Mx(-s) < oo. Hence, by the dominated convergence theorem, E(limn_00 Zn) = limn^oo E(Zn). I Remark. Theorem 9.3.3 says that the rth derivative of Mx at 0 equals the rth moment of X (thus explaining the terminology "moment generating function"). For example, Mx(0) = 1, M'x(0) = E(X), Af£(0) = E(X2), etc. This result also follows since (£)rE(esX) = E ((&)reaX) = E (XresX), where the exchange of derivative and expectation can be justified (for \s\ < sq) using Proposition 9.2.1 and induction. We now consider the subject of large deviations. If Xi,X2,... are i.i.d. with common mean m and finite variance v, then it follows from Cheby- chev's inequality (as in the proof of the Weak Law of Large Numbers) that for all e > 0, P (Xl+'n'+Xn > m + e) < v/ne2, which converges to 0 as n —> oo. But how quickly is this limit reached? Does the probability really just decrease as 0(l/n), or does it decrease faster? In fact, if the moment generating functions are finite in a neighbourhood of 0, then the convergence is exponentially fast: Theorem 9.3.4. Suppose XX,X2,... are i.i.d. with common mean m. and with Mx% (s) < oo for -a < s < b where a, b > 0. Then
9 3. Moment generating functions and large deviations. 109 where p = info<*<6 {e-^m^MXl(8)) < 1. This theorem gives an exponentially small upper bound on the probability that the average of the Xi exceeds its mean by at least e. This is a (very simple) example of a large deviations result, and shows that in this case the probability is decreasing to zero exponentially quickly - much faster than just 0(l/n). To prove Theorem 9.3.4, we begin with a lemma: Lemma 9.3.5. Let Z be a random variable with E(Z) < 0, such that Mzis) < oo for —a < s < b, for some a, b > 0. Then P(Z > 0) < p, where p = mio<s<bMz(s) < 1. Proof. For any 0 < 5 < 6, since the function x i—► esx is increasing, Markov's inequality implies that P(Z > 0) = P(esZ > 1) < 5i^i = Mz{s). Hence, taking the infimum over 0 < 5 < 6, P(Z>0) < inf Mz(s)=p. ~ 0<s<b Furthermore, since Mz{0) = 1 and M'Z(Q) = E(Z) < 0, we must have that Mz(s) < 1 for all positive 5 sufficiently close to 0. In particular, p < 1. I Remark. In Lemma 9.3.5, we need a > 0 to ensure that M'z(0) = E(Z). However, the precise value of a is unimportant. Proof of Theorem 9.3.4. Let Y{ = X{ - m - e, so E(Yi) = -e < 0. Then for -a < s < 6, MYl(s) = E{esY*) = e-s(m+e)E(esXi) = e-^+^Mx^s) < °c. Using Lemma 9.3.5 and (9.3.1), we have P (Xl + -n+Xn>rn + e)=p(Y>+-n+Yn>0) ~ P (Y, + ... + Yn > 0) < inf MYi+...+Yn(s) = inf (MYl (s))n = pn , u<s<b 0<s<b where p = M0<s<bMYl(s) = m{0<s<b(e-s{m+e)MXl{s)). Furthermore, trom Lemma 9.3.5, p < 1. |
110 9. More probability theorems. 9.4. Fubini's Theorem and convolution. In multivariate calculus, an iterated integral like f0 f0 x2 ys dx dy, can be computed in three different ways: integrate first x and then y, or integrate first y and then x, or compute a two-dimensional "double integral" over the full two-dimensional region. It is well known that under mild conditions, these three different integrals will all be equal. A generalisation of this is Fubini's Theorem, which allows us to compute expectations with respect to product measure in terms of an "iterated integral" , where we integrate first with respect to one variable and then with respect to the other, in either order. (For non-negative /, the theorem is sometimes referred to as TonellVs theorem.) Theorem 9.4.1. (Fubini's Theorem) Let ft be a probability measure on X, and v a probability measure on y, and let /i x v be product measure on X x y. If f : X xy —► R is measurable with respect to /i x v, then Sxxy f d^ x») = Sx [Sy /(x' y)»(dy)) V{dx) ,g 4 ^ = Sy (Sx /(x> V)»(<&)) »(dy)> provided that either fXxy /+d(/i x v) < oo or fXxy f~d(fi x v) < oo (or both). [This is guaranteed if, for example, f > C > —oo, or f < C < oo, or fXxy\f\d(fix v) < oo. Note that we allow that the inner integrals (i.e., the integrals inside the brackets) may be infinite or even undefined on a set of probability 0.] The proof of Theorem 9.4.1 requires one technical lemma: Lemma 9.4.3. The mapping E i—► Jy (fx 1e(x, y)^{dx)) v(dy) is a well-defined, count ably additive function of subsets E. Proof (optional). For E C X x y, and y e y, let Sy(E) = {x G X\ (x^y) £ E}- We first acgue that, for any y G y and any set E which is measurable with respect to /i x v, the set Sy(E) is measurable with respect to /i. Indeed, this is certainly true if E = A x B with A and B measurable, for then Sy(E) is always either A or 0. On the other hand, since Sy preserves set operations (i.e., 5y(EiUE2) = Sy(Ei)USy(E2), Sy(Ec) = Sy(E)c, etc.), the collection of sets E for which Sy(E) is measurable is a a-algebra. Furthermore, the measurable rectangles A x B generate the product measure's entire a-algebra. Hence, for any measurable set E, Sy(E) is a measurable subset of X, so that ii(Sy(E)) is well-defined. Next, consider the collection Q of those measurable subsets E C X x y for which /ji(Sy(E)) is a measurable function of y G y. This Q certainly contains all the measurable rectangles Ax B, since then /j,(Sy(E)) =
9.4. Fubini's Theorem and convolution. Ill (A) 1b{v) wh*cn is measurable. Also, Q is closed under complements since (S (E )) — viSyiE)0) = 1_m(^(^))- ^ is a^so dosed under countable disjoint unions, since if {En} are disjoint, then so are {Sy(En)}, and thus iz(5y(U„£»)) = MnSy(En)) = ZnKSy(En))- ^m these facts, it follows (formally, from the U7r-A theorem", e.g. Billingsley, 1995, p. 42) that Q includes all measurable subsets E, i.e. that fi(Sy(E)) is a measurable function of y for any measurable set E. Thus, integrals like fyfi(Sy(E)) v(dy) are well-defined. Since fy (fx lE{x,y)n(dx)) v(dy) = Jyfi(Sy(E)) v(dy), the claim about being well-defined follows. To prove countable additivity, let {En} be disjoint. Then, as above, fi>(Sy(UnEn)) = En/^^y^n))' Countable additivity then follows from countable linearity of expectations with respect to v. | Proof of Theorem 9.4.1. We first consider the case where / = 1^ is an indicator function of a measurable set E C X x y. By Lemma 9.4.3, the mapping E »-> fy (fx ^-E{x1y)fi(dx)) v(dy) is a well-defined, countably additive function of subsets E. When E = A x JE?, this integral clearly equals ^(A)v(B) = (/i x v)(E). Hence, by uniqueness of extensions (Proposition 2.5.8), we must have fy (Jx l^(x,2/)/i(dx)) v(dy) = (/iX v)(E) for any measurable subset E. Similarly, fx (fy lE{x,y)v(dy)\ fi(dx) = (/i x v)(E), so that (9.4.2) holds whenever / = 1e- We complete the proof by our usual arguments. Indeed, by linearity, (9.4.2) holds whenever / is a simple function. Then, by the monotone convergence theorem, (9.4.2) holds for general measurable non-negative /. Finally, again by linearity since / = /+ - /~, we see that (9.4.2) holds for general functions / as long as we avoid the oo — oo case where Jxxyf*d(fi x v) = JXxyf~d(fi x v) = oo (while still allowing that the inner integrals may be oo - oo on a set of probability 0). | Remark. If JXxyf+d(fi x i/) = fXxyf~d(fi x i/) = oo, then Fubini's Theorem might not hold; see Exercises 9.5.13 and 9.5.14. By additivity, Fubini's Theorem still holds if /i and/or v are a-finite Measures (cf. Remark 4.4.3), rather than just probability measures. In particular, letting // = P and v be counting measure on N leads to the following generalisation of countable linearity (essentially a re-statement of Exercise 4.5.14(a)): Corollary 9.4.4. Let Z\, Z2,... be random variables with ^ E|Z»| < oo. ;nienE(£zZ0 = £;E(Zi).
112 9. More probability theorems. As an application of Fubini's Theorem, we consider the convolution formula, about sums of independent random variables. Theorem 9.4.5. Suppose X and Y are independent random variables with distributions /i = C(X) and v = C(Y). Then the distribution ofX + Y is given by /i * v, where (li*v)(H) = [ li(H-y)v(dy), #CR, Jn with H — y = {h — y; h G H}. Furthermore, if /jl has density f and v has density g (with respect to \ = Lebesgue measure on R), then /j, * v has density f * g, where (f*9)(x) = I f(x-y)g(y) A(dy), xgR. Jn Proof. Since X and Y are independent, we know that £((X, Y)) =/zxi/, i.e. the distribution of the ordered pair (X, Y) is equal to product measure. Given a Borel subset H C R, let B = {(x,y) G R2; x + y G H}. Then using Fubini's Theorem, we have P(X + YeH) = P((X,y)GB) = 0*xi/)(B) = In (InM*>V)»(dx))v(dy) = inv(H-y) "(dy) so C(X + Y) = ji* v. If /i has density / and z/ has density #, then using Proposition 6.2.3, shift invariance of Lebesgue measure as in (1.2.5), and Fubini's theorem again, (n*v)(H) = 'JKii(H-y)v(dy) = SkUh-v1 ^{dx))v{dy) = /r (/«_„/WA(dx))fl(y)A(d») = /r (fHf(x-y)X(dx))g(y)X(dy) = STt[lHf(x-v)9(y)Hdx))X(dy) = JH(JRf(x-y)g(y)\(dy))X(dx) = ///(/* 9) (x) X(dx), so \i * v has density given by / * g. I
9.5. Exercises. 113 9.5. Exercises. Exercise 9.5.1. For the "simple counter-example" with ft = N, P(u;) = 2'^ for uj G ft, and Xn(u;) = 2n<Su;>n, verify explicitly that the hypotheses of each of the Monotone Convergence Theorem, the Bounded Convergence Theorem, the Dominated Convergence Theorem, and the Uniform Integra- bility Convergence Theorem, are all violated. Exercise 9.5.2. Give an example of a sequence of random variables which is unbounded but still uniformly integrable. For bonus points, make the sequence also be undominated, i.e. violate the hypothesis of the Dominated Convergence Theorem. Exercise 9.5.3. Let X,Xi,X2,... be non-negative random variables, denned jointly on some probability triple (ft,.F, P), each having finite expected value. Assume that limn_,oc Xn(u) = X{u) for all u G ft. For n, K G N, let Yn,K — m.m(Xn,K). For each of the following statements, either prove it must true, or provide a counter-example to show it is sometimes false. (a) limK^oolimn^ooECFn^) = E(X). (b) limn^00limK^ooE(yn,K) = E(X). Exercise 9.5.4. Suppose that limn_ocXn(a;) = 0 for all uj G ft, but lim^oo E[Xn] ^ 0. Prove that E(supn |Xn|) = co. Exercise 9.5.5. Suppose supn E(|Xn|r) < co for some r > 1. Prove that {Xn} is uniformly integrable. [Hint: If |Xn(u;)| > a > 0, then |Xn(o;)| < |JfnM|r/ttr-1.] Exercise 9.5.6. Prove that Theorem 9.1.6 implies Theorem 9.1.2. [Hint: Suppose \Xn\ < Y where E(Y) < co. Prove that {Xn} satisfies (9.1.4).] Exercise 9.5.7. Prove that Theorem 9.1.2 implies Theorem 4.2.2, assuming that E|X| < co. [Hint: Suppose {Xn} / X where E|X| < co. Prove that {Xn} is dominated.] Exercise 9.5.8. Let ft = {1,2}, with P({1}) = P({2}) = \, and let *i({l}) = t2 and Ft({2}) = t4for0<t<l. (a) What does Proposition 9.2.1 conclude in this case? (b) In light of the above, what rule from calculus is implied by Proposition 9.2.1? Example 9.5.9. Let Xu X2, •.. be i.i.d., each with P(X{ = 1) = P(X{ =
114 9. More probability theorems. (a) Compute the moment generating functions M^fs). (b) Use Theorem 9.3.4 to obtain an exponentially-decreasing upper bound onP(£(X1 + ... + Xn)>0.l). Exercise 9.5.10. Let Xi,X2,... be i.i.d., each having the standard normal distribution N(Q, 1). Use Theorem 9.3.4 to obtain an exponentially- decreasing upper bound on P (^(-^i + ... + Xn) > O.l). [Hint: Don't forget (9.3.2).] Example 9.5.11. Let X have the distribution Exponential(5), with density fx(x) = 5e~5x for x > 0 (with fx(x) = 0 for x < 0). (a) Compute the moment generating function Mx(t). (b) Use Mx(t) to compute (with explanation) the expected value E(A"). Example 9.5.12. Let a > 2, and let M(t) = e"|t|Q for t eR. Prove that M(t) is not a characteristic function of any probability distribution. [Hint: Consider Af"(*).] Exercise 9.5.13. Let X = y = N, and let n{n} = v{n} = 2~n for n G N. Let / : X x y — R by /(n,n) = (4n - 1), and /(n,n + 1) = -2(4n - 1), with / (n, m) = 0 otherwise. (a) Compute /^ (/y /(x, y)v{dyyj /i(dx). (b) Compute /y (/^ /(x, y)^{dx)) v{dy). (c) Why does the result not contradict Fubini's Theorem? Exercise 9.5.14. Let A be Lebesgue measure on [0,1], and let f(x,y) — Sxy{x2 - y2)(x2 + y2)~3 for (x, j/) ^ (0,0), with /(0,0) = 0. (a) Compute J0 f/0 f(x,y)\(dy))\(dx). [Hint: Make the substitution u = x2 + y2, v = x, so du = 2ydy, dv = dx, and x2 — y2 = 2v2 — u.} (b) Compute J* (j^ f(x,y)>ldxj) X(dy). (c) Why does the result not contradict Fubini's Theorem? Exercise 9.5.15. Let X ~ Poisson(a) and Y ~ Poisson(fr) be independent. Let Z = X + Y. Use the convolution formula to compute P(Z = z) for all z e R, and prove that Z ~ Poisson(a + b). Exercise 9.5.16. Let X ~ N(a, v) and Y ~ iV(6, w) be independent. Let Z = X + Y. Use the convolution formula to prove that Z ~ N(a + b, v + w). Exercise 9.5.17. For a, (3 > 0, the Gamma(a,/3) distribution has density function f(x) = 0axa-1e~x^/T(a) for x > 0 (with /(x) = 0 for
9.6. Section summary. 115 < o), where T(a) = f0 tx~1e~tdt is the gamma function. (Hence, when Z= 1, Gamma(l,/3) = Exp(/3).) Let X ~ Gamma(a, /?) and Y ~ Gamnia(7, /?) be independent, and let Z = X + Y. Use the convolution formula to prove that Z ~ Gamma(a + 7,/?). [Note: You may use the facts that T(a +1) = a T(a) for a e R, and T(n) = (n - 1)! for n G N, and rx tr-i(x _ i)*-1^ = xr+s_1 T(r) r(s) / r(r + s) for r, 5, x > 0.] 9.6. Section summary. This section presented various probability results that will be required for the more advanced portions of the text. First, the Dominated Convergence Theorem and the Uniform Integrability Limit Theorem were proved, to extend the Monotone Convergence Theorem and the Bounded Convergence Theorem studied previously, providing further conditions under which limE(Xn) = E(limXn). Second, a result was given allowing derivative and expectation operators to be exchanged. Third, moment generating functions were introduced, and some of their basic properties were studied. Finally, Fubini's Theorem for iterated integration was proved, and applied to give a convolution formula for the distribution of a sum of independent random variables.
10. Weak convergence. 117 10. Weak convergence. riven Borel probability distributions /i, /ii,/i2, • • • on R, we shall write ^ u, and say that {/jin} converges weakly to /i, if jKfdfin —► JR/d/i f n all bounded continuous functions / : R —► R. This is a rather natural* definition, though we draw the reader's atten- t* n to the fact that this convergence need hold only for continuous functions f (as opposed to all Borel-measurable /; cf. Proposition 3.1.8). That is, the -topology" of R is being used here, not just its measure-theoretic properties. 10.1. Equivalences of weak convergence. We now present a number of equivalences of weak convergence (see also Exercise 10.3.8). For condition (2), recall that the boundary of a set A C R isdA = {xGR; Ve>0, An (x - e,x + e) ^ 0, Ac n (x - e,x + e) ^ 0}. Theorem 10.1.1. The following are equivalent. (1) iin => /i (i.e., {fin} converges weakly to fi); (2) fin(A) —► fJi{A) for all measurable sets A such that fi(dA) = 0; (3) [in ((—oo, x]) —► /i ((—oo, x]) for all x E R such that fi{x} = 0; (4) (Skorohod's Theorem) there are random variables Y, Y\, Y2,... defined jointly on some probability triple, with C(Y) = /jl and C(Yn) = [in for each neN, such that Yn^Y with probability 1. (5) JR/d/in —+ fKfdfi for all bounded Borel-measurable functions f : R —> R such that /J>(Df) = 0, where Df is the set of points where f is discontinuous. Proof. (5) =» (1): Immediate. (5) => (2): This follows by setting / = 1A, so that Df = dA, and »(Df) = KdA) = 0. Then fin(A) = J f dfin - / /d/i = n(A). (2) => (3): Immediate, since the boundary of (—oo,x] is {x}. (1) ==> (3): Let e > 0, and let / be the function defined by f(t) = 1 for * < x, f(t) = 0 for t > x + e, with / linear on the interval (x,x + e) (see Figure 10.1.2 (a)). Then / is continuous, with l(_00fX] < / < l(-oo,*+e]- Hence, limsup/An((-oo,ar]) < limsup / fdfin = / fd/ji < /i((-oo,x + e]) . n-+oo n->oo J J In fact, it corresponds to the weak* ("weak-star") topology from functional analysis, w|th X the set of all continuous functions on R vanishing at infinity (cf. Exercise 10.3.8), with norm denned by ||/|| = supxGR |/(x)|, and with dual space X* consisting of all finite signed Borel measures on R. The Helly Selection Principle below then follows from Alaoglu's Theorem. See e.g. pages 161-2, 205, and 216 of Folland (1984).
118 10. Weak convergence. This is true for any e > 0, so we conclude that limsupn_>00 /xn ((-oo,x]) < n((-oo,x}). (a) a H—h x—e x x+e 9(t) (b) , 1 x—e x x+e Figure 10.1.2. Functions used in proof of Theorem 10.1.1. Similarly, if we let g be the function defined by g(t) = 1 for t < x — e, g(t) = 0 for t > x, with g linear on the interval (x — e, x) (see Figure 10.1.2 (b)), then l^c^x-e] < 9 ^ l(-oo,x]> and we obtain that liminf /in ((—oo,x]) > liminf / fdfin = / fdfi > fi ((—oo,x — e]) . n^oo n—►oo y y This is true for any e > 0, so we conclude that limin^^co [in ((—oo,x]) > /i((-cx),x)). But if fi{x} = 0, then /i ((—oo, x]) = /i ((—oo, x)), so we must have limsup//n ((—oo,x]) i= liminf/in ((—oo,x]) = /i ((—oo,x]) , as claimed. (3) ==> (4): We first define the cumulative distribution functions, by Fn(x) = /in((-oo,x]) and F(x) = /x((-oo,x]). Then, if we let (Q.T.'P) be Lebesgue measure on [0,1], and let Yn(u) = inf{x;Fn(x) > u) and Y(uj) = inf{x; F(x) > a;}, then as in Lemma 7.1.2 we have C(Yn) = /in and C(Y) = ii. Note that if F(z) < a, then Y{a) > z, while if F(w) > b, then Y(b) < z. Since {Fn} —»• F at most points, it seems reasonable that {Yn} —► y at most points. We will prove that {Yn} —► y at points of continuity of y. Then, since Y is non-decreasing, it can have at most a countable number
10.2. Connections to other convergence. 119 0f discontinuities: indeed, it has at most m(Y(n + 1) - Y(n)) < oo discontinuities of size > 1/ra within the interval (n,n + 1], then take countable union over m and n. Since countable sets have Lebesgue measure 0, this implies that {Yn} -> Y with probability 1, proving (4). Suppose, then, that Y is continuous at cj, and let y = Y(u). For any e > 0, we claim that F(y—e) < uj < F(y+e). Indeed, if we had F(y—e) = u, then setting w = y — e and b = lj above, this would imply Y(lj) < y — e = Y(uj) — 6, a contradiction. Or, if we had F(y + e) = lj, then setting z = y + e and a = w + 5 above, this would imply Y(uj + 8) > y + e = Y(lj) + e for all S > 0, contradicting the continuity of Y at u. So, F(y — e) < u < F(y + e) for all e > 0. Next, given e > 0, find e' with 0 < e' < e such that fi{y — e'} = l*{y + e'} = 0. Then Fn{y - e') - F(y - e') and Fn(y + e') - F(j/ + e'), so fn(2/ — e') < a; < Fn(?/ + e') for all sufficiently large n. This in turn implies (setting first z = y — e' and a = lu above, and then w = y + e' and b = u above) that j/ - e' < Yn{u) < y + e', i.e. \Yn(u) - Y(lj)\ < c' < e for all sufficiently large n. Hence, Yn(u) —► y(u;). (4) => (5): Recall that if / is continuous at x, and if {xn} —► x, then /(xn) - /(*). Hence, if {yn} - y and F 0 £>/, then {/(yn)} - /(y). It follows that P[{/(y„)} — f{Y)] > P[{yn} — y and Y 0 £>/]. But by assumption, P[{yn} -> Y] = 1 and P[y 0 I?/] = /i(Dy') = 1, so also P[{/(yn)} —► f(Y)] = 1. If / is bounded, then from the bounded convergence theorem, E [/(Y^)] —► E [/(y)], i.e. J f dfin -+ f f dfi, as claimed. I For a first example, let ji be Lebesgue measure on [0,1], and let /in be defined by fin (^) = ^ for z = 1,2,..., n. Then /i is purely continuous while [in is purely discrete; furthermore, /i(Q) = 0 while /in(Q) = 1 for each n. On the other hand, for any 0 < x < 1, we have /i((-oo,x]) = x while Mn((-oo,x]) = \nx\/n. Hence, |/in ((-oo,x]) — /i((—oo,x])| < ^ —► 0 as ft —► oo, so we do indeed have /in => /i. (Note that dQ = [0,1] so that »(9Q) ± 0.) For a second example, suppose X\, X<i,... are i.i.d. with finite mean m, and Sn = ^(Xi + ... + Xn). Then the weak law of large numbers says that for any e > 0 we have P (Sn < m - e) -> 0 and P (5n < m + e) -> 1 as ^ -^ oo. It follows that £(5n) ^> 5m(-), a point mass at m. Note that it is not necessarily the case that P (Sn < m) -+ 8m ((-oo,ra]) = 1, but this is no contradiction since the boundary of (—oo,m] is {m}, and (5m{m} ^ 0. 10.2. Connections to other convergence. We now explore a sufficient condition for weak convergence. Proposition 10.2.1. If {Xn} -► X in probability, then C(Xn) => C(X).
120 10. Weak convergence. Proof. For any e > 0, if X > z + e and \Xn — X\ < e, then we must have Xn > z. That is, {X > z + e} n {\Xn - X\ < e} C {Xn > 2} Taking complements, {X < z + c}U {|Xn - X\ > e} D {Xn < zy Hence, by the order-preserving property and subadditivity, P(Xn < z) < P(X < z + c) -f- P(|X - Xn| > c). Since {Xn} -> X in probability, w^ get that limsupn_ocP(Xn < z) < P(X < z + e). Letting e \ 0 gives limsup,^ P(Xn <z)< P(X < z). Similarly, interchanging X and Xn and replacing z with z — e in the above gives P(X < z - e) < P{Xn < z) + P(|X - Xn| > c), or P(Xn < z) > P(X < z-e)-P(\X-Xn\ > e), so liminf P(Xn < z) > P(X < z-e). Letting e \ 0 gives liminf P(Xn < z) > P(X < z). If P(X = z) = 0, then P(X < z) = P(X < z), so we must have liminf P(Xn < z) = limsupP(Xn < z) = P(X < z), as claimed. | Remark. We sometimes write C(Xn) => C(X) simply as Xn => X, and say that {Xn} converges weakly (or, in distribution) to X. We now have an interesting near-circle of implications. We already knew (Proposition 5.2.3) that if Xn —» X almost surely, then Xn —► X in probability. We now see from Proposition 10.2.1 that this in turn implies that C(Xn) => C(X). And from Theorem 10.1.1(4), this implies that there are random variables Yn and Y having the same laws, such that Yn —» Y almost surely. Note that the converse to Proposition 10.2.1 is clearly false, since the fact that C(Xn) => C(X) says nothing about the underlying relationship between Xn and X, it only says something about their laws. For example, if X, Xi, X2,... are i.i.d., each equal to ±1 with probability |, then of course C(Xn) => £(X), but on the other hand P (|X - Xn| > 2) = \ +> 0, so X„ does not converge to X in probability or with probability 1. However, if X is constant then the converse to Proposition 10.2.1 does hold (Exercise 10.3.1). Finally, we note that Skorohod's Theorem may be used to translate results involving convergence with probability 1 to results involving weak convergence (or, by Proposition 10.2.1, convergence in probability). For example, we have Proposition 10.2.2. Suppose C(Xn) => £(X), with Xn > 0. Then E(X)< liminf E(Xn). Proof. By Skorohod's Theorem, we can find random variables Yn and Y with C(Yn) = £(Xn), C{Y) = £(X), and Yn -> Y with probability 1. Then, from Fatou's Lemma, E(X) =E(y) = E(liminfyn) < liminf E(Yn) = liminf E(Xn). I
10.3. Exercises. 121 For example, if X = 0, and if P(Xn = n) = i and P(Xn = 0) = 1 - £, then C(Xn) =* £P0> and ° = EW ^ liminf E(Xn) = 1. (In fact, here X —» X in probability, as well.) Remark 10.2.3. We note that most of these weak convergence concepts have direct analogues for higher-dimensional distributions, not considered here; see e.g. Billingsley (1995, Section 29). 10.3. Exercises. Exercise 10.3.1. Suppose C(Xn) => 5C for some c G R. Prove that {Xn} converges to c in probability. Exercise 10.3.2. Let X,Yi,!^,... be independent random variables, with P{Yn = 1) = 1/n and P(Fn = 0) = 1 - 1/n, . Let Zn = X + Fn. Prove that C(Zn) => C(X), i.e. that the law of Zn converges weakly to the law of X. Exercise 10.3.3. Let /xn = xV(0, ^) be a normal distribution with mean 0 and variance -. Does the sequence {/xn} converge weakly to some probability measure? If yes, to what measure? Exercise 10.3.4. Prove that weak limits, if they exist, are unique. That is, if/x, */, /xi,/X2,... are probability measures, and /xn => /x, and also /xn => */, then ji = v. Exercise 10.3.5. Let /xn be the Poisson(n) distribution, and let /x be the Poisson(5) distribution. Show explicitly that each of the four conditions of Theorem 10.1.1 are violated. Exercise 10.3.6. Let a\, a2,... be any sequence of non-negative real numbers with Ysi a>i = I- Define the discrete measure /x by /x(-) = J2ieN a*^(')> where Si(-) is a point-mass at the positive integer i. Construct a sequence IMn} of probability measures, each having a density with respect to Lebesgue measure, such that /xn => /x. Exercise 10.3.7. Let C(Y) = /x, where /x has continuous density /. For n € N, let Yn = [nY\ / n, and let /xn = £{Yn). (a) Describe /xn explicitly. (b) Prove that /xn => /x. (c) Is jin discrete, or absolutely continuous, or neither? What about /x?
122 10. Weak convergence. Exercise 10.3.8. Prove that the following are equivalent. (1) /xn => /x. (2) J f dfin —► J f dfi for all non-negative bounded continuous / : R —► R. (3) J/d/xn —» J/d/x for all non-negative continuous / : R —» R with compact support, i.e. such that there are finite a and 6 with f(x) = 0 for all x < a and all x > b. (4) J f dfin —> f f dfi for all continuous / : R —» R with compact support. (5) J f dfin —> f f dfi for all non-negative continuous / : R —» R which vanish at infinity, i.e. such that linix^-oo f(x) = linix^oo f(x) = 0. (6) / / d/xn —> f f dfi for all continuous / : R —» R which vanish at infinity. [Hints: You may assume the fact that all continuous functions on R which have compact support or vanish at infinity are bounded. Then, showing that (1) =» each of (4)-(6), and that each of (4)-(6) =» (3), is easy. For (2) => (1), note that if |/| < M, then / + M is non-negative. For (3) => (2), note that if / is non-negative bounded continuous and m £ Z, then fm = /l[m,m+i) is non-negative bounded with compact support and is "nearly" continuous; then recall Figure 10.1.2, and that / = J2mez fm-] Exercise 10.3.9. Let 0 < M < oo, and let /,/i,/2,.-. : [0,1] -> [0, M] be Borel-measurable functions with J0 f dX = J0 fn dX = 1. Suppose limn/n(x) = f(x) for each fixed x G [0,1]. Define probability measures /x,/xi,/x2, •.. by fi(A) = JAfdX and fin{A) = JA fn dA, for Borel A C [0,1]. Prove that /xn ^> /x. Exercise 10.3.10. Let / : [0,1] —» (0,oo) be a continuous function such that f0 fdX = 1 (where A is Lebesgue measure on [0,1]). Define probability measures /x and {/xn} by /x(^4) = J0 flAdX and /xn(^4) = zr=i/(</») iA(»/n)/Er=i/(</»)• (a) Prove that /xn => /x. [Hint: Recall Riemann sums from calculus.] (b) Explicitly construct random variables Y and {Yn} so that C(Y) = /x, C(Yn) = /xn, and Yn —> Y with probability 1. [Hint: Remember the proof of Theorem 10.1.1.] 10.4. Section summary. This section introduced the notion of weak convergence. It proved equivalences of weak convergence in terms of convergence of expectations of bounded continuous functions, convergence of probabilities, convergence of
10.4. Section summary. 123 cumulative distribution functions, and the existence of corresponding random variables which converge with probability 1. It proved that if random variables converge in probability (or with probability 1), then their laws converge weakly. Weak convergence will be very important in the next section, including allowing for a precise statement and proof of the Central Limit Theorem (Theorem 11.2.2 on page 134).
11. Characteristic functions. 125 11. Characteristic functions. Given a random variable X, we define its characteristic function (or Fourier transform) by (j)X(t) = E(eitx) = E[cos(tX)]+iE[sm(tX)}, f€R. The characteristic function is thus a function from the real numbers to the complex numbers. Of course, by the Change of Variable Theorem (Theorem 6.1.1), (f)x{t) depends only on the distribution of X. We sometimes write (f)x{t) as </>(t). The characteristic function is clearly very similar to the moment generating function introduced earlier; the only difference is the appearance of the imaginary number i = y/^1 in the exponent. However, this change is significant; since |eltX| = 1 for any (real) t and X, the triangle inequality implies that |0xMI — 1 < °° f°r all ^ and a^ random variables X. This is quite a contrast to the case for moment generating functions, which could be infinite for any s ^ 0. Like for moment generating functions, we have 0x(O) = 1 for any X, and if X and Y are independent then 0x+y(£) = (f>x{t)(t>Y(t) by (4.2.7). We further note that, with /z = £(X), we have fell+ />)-«()I = || (ei<'+»>'-el") (.(<&)! s/r»-n*is/nr-w = f\eihx-l\fi{dx). Now, as h —» 0, this last quantity decreases to 0 by the bounded convergence theorem (since \elhx -1| < 2). We conclude that (j>x is always a (uniformly) continuous function. The derivatives of (px are also straightforward. The following proposition is somewhat similar to the corresponding result for Mx{s) (Theorem 9.3.3), except that here we do not require a severe condition like "^x(s)<ooforall |*| < s0"- Proposition 11.0.1. Suppose X is a random variable with E (\X\k) < 00• Then for 0 < j < k, <\>x has finite jth derivative, given by ^{t) = E [(iXye***]. In particular, ^}(0) = i'E(X'). ^roof. We proceed by induction on j. The case j = 0 is trivial. Assume n°w that the statement is true for j - 1. For t € R, let Ft = (iX)j-leitx,
126 11. Characteristic functions. so that \F(\ = \(iX)ieitX\ = \X\K Since E(\X\k) < oo, therefore also Efl-Xp') < oo. It thus follows from Proposition 9.2.1 that &\t) = j/tl\t) dt Eftxy-1?**] = E&iiX)'-1?**] = EUiX)jeitx]. at 11.1. The continuity theorem. In this subsection we shall prove the continuity theorem for characteristic functions (Theorem 11.1.14), which says that if characteristic functions converge pointwise, then the corresponding distributions converge weakly: /xn => /z if and only if <j>n{t) —» (f)(t) for all t. This is a very important result; for example, it is used to prove the central limit theorem in the next subsection. Unfortunately, the proof is somewhat technical; we must show that characteristic functions completely determine the corresponding distribution (Theorem 11.1.1 and Corollary 11.1.7 below), and must also establish a simple criterion for weak convergence of "tight" measures (Corollary 11.1.11). We begin with an inversion theorem, which tells how to recover information about a probability distribution from its characteristic function. Theorem 11.1.1. (Fourier inversion theorem) Let /ibea Borel probability measure on R, with characteristic function (j)(t) = JK eltxfi(dx). Then if a <b and /x{a} = fi{b} = 0, then MMD^lim^l T p—ita p—itb (p(t)dt. _rp Zt To prove Theorem 11.1.1, we use two computational lemmas. Lemma 11.1.2. For T > 0 and a < 6, IS T | — Ua _ p—itb 4>{t) dtfi(dx) < 2T(b-a) < oo. -t | it Proof. We first note by the triangle inequality that ita p—itb Atx it p—ita _ p—itb I it f J a -itrdr
11.1. The continuity theorem. 127 < / \e-itr\dr= / J a J a Idr = b — a. Hence, LL -ita 0—itb Atx it dtfi(dx) JnJ-i (b- a)dt [i(dx) -L 2T (b - a) fi{dx) = 2T(b-a). Lemma 11.1.3. ForT>Q and 6 e R, rT sm{6t) lim / r-,oo J_, t dt = it sign (0), (11.1.4) where sign (0) = 1 for 6 > 0, sign(0) = -1 for 6 < 0, and sign(O) = 0. Furthermore, there is M < oo such that \ J_T[sm(6t)/t]dt\ < M for all T > 0 and 6 G R. Proof (optional). When 6 = 0 both sides of (11.1.4) vanish, so assume 0 ^ 0. Making the substitution s = \6\t, dt = ds/\6\ gives /: —^-^-dt = sign(0) T t J-T sin(|0|£) r sma J-T t r\e\T J-\e\T sins dt = sign(0) / ds , \e\T s and hence T-+00 Furthermore, /: ,T sin(0s) , „ . ,„% f°° sins , hm / —^-±ds = 2 sign (9) \ ds. S Jo & (11.1.5) (11.1.6) H—ds = [°°{sms)( [°°e-usdu\ds /»oo / /»oo \ /»oo / /»oo \ = / (/ (sins)e-usds)du = / (/ {sin s)e~usds \ du. Now, for u > Q, integrating by parts twice, r°° ioo r°° = / (sins)e-usds = (-coss)e~us \ -/ (-coss){-u)e-usds Jo >a=o Jo = 0 - (-1) - u Usins)e-us |~ + / (sins)(-u)e-usds) 4 =
128 11. Characteristic functions. poo 1+0-0-u2 / (sins)e~usds. Jo u=0 Hence, Iu = 1 — u2Iu, so Iu = 1/(1 + u2). We then compute that f°°sms , f°° T , f°° 1 f / as = ludu = r- cm = arctanfin Jo « Jo Jo l+«2 = arctan(oo) — arctan(O) = ir/2 — 0 = 7r/2. Combining this with (11.1.6) gives (11.1.4). Finally, since convergent sequences are bounded, it follows from (11.1.4) that the set { J_T[sin(t)/t] dt}T> is bounded. It then follows from (11.1.5) that the set { J_T[sm(0t)/t] dt)T>0 e R is bounded as well. | Proof of Theorem 11.1.1. We compute that 1 nT *> — *to ^ — itb ... , -ita -itb = h S-t e~"aue~"° (k e{y(dx)) dt [by definition of <j>(t)\ = ^F Jr S-t e% ~if —^t fi{dx) [by Fubini and Lemma 11.1.2] = £ /r S-t sin(t(x"Q))7sin(t(x"b))^ M<**) [since ^ is odd]. Hence, we may use Lemma 11.1.3, together with the bounded convergence theorem and Remark 9.1.9, to conclude that 1 pT -—ita p—itb lim — / </>(t)dt T-+oo 27T J_t it K ' 1 f°° = — / 7r [sign (x — a) — sign (x — b)] fi(dx) ^ J-oo . /oo 2 7r- [sign (x - a) — sign (x — b)] fi(dx) -oo 2 (The last equality follows because (1/2) [sign (x - a) - sign (x - b)] is equal to 0 if x < a or x > b; is equal to 1/2 if x = a or x = 6; and is equal to 1 if a < x < b.) But if fi{a} = /i{£>} = 0, then this is precisely equal to fi ([a, 6]), as claimed. I From this theorem easily follows the important
11.1. The continuity theorem. 129 Corollary 11.1-7. (Fourier uniqueness theorem) Let X and Y be random variables. Then 4>x(t) = (j)y{t) for all t G R if and only if C(X) = C(Y), I e if and only if X and Y have the same distribution. proof. Suppose (f>x(t) = <t>Y {t) for all t G R. From the theorem, we know that P(a < X < b) = P(a < Y < b) provided that P{X = a) = P(X = h) = P(Y = a) = P(Y = 6) = 0, i.e. for all but countably many choices of a and b. But then by taking limits and using continuity of probabilities, we see that P(X G I) = P(Y G I) for all intervals J C R. It then follows from uniqueness of extensions (Proposition 2.5.8) that C(X) = C(Y). Conversely, if C(X) = C{Y), then Corollary 6.1.3 implies that E(eitX) = E(eity), i.e. <j>x{t) = 4>Y{t) for all teR. ■ This last result makes the continuity theorem at least plausible. However, to prove the continuity theorem we require some further results. Lemma 11.1.8. (Helly Selection Principle) Let {Fn} be a sequence of cumulative distribution functions (i.e. Fn(x) = /xn ((—oo,x]) for some probability distribution iin). Then there is a subsequence {Fnk}, and a non-decreasing right-continuous function F with 0 < F < 1, such that lim^oo Fnk(x) = F(x) for all x G R such that F is continuous at x. [On the other hand, we might not have linix^-oo F(x) = 0 or linis^oo F(x) = 1J Proof. Since the rationals are countable, we can write them as Q = {qi,q2, • • •}• Since 0 < Fn(^i) < 1 for all n, the Bolzano-Weierstrass theorem (see page 204) says there is at least one subsequence {vk '} such that lim^oo Fik(i)(qi) exists. Then, there is a further subsequence {vk '} (i.e., {?k } is a subsequence of {^ }) such that limfc—oo i^ (2)^2) exists (but also lim^oo Fik(2)(qi) exists, since {vk ^} is a subsequence of {£k ^}). Continuing, for each m G N there is a further subsequence {tk} such that linifc^oQ F£k(m)(qj) exists for j < m. We now define the subsequence we want by rik = £k , i.e. we take the k element of the kth subsequence. (This trick is called the diagonalisation method.) Since {n^} is a subsequence of {£k} from the kth point onwards, this ensures that lim^oo Fnk(q) = G(q) exists for each q G Q. Since each is non-decreasing, therefore G is also non-decreasing. To continue, we set F(x) = inf{G(<?); q G Q, q > x}. Then F is easily seen to be non-decreasing, with 0 < F(x) < 1. Furthermore, F is right- continuous, since if {xn} \ x then {{q € Q : q > xn}} /* {q e Q : q > x}, and hence F(xn) -* F(x) as in Exercise 3.6.4. Also, since G is non-decreasing, we have F(q) > G(q) for all q 6 Q.
130 11. Characteristic functions. Now, if F is continuous at x, then given e > 0 we can find rational numbers r, s, and u with r < u < x < s, and with F(s) — F(r) < e. We then note that F{x) - e < F(r) = inf G(q) q>r = inflimFnfc(g) q>r k = inf lim inf Fn. (q) q>r k < lim inf Fnk (u) since u G Q, u > r k < lim inf Fnk (x) since x > u k < limsupFnfc(x) k < limsupFnfc(s) since s > x k = G(s) < F{s) < F(x) + e. This is true for any e > 0, hence we must have lim inf Fnk(x) = limsupFnfc(x) = F(x), k k so that lim*; Fnk (x) = F(x), as claimed. I Unfortunately, Lemma 11.1.8 does not ensure that limx^oo F(x) = 1 or \imx^-00F(x) = 0 (see e.g. Exercise 11.5.1). To rectify this, we require a new notion. We say that a collection {/xn} of probability measures on R is tight if for all e > 0, there are a < b with /xn ([a, b]) > 1 — e for all n. That is, all of the measures give most of their mass to the same finite interval; mass does not "escape off to infinity". • Exercise 11.1.9. Prove that: (a) any finite collection of probability measures is tight. (b) the union of two tight collections of probability measures is tight. (c) any sub-collection of a tight collection is tight. We then have the following. Theorem 11.1.10. If {/in} is a tight sequence of probability measures, then there is a subsequence {/infc} and a probability measure fi, such that jink =4> n, i.e. {/infc} converges weakly to fi.
11.1. The continuity theorem. 131 proof. Let Fn(x) = \xn ((—oo,x]). Then by Lemma 11.1.8, there is a subsequence Fnk and a function F such that Fnk (x) —► F(x) at all continuity points of F. Furthermore 0 < F < 1. We now claim that F is actually a probability distribution function, i.e. that limx__oo F(x) = 0 and lim^oo F(x) = 1. Indeed, let e > 0. Then using tightness, we can find points a < b which are continuity points of F, such that fjLn ((a, b]) > 1 - e for all n. But then lim F(x) - lim F(x) > F(b) - F(a) X—♦OO X—► — OO = lim [F„(6) - Fn(a)] = lim /i» ((a, 6]) > 1 - e. n n This is true for all e > 0, so we must have lim^—co F(x)-\imx^-00 F(x) = 1, proving the claim. Hence, F is indeed a probability distribution function. Thus, we can define the probability measure jx by /j, ((a, b]) = F(b) - F(a) for a < b. Then fink => jz by Theorem 10.1.1, and we are done. I A main use of this theorem comes from the following corollary. Corollary 11.1.11. Let {fjLn} be a tight sequence of probability distributions on R. Suppose that /jl is the only possible weak limit of {fin}, in the sense that whenever /j,nk => v then v = ji (that is, whenever a subsequence of the {fjin} converges weakly to some probability measure, then that probability measure must be fi). Then /in => /j,, i.e. the full sequence converges weakly to /jl. Proof. If fjin t4> jl, then by Theorem 10.1.1, it is not the case that Hn{oc,x] —> ijl(—oo,x] for all x G R with /jl{x} = 0. Hence, we can find x g R, e > 0, and a subsequence {n^}, with /jl{x} = 0, but with |Mnfc((-oo,x]) - Ai((-oo,x])| > e, fcGN. (11.1.12) On the other hand, {^nfc} is a subcollection of {jxn} and hence tight, so by Theorem 11.1.10 there is a further subsequence {jxnfc } which converges weakly to some probability measure, say v. But then by hypothesis we must have v = jx, which is a contradiction to (11.1.12). I Corollary 11.1.11 is nearly the last thing we need to prove the continuity theorem. We require just one further result, concerning a sufficient condition for a sequence of measures to be tight. Lemma 11.1.13. Let {/in} be a sequence of probability measures on R, with characteristic functions <j>n(t) = /eltxfin(dx). Suppose there is a
132 11. Characteristic functions. function g which is continuous at 0, such that \\mn<j>n(t) = g(t) for each \t\ < to for some to > 0. Then {fj,n} is tight. Proof. We first note that g(0) = \imncj>n(0) = limn 1 = 1. We then compute that, for y > 0, Mn((-OC,-2]U[^Oo)) = Jjx|>2/y 1 Hn{dx) ^ Hxl^/yi1-^) Unite) < 2/M>2/y(l-^)Mn(<**) = !\x\>2/y0-lv) I-y C1 - COs(te)) dt Hn(dx) ^ /x6R(Vy) I-y (1 - COs(te)) di /Xn(dx) = LenVMHyV-***)<* !*«{**) = ly!ly{l-<l>n{t))dt. Here the first inequality uses that 1 j^y > \ whenever \x\ > 2/y, the second inequality uses that 8m™x' = 8my^ < -jL the second equality uses that J^ cos(tx) dt = 2sin(yx> ? the final inequality uses that 1 — cos(to) > 0, the third equality uses that f^ sin(to) dt = 0, and the final equality uses Fubini's theorem (which is justified since the function is bounded and hence has finite double-integral). To finish the proof, let e > 0. Since #(0) = 1 and g is continuous at 0, we can find yo with 0 < yo < to such that |1 — g(t)\ < e/4 whenever \t\ < y0. Then \± /^(l - g(t))dt\ < e/2. Now, <£„(*) - 9<t) for all 1*1 < 2/o, and |</>n(*)| < 1- Hence, by the bounded convergence theorem, we can find n0 G N such that |-^ f0° (1 - (f)n(t))dt\ < e for all n > n0. Hence, fin( - ^, ^) = 1 - Mn((-oo,-^] U [^,oo)) > 1 - e for all n > no- It follows from the definition that {jxn} is tight. I We are now, finally, in a position to prove the continuity theorem. • Theorem 11.1.14. (Continuity Theorem) Let /j,, li\,li2, •• • be probability measures, with corresponding characteristic functions </>, (j)\, fa, — Then /j,n => /j, if and only if (j)n(t) —> <j>(t) for all t E R. In words, the probability measures {/j,n} converge weakly to [i if and only if their characteristic functions converge pointwise to that of [i. Proof. First, suppose that /in => /i. Then, since cos(to) and sin(to) are bounded continuous functions, we have as n —► oo for each tGR that (j)n(t) = f cos(tx)fin(dx) + i fsm(tx)fin(dx) —> f cos(tx)fi(dx) + if sin(tx)fi(dx) = 0(*).
11.2. The Central Limit Theorem. 133 Conversely, suppose that (j)n{t) —► (j){i) for each tGR. Then by Lemma 11.1.13 (with g = (/>), the {/in} are tight. Now, suppose that we have /infc => v for some subsequence {fink} and some measure v. Then, from the previous paragraph we must have <t>nk (t) —► <j)u(t) for all t, where (j>u(t) = J eltxv(dx). On the other hand, we know that <j>nk (t) —> <j>{t) for all t\ hence, we must have (j>v = <t>- But from Fourier uniqueness (Corollary 11.1.7), this implies that v = p. Hence, we have shown that /j, is the only possible weak limit of the {^n}. Therefore, from Corollary 11.1.11, we must have /in => /j,, as claimed. I 11.2. The Central Limit Theorem. Now that we have proved the continuity theorem (Theorem 11.1.14), it is very easy to prove the classical central limit theorem. First, we compute the characteristic function for the standard normal distribution iV(0,1), i.e. for a random variable X having density with respect to Lebesgue measure given by fx(x) = ~k=e~x ^2- That is, we wish to compute /OO 1 eitxle-x*/2dx V2i Comparing with the computation leading to (9.3.2), we might expect that <j>x{t) = Mx(it) = e^ I2 = e~l I2. This is in fact correct, and can be justified using theory of complex analysis. But to avoid such technicalities, we instead resort to a trick. Proposition 11.2.1. If X ~ JV(0,1), then <f>x(t) = e"*2/2 for all t G R. Proof. By Proposition 9.2.1 (with Ft = eitx and Y = \X\, so that E(K) < oo and \F(\ = \(iX)eitx\ = \X\ < Y for all t), we can differentiate under the integral sign, to obtain that /OO -j /*00 -j ixeit*'e-x2/2dx = / ieitx 1 xe-*2/2dx -oo v27T J-oo v27T Integrating by parts gives that /OO -j #ite^dx = -td>X(t). Hence, </>'x(t) = -t<t>x{t), so that ^log0x(*) = -*. Also, we know that log^x(O) = logl = 0. Hence, we must have log </>x(t) = /0*(-s) ds = -t2/2,
134 11. Characteristic functions. whence </>x{t) = e~* ^2- | We can now prove Theorem 11.2.2. (Central Limit Theorem) Let Xi,X2,... be Lid. with finite mean m and finite variance v. Set Sn = X\ + ... + Xn. Then as n —» oo, r(Sn-nm\ where /j,n = N(Q, 1) is the standard normal distribution, i.e. the distribution having density -^=e~x I2 with respect to Lebesgue measure. Proof. By replacing X{ by ^7=p, we can (and do) assume that m = 0 and v = 1. Let (j)n(t) = E f eitSn/y^j be the characteristic function of Sn/y/n. By the continuity theorem (Theorem 11.1.14), and by Proposition (11.2.1), it suffices to show that limn <j)n(t) = e~l I2 for each fixed t G R. To this end, set </>(£) = E(eltXl). Then as n —► oo, using a Taylor expansion and Proposition 11.0.1, 4>n(t) = E (j*(Xi+...+Xn)/V*\ = 4>(t/Vn)n = (i+^ft)+% ( a)2 E ((^h+0{1/n)y = (l-|i+o(l/n))n as claimed. (Here o(l/n) means a quantity qn such that qn/(l/n) —► 0 as n —> oo. Formally, the limit holds since for any e > 0, for sufficiently large n we have qn > —e/n and also qn»< e/n, so that the liminf is > e~^ /2)~c and the limsup is < e"^ /2)+e.) I Since the normal distribution has no points of positive measure, this theorem immediately implies (by Theorem 10.1.1) the simpler-seeming Corollary 11.2.3. Let {Xn}, m, v, and Sn be as above. Then for each fixed x G R, lim P (Sn-nm < \ = $(x) ? (n 2 4) n—oo y y/nV J where &(x) = fx -^=e~* l2dt is the cumulative distribution function for the standard normal distribution.
11.3. Generalisations of the Central Limit Theorem. 135 This can also be written as P(5n < nm + Xy/nv) —► $>(x). That is, % 4.... + Xn is approximately equal to nra, with deviations from this value of order y/n. For example, suppose Xi,X2,... each have the Poisson(5) distribution. This implies that m = E(X$) = 5 and v = Var(Xj) = 5. Hence, for each fixed x G R, we see that P (Xi + ... + Xn < 5n + xy/5n\ -+ $(x), n -► oc. Remarks. 1. It is not essential in the Central Limit Theorem to divide by y/v. Without doing so, the theorem asserts instead that 2. Going backwards, the Central Limit Theorem in turn implies the WLLN, since if y > m, then as n —► oo, P[(Sn/n) <y}= P[Sn < ny] = P[{Sn - nm)/y/nH < (ny - nm)/y/n^] « $[(ny — nm)/y/nv] = $[y/n(y — m)/y/v\ —► ^(+oo) = 1, and similarly if y < m then P[(Sn/n) < y] —> $(—oo) = 0. Hence, C(Sn/n) => 5m, and so «Sn/n converges to m in probability. 11.3. Generalisations of the Central Limit Theorem. The classical central limit theorem (Theorem 11.2.2 and Corollary 11.2.3) is extremely useful in many areas of science. However, it does have certain limitations. For example, it provides no quantitative bounds on the convergence in (11.2.4). Also, the insistence that the random variables be i.i.d. is sometimes too severe. The first of these problems is solved by the Berry-Esseen Theorem, which states that if Xi, X2,... are i.i.d. with finite mean m, finite positive variance v, and E (|X; - m|3) = p < oo, then |p^i + ... + Xw-nm \ I V y/vn J This theorem thus provides a quantitative bound on the convergence in (11.2.4), depending only on the third moment. For a proof see e.g. Feller (1971, Section XVI.5). Note, however, that this error bound is absolute, not relative: as x -* -oo, both P(Xl+-+^~nm < x) and &(x) get small, and < 3p fnv°
136 11. Characteristic functions. the Berry-Esseen Theorem says less and less. In particular, the theore does not assert that P (Xl+'"^~nm < x) decreases as 0{e~x2/2) as x ^ —oo, even though $(x) does. (We have already seen in Theorem 9.3.4 that pf xi+^±2£zl < x J often decreases as px for some p < 1.) Regarding the second problem, we mention just two of many results To state them, we shall consider collections {Znk\ n > 1, 1 < fc < rn} 0f random variables such that each row {Znk}\<k<rn is independent, called triangular arrays. (If rn = n they form an actual triangle.) We shall assume for simplicity that E(Znfc) = 0 for each n and fc. We shall further set c2nk = E(Z*k) (assumed to be finite), Sn = Znl + ... + Znrn, and 4 = Var(Sn) = (r21 + ... + (72rn. For such a triangular array, the Lindeberg Central Limit Theorem states that C(Sn/sn) =$> N(0,1), provided that for each e > 0, we have 1 Tn ^^Y,^Znk1\Znk>esn] = 0. (11.3.1) n k=l This Lindeberg condition states, roughly, that as n —► oo, the tails of the Znk contribute less and less to the variance of Sn. Exercise 11.3.2. Consider the special case where rn — 72, with Znk = -j=j Yn where {Yn} are i.i.d. with mean 0 and variance v < oo (so sn = 1). (a) Prove that the Lindeberg condition (11.3.1) is satisfied in this case. [Hint: Use the Dominated Convergence Theorem.] (b) Prove that the Lindeberg CLT implies Theorem 11.2.2. This raises the question that, if (11.3.1) is not satisfied, then what other limiting distributions may arise? Call a distribution \x a possible limit if there exists a triangular array as defined above, with supn s^ < oo and linin-xx) maxi<fc<rn c\k = 0 (so that no one term dominates the contribution to Var(5n)), such thatr£(5n) => /jl. Then we can ask, what distributions are possible limits? Obviously the normal distributions N(0,v) are possible limits; indeed C(Sn) => N(0,v) whenever (11.3.1) is satisfied and 5^ —► v. But what else? The answer is that the possible limits are precisely the infinitely divisible distributions having mean 0 and finite variance. Here a distribution /jl is called infinitely divisible if for all n G N, there is a distribution vn such that the n-fold convolution of vn equals \x (in symbols: vn * vn * ... * vn = /x). Recall that this means that, if X\, X2,..., Xn ~ vn are independent, then X\ + ... + Xn ~ /i. Half of this theorem is obvious; indeed, if /i is infinitely divisible, then we can take rn = n and C(Xnk) = vn in the triangular array, to get that
11.4. Method of moments. 137 £/£n) =» /i. For a proof of the converse, see e.g. Billingsley (1995, Theorem 28.2). XI.4. Method of moments. There is another way of proving weak convergence of probability measures, which does not explicitly use characteristic functions or the continuity theorem (though its proof of correctness does, through Corollary 11.1.11). Instead, it uses moments, as we now discuss. Recall that a probability distribution fi on R has moments defined by a^ — j xkii{dx), for k = 1,2,3, Suppose these moments all exist and are all finite. Then is \x the only distribution having precisely these moments? And, if a sequence {/in} of distributions have moments which converge to those of/i, then does it follow that \xn => \x, i.e. that the fjLn converges weakly to /i? We shall see in this section that such conclusions hold sometimes, but not always. We shall say that a distribution /i is determined by its moments if all its moments are finite, and if no other distribution has identical moments. (That is, we have J \xk\fj,(dx) < oo for all k G N, and furthermore whenever / xkn(dx) = J xki/(dx) for all k G N, then we must have v = /x.) We first show that, for those distributions determined by their moments, convergence of moments implies weak convergence of distributions; this result thus reduces the second question above to the first question. Theorem 11.4.1. Suppose that /i is determined by its moments. Let {fin} be a sequence of distributions, such that f xkfj,n(dx) is finite for all n, k G N, and such that limn^oo JxkfjLn(dx) = J xkfi(dx) for each k G N. Then /x => /i, i.e. the [in converge weakly to /i. Proof. We first claim that {/in} is tight. Indeed, since the moments converge to finite quantities, we can find Kk G R with f xkfin(dx) < Kk for all n G N. But then, by Markov's inequality, letting Yn ~ /xn, we have t*n([-R,R]) = P(|yn|<J2) = l-P(\Yn\>R) = i-p(rn2>i?2) > i-(E[rn2]/i*2) > i-(x2/i?2), which is > 1 — e whenever R > y/K2/e, thus proving tightness. We now claim that if any subsequence {/inr} converges weakly to some distribution v, then we must have v = /i. The theorem will then follow from Corollary 11.1.11.
138 11. Characteristic functions. Indeed, suppose \xnr => v. By Skorohod's theorem, we can find random variables Y and {Yr} with C(Y) = v and C(Yr) = /inr, such that Yr —» Y with probability 1. But then also Yk —► l^fc with probability i Furthermore, for k G N and a > 0, we have E (|nifcl|yr|*>a) < E (i^pl|yr|t>a) < iE (|rr|2fc) = -E ((Yrfk) < ** , a a which is independent of r, and goes to 0 as a —► oo. Hence, the {Yk} are uniformly integrable. Thus, by Theorem 9.1.6, E(Yk) —► E(Ffc), i.e. fxkfinr(dx) —► fxkv(dx). But we already know that Jxkfinr(dx) —► f xkfj,(dx). Hence, the moments of i^ and /x must coincide. And, since /i is determined by its moments, we must have v = /i, as claimed. | This theorem leads to the question of which distributions are determined by their moments. Unfortunately, not all distributions are, as the following exercise shows. Exercise 11.4.2. Let f(x) = ^7^e_(logx)2/2 for x > 0 (with f(x) = 0 for x < 0) be the density function for the random variable ex, where X ~ iV(0,l). Let g(x) = f(x) (1 + sin(27rlogx)). Show that g(x) > 0 and that fxkg(x)dx = fxkf(x)dx for k = 0,1,2,.... [Hint: Consider J xkf(x) sin(27rlogx)dx, and make the substitution x = esek, dx = esekds] Show further that J \x\kf(x)dx < oo for all k G N. Conclude that g is a probability density function, and that g gives rise to the same (finite) moments as does /. Relate this to Theorem 11.4.1 above and Theorem 11.4.3 below. On the other hand, if a distribution satisfies that its moment generating function is finite in a neighbourhood of the origin, then it will be determined by its moments, as we now show. (Unfortunately, the proof requires a result from complex function theory.) Theorem 11.4.3. Let sq > 0, and let X be a random variable with moment generating function Mx{s) which is finite for \s\ < so- Then C(X) is determined by its moments (and also by Mx(s)). Proof (optional). Let fx(z) = E(ezX) for ztC. Since \ezX\ = ex*ez, we see that fx(z) is finite whenever \$lez\ < s0. Furthermore, just like for Mx(s), it follows that fx(z) is analytic on {z e C; |3&ez| < s0}. Now, if
11.5. Exercises. 139 y has the same moments as does X, then for |s| < su? we have by order- oreserving and countable linearity that E(esy) < E(esY + e-sY) = 2 (l + s2E^2) +...j=Mx(S) + Mx(-s) < oo. Hence, My(s) < oo for \s\ < sq. It now follows from Theorem 9.3.3 that MY(s) = Mx(s) for \s\ < s0, i.e. that fx{s) = fy(s) for real |s| < s0- By the uniqueness of analytic continuation, this implies that fy(z) = fx(z) for \$lez\ < s0. In particular, since <f>x(t) = fx(it) and </>y(t) = fy(it), we have $x = <i>y- Hence, by the uniqueness theorem for characteristic functions (Theorem 11.1.7), we must have C(Y) = C(X), as claimed. I Remark 11.4.4. Proving weak convergence by showing convergence of moments is called the method of moments . Indeed, it is possible to prove the central limit theorem in this manner, under appropriate assumptions. Remark 11.4.5. By similar reasoning, it is possible to show that if Mxn{s) < oo and Mx{s) < oo for all n G N and |s| < so? and also Mxn(s) —> Mx(s) for all |s| < sq, then we must have C(Xn) => C{X). 11.5. Exercises. Exercise 11.5.1. Let [in = 5n be a point mass at n (for n = 1,2,...). (a) Is {/in} tight? (b) Does there exist a subsequence {/xnfc}, and a Borel probability measure M, such that /infc => fi? (If so, then specify {n^} and /jl.) Relate this to theorems from this section. (c) Setting Fn(x) = fin ((—oo, x]), does there exist a non-decreasing, right- continuous function F such that Fn(x) —» F(x) for all continuity points x of F? (If so, then specify F.) Relate this to the Helly Selection Principle. (d) Repeat part (c) for the case where /in = 5-n is a point mass at —n. Exercise 11.5.2. Let /in = Sn mod 3 be a point mass at n mod 3. (Thus, Mi = £i, /x2 = $2, M3 = <5o, ^4 = Si, /i5 = S2, /i6 = S0, etc.) This should not be confused with the statistical estimation procedure of the same name, which estimates unknown parameters by choosing them to make observed moments equal theoretical ones.
140 11. Characteristic functions. (a) Is {fjin} tight? (b) Does there exist a Borel probability measure /i, such that /xn => ^? /j* so, then specify fi.) (c) Does there exist a subsequence {/infc}5 and a Borel probability measure /i, such that /infc => \xl (If so, then specify {rik} and fj,.) (d) Relate parts (b) and (c) to theorems from this section. Exercise 11.5.3. Let {xn} be any sequence of points in the interval [0,1]. Let /jin = 6Xn be a point mass at xn. (a) Is {fxn} tight? (b) Does there exist a subsequence {/infc }, and a Borel probability measure fi, such that /infc => /i? (Hint: by compactness, there must be a subsequence of points {xnk} which converges, say to y G [0,1]. Then what does fiUk converge to?) Exercise 11.5.4. Let /i2n = ^o> and let /X2n+i = ^n> for n = 0,1,2,.... (a) Does there exist a Borel probability measure /i, such that /in => /x? (b) Suppose for some subsequence {/in/c} and some Borel probability measure i/, we have fjLnk => v. What must v be? (c) Relate parts (a) and (b) to Corollary 11.1.11. Why is there no contradiction? Exercise 11.5.5. Let fjin = Uniform[0,n], so fjin([a,b]) = (b — a)/n for 0 < a < b < n. (a) Prove or disprove that {/j,n} is tight. (b) Prove or disprove that there is some probabilty measure /i such that [in =>/i. Exercise 11.5.6. Suppose /in => A*- Prove or disprove that {fj,n} must be tight. Exercise 11.5.7. Define the Borel probability measure fin by fin ({#}) = 1/n, for x = 0, £, ^,..., n^. Let A be Lebesgue measure on [0,1]. (a) Compute <j)n(t) = f ettxfin(dx), the characteristic function of /xn. (b) Compute (j)(t) = f eltx\(dx), the characteristic function of A. (c) Does (j)n{t) —► </>{t), for each t6R? (d) What does the result in part (c) imply? Exercise 11.5.8. Use characteristic functions to provide an alternative solution of Exercise 10.3.2. Exercise 11.5.9. Use characteristic functions to provide an alternative solution of Exercise 10.3.3.
11.5. Exercises. 141 Exercise 11.5.10. Use characteristic functions to provide an alternative solution of Exercise 10.3.4. Exercise 11.5.11. Compute the characteristic function 4>x(t), and also ±j (0) = iE(X), where X follows (a) the binomial distribution: P(X = k) = (nk)pk{l - p)n~k, for k = 0,1,2,. ..,n. (b) the Poisson distribution: P(X = k) = e A^-, for k = 0,1, 2,.... (c) the exponential distribution, with density with respect to Lebesgue measure given by fx(x) = Xe~Xx for x > 0, and fx{x) = 0 for x < 0. Exercise 11.5.12. Suppose that for n € N, we have P[Xn = 5] = \/n and P[Xn = 6] = 1 - (1/n). (a) Compute the characteristic function (j>xn(t), for all n € N and £ € R. (b) Compute limn—oo0x»(*)» (c) Specify a distribution fi such that limn^oo (f)Xn (t) = J eltx /j,(dx) for all teR. (d) Determine (with explanation) whether or not C(Xn) => ji. Exercise 11.5.13. Let {Xn} be i.i.d., each having mean 3 and variance 4. Let S = X\ + X2 + ... + ^10,000- In terms of $(#), give an approximate value for P[5< 30,500]. Exercise 11.5.14. Let Xi,X2,... be i.i.d. with mean 4 and variance 9. Find values C(n,x), for n € N and x € R, such that as n —► 00, P[Xi + X2 + ... + Xn < C(n, x)] « *(x). Exercise 11.5.15. Prove that the Poisson(A) distribution, and the N(m, v) (normal) distribution, are both infinitely divisible (for any A > 0, m e R, and v > 0). [Hint: Use Exercises 9.5.15 and 9.5.16.] Exercise 11.5.16. Let X be a random variable whose distribution C(X) is infinitely divisible. Let a > 0 and b e R, and set y = aX + 6. Prove that £(y) is infinitely divisible. Exercise 11.5.17. Prove that the Poisson(A) distribution, the N(m,v) distribution, and the Exp(A) (exponential) distribution, are all determined by their moments, for any A > 0, m € R, and v > 0. Exercise 11.5.18. Let X:Xi,X2,... be random variables which are uniformly bounded, i.e. there is M e R with \X\ < M and \Xn\ < M for all n. Prove that {C(Xn)} => C{X) if and only if E (X*) -► E (Xk) for all keN.
142 11. Characteristic functions. 11.6. Section summary. This section introduced the characteristic function 4>x{t) = E(eitx) After introducing its basic properties, it proved an Inversion Theorem (to recover the distribution of a random variable from its characteristic function) and a Uniqueness Theorem (which shows that if two random variables have the same characteristic function then they have the same distribution). Then, using the Helly Selection Principle and the notion of tightness, it proved the important Continuity Theorem, which asserts the equivalence of weak convergence of distributions and pointwise convergence of characteristic functions. This important theorem was used to prove the Central Limit Theorem about weak convergence of averages of i.i.d. random variables to a normal distribution. Some generalisations of the Central Limit Theorem were briefly discussed. The section ended with a discussion of the method of moments, an alternative way of proving weak convergence of random variables using only their moments, but not their characteristic functions.
12. Decomposition of probability laws. 143 12. Decomposition of probability laws. Let /ibea Borel probability measure on R. Recall that fi is discrete if it takes all its mass at individual points, i.e. if ^2xejil^{x} = ^(R). Also [i is absolutely continuous if there is a non-negative Borel-measurable function / such that fi(A) = JA f(x)\(dx) = J lA(x)f(x)X(dx) for all Borel sets A, where A is Lebesgue measure on R. One could ask, are these the only possible types of probability distributions? Of course the answer to this question is no, as fi could be a mixture of a discrete and an absolutely continuous distribution, e.g. [i = |5o+|iV(0,1). But can every probability distribution at least be written as such a mixture? Equivalently, is it true that every distribution [i with no discrete component (i.e. which satisfies /j,{x} = 0 for each x € R) must necessarily be absolutely continuous? To examine this question, say that fi is dominated by A (written ji <C A), if ^(A) = 0 whenever X(A) = 0. Then clearly any absolutely continuous measure [i must be dominated by A, since whenever X(A) = 0 we would then have fi{A) = J lA(x)f(x)X(dx) = 0. (We shall see in Corollary 12.1.2 below that in fact the converse to this statement also holds.) On the other hand, suppose Z\, Z<i,... are i.i.d. taking the value 1 with probability 2/3, and the value 0 with probability 1/3. Set Y = ^2^=1 Zn^~n (i.e. the base-2 expansion of Y is O.Z1Z2 .. .)• Further, define S C R by S = (x€[0,l]; lim -JTdiW^l], V 2=1 ' where di(x) is the ith digit in the (non-terminating) base-2 expansion of x. Then by the strong law of large numbers, we have P(Y € S) = 1 while X(S) = 0. Hence, from the previous paragraph, the law of Y cannot be absolutely continuous. But clearly C(Y) has no discrete component. We conclude that C(Y) cannot be written as a mixture of a discrete and an absolutely continuous distribution. In fact, C(Y) is singular with respect to A (written C(Y) _L A), meaning that there is a subset SCR with X(S) = 0 and P(y e Sc) = 0. 12.1. Lebesgue and Hahn decompositions. The main result of this section is the following. Theorem 12.1.1. (Lebesgue Decomposition) Any probability measure M on R can uniquely be decomposed as /1 = jidisc + ^ac + fising, where the me/)«??rp«; //j,„.. /y„^. and u^nn satisfy
144 12. Decomposition of probability laws. (a) iodise is discrete, i.e. £x6RHdisc{x} = Mwc(R); (b) Hac is absolutely continuous, i.e. fiac{A) = JAf dX for all Borel sets A for some non-negative, Borel-measurable function f, where X is Lebesgue measure on R; (c) Using is singular continuous, i.e. fiSing{^} = 0 for all xGR, but there is S C R with X(S) = 0 and Using (Sc) = 0. Theorem 12.1.1 will be proved below. We first present an important corollary. Corollary 12.1.2. (Radon-Nikodym Theorem) A Borel probability measure fi is absolutely continuous (i.e. there is f with fi(A) = JAfd\ for all Borel A) if and only if it is dominated by X (i.e. fi <C A, i.e. n(A) = 0 whenever X(A) = 0). Proof. We have already seen that if ii is absolutely continuous then ju< A. For the converse, suppose ji <C A, and let ji = Hdisc + AW + Using as in Theorem 12.1.1. Since X{x} = 0 for each x, we must have fi{x} = 0 as well, so that iodise = 0. Similarly, if S is such that X(S) — 0 and Hsing(SC) = 0, then we must have fi(S) = 0, so that iiSing(S) = 0, so that /J,Sing = 0- Hence, /i = iiac, i.e. fi is absolutely continuous. I Remark 12.1.3. If fi(A) = fAfdX for all Borel A, we write ^ = /, and call / the density, or Radon-Nikodym derivative, of ji with respect to A. It remains to prove Theorem 12.1.1. We begin with a lemma. Lemma 12.1.4. (Hahn Decomposition) Let 0 be a finite usigned measure" on (ft, J7), i.e. (/) = [i — v for some finite measures fi and v. Then there is a partition Q = A+ lM~,# with A+,A~ e T, such that (j)(E) > 0 for all E C A+, and </>(E) < 0 for all E C A~. Proof. Following Billingsley (1995, Theorem 32.1), we set a = sup{(j)(A); A G J7} . We shall construct a subset A+ such that 4>{A+) = a. Once we have done this, then we can then take A~ = Q \ A+. Then if E C A+ but <p(E) < 0, then </>(A+\E) = (f)(A+) - (j){E) > (f)(A+) = a, which contradicts the definition of a. Similarly, if E C A~ but </>(£) > 0, then (p(A+ U E) > (f>(A+) + (t>(E) > a, again contradicting the definition of a. Thus, to complete the proof, it suffices to construct A+ with </>(A+) = a.
12.1. Lebesgue and Hahn decompositions. 145 To that end, choose (by the definition of a) subsets Ai,A2,...€:Jr such that <p(An) -► a. Let A = [jAu and let Gn = I H A'k > each A'fc = 4* or A \ Ak \ (so that Gn contains < 2n different subsets, which are all disjoint). Then let Cn= |J 5, segn <^>(S)>0 i.e. Cn is the union of those elements of Gn with non-negative measure under (p. Finally, set A+ = limsupCn. We claim that 4>(A+) = a. First, note that since An is a union of certain particular elements of Qn (namely, all those formed with A'n = An), and Cn is a union of all the ^-positive elements of Gn, it follows that </>(Cn) > 4>(An). Next, note that CmU.. .UCn is formed from CmU.. .UCn-i by including some additional ^-positive elements of Gn, so </>(Cm U Cm+i U ... U Cn) > (j)(Cm U Cm+i U... U Cn-i). It follows by induction that 0(Cm U... UCn) > <t>(Cm) > 4>(Am). Since this holds for all n, we must have (by continuity of probabilities on /i and v separately) that (j)(Cm U Cm+i U ...) > </>(Am) for all m. But then (again by continuity of probabilities on ji and v separately) we have (j){A+) = 0(limsupCn) = lim 0(Cm U Cm+1 U ...) > lim </>(Am) = a. m—+00 771—► OO Hence, 0(A+) = a, as claimed. I Remarks. 1. The Hahn decomposition is unique up to sets of (^-measure 0, i.e. if -A+U A~ and B+UB~ are two Hahn decompositions then (f)(A+\B+) = 0(5+ \ A+) = (j){A- \ B~) = 4>{B~ \ A~) = 0, since e.g. A+ \ £+ = A+ n B~ must have 0-measure both > 0 and < 0. 2. Using the Radon-Nikodym theorem, it is very easy to prove the Hahn decomposition; indeed, we can let f = // + i/, let / = %* and £ = ^f, and set A+ = {/ > #}. However, this reasoning would be circular, since we are going to use the Hahn decomposition to prove the Lebesgue decomposition which in turn proves the Radon-Nikodym theorem! 3- If (j) is any countably additive mapping from T to R (not necessarily non-negative, nor assumed to be of the form /i — i/), then continuity of probabilities follows just as in Proposition 3.3.1, and the proof of Theorem 12.1.4 then goes through without change. Furthermore, it then
146 12. Decomposition of probability laws. follows that <j) = ii-v where \i{E) — </>(EnA+) and v(E) = -(^{EnA") i.e. every such function is in fact a signed measure. We are now able to prove our main theorem. Proof of Theorem 12.1.1. We first take care of fidisc Indeed, clearly we shall define fj,disc{A) = Y^xeA^{x)^ anc* tnen M ~~ Vdisc has no discrete component. Hence, we assume from now on that /i has no discrete component. Second, we note that by countable additivity, it suffices to assume that fi is supported on [0,1], so that we may also take A to be Lebesgue measure on [0,1]. To continue, call a function g a candidate density if g > 0 and fEgd\ < fi{E) for all Borel sets E. We note that if g\ and g2 are candidate densities, then so is max(<7i, (72), since / max(#!, g2) dX = gi dX + / g2 dX JE JEn{gi>g2} JEn{gi<g2} < fi{E n {gi > g2}) + n(E H {9l < g2}) = fi{E). Also, by the monotone convergence theorem, if hi,h2l... are candidate densities and hn f h, then h is also a candidate density. It follows from these two observations that if g\,g2,... are candidate densities, then so is supn gn = lim^oc max(#i,.... gn). Now, let (3 = sup{Jj0 xl gdX; g a candidate density}. Choose candidate densities gn with Jj0 ^ gndX > j3 — ^, and let / = supn>1 gn, to obtain that / is a candidate density with JJ0 ^ / dX = /?, i.e. / is (up to a set of measure 0) the largest possible candidate density. This / shall be our density for /iac. That is, we define fiac(A) = JA f dX. We then (of course) define nSing{A) = /i(j4) — nac(A). Since / was a candidate density, therefore /ising^A) > 0. To complete the existence proof, it suffices to show that /ising is singular. For each n e N, let [0,1] = A+UA~ be a Hahn decomposition (cf. Lemma 12.1.4) for the signed measure 0n = fising — ^A. Set M = (Jn An. Then Mc = C\nAn^ so that M° £ An for each n- lt follows that (fusing ~ kX) (M°) ^ 0 f0r a11 n> S0 tll£tt M^n^(^/C) < ^X(MC) for all n. Hence, fiSing(Mc) = 0. We claim that A(M) = 0, so that musing is indeed singular. To prove this, we assume that X(M) > 0, and derive a contradiction. If A(M) > 0, then there is n G N with X(A+) > 0. For this n, we have {fusing ~ £A) (£) > 0, i.e. fising(E) > ±\(E), for all E C A+. We now claim that the function g = f + ± 1A+ is a candidate density. Indeed, we
12.2. Decomposition with general measures. 147 compute for any Borel set E that = /w(£) + iHA+nE) < Hac{E) + fising{A+ fl E) < Hac(E) + Hsing(E) = H(E), thus verifying the claim. On the other hand, we have / gdX= [ fdX + - f lA+d\ = (3+±\(A+)>(3, 7(0,1] «/[0,l] n «/[0,l] " n which contradicts the maximality of /. Hence, we must actually have had X(M) = 0, showing that Using must actually be singular, thus proving the existence part of the theorem. Finally, we prove the uniqueness. Indeed, suppose fi = jiac + Using = Vac + "sing, with nac{A) = fA f d\ and vac{A) = fAgdX. Since u.sing and Vsing are singular, we can find Si and 52 with A (Si) = A(^2) = 0 and M«n*(Sf) = "sing(S?) = 0. Let S = Sx U52, and let B = {u e Sc; f(u>) < g{uo)}. Then g - f > 0 on B, but J^ - f)dX = u.ac{B) - uac{B) = fi(B) - i/(B) = 0. Hence, X(B) = 0. But we also have X{S) = 0, hence A{/ < g} = 0. Similarly A{/ > g} = 0. We conclude that A{/ = #} = 1, whence [iac = i/ac, whence nsing = vsing. I Remark. Note that, while fiac is unique, the density / = -^ is only unique up to a set of measure 0. 12.2. Decomposition with general measures. Finally, we note that similar decompositions may be made with respect to other measures. Instead of considering absolute continuity or singularity with respect to A, we can consider them with respect to any other probability measure v, and the same proofs apply. Furthermore, by countable additivity, the above proofs go through virtually unchanged for cr-finite v as well (cf. Remark 4.4.3). We state the more general results as follows. (For the general Radon-Nikodym theorem, we write /i <C v, and say that /i is dominated by v, if u.(A) = 0 whenever i/(A) = 0.) Theorem 12.2.1. (Lebesgue Decomposition, general case) Iffi and v are two cr-finite measures on some measurable space (Q, J7), then u. can uniquely
148 12. Decomposition of probability laws. be decomposed as ji = fiac + fiSin9, where fiac is absolutely continuous wi>h respect to v (i.e., there is a non-negative measurable function / : fi -^ p such that fiac(A) = fAfdv for all A G T), and nsing is singular (i.e., there is SCQ with i/(5) = 0 and fiSin9{Sc) = 0). Remark. In the above statement, for simplicity we avoid mentioning the discrete part fidisc, thus allowing for the possibility that v may itself have some discrete component (in which case the corresponding discrete component of fi would still be absolutely continuous with respect to v) However, if v has no discrete component, then if we wish we can extract Iodise from fiSing as in Theorem 12.1.1. Corollary 12.2.2. (Radon-Nikodym Theorem, general case) If fi and v are two a-finite measures on some measurable space (Q^J7), then ji is absolutely continuous with respect to v if and only if /i <C v. If [i < i/, so fi(A) = fAfdv for all A e J7, then we write ^ = /, and call -^ the Radon-Nikodym derivative of fi with respect to v. Thus, /j,(A) = fA(-fo) dv. If v is a probability measure, then we can also write this as fJ>(A) = E^f^j 1^], where E^ stands for expectation with respect to v. 12.3. Exercises. Exercise 12.3.1. Prove that u. is discrete if and only if there is a countable set S with //(5C) = 0. Exercise 12.3.2. Let X and Y be discrete random variables (not necessarily independent), and let Z = X + Y. Prove that C(Z) is discrete. Exercise 12.3.3. Let X be a random variable, and let Y = c X for some constant c > 0. (a) Prove or disprove that if C(X) is discrete, then C(Y) must be discrete. (b) Prove or disprove that if C(X) is absolutely continuous, then C(Y) must be absolutely continuous. (c) Prove or disprove that if C(X) is singular continuous, then C(Y) must be singular continuous. Exercise 12.3.4. Let X and Y be random variables, with C(Y) absolutely continuous, and let Z = X -\-Y. (a) Assume X and Y are independent. Prove that C(Z) is absolutely continuous, regardless of the nature of C(X). [Hint: Recall the convolution formula of Subsection 9.4.]
12.4. Section summary. 149 (b) Show that if X and Y are not independent, then C{Z) may fail to be absolutely continuous. Exercise 12.3.5. Let X and Y be random variables, with C(X) discrete and C(Y) singular continuous. Let Z = X + Y. Prove that C(Z) is singular continuous. [Hint: If X(S) = P{Y € Sc) = 0, consider the set U = {s + x : 8 e 5, P(X = x) > 0}.] Exercise 12.3.6. Let A, B, Z\, Z2,... be i.i.d., each equal to +1 with probability 2/3, or equal to 0 with probability 1/3. Let Y — Y^iL\ %\ 2~l as at the beginning of this section (so v = L(Y) is singular continuous), and let W ~ N{0,1). Finally, let X = A(BY + (1 - B)W), and set [i = C(X). Find a discrete measure fidisc, an absolutely continuous measure fiac, and a singular continuous measure fis, such that fi = fidisc + Vac + A*s- Exercise 12.3.7. Let fi, v, and p be probability measures with ji <C i/ < p. Prove that ^ = ^f^ with p-probability 1. [Hint: Use Proposition 6.2.3.] Exercise 12.3.8. Let fi and v be probability measures with ji <C v and i/ <C [i. (This is sometimes written as fi = v.) Prove that -^ > 0 with //-probability 1, and in fact ^ = 1/^. Exercise 12.3.9. Let fi and i/ be discrete probability measures, with 0 = ii - v. Write down an explicit Hahn decomposition ft = A+ U A~ for 0. Exercise 12.3.10. Let /i and i/ be absolutely continuous probability measures on R (with the Borel cr-algebra), with 0 = \x — v. Write down an explicit Hahn decomposition R = A+ U A~ for 0. 12.4. Section summary. This section proved the Lebesgue Decomposition theorem, which says that every measure may be uniquely written as the sum of a discrete measure, an absolutely continuous measure, and a singular continuous measure. From this followed the Radon-Nikodym Theorem, which gives a simple condition under which a measure is absolutely continuous.
13. Conditional probability and expectation. 151 13. Conditional probability and expectation. Conditioning is a very important concept in probability, and we consider it here. Of course, conditioning on events of positive measure is quite straightforward. We have already noted that if A and B are events, with P(B) > 0, then we can define the conditional probability P(A | B) = P{A n B) / P(B); intuitively, this represents the probabilistic proportion of the event B which also includes the event A. More generally, if Y is a random variable, and if we define v by i/(5) = P{Y G S\B) = P(FgS, B)/P(B), then v = C(Y \B) is a probability measure, called the conditional distribution of Y given B. We can then define conditional expectation by E (Y \ B) — jyv(dy). Also, C(Y 1B) = P(B) C(Y \ B) + P{BC) 50, so taking expectations and re-arranging, E(Y\B) = E{Y1B)/P(B). (13.0.1) No serious difficulties arise. On the other hand, if P(jB) = 0 then this approach does not work at all. Indeed, it is quite unclear how to define something like P(Y G S \ B) in that case. Unfortunately, it frequently arises that we wish to condition on events of probability 0. 13.1. Conditioning on a random variable. We begin with an example. Example 13.1.1. Let (X.Y) be uniformly distributed on the triangle T = {{x,y) G R2; 0 < y < 2, y < x < 2}; see Figure 13.1.2. (That is. P ((X, Y)eS) = ±\2{S H T) for Borel S C R2, where A2 is Lebesgue measure on R2; briefly, dP = \ 1T dxdy.) Then what is P (Y > f | X = l)? What is E (Y \ X = 1)? Since P(X = 1) = 0, it is not clear how to proceed. We shall return to this example below. Because of this problem, we take a different approach. Given a random variable X. we shall consider conditional probabilities like P(A\X). and also conditional expected values like E(F|X), to themselves be random variables. We shall think of them as functions of the "random" value X. This is very counter-intuitive: we are used to thinking of P(- • •) and E(- • •) as numbers, not random variables. However, we shall think of them as random variables, and we shall see that this allows us to partially resolve 'the difficulty of conditioning on sets of measure 0 (such as {X = 1} above).
152 13. Conditional probability and expectation. Figure 13.1.2. The triangle T in Example 13.1.1. The idea is that, once we define these quantities to be random variables, then we can demand that they satisfy certain properties. For starters, we require that E \P(A | X)} = P(A), E [E(Y | X)} = E(Y). (13.1.3) In words, these random variables must have the correct expected values. Unfortunately, this does not completely specify the distributions of the random variables P(A \ X) and E(Y \ X); indeed, there are infinitely many different distributions having the same mean. We shall therefore impose a stronger requirement. To state it, recall that if Q is a sub-a-algebra (i.e. a a-algebra contained in the main a-algebra T), then a random variable Z is ^-measurable if {Z < z) e Q for all z e R. (It follows that also {Z = z) = {Z<z}\ \Jn{Z <z-±}eG.) Also, a(X) = {{X e B} : B C R Borel}. Definition 13.1.4. Given random variables X and Y with E\Y\ < oo. and an event A, P(A \ X) is a conditional probability of A given X if it is a a(X)-measurable random variable and, for any Borel S C R, we have E(P(A\X)ixes) = P(An{X e S}) (13.1.5) Similarly, E(Y \ X) is a conditional expectation of Y given X if it is a cr(X)-
13.1. Conditioning on a random variable. 153 measurable random variable and, for any Borel S C R, we have E (E(Y I X)lxes) = E(Y lxes) . (13.1.6) That is, we define conditional probabilities and expectations by specifying that certain expected values of them should assume certain values. This prompts some observations. Remarks. 1. Requiring these quantities to be a(X)-measurable means that they are functions of X alone, and do not otherwise depend on the sample point u. Indeed, if a random variable Z is cr(X)-measurable, then for each z e R, {Z = z} = {X e Bz} for some Borel Bz C R. Then Z = f(X) where / is defined by f(x) = z for all x € Bz, i.e. Z is a function of X. 2. Of course, if we set S = R in (13.1.5) and (13.1.6), we obtain the special case (13.1.3). 3. Since expected values are unaffected by changes on a set of measure 0, we see that conditional probabilities and expectations are only unique up to a set of measure 0. Thus, if P(X = x) = 0 for some particular value of x, then we may change P(A \ X) on the set {X = x} without restriction. However, we may only change its value on a set of measure 0; thus, for "most" values of X, the value P(A \ X) cannot change. In this sense, we have mostly (but not entirely) overcome the difficulty of Example 13.1.1. These conditional probabilities and expectations always exist: Proposition 13.1.7. Let X and Y be jointly defined random variables, and let A be an event. Then P(A | X) and E(Y \ X) exist (though they are only unique up to a set of probability 0). Proof. We may define P(A \ X) to be ^-, where Po and v are measures on cf{X) defined as follows. Po is simply P restricted to cr(X), and u(E) = P(AHE), Ee a(X). Note that v < P0, so by the general Radon-Nikodym Theorem (Corollary 12.2.2), ^ exists and is unique up to a set of probability 0. Similarly, we may define E(Y \ X) to be ^ - ^, where p+{E) = E (y+ 1E) , p~(E) = E (Y~ 1E), Ee a(X). Then (13.1.5) and (13.1.6) are automatically satisfied, since e.g. E[P(A\X)lxes] = E -—lxes = / -75-dP dpo J Jxes dP0
154 13. Conditional probability and expectation. Jxes «^o Example 13.1.1 continued. To better understand Definition 13.1.4, we return to the case where (X, Y) is assumed to be uniformly distributed over the triangle T = {(x,y) G R2; 0 < y < 2, y < x < 2}. We can then define P(Y >l\X) and E(Y \ X) to be what they "ought" to be, namely f x-± P(Y > - IX) = { ~x^ ' X - 3/4 ; E(r IX) = — . v -41 ' | 0, X<3/4 V ' ; 2 We then compute that, for example, 3 E (p(Y >j\X) lXeS) = [ f ~ J Kdy) X(dx) = / (x-DUidx) P(V>f, Xes) = f f {\)\\{dy)\{dx) = / (x--Y-\{dx)^ 7sn[f,2] V 4/2 thus verifying (13.1.5) for the case A = {Y > f }. Similarly, for S C [0,2], while E(E(Y\X)1X€S) = f %h(dy)\(dx) while E (yixes) =/ y\\{dy)dx = I f y \x{dy) x{dx) = I TX{dx)' fSJO * JS so that the two expressions in (13.1.6) are also equal. This example was quite specific, but similar ideas can be used more generally; see Exercises 13.4.3 and 13.4.4. In summary, conditional probabilities and expectations always exist, and are unique up to sets of probability 0,
13.2. Conditioning on a sub-ot-algebra. 155 but finding them may require guessing appropriate random variables and then verifying that they satisfy (13.1.5) and (13.1.6). 13.2. Conditioning on a sub-cr-algebra. So far we have considered conditioning on just a single random variable X. We may think of this, equivalently, as conditioning on the generated a-algebra, a(X). More generally, we may wish to condition on something other than just a single random variable X. For example, we may wish to condition on a collection of random variables X\, X2,... ,Xn, or we may wish to condition on certain other events. Given any sub-a-algebra Q (in place of cr(X) above), we define the conditional probability P(A \ Q) and the conditional expectation E(y | Q) to be (/-measurable random variables satisfying that E{P(A\Q)1G) = P(AnG) (13.2.1) and E(E(Y\G)1G) = E(yiG) (13.2.2) for any G € (/. If Q — cr(X), then this definition reduces precisely to the previous one, with P(A | X) being shorthand for P(A | a(X)), and E(Y \ X) being shorthand for E(Y \ cr(X)). Similarly, we shall write e.g. E(F | X\, X2) as shorthand for E{Y \ a{XuX2)), where a{X1,X2) = {{Xi < a} n {X2 < b} : a, b € R} is the a-algebra generated by X\ and X2. As before, these conditional random variables exist (and are unique up to a set of probability 0) by the Radon-Nikodym Theorem. Two further examples may help to clarify matters. If Q = {0,0} (i.e., G is the trivial a-algebra), then P(A | Q) = P(A) and E(Y | Q) = E(Y) are constants. On the other hand, if Q = T (i.e., Q is the full cr-algebra), or equivalently if A e Q and X is (/-measurable, then P(A \ Q) = 1a and E(X | Q) = X. Intuitively, if Q is small then the conditional values cannot depend too much on the sample point a;, but rather they must represent average values. On the other hand, if Q is large then the conditional values can (and must) be very close approximations to the unconditional values. In brief, the larger Q is, the more random are conditionals with respect to G. See also Exercise 13.4.1. Exercise 13.2.3. Let Q\ and Q2 be two sub-cr-algebras. (a) Prove that if Z is Q\ -measurable, and Q\ C Q2, then Z is also Q2- rneasurable. (b) Prove that if Z is ^-measurable, and also Z is ^'-measurable, then Z is (Q fl (/^-measurable.
156 13. Conditional probability and expectation. Exercise 13.2.4. Let S\ and 52 be two disjoint events, and let Q be a sub-a-algebra. Show that the following hold with probability 1. [Hint: Use proof by contradiction.] (a) 0<P(Si|a)<l. (b) P(Si U S2 | G) = P(Si | Q) + P(52 | Q). Remark 13.2.5. Exercise 13.2.4 shows that conditional probabilities behave "essentially" like ordinary probabilities. Now, the additivity property (b) is only guaranteed to hold with probability 1, and the exceptional set of probability 0 could perhaps be different for each S\ and S2. This suggests that perhaps P(S \ Q) cannot be defined in a consistent, count ably additive way for all S € T. However, it is a fact that if a random variable Y is real-valued (or, more generally, takes values in a Polish space, i.e. a complete separable metric space like R), then there always exist regular conditional distributions, which are versions of P(F G B \ Q) defined precisely (not just w.p. 1) for all Borel B in a countably additive way; see e.g. Theorem 10.2.2 of Dudley (1989), or Theorem 33.3 of Billingsley (1995). We close with two final results about conditional expectations. Proposition 13.2.6. Let X and Y be random variables, and let Q be a sub-o-algebra. Suppose that E(F) and E(X7) are Unite, and furthermore that X is Q-measurable. Then with probability 1, E(XY\G) = XE(Y\G). That is, we can "factor" X out of the conditional expectation. Proof. Clearly XE(Y\G) is (/-measurable. Furthermore, if X = 1g0, with Go, G G G, then using the definition of E(F | G), we have E (X E(Y | G) 1g) = E (E(y | G) lcnGo) = E (YlGnG0) = E (XY1G) , so that (13.2.2) holds in this case. But then by the usual linearity and monotone convergence arguments, (13.2.2) holds for general X. It then follows from the definition that X E(Y \ G) is indeed a version of E(XY \ G)- 1 For our final result, suppose that Gi Q G2- Note that since E(Y\Gi) is G\-measurable, it is also (^-measurable, so from the above discussion we have E(E(y|<?i)|02) = E(Y\Gi). That is, conditioning first on G\ and then on G2 is equivalent to conditioning just on the smaller sub-a-algebra, G\. What is perhaps surprising is that we obtain the same result if we condition in the opposite order:
13.3. Conditional variance. 157 proposition 13.2.7. Let Y be a random variable with Unite mean, and let Q\ C Q2 be two sub-o~-algebras. Then with probability 1, E(E(Y\g2)\Gi) = E(y|ai)- That is, conditioning first on g2 and then on Q\ is equivalent to conditioning just on the smaller sub-o-algebra Q\. Proof. By definition, E (E(Y \ G2) \Q\) is Q\ -measurable. Hence, to show- that it is a version of E(F | Q\), it suffices to check that, for any G € Q\ C Q2, E [E (E(y | g2) | Gi) 1G] = E(yiG). But using the definitions of E(- ■ • | Gi) and E(- • • | g2), respectively, and recalling that G G G\ and G G $2, we have that e[e(E(y | g2) | Gi) id = e(E(y | g2) iG) = e(y ig) . I In the special case G\ = £2 = <7, we obtain that E[E(y | G) \ G] = E[Y\G], In words, repeating the operation "take conditional expectation with respect to G" multiple times is equivalent to doing it just once. That is, conditional expectation is a projection operator, and can be thought of as "projecting" the random variable Y onto the cr-algebra G- 13.3. Conditional variance. Given jointly defined random variables X and Y, we can define the conditional variance of Y given X by Var(y|X) = E[(y-E(y|X))2|X]. Intuitively, Var(y | X) is a measure of how much uncertainty there is in Y even after we know X. Since X may well provide some information about y, we might expect that on average Var(y | X) < Var(F). That is indeed the case. More precisely: Theorem 13.3.1. Let Y be a random variabie, and Q a sub-a-algebra. If Var(y) < oo, then Var(r) = E[Var(F | Q)] + Var[E(Y | Q)\. Proof. We compute (writing m = E{Y) = E[E(Y | G)}, and using (13.1.3) and Exercise 13.2.4) that Var(y). = E[(y-m)2] = E[E[(Y-m)2\g}} = E[E[(Y-E(Y\g) + E(Y\g)-m)2\G}] = E[E[(y - E(y i g)f \ g] +e[(e(y \ g) - m)2 \ g)} +2E[E[(y-E(y|a))(E(y|g)-m)|g]] = E[Var(y Ig)] + Var[E(Y \G)]+0,
158 13. Conditional probability and expectation. since using Proposition 13.2.6, we have E[(Y -E(Y\g))(E(Y \g)-m)\G} = (E(Y\g)-m)E[(Y-E(Y\g))\g] = (E(Y\g)-m)[E(Y\g)-E(Y\G)] = 0. | For example, if Q = cr(X), then E[Var(F | X)] represents the average uncertainty in Y once X is known, while Var[E(F | X)] represents the uncertainty in Y caused by uncertainty in X. Theorem 13.3.1 asserts that the total variance of Y is given by the sum of these two contributions. 13.4. Exercises. Exercise 13.4.1. Let A and B be events, with 0 < P(-B) < 1. Let Q = cr(B) be the a-algebra generated by B. (a) Describe Q explicitly. (b) Compute P(A | Q) explicitly. (c) Relate P{A \ G) to the earlier notion of P(A \ B) = P(A (IB)/ P{B). (d) Similarly, for a random variable Y with finite mean, compute E(Y" | (/). and relate it to the earlier notion of E(F | B) = E{Y 1B) / P(B). Exercise 13.4.2. Let Q be a sub-a-algebra, and let A be any event. Define the random variable X to be the indicator function 1^. Prove that E(X | Q) = P(A I Q) with probability 1. Exercise 13.4.3. Suppose X and Y are discrete random variables. Let q(x,y) = P(X = x, Y = y). (a) Show that with probability 1, [Hint: One approach is to first argue that it suffices in (13.1.6) to consider the case S = {xo}.] (b) Compute P(Y = y\X). [Hint: Use Exercise 13.4.2 and part (a).] (c) show that e(y\x) = J2yyV(y = y\x)' Exercise 13.4.4. Let X and Y be random variables with joint distribution given by C{X,Y) = dP = f(x,y)\2(dx,dy), where A2 is two- dimensional Lebesgue measure, and / : R2 —► R is a non-negative Borel- measurable function with JR2 / d\2 = 1. (Example 13.1.1 corresponds to the case f(x,y) = |lr(#,y).) Show that we can take P(Y e B \ X) = fBgx(y)\(dy) and E(F | X) = fKygx(y)Hdy), where the function gx :
13.4. Exercises. 159 r -> R is defined by gx(y) = * !, "1,.,. whenever Lf(x,t)\(dt) is pos- itive and finite, otherwise (say) gx(y) = 0. Exercise 13.4.5. Let Q = {1,2,3}, and define random variables X and Y by Y{u) = cj, and X(l) = X{2) = 5 and X(3) = 6. Let Z = E{Y \ X). (a) Describe a(X) precisely. (b) Describe (with proof) Z(uj) for each uj G ft. Exercise 13.4.6. Let Q be a sub-cr-algebra, and let X and Y be two independent random variables. Prove by example that "E(X\Q) and E(y | Q) need not be independent. [Hint: Don't forget Exercise 3.6.3(a).] Exercise 13.4.7. Suppose Y is a(X)-measurable, and also X and Y are independent. Prove that there is C <E R with P(F = C) = 1. [Hint: First prove that P(Y < y) = 0 or 1 for each y G R.] Exercise 13.4.8. Suppose Y is ^-measurable. Prove that Var(Y \ Q) = 0. Exercise 13.4.9. Suppose X and Y are independent. (a) Prove that E(Y \ X) = E(F) w.p. 1. (b) Prove that Var(F | X) = Var(y) w.p. 1. (c) Explicitly verify Theorem 13.3.1 (with Q = a(X)) in this case. Exercise 13.4.10. Give an example of jointly defined random variables which are not independent, but such that E(Y \ X) = E(Y) w.p. 1. Exercise 13.4.11. Let X and Y be jointly defined random variables. (a) Suppose E{Y \ X) = E{Y) w.p. 1. Prove that E{XY) = E(X) E{Y). (b) Give an example where E(XY) = E(X)E(Y), but it is not the case that E(Y | X) = E{Y) w.p. 1. Exercise 13.4.12. Let {Zn} be independent, each with finite mean. Let ^o = a, and Xn = a + ZY + ... + Zn for n > 1. Prove that E(Xn+i |X0,Xi,... ,Xn) — Xn + E(Zn+i). Exercise 13.4.13. Let X0,Xi,... be a Markov chain on a countable state space 5, with transition probabilities {ptj}, and with E|Xn| < cxd for all n. Prove that with probability 1: jexJPXnj- [Hint: Don't forget Exercise 13.4.3.] . (b) E(Xn+1|Xn) = £,^PxrJ.
160 13. Conditional probability and expectation. 13.5. Section summary. This section discussed conditioning, with an emphasis on the problems that arise (and their partial resolution) when conditioning on events of probability 0, and more generally on random variables and on sub-cr-algebras. It presented definitions, examples, and various properties of conditional probability and conditional expectation in these cases.
14. Martingales. 161 14. Martingales. In this section we study a special kind of stochastic process called a martingale. A stochastic process Xq, Xi, ... is a martingale if E|Xn| < oo for all n, and with probability 1, E (^n+l | Xq, X\ -> • • • j Xn) = Xn . Of course, this really means that E(Xn+i \Jrn) — Xn, where Tn is the o- algebra cr(Xo, X\,..., Xn). Intuitively, it says that on average, the value of Xn+i is the same as that of Xn. Remark 14.0.1. More generally, we can define a martingale by specifying that E (Xn+i \Jrn) = Xn for some choice of increasing sub-cr-fields Tn such that Xn is measurable with respect to Tn. But then a(X0,..., Xn) C Tn, so by Proposition 13.2.7, E(Xn+i |Xo,..., Xn) = E[E(Xn+i l^n) | X0,... ,Xn] = E[Xn I Ao,..., Xn\ = Xn , so the above definition is also satisfied, i.e. the two definitions are equivalent. Markov chains often provide good examples of martingales (though there are non-Markovian martingales too, cf. Exercise 14.4.1). Indeed, by Exercise 13.4.13, a Markov chain on a countable state space 5, with transition probabilities Pij, and with E|Xn| < oo, will be a martingale provided that Y,3Vij = «, tGS. (14.0.2) (Intuitively, given that the chain is at state i at time n, on average it will still equal i at time n + 1.) An important specific case is simple symmetric random walk, where S = Z and Xq = 0 and Pi,i-i = Pi,i+i = \ for all ieZ. We shall also have occasion to consider submartingales and supermartin- gales. The sequence Xo,X\,... is a submartingale if E|Xn| < oo for all n, and also E(Xn+i|X0,-Y1,,..,Xn)>Xn. (14.0.3) It is a supermartingale if E|Xn| < oo for all n, and also E {Xn+i \Xo,Xi,..., Xn) < Xn . (These names are very standard, even though they are arguably the reverse °f what they should be.) Thus, a process is a martingale if and only if it is
162 14. Martingales. both a submartingale and a supermartingale. And, {Xn} is a submartingale if and only if {—Xn} is a supermartingale. Furthermore, again by Exercise 13.4.13, a Markov chain {Xn} with E|Xn| < oo is a submartingale if ^2jEsJPij > * f°r aU i G S, or is a supermartingale if J2jesJPij S i for all z G S. Similarly, if Xo — 0 and Xn = Z\ + ... + Zn, where {ZA are i.i.d. with finite mean, then by Exercise 13.4.12, {Xn} is a martingale if E(Z«) = 0, is a supermartingale if E(Z^) < 0; or is a submartingale if E(Zi) > 0. If {Xn} is a submartingale, then taking expectations of both sides of (14.0.3) gives that E(Xn+i) > E(Xn), so by induction E{Xn) > E(X„.i) > ... > E(Xi) > E(X0). (14.0.4) Similarly, using Proposition 13.2.7, E [Xn+2 I Xo, • • •, Xn] = E[E(Xn+2 | Xo,..., Xn+\) | Xo,..., Xnj > E [Xn+i | Xo,..., Xn] > Xn , so by induction E(Xm|X0,Xi,...,Xn) >Xn, m>n. (14.0.5) For supermartingales, analogous statements to (14.0.4) and (14.0.5) follow with each > replaced by <, and for martingales they can be replaced by =. 14.1. Stopping times. If X0, Xi,... is a martingale, then as in (14.0.4), E(Xn) = E(Xo) for all n £ N. But what about E(Xr), where r is a random time? That is, if we define a new random variable Y by Y(uj) = Xt^(uj), then must we also have E(F) = E(X0)? If r is independent of {Xn}, then E(Xr) is simply a weighted average of different E(Xn), and therefore still equals E(Xo)- But what if r and {Xn} are not independent? We shall assume that r is a stopping time, i.e. a non-negative-integer- valued random variable with the property that {r = n} G cr(Xo,... ,Xn). (Intuitively, this means that one can determine if r = n just by knowing the values X0,..., Xn; r does not "look into the future" to decide whether or not to stop at time n.) Under these conditions, must it be true that E(Xr) = E(X0)? The answer to this question is no in general. Indeed, consider simple symmetric random walk, with Xo = 0, and let r = inf{n > 0; Xn = —5}. Then r is a stopping time, since {r = n) = {Xo ^ —5, .... Xn_i ^ -5, Xn = —5} G <j(X0, ... ,Xn). Furthermore, since {Xn} is recurrent, we
14.1. Stopping times. 163 have P(r < oo) = 1. On the other hand, clearly XT = -5 with probabilitv 1, so that E(XT) = -5, not 0. However, for bounded stopping times the situation is quite different as the following result (one version of the Optional Sampling Theorem) shows. Theorem 14.1.1. Let {Xn} be a submartingale, let M e N, and let n,T2 be stopping times such that 0 < t\ < r2 < M with probability 1. ThenE(XT2)>E(XTl). Proof. Note first that {tx < k < r2} = {n < k - 1} Pi {r2 < k - 1}C, so that (since t\ and r2 are stopping times) the event \r\ < k < r2} is in a(Xo,..., Xk-i)- We therefore have, using a telescoping series, linearity of expectation, and the definition of E(- • • | Xo,..., Xk-i), that E(XT2)-E(XT1) = E(XT2-XT1) = E(E|n+1(Xfc-Xfc_1)) = E \12k=i(xk ~ xk-i)iTl<k<T2j = Sfc=lE((^fc -Xfc_l)lri<fc<r2) = Sfc=i E((E(Xfc |X0,...,Xfc_i) - Xk-i) ^■rl<k<T2) • But since {Xn} is a submartingale, (E(Xk\Xo,... ,Xk-i) — Xk-i) > 0. Therefore, E(XT2) - E(XTl) > 0, as claimed. I Corollary 14.1.2. Let {Xn} be a martingale, let MgN, and let T\. t2 be stopping times such that 0 < t\ < r2 < M with probability 1. Then E(XT2) = E(XT1). Proof. Since both {Xn} and {— Xn} are submartingales, the proposition gives that E{XT2) > E{XTl) and also E(-XT2) > E(-XTl). The result follows. I Setting tj = 0, we obtain: Corollary 14.1.3. If {Xn} be a martingale, and r is a bounded stopping time, then E{XT) = E(X0). Example 14.1.4. Let {Xn} be simple symmetric random walk with ^o = 0. and let r = min(1012, inf{n > 0]Xn = -5}). Then r is indeed a bounded stopping time (with r < 1012), so we must have E(XT) = 0. This is surprising since the probability is extremely high that r < 1012, and in this case XT = -5. However, in the rare case that r = 1012, XT will (on
164 14. Martingales. average) be so large as to cancel this —5 out. More precisely, using the observation (13.0.1) that E{Z\B) = E(Z1B)/P(B) when P(B) > 0, we have that 0 = E(XT) = E(Xrlr<1oi2)+E(Xrlr=10i2) = E(XT | r < 1012) P(r < 1012) + E(XT \ r = 1012) P(r = 1012) = (-5) P(r < 1012) + E(X10i2 | r = 1012) P(r = 1012). Here P(r < 1012) « 1 and P(r = 1012) « 0, but E(X10i2 | r = 1012) is so huge that the equation still holds. A generalisation of Theorem 14.1.1, allowing for unbounded stopping times, is as follows. Theorem 14.1.5. Let {Xn} be a martingale with stopping time r. Suppose P(r < oo) = 1, and E|Xr| < oo, and linin—oo E[Xn lT>n] = 0. Then E{XT) = E(X0). Proof. Let Zn = A"min(TjTl) for n = 0,1,2,— Then Zn = XTlT<n + XnlT>n = XT — XTlT>n + XnlT>n, so XT = Zn — Anlr>n -h ArlT>n. Hence, E(XT) = E(Zn) - E[Xnlr>n] + E[Xrlr>n]. Since min(r, n) is a bounded stopping time (cf. Exercise 14.4.6(b)), it follows from Corollary 14.1.3 that E(Zn) = E(X0) for all n. As n -> oo, the second term goes to 0 by assumption. Also, the third term goes to 0 by the Dominated Convergence Theorem, since S|XT| < oo, and lT>n —> 0 w.p. 1 since P[r < oo] = 1. Hence, letting n —> oo, we obtain that E(XT) = E(Xq). I Remark 14.1.6. Theorem 14.1.5 in turn implies Corollary 14.1.3, since if r < M, then Xn lT>n = 0 for all n > M, and also E|XT| = E|Xolr=o + ... + XMlr=M| < E|X0| + ... + E|XmI < °°- However, this reasoning is circular, since we used Corollary 14.1-3 in the proof of Theorem 14.1.5. Corollary 14.1.7. Let {Xn} be a martingale with stopping time Tj such that P[r < oo] = 1. Assume that {Xn} is bounded up to tlT^ t, i.e. that there is M < oo such that \Xn\ ln<T < M ln<T for 0$ n- Tilen E(XT) = E(Xo).
14.1. Stopping times. 165 Proof. We have \XT\ < M, so that E\XT\ < M < oo. Also |E(Xnlr>n)| < E(|Xn|lr>n) < E(M lr>n) = MP(r > n), which converges to 0 as n —> oo since P[r < oo] = 1. Hence, the result follows from Theorem 14.1.3. I Remark 14.1.8. For submartingales {Xn} we may replace = by > in Corollary 14.1.3 and Theorem 14.1.5 and Corollary 14.1.7, while for super- martingales we may replace = by <. Another approach is given by: Theorem 14.1.9. Let r be a stopping time for a martingale {Xn}, with P(r < oo) = 1. Then E(XT) = E{X0) if and only if lim^oo E[Xmin(Tfn)] = Ellinin-^oo Xmin(r?n) J. Proof. By Corollary 14.1.3, lim^oo E[Xmin(r>n)] = lim^oo E(X0) = E(Xo). Also, since P(r < oo) = 1, we must have P[limn_^oo Xmin(r n) = XT] = 1, so E[limn_+oo Xmin(T)Tl)] = E(XT). The result follows. I Combining Theorem 14.1.9 with the Dominated Convergence Theorem yields: Corollary 14.1.10. Let r be a stopping time for a martingale {Xn}, with P(r < oo) = 1. Suppose there is a random variable Y with E(Y) < oo and |Xmin(r,n)| < Y for all n. Then E(XT) = E(X0). As a particular case, we have: Corollary 14.1.11. Let r be a stopping time for a martingale {Xn}. Suppose E(r) < oo, and \Xn - Xn_i| < M < oo for all n. Then E(XT) = E(X0). Proof. Let Y = \X0\ +Mt. Then E(F) < E|X0| + ME(r) < oo. Also, l^min(r,n)l < |*o| + M min(r,n) < Y. Hence, the result follows from Corollary 14.1.10. I Similarly, combining Theorem 14.1.9 with the Uniform Integrability Convergence Theorem yields: Corollary 14.1.12. Let r be a stopping time for a martingale {Xn}, with P(r < oo) = 1. Suppose lir^supE(|Xmin(r,n)|l|x ,>Q) = 0.
166 14. Martingales. Then E{XT) = E{X0). Example 14.1.13. (Sequence waiting time.) Let {rn}n>i be infinite fair coin tossing (cf. Subsection 2.6), and let r = inf{n > 3 : rn_2 = 1, rn_i = 0, rn = 1} be the first time the sequence heads-tails-heads is completed. Surprisingly, E(r) can be computed using martingales. Indeed, suppose that at each time n, a new player appears and bets $1 on tails, then if they win they bet $2 on heads, then if they win again they bet $4 on heads. (They stop betting as soon as they either lose once or win three bets in a row.) Let Sn be the total amount won by all the betters by time n. Then by construction, {Sn} is a martingale with stopping time r, and furthermore \Sn — Sn-i| < 7 < oo. Also, E(r) < oo, and ST = — r + 10 by Exercise 14.4.12. Hence, by Corollary 14.1.11, 0 = E(5T) = -E(r) +10, whence E(r) = 10. (See also Exercise 14.4.13.) Finally, we observe the following fact about E(XT) when {Xn} is a general random walk, i.e. is given by sums of i.i.d. random variables. (If the [Zi] are bounded, then part (a) below also follows from applying Corollary 14.1.11 to {Xn - nm}.) Theorem 14.1.14. (Weld's theorem) Let {Zi} be i.i.d. with finite mean m. Let X0 = a, and Xn = a + Z\ + ... 4- Zn for n > 1. Let r be a stopping time for {Xn}, with E(r) < oo. Then (a) E(Xr) = a + raE(r); and furthermore (b) if m = 0 and E(ZX2) < oo, then Var(Xr) = E(Zf) E(r). Proof. Since {i < r} = {r >z-l} G a(X0,... ,Xi_i) = a(Z0,..., Z^i), it follows from Lemma 3.5.2 that {i < r} and {Zi G B} are independent for any Borel B C R, so that li<T and Zi are independent for each i. For part (a), assume first that the Zi are non-negative. It follows from countable linearity and independence that E{XT-a) = E(Zi + ... + ZT) = E^JTza^r) oo oo = £E(Zili<r) = ^E^Etl^) oo oo = £E(Zi)E(li<T) = E(Zx)£P(T>i) = E(Zx)E(t), 2=1 2=1 the last equality following from Proposition 4.2.9.
14.1. Stopping times. 167 For general Zj, the above shows that E(J^- \Z{\ li<T) = E|Zi| E(r) < oc. Hence, by Exercise 4.5.14(a) or Corollary 9.4.4, we can write E(XT-a) = e(]Tz,1*<tJ = ^E(Z,li<r) = E(Zi)E(r). For part (b), if m = 0 then from the above E(Xr) = a. Assume first that r is bounded, i.e. r < M < oo. Then using part (a), (M 2\ (]rz,ii<r) J = E^Z^l^j+O = E(Z2)E(r), where the cross terms equal 0 since as before {j < r} G cr(Zo, •. •, Zj-\) so Zi lj<T is independent of Zj, and hence E I ^2 ZiZJ ll^T ^<r J = E I Z2 ZiZ3 1j<r \i<j J \l<i<j<M = Y, E(ZiZilj<T)= £ E(Zili<T)E(Zi) = 0. \<i<j<M \<i<j<M If r is not bounded, then let pk = min(r, fc) be corresponding bounded stopping times. The above gives that Var(XPfc) = E(Z2) E(p^) for each k. As k —> oo, E(pfc) —> E(r) by the Monotone Convergence Theorem, so the result follows from Lemma 14.1.15 below. I The final argument in the proof of Theorem 14.1.14 requires two technical lemmas involving I? theory: Lemma 14.1.15. In the proof of Theorem 14.1.14, lim^oo Var(XPfc) = Var(Xr). Proof (optional). Clearly XPk -+ XT a.s. Let \\W\\ = y/E(W2) be the usual L2 norm. Note first that for m < n, using Theorem 14.1.14(a) and since the cross terms again vanish, we have II*p„-*pJ|2 = V[(XPn-XPm)2} = E[(ZPm+1 + ... + ZPn)2} = E[Z2Pm+1+...+Zl}+0 = E[Z2 + ... + Z2J-E[Z2 + ...+Z2J
168 14. Martingales. = E(Z2)E(pn)-E(Z2)E(pm) = E(Z2)[E(pn)-E09m)] which -» E(Z?)[E(t) - E(r)] = 0 as n,ra —► oo. This means that the sequence {XPri}^=1 is a Cauchy sequence in L2, and since L2 is complete (see e.g. Dudley (1989), Theorem 5.2.1), we must have limn_^oo \\XPn - Y\\ = q for some random variable Y\ It then follows from Lemma 14.1.16 below (with p = 2) that y = XT a.s., so that limn_^oo \\XPn — XT\\ = 0. The triangle inequality then gives that \\\XPn — a\\ - \\XT - a\\\ < \\XPn - XT\\ solimn^00||XPn-a|| = ||Xr-a||. Hence, lim^^ ||XPn-a||2 = ||XT-a||2' i.e. limn^oo Var(X2J = Var(X2). B Lemma 14.1.16. If {Yn} -> Y in Lp, i.e. lim^oo E(|yn - Y\*>) = 0, where 1 < p < oo, then there is a subsequence {Ynk} with Ynk —> Y a.s. In particular, if {Yn} —► Y in Lp', and aiso {yn} —> Z a.s., then Y = Z a.s. Proof (optional). Let {Yntc} be any subsequence such that E(|ynfc - Y\p) < 4"fc, and let Ak = {u £ Q : \Ynk\uj) - Y{u)\ > 2~k^}. Then 4"fc > E(|ynfc-yr) > E{\Ynk-YflAk) > (2-^)PP(4) = 2~kP(Ak). Hence, P(Ak) < 2~fc, so ^2kP(Ak) = 1 < oo, so by the Borel-Cantelli Lemma, P(limsupfeAfc) = 0. For any e > 0, since {a; G ft : |ynfc(cj) - y(^)| > 4 C Afc for all fc > plog2(l/e), this shows that P(|ynfc(u;) - ^(^)| > € i.o.) < P(limsupfcAfc) = 0. Lemma 5.2.1 then implies that {ynfc}^ya.s. ■ 14.2. Martingale convergence. If {Xn} is a martingale (or a submartingale), will it converge almost surely to some (random) value? Of course, the answer to this question in general is no. Indeed, simple symmetric random walk is a martingale, but it is recurrent, so it will forever oscillate between all the integers, without ever settling down anywhere. On the other hand, consider the Markov chain on the non-negative integers with (say) Xo = 50, and with transition probabilities given by Pv = 27+T f°r 0 < j < 2z, with pij = 0 otherwise. (That is, if Xn = i, then Xn+\ is uniformly distributed on the 2i -f 1 points 0,1,..., 2i.) This Markov chain is a martingale. Clearly if it converges at all it must converge to 0; but does it?
14.2. Martingale convergence. 169 This question is answered by: Theorem 14.2.1. (Martingale Convergence Theorem) Let {Xn} be a submartingale. Suppose that supnE|Xn| < oo. Then there is a (finite) random variable X such that Xn —> X almost surely (a.s.). We shall prove Theorem 14.2.1 below. We first note the following immediate corollary. Corollary 14.2.2. Let {Xn} be a martingale which is non-negative (or more generally satisfies that either Xn > C for all n, or Xn < C for all n, for some C € RJ. Then there is a (finite) random variable X such that Xn —► A a.s. Proof. If {Xn} is a non-negative martingale, then E|Xn| = E(Xn) = E(Xo), so the result follows directly from Theorem 14.2.1. If Xn > C [respectively Xn < C], then the result follows since {Xn — C} [respectively {—Xn + C}} is a non-negative martingale. I It follows from this corollary that the above Markov chain example does indeed converge to 0 a.s. For a second example, suppose Xq = 50 and that pij = 2min/. i00_^+1 for \j — i\< min(z, 100 — i). This Markov chain lives on {0,1,2,..., 100}, at each stage jumping uniformly to one of the points within min(z, 100 — i) of its current position. By Corollary 14.2.2, {Xn} —> X a.s. for some random variable X, and indeed it is easily seen that P(X = 0) = P(X = 100) = \. For a third example, let S = {2n; n € Z}, with a Markov chain on S having transition probabilities pi^i = ^ Pi,i — §• This is again a martingale, and in fact it converges to 0 a.s. (even though it is unbounded). For a fourth example, let S = N be the set of positive integers, with Xq = 1- Let pu = 1 for 2 even, with pi^-i = 2/3 and Pi,i+2 = 1/3 for i odd. Then it is easy to see that this Markov chain is a non-negative martingale, which converges a.s. to a random variable X having the property that P(X = i) > 0 for every even non-negative integer i. It remains to prove Theorem 14.2.1. We require the following lemma. Lemma 14.2.3. (Upcrossing Lemma) Let {Xn} be a submartingale. For M e N and a < (3, let U^f = sup{/c; 3ai <bi < a2 <b2 < a3 < ... <ak <bk< M; Xaz < a, Xbt > j3} .
170 14. Martingales. (Intuitively, U^ represents the number of times the process {Xn} "Up- crosses" the interval [a,/?] by time M.) Then E(U%0) < E\XM-X0\/(p-a). Proof. By Exercise 14.4.2, the sequence {max(Xn,a)} is also a sub- martingale, and it clearly has the same number of upcrossings of [a, 0] as does {Xn}. Furthermore, | max(lM, ot) — max(Xo, ot)\ < \Xm — Xo\. Hence, for the remainder of the proof we can (and do) replace Xn by max(Xn.a). i.e. we assume that Xn > a for all n. Let uo = vo = 0, and iteratively define Uj and Vj for j > 1 by Uj = min(M, inf{/c > Vj-\; Xk < a}) ; Vj = min(M, ini{k > Uj\ Xk > 0}) . We necessarily have vm = M, so that E(XM) = E(XVM) — E(XVM — XUm + XUm — XVm_x + XVM_1 — XUM_l +... 4-XUl — Xq + Xq) (M \ M J2(*vk - xuj + Y,E (**> - *«*-,) ■ k=l / k=l Now, E(XUk - XVk_l) > 0 by Theorem 14.1.1, since {Xn} is a sub- martingale and Vk-i <Uk<M. Hence, Y^k=\ E{XUk - XVk_l) > 0. (This is the subtlest part of the proof; since usually XUk < a and XVk_1 > 0, it is surprising that E(XUk — XVk_1) > 0.) Furthermore, we claim that E [X>»* - *"*)) > E ((/^ - «) Ulf) ■ (14-2-4' Indeed, each upcrossing contributes at least 0—a to the sum. And, any unull cycle" where uk = Vk = M contributes nothing, since then XVk — XUk = Xm — Xm = 0. Finally, we may have one "incomplete cycle" where uk < ^1 but Vk = M. However, since we are assuming that Xn > a, we must have Xvk ~XUk = XM~XUk > a-a = 0 in this case; that is, such an incomplete cycle could only increase the sum. This establishes (14.2.4). We conclude that E(XM) > E(X0) + (0 - a) E (V^). Re-arranging. (0-a)E (U^ < E(XM)-E(X0) < E\XM-X0\, which gives the result. I
14.3. Maximal inequality. 171 Proof of Theorem 14.2.1. Let K = supn E {\Xn\) < «,. Note first that by Fatou's lemma, E (liminf \Xn\) < liminf El ATJ < K < oo. \ n / n It follows that P (|Xn| —> co) = 0, i.e. {Xn} will no£ diverge to ±00. Suppose now that P (liminf Xn < lim sup Xn) > 0. Then since { lim inf Xn < lim sup Xn } = [J (J J liminf Xn < q < q+ -< limsupXnj , g€Qfc6N it follows from countable subadditivity that we can find a, /3 € Q with P( lim inf Xn < a < (5 < limsupXn) > 0. With U^f as in Lemma 14.2.3, this implies that P (limM-00 U%f = 00) > 0, so E (limM-+oo V^f = 00) =00. Then by the monotone convergence theorem, limM-oo E(£/^) = E(limM->oo Utf) = 00. But Lemma 14.2.3 says that for all M G N, E{U^) < E|X^;Xo1 < ^, a contradiction. We conclude that P(limn_00 |Xn| = 00) = 0 and also P(liminf Xn < lim sup Xn) = 0. Hence, we must have P(limn_»oo Xn exists and is finite) = 1, as claimed. ■ 14.3. Maximal inequality. Markov's inequality says that for a > 0, P{X0 > a) < PflXol > a) < E|X0| / a. Surprisingly, for a submartingale, the same inequality holds with X0 replaced by max0<i<n^: even though usually (maxo<i<n^) > ^o- Theorem 14.3.1. (Martingale maximal inequality) If {Xn} iS a su^- martingale, then for all a > 0, ( max Xi) > a \0<i<n J ~ < E|Xn| Proof. Let Ak be the event {Xk > a, but X{ < a for i < k}, i.e. the event that the process first reaches a at time k. And let A = Uo<fc<n4 be the event that the process reaches a by time n.
172 14. Martingales. We need to show that P(A) < E\Xn\ /a, i.e. that aP(A) < E\Xn\. To that end, we compute (letting Tk = cr(X0,X\,...,Xk)) that aP(A) = Sfc==0aP(^fc) since {^fc} disjoint = ELoE(al^) ^ Sc=o E (^fclAfc) since Xk > a on ^fc < ELoE(E(Xn|jrfc)iAj by (14.0.5) = ELoE(Xn^Ak) since ^fc € ?k = E(Xnl^) since {^4fc} disjoint < E(\Xn\lA) < E|Xn|, as required. | Theorem 14.3.1 is clearly false if {Xn} is not a submartingale. For example, if the Xi are i.i.d. equal to 0 or 2 each with probability ^, and if a = 2, then P [(max0<i<n Xi) > a] = 1 - (^)n+1, which for n > 1 is not <E|Xn|/a=|, If {Xn} is in fact a non-negative martingale, then E|Xn| = E(Xn) = E(Xo), so the bound in Theorem 14.3.1 does not depend on n. Hence, letting n —► oo and using continuity of probabilities, we obtain: Corollary 14.3.2. If {Xn} is a non-negative martingale, then for all a>0, E(X0) ( sup Xi1 > a < a L x0<i<oo For example, consider the third Markov chain example above, where 3' Pi,£ 3* S = {2n;n G Z} and pii2i = $> Pi ± = §• K? say, X0 = 1, then we obtain that P [(sup0<i<ooXj) > 2] < |, which is perhaps surprising. (This result also follows from applying (7.2.7) to the simple non-symmetric random walk {log2X„}.) Remark 14.3.3. (Martingale Central Limit Theorem) If Xn = Zq + ... + Zn, where {Zj} are i.i.d. with mean 0 and variance 1, then we already know from the classical Central Limit Theorem that 4& => iV(0,1), i.e. that the properly normalised Xn converge weakly to a standard normal distribution. In fact, a similar result is true for more general martingales {Xn}. Let Tn = a(XQ, Xi, ..., Xn). Set <rg = Var(X0), and for n > 1 set al = Var^J^-O = E (X2n - X2n_, \ Tn.x) . Then follows from induction that Var(Xn) = X^=oE (<r£), and of course E(Xn) = E(Xq) does not grow with ro. Hence, if we set vt = min{n >
14.4. Exercises. 173 0; ELo^fc ^ 0» then for larSe *> the random variable *„,/>/* has mean close to 0 and variance close to 1. We then have —p- =>• AT(0,1) as t —► cxd under certain conditions, for example if £n °n = °° with probability one and the differences Xn — Xn-i are uniformly bounded. For a proof see e.g. Billingley (1995, Theorem 35.11). 14.4. Exercises. Exercise 14.4.1. Let {Z{} be i.i.d. with P(Z* = 1) = P(Z» = -1) = 1/2. Let X0 = 0, Xx = Zi, and for n > 2, Xn = Xn_i + (1 + Zx +... + Z„-i)(2* Zn — 1). (Intuitively, this corresponds to wagering, at each time n, one dollar more than the number of previous victories.) (a) Prove that {Xn} is a martingale. (b) Prove that {Xn} is not a Markov chain. Exercise 14.4.2. Let {Xn} be a submartingale, and let a £ R. Let Yn = max(Xn,a). Prove that {Yn} is also a submartingale. (Hint: Use Exercise 4.5.2.) Exercise 14.4.3. Let {Xn} be a non-negative submartingale. Let Yn = {Xn)2' Assuming F,(Yn) < cxd for all n, prove that {Yn} is also a submartingale. Exercise 14.4.4. The conditional Jensen's inequality states that if <j> is a convex function, then E (</>(X) | Q) > </> (E(X \ Q)). (a) Assuming this, prove that if {Xn} is a submartingale, then so is {(j){Xn)} whenever <j> is non-decreasing and convex with E|</>(Xn)| < cxd for all n. (b) Show that the conclusions of the two previous exercises follow from part (a). Exercise 14.4.5. Let Z be a random variable on a probability triple (0, T, P), and let Qq C Q1 C ... C T be a nested sequence of sub-cr-algebras. Let Xn — E(Z | Qn). (If we think of Qn as the amount of information we have available at time n, then Xn represents our best guess of the value Z at time n.) Prove that X0, Xi,... is a martingale. [Hint: Use Proposition 13.2.7 to show that E(Xn+i | Qn) = Xn. Then use the fact that X{ is ^-measurable to prove that <t(X0, ..., Xn) C Qn.] Exercise 14.4.6. Let {Xn} be a stochastic process, let r and p be two non-negative-integer-valued random variables, and let m G N. (a) Prove that r is a stopping time for {Xn} if and only if {r < n} G (t(X0,...,X„) foralln>0.
174 14. Martingales. (b) Prove that if r is a stopping time, then so is min(r, m). (c) Prove that if r and p are stopping times for {Xn}, then so is min(r, p). Exercise 14.4.7. Let CgR, and let {Zi} be an i.i.d. collection of random variables with P[Z» = -1] = 3/4 and P[Z» = C] = 1/4. Let X0 = 5, and Xn = 5 + Zi + Z2 + ... + Zn forn > 1. (a) Find a value of C such that {Xn} is a martingale. (b) For this value of C, prove or disprove that there is a random variable X such that as n —► oo, Xn —► X with probability 1. (c) For this value of C, prove or disprove that P[Xn = 0 for some n £ N] = 1. Exercise 14.4.8. Let {Xn} be simple symmetric random walk, with Xq = 0. Let r = inf{n > 5 : Xn+i = Xn -I-1} be the first time after 4 which is just before the chain increases. Let p = r + 1. (a) Is r a stopping time? Is p a stopping time? (b) Use Theorem 14.1.5 to compute E(Xp). (c) Use the result of part (b) to compute E(Xr). Why does this not contradict Theorem 14.1.5? Exercise 14.4.9. Let 0 < a < c be integers, with c > 3. Let {Xn} be simple symmetric random walk with Xq = a, let a = inf{n > 1 : Xn = 0 or c}, and let p = a — 1. Determine whether or not E(Xa) = a, and whether or not E(XP) = a. Relate the results to Corollary 14.1.7. Exercise 14.4.10. Let 0 < a < c be integers. Let {Xn} be simple symmetric random walk, started at Xq = a. Let r = inf{n > 1; Xn = 0 or c}. (a) Prove that {Xn} is a martingale. (b) Prove that E(XT) = a. [Hint: Use Corollary 14.1.7.] (c) Use this fact to derive an alternative proof of the gambler's ruin formula given in Section 7.2, for the case p = 1/2. Exercise 14.4.11. Let 0 < p < 1 with p ^ 1/2, and let 0 < a < c be integers. Let {Xn} be simple random walk with parameter p. started at X0 = a. Let r = inf{n > 1; Xn = 0 or c}. Let Zn = ((1 - p)/p)Xn for n = 0,1,2,.... (a) Prove that {Zn} is a martingale. (b) Prove that E(ZT) = ((1 -p)/p)a. [Hint: Use Corollary 14.1.7.] (c) Use this fact to derive an alternative proof of the gambler's ruin formula given in Section 7.2, for the case p ^ 1/2. Exercise 14.4.12. Let {Sn} and r be as in Example 14.1.13. (a) Prove that E(r) < oo. [Hint: Show that P(r > 3m) < (7/8)m, and
14.4. Exercises 175 use Proposition 4.2.9.] (b) Prove that Sr = -r + 10. [Hint: By considering the r different players one at a time, argue that Sr = (r - 3)(-l) + 7-1 + 1.] Exercise 14.4.13. Similar to Example 14.1.13. let {rn}n>1 be infinite fair coin tossing, a = inf{n > 3 : rn_2 = 0. r„_i = 1, rn = 1}, and p = inf{n > 4 : rn_3 = 0, rn_2 = 1, rn_! = 0, rn = 1}. (a) Describe a and p in plain English. (b) Compute E(er). (c) Does E(<j) = E(r), with r as in Example 14.1.13? Why or why not? (d) Compute E(p). Exercise 14.4.14. Why does the proof of Theorem 14.1.1 fail if M = oc? [Hint: Exercise 4.5.14 may help.] Exercise 14.4.15. Modify the proof of Theorem 14.1.1 to show that if {Xn} is a submartingale with |Xn+i - Xn| < M < oo, and t\ < r2 are stopping times (not necessarily bounded) with E(r2) < oo, then we still have E(XT2) > E(XTl). [Hint: Corollary 9.4.4 may help.] (Compare Corollary 14.1.11.) Exercise 14.4.16. Let {Xn} be simple symmetric random walk, with X0 = 10. Let r = min{n > 1; Xn = 0}, and let Yn = Xmm^r). Determine (with explanation) whether each of the following statements is true or false. (a) E(X200) = 10. (b) E(Y200) = 10. (c) E(XT) = 10. (d) E(Yr) = 10. (e) There is a random variable X such that {Xn} —> X a.s. (f) There is a random variable Y such that {Yn} —> Y a.s. Exercise 14.4.17. Let {Xn} be simple symmetric random walk with X0 = 0, and let r = inf{n > 0; Xn = -5}. (a) What is E(Xr) in this case? (b) Why does this fact not contradict Wald's theorem part (a) (with a = m = 0)? Exercise 14.4.18. Let 0 < p < 1 with p ^ 1/2, and let 0 < a < c be integers. Let {Xn} be simple random walk with parameter p, started at X0 = a. Let r = inf{n > 1; Xn = 0 or c}. (Thus, from the gambler's ruin solution of Subsection 7.2, P(Xr = c) = [((l-p)/p)a-l] / [((l-p)/p)c-l].) (a) Compute E(Xr) by direct computation. (b) Use Wald's theorem part (a) to compute E(r) in terms of E(Xr).
176 14. Martingales. (c) Prove that the game's expected duration satisfies E(r) = (a - c[((l _ P)/P)a-1]/[((1-P)/P)c-1])/(1-2P). (d) Show that the limit of E(r) as p —* 1/2 is equal to a(c — a). Exercise 14.4.19. Let 0 < a < c be integers, and let {Xn} be simple symmetric random walk with Xo = a. Let r = inf{n > 1; Xn = 0 or c}. (a) Compute Var(Xr) by direct computation. (b) Use Wald's theorem part (b) to compute E(r) in terms of Var(XT). (c) Prove that the game's expected duration satisfies E(r) = a(c — a). (d) Relate this result to part (d) of the previous exercise. Exercise 14.4.20. Let {Xn} be a martingale with \Xn+i — Xn\ < 10 for all n. Let r = inf{n > 1 : |Xn| > 100}. (a) Prove or disprove that this implies that P(r < oo) = 1. (b) Prove or disprove that this implies there is a random variable X with {Xn} -► X a.s. (c) Prove or disprove that this implies that P [r < oo, or there is a random variable X with {Xn} —► X] = 1. [Hint: Let Yn = Xm-m(Ttny] Exercise 14.4.21. Let Zi, Z2,... to be independent, with / 2* \ 2* - 1 1 p(Z{ = -—- ) = ^— and P(Zi = -2l) = - . V 2l - 1 / 2l v } 2l Let X0 = 0, and Xn = Zx + ... + Zn for n > 1. (a) Prove that {^n} is a martingale. [Hint: Don't forget (14.0.2).] (b) Prove that P[Zi > 1 a.a] = 1, i.e. that with probability 1, Z{ > 1 for all but finitely many i. [Hint: Don't forget the Borel-Cantelli Lemma.] (c) Prove that P[limn^00 Xn = 00] = 1. (Hence, even though {Xn} is a martingale and thus represents a player's fortune in a "fair" game, it is still certain that the player's fortune will converge to +00.) (d) Why does this result not contradict Corollary 14.2.2? (e) Let r = inf{n > 1 : Xn < 0}, and let Yn = Xm[n{T,n). Prove that {Yn} is also a martingale, and that P[limn^oo Yn = 00] > 0. Why does this result not contradict Corollary 14.2.2? 14.5. Section summary. This section provided a brief introduction to martingales, including submartingales, supermartingales, and stopping times. Some examples were given, mostly arising from Markov chains. Various versions were of the Optional Sampling Theorem were proved, giving conditions such that E(XT) = E(X0). This together with the Upcrossing Lemma was then used to prove the important Martingale Convergence Theorem. Finally, the Martingale maximal inequality was presented.
15. General stochastic processes 177 15. General stochastic processes. We end this text with a brief look at more general stochastic processes We attempt to give an intuitive discussion of this area, without being overly careful about mathematical precision. (A full treatment of these processes would require another course, perhaps following one of the books in Subsection B.5.) In particular, in this section a number of results are stated without being proved, and a number of equations are derived in an intuitive and non-rigorous manner. All exercises are grouped by subsection, rather than at the end of the entire section. To begin, we define a (completely general) stochastic process to be any collection {Xt\ t £ T} of random variables defined jointly on some probability triple. Here T can be any non-empty index set. If T = {0,1,2,...} then this corresponds to our usual discrete-time stochastic processes; if T = R-° is the non-negative real numbers, then this corresponds to a continuous-time stochastic process as discussed below. 15.1. Kolmogorov Existence Theorem. As with all mathematical objects, a proper analysis of stochastic processes should begin with a proof that they exist In two places in this text (Theorems 7.1.1 and 8.1.1), we have proved the existence of certain random variables, defined on certain underlying probability triples, having certain specified properties. The Kolmogorov Existence Theorem is a huge generalisation of these results, which allows us to define stochastic processes quite generally, as we now discuss. Given a stochastic process {Xt; t £ T}, and k G N, and a finite collection t\,..., tk G T of distinct index values, we define the Borel probability measure Hti...tk on Rfc by litl...tk(H) = P((Xtl,...,Xtfc)eff), HCKk Borel. The distributions {/x^...^; k G N, t\,... ,tk G T distinct} are called the finite-dimensional distributions for the stochastic process {Xt; t eT}. These finite-dimensional distributions clearly satisfy two sorts of consistency conditions: (Cl) If (s(l),s(2),..., s(fc)) is any permutation of (1,2, ...,fc) (meaning that s : {1, 2,..., k} —> {1, 2,..., k} is one-to-one), then for distinct ti,..., tk G T, and any Borel Hi,... ,Hk CR, we have Ht1...tk(Hlx...xHk) = »tsil)...ts(k)(Hs{1)x...xHs{k)). (15.1.1)
178 15. General stochastic processes. That is, if we permute the indices U, and correspondingly modify the set H = Hi x ... x Hk, then we do not change the probabilities. For example, we must have P(X e A, Y G B) = P(Y G B, X G A), even though this will not usually equal P(F G i, X G B). (C2) For distinct t\,..., tk G T, and any Borel Hi,..., Hk-i Q R, we have Htl...tk(Hi x...xHk-i xR) = ^tl...tfc_1(flri x...x/ffc_i). (15.1.2) That is, allowing Xtk to be anywhere in R is equivalent to not mentioning Xtk at all. For example, P(X G A, Y G R) = P(X G A). Conditions (CI) and (C2) are quite obvious and uninteresting. However, what is surprising is that they have an immediate converse; that is, for any collection of finite-dimensional distributions satisfying them, there exists a corresponding stochastic process. A formal statement is: Theorem 15.1.3. (Kolmogorov Existence Theorem) A family of Borel probability measures {fJ>ti...tkm, k G N, U G T distinct}, with fiti...tk & measure on Hk, satisfies the consistency conditions (Cl) and (C2) above if and only if there exists a probability triple (HT,JrT,P), and random variables {Xt}t£T defined on this triple, such that for all k G N, distinct t\,..., tk €T, and Borel H C Hk, we have P((Xh,...,Xtk)eH) = /^...tjtf)- (15.1-4) The theorem thus says that, under extremely general conditions, stochastic processes exist. Theorems 7.1.1 and 8.1.1 follow immediately as special cases (Exercises 15.1.6 and 15.1.7). The "only if" direction is immediate, as discussed above. To prove the "if" direction, we can take RT = {all functions T -> R} and TT = <j{{Xt G H}; t£T, HCR Borel} . The idea of the proof is to first use (15.1.4) to define P for subsets of the form {Xtl,..., Xt J G H) (for distinct tx,..., tk G T, and Borel H C Rfc), and then extend the definition of P to the entire cr-algebra TT, analogously to the proof of the Extension Theorem (Theorem 2.3.1). The argument is somewhat involved; for details see e.g. Billingsley (1995, Theorem 36.1). Exercise 15.1.5. Suppose that (15.1.1) holds whenever s is a transposition, i.e. a one-to-one function s : {1,2,... ,k} —+ {1,2,... ,k} such that
15.2. Markov chains on general state 179 s(i) = i for k -2 choices of i G {1,2,.. ,,k\. Provp tW th„ g~,* ;,.^. . x. n , . x, x ! ' ' v J / ^rove tnat trie first consistency condition is satisfied, i.e. that (15.1.1) holds for all permutations s. Exercise 15.1.6. Let Xi,X2,... be independent, with Xn ~ ^n. (a) Specify the finite-dimensional distributions Hti,t2,..,tk for distinct non- negative integers ti, t2, •.. ,tk- (b) Prove that these /iti,t2,-,tk satisfy (15.1.1) and (15.1.2). (c) Prove that Theorem 7.1.1 follows from Theorem 15.1.3. Exercise 15.1.7. Consider the definition of a discrete Markov chain {Xn}, from Section 8. (a) Specify the finite-dimensional distributions /i^,^,...,^ for non-negative integers t\ < t2 < ... < tk> (b) Prove that they satisfy (15.1.1). (c) Specify the finite-dimensional distributions /J>tut2,>,tk for general distinct non-negative integers ti, t2, • •., tk, and explain why they satisfy (15.1.2). [Hint: It may be helpful to define the order statistics, whereby £(r) is the rth-largest element of {t\, t2,..., £&} for 1 <r< k.] (d) Prove that Theorem 8.1.1 follows from Theorem 15.1.3. 15.2. Markov chains on general state spaces. In Section 8 we considered Markov chains on countable state spaces 5, in terms of an initial distribution {vi}i£s and transition probabilities {Pij}i,jes- We now generalise many of the notions there to general (perhaps uncountable) state spaces. We require a general state space X, which is any non-empty (perhaps uncountable) set, together with a cr-algebra T of measurable subsets. The transition probabilities are then given by {P(x,A)}X£x, agt- We make the following two assumptions: (Al) For each fixed x G X, P(x, •) is a probability measure on (X^J7). (A2) For each fixed AGf, P(x, A) is a measurable function of x G X. Intuitively, P(x, A) is the probability, if the chain is at a point x, that it will jump to the subset A at the next step. If X is countable, then P(x, {i}) corresponds to the transition probability pXi of the discrete Markov chains of Section 8. But on a general state space, we may have P(x, {%}) = 0 for all ieX. We also require an initial distribution v, which is any probability distribution on (X,T). The transition probabilities and initial distribution then give rise to a (discrete-time, general state space, time-homogeneous) Markov chain Xq, X\,X2,..., where P(X0g A), Xi eA1,...1xneAn)
180 15. General stochastic processes. = / v(dx0) P(xo,dxi)... Jx0€Ao JxieAi ... I P{xn-2,dxn-i)P{x )• (15.2.1) Jxn-lGAn-i Note that these integrals (i.e., expected values) are well-defined because of condition (A2) above. As before, we shall write Px(- • •) for the probability of an event conditional on Xo = x, i.e. under the assumption that the initial distribution v is a point-mass at the point x. And, we define higher-order transition probabilities inductively by P1(x, A) = P(x,^4), and Pn+1(x, A) = $xP{x,dz)Pn{z,A) for n > 1. Analogous to the countable state space case, we define a stationary distribution for a Markov chain to be a probability measure 7r(-) on (X,T), such that tt(A) = fx 7r(dx)P(x, A) for all A e J7. (This generalises our earlier definition ttj = Ylies^iPiJ-) As m tne countably infinite case, Markov chains on general state spaces may or may not have stationary distributions. Example 15.2.2. Consider the Markov chain on the real line (i.e. with X = R), where P(x, •) = iV(|, |) for each x G X. In words, if Xn is equal to some real number x, then the conditional distribution of Xn+i will be normal, with mean | and variance |. Equivalently, Xn+i = ^Xn + {7n+i, where {Un} are i.i.d. with Un ~ 7V(0, |). This example is analysed in Exercise 15.2.5 below; in particular, it is shown that 7r(-) = JV(0,1) is a stationary distribution for this chain. For countable state spaces 5, we defined irreducibility to mean that for all i,j £ 5, there is n €. N with p\j > 0. On uncountable state spaces this definition is of limited use, since we will often (e.g. in the above example) have p\j = 0 for all i,j G X and all n > 1. Instead, we say that a Markov chain on a general state space X is ^-irreducible if there is a non-zero, o~- finite measure ip on (X,T) such that for any A e J7 with ip(A) > 0, we have Px(ta < oo) > 0 for all x G X. (Here ta = inf{n > 0; Xn G A) is the first hitting time of the subset A; thus, ta < oo is the event that the chain eventually hits the subset A, and (/>-irreducibility is the statement that the chain has positive probability of eventually hitting any subset A of positive ip measure.) Similarly, on countable state spaces S we defined aperiodicity to mean that for all i G 5, gcd{n > 1 : p^ > 0} = 1, but on uncountable state spaces we will usually have p\^ = 0 for all n. Instead, we define the period of a general-state-space Markov chain to be the largest (finite) positive integer d such that there are non-empty disjoint subsets X\,..., Xd C X,
15.2. Markov chains on general state spaces. 181 with P(x, Xi+1) = 1 for all x G X{ (1 < i < d - 1) and P(x, X{) = 1 for all x G Xd- The chain is periodic if its period is greater than 1, otherwise the chain is aperiodic. In terms of these definitions, a fundamental theorem about general state space Markov chains is the following generalisation of Theorem 8.3.10. For a proof see e.g. Meyn and Tweedie (1993). Theorem 15.2.3. If a discrete-time Markov chain on a general state space is (f)-irreducible and aperiodic, and furthermore has a stationary distribution 7r(), then for n-almost every x G X, we have that lim sup|Pn(x,A)-7r(^)| -► 0. In words, the Markov chain converges to its stationary distribution in the total variation distance metric. Exercise 15.2.4. Consider a Markov chain which is ^irreducible with respect to some non-zero a-finite measure ^, and which is periodic with corresponding disjoint subsets A\,..., X&. Let B = (Jf Xi. (a) Prove that Pn(x, Bc) = 0 for all x G B. (b) Prove that ^(Bc) = 0. (c) Prove that ip(Xi) > 0 for some i. Exercise 15.2.5. Consider the Markov chain of Example 15.2.2. (a) Let 7r(-) = ./V(0,1) be the standard normal distribution. Prove that 7r() is a stationary distribution for this Markov chain. (b) Prove that this Markov chain is ^irreducible with respect to A, where A is Lebesgue measure on R. (c) Prove that this Markov chain is aperiodic. [Hint: Don't forget Exercise 15.2.4.] (d) Apply Theorem 15.2.3 to this Markov chain, writing your conclusion as explicitly as possible. Exercise 15.2.6. (a) Prove that a Markov chain on a countable state space X is ^irreducible if and only if there is j G X such that P{(tj < oo) > 0 for all i G X. (b) Give an example of a Markov chain on a countable state space which is ^-irreducible, but which is not irreducible in the sense of Subsection 8.2. Exercise 15.2.7. Show that, for a Markov chain on a countable state space 5, the definition of aperiodicity from this subsection agrees with the previous definition from Definition 8.3.4.
182 15. General stochastic processes. Exercise 15.2.8. Consider a discrete-time Markov chain with state space X = R, and with transition probabilities such that P(x, •) is uniform on the interval [x — 1, x + 1]. Determine whether or not this chain is </>-irreducible. Exercise 15.2.9. Recall that counting measure is the measure ?/>(•) defined by ip(A) = \A\, i.e. ip(A) is the number of elements in the set A, with ip{A) = oo if the set A is infinite. (a) For a Markov chain on a countable state space X, prove that irre- ducibility in the sense of Subsection 8.2 is equivalent to </>-irreducibility with respect to counting measure on X. (b) Prove that the Markov chain of Exercise 15.2.5 is not ^irreducible with respect to counting measure on R. (That is, prove that there is a set A, and xGA", such that Px(t~a < oo) = 0, even though ip(A) > 0 where ip is counting measure.) (c) Prove that counting measure on R is not cr-finite (cf. Remark 4.4.3). Exercise 15.2.10. Consider the Markov chain with X = R, and with P(x, •) = N(x, 1) for each x e X. (a) Prove that this chain is ^irreducible and aperiodic. (b) Prove that this chain does not have a stationary distribution. Relate this to Theorem 15.2.3. Exercise 15.2.11. Let # = {1,2,...}. Let P(l, {1}) = 1, and for x > 2, P(x, {1}) = l/x2 and P(x, {x + 1}) = 1 - (1/x2). (a) Prove that this chain has stationary distribution 7r(-) = Si(-). (b) Prove that this chain is ^irreducible and aperiodic. (c) Prove that if X0 = x > 2, then P[Xn = x + n for all n] > 0, and ||Pn(x, •) - tt(.)|| /> 0. Relate this to Theorem 15.2.3. Exercise 15.2.12. Show that the finite-dimensional distributions implied by (15.2.1) satisfy the two consistency conditions of the Kolmogorov Existence Theorem. What does this allow us to conclude? 15.3. Continuous-time Markov processes. In general, a continuous-time stochastic process is a collection {Xt}t>o of random variables, defined jointly on some probability triple, taking values in some state space X with cr-algebra T, and indexed by the non-negative real numbers T = {t > 0}. Usually we think of the variable t as representing (continuous) time, so that Xt is the (random) state at some time t > 0.
15.3. Continuous-time Markov processes. 183 DS + t Such a process is a (continuous-time, time-homogeneous) Markov process if there are transition probabilities Pt(x, •) for all t > 0 and all x G X, and an initial distribution v, such that P{X0e Ao,Xtl 6i4i,...,Xtn eAn) = / • . / v(dx0) Jx0€A0 Jxtl£Ai Jxtri€An P^ixo^dxtjP^-^ixt^dxt,) ...P^-^-^x^^dxtJ (15.3.1) for all times 0 < t\ < ... < tn and all subsets Ai,...,An G T. Letting P°(x, •) be a point-mass at x, it then follows that {x,A) = f Ps(x,dy)Pt(y,A), s,t > 0, x G X, A G T. (15.3.2) (This is the semigroup property of the transition probabilities: PS_K = Ps Pl.) On a countable state space X, we sometimes write p\^ for P*(i, {j}); (15.3.2) can then be written as p\fl = YskextfikPkj' As in (8.0.5), p^ = 6^, (i.e., at time t = 0, p^ equals 1 for i = j and 0 otherwise). Exercise 15.3.3. Let {Pt(x, •)} be a collection of Markov transition probabilities satisfying (15.3.2). Define finite-dimensional distributions fi>ti...tk for tu...,tk > Oby/ztl...tfc(4> x ... x Ak) = P(Xtl G Ai, ...,Xtfc G Afc) as implied by (15.3.1). (a) Prove that the Kolmogorov consistency conditions are satisfied by {Mti-tfc}- [Hint: You will need to use (15.3.2).] (b) Apply the Kolmogorov Existence Theorem to {fJ>ti...tk}i and describe precisely the conclusion. Another important concept for continuous-time Markov processes is the generator. If the state space X is countable (discrete), then the generator is a matrix Q = (#ij), defined for z,j G X by qij = lim Pij ~ 6ij . (15.3.4) Since 0 < p\^ < 1 for all ^ > 0 and all i and j, we see immediately that Qu < 0, while q^ > 0 if i ^ j. Also, since ^(Pij - Sij) = 1 - 1 = 0, we have Y^j Qij = 0 for each i. The matrix Q represents a sort of "derivative" of P* with respect to t. Hence, pfj « <5ij + s<ftj for small s > 0, i.e. Ps = I + sQ + 0(5) as 5 \ 0. Exercise 15.3.5. Let {P*} = {p^.} be the transition probabilities for a continuous-time Markov process on a finite state space X, with finite
184 15. General stochastic processes. generator Q = {qij} given by (15.3.4). Show that Pl = exp(tQ). (Given a matrix M, we define exp(M) by exp(M) = / + M + \M2 + ^M3 + ..., where J is the identity matrix.) [Hint: Pl = (p*/n)n? and as n —> oo, we have P^n = I + (t/n)Q + 0(l/n2).] Exercise 15.3.6. Let X = {1,2}, and let Q = (qij) be the generator of a continuous-time Markov process on X, with Compute the corresponding transition probabilities P* = (p\j) of the process, for any t > 0. [Hint: Use the previous exercise. It may help to know that the eigenvalues of Q are 0 and —9, with corresponding left eigenvectors (2,1) and (1,-1).] The generator Q may be interpreted as giving "jump rates" for our Markov processes. If the chain starts at state i and next jumps to state j, then the time before it jumps is exponentially distributed with mean 1/%. We say that the process jumps from i to j at rate q^. Exercise 15.3.5 shows that the generator Q contains all the information necessary to completely reconstruct the transition probabilities Pl of the Markov process. A probability distribution 7r(«) on X is a stationary distribution for the process {Xt}t>0 if J n(dx) Pt{x, •) = 7r(-) for all t > 0; if X is countable we can write this as ttj = Yliex niPij f°r a^ 3 £ X and all t > 0. Note that it is no longer sufficient to check this equation for just one particular value of t (e.g. t = 1). Similarly, we say that {Xt}t>o is reversible with respect to tt(-) if n(dx) P*(x, dy) = n(dy) Pl(y, dx) for all x, y e X and all t > 0. Exercise 15.3.7. Let {Xt}t>0 be a Markov process. Prove that (a) if {Xt}t>o is reversible with respect to 7r(-), then 7r(-) is a stationary distribution for {Xt}t>Q. (b) if for some T > 0 we have f ir(dx) P\x, •) = tt(-) for all 0 < t < T, then 7r(-) is a stationary distribution for {Xt}t>o. (c) if for some T > 0 we have n(dx) Pt(x,dy) = n(dy) Pt(y,dx) for all x, y e X and all 0 < t < T, then {Xt}t>0 is reversible with respect to 7r(-)« Exercise 15.3.8. For a Markov chain on a finite state space X with generator Q, prove that {7Ti}i£x is a stationary distribution if and only if 7r Q = 0, i.e. if and only if ^2ieX ni <fe = 0 f°r a^ •? ^ ^* Exercise 15.3.9. Let {Zn} be i.i.d. ~ Exp(5), so P[Zn > z] = e~5z for z > 0. Let Tn = Zi + Z2 +... + Zn for n > 1. Let {Xt}t>o be a continuous- time Markov process on the state space X = {1,2}, defined as follows. The
15.3. Continuous-time Markov processes. 185 process does not move except at the times Tn, and at each time Tn, the process jumps from its current state to the opposite state (i.e., from 1 to 2, or from 2 to 1). Compute the generator Q for this process. [Hint: You may use without proof the following facts, which are consequences of the "memoryless property" of the exponential distribution: for any 0 < a < 6, (i) P(3n : a < Tn < 6) = 1 - e~^b-a\ which is « 5(6 - a) as 6 \ a; and also (ii) P(3n : a < Tn < Tn+1 < 6) = 1 - e~5(6-a)(l + 5(6 - a)), which is « 25(6 - a)2 as 6 \ a] Exercise 15.3.10. (Poisson process.) Let A > 0, let {Zn} be i.i.d. ~ Exp(A), and let Tn = Zx + Z2 + ... + Zn for n > 1. Let {Nt}t>o be a continuous-time Markov process on the state space X = {0,1,2,...}, with N0 = 0, which does not move except at the times Tn, and increases by 1 at each time Tn. (Equivalently, Nt = #{n G N : Tn < t}; intuitively, Nt counts the number of events by time t.) (a) Compute the generator Q for this process. [Don't forget the hint from the previous exercise.] (b) Prove that P(Nt < m) = e-xt(Xt)m/m\ + P(JVt < m - 1) for m = 0,1,2,.... [Hint: First argue that P(Nt < m) = P(Tm+i > t) and that Tm ~ Gamma(m, A), and then use integration by parts; you may assume the result and hints of Exercise 9.5.17.] (c) Conclude that ¥{Nt = j) = e-Xt{Xt)j / j!, i.e. that JVt ~ Poisson(AJ). [Remark: More generally, there are Poisson processes with variable rate functions A : [0, oo) —> [0,oo), for which Nt ~ Poisson( fQ X(s) ds).] Exercise 15.3.11. Let {Nt}t>0 be a Poisson process with rate A > 0. (a) For 0 < s < t, compute V{NS = 11 Nt = 1). (b) Use this to specify precisely the conditional distribution of the first event time Ti, conditional on Nt = 1. (c) More generally, for r < t and m e N, specify the conditional distribution of Tm, conditional on Nr = m — 1 and Nt = m. Exercise 15.3.12. (a) Let {Nt}t>o be a Poisson process with rate A > 0, let 0 < s < t, and let UUU2 be i.i.d. ~ Uniform[0,*]. (a) Compute P(7VS = 01 Nt = 2). (b) Compute P(U\ > s and U2 > s). (c) Compare the answers to parts (a) and (b). What does this comparison seem to imply?
186 15. General stochastic processes. 15.4. Brownian motion as a limit. Having discussed continuous-time processes on countable state spaces, we now turn to continuous-time processes on continuous state spaces. Such processes can move in continuous curves and therefore give rise to interesting random functions. The most important such process is Brownian motion, also called the Wiener process. Brownian motion is best understood as a limit of discrete time processes, as follows. Let Zi, Z2,... be any sequence of i.i.d. bounded random variables having mean 0 and variance 1, e.g. Zi = ±1 with probability 1/2 each. For each n € N, define a discrete-time random process {y|n)}g0 iteratively by F0(n) = 0, and (n) yW + 1 y/n Zi+i, z = 0,1,2,... . (15.4.1) Thus, y}n) = ^(Zi + Z2 + ... + Zi). In particular, {F|n)} is like simple symmetric random walk, except with time sped up by a factor of n, and space shrunk by a factor of y/n. ?{n) Figure 15.4.2. Constructing Brownian motion. We can then "fill in" the missing values by linear interpolation on each in- terval [^, ^]. In this way, we obtain a continuous-time process {Yt }t>o;
15.4. Brownian motion as a limit. 187 see Figure 15.4.2. Thus, [Yt }t>o is a continuous-time process which agrees with Yt whenever t = ^. Furthermore, Yt is always within 0(l/n) of yffj, where |_rj is the greatest integer not exceeding r. Intuitively, then, n {Yt }t>o is like an ordinary discrete-time random walk, except that it has been interpolated to a continuous-time process, and furthermore time has been sped up by a factor of n and shrunk by a factor of y/n. In summary, this process takes lots and lots of very small steps. Remark. It is not essential that we choose the missing values so as to make the function linear on [-, ^1. The same intuitive limit will be obtained no matter how the missing values are defined, provided only that they are close (as n —> oo) to the corresponding Y± values. For example, another option is to use a constant interpolation of the missing values, by setting Ytn = Fffj . In fact, this constant interpolation would make the following n calculations cleaner by eliminating the 0(1/n) errors. However, the linear interpolation has the advantage that each Yt is then a continuous function of t, which makes it more intuitive why Bt is also continuous. Now, the factors n and y/n have been chosen carefully. We see that Y(tn) * Y^ = j-{Zi + Z2 + ... + Z[tni). Thus, F|n) is essentially ^ n times the sum of \tn\ different i.i.d. random variables, each having mean 0 and variance 1. It follows from the ordinary Central Limit Theorem that, as n —> oo with t fixed, we have C (Y™ J => N(0,t), i.e. for large n the random variable Yt is approximately normal with mean 0 and variance t. Let us note a couple of other facts. Firstly, for s < t, E (Y™ Y™ J w '- E {Y^Y^j =E((Z1 + ... + ZLanJ)(Z1 + ... + Z[tni)). Since the Zi have mean 0 and variance 1, and are independent, this is equal to -^p-, which converges to s as n —> oo. Secondly, if n —> oo with s < t fixed, the difference Yt — Ys is within 0(1/n) of Z^snj+1 + ... + Z|^nj. Hence, by the Central Limit Theorem, converges weakly to N(0,t — s) as n —► oo, and is in fact independent of Yj1 . Brownian motion is, intuitively, the limit as n —> oo of the processes {y;";}t>0. That is, we can intuitively define Brownian motion {Bt}t>o by —(n) ~~ saying that Bt = limn^oo Yt for each fixed t > 0. This analogy is very useful, in that all the n —+ oo properties of Yt discussed above will apply to Brownian motion. In particular:
188 15. General stochastic processes. • normally distributed: C(Bt) = N(0,t) for any t > 0. • covariance structure: "E(BsBt) = s for 0 < s < t. • independent normal increments: C(Bt2 — Btl) = N(0, t2 — t\), and Bt4 — Bt3 is independent of Bt2 — Btl whenever 0 < t\ < t2 < £3 < t±. In addition, Brownian motion has one other very important property. It has continuous sample paths, meaning that for each fixed u € Q, the function Bt(u) is a continuous function of t. (However, it turns out that, with probability one, Brownian motion is not differentiate anywhere at all!) Exercise 15.4.3. Let {Bt}t>o be Brownian motion. Compute E[(£2 + #3 + I)2]- Exercise 15.4.4. Let {Bt}t>o be Brownian motion, and let Xt = 2t+3Bt for t > 0. (a) Compute the distribution of Xt for t > 0. (b) Compute E[{Xt)2} for t > 0. (c) Compute E{XsXt) for 0 < s < t. Exercise 15.4.5. Let {Bt}t>o be Brownian motion. Compute the distribution of Zn for n e N, where Zn = £ (Bi + B2 + ... + Bn). Exercise 15.4.6. (a) Let / : R —> R be a Lipschitz function, i.e. a function for which there exists a G R such that |/(x) — f{y)\ < a\x — y\ for all x, y e R. Compute \imh\^o(f(t + h) — f(t))2/h for any tGR. (b) Let {Bt} be Brownian motion. Compute lim/^oE ((Bt+h — Bt)2/h) for any t > 0. (c) What do parts (a) and (b) seem to imply about Brownian motion? 15.5. Existence of Brownian motion. We thus see that Brownian motion has many useful properties. However, we cannot make mathematical use of these properties until we have established that Brownian motion exists, i.e. that there is a probability triple (Q,^7, P), with random variables {Bt}t>o defined on it, which satisfy the properties specified above. Now, the Kolmogorov Existence Theorem (Theorem 15.1.3) is of some help here. It ensures the existence of a stochastic process having the same finite-dimensional distributions as our desired process {Bt}. At first glance this might appear to be all we need. Indeed, the properties of being normally distributed, of having the right covariance structure, and of having
15.5. Existence of Brownian motion. 189 independent increments, all follow immediately from the finite-dimensional distributions. The problem, however, is that finite-dimensional distributions cannot guarantee the property of continuous sample paths. To see this, let {Bt}t>0 be Brownian motion, and^let U be a random variable which is distributed according to (say) Lebesgue measure on [0,1]. Define a new stochastic process {B't}t>0 by B't = Bt + l{t/=t}- That is, B[ is equal to Bt, except that if we happen to have U = t then we add one to Bt. Now, since P(C/ = t) = 0 for any fixed t, we see that {B't}t>o has exactly the same finite- dimensional distributions as {Bt}t>o does. On the other hand, obviously if {Bt}t>o has continuous sample paths, then {B't}t>o cannot. This shows that the Kolmogorov Existence Theorem is not sufficient to properly construct Brownian motion. Instead, one of several alternative approaches must be taken. In the first approach, we begin by setting B0 = 0 and choosing B\ ~ iV(0,1). We then define Bi to have its correct conditional distribution, conditional on the values of Bo and B\. Continuing, we then define Bi to have its correct conditional distribution, conditional on the values of Bq and Bi; and similarly define Bz conditional on the values of B\ and B\. Iteratively, we see that we can define Bt for all values of t which are dyadic rationals, i.e. which are of the form ^r for some integers i and n. We then argue that the function {Bt; t a non-negative dyadic rational} is uniformly continuous, and use this to argue that it can be "filled in" to a continuous function {Bt; t > 0} defined for all non-negative real values t. For the details of this approach, see e.g. Billingsley (1995, Theorem 37.1). In another approach, we define the processes {Yt } as above, and ob- serve that Yt is a (random) continuous function of t. We then prove that as n —> oo, the distributions of the functions {Yt }t>o (regarded as probability distributions on the space of all continuous functions) are tight, and in fact converge to the distribution of {Bt}t>o that we want. We then use this fact to conclude the existence of the random variables {Bt}t>o- For details of this approach, see e.g. Fristedt and Gray (1997, Section 19.2). Exercise 15.5.1. Construct a process {B'tf}t>o which has the same finite-dimensional distributions as does {Bt}t>o and {^}t>o, but such that neither {J3"}t>0 nor {B'tf — B[}t>o has continuous sample paths. Exercise 15.5.2. Let {Xt}teT and {Xft}t£T be a stochastic processes with the countable time index T. Suppose {Xt}ter and {X^}ter have identical finite-dimensional distributions. Prove or disprove that {Xt}teT and {X[}teT must have the same full joint distribution.
190 15. General stochastic processes. 15.6. Diffusions and stochastic integrals. Suppose we are given continuous functions /z, a : R —> R, and we replace equation (15.4.1) by Y\l\ = Y™ + ±Zi+1a (y<»>) + -n (Y[n)) . (15.6.1) "7T n yjn \ n / U V n / ' That is, we multiply the effect of Z»+i by cr(y[n^), and we add in an extra n drift of ^//(yj ). If we then interpolate tfizs function to a function {Y™} defined for all t > 0, and let n —> oo, we obtain a diffusion, with dn/t /i(x), and diffusion coefficient or volatility a(x). Intuitively, the diffusion is like Brownian motion, except that its mean is drifting according to the function fji(x), and its local variability is scaled according to the function a(x). In this context, Brownian motion is the special case fi(x) = 0 and a(x) = 1. Now, for large n we see from (15.4.1) that -j^Zi+i « Bi±± — B±, so that (15.6.1) is essentially saying that W3 - y|B) « (flm - BA a(Y\nA + I M(y<">) . If we now (intuitively) set Xt = limn_>oo Yt for each fixed t > 0, then we can write the limiting version of (15.6.1) as dXt = a(Xt)dBt + fi(Xt)dt, (15.6.2) where {Bt}t>o is Brownian motion. The process {Xt}t>o is called a diffusion with drift ji(x) and diffusion coefficient or volatility o~(x). Its definition (15.6.2) can be interpreted roughly as saying that, as h \ 0, we have Xt+h « *t + (7(Xt)(Bt+fc - Bt) + /i(Xt) h; (15.6.3) thus, given that Xt = x, we have that approximately Xt+h ~ N(x + n(x)h, o~2(x)h). If /x(x) = 0 and cr(x) = 1, then the diffusion coincides exactly with Brownian motion. We can also compute a generator for a diffusion defined by (15.6.2). To do this we must generalise the matrix generator (15.3.4) for processes on finite state spaces. We define the generator to be an operator Q acting on smooth functions / : R —* R by (Qf)(x) = lim I Ex (f(Xt+h) - f(Xt)) , (15.6.4) where Ex means expectation conditional onlt = x. That is, Q maps one smooth function / to another function, Qf.
15.6. Diffusions and stochastic integrals 191 Now, for a diffusion satisfying (15.6.3), we can use a Tavl™. •^ scrips py— pansion to approximate f(Xt+h) as f(Xt+h) « f(Xt) + (Xt+h-Xt)f'(Xt) + ±(Xt+h-Xt)2f"(Xt). (15.65) On the other hand, Xt+h-Xt ~ a(Xt)(Bt+h-Bt) + fi(Xt)h. Conditional on Xt = x, the conditional distribution of Bt+h — Bt is normal with mean 0 and variance h\ hence, the conditional distribution of Xt+h —Xt is approximately normal with mean fi(x) h and variance cr2(x) h. So, taking expected values conditional on Xt = x, we conclude from (15.6.5) that Ex[f(Xt+h)-f(Xt)} « n{x)hf'(x) + \ [(»(x)h)2 + o2(x)h\ f"(x) = fi(x)hf'(x) + l-a2{x)hf"{x) + 0(h2). It then follows from the definition (15.6.4) of generator that (Q/)(*) = M*) I'M + \ °2(x) f"(x) • (15.6.6) We have thus given an explicit formula for the generator Q of a diffusion defined by (15.6.2). Indeed, as with Markov processes on discrete spaces, diffusions are often characterised in terms of their generators, which again provide all the information necessary to completely reconstruct the process's transition probabilities. Finally, we note that Brownian motion can be used to define a certain unusual kind of integral, called a stochastic integral or ltd integral. Let {Bt}t>o be Brownian motion, and let {Ct}t>o be a non-anticipative real- valued random variable (i.e., the value of Ct is conditionally independent °f {Bs}s>t, given {Bs}s<t). Then it is possible to define an integral of the form Ja Ct dBt, i.e. an integral "with respect to Brownian motion". Roughly speaking, this integral is defined by / CtdBt = lim Y,Cti(Btt+1-Bti), Ja i=o where a = to < h < ... < £m = 6, and where the limit is over finer and finer partitions {(*i_1,*i]} of (a, b]. Note that this integral is random both because of the randomness of the values of the integrand Ct, and also
192 15. General stochastic processes. because of the randomness from Brownian motion Bt itself. In particular, we now see that (15.6.2) can be written equivalently as pb pb Xb-Xa= I a(Xt)dBt+ n(Xt)dt, J a J a where the first integral is with respect to Brownian motion. Such stochastic integrals are used in a variety of research areas; for further details see e.g. Bhattacharya and Waymire (1990, Chapter VII), or the other references in Subsection B.5. Exercise 15.6.7. Let {-^t}t>o be a diffusion with constant drift fi(x) = a and constant volatility a(x) = b > 0. (a) Show that Xt = a t + b Bt. (b) Show that C(Xt) = N(at, b2t). (c) Compute E(XS Xt) for 0 < s < t. Exercise 15.6.8. Let {Bt}t>o be standard Brownian motion, with Bq = 0. Let Xt = f0 sds + fQb Bs ds = at + bBt be a diffusion with constant drift fi{x) = a > 0 and constant volatility a(x) = b > 0. Let Zt = exp[- (a + \b2)t + Xt]. (a) Prove that {Zt}t>o is a martingale, i.e. that E[Zt | Zu (0 < u < s)] = Zs for 0 < s < t. (b) Let A, B > 0, and let TA = inf{t > 0 : Xt = A} and T.B = inf{t > 0 : Xt = — B} denote the first hitting times of A and —£, respectively. Compute P(Ta < T_B). [Hint: Use part (a); you may assume that Corollary 14.1.7 also applies in continuous time.] Exercise 15.6.9. Let g : R —> R be a smooth function with g(x) > 0 and Jg(x)dx = 1. Consider the diffusion {Xt} defined by (15.6.2) with a(x) = 1 and fi{x) = ^g'(x)/g(x). Show that the probability distribution having density g (with respect to Lebesgue measure on R) is a stationary distribution for the diffusion. [Hint: Show that the diffusion is reversible with respect to g(x)dx. You may use (15.6.3) and Exercise 15.3.7.] This diffusion is called a Langevin diffusion. Exercise 15.6.10. Let {p\j} be the transition probabilities for a Markov chain on a finite state space X. Define the matrix Q = (qij) by (15.3.4). Let / : X —> R be any function, and let i G X. Prove that (Qf)i (i.e., YskexQikfk) corresponds to (Qf)(i) from (15.6.4). Exercise 15.6.11. Suppose a diffusion {^t}t>o satisfies that (Q/)(x) = 17/,(x) + 23x2/,,W,
15.7. Ito's Lemma. 193 for all smooth / R - R. Compute the drift and volatility functions for this diffusion. Exercise 15.6.12. Suppose a diffusion {Xt}t>0 has generator given by (Qf)(x) = -f /'Or) + \f"{x). (Such a diffusion is called an Ornstein- Uhlenbeck process.) (a) Write down a formula for dXt. [Hint: Use (15.6.6).] (b) Show that {Xt}t>o is reversible with respect to the standard normal distribution, AT(0,1). [Hint: Use Exercise 15.6.9.] 15.7. Ito's Lemma. Let {Xt}t>o be a diffusion satisfying (15.6.2), and let / : R —► R be a smooth function. Then we might expect that {/(^)}t>o would itself be a diffusion, and we might wonder what drift and volatility would correspond to {f(Xt)}t>o- Specifically, we would like to compute an expression for d(/(Xt)). To that end, we consider small h « 0, and use the Taylor approximation f(Xt+h) « f(Xt) + f'(Xt) (Xt+h - Xt) + \f"(Xt) (Xt+h - Xtf + o(h). Now, in the classical world, an expression like (Xt+h — Xt)2 would be 0(h2) and hence could be neglected in this computation. However, we shall see that for Ito integrals this is not the case. We continue by writing /t+h pt+h a{Xs)dBs + / fi(Xs)ds « *(Xt)(Bt+h-Bt) + ii{Xt)h. Hence, f(Xt+h) w /(Xt) + /'(Xt) [<7(Xt) {Bt+h ~ Bt) + jx(Xt) h] +\f\Xt) k(Xt) (Bt+h - Bt) + »(Xt) h}2 + o(h). In the expansion of (a(Xt)(Bt+h — Bt) + fx(Xt)h)2, all terms are o(h) aside from the first, so that /(Xt+h) w /(Xt) + /'(Xt) [a(Xt) (Bt+h - Bt) + M(*t) h) + -/"(Xt) \a(Xt) {Bt+h - Bt)}2 + o(h).
194 15. General stochastic processes. But Bt+h-Bt ~ N(0, h), so that E [(Bt+h - Bt)2] = K and in fact {Bt+h- Bt)2 = h + o(h). (This is the departure from classical integration theory; there, we would have {Bt+h — Bt)2 = o(h).) Hence, as h —> 0, f(Xt+h) « /(Xt) + /'(Xt)[a(Xt)(Bt+h-Bt) + M(Xt)ft] +^/"(^t)a2(Xt)/l + o(/i). We can write this in differential form as d(f(Xt)) = f(Xt)[a(Xt)dBt+^Xt)dt] + ^r(Xt)a2(Xt)dt, or d(f(Xt)) = f'(Xt)a(Xt)dBt + (f'(Xt)»(Xt) + l-f"{Xt)a2{Xt)\ dt. (15.7.1) That is, {f(Xt)} is itself a diffusion, with new drift /i(x) = f'(x)fi(x) + \f"{x) cr2(x) and new volatility a(x) = f'(x) a(x). Equation (15.7.1) is ltd ys Lemma, or ltd's formula. It is a generalisation of the classical chain rule, and is very important for doing computations with Ito integrals. By (15.6.2), it implies that d(f(Xt)) is equal to f'{Xt) dXt + \f"{Xt) v2(Xt) dt. The extra term \f"(Xt) cr2(Xt) dt arises because of the unusual nature of Brownian motion, namely that {Bt+h — Bt)2 ~ h instead of being 0(h2). (In particular, this again indicates why Brownian motion is not differentiate; compare Exercise 15.4.6.) Exercise 15.7.2. Consider the Ornstein-Uhlenbeck process {Xt} of Exercise 15.6.12, with generator (Q/)(x) = -ff'(x) + \f"(x). (a) Let Yt = {Xt)2 for each t > 0. Compute dYt. (b) Let Zt = (Xt)s for each t > 0. Compute dZt. Exercise 15.7.3. Consider the diffusion {Xt} of Exercise 15.6.11, with generator (Qf)(x) = 17f'(Xt) + 23(X,)2 f"(Xt). Let Wt = (Xt)5 for each t > 0. Compute dWt. 15.8. The Black-Scholes equation. In mathematical finance, the prices of stocks are often modeled as diffusions. Various calculations, such as the value of complicated stock options or derivatives, then proceed on that basis. In this final subsection, we give a very brief introduction to financial models, and indicate the derivation of the famous Black-Scholes equation
15.8. The Black-Scholes equation. 195 for pricing a certain kind of stock option. Our treatment is very cursory for more complete discussion see books on mathematical finance, such as those listed in Subsection B.6. For simplicity, we suppose there is a constant "risk-free" interest rate r. This means that an amount of money M at present is worth ertM at a time t later. Equivalently, an amount of money M a time t later is worth only e~~rtM at present. We further suppose that there is a "stock" available for purchase. This stock has a purchase price Pt at time t. It is assumed that this stock price can be modeled as a diffusion process, according to the equation dPt = bPtdt + <jPtdBt, (15.8.1) i.e. with fi(x) = bx and a(x) = ax. Here Bt is Brownian motion, b is the appreciation rate, and a is the volatility of the stock. For present purposes we assume that b and a are constant, though more generally they (and also r) could be functions of t. We wish to find a value, or fair price, for a "European call option" (an example of a financial derivative). A European call option is the option to purchase the stock at a fixed time T > 0 for a fixed price q > 0. Obviously, one would exercise this option if and only if Pt > q, and in this case one would gain an amount Pt — q. Therefore, the value at time T of the European call option will be precisely max(0, Pt — q)', it follows that the value at time 0 is e~rT max(0, Pt — q). The problem is that, at time 0, the future price Pt is unknown (i.e. random). Hence, at time 0, the true value of the option is also unknown. What we wish to compute is a fair price for this option at time 0, given only the current price Fo of the stock. To specify what a fair price means is somewhat subtle. This price is taken to be the (unique) value such that neither the buyer nor the seller could strictly improve their payoff by investing in the stock directly (as opposed to investing in the option), no matter how sophisticated an investment strategy they use (including selling short, i.e. holding a negative number of shares) and no matter how often they buy and sell the stock (without being charged any transaction commissions). It turns out that for our model, the fair price of the option is equal to the expected value of the option given Pq, but only after replacing b by r, i.e. after replacing the appreciation rate by the risk-free interest rate. We do not justify this fact here; for details see e.g. Theorem 1.2.1 of Karatzas, 1997. The argument involves switching to a different probability measure (the risk-neutral equivalent martingale measure), under which {Bt} is still a (different) Brownian motion, but where (15.8.1) is replaced by dPt = rPtdt + aPtdBt.
196 15. General stochastic processes. We are thus interested in computing E (e"rT max(0, PT-q)\ Po) , (15.8.2) under the condition that b is replaced by r. (15.8.3) To proceed, we use Ito's Lemma to compute d (logPt). This means that we let f(x) = logx in (15.7.1), so that f'(x) = l/x and f"(x) = -l/x2. We apply this to the diffusion (15.8.1), under the condition (15.8.3), so that ji{x) = rx and a(x) = ax. We compute that d(logPt) = ^PtdBt+(rPt^ + i(-^)(aPt)2) * = <jdBt+ (r- 2a2)dt- Since these coefficients are all constants, it now follows that log (PT/P0) = log PT-log P0 = a(£T-£0)+(r-ia2)(T-0), whence PT = Po exp (a(BT - B0) + (r - \^)t\ . Recall now that BT — B0 has the normal distribution JV(0,T), with density function } e~x /2T. Hence, by Proposition 6.2.3, we can compute the value (15.8.2) of the European call option as / e~rT max (0, P0 exp (ax + (r - \o-2)t\ - q\ -^= e~x2/2T dx . (15.8.4) After some re-arranging and substituting, we simplify this integral to Po $ (^= (log(PoAz) + T(r + icr2))) - qe-rT$ ^-l=(log(P0/9) +T(r- ia2))) , (15.8.5) where $(z) = -4= fz e~s l2ds is the usual cumulative distribution func- v y V27T J— oo tion of the standard normal. Equation (15.8.5) thus gives the time-0 value, or fair price, of a European call option to buy at time T > 0 for price q > 0 a stock having initial price
15.9. Section summary. 197 P0 and price process {Pt}t>o governed by (15.8.1). Equation (15.8.5) is the well-known Black-Scholes equation for computing the price of a European call option. Recall that here r is the risk-free interest rate, and a is the volatility. Furthermore, the appreciation rate b does not appear in the final formula, which is advantageous because b is difficult to estimate in practice. For further details of this subject, including generalisations to more complicated financial models, see any introductory book on the subject of mathematical finance, such as those listed in Subsection B.6. Exercise 15.8.6. Show that (15.8.4) is indeed equal to (15.8.5). Exercise 15.8.7. Consider the price formula (15.8.5), with r, a, T, and Po fixed positive quantities. (a) What happens to the price (15.8.5) as q \ 0? Does this result make intuitive sense? (b) What happens to the price (15.8.5) as q —► oo? Does this result make intuitive sense? Exercise 15.8.8. Consider the price formula (15.8.5), with r, a, Fo? and q fixed positive quantities. (a) What happens to the price (15.8.5) as T \ 0? [Hint: Consider separately the cases q > Po, q = Po, and q < P0.] Does this result make intuitive sense? (b) What happens to the price (15.8.5) as T —► oo? Does this result make intuitive sense? 15.9. Section summary. This section considered various aspects of the theory of stochastic processes, in a somewhat informal manner. All of the topics considered can be thought of as generalisations of the discrete-time, discrete-space processes of Sections 7 and 8. First, the section considered existence questions, and stated (but did not prove) the important Kolmogorov Existence Theorem. It then discussed discrete-time processes on general (non-discrete) state spaces. It provided generalised notions of irreducibility and aperiodicity, and stated a theorem about convergence of irreducible, aperiodic Markov chains to their stationary distributions. Next, continuous-time processes were considered. The semigroup property of Markov transition probabilities was presented. Generators of continuous-time, discrete-space Markov processes were discussed. The section then focused on processes in continuous time and space. Brownian motion was developed as a limit of random walks that take smaller
198 15. General stochastic processes. and smaller steps, more and more frequently; the normal, independent increments of Brownian motion then followed naturally. Diffusions were then developed as generalisations of Brownian motion. Generators of diffusions were described, as were stochastic integrals. The section closed with intuitive derivations of Ito's Lemma, and of the Black-Scholes equation from mathematical finance. The reader was encouraged to consult the books of Subsections B.5 and B.6 for further details.
A. Mathematical Background. 199 A. Mathematical Background. This section reviews some of the mathematics that is necessary as a prerequisite to understanding this text. The material is reviewed in a form which is most helpful for our purposes. Note, however, that this material is merely an outline of what is needed; if this section is not largely review for you, then perhaps your background is insufficient to follow this text in detail, and you may wish to first consult references such as those in Subsection B.l. A.l. Sets and functions. A fundamental object in mathematics is a set, which is any collection of objects. For example, {1,3,4, 7}, the collection of rational numbers, the collection of real numbers, and the empty set 0 containing no elements, are all sets. Given a set, a subset of that set is any subcollection. For example, {3,7} is a subset of {1,3,4,7}; in symbols, {3,7} C {1,3,4,7}. Given a subset, its complement is the set consisting of all elements of a "universal" set which are not in the subset. In symbols, if A C ft, where ft is understood to be the universal set, then Ac = {cj £ ft;a; ^ A}. Given a collection of sets {Aa}aej (where / is any indicator set; perhaps / = {1,2}), we define their union to be the set of all elements which are in at least one of the Aa; in symbols, |JQ Aa = {u; u £ Aa for some a £ I}. Similarly, we define their intersection to be the set of all elements which are in all of the Aa; in symbols, f]a Aa = {lj; lj £ Aa for all a £ I}. Union, intersection, and complement satisfy de Morgan's Laws, which state that ((JQ Aa)C = f|Q A% and (f|Q Aa)C = \Ja A%. In words, complement converts unions to intersections and vice-versa. A collection {Aa}aei of subsets of a set ft are disjoint if the intersection of any pair of them is the empty set; in symbols, if Aai C\Aa2 = 0 whenever ai,Q2 £ I are distinct. The collection {Aa}aej is called a partition of ft if {Aa}a€i is disjoint and also \Ja Aa = ft. If {Aa}aej are disjoint, we • # sometimes write their union as \J Aa (e.g. Aai UAQ2) and call it a disjoint union. We also define the difference of sets A and B by A\B = AD Bc\ for example, {1,3,4,7} \ {3,7,9} = {1,4}. Given sets fti and ft2, we define their (Cartesian) product to be the set of all ordered pairs having first element from fti and second element from In fact, certain very technical restrictions are required to avoid contradictions such as Russell's paradox. But we do not consider that here.
200 A. Mathematical Background. ^2; in symbols, fii x fi2 = {(^1^2);^i G ^i> ^2 G ^2}- Similarly, larger Cartesian products Yla ^<* are defined. A function f is a mapping from a set Q,\ to a second set f^; in symbols, / : Qi —► Q2- Its domain is the set of points in Qi for which it is defined, and its range is the set of points in ft2 which are of the form f(u>i) for some If / : fii —> ^2, and if S\ Cfli and 52 C ^2, then the image of S\ under / is given by /(Si) = {^2 G ^2^2 = /(^i) for some a;i G Si}, and the inverse image of S2 under / is given by /_1(S2) = {lj\ G ft\',f(uJi) G S2}. We note that inverse images preserve the usual set operations, for example f-\Sl\JS2) = f-1(Si)Uf-1(S2y, r1(SlnS2) = /-1(5i)n/-1(52); and f-\Sc) = f-1(Sf. Special sets include the integers Z = {0,1, — 1,2, — 2,3,...}, the positive integers or natural numbers N = {1,2,3,...}, the rational numbers Q = {^; m,nGZ,n/ 0}, and the real numbers R. Given a subset SCfi, its indicator function I5 : ft —> R is defined by 1 ( \ J *> u;eS ls{u) = \0, uiS. Finally, the Axiom of Choice states that given a collection {Aa}aei of non-empty sets (i.e., Aa ^ 0 for all a), their Cartesian product fJQ ^U is also non-empty. In symbols, JJ Aa ^ 0 whenever Aa ^ 0 for each a. (A 1.1) a That is, it is possible to "choose" one element simultaneously from each of the non-empty sets Aa - a fact that seems straightforward, but does not follow from the other axioms of set theory. A.2. Countable sets. One important property of sets is their cardinality, or size. A set ft is finite if for some n G N and some function / : N —> ft, we have / ({1,2,..., n}) 2 fi- A set is countable if for some function / : N —► fi, we have / (N) = ft. (Note that, by this definition, all finite sets are countable. As a special case, the empty set 0 is both finite and countable.) A set is countably infinite if it is countable but not finite. It is uncountable if it is not countable. *More formally, it is a collection of ordered pairs (ui,^) 6 Hi x Q2, with u\ £ Qi and u;2 = /(^l) G SI2, such that no lj\ appears in more than one pair.
A.2. Countable sets. 201 Intuitively, a set is countable if its elements can be arranged in a sequence. It turns out that countability of sets is very important in probability theory. The axioms of probability theory make certain guarantees about countable operations (e.g., the countable additivity of probability measures) which do not necessarily hold for uncountable operations. Thus it is important to understand which sets are countable and which are not. The natural numbers N are obviously countable; indeed, we can take the identity function (i.e., f(n) = n for all n € N) in the definition of countable above. Furthermore, it is clear that any subset of a countable set is again countable; hence, any subset of N is also countable. The integers Z are also countable: we can take, say, /(l) = 0, /(2) = 1, /(3) = _i, /(4) = 2, /(5) = -2, /(6) = 3, etc. to ensure that /(N) = Z. More surprising is Proposition A.2.1. Let Q.\ and fi2 be countable sets. Then their Cartesian product fix x fi2 is also countable. Proof. Let fx and f2 be such that /i(N) = Qx and /2(N) = ft2. Then define a new function / : N -+ Qx x fl2 by /(l) = (/i(l),/2(l)), /(2) = (/x(l),/2(2)), /(3) = (/i(2),/2(l)), /(4) = (/i(l),/2(3)), /(5) = (/i(2),/2(2)), /(6) = (/i(3),/2(l)), /(7) = (/i(l),/2(4)), etc. (See Figure A.2.2.) Then /(N) = Qi x fi2, as required. I (4,4) (4,3) • • • (4,2) (4,1) Figure A.2.2. Constructing / in Proposition A.2.1. From this proposition, we see that e.g. N x N is countable, as is Z x Z. It follows immediately (and, perhaps, surprisingly) that the set of all rational numbers is countable; indeed, Q is equivalent to a subset of Z x Z (1,4) (2,4) (3,4) (1,3) (2,3) (3,3) (1,2) (2,2) (3,2) XXX (1,1) (2,1) (3,1)
202 A. Mathematical Background. if we identify the rational number m/n (in lowest terms) with the element (ra, n) e Z x Z. It also follows that, given a sequence Q\, Q2, • • • of countable sets, their countable union Q = \Ji f^ is also countable; indeed, if /i(N) = fli, then we can identify Q with a subset of N x N by the mapping (ra, n) (—► fm(n). For another example of the use of Proposition A.2.1, recall that a real number is algebraic if it is a root of some non-constant polynomial with integer coefficients. Exercise A.2.3. Prove that the set of all algebraic numbers is countable. On the other hand, the set of all real numbers is not countable. Indeed, even the unit interval [0,1] is not countable. To see this, suppose to the contrary that it were countable, with / : N —> [0,1] such that /(N) = [0,1]. Imagine writing each element of [0,1] in its usual base-10 expansion, and let di be the 2th digit of the number f(i). Now define c» by: Cj = 4 if di = 5, while Ci = 5 if di ^ 5. (In particular, c» ^ di.) Then let x = X)Si Ci 10~l (so that the base-10 expansion of x is O.C1C2C3 ...). Then x differs from f(i) in the 2th digit of their base-10 expansions, for each i. Hence, x ^ f(i) for any i. This contradicts the assumption that /(N) = [0,1]. A.3. Epsilons and Limits. Real analysis and probability theory make constant use of arguments involving arbitrarily small e > 0. A basic starting block is Proposition A.3.1. Let a and b be two real numbers. Suppose that, for any e > 0, we have a <b + e. Then a <b. Proof. Suppose to the contrary that a > b. Let e = g^. Then e > 0, but it is not the case that a < b + e. This is a contradiction. I Arbitrarily small e > 0 are used to define the important concept of a limit. We say that a sequence of real numbers #i,#2, • • • converges to the real number x (or, has limit x) if, given any e > 0, there is N G N (which may depend on e) such that for any n > N, we have \xn — x\ < e. We shall also write this as limn—oo xn = x, and shall sometimes abbreviate this to limn xn = x or even limxn = x. We also write this as {xn} —► x. Exercise A.3.2. Use the definition of a limit to prove that (a) limn^oo £ = 0.
A.3. Epsilons and Limits. 203 (b) limn^oo ^ = 0 for any k> 0. (c) limn_0021/- = l. A useful reformulation of limits is given by Proposition A.3.3. Let {xn} be a sequence of real numbers. Then {xn} —► x if and only if for each e > 0, the set {n G N; \xn — x\ > e} is finite. Proof. If {xn} —> x, then given e > 0, there is N G N such that \xn — x\ < e for all n > N. Hence, the set {n; \xn — x\ > e} contains at most N — 1 elements, and so is finite. Conversely, given e > 0, if the set {n; \xn — x\ > e} is finite, then it has some largest element, say K. Setting N = K + 1, we see that \xn — x\ < e whenever n> N. Hence, {xn} —> x. I Exercise A.3.4. Use Proposition A.3.3 to provide an alternative proof for each of the three parts of Exercise A.3.2. A sequence which does not converge is said to diverge. There is one special case. A sequence {xn} converges to infinity if for all M G R, there is N G N, such that xn> M whenever n > N. (Similarly, {xn} converges to negative infinity if for all M G R, there is N G N, such that xn < M whenever n > N.) This is a special kind of "convergence" which is also sometimes referred to as divergence! Limits have many useful properties. For example, Exercise A.3.5. Prove the following: (a) If limn xn = x, and a G R, then limn axn = ax. (b) Iflim n Xji — X and limn yn = y, then limn (xn + yn) = x + y. (c) If limn xn = x and limn yn = y, then limn {xnyn) = xy. (d) If limn xn = x, and x^0, then limn(l/xn) = l/x. Another useful property is given by Proposition A.3.6. Suppose xn —> x, and xn < a for all n G N. Then x < a. Proof. Suppose to the contrary that x > a. Let e = ^^. Then e > 0, but for all n G N we have \xn - x\ > x - xn > x - a > e, contradicting the assertion that xn —► x. |
204 A. Mathematical Background. Given a sequence {xn}, a subsequence {xUk} is any sequence formed from {xn} by omitting some of the elements. (For example, one subsequence of the sequence (1,3,5,7,9,...) of all positive odd numbers is the sequence (3,9,27,81,...) of all positive integer powers of 3.) The Bolzano- Weierstrass theorem, a consequence of compactness, says that every bounded sequence contains a convergent subsequence. Limits are also used to define infinite series. Indeed, the sum ^Si si 1S simply a shorthand way of writing limn_oo ^7= l si- ^ this ^m^ exists and is finite, then we say that the series JZSi s* converges; otherwise we say the series diverges. (In particular, for infinite series, converging to infinity is usually referred to as diverging.) If the Si are non-negative, then X)Si Si either converges to a finite value or diverges to infinity; we write this as YlZi s* < °° and SSi si = °°» respectively. For example, it is not hard to show (by comparing JZSi *~a *° *ne integral J^° t~adi) that oo ^(l/za) < oo if and only if a > 1. (A3.7) i=l Exercise A.3.8. Prove that YliLi i1 / ilo^(i)) = °°- [Hint: Show EZi(U^og(i))> ^(dx/xlogx).} Exercise A.3.9. Prove that ^^ (l/ilog(i)loglog(i)) = oo, but E.~i(l/<log(0[loglog(t)]2)<cx). A.4. Infimums and supremums. Given a set {xa}aei of real numbers, a lower bound for them is a real number £ such that xa> £ for all a £ I. The set {xa}a£i is bounded below if it has at least one lower bound. A lower bound is called the greatest lower bound if it is at least as large as any other lower bound. A very important property of the real numbers is: Any non-empty set of real numbers which is bounded below has a unique greatest lower bound. The uniqueness part of this is rather obvious; however, the existence part is very subtle. For example, this property would not hold if we restricted ourselves to rational numbers. Indeed, the set {q G Q; q > 0, q2 > 2} does not have a greatest lower bound if we allow rational numbers only; however, if we allow all real numbers, then the greatest lower bound is y/2. The greatest lower bound is also called the infimum, and is written as inf{xQ;a £ /} or infa6/xa. We similarly define the supremum of a set of real numbers to be their least upper bound. By convention, for the
A.4. INFIMUMS AND SUPREMUMS. 205 empty set 0, we define inf 0 = oo and sup0 = — oo. Also, if a set 5 is n + bounded below, then inf 5 = —oo; similarly, if it is not bounded above, then sup 5 = oo. One obvious but useful fact is inf 5 < x, for any #G 5. (A4.1) Of course, if a set of real numbers has a minimal element (for example, if the set is finite), then this minimal element is the infimum. However, infimums exist even for sets without minimal elements. For example, if 5 is the set of all positive real numbers, then inf 5 = 0, even though 0^5. A simple but useful property of infimums is the following. Proposition A.4.2. Let S be a non-empty set of real numbers which is bounded below. Let a = inf 5. Then for any e > 0, there is s G 5 with a < s < a + e. Proof. Suppose, to the contrary, that there is no such s. Clearly there is no s G 5 with s < a (otherwise a would not be a lower bound for 5); hence, it must be that all s G 5 satisfy s > a + e. But in this case, a + e is a lower bound for 5 which is larger than a. This contradicts the assertion that a was the greatest lower bound for 5. | For example, if 5 is the interval (5,20), then inf 5 = 5, and 5 0 5, but for any e > 0, there is x G 5 with x < 5 + e. Exercise A.4.3. (a) Compute inf{# G R : x > 10}. (b) Compute inf{<? G Q : q > 10}. (c) Compute inf{g G Q : q > 10}. Exercise A.4.4. (a) Let R,SCH each be non-emtpy and bounded below. Prove that inf (R U 5) = min (inf R, inf S). (b) Prove that this formula continues to hold if R = 0 and/or 5 = 0. (c) State and prove a similar formula for sup(i? U 5). Exercise A.4.5. Suppose {an} —► a. Prove that infnan < a < supnan. [Hint: Use proof by contradiction.] Two special kinds of limits involve infimums and supremums, namely the limit inferior and the limit superior, defined by liminfxn = lim inffc>n:rfc n n—>oo — and lim sup xn = lim supA.>nXfc, respectively. Indeed, such limits always n n-^00 exist (though they may be infinite). Furthermore, limn xn exists if and only if lim inf xn = lim sup xn.
206 A. Mathematical Background. Exercise A.4.6. Suppose {an} —► a and {bn} —► 6, with a <b. Let cn = an for n odd, and cn = bn for n even. Compute liminf cn and limsupcn. 71 n We shall sometimes use the "order" notations, O(-) and o(-). A quantity g{x) is said to be 0(h(x)) as x -+ £ If limsupx_^ |^(x)//i(x)| < oo. (Here £ can be oo, or — oo, or 0, or any other quantity. Also, h(x) can be any function, such as x, or x2, or 1/x, or 1.) Similarly, g(x) is said to be o(h(x)) as x —> £ if limx_>£ g(x)/h(x) = 0. Exercise A.4.7. Prove that {an} —> 0 if and only if an = o(l) as n —> oo. Finally, we note the following. We see by Exercise A.3.5(b) and induction that if {xnk} are real numbers for n G N and 1 < k < K < oo, with limn^oo xnk = 0 for each /c, then lim^^co Ylk=i Xnk = 0 as well. However, if there are an infinite number of different k then this is not true in general (for example, suppose xnn = 1 but xnk = 0 for k ^ n). Still, the following proposition gives a useful condition under which this conclusion holds. Proposition A.4.8. (The M-test.) Let {#nfc}n,fceN be a collection of real numbers. Suppose that limn-^ xnk = ak for each fixed fcGN. Suppose further that Sfcli suPn \xnk\ < oo. Then limn^oo YHZLi xnk = ^2T=i a*- Proof. The hypotheses imply that J2T=i lafcl < °°- Hence, by replacing xnk by xnk — Uk, it suffice to assume that ak = 0 for all k. Fix e > 0. Since YllZLi supn xnk < oo, we can find K G N such that S^=k+i suPnxnfc < e/2. Since limn^ooXnfc = 0, we can find (for k = 1,2,..., K) numbers Nk with xnk < e/2K for all n > Nk. Let N = max(iV"i,..., Nk)- Then for n > AT, we have 5Z£li xnfc < ^2^ + § = e* Hence, lim^oo YlkLi xnk < £fcliafc- Similarly, limn^oo YlkLi x^k > ^2^=1 ak- The result follows. I If limn_00 xnk = <*>k for each fixed k G N, with :rnfc > 0, but if we do not know that YllkLi suPn xnk < oo, then we still have oo lim Vxnl > VV, (A.4.9) n—►oo assuming this limit exists. Indeed, if not then we could find some finite K e N with limn_oo YH?=i xnk < Y^=\ak, contradicting the fact that limn^oo Y^=\ Xnk > limn^oo Ylk=l Xnk = Sjfe=l a*'
A.5. Equivalence relations. 207 A.5. Equivalence relations. In one place in the notes (the proof of Proposition 1.2.6), the idea of an equivalence relation is used. Thus, we briefly review equivalence relations here. A relation on a set 5 is a boolean function on S x 5; that is, given x,y G 5, either x is related to y (written x ~ y), or x is not related to y (written x ^ y). A relation is an equivalence relation if (a) it is reflexive, i.e. x ~ x for all x G 5; (b) it is symmetric, i.e. x ~ y whenever y ~ x; and (c) it is transitive, i.e. x ^ z whenever x ~ y and y ~ z. Given an equivalence relation ~ and an element x G 5, the equivalence class of x is the set of all y G 5 such that y ~ x. It is straightforward to verify that, if ~ is an equivalence relation, then any pair of equivalence classes is either identical or disjoint. It follows that, given an equivalence relation, the collection of equivalence classes form a partition of the set 5. This fact is used in the proof of Proposition 1.2.6 herein. Exercise A.5.1. For each of the following relations on 5 = Z, determine whether or not it is an equivalence class; and if it is, then find the equivalence class of the element 1: (a) x ~ y if and only if \y — x\ is an integer multiple of 3. (b) x ~ y if and only if \y — x\ < 5. (c) x ~ y if and only if \x\, \y\ < 5. (d) x ~ y if and only if either x = y, or |x|, \y\ < 5 (or both). (e) x ~ y if and only if |x| = \y\. (f) x ~ y if and only if |x| > \y\.
B. Bibliography. 209 B. Bibliography. The books in Subsection B.l below should be used for background study, especially if your mathematical background is on the weak side for understanding this text. The books in Subsection B.2 are appropriate for lower- level probability courses, and provide lots of intuition, examples, computations, etc. (though they largely avoid the measure theory issues). The books in Subsection B.3 cover material similar to that of this text, though perhaps at a more advanced level; they also contain much additional material of interest. The books in Subsection B.4 do not discuss probability theory, but they do discuss measure theory in detail. Finally, the books in Subsections B.5 and B.6 discuss advanced topics in the theory of stochastic processes and of mathematical finance, respectively, and may be appropriate for further reading after mastering the content of this text. B.l. Background in real analysis. K.R. Davidson and A.P. Donsig (2002), Real analysis with real applications. Prentice Hall, Saddle River, NJ. J.E. Marsden (1974), Elementary classical analysis. W.F. Freeman and Co., New York. W. Rudin (1976), Principles of mathematical analysis (3rd ed.). McGraw- Hill, New York. M. Spivak (1994), Calculus (3rd ed.). Publish or Perish, Houston, TX. (This is no ordinary calculus book!) B.2. Undergraduate-level probability. W. Feller (1968), An introduction to probability theory and its applications, Vol. I (3rd ed.). Wiley & Sons, New York. G.R. Grimmett and D.R. Stirzaker (1992), Probability and random processes (2nd ed.). Oxford University Press. D.G. Kelly (1994), Introduction to probability. Macmillan Publishing Co., New York. J. Pitman (1993), Probability. Springer-Verlag, New York. S. Ross (1994), A first course in probability (4th ed.). Macmillan Publishing Co., New York. R.L. Scheaffer (1995), Introduction to probability and its applications (2nd ed.). Duxbury Press, New York.
210 B. Bibliography. B.3. Graduate-level probability. P. Billingsley (1995), Probability and measure (3rd ed.). John Wiley k Sons, New York. L. Breiman (1992), Probability. SIAM, Philadelphia. K.L. Chung (1974), A course in probability theory (2nd ed.). Academic Press, New York. R.M. Dudley (1989), Real analysis and probability. Wadsworth, Pacific Grove, CA. R. Durrett (1996), Probability: Theory and examples (2nd ed.). Duxbury Press, New York. W. Feller (1971), An introduction to probability theory and its applications, Vol. II (2nd ed.). Wiley & Sons, New York. B. Pristedt and L. Gray (1997), A modern approach to probability theory. Birkhauser, Boston. J.C. Taylor (1997), An introduction to measure and probability. Springer, New York. D. Williams (1991), Probability with martingales. Cambridge University Press. B.4. Pure measure theory. D.L. Cohn (1980), Measure theory. Birkhauser, Boston. J.L. Doob (1994), Measure theory. Springer-Verlag, New York. G.B. Folland (1984), Real analysis: Modern techniques and their applications. John Wiley & Sons, New York. P.R. Halmos (1964), Measure theory. Van Nostrand, Princeton, NJ. H.L. Royden (1988), Real analysis (3rd ed.). Macmillan Publishing Co., New York. W. Rudin (1987), Real and complex analysis (3rd ed.). McGraw-Hill,, New York. B.5. Stochastic processes. S. Asmussen (1987), Applied probability and queues. John Wiley & Sons, New York. R. Bhattacharya and E.C. Waymire (1990), Stochastic processes with applications. John Wiley & Sons, New York. J.L. Doob (1953), Stochastic processes (2nd ed.). John Wiley & Sons, New York.
B.6. Mathematical finance. 211 D. Freedman (1983), Brownian motion and diffusion. Springer-Verlag, New York. S. Karlin and H. M. Taylor (1975), A first course in stochastic processes. Academic Press, New York. S.P. Meyn and R.L. Tweedie (1993), Markov chains and stochastic stability. Springer-Verlag, London. E. Nummelin (1984), General irreducible Markov chains and non-negative operators. Cambridge University Press. S. Resnick (1992), Adventures in stochastic processes. Birkhauser, Boston. B.6. Mathematical finance. M. Baxter and A. Rennie (1996), Financial calculus: An introduction to derivative pricing. Cambridge University Press. N.H. Bingham and R. Kiesel (1998), Risk-neutral valuation: Pricing and hedging of financial derivatives. Springer, London, New York. R.J. Elliott and P.E. Kopp (1999), Mathematics of financial markets. Springer, New York. I. Karatzas (1997), Lectures on the mathematics of finance. American Mathematical Society, Providence, Rhode Island. I. Karatzas and S.E. Shreve (1998), Methods of mathematical finance. Springer, New York. A.V. Mel'nikov (1999), Financial markets: Stochastic analysis and the pricing of derivative securities. American Mathematical Society, Providence, Rhode Island.
Index 213 Index Ac, see complement L2, 168 IP, 168 Mx(s), see moment generating function AT(0,1), 69 O(-), 206 E, see expected value Ei, 86 T, see cr-algebra N, 200 Q, 200 R, 200 Z, 200 ft, see sample space P, see probability measure Pi, 86 fl, see intersection U, see union Sc (point mass), 69 5ij, 85, 183 U, see disjoint union 0, see empty set /ix,43 /in =>/i, 117 Is, see indicator function 0-irreducibility, 180 </>x{t), see characteristic function \, see difference of sets cr-finite measure, 52, 147, 182 cr-algebra, 7 Borel, 16 generated, 16 cr-field, see cr-algebra C, see subset A?, 86 o(-), 206 absolute continuity, 70, 143, 144, 148 additivity countable, 2, 24 finite, 2 uncountable, 2, 24 Alaoglu's theorem, 117 algebra, 10, 13 semi, 9 sigma, 7 algebraic number, 3, 202 almost always, 34 almost everywhere, 58 almost sure convergence, see convergence analytic function, 108 aperiodicity, 90, 180, 181 appreciation rate, 195 area, 23 axiom of choice, 3, 4, 200 background, mathematical, 9, 199, 209 Berry-Esseen theorem, 135 binomial distribution, 141 Black-Scholes equation, 194, 197 bold play, 79 Bolzano-Weierstrass theorem, 129, 204 Borel cr-algebra, 16 Borel set, 16 Borel-Cantelli Lemma, 35 Borel-measurable, 30 boundary conditions, 76 boundary of set, 117 bounded convergence theorem, 78 bounded set, 204 Brownian motion, 186-188 properties, 187 a.a., 34 a.s., 47, 58 call option, 195
214 Index candidate density, 146 Cantelli's inequality, 65 Cantor set, 16, 81 cardinality, 200 Cartesian product, 199 Cauchy sequence, 168 Cauchy-Schwarz inequality, 58, 65 central limit theorem, 134 Lindeberg, 136 martingale, 172 Cesaro average, 90 change of variable theorem, 67 characteristic function, 125 Chebychev's inequality, 57 closed subset of a Markov chain, 98 CLT, see central limit theorem coin tossing, 21, 22 sequence waiting time, 166, 175 communicating states, 91, 97 compact support, 122 complement, 199 complete metric space, 168 complete probability space, 15, 16 composition, 30 conditional distribution, 151 regular, 156 conditional expectation, 152, 155 conditional probability, 84, 152, 155 conditional variance, 157 consistency conditions, 177, 178 continuity of probabilities, 33 continuity theorem, 132 continuous sample paths, 188, 189 continuum hypothesis, 4 convergence, 202 almost sure (a.s.), 47, 58-60, 62, 103, 120 in distribution, 120 in probability, 59, 60, 64, 120 of series, 204 pointwise, 58 to infinity, 203 weak, 117, 120 with probability one (w.p. 1)? 58 convergence theorem bounded, 78 dominated, 104, 165 martingale, 169 monotone, 46 uniform integrability, 105, 165 convex function, 58, 65, 173 convolution, 112 correlation (Corr), 45, 53, 65 countable additivity, 2, 24 countable linearity, 48, 111 countable monotonicity, 10 countable set, 200 countable subadditivity, 8 counting measure, 5, 182 coupling, 92 covariance (Cov), 45 cumulative distribution function, 67 de Morgan's Laws, 199 decomposable Markov chain, 101 decreasing events, 33 density function, 70, 144 derivative financial, 194, 195 Radon-Nikodym, 144, 148 determined by moments, 137 diagonalisation method, 129 difference equations, 76 difference of sets, 199 diffusion, 190 coefficient, 190 Langevin, 192 Ornstein-Uhlenbeck, 193, 194 discrete measure, 69, 143, 144, 148 discrete probability space, 9, 21 disjoint set, 199
Index 215 disjoint union, 8, 199 distribution, 67 binomial, 141 conditional, 151 exponential, 141 finite-dimensional, 177 function, 67 gamma, 114 initial, 179, 183 normal, 1, 69, 70, 133, 134, 141, 193 Poisson, 1, 8, 69, 141, 185 stationary, 89, 180, 184 uniform, 2, 9, 16 divergence, 203 of series, 204 domain, 200 dominated convergence theorem, 104, 165 dominated measure (<C), 143, 147 double 'til you win, 78 doubly stochastic Markov chain, 100 drift, 190 drift (of diffusion), 190 dyadic rational, 189 Ehrenfest's urn, 84 empty set, 199 equivalence class, 3, 207 equivalence relation, 3, 207 equivalent martingale measure, 195 European call option, 195 evaluation map, 69 event, 7 decreasing, 33 increasing, 33 expected value, 1, 43, 45 conditional, 152, 155 infinite, 45, 49 linearity of, 43, 46, 47, 50 order-preserving, 44, 46, 49 exponential distribution, 141 fair price, 195 Fatou's lemma, 103 field, see algebra financial derivative, 194, 195 finite additivity, 2 finite intersection property, 22 finite superadditivity, 10 finite-dimensional distributions, 177 first hitting time, 75, 180 floor, 47 fluctuation theory, 88 Fourier inversion, 126 Fourier transform, 125 Fourier uniqueness, 129 Fubini's theorem, 110 j function, 200 / analytic, 108 convex, 58, 65, 173 density, 70, 144 distribution, 67 identity, 201 indicator, 43, 200 Lipschitz, 188 measurable, 29, 30 random, 186 gambler's ruin, 75, 175 gambling policy, 78 double 'til you win, 78 gamma distribution, 114 gamma function, 115 generalised triangle inequality, 44 generated a-algebra, 16 generator, 183, 190, 191 greatest integer not exceeding, 47 greatest lower bound, 204 Hahn decomposition, 144 Heine-Borel Theorem, 16 Helly selection principle, 117, 129 hitting time, 75, 180 i.i.d., 62
216 Index i.o., 34, 86 identically distributed, 62 identity function, 201 image, 200 inverse, 200 inclusion-exclusion, principle of, 8, 53 increasing events, 33 independence, 31, 32 of sums, 39 indicator function, 43, 200 inequality Cantelli's, 65 Cauchy-Schwarz, 58, 65 Chebychev's, 57 Jensen, 65, 173 Jensen's, 58 Markov's, 57 maximal, 171 infimum, 204 infinite expected value, 45, 49 infinite fair coin tossing, 22 infinitely divisible, 136 infinitely often, 34, 86 initial distribution, 83, 179, 183 integers (Z), 200 positive (N), 200 integrable Lebesgue, 51 Riemann, 50, 51 uniformly, 105 integral, 50 Ito, 191 iterated, 110 Lebesgue, 51 lower, 50 Riemann, 50 stochastic, 191 upper, 50 interest rate, 195 intersection, 199 interval, 9, 15 inverse image, 200 irreducibility, 88, 180 Ito integral, 191 Ito's lemma, 194 iterated integral, 110 iterated logarithm, 88 Jensen's inequality, 58, 65, 173 jump rate, 184 Kolmogorov consistency conditions, 177, 178 Kolmogorov existence theorem, 178, 182 Kolmogorov zero-one law, 37 Langevin diffusion, 192 large deviations, 108, 109 law, see distribution law of large numbers strong, 60, 62 weak, 60, 64 law of the iterated logarithm, 88 least upper bound, 204 Lebesgue decomposition, 143, 147 Lebesgue integral, 51 Lebesgue measure, 16, 50, 51 liminf, 205 limit, 202 inferior, 205 superior, 205 limsup, 205 Lindeberg CLT, 136 linearity of expected values, 43, 46, 47, 50 countable, 48, 111 Lipschitz function, 188 lower bound, 204 greatest, 204 lower integral, 50 M-test, 206 Markov chain, 83, 179 decomposable, 101 Markov process, 183
Index 217 Markov's inequality, 57 martingale, 76, 161 central limit theorem, 172 convergence theorem, 169 maximal inequality, 171 sampling theorem, 163 mathematical background, see background, mathematical maximal inequality, 171 mean, see expected value mean return time, 94 measurable function, 29, 30 rectangle, 23 set, 4, 7 space, 31 measure (7-finite, 52, 147, 182 counting, 5, 182 Lebesgue, 16, 50, 51 outer, 11 probability, 7 product, 22, 23 signed, 144, 146 measure space, see probability space method of moments, 139 mixture distribution, 143 moment, 45, 108, 137 moment generating function, 107 monotone convergence theorem, 46 monotonicity, 8, 11, 46 countable, 10 natural numbers, 200 normal distribution, 1, 69, 70, 133, 134, 141, 193 null recurrence, 94 option, 195 optional sampling theorem, 163 order statistics, 179 order-preserving, 44, 46, 49 Ornstein-Uhlenbeck process, 193, 194 outer measure, 11 partition, 43, 199 period of Markov chain, 90, 180 periodicity, 90, 181 permutation, 177, 179 persistence, 86 point mass, 69 Poisson distribution, 1, 8, 69, 141, 185 Poisson process, 185 policy, see gambling policy Polish space, 156 positive recurrence, 94 prerequisites, 8, 199 price, fair, 195 principle of inclusion-exclusion, 8, 53 probability, 1 conditional, 84, 152, 155 probability measure, 7 defined on an algebra, 25 probability space, 7 complete, 15, 16 discrete, 9, 21 probability triple, 7 process Markov, 183 stochastic, 182 process, stochastic, 73 product measure, 22, 23 product set, 199 projection operator, 157 Radon-Nikodym derivative, 144, 148 Radon-Nikodym theorem, 144, 148 random function, 186 random variable, 29 absolutely continuous, 1, 70 discrete, 1, 69
218 Index simple, 43 random walk, 75, 87, 97, 161, 166 range, 43, 200 rate, jump, 184 rational numbers (Q), 200 real numbers (R), 200 recurrence, 86, 88 null, 94 positive, 94 reducibility, 88 reflexive relation, 207 regular conditional distribution, 156 relation, 207 equivalence, 207 reflexive, 207 symmetric, 207 transitive, 207 return time, 94 reversibility, 89, 184, 192 Riemann integrable, 50, 51 right-continuous, 67 risk-free interest rate, 195 risk-neutral measure, 195 Russell's paradox, 199 sample paths, continuous, 188, 189 sample space, 7 semialgebra, 9, 10 semigroup, 183 set, 199 Borel, 16 boundary, 117 Cantor, 16, 81 cardinality, 200 complement, 199 countable, 200 difference, 199 disjoint, 8, 199 empty, 199 finite, 200 intersection, 199 measurable, 4, 7 product, 199 uncountable, 200 union, 199 disjoint, 8, 199 universal, 199 shift, 3 sigma-algebra, see cr-algebra sigma-field, see a-algebra sigma-finite measure, see cr-finite measure sign(0), 127 signed measure, 144, 146 simple random variable, 43 simple random walk, 75, 97 symmetric, 87, 161 singular measure, 143, 144, 148 Skorohod's theorem, 117 state space, 83, 179 stationary distribution, 89, 180, 184 Sterling's approximation, 87 stochastic integral, 191 stochastic process, 73, 177, 182 continuous time, 182 stock, 195 stock options, 194, 195 value of, 195 stopping time, 162 strong law of large numbers, 60, 62 sub-a-algebra, 152 subadditivity, 8, 11 submartingale, 161 convergence theorem, 169 sampling theorem, 163 subsequence, 204 subset, 199 superadditivity, 10 supermartingale, 161 supremum, 204 symmetric relation, 207 tightness, 130 Tonelli's theorem, 110
Index 219 total variation distance, 181 transience, 86 transition matrix, 83, 84, 89 transition probabilities, 83, 179 nth order, 85 transitive relation, 207 triangle inequality, 44 triangular array, 136 truncation argument, 62 Tychonov's Theorem, 22 uncountable additivity, 2, 24 uncountable set, 200 uniform distribution, 2, 9, 16 uniform integrability, 105 convergence theorem, 105, 165 uniformly bounded, 78 union, 199 disjoint, 8, 199 universal set, 199 upcrossing lemma, 169 upper bound, 204 least, 204 upper integral, 50 value of stock option, 195 vanishing at infinity, 117, 122 variable, random, see random variable variance (Var), 44 conditional, 157 volatility, 190, 195 volume, 23 w.p. 1, see convergence Wald's theorem, 166, 175, 176 weak convergence, 117, 120 weak law of large numbers, 60, 64 weak* topology, 117 Wiener process, 186 zero-one law, Kolmogorov, 37