A first look at rigorous probability theory - Rosenthal J.S.

Author: Rosenthal J.S.
Tags: mathematics probability theory
ISBN: 978-981-270-370-5
Year: 2006
Similar
An Introduction to Probability Theory and Its Applications. Volume 1
Probability with martingales
Counterexamples in probability (TOC)
Introduction to probability
Text
                    AT
RIGOROUS
PROBABILITY
THEORY
Second Edition

Published by
World Scientific Publishing Co. Pte. Ltd.
5 Toh Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library.
First published 2006
Reprinted 2008
A FIRST LOOK AT RIGOROUS PROBABILITY THEORY (2nd Edition)
Copyright © 2006 by World Scientific Publishing Co. Pte. Ltd.
All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means,
electronic or mechanical, including photocopying, recording or any information storage and retrieval
system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright
Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to
photocopy is not required from the publisher.
ISBN-13 978-981-270-370-5
ISBN-10 981-270-370-5
ISBN-13 978-981-270-371-2 (pbk)
ISBN-10 981-270-371-3 (pbk)
Printed in Singapore by World Scientific Printers (S) Re Ltd

To my wonderful wife, Margaret Fulford:
always supportive, always encouraging.

Preface to the First Edition.
This text grew out of my lecturing the Graduate Probability sequence
STA 211 IF / 221 IS at the University of Toronto over a period of several
years. During this time, it became clear to me that there are a large number
of graduate students from a variety of departments (mathematics, statistics,
economics, management, finance, computer science, engineering, etc.) who
require a working knowledge of rigorous probability, but whose
mathematical background may be insufficient to dive straight into advanced texts on
the subject.
This text is intended to answer that need. It provides an introduction
to rigorous (i.e., mathematically precise) probability theory using measure
theory. At the same time, I have tried to make it brief and to the point, and
as accessible as possible. In particular, probabilistic language and
perspective are used throughout, with necessary measure theory introduced only
as needed.
I have tried to strike an appropriate balance between rigorously covering
the subject, and avoiding unnecessary detail. The text provides
mathematically complete proofs of all of the essential introductory results of
probability theory and measure theory. However, more advanced and specialised
areas are ignored entirely or only briefly hinted at. For example, the text
includes a complete proof of the classical Central Limit Theorem, including
the necessary Continuity Theorem for characteristic functions. However,
the Lindeberg Central Limit Theorem and Martingale Central Limit
Theorem are only briefly sketched and are not proved. Similarly, all necessary
facts from measure theory are proved before they are used. However, more
abstract and advanced measure theory results are not included.
Furthermore, the measure theory is almost always discussed purely in terms of
probability, as opposed to being treated as a separate subject which must
be mastered before probability theory can be studied.
I hesitated to bring these notes to publication. There are many other
books available which treat probability theory with measure theory, and
some of them are excellent. For a partial list see Subsection B.3 on page
210. (Indeed, the book by Billingsley was the textbook from which I taught
before I started writing these notes. While much has changed since then, the
knowledgeable reader will still notice Billingsley's influence in the treatment
of many topics herein. The Billingsley book remains one of the best sources
for a compete, advanced, and technically precise treatment of probability
theory with measure theory.) In terms of content, therefore, the current
text adds very little indeed to what has already been written. It was only
the reaction of certain students, who found the subject easier to learn from
my notes than from longer, more advanced, and more all-inclusive books,
that convinced me to go ahead and publish. The reader is urged to consult

other books for further study and additional detail.
There are also many books available (see Subsection B.2) which treat
probability theory at the undergraduate, less rigorous level, without the use
of general measure theory. Such texts provide intuitive notions of
probabilities, random variables, etc., but without mathematical precision. In this
text it will generally be assumed, for purposes of intuition, that the
student has at least a passing familiarity with probability theory at this level.
Indeed, Section 1 of the text attempts to link such intuition with the
mathematical precision to come. However, mathematically speaking we will not
require many results from undergraduate-level probability theory.
Structure. The first six sections of this book could be considered to
form a "core" of essential material. After learning them, the student will
have a precise mathematical understanding of probabilities and a-algebras;
random variables, distributions, and expected values; and inequalities and
laws of large numbers. Sections 7 and 8 then diverge into the theory of
gambling games and Markov chain theory. Section 9 provides a bridge
to the more advanced topics of Sections 10 through 14, including weak
convergence, characteristic functions, the Central Limit Theorem, Lebesgue
Decomposition, conditioning, and martingales.
The final section, Section 15, provides a wide-ranging and somewhat
less rigorous introduction to the subject of general stochastic processes. It
leads up to diffusions, Ito's Lemma, and finally a brief look at the famous
Black-Scholes equation from mathematical finance. It is hoped that this
final section will inspire readers to learn more about various aspects of
stochastic processes.
Appendix A contains basic facts from elementary mathematics. This
appendix can be used for review and to gauge the book's level. In addition,
the text makes frequent reference to Appendix A, especially in the earlier
sections, to ease the transition to the required mathematical level for the
subject. It is hoped that readers can use familiar topics from Appendix A
as a springboard to less familiar topics in the text.
Finally, Appendix B lists a variety of references, for background and for
further reading.
Exercises. The text contains a number of exercises. Those very closely
related to textual material are inserted at the appropriate place. Additional
exercises are found at the end of each section, in a separate subsection.
I have tried to make the exercises thought provoking without being too
difficult. Hints are provided where appropriate. Rather than always asking
for computations or proofs, the exercises sometimes ask for explanations
and/or examples, to hopefully clarify the subject matter in the student's
mind.
Prerequisites. As a prerequisite to reading this text, the student should

have a solid background in basic undergraduate-level real analysis (not
including measure theory). In particular, the mathematical background
summarised in Appendix A should be very familiar. If it is not, then books
such as those in Subsection B.l should be studied first. It is also helpful,
but not essential, to have seen some undergraduate-level probability theory
at the level of the books in Subsection B.2.
Further reading. For further reading beyond this text, the reader should
examine the similar but more advanced books of Subsection B.3. To learn
additional topics, the reader should consult the books on pure measure
theory of Subsection B.4, and/or the advanced books on stochastic processes
of Subsection B.5, and/or the books on mathematical finance of Subsection
B.6. I would be content to learn only that this text has inspired students
to look at more advanced treatments of the subject.
Acknowledgements. I would like to thank several colleagues for
encouraging me in this direction, in particular Mike Evans, Andrey Feuerverger,
Keith Knight, Omiros Papaspiliopoulos, Jeremy Quastel, Nancy Reid, and
Gareth Roberts. Most importantly, I would like to thank the many
students who have studied these topics with me; their questions, insights, and
difficulties have been my main source of inspiration.
Jeffrey S. Rosenthal
Toronto, Canada, 2000
j eff@math.toronto.edu
http://probability.ca/jeff/
Second Printing (2003). For the second printing, a number of minor
errors have been corrected. Thanks to Tom Baird, Meng Du, Avery Fuller-
ton, Longhai Li, Hadas Moshonov, Nataliya Portman, and Idan Regev for
helping to find them.
Third Printing (2005). A few more minor errors were corrected, with
thanks to Samuel Hikspoors, Bin Li, Mahdi Lotfinezhad, Ben Reason, Jay
Sheldon, and Zemei Yang.

Preface to the Second Edition.
I am pleased to have the opportunity to publish a second edition of this
book. The book's basic structure and content are unchanged; in particular,
the emphasis on establishing probability theory's rigorous mathematical
foundations, while minimising technicalities as much as possible, remains
paramount. However, having taught from this book for several years, I
have made considerable revisions and improvements. For example:
• Many small additional topics have been added, and existing topics
expanded. As a result, the second edition is over forty pages longer than
the first.
• Many new exercises have been added, and some of the existing exercises
have been improved or "cleaned up". There are now about 275 exercises
in total (as compared with 150 in the first edition), ranging in difficulty
from quite easy to fairly challenging, many with hints provided.
• Further details and explanations have been added in steps of proofs
which previously caused confusion.
• Several of the longer proofs are now broken up into a number of lemmas,
to more easily keep track of the different steps involved, and to allow
for the possibility of skipping the most technical bits while retaining the
proof's overall structure.
• A few proofs, which are required for mathematical completeness but
which require advanced mathematics background and/or add little
understanding, are now marked as "optional".
• Various interesting, but technical and inessential, results are presented
as remarks or footnotes, to add information and context without
interrupting the text's flow.
• The Extension Theorem now allows the original set function to be
defined on a semialgebra rather than an algebra, thus simplifying its
application and increasing understanding.
• Many minor edits and rewrites were made throughout the book to
improve the clarity, accuracy, and readability.
I thank Ying Oi Chiew and Lai Fun Kwong of World Scientific for facilitating
this edition, and thank Richard Dudley, Eung Jun Lee, Neal Madras, Peter
Rosenthal, Hermann Thorisson, and Balint Virag for helpful comments.
Also, I again thank the many students who have studied and discussed
these topics with me over many years.
Jeffrey S. Rosenthal
Toronto, Canada, 2006
Second Printing (2007). A few very minor corrections were made, with
thanks to Joe Blitzstein and Emil Zeuthen.

Contents
Preface to the First Edition vii
Preface to the Second Edition xi
1 The need for measure theory 1
1.1 Various kinds of random variables 1
1.2 The uniform distribution and non-measurable sets 2
1.3 Exercises 4
1.4 Section summary 5
2 Probability triples 7
2.1 Basic definition 7
2.2 Constructing probability triples 8
2.3 The Extension Theorem 10
2.4 Constructing the Uniform[0,1] distribution 15
2.5 Extensions of the Extension Theorem . 18
2.6 Coin tossing and other measures 21
2.7 Exercises 23
2.8 Section summary 27
3 Further probabilistic foundations 29
3.1 Random variables 29
3.2 Independence 31
3.3 Continuity of probabilities 33
3.4 Limit events 34
3.5 Tail fields 36
3.6 Exercises 38
3.7 Section summary 41
4 Expected values 43
4.1 Simple random variables 43
4.2 General non-negative random variables 45
4.3 Arbitrary random variables 49
4.4 The integration connection 50
4.5 Exercises 52
4.6 Section summary 55
5 Inequalities and convergence 57
5.1 Various inequalities 57
5.2 Convergence of random variables 58
5.3 Laws of large numbers 60
5.4 Eliminating the moment conditions 61

5.5 Exercises
5.6 Section summary
6 Distributions of random variables
6.1 Change of variable theorem
6.2 Examples of distributions
6.3 Exercises
6.4 Section summary
7 Stochastic processes and gambling games
7.1 A first existence theorem
7.2 Gambling and gambler's ruin
7.3 Gambling policies
7.4 Exercises
7.5 Section summary
8 Discrete Markov chains
8.1 A Markov chain existence theorem
8.2 Transience, recurrence, and irreducibility
8.3 Stationary distributions and convergence
8.4 Existence of stationary distributions
8.5 Exercises
8.6 Section summary
9 More probability theorems
9.1 Limit theorems
9.2 Differentiation of expectation
9.3 Moment generating functions and large deviations
9.4 Fubini's Theorem and convolution
9.5 Exercises
9.6 Section summary
10 Weak convergence
10.1 Equivalences of weak convergence
10.2 Connections to other convergence
10.3 Exercises
10.4 Section summary
11 Characteristic functions
11.1 The continuity theorem
11.2 The Central Limit Theorem
11.3 Generalisations of the Central Limit Theorem . .
11.4 Method of moments
11.5 Exercises

11.6 Section summary 142
12 Decomposition of probability laws 143
12.1 Lebesgue and Hahn decompositions 143
12.2 Decomposition with general measures 147
12.3 Exercises 148
12.4 Section summary 149
13 Conditional probability and expectation 151
13.1 Conditioning on a random variable 151
13.2 Conditioning on a sub-cr-algebra 155
13.3 Conditional variance 157
13.4 Exercises 158
13.5 Section summary 160
14 Martingales 161
14.1 Stopping times 162
14.2 Martingale convergence 168
14.3 Maximal inequality 171
14.4 Exercises 173
14.5 Section summary 176
15 General stochastic processes 177
15.1 Kolmogorov Existence Theorem 177
15.2 Markov chains on general state spaces 179
15.3 Continuous-time Markov processes 182
15.4 Brownian motion as a limit 186
15.5 Existence of Brownian motion 188
15.6 Diffusions and stochastic integrals 190
15.7 Ito's Lemma 193
15.8 The Black-Scholes equation 194
15.9 Section summary 197
A Mathematical Background 199
A.l Sets and functions 199
A.2 Countable sets 200
A.3 Epsilons and Limits 202
A.4 Infimums and supremums 204
A.5 Equivalence relations 207

XVI
B Bibliography
B.l Background in real analysis . .
B.2 Undergraduate-level probability
B.3 Graduate-level probability . . .
B.4 Pure measure theory
B.5 Stochastic processes
B.6 Mathematical finance
Index

1. The need for measure theory.
1
1. The need for measure theory.
This introductory section is directed primarily to those readers who have
some familiarity with undergraduate-level probability theory, and who may
be unclear as to why it is necessary to introduce measure theory and other
mathematical difficulties in order to study probability theory in a rigorous
manner.
We attempt to illustrate the limitations of undergraduate-level
probability theory in two ways: the restrictions on the kinds of random variables
it allows, and the question of what sets can have probabilities defined on
them.
1.1. Various kinds of random variables.
The reader familiar with undergraduate-level probability will be
comfortable with a statement like, "Let X be a random variable which has the
Poisson(5) distribution." The reader will know that this means that X
takes as its value a "random" non-negative integer, such that the integer
k > 0 is chosen with probability P(X = k) = e~55fc/fc!. The expected value
of, say, X2, can then be computed as E(X2) = YlT=o^2e~5^k/^" ^ 1S an
example of a discrete random variable.
Similarly, the reader will be familiar with a statement like, "Let Y be
a random variable which has the Normal(0,1) distribution." This means
that the probability that Y lies between two real numbers a < b is given
by the integral P(a < Y < b) = f* -±=e-y2/2dy. (On the other hand,
P(^ = y) = 0 f°r any particular real number y.) The expected value of,
say, F2, can then be computed as E(F2) = f™ooy2-7=e~y2/2dy. Y is an
example of an absolutely continuous random variable.
But now suppose we introduce a new random variable Z, as follows.
We let X and Y be as above, and then flip an (independent) fair coin.
If the coin comes up heads we set Z = X, while if it comes up tails we
set Z = Y. In symbols, P(Z = X) = P(Z = Y) = 1/2. Then what
sort of random variable is Z? It is not discrete, since it can take on an
uncountable number of different values. But it is not absolutely continuous,
since for certain values z (specifically, when z is a non-negative integer) we
have P(Z = z) > 0. So how can we study the random variable Z? How
could we compute, say, the expected value of Z2?
The correct response to this question, of course, is that the division
of random variables into discrete versus absolutely continuous is artificial.
Instead, measure theory allows us to give a common definition of expected
value, which applies equally well to discrete random variables (like X above),
to continuous random variables (like Y above), to combinations of them (like

2
1. The need for measure theory.
Z above), and to other kinds of random variables not yet imagined. These
issues are considered in Sections 4, 6, and 12.
1.2. The uniform distribution and non-measurable sets.
In undergraduate-level probability, continuous random variables are
often studied in detail. However, a closer examination suggests that perhaps
such random variables are not completely understood after all.
To take the simplest case, suppose that X is a random variable which
has the uniform distribution on the unit interval [0,1]. In symbols, X ~
Uniform[0,1]. What precisely does this mean?
Well, certainly this means that P(0<X<l) = lIt also means that
P(0 < X < 1/2) = 1/2, that P(3/4 < X < 7/8) = 1/8, etc., and in general
that P(a < X < b) = b — a whenever 0 < a < b < 1, with the same formula
holding if < is replaced by <. We can write this as
P([fl, b]) = P((a, b}) = P([a, b)) = P((a, 6)) = 6 - a, 0 < a < 6 < 1.
(1.2.1)
In words, the probability that X lies in any interval contained in [0,1] is
simply the length of the interval. (We include in this the degenerate case
when a = 6, so that P({a}) = 0 for the singleton set {a}; in words, the
probability that X is equal to any particular number a is zero.)
Similarly, this means that
P(l/4 < X < 1/2 or 2/3 < X < 5/6)
= P(l/4<X<l/2) + P(2/3<X<5/6) = 1/4 + 1/6 = 5/12,
and in general that if A and B are disjoint subsets of [0,1] (for example, if
A = [1/4, 1/2] and B = [2/3, 5/6]), then
P(AuB) = P(A) + P(B). (1.2.2)
Equation (1.2.2) is called finite additivity.
Indeed, to allow for countable operations (such as limits, which are
extremely important in probability theory), we would like to extend (1.2.2) to
the case of a countably infinite number of disjoint subsets: if A\, A<i, As,...
are disjoint subsets of [0,1], then
P (Ax U A2 U A3 U ...) = P(Ai) + P{A2) + P{A3) + ... . (1.2.3)
Equation (1.2.3) is called countable additivity.
Note that we do not extend equation (1.2.3) to uncountable additivity.
Indeed, if we did, then we would expect that P([0,1]) = X!xe[o,i] ^({x})'

1.2. The uniform distribution and non-measurable sets. 3
which is clearly false since the left-hand side equals 1 while the right-hand
side equals 0. (There is no contradiction to (1.2.3) since the interval [0,1] is
not countable.) It is for this reason that we restrict attention to countable
operations. (For a review of countable and uncountable sets, see
Subsection A.2. Also, recall that for non-negative uncountable collections {ra}aej,
J2 Gj ra is defined to be the supremum of Y^aeJ r<* over nn^e J ^ I-)
Similarly, to reflect the fact that X is "uniform" on the interval [0,1],
the probability that X lies in some subset should be unaffected by "shifting"
(with wrap-around) the subset by a fixed amount. That is, if for each subset
A C [0,1] we define the r-shift of A by
A®r = {a + r; aeA, a + r < 1} u {a + r-1; aG A, a + r > 1}, (1.2.4)
then we should have
P(A 0 r) = P{A), 0 < r < 1. (1.2.5)
So far so good. But now suppose we ask, what is the probability that X
is rational? What is the probability that Xn is rational for some positive
integer n? What is the probability that X is algebraic, i.e. the solution to
some polynomial equation with integer coefficients? Can we compute these
things? More fundamentally, are all probabilities such as these necessarily
even defined? That is, does P(A) (i.e., the probability that X lies in the
subset A) even make sense for every possible subset A C [0,1]?
It turns out that the answer to this last question is no, as the following
proposition shows. The proof requires equivalence relations, but can be
skipped if desired since the result is not used elsewhere in this book.
Proposition 1.2.6. There does not exist a definition ofP(A), defined
for all subsets A C [0,1], satisfying (1.2.1) and (1.2.3) and (1.2.5).
Proof (optional). Suppose, to the contrary, that P(A) could be so
defined for each subset A C [0,1]. We will derive a contradiction to this.
Define an equivalence relation (see Subsection A.5) on [0,1] by: x ~ y
if and only if the difference y - x is rational. This relation partitions the
interval [0,1] into a disjoint union of equivalence classes. Let H be a subset
of [0,1] consisting of precisely one element from each equivalence class (such
H must exist by the Axiom of Choice, see page 200). For definiteness,
assume that 0 £ H (say, if 0 G H, then replace it by 1/2).
Now, since If contains an element of each equivalence class, we see that
each point in (0,1] is contained in the union (J (H 0 r) of shifts of H.
r€[0,l)
r rational
Furthermore, since H contains just one point from each equivalence class,
we see that these sets H 0 r, for rational r G [0,1), are all disjoint.

4
1. The need for measure theory.
But then, by countable additivity (1.2.3), we have
P((o,i])= £ P(ffer).
re[o,i)
r rational
Shift-invariance (1.2.5) implies that P(ff0r) = P(#), whence
1 = P((0,1])= £ P(H).
re[o,i)
r rational
This leads to the desired contradiction: A countably infinite sum of the same
quantity repeated can only equal 0, or oo, or —oo, but it can never equal 1. I
This proposition says that if we want our probabilities to satisfy
reasonable properties, then we cannot define them for all possible subsets of
[0,1]. Rather, we must restrict their definition to certain "measurable" sets.
This is the motivation for the next section.
Remark. The existence of problematic sets like H above turns out to be
equivalent to the Axiom of Choice. In particular, we can never define such
sets explicitly - only implicitly via the Axiom of Choice as in the above
proof.
1.3. Exercises.
Exercise 1.3.1. Suppose that ft = {1,2}, with P(0) = 0 and P{1,2} = 1.
Suppose P{1} = \. Prove that P is countably additive if and only if
P{2} = f •
Exercise 1.3.2. Suppose ft = {1,2,3} and T is the collection of all
subsets of ft. Find (with proof) necessary and sufficient conditions on the real
numbers x, y, and z, such that there exists a countably additive probability
measure P on T, with x = P{1,2}, y = P{2,3}, and z = P{1,3}.
Exercise 1.3.3. Suppose that ft = N is the set of positive integers, and
P is defined for all A C ft by P(A) = 0 if A is finite, and P(A) = 1 if A is
infinite. Is P finitely additive?
Exercise 1.3.4. Suppose that ft = N, and P is defined for all A C
ft by P(j4) = \A\ if A is finite (where \A\ is the number of elements in
In fact, assuming the Continuum Hypothesis, Proposition 1.2.6 continues to hold if
we require only (1.2.3) and that 0 < P([0,1]) < oo and P{x} = 0 for all x; see e.g.
Billingsley (1995, p. 46).

1.4. Section summary.
the subset A), and P(A) = oo if A is infinite. This P is of course not a
probability measure (in fact it is counting measure), however we can still
ask the following. (By convention, 00 + 00 = 00.)
(a) Is P finitely additive?
(b) Is P countably additive?
Exercise 1.3.5. (a) In what step of the proof of Proposition 1.2.6
was (1.2.1) used?
(b) Give an example of a countably additive set function P, defined on all
subsets of [0,1], which satisfies (1.2.3) and (1.2.5), but not (1.2.1).
1.4. Section summary.
In this section, we have discussed why measure theory is necessary to
develop a mathematical rigorous theory of probability. We have discussed
basic properties of probability measures such as additivity. We have
considered the possibility of random variables which are neither absolutely
continuous nor discrete, and therefore do not fit easily into undergraduate-level
understanding of probability. Finally, we have proved that, for the uniform
distribution on [0,1], it will not be possible to define a probability on every
single subset.

2, Probability triples.
2. Probability triples-
In this section we consider probability triples and how to construct them.
In light of the previous section, we see that to study probability theory
properly it will be necessary to keep track of which subsets A have a probability
P(A) denned for them.
2.1. Basic definition.
We define a probability triple or (probability) measure space or probability
space to be a triple (ft,.F,P), where:
• the sample space ft is any non-empty set (e.g. ft = [0,1] for the uniform
distribution considered above);
• the a-algebra (read "sigma-algebra") or a-field (read "sigma-field") T
is a collection of subsets of ft, containing ft itself and the empty set 0,
and closed under the formation of complements and countable unions
and countable intersections (e.g. for the uniform distribution considered
above, T would certainly contain all the intervals [a, 6], but would
contain many more subsets besides);
• the probability measure P is a mapping from T to [0,1], with P(0) = 0
and P(ft) = 1, such that P is countably additive as in (1.2.3).
This definition will be in constant use throughout the text.
Furthermore it contains a number of subtle points. Thus, we pause to make a few
additional observations.
The cr-algebra T is the collection of all events or measurable sets. These
are the subsets Kfifor which P(A) is well-defined. We know from
Proposition 1.2.6 that in general T might not contain all subsets of ft, though we
still expect it to contain most of the subsets that come up naturally.
To say that T is closed under the formation of complements and
countable unions and countable intersections means, more precisely, that
(i) For any subset A C ft, if A <E T, then Ac <E T\
(ii) For any countable (or finite) collection of subsets Ai,A2, A3,... C ft, if
Ai e T for each z, then the union Ax U A2 U A3 U ... € T\
(ni) For any countable (or finite) collection of subsets Ai,A2, A3,... C ft, if
A{ e T for each z, then the intersection A\ Pi A2 Pi A3 H ... € T.
Like for countable additivity, the reason we require T to be closed under
countable operations is to allow for taking limits, etc., when studying
probability theory. Also like for additivity, we cannot extend the definition to
For the definitions of complements, unions, intersections, etc., see Subsection A.l on
. Page 199.

8
2. Probability triples.
require that T be closed under uncountable unions; in this case, for the
example of Subsection 1.2 above, T would contain every subset A, since every
subset can be written as A = \JxeA{x} and since the singleton sets {x} are
all in T.
There is some redundancy in the definition above. For example, it
follows from de Morgan's Laws (see Subsection A.l) that if T is closed under
complement and countable unions, then it is automatically closed under
countable intersections. Similarly, it follows from countable additivity that
we must have P(0) = 0, and that (once we know that P(^) = 1 and
P(A) > 0 for all A <E T) we must have P(A) < 1.
More generally, from additivity we have P(A) + P(AC) = P(A U Ac) =
P(fl) — 1, whence
P(AC) = l-P(A), (2.1.1)
a fact that will be used often. Similarly, if A C B, then since B = AU (B\A)
(where U means disjoint union), we have that P(B) = P(A) + P(B \ A) >
P(A), i.e.
P(A) < P(B) whenever A CB, (2.1.2)
which is the monotonicity property of probability measures.
Also, if A, B € J", then
P(iUB) = P[(A\B)U(B\A)U{AnB)]
= P(A \ B) + P(B \ A) + P(A n B)
- P(4) - P(4 nB) + P(B) - P(^l nB) + P{A n B)
= P{A) + P(B) - P(A n B),
the principle of inclusion-exclusion. For a generalisation see Exercise 4.5.7.
Finally, for any sequence Ai,A2, ■ ■ ■ € T (whether disjoint or not), we
have by countable additivity and monotonicity that
P(i4i U A2 U A3 U . •.) = P[A1Ci(A2\A1)U{A3\A2\A1)U ...]
= P(A1) + P(A2\A1) + P(A3\A2\A1) + ...
< P(i41) + P(A2) + P(J43) + ...
which is the countable subadditivity property of probability measures.
2.2. Constructing probability triples.
We clarify the definition of Subsection 2.1 with a simple example. Let us
again consider the Poisson(5) distribution considered in Subsection 1.1. In
this case, the sample space Q would consist of all the non-negative integers:

2.2. Constructing probability triples.
9
o = {0,1,2,.. .}• Also, the cr-algebra T would consist of all subsets of ft.
Finally, the probability measure P would be defined, for any A G T, by
p(a) = J2e~55k/kl-
keA
It is straightforward to check that T is indeed a cr-algebra (it contains
all subsets of ft, so it's closed under any set operations), and that P is a
probability measure defined on T (the additivity following since if A and B
are disjoint, then T,heAuB is the same ^ T,keA + Efc€s)-
So in the case of Poisson(5), we see that it is entirely straightforward
to construct an appropriate probability triple. The construction is similarly
straightforward for any discrete probability space, i.e. any space for which
the sample space ft is finite or countable. We record this as follows.
Theorem 2.2.1. Let ft be a finite or countable non-empty set. Let
p : ft —> [0,1] be any function satisfying ^2ujeQp((^) = 1. Then there is a
valid probability triple (ft, J7, P) where T is the collection of all subsets of
0, and for A^T, P(A) = Eu,€aPM-
Example 2.2.2. Let ft be any finite non-empty set, T be the collection
of all subsets of ft, and P(A) = \A\ / \Q\ for all A e T (where \A\ is the
cardinality of the set ^4). Then (^,«F, P) is a valid probability triple, called
the uniform distribution on tt, written Uniform(O).
However, if the sample space is not countable, then the situation is
considerably more complex, as seen in Subsection 1.2. How can we
formally define a probability triple (O, J7, P) which corresponds to, say, the
Uniform[0,1] distribution?
It seems clear that we should choose Q = [0,1]. But what about f? We
know from Proposition 1.2.6 that T cannot contain all subsets of ft, but it
should certainly contain all the intervals [a, 6], [a, 6), etc. That is, we must
have T D J, where
J = {all intervals contained in [0,1]}
and where "intervals" is understood to include all the open / closed / half-
°pen / singleton / empty intervals.
Exercise 2.2.3. Prove that the above collection J' is a semialgebra of
subsets of ft, meaning that it contains 0 and SI, it is closed under finite
intersection, and the complement of any element of J is equal to a finite
disjoint union of elements of J.

10
2. Probability triples.
Since J is only a semialgebra, how can we create a cr-algebra? As a first
try, we might consider
B0 = {all finite unions of elements of J} . (2.2.4)
(After all, by additivity we already know how to define P on Bo.) However,
Bo is not a cr-algebra:
Exercise 2.2.5. (a) Prove that Bo is an algebra (or, field) of subsets
of fi, meaning that it contains CI and 0, and is closed under the formation
of complements and of finite unions and intersections.
(b) Prove that Bo is not a cr-algebra.
As a second try, we might consider
B\ = {all finite or countable unions of elements of J} . (2.2.6)
Unfortunately, B\ is still not a cr-algebra (Exercise 2.4.7).
Thus, the construction of T, and of P, presents serious challenges. To
deal with them, we next prove a very general theorem about constructing
probability triples.
2.3. The Extension Theorem.
The following theorem is of fundamental importance in constructing
complicated probability triples. Recall the definition of semialgebra from
Exercise 2.2.3.
Theorem 2.3.1. (The Extension Theorem.) Let J be a semialgebra of
subsets ofto. Let P : J -► [0,1] with P(0) = 0 and P(fi) = 1, satisfying
the finite superadditivity property that
(k \ k k
U Ai) - Yl P(Ai^ whenever Au...,Ak€j, and \J At € J,
and the {Ai} are disjoint, (2.3.2)
and also the countable monotonicity property that
P(A) < ^2^(An) for A,AuA2,...eJwithAc]jAn. (2.3.3)
n n
Then there is a a-algebra M 3 J, and a countably additive probability
measure P* on M, such that P*(A) = P(A) for all A G J. (That is,

2.3. The Extension Theorem.
11
(Q M,P*) JS a valid probability triple, which agrees with our previous
probabilities on J.)
Remark. Of course, the conclusions of Theorem 2.3.1 imply that (2.3.2)
must actually hold with equality. However, (2.3.2) need only be verified as
an inequality to apply Theorem 2.3.1.
Theorem 2.3.1 provides precisely what we need: a way to construct
complicated probability triples on a full cr-algebra, using only probabilities
defined on the much simpler subsets (e.g., intervals) in J.
However, it is not clear how to even start proving this theorem. Indeed,
how could we begin to define P(A) for all A in a cr-algebra? The key is
given by outer measure P*, defined by
P*(A) = Aijnf6,J>(i4i), ACQ. (2.3.4)
That is, we define P*(^4), for any subset A C ft, to be the infimum of
sums of P{Ai), where {Ai} is any countable collection of elements of the
original semialgebra J whose union contains A. In other words, we use
the values of P{A) for A € J, to help us define P*(A) for any A C ft.
Of course, we know that P* will not necessarily be a proper probability
measure for all A C ft; for example, this is not possible for Uniform[0,1]
by Proposition 1.2.6. However, it is still useful that P*(A) is at least defined
for all A C ft. We shall eventually show that P* is indeed a probability
measure on some cr-algebra M, and that P* is an extension of P.
To continue, we note a few simple properties of P*. Firstly, we clearly
have P*(0) = 0; indeed, we can simply take Ai = $ for each i in the
definition (2.3.4). Secondly, P* is clearly monotone; indeed, if A C B then
the infimum (2.3.4) for P*(A) includes all choices of {Ai} which work for
P*(£) plus many more besides, so that P*(A) < P*(B). We also have:
Lemma 2.3.5. P* is an extension of P, i.e. P*{A) = P(A) for all AG J.
Proof. Let A £ J. It follows from (2.3.3) that P\A) > P(A). On the
other hand, choosing Ax = A and At = 0 for i > 1 in the definition (2.3.4)
shows by (A.4.1) that P*(A) < P{A). I
Lemma 2.3.6. P* is countably subadditive, i.e.
(oo \ oo
[JBn < ^P*(Bn) for any BuB2,... C SI.
n=l / n=l

Since J is only a semialgebra, how can we create a cr-algebra? As a first
try, we might consider
So = {all finite unions of elements of J} . (2.2.4)
(After all, by additivity we already know how to define P on So-) However,
So is not a cr-algebra:
Exercise 2.2.5. (a) Prove that So is an algebra (or, field) of subsets
of ft, meaning that it contains Q and 0, and is closed under the formation
of complements and of finite unions and intersections.
(b) Prove that So is not a cr-algebra.
As a second try, we might consider
Si = {all finite or countable unions of elements of J} . (2.2.6)
Unfortunately, B\ is still not a cr-algebra (Exercise 2.4.7).
Thus, the construction of T, and of P, presents serious challenges. To
deal with them, we next prove a very general theorem about constructing
probability triples.
2.3. The Extension Theorem.
The following theorem is of fundamental importance in constructing
complicated probability triples. Recall the definition of semialgebra from
Exercise 2.2.3.
Theorem 2.3.1. (The Extension Theorem.) Let J be a semialgebra of
subsets of ft. Let P : J -+ [0,1] with P(0) = 0 and P(fi) = 1, satisfying
the finite superadditivity property that
(k \ k k
UAi) -H P(A*) whenever Au...,AkeJ, and \J A> € J,
i=i / i=\ i=i
and the {Ai} are disjoint, (2.3.2)
and also the countable monotonicity property that
P(A) < ^2P(An) for A,AuA2,...eJwithAc\jAn. (2.3.3)
n n
Then there is a a-algebra M 2 J, and a countably additive probability
measure P* on M, such that P*(A) = P(A) for all A e J. (That is,

(O M P*) iS a valid probability triple, which agrees with our previous
probabilities on J.)
Remark. Of course, the conclusions of Theorem 2.3.1 imply that (2.3.2)
must actually hold with equality. However, (2.3.2) need only be verified as
an inequality to apply Theorem 2.3.1.
Theorem 2.3.1 provides precisely what we need: a way to construct
complicated probability triples on a full a-algebra, using only probabilities
defined on the much simpler subsets (e.g., intervals) in J.
However, it is not clear how to even start proving this theorem. Indeed,
how could we begin to define P(A) for all A in a a-algebra? The key is
given by outer measure P*, defined by
P*(A) = ^ mf £j £ P(^), ACQ. (2.3.4)
That is, we define P*(A), for any subset A C fi, to be the infimum of
sums of P(^4i), where {Ai} is any countable collection of elements of the
original semialgebra J whose union contains A. In other words, we use
the values of P{A) for A € J, to help us define P*(A) for any ACQ,.
Of course, we know that P* will not necessarily be a proper probability
measure for all A C Q,; for example, this is not possible for Uniform[0,1]
by Proposition 1.2.6. However, it is still useful that P*(A) is at least defined
for all KO. We shall eventually show that P* is indeed a probability
measure on some cr-algebra M, and that P* is an extension of P.
To continue, we note a few simple properties of P*. Firstly, we clearly
have P*(0) = 0; indeed, we can simply take Ai = 0 for each i in the
definition (2.3.4). Secondly, P* is clearly monotone; indeed, if A C B then
the infimum (2.3.4) for P (A) includes all choices of {Ai} which work for
P*(JB) plus many more besides, so that P*(A) < P*(B). We also have:
Lemma 2.3.5. P* is an extension ofP, i.e. P*(A) = P(A) for all At J.
Proof. Let AG J. It follows from (2.3.3) that P*{A) > P(A). On the
other hand, choosing A\ = A and Ai = 0 for i > 1 in the definition (2.3.4)
shows by (A.4.1) that P*(A) < P(A). I
Lemma 2.3.6. P* is countably subadditive, i.e.
(oo \ oo
[JBn < ^P*(Bn) for any B1,ft,...Cfl.
n=l / n=l

12
2. Probability triples.
Proof. Let B\,B2,... C fl. From the definition (2.3.4), we see that for
any e > 0, we can find (cf. Proposition A.4.2) a collection {Cnk}(kL1 for each
n e N, with Cnk £ J, such that Bn C \Jk Cnk and £fc P(Cnk) < P*(Bn) +
e2~n. But then the overall collection {Cnk}n<)k=1 contains U^Li Bn. It
follows that P* (U~=iBn) < En/fCnfe) < EnP*(^n) + €. Since this is
true for any e > 0, we must have (cf. Proposition A.3.1) that P* (U^Li Bn) <
En P*(Bn), as claimed. I
We now set
M = {Acn-, P*(A DE) + P*(AC HE) = P*{E) V^Cfi}. (2.3.7)
That is, M is the set of all subsets A with the property that P* is additive
on the union of A C\ E with Ac D E, for all subsets E. Note that by
subadditivity we always have P*(An E) + P*(,4C n E) > P*(E), so (2.3.7)
is equivalent to
M = {ACQ-, P*(A HE) + P*(AC DE)< P*{E) VE C ft} , (2.3.8)
which is sometimes helpful. Furthermore, P* is countably additive on M:
Lemma 2.3.9. If Ai,A2,... € M are disjoint, then P*(\JnAn) =
EnP*(4).
Proof. If A\ and A2 are disjoint, with A\ G M, then
P*{A1\JA2)
= P* (Ai fl (i4i U i42)) + P* (^f n (Ai U il2)) since Ax e M
= P*{A1)-\-P*(A2) since AX,A2 disjoint.
Hence, by induction, the lemma holds for any finite collection of Ai.
Then, with countably many disjoint Ai € A4, we see that for any m € N,
£P*(An) = P*( (J ,4n) < P*(UA»),
where the inequality follows from monotonicity. Since this is true for any
m <E N, we have (cf. Proposition A.3.6) that £nP*(An) < P* {\JnAn).
On the other hand, by subadditivity we have ^nP*(An) > P* {\JnAn).
Hence, the lemma holds for countably many Ai as well. I
The plan now is to show that M is a cr-algebra which contains J. We
break up the proof into a number of lemmas.

2.3. The Extension Theorem.
13
ma 2.3.10. M is an algebra, i.e. ft G M and M is closed under
nlement and finite intersection (and hence also finite union).
P oof ^ 1S immediate from (2.3.7) that M contains ft and is closed
der complement. For the statement about finite intersections, suppose
A B € M> Then, for any E C ft, using subadditivity,
F*((AnB)nE) + P*((AnB)cnE)
= P*(AnBDE)
+P* ((Ac nBr\E)u(AnBcnE)u(AcnBcr) e))
< p* (A n B n E) + P* (Ac n B n E)
+p* (AnBcnE)+ P* (Acn5cne))
= P*(B DE) + P*{BC fl E) since ,4 € M
= P*(E) since BeM.
Hence, by (2.3.8), AnBeM. I
Lemma 2.3.11. Let A\, A2,... € M be disjoint. For each m € N, iet
5m = Un<m ^n- Then for all m € N, and for aii i? C ft, we have
P*(£H£m) = ^P*(£nAn). (2.3.12)
n<m
Proof. We use induction on m. Indeed, the statement is trivially true
when m = 1. Assuming it true for some particular value of m, and noting
that Bm n JBm+1 = Bm and B^ Pi £m+i = i4m+i, we have (noting that
Bm e M by Proposition 2.3.10) that
P*(£n£m+i)
= P*(£m n en Bm+1) + P*(££ n E n Bm+1) since Bm e M
= P*(EnBm) + p*(EnArn+1)
= En<m+i P*{E fl ,<4n) by the induction hypothesis,
thus completing the induction proof. I
Lemma 2.3.13. Let AuA2,...£Mbe disjoint. Then \Jn An € M.
Proof. For each m e N, let Bm = \Jn<m An. Then for any m € N and
any E C ft,
P*(£) = P(finBm)+P(£nBS) since BmeM
= En<mP*(^4) + P(^n^) by (2.3.12)

14
2. Probability triples.
where the inequality follows by monotonicity since ({JAn)c C B%. This is
true for any m G N, so it implies (cf. Proposition A.3.6) that
P*(£?) > ^P*(EnAn) + P*(En({jAnf\
n \ n /
> P*(En(\jAn)\+p*(En(\jAnf\ ,
where the final inequality follows by subadditivity. Hence, by (2.3.8) we
havelJn^n € M. I
Lemma 2.3.14. M is a a-algebra.
Proof. In light of Lemma 2.3.10, it suffices to show that \JnAn € M
whenever A\, A2,... € M. Let Di = A\, and Di = Ai n Af n ... n -AfLi
for i > 2. Then {A} are disjoint, with \J{ Di = \Ji Ai, and with Di € M
by Lemma 2.3.10. Hence, by Lemma 2.3.13, \J. Di € M, i.e. |Ji Ai € M. I
Lemma 2.3.15. J Q M.
Proof. Let A £ J. Then since J' is a semialgebra, we can write Ac =
J\ U ... U Jk for some disjoint J\,..., Jk € J'. Also, for any £ C ^ and e >
0, by the definition (2.3.4) we can find (cf. Proposition A.4.2) A\, A2,... € J
with EC[Jn4 and £n P(An) < P*(£) + 6. Then
P*(£nA) + P*(£nAc)
< P* ((|Jn An) n i4) + P* ((|Jn An) n Ac) by monotonicity
= P*(UJAnnA)) +P*(l)n{Jl1(AnnJi))
< En P*(^n H A) +.En Eti p*(^n n J4) by subadditivity
= En P(^n nA) + En ELl P(^n H J<) since P* = P OU J
= En (p(^» n A) + Eti P(^n n J,))
<EnP(^n) by (2.3.2)
< P*(E) + e by assumption.
This is true for any e > 0, hence (cf. Proposition A.3.1) we have P*(E C\
A) + P*(E fi ^c) < P*(E), for any ECfi. Hence, from (2.3.8), we have
A <E M. This holds for any A <E J, hence J C M. I
With all those lemmas behind us, we are now, finally, able to complete
the proof of the Extension Theorem.

2 4. Constructing the Uniform[0, 1] distribution. 15
f of Theorem 2.3.1. Lemmas 2.3.5, 2.3.9, 2.3.14, and 2.3.15 to-
ther show that M is a a-algebra containing J, that P* is a probability
geeasure on M, and that P* is an extension of P. I
Exercise 2.3.16. Prove that the extension (fi,Al,P*) constructed in
the proof of Theorem 2.3.1 must be complete, meaning that if A G M with
p*M) __ o^ and if £? C j4, then B G Af. (It then follows from monotonicity
thatP*(^) = 0.)
2.4. Constructing the Uniform[0,1] distribution.
Theorem 2.3.1 allows us to automatically construct valid probability
triples which take particular values on particular sets. We now use this to
construct the Uniform[0,1] distribution. We begin by letting Q = [0,1],
and again setting
J = {all intervals contained in [0,1]} , (2.4.1)
where again "intervals" is understood to include all the open, closed, half-
open, and singleton intervals contained in [0,1], and also the empty set 0.
Then J' is a semialgebra by Exercise 2.2.3.
For I G J, we let P( J) be the length of I. Thus P(0) = 0 and P(ft) = 1.
We now proceed to verify (2.3.2) and (2.3.3).
Proposition 2.4.2. The above definition of J and P satisfies (2.3.2),
with equality.
Proof. Let Jx,..., Ik be disjoint intervals contained in [0,1], whose union
is some interval J0. For 0 < j < k, write Oj for the left end-point of Ij, and
bj for the right end-point of Ij. The assumptions imply that by re-ordering,
we can ensure that a0 = ai < bi = a2 < b2 = a3 < ... < bk = b0. Then
Ep(^) = !>;-<*;) = bk-ai = b0-ao = P(/0). I
3 j
The verification of (2.3.3) for this J and P is a bit more involved:
Exercise 2.4.3. (a) Prove that if Iu J2,..., In is a finite collection of
intervals, and if |J"=1 Ij 2 I for some interval /, then YTj=i *(Ij) > P(/)-
[Hint: Imitate the proof of Proposition 2.4.2.]
•r oo r°Ve ^at ^ ^' ^2'''' ^s a countaD^e collection of open intervals, and
lf U=i Ij D I for some closed interval J, then ^°°=1 P(/?) > P(7). [Hint:

16
2. Probability triples.
You may use the Heine-Borel Theorem, which says that if a collection of
open intervals contain a closed interval, then some finite sub-collection of
the open intervals also contains the closed interval.]
(c) Verify (2.3.3), i.e. prove that if I\, J2,... is any countable collection of
intervals, and if \J7=i h 2 I for any interval /, then £°liP(^) > P(J).
[Hint: Extend the interval Ij by e2~j at each end, and decrease J by e at
each end, while making Ij open and J closed. Then use part (b).]
In light of Proposition 2.4.2 and Exercise 2.4.3, we can apply
Theorem 2.3.1 to conclude the following:
Theorem 2.4.4. There exists a probability triple (Q,M,P*) such that
Q = [0,1], M contains all intervals in [0,1], and for any interval I C [0,1],
P*(J) is the length of I.
This probability triple is called either the uniform distribution on [0,1], or
Lebesgue measure on [0,1]. Depending on the context, we sometimes write
the probability measure P* as P or as A.
Remark. Let B = a( J) be the cr-algebra generated by J, i.e. the smallest
cr-algebra containing J. (The collection B is called the Borel a-algebra of
subsets of [0,1], and the elements of B are called Borel sets.) Clearly, we
must have M. D B. In this case, it can be shown that M is in fact much
bigger than B; it even has larger cardinality. Furthermore, it turns out that
Lebesgue measure restricted to B is not complete, though on M it is (by
Exercise 2.3.16). In addition to the Borel subsets of [0,1], we shall also have
occasion to refer to the Borel a-algebra of subsets of R, denned to be the
smallest cr-algebra of subsets of R which includes all intervals.
Exercise 2.4.5. Let A = {(-00,x]; x G R}. Prove that a (A) = B, i.e.
that the smallest cr-algebra of subsets of R which contains A is equal to the
Borel cr-algebra of subsets of R. [Hint: Does o~(A) include all intervals?]
Writing A for Lebesgue measure on [0,1], we know that \{x} = 0 for
any singleton set {x}. It follows by countable additivity that X(A) = 0
for any set A which is countable. This includes (cf. Subsection A.2) the
rational numbers, the integer roots of the rational numbers, the algebraic
numbers, etc. That is, if X is uniformly distributed on [0,1], then P(X
is rational) = 0, and P(Xn is rational for some n G N) = 0, and P(X is
algebraic) = 0, and so on.
There also exist uncountable sets which have Lebesgue measure 0. The
simplest example is the Cantor set K, denned as follows (see Figure 2.4.6).
We begin with the interval [0,1]. We then remove the open interval con-

4. Constructing the Uniform[0, 1] distribution. 17
1 2 I 2 7 8
0 9 9 3 3 9 9
Figure 2.4.6. Constructing the Cantor set K.
sisting of the middle third (1/3, 2/3). We then remove the open middle
thirds of each of the two pieces, i.e. we remove (1/9, 2/9) and (7/9, 8/9).
We then remove the four open middle thirds (1/27, 2/27), (7/27, 8/27),
(19/27, 20/27), and (25/27, 26/27) of the remaining pieces. We continue
inductively, at the nth stage removing the 2n_1 middle thirds of all
remaining sub-intervals, each of length l/3n. The Cantor set K is defined to be
everything that is left over, after we have removed all these middle thirds.
Now, the complement of the Cantor set has Lebesgue measure given
by \{KC) = 1/3 + 2(1/9) + 4(1/27) + ... = £~=1 2n~l/3n = 1. Hence,
by (2.1.1), X(K) = 1-1=0.
On the other hand, K is uncountable. Indeed, for each point x G K, let
dn(x) = 0 or 1 depending on whether, at the nth stage of the construction
of K, x was to the left or the right of the nearest open interval removed.
Then define the function / : K -► [0,1] by f(x) = £^°=1 dn(x) 2"n. It is
easily checked that f(K) = [0,1], i.e. that / maps K onto [0,1]. Since [0,1]
is uncountable, this means that K must also be uncountable.
Remark. The Cantor set is also equal to the set of all numbers in [0,1]
which have a base-3 expansion that does not contain the digit 1. That is,
K = { E~=i CnS~n : each cn e {0,2}}.
Exercise 2.4.7. (a) Prove that K, Kc e B, where B are the Borel
subsets of [0,1].
(b) Prove that K, Kc e M, where M is the a-algebra of Theorem 2.4.4.
(c) Prove that Kc e Bu where Bx is defined by (2.2.6).
(d) Prove that K ^Bl.
(e) Prove that Bx is not a cr-algebra.
On the other hand, from Proposition 1.2.6 we know that:

_. ^ ^t-v^jj-n-jJiLill I IJttlb'LKS.
Proposition 2.4.8. For the probability triple (Cl,M,P*) of
Theorem 2.4.4 corresponding to Lebesgue measure on [0,1], there exists at least
one subset H C Q, with H 0 M..
2.5. Extensions of the Extension Theorem.
The Extension Theorem (Theorem 2.3.1) will be our main tool for
proving the existence of complicated probability triples. While (2.3.2) is
generally easy to verify, (2.3.3) can be more challenging. Thus, we present some
alternative formulations here.
Corollary 2.5.1. Let J be a semialgebra of subsets ofCl. Let P : J -»
[0,1] with P(0) = 0 and P(fi) = 1, satisfying (2.3.2), and the umonotonicity
on J" property that
P(A) < P(B) whenever A,B ej with ACB, (2.5.2)
and also the ucountable subadditivity on J" property that
P(UB») - 5ZP(B") forB1,B2,...eJ with\jBneJ. (2.5.3)
\ n / n n
Then there is a a-algebra M 2 *7, and a countably additive probability
measure P* on M, such that P*(A) = P(A) for all AeJ.
Proof. In light of Theorem 2.3.1, we need only verify (2.3.3). To that
end, let A,Ai,A2,...eJ with A C (Jn An. Set Bn = A n An. Then since
A C |Jn An, we have A = U„(^ n An) = Un Bn, whence (2.5.3) and (2.5.2)
give that
P(A) = P ((J Bn ) < Y, P(5„) < £ P^«) • ■
Another version assumes countable additivity of P on J:
Corollary 2.5.4. Let J be a semialgebra of subsets of CI. Let P : J —>
[0,1] with P(fi) = 1, satisfying the countable additivity property that
P(ljAi) = ^P(Ai) forDi,D2,...G J disjoint with [JDne J.
\ n / n n
(2.5.5)
Then there is a a-algebra M. D J, and a countably additive probability
measure P* on M, such that P*{A) = P(A) for all AeJ.

2.5. Extensions of the Extension Theorem. 19
0 f Note that (2.5.5) immediately implies (2.3.2) (with equality), and
p(0) = 0. Hence, in light of Corollary 2.5.1, we need only verify (2.5.2)
and (2.5.3).
For (2.5.2), let A, B G J with A C B. Since J is a semialgebra, we
can write Ac = JiU ... UJfc, for some disjoint Ji,..., Jk G J. Then
using (2.5.5),
P(B) = P(A) + P(BnJi) + ... + P(BnJfc) > P(A).
For (2.5.3), let Bi,B2,... G J with |Jn #n G J. Set £>i = #i, and
Dn = BnPiB?n.. .nfi^Lx for n > 2. Then {£>„} are disjoint, with |Jn Dn =
1 I B Furthermore, since J is a semialgebra, each Dn can be written as
a finite disjoint union of elements of J, say Dn = Jn\ U ... U JnKn. It then
follows from (2.5.5) that
p(u*») = p(u^») = p(uu^) = zi>(^).
\ n J \ n / \ n 2=1 / n 2=1
On the other hand,
Km
Bn = (J \J(Jmi^Bn)
m<n 2=1
and the union is disjoint, with Jni C Bn, so
P(Bn) = J2J2P(J™nB^ * £P(Jnjn£n) = ^P(J„i),
m<n i=l i=l 2=1
and hence
kn / \
$>(B„) > ££P(J„i) = P(UBn) ' "
n n 2=1 \ n /
Exercise 2.5.6. Suppose P satisfies (2.5.5) for finite collections {Dn}.
Suppose further that, whenever AUA2,... G J such that An+i C An
and (XLi An = 0, we have limn_*oo P(An) = 0. Prove that P also
satisfies (2.5.5) for countable collections {Dn}. [Hint: Set An = (\J°°=1 DA \
(ur=1A).i
The extension of Theorem 2.3.1 also has a uniqueness property:
Proposition 2.5.7. Let J, P, P*, and M be as in Theorem 2.3.1 (or as
in Corollary 2.5.1 or 2.5.4). Let T be any a-algebra with J C T C M (e.g.

20
2. Probability triples.
F = .A/f ? or J7 = cr(J)). Let Q be any probability measure on J7, such that
Q{A) = P{A) for all AeJ. Then Q(A) = P*(A) for aii A G J7.
Proof. For A G .T7, we compute
P*(A) = inf^.Aa.-e^ Ei^(^i) from (2.3.4)
= inf ^^^....ej X^i Q{Ai) since Q = P on J
> inf a1,a2,...€j Q(\J Ai) by countable subadditivity
A-lJtA*
> inf a1,a2,...<=j Q(A) by monotonicity
= Q(A),
i.e. P*(4) > Q{A). Similarly, P*(AC) > Q(AC), and then (2.1.1) implies
1 - P*{A) > 1 - Q(A), i.e. P*(A) < Q(j4). Hence, P*(A) = Q(j4). I
Proposition 2.5.7 immediately implies the following:
Proposition 2.5.8. Let J be a semialgebra of subsets of Q, and let
T — (J {J) be the generated a-algebra. Let P and Q be two probability
distributions defined on T. Suppose that P(A) = Q{A) for all AeJ.
Then P = Q, i.e. P{A) = Q(A) for all AeJ7.
Proof. Since P and Q are probability measures, they both satisfy (2.3.2)
and (2.3.3). Hence, by Proposition 2.5.7, each of P and Q is equal to the
P* of Theorem 2.3.1. I
One useful special case of Proposition 2.5.8 is:
Corollary 2.5.9. Let P and Q be two probability distributions defined on
the collection B of Borel subsets of R. Suppose P((—oo,:r]) = Q((-oo,:r])
for all xeR. Then P(A) = Q(A) for all AeB.
Proof. Since P((y,oo)) = 1 - P((-oo,y]), and P((x,y\) = 1 -
P((-oo,:r]) — P((y, oo)), and similarly for Q, it follows that P and Q
agree on
J = {(-oo,a;] :xgR}u{(2/,oc) :y€R)u{(i,y] :x,|/GR)u{«,r}.
(2.5.10)
But J is a semialgebra (Exercise 2.7.10), and it follows from Exercise 2.4.5
that <j( J) = B. Hence, the result follows from Proposition 2.5.8. |

2.6. Coin tossing and other measures. 21
2 6. Coin tossing and other measures.
Now that we have Theorem 2.3.1 to help us, we can easily construct
other probability triples as well.
For example, of frequent mention in probability theory is (independent,
fair) coin tossing. To model the flipping of n coins, we can simply take
Q = {{r\,r2, • • • ,rn)', ri — 0 or 1} (where 0 stands for tails and 1 stands
for heads), let T = 2Q be the collection of all subsets of ft, and define P
by Y(A) = |^4|/2n for AC J7. This is another example of a discrete
probability space; and we know from Theorem 2.2.1 that these spaces present no
difficulties.
But suppose now that we wish to model the flipping of a (countably)
infinite number of coins. In this case we can let
0 = {(n,r2,r3,...); ri =0or 1}
be the collection of all binary sequences. But what about T and P?
Well, for each n G N and each ai,a2,...,an G {0,1}, let us define
subsets Aaia2...an C Q by
Aaia2...an = {(n,r2,...) G ft; r{ = a{ for 1 < i < n} .
(Thus, Ao is the event that the first coin comes up tails; An is the event
that the first two coins both come up heads; and Aioi is the event that
the first and third coins are heads while the second coin is tails.) Then we
clearly want P(Aua2...aJ = l/2n for each set j4ai02...an. Hence, if we set
J = R1a2...an;neN,ai,a2,...,ane{0,l}}U{W},
then we already know how to define P(A) for each A G J. To apply the
Extension Theorem (in this case, Corollary 2.5.4), we need to verify that
certain conditions are satisfied.
Exercise 2.6.1. (a) Verify that the above J is a semialgebra.
(b) Verify that the above J and P satisfy (2.5.5) for finite collections {Dn}.
[Hint: For a finite collection {Dn} C J, there is k G N such that the results
of only coins 1 through k are specified by any Dn. Partition Q into the
corresponding 2k subsets.]
Verifying (2.5.5) for countable collections unfortunately requires a bit of
topology; the proof of this next lemma may be skipped.
Lemma 2.6.2. The above J and P (for infinite coin tossing) sat-
isfy (2.5.5).

22
2. Probability triples.
Proof (optional). In light of Exercises 2.6.1 and 2.5.6, it suffices
to show that for Ai,A2,... G J with An+i C An and D^Li ^n = 0,
limn^00P(An) =0.
Give {0,1} the discrete topology, and give fi = {0,1} x {0,1} x ...
the corresponding product topology. Then fi is a product of compact sets
{0,1}, and hence is itself compact by Tychonov's Theorem. Furthermore
each element of J is a closed subset of fi, since its complement is open in
the product topology. Hence, each An is a closed subset of a compact space,
and is therefore compact.
The finite intersection property of compact sets then implies that there
is N G N with An = 0 for all n > N. In particular, P(An) -► 0. I
Now that these conditions have been verified, it then follows from
Corollary 2.5.4 that the probabilities for the special sets Aaia2...an G J can
automatically be extended to a a-algebra M containing J. (Once again, this
a-algebra will be quite complicated, and we will never understand it
completely. But it is still essential mathematically that we know it exists.) This
will be our probability triple for infinite fair coin tossing.
As a sample calculation, let Hn = {(n, r2,...) G fi; rn = 1} be the event
that the nth coin comes up heads. We certainly would hope that Hn e M,
with P(Hn) = |. Happily, this is indeed the case. We note that
ri,r2,...,rn_iG{0,l}
the union being disjoint. Hence, since M is closed under countable
(including finite) unions, we have Hn G M.. Then, by countable additivity,
P(#n) = 2^ P(^4r1r2...rn_1l)
ri,r2,...,rn_i€{0,l}
. ^2 V2n = 2n-1/2n = 1/2.
n,r2,...,rn_i€{0,l}
Remark. In fact, if we identify an element x G [0,1] by its binary
expansion (ri,r2,...), i.e. so that x = YlT=i rk/^k', then we see that in fact
infinite fair coin tossing may be viewed as being "essentially" the same thing
as Lebesgue measure on [0,1].
Next, given any two probability triples (fii, T\, Pi) and (Q2, ^2, P2), we
can define their product measure P on the Cartesian product set fii x Q2 =
{((Ji,(j2) '-Mi G fit (i = 1>2)}. We set
J = {AxB; AeFi, B e T2} , (2.6.3)

2.7. Exercises.
23
nd define P(A x B) = Pi (A) P2(B) for A x B e J. (The elements of J
are called measurable rectangles.)
F rcise 2.6.4. Verify that the above J' is a semialgebra, and that
0jfi e J with P(0) = 0 and P(0) = 1.
We will show later (Exercise 4.5.15) that these J and P satisfy (2.5.5).
Hence by Corollary 2.5.4, we can extend P to a a-algebra containing J.
The resulting probability triple is called the product measure of (fti, T\, P\)
and (02,^2, ft).
An important special case of product measure is Lebesgue measure in
higher dimensions. For example, in dimension two, we define 2-dimensional
Lebesgue measure on [0,1] x [0,1] to be the product of Lebesgue measure
on [0,1] with itself. This is a probability measure on ft = [0,1] x [0,1] with
the property that
p(\a,b]x[c,d\\ = (b-a)(d-c), 0 < a < 6 < 1, 0<c<d<l.
It is thus a measure of area in two dimensions.
More generally, we can inductively define d-dimensional Lebesgue
measure on [0, l]d for any d G N, by taking the product of Lebesgue measure on
[0,1] with (d — l)-dimensional Lebesgue measure on [0, l]d_1. When d = 3,
Lebesgue measure on [0,1] x [0,1] x [0,1] is a measure of volume.
2.7. Exercises.
Exercise 2.7.1. Let Q = {1,2,3,4}. Determine whether or not each of
the following is a a-algebra.
(a) ^ = {0, {1,2}, {3,4}, {1,2,3,4}}.
(b) ^2 = {0, {3}, {4}, {1,2}, {3,4}, {1,2,3}, {1,2,4}, {1,2,3,4}}.
(c) 7-3 = {0, {1,2}, {1,3}, {1,4}, {2,3}, {2,4}, {3,4}, {1,2,3,4}}.
Exercise 2.7.2. Let Q = {1,2,3,4}, and let J = {{1}, {2}}. Describe
explicitly the cr-algebra a(J) generated by J.
Exercise 2.7.3. Suppose T is a collection of subsets of ft, such that
(a) Suppose that whenever A,B e F, then also A\B = Af)Bc e J7.
prove that T is an algebra.
(b) Suppose T is a semialgebra. Prove that T is an algebra.

24
2. Probability triples.
(c) Suppose that T is closed under complement, and also closed under
finite disjoint unions (i.e. whenever A,BeT are disjoint, then AuB G T).
Give a counter-example to show that T might not be an algebra.
Exercise 2.7.4. Let T\,T2, • • • be a sequence of collections of subsets of
Q, such that Tn C .Fn+i for each n.
(a) Suppose that each Ti is an algebra. Prove that USi -^ *s a^so an
algebra.
(b) Suppose that each Ti is a cr-algebra. Show (by counter-example) that
USi ^i might not be a cr-algebra.
Exercise 2.7.5. Suppose that Cl = N is the set of positive integers, and
T is the set of all subsets A such that either A or Ac is finite, and P is
defined by P(A) = 0 if A is finite, and P(A) = 1 if Ac is finite.
(a) Is T an algebra?
(b) Is T a cr-algebra?
(c) Is P finitely additive?
(d) Is P countably additive on T, meaning that if AuA2l... G T are
disjoint, and if it happens that |Jn An G T, then P(Un ^n) = Sn P(-^n)?
Exercise 2.7.6. Suppose that Q, = [0,1] is the unit interval, and T is the
set of all subsets A such that either A or Ac is finite, and P is defined by
P(A) = 0 if A is finite, and P(A) = 1 if Ac is finite.
(a) Is T an algebra?
(b) Is T a cr-algebra?
(c) Is P finitely additive?
(d) Is P countably additive on T (as in the previous exercise)?
Exercise 2.7.7. Suppose that Q = [0,1] is the unit interval, and T is
the set of all subsets A such that either A or Ac is countable (i.e., finite
or countably infinite), and P is defined by P(-A) = 0 if A is countable, and
P(A) = 1 if Ac is countable. •
(a) Is T an algebra?
(b) Is T a a-algebra?
(c) Is P finitely additive?
(d) Is P countably additive on T?
Exercise 2.7.8. For the example of Exercise 2.7.7, is P uncountably
additive (cf. page 2)?
Exercise 2.7.9. Let T be a cr-algebra, and write \T\ for the total number
of subsets in T. Prove that if \T\ < 00 (i.e., if T consists of just a finite
number of subsets), then \T\ = 2m for some m G N. [Hint: Consider those

2.7. Exercises.
25
on-empty subsets in T which do not contain any other non-empty setset in
T How can all subsets in T be "built up" from these particular subsets?]
Exercise 2.7.10. Prove that the collection J of (2.5.10) is a semialgebra.
Exercise 2.7.11. Let ft = [0,1]. Let J' be the set of all half-open
intervals of the form (a, b], for 0 < a < b < 1, together with the sets 0, ft,
and {0}.
(a) Prove that J' is a semialgebra.
(b) Prove that <j{J') = #, i.e. that the cr-algebra generated by this J' is
equal to the cr-algebra generated by the J of (2.4.1).
(c) Let B'0 be the collection of all finite disjoint unions of elements of J'.
Prove that B'0 is an algebra. Is B'0 the same as the algebra So defined
in (2.2.4)?
[Remark: Some treatments of Lebesgue measure use Jf instead of J.]
Exercise 2.7.12. Let K be the Cantor set as defined in Subsection 2.4.
Let Dn = K 0 £ where K 0 £ is defined as in (1.2.4). Let B = \J™=1 Dn.
(a) Draw a rough sketch of D$.
(b) What is \{D3)7
(c) Draw a rough sketch of B.
(d) What is \{B)1
Exercise 2.7.13. Give an example of a sample space fi, a semialgebra
J, and a non-negative function P : J —> R with P(0) = 0 and P(fi) = 1,
such that (2.5.5) is not satisfied.
Exercise 2.7.14. Let ft = {1, 2,3,4}, with T the collection of all subsets
of ft. Let P and Q be two probability measures on T, such that P{1} =
P{2} = P{3} = P{4} = 1/4, and Q{2} = Q{4} = 1/2, extended to T by
linearity. Finally, let J = {0, ft, {1,2}, {2,3}, {3,4}, {1,4}}.
(a) Prove that P(A) = Q(A) for all AeJ.
(b) Prove that there is A G a(J) with P(A) ^ Q{A).
(c) Why does this not contradict Proposition 2.5.8?
Exercise 2.7.15. Let (£l,M,\) be Lebesgue measure on the interval
[0,1]. Let
fi' = {(i,!/)eR2;0<i<l,0<y<l}.
Let T be the collection of all subsets of Q' of the form
{(x,y)eR2; x€ A, 0 < y < 1}
for some A e M. Finally, define a probability P on T by
P ({{x,y) € R2 ; x € A, 0 < y < 1}) = \{A).

26
2. Probability triples.
(a) Prove that (ft', T, P) is a probability triple.
(b) Let P* be the outer measure corresponding to P and T. Define the
subset S C ft' by
S = {(x,y)eR2; 0 < x < 1, y = 1/2} .
(Note that S 0 J\) Prove that P*(5) = 1 and P*(5C) = 1.
Exercise 2.7.16. (a) Where in the proof of Theorem 2.3.1 was
assumption (2.3.3) used?
(b) How would the conclusion of Theorem 2.3.1 by modified if
assumption (2.3.3) were dropped (but all other assumptions remained the same)?
Exercise 2.7.17. Let ft = {1,2}, and let J be the collection of all subsets
of ft, with P(0) = 0, P(ft) = 1, and P{1} = P{2} = 1/3.
(a) Verify that all assumptions of Theorem 2.3.1 other than (2.3.3) are
satisfied.
(b) Verify that assumption (2.3.3) is not satisfied.
(c) Describe precisely the M and P* that would result in this example
from the modified version of Theorem 2.3.1 in Exercise 2.7.16(b).
Exercise 2.7.18. Let ft = {1,2}, J = {0,ft,{l}}, P(0) = 0, P(ft) = 1,
and P({1}) = 1/3.
(a) Can Theorem 2.3.1, Corollary 2.5.1, or Corollary 2.5.4 be applied in
this case? Why or why not?
(b) Can this P be extended to a valid probability measure? Explain.
Exercise 2.7.19. Let ft be a finite non-empty set, and let J consist
of all singletons in ft, together with 0 and ft. Let p : ft —> [0,1] with
T,u,enP(u) = *> and define p(0) = °> P(fi) = *' and PM = ^M for a11
uj G ft.
(a) Prove that J is a semialgebra.
(b) Prove that (2.3.2) and (2.3.3) are satisfied.
(c) Describe precisely the M. and P* that result from applying
Theorem 2.3.1.
(d) Are these M and P* the same as those described in Theorem 2.2.1?
Exercise 2.7.20. Let P and Q be two probability measures defined on
the same sample space ft and cr-algebra T.
(a) Suppose that P(j4) = Q(^) for all A G T with P(4) < \. Prove that
P = Q, i.e. that P(A) = Q(A) for all A G T.

2.8. Section summary.
27
(b) Give an example where P(A) = Q(A) for all A G T with P(A) < |,
but such that P ^ Q, i.e. that P(A) ^ Q{A) for some AeT.
Exercise 2.7.21. Let A be Lebesgue measure in dimension two, i.e.
Lebesgue measure on [0,1] x [0,1]. Let A be the triangle {(x,y) G [0,1] x
fO ll' U < x)- Prove tnat A is measurable with respect to A, and compute
\(A).
Exercise 2.7.22. Let ($li,Fi,Pi) be Lebesgue measure on [0,1].
Consider a second probability triple, (ft^-T^P^, defined as follows: ft2 =
{1,2}, Ti consists of all subsets of ft2, and P2 is defined by P2{1} = \,
p2{2} = |, and additivity. Let (fi,^7, P) be the product measure of
(fii^i.Pi) and (fi2,^*2,P2)-
(a) Express each of ft, T, and P as explicitly as possible.
(b) Find a set A G T such that P(A) = |.
2.8. Section summary.
The section gave a formal definition of a probability triple (ft,^7, P),
consisting of a sample space ft, a cr-algebra T, and a probability measure
P, and derived certain basic properties of them. It then considered the
question of how to construct such probability triples. Discrete spaces (with
countable ft) were straightforward, but other spaces were more challenging.
The key tool was the Extension Theorem, which said that once a probability
measure has been constructed on a semialgebra, it can then automatically
be extended to a cr-algebra.
The Extension Theorem allowed us to construct Lebesgue measure on
[0,1], and to consider some of its basic properties. It also allowed us to
construct other probability triples such as infinite coin tossing, product
measures, and multi-dimensional Lebesgue measure.

3. Further probabilistic foundations.
29
3. Further probabilistic foundations.
Now that we understand probability triples well, we discuss some
additional essential ingredients of probability theory. Throughout Section 3
(and, indeed, throughout most of this text and most of probability theory
in general), we shall assume that there is an underlying probability triple
(17 T, P) w^h respect to which all further probability objects are defined.
This assumption shall be so universal that we will often not even mention
it.
3.1. Random variables.
If we think of a sample space il as the set of all possible random outcomes
of some experiment, then a random variable assigns a numerical value to
each of these outcomes. More formally, we have
Definition 3.1.1. Given a probability triple (fi, T, P), a random variable
is a function X from fi to the real numbers R, such that
{a; eft; X(w) < x} G T, xGR. (3.1.2)
Equation (3.1.2) is a technical requirement, and states that the
function X must be measurable. It can also be written as {X < x} G J7, or
X~l ((—oc, x]) G T, for all x G R. Since complements and unions and
intersections are preserved under inverse images (see Subsection A.l), it
follows from Exercise 2.4.5 that equation (3.1.2) is equivalent to saying that
X~1(B) G T for every Borel set B. That is, the set X~1(B), also written
{X G B), is indeed an event. So, for any Borel set B, it makes sense to
talk about P(X G B), the probability that X lies in B.
Example 3.1.3. Suppose that (fi,^,P) is Lebesgue measure on [0,1],
then we might define some random variables X, Y', and Z by X(lj) = u,
y{u) = 2lj, and Z(u>) = 3u + 4. We then have, for example, that Y = 2X,
and Z = SX + 4 = §y + 4. Also, P(F < 1/3) = P{o;; Y{u>) < 1/3} =
P{w; 2a; < 1/3} = P([0,1/6]) = 1/6.
Exercise 3.1.4. For Example 3.1.3, compute P(Z > a) and P(X <
a and Y < b) as functions of a, b G R.
Now, not all functions from Q to R are random variables. For example,
«* (fi,^,P) be Lebesgue measure on [0,1], and let H c f2 be the non-
measurable set of Proposition 2.4.8. Define X : ft -► R by X = lHc, so

30
3. Further probabilistic foundations.
X{u>) = 0 for uj G H, and X(u) = 1 for u 0 H. Then {u e Q : X{u) <
1/2} = H 0 .F, so X is not a random variable.
On the other hand, the following proposition shows that condition (3.1.2)
is preserved under usual arithmetic and limits. In practice, this means that
if functions from fi toR are constructed in "usual" ways, then (3.1.2) will
be satisfied, so the functions will indeed be random variables.
Proposition 3.1.5. (i) If X = 1a is the indicator of some event A e F,
then X is a random variable.
(ii) IfX and Y are random variables and cGR, then X + c, cX, X2, X + Y,
and XY are all random variables.
(Hi) If Z\, Z2,... are random variables such that lim™-^ Zn(uj) exists for
each lj G ft, and Z(u) = lin^^oo Zn{u), then Z is also a random variable.
Proof, (i) If X = 1A for A G T, then X~l(B) must be one of A, Ac', 0,
or Q, so X^iB) eT.
(ii) The first two of these assertions are immediate. The third follows since
for y > 0, {X2 < y} = {X e [-y/y, y/y}} e T. For the fourth, note (by
finding a rational number r G (X, x — Y)) that
{X + Y < x} = (J ({X < r} H {Y < x - r}) G T.
r rational
The fifth assertion then follows since XY = \ [(X + Y)2 - X2 - Y2].
(iii) For x G R,
CX) CX) CX)
{***>= nun
ra=ln=lk=n
But Zk is a random variable, so {Zk < x 4- ^} G T. Then, since T is a
^•-algebra, we must have {X < x} £ J7. I
Exercise 3.1.7. Prove (3.1.6). [Hint: remember the definition ofX(u) =
limn^00Xn(a;), cf. Subsection A.3.]
Suppose now that X is a random variable, and / : R —> R is a function
from R to R which is Borel-measurable, meaning that f~x{A) G B for any
A G B (where B is the collection of Borel sets of R). (Equivalently, / is a
random variable corresponding to Q = R and T = B.) We can define a new
random variable f(X), the composition of X with /, by f(X)(u) = f (X(u))
for each u G ft. Then (3.1.2) is satisfied since for J5 G 5, \f{X) e B} =
{Xef-l(B)}eF.
|zfc<x+i|. (3.1.6)

3.2. Independence.
31
proposition 3.1.8. If f is a continuous function, or a piecewise-
continuous function, then f is Borel-measurable.
Proof. A basic result of point-set topology says that if / is
continuous, then f~l{0) is an open subset of R whenever O is. In particular,
/-1 {{x, oo)) is open, so f~l ((x, oo)) G B, so /_1 ((oo, x\) G B.
If / is piecewise-continuous, then we can write / = filix + /2U2 +... +
fnljn where the fj are continuous and the {Ij} are disjoint intervals. It
follows from the above and Proposition 3.1.5 that / is Borel-measurable. I
For example, if f(x) = xk for k G N, then / is Borel-measurable. Hence, if
X is a random variable, then so is Xk for all k G N.
Remark 3.1.9. In probability theory, the underlying probability triple
(Q, T, P) is usually complete (cf. Exercise 2.3.16; for example this is always
true for discrete probability spaces, or for those such as Lebesgue measure
constructed using the Extension Theorem). In that case, if X is a random
variable, and Y : fJ —> R such that P(X = Y) = 1, then Y must also be a
random variable.
Remark 3.1.10. In Definition 3.1.1, we assume that X is a real-valued
random variable, i.e. that it maps Q, into the set of real numbers equipped
with the Borel cr-algebra. More generally, one could consider a random
variable which mapped ft to an arbitrary second measurable space, i.e. to
some second non-empty set Q' with its own collection T' of measurable
subsets. We would then have X : ft —> fJ', with condition (3.1.2) replaced
by the condition that X~1(Af) G T whenever A! G P.
3.2. Independence.
Informally, events or random variables are independent if they do not
affect each other's probabilities. Thus, two events A and B are independent
if P(A nfl) = P(A)P(B). Intuitively, the probabilistic proportion of the
event B which also includes A (i.e., P{AnB)/P(B)) is equal to the overall
probability of A (i.e., P{A)) - the definition uses products to avoid division
by zero.
Three events A, B, and C are said to be independent if all of the
following equations are satisfied: P(AnB) = P{A)P(B); P(AnC) = P(i4)P(C);
P(B n C) = P(B)P(C); and P(AnBnC) = P{A)P(B)P{C). It is not
sufficient (see Exercise 3.6.3) to check just the final - or just the first three
~- of these equations. More generally, a possibly-infinite collection {Aa}aeI

32
3. Further probabilistic foundations.
of events is said to be independent if for each jGN and each distinct finite
choice ai, 0:2,..., ol3; G /, we have
P(41n^2n...nAQj) = P(AQl)P(AQ2)...P(Aaj). (3.2.1)
Exercise 3.2.2. Suppose (3.2.1) is satisfied.
(a) Show that (3.2.1) is still satisfied if Aai is replaced by A^.
(b) Show that (3.2.1) is still satisfied if each Aai is replaced by the
corresponding A%..
(c) Prove that if {Aa}aej is independent, then so is {A%}aei.
We shall on occasion also talk about independence of collections of
events. Collections of events {Aa\ot G 1} are independent if for all j G N,
for all distinct ql\ ,..., olj G /, and for all A\ G AQl, • • •, A? ^ A*, > equation
(3.2.1) holds.
We shall also talk about independence of random variables. Random
variables X and Y are independent if for all Borel sets Si and 52, the
events X-1(Si) and Y~1(S2) are independent, i.e. P(X G Si, Y G S2) =
P(X G 5i) P(F G S2). More generally, a collection {XQ; a G 1} of random
variables are independent if for all j G N, for all distinct ai,... ,oij G /,
and for all Borel sets Si,..., Sj, we have
P (^qi G Si, XQ2 G S2, ..., Xaj G Sj)
= P(Xai G Si) P(Xa2 G S2) ... P(Xaj G S,).
Independence is preserved under deterministic transformations:
Proposition 3.2.3. Let X and Y be independent random variables. Let
f,g : R —► R be Borel-measurable functions. Then the random variables
f{X) and g(Y) are independent.
Proof. For Borel Si, S2 C R, we compute that
P(/(X) € Su g(Y) € 52) = P (X € /-^Si), Y € g-\S2))
= P(Xef-HSi)) P(Y€g-1(S2))
= P(f(X)eS1)P(g(Y)€S2). I
We also have the following.
Proposition 3.2.4. Let X and Y be two random variables, defined jointly
on some probability triple (ft, T, P). Then X and Y are independent if and
only ifP(X <x, Y < y) = P(X < x) P(Y < y) for all x,yeK.

3.3. Continuity of probabilities.
33
proof. The "only if" part is immediate from the definition.
For the "if" part, fix x G R with P(X < x) > 0, and define the measure
Q on the Borel subsets of R by Q(5) = P(X < x, Y G S)/P(X < x).
Then by assumption, Q((-oo,y]) = P(Y < y) for all y G R. It follows
from Corollary 2.5.9 that Q(5) = P(y G 5) for all Borel SCR, i.e. that
P(X<x, YeS) = P(X<x)P(YeS).
Then, for fixed Borel SCR, let R(T) = P{X eT.Ye S)/P(X G T).
By the above, it follows that R((—oo,x]) = P(X < x) for each x G R.
It then follows from Corollary 2.5.9 that R(T) = P(X G T) for all Borel
T C R, i.e. that P{X G T, F G 5) = P(X G T)P(F G 5) for all Borel
55 T C R. Hence, X and V are independent. I
Independence will come up often in this text, and its significance will
become more clear as we proceed.
3.3. Continuity of probabilities.
Given a probability triple (fi,^7,P), and events A,A\,A2,... G J", we
write {An} / A to mean that A\ C A2 C A3 C ..., and Un^i = A.
In words, the events An increase to A. Similarly, we write {An} \ A to
mean that {A%} / Ac, or equivalently that Ai 2 ^2 2 ^3 5 • • •, and
Hn -^n = ^4- In words, the events An decrease to A. We then have
Proposition 3.3.1. (Continuity of probabilities.) If {An} / A or
{An} \ A, then limn-*, P(An) = P{A).
Proof. Suppose {An} /" A. Let £n = 4n D ^£_i. Then the {Bn} are
disjoint, with [jBn = [jAn = A. Hence,
P(A) = P(|j0 = f^P(Bm)
\m / m=l
= lim V P(Bm) = lim P I M Bm ) = lim P(An)
m=l \m<n /
(where the last equality is the only time we use that the {Am} are a nested
sequence).
If instead {An} \ A, then {A%} / Ac', so
P(;4) = 1-P(AC) = l-limP(^)
= lim(l-P(^)) = limP(An). I

34
3. Further probabilistic foundations.
If the {An} are not nested, then we may not have limn P(An) = P(A).
For example, suppose that An = ^ for n odd, but An = 0 for n even. Then
P(An) alternates between 0 and 1, so that limnP(^4n) does not exist.
3.4. Limit events.
Given events Ai, A2,... G T, we define
OO OO
lim sup-An = {An i.o.} = f| (J Ak
n=lk=n
and
00 00
liminf An = {^n a.a.} = (J pj Afe.
n=lfe=n
The event lim supn An is referred to as "An infinitely often"; it stands for
those lj £ ft which are in infinitely many of the An. Intuitively, it is the
event that infinitely many of the events An occur. Similarly, the event
lim infn An is referred to as "An almost always"; intuitively, it is the event
that all but a finite number of the events An occur.
Since T is a cr-algebra, we see that lim supn An £ J~ and lim infn An £ J~.
Also, by de Morgan's laws, (limsupn An)c = liminfn(^), so P{An i.o.) =
l-P(ASa.a.).
For example, suppose (J}, T, P) is infinite fair coin tossing, and Hn is
the event that the nth coin is heads. Then limsupni/n is the event that
there are infinitely many heads. Also, lim inf n Hn is the event that all but
a finite number of the coins were heads, i.e. that there were only finitely
many tails.
Proposition 3.4.1. We always have
P( liminf An) < HminfP(i4n) < limsupP(An) < P( lim sup An).
\ n / n n \ n J
Proof. The middle inequality holds by definition, and the last inequality
follows similarly to the first, so we prove only the first inequality. We
00
note that as n —► 00, the events { P| Ak} are increasing (cf. page 33) to
k=n
liminf An. Hence, by continuity of probabilities,
n
P(limninf^n) sp([J f| A*) -JBm p(n A*
\ n k—n / \k=n

3.4. Limit events.
35
= HminfP Pi Ak < liminfP(^n),
n—kx) \ ' ' / n—>oo
\k=n /
h e the final equality follows by definition (if a limit exists, then it is equal
the liminf), and the final inequality follows from monotonicity (2.1.2). I
For example, again considering infinite fair coin tossing, with Hn the
event that the nth coin is heads. Proposition 3.4.1 says that *P(Hn i.o.) > \,
which is interesting but vague. To improve this result, we require a more
powerful theorem.
Theorem 3.4.2. (The Borel-Cantelli Lemma.) Let A\, A2,... G T.
(i) ^Enp(^) < °°' then P(limsupnil„) = 0.
(u) ^"Sn P(^n) = °°, and {An} are independent, then P(limsupn An) = 1.
Proof. For (i), we note that for any m G N, we have by countable
subadditivity that
P Aim sup An) < P I (J Ak) < JT p(40,
^ ' \k=m / k=m
which goes to 0 as m —> oo if the sum is convergent.
For (ii), since (limsupn An f = Un°=i C\T=n Ak> [t suffices (by countable
subadditivity) to show that P (fl^U Ak) = ° for each n e N- Well> for
n, ra G N, we have by independence and Exercise 3.2.2 (and since 1 — x <
e~x for any real number x) that
p(nr=n40 < p(fC^)
which goes to 0 as m —> oo if the sum is divergent. I
This theorem is striking since it asserts that if {An} are independent,
then P(limsupn An) is always either 0 or 1 - it is never \ or | or any other
value. In the next section we shall see that this statement is true even more
generally.
We note that the independence assumption for part (ii) of Theorem 3.4.2
cannot simply be omitted. For example, consider infinite fair coin tossing,
and let Ax = A2 = A3 = ... = {n = 1}, i.e. let all the events be the
event that the first coin comes up heads. Then the {An} are clearly not
^dependent. And, we clearly have P(limsupn An) = P(n = 1) = I

36 3. Further probabilistic foundations.
Theorem 3.4.2 provides very precise information about P(limsupn An)
in many cases. Consider again infinite fair coin tossing, with Hn the event
that the nth coin is heads. This theorem shows that P(Hn i.o.) = 1, i.e.
there is probability 1 that an infinite sequence of coins will contain infinitely
many heads. Furthermore, P(Hn a.a.) = 1 — P(H% i.o.) = 1 — 1 = 0, so
the infinite sequence will never contain all but finitely many heads.
Similarly, we have that
P {H2n+i n H2n+2 n... n #2"+[iog2n] i-o.} = l,
since the events {H2n+inH2n+2n.. .nif2n+[iog2 n] are seen to be independent
for different values of n, and since their probabilities are approximately 1/n
which sums to infinity. On the other hand,
P{#2r>+in#2"+2n...n#2n+[2iog2n] i.o.} = 0,
since in this case the probabilities are approximately 1/n2 which have finite
sum.
An event like P(Bn i.o.), where Bn = {Hn n #n+i}, is more difficult.
In this case ^P(Bn) = ]Cn(V^) = oo. However, the {Bn} are not
independent, since Bn and Bn+\ both involve the same event Hn+\ (i.e., the
(n + l)st coin). Hence, Theorem 3.4.2 does not immediately apply. On
the other hand, by considering the subsequence n = 2k of indices, we see
that {B2k}kLi are independent, and J2T=i P(B2k) = EfcLi p(B2k) = oo.
Hence, P(^2fe i-o.) = 1, so that P(Bn i.o.) = 1 also.
For a similar but more complicated example, let Bn = {ifn+i H ifn+2 H
• • • nifn+[iog2 iog2 n]}- Again, £ P(Bn) = oo, but the {Bn} are no^
independent. But by considering the subsequence n = 2k of indices, we compute
that {B2k} are independent, and ]CfcP(-^2fc) = °°- Hence, P(^2fc *-°-) = 1»
so that P(5n i.o.) = 1 also.
3.5. Tail fields.
Given a sequence of events A\,A2,..., we define their tail field by
oo
r = p| a(i4n,i4n+i,i4n+2,...)-
71=1
In words, an event A G r must have the property that for any n, it depends
only on the events j4n,j4n+i,...; in particular, it does not care about any
finite number of the events An.
One might think that very few events could possibly be in the tail field,
but in fact it sometimes contains many events. For example, if we are

3.5. Tail fields.
37
considering infinite fair coin tossing (Subsection 2.6), and Hn is the event
that the nth coin comes up heads, then r includes the event lim supn Hn
that we obtain infinitely many heads; the event lim infn Hn that we obtain
only finitely many tails; the event limsupn#2ri that we obtain infinitely
many heads on tosses 2,4,8,...; the event {limn^oo £ Yl?=i ri - \) tnat
the limiting fraction of heads is < \; the event {rn = rn+i = rn+2 i-o.}
that we infinitely often obtain the same result on three consecutive coin
flips; etc. So we see that r contains many interesting events.
A surprising theorem is
Theorem 3.5.1. (Kolmogorov Zero-One Law.) If events Ai, A2,... are
independent, with tail-Geld r, and if A G r, then P(A) = 0 or 1.
To prove this theorem, we need a technical result about independence.
Lemma 3.5.2. Let B,B\,B2,- -. be independent. Then {B} and
&{Bi,B2,.. •) are independent classes, i.e. if S G a{B\,B2,.. .)> then P(S D
B) = P{S)P(B).
Proof. Assume that P(B) > 0, otherwise the statement is trivial.
Let J be the collection of all sets of the form Dix D Di2 D... D Din, where
n G N and where Dij is either Bij or Bf, together with 0 and Q. Then for
A G J, we have by independence that P(j4) = P(B D A)/P(B).
Now define a new probability measure Q on cr(Bi,B2,...) by Q(5) =
P{BnS)/P{B), for S G <j(£i,£2,...)- Then Q(0) = 0, Q(fi) = 1, and
Q is countably additive since P is, so Q is indeed a probability measure.
Furthermore, Q and P agree on J. Hence, by Proposition 2.5.8, Q and P
agree on a(J) = a{BuB2^..). That is, P(S) = Q(5) = P(BnS)/P(B)
for all 5 G a(Bi,B2,...), as required. I
Applying this lemma twice, we obtain:
Corollary 3.5.3. Let Au A2,..., Bu B2,... be independent. Then if
S\ G cr(Ai,A2,...), then Si, B\, B2,... are independent. Furthermore, the
cr-algebras cr(Ai, A2,...) and a(Bi,B2,...) are independent classes, i.e. if
Siea{A1,A2,...),andS2e(T{B1,B2,...),thenP(SlnS2) = P{S1)P(S2)-
Proof. For any distinct ii,i2,... ,2n, let A = Bh D ... D Bin. Then
it follows immediately that A,A\,A2,... are independent. Hence, from
Lemma 3.5.2, if Si G a{Au A2, •..), then P(A n Si) = P(A)P(Si). Since
this is true for all distinct i\,...,in, it follows that Si, B\, B2,... are
independent. Lemma 3.5.2 then implies that if S2 G a(Bi,B2,...), then Si and

38
3. Further probabilistic foundations.
52 are independent. |
Proof of Theorem 3.5.1. We can now easily prove the Kolmogorov
Zero-One Law. The proof is rather remarkable!
Indeed, A G a(An,An+i,...), so by Corollary 3.5.3, A,Ai,A2,... ,An^l
are independent. Since this is true for all n G N, and since independence is
defined in terms of finite subcollections only, it is also true that A, A\, Ai,...
are independent. Hence, from Lemma 3.5.2, A and S are independent for
all S G a(i4i,i42,...).
On the other hand, A e t C a(A\,A2,. • .)• ^ follows that A is
independent of itself (!). This implies that P(j4 n A) = P(A)P(A). That is,
P(A) = P{A)2, so P{A) = 0 or 1. I
3.6. Exercises.
Exercise 3.6.1. Let X be a real-valued random variable defined on a
probability triple (f2, J7, P). Fill in the following blanks:
(a) J7 is a collection of subsets of .
(b) P(A) is a well-defined element of provided that A is an
element of .
(c) {X < 5} is shorthand notation for the particular subset of
which is defined by: .
(d) If 5 is a subset of , then {X G S} is a subset of .
(e) If 5 is a subset of , then {X G 5} must be an
element of .
Exercise 3.6.2. Let (fi,^7, P) be Lebesgue measure on [0,1]. Let A =
(1/2,3/4) and B = (0,2/3). Are A and B independent events?
•
Exercise 3.6.3. Give an example of events A, B, and C, each of
probability strictly between 0 and 1, such that
(a) P(A H B) = P(A)P(B), P{A n C) = P(A)P(C), and P(B n C) =
P(£)P(C); but it is not the case that P(A CiBnC) = P(A)P(B)P(C).
[Hint: You can let Q be a set of four equally likely points.]
(b) P{AHB) = P(A)P(B), P{AnC) = P(>1)P(C), and P(AnBnC) =
P(A)P(B)P(C); but it is not the case that P{B n C) = P(B)P{C). [Hint:
You can let Q, be a set of eight equally likely points.]
Exercise 3.6.4. Suppose {An} /A. Let / : Q -> R be any function.
Prove that lim^oo inf^^n /M = inf^A /M-

3.6. Exercises.
39
rcise 3.6.5. Let (H, T, P) be a probability triple such that Q is count-
hie Prove that it is impossible for there to exist a sequence j4i, A2,... G J7
hich is independent, such that P(A;) = ^ for each i [Hint: First prove
that for each a; G fi, and each n G N, we have P ({a;}) < l/2n. Then derive
a contradiction.]
Exercise 3.6.6. Let X, Y, and Z be three independent random variables,
and set W = X + Y. Let Bk,n = {(n - l)2"fc < X < n2~k} and let
Ckm = {(m - l)2"fc < Y < m2-k}. Let
•
Ak = (J (Bfc,n n Ct,m).
(n + m)2-fc<x
Fix x, z G R, and let A = {X + Y < x} = {W < x) and D = {Z < z}.
(a) Prove that {Ak} / A.
(b) Prove that Ak and D are independent.
(c) By continuity of probabilities, prove that A and D are independent.
(d) Use this to prove that W and Z are independent.
Exercise 3.6.7. Let (ft,T, P) be the uniform distribution on Q =
{1,2,3}, as in Example 2.2.2. Give an example of a sequence ^4i, A2,... G ^*
such that
P ( lim inf An ) < lim inf P (An) < lim sup P(A„)<P lira SUp An ,
^™/ri n V n /
i.e. such that all three inequalities are s£nc£.
Exercise 3.6.8. Let A be Lebesgue measure on [0,1], and let 0 < a < b <
c < d < 1 be arbitrary real numbers with d < b + c — a. Give an example
of a sequence Ai,A2,... of intervals in [0,1], such that A(liminfn An) = a,
Hminfn X(An) = 6, limsupn X(An) = c, and A(limsupn An) = d. For bonus
points, solve the question when d > b + c - a, with each An a finite union
of intervals.
Exercise 3.6.9. Let A1,A2,...,B1,B2,...be events,
(a) Prove that
( lim sup An) fl ( lim sup Bn) 2 lim sup (An n Bn) .
(b) Give an example where the above inclusion is strict, and another
example where it holds with equality.

40
3. Further probabilistic foundations.
Exercise 3.6.10. Let A\, A2,... be a sequence of events, and let N € N.
Suppose there are events B and C such that B C An C C for all n> N, and
such that P(B) = P(C). Prove that P(liminfnAn) = P(limsupn An) =
P(B) = P(C).
Exercise 3.6.11. Let {Xn}^L1 be independent random variables, with
Xn ~ Uniform({l,2,...,n}) (cf. Example 2.2.2). Compute P(Xn =
5 z.o.), the probability that an infinite number of the Xn are equal to 5.
Exercise 3.6.12. Let X be a random variable with P(X > 0) > 0. Prove
that there is S > 0 such that P(X > S) > 0. [Hint: Don't forget continuity
of probabilities.]
Exercise 3.6.13. Let X\,X2l... be defined jointly on some probability
space (fi,^,P), with E[Xi] = 0 and E[(Xi)2} = 1 for all i. Prove that
P[Xn > n i.o.) = 0.
Exercise 3.6.14. Let 5, e > 0, and let X\, X2,... be a sequence of non-
negative random variables such that P(Xi > S) > e for all i. Prove that
with probability one, YlZi ^i = °°-
Exercise 3.6.15. Let ^1,^2,... be a sequence of events, such that (i)
AiY, Ai2,..., Aik are independent whenever ij+i > ij + 2 for 1 < j < k — 1,
and (ii) ^n P(An) = 00. Then the Borel-Cantelli Lemma does not directly
apply. Still, prove that P(limsupn^4n) = 1.
Exercise 3.6.16. Consider infinite, independent, fair coin tossing as
in Subsection 2.6, and let Hn be the event that the nth coin is heads.
Determine the following probabilities.
(a) P(ffn+inffn+2n...nffn+9t.o.).
(b) P(JJn+inHn+2n...nH2n i.o.).
(c) P(#n+i n Hn+2 n... n #n+[2iog2n] i.o.).
(d) Prove that P(iJn+i fl Hn+2 H ... D #n+[iog2 n] i-o.) must equal either 0
or 1.
(e) Determine P(#n+i H Hn+2 H ... fl #n+[iog2n] i-o.). [Hint: Find the
right subsequence of indices.]
Exercise 3.6.17. Show that Lemma 3.5.2 is false if we require only that
P(BnBn) = P{B) P(Bn) for each n G N, but do not require that the {Bn}
be independent of each other. [Hint: Don't forget Exercise 3.6.3(a).]
Exercise 3.6.18. Let ^1,^2?... be any independent sequence of events,
and let Sx = {lim^oo £ Yl7=i ^ — x}- Prove tnat f°r eacn # G R we
have P(5X) = 0 or 1.

3.7. Section summary.
41
Exercise 3.6.19. Let j4i,j42,... be independent events. Let Y be a
random variable which is measurable with respect to cr(An, An+i,...) for
each n G N. Prove that there is a real number a such that P(Y = a) = 1.
[Hint: Consider P(Y < x) for x € R; what values can it take?]
3.7. Section summary.
In this section, we defined random variables, which are functions on
the state space. We also defined independence of events and of random
variables. We derived the continuity property of probability measures. We
defined limit events and proved the important Borel-Cantelli Lemma. We
defined tail fields and proved the remarkable Kolmogorov Zero-One Law.

4. Expected values.
43
4. Expected values.
There is one more notion that is fundamental to all of probability theory,
that of expected values. The general definition of expected value will be
developed in this section.
4.1. Simple random variables.
Let (0, T, P) be a probability triple, and let X be a random variable
defined on this triple. We begin with a definition.
Definition 4.1.1. A random variable X is simple if range(X) is finite,
where range(X) = {X(u);lu e ft}.
That is, a random variable is simple if it takes on only a finite number of
different values. If X is a simple random variable, then listing the distinct
elements of its range as xi, #2, • • • > xw> we can then write X = Yl7=i Xi^-Ai
where Ai = {u e Q;X(u>) = Xi] = X-1({#i}), and where the 1^. are
indicator functions. We note that the sets A{ form a finite partition of ft.
For such a simple random variable X = Yl^ixi^Al, we define its
expected value or expectation or mean by E(X) = Yl7=i #iP(^i)- That is,
/ " \ A
E 22xilAt j = 2jxiP(Ai), {Ai} a finite partition of ft. (4.1.2)
We sometimes write fix for E(X).
Exercise 4.1.3. Prove that (4.1.2) is well-defined, in the sense that if {Ai}
and {Bj} are two different finite partitions of ft, such that YJi=\xi^Ai =
Y^LiVjIb,, then EILi^P(^) = ££=i Wp(Bj)- [Hint: collect together
those Ai and Bj corresponding to the same values of Xi and yj]
For a quick example, let (ft,.F, P) be Lebesgue measure on [0,1], and
define simple random variables X and Y by
{2, lj rational
4, u = l/V2
6, other lj < 1/4
8, otherwise.
Then it is easily seen that E(X) = 13/3, and E(Y) = 15/2.
From equation (4.1.2), we see immediately that E(1A) = P(A), and
that E(c) = c. We now claim that E(-) is linear. Indeed, if X = J2ixilA%

44
4. Expected values.
and Y = ^ yj1-Bt, where {Ai} and {Bj} are finite partitions of 17, and if
a, 6, G R, then {Ai D B^} is again a finite partition of ft, and we have
E(aX + bY) = E^^oxi + fty^UnB,)
= Eii^ + tyjJP^nB,-)
= aE(X) + bE(Y),
as claimed. It follows that E (]T^=1 x^l^) = £^=1 X{P(Ai) for an?/ finite
collection of subsets Ai C 17, even if they do not form a partition.
It also follows that E(-) is order-preserving, i.e. if X < Y (meaning that
X{uj) < Y{u) for all u e 17), then E(X) < E(Y). Indeed, in that case
Y - X > 0, so from (4.1.2) we have E(Y -X)>0; by linearity this implies
that E(Y) - E(X) > 0.
In particular, since -\X\ < X < \X\, we have |E(X)| < E(|X|), which
is sometimes referred to as the (generalised) triangle inequality. If X takes
on the two values a and 6, each with probability \, then this inequality
reduces to the usual \a + b\ < \a\ + \b\.
Finally, if X and Y are independent simple random variables, then
E{XY) = E(X)E{Y). Indeed, again writing X = £^1^ and Y =
J2jVj^-Bt^ where {Ai} and {Bj} are finite partitions of 17 and where {#;}
are distinct and {yj} are distinct, we see that X and Y are independent
if and only if P(^ PI Bj) = P(Ai)P(Bj) for all i and j. In that case,
E(*y) = ZidXiyjV(AinBj) = £^%p(^)p(^) = e(x)e(y), as
claimed. Note that this may be false if X and Y are not independent;
for example, if X takes on the values ±1, each with probability |, and if
Y = X, then E(X) = E{Y) = 0 but E(XY) = 1. Also, we may have
E(XY) = E(X)E(Y) even if X and V are not independent; for example,
this occurs if X takes on the three values 0, 1, and 2 each with probability
|, and if Y is defined by Y(u) = 1 whenever X(u) = 0 or 2, and Y^o;) = 5
whenever X(uj) = 1.
If X = Er=i xil^i^ witn {Ai} a finite partition of 17, and if / : R —» R is
any function, then f(X) = Yl7=i f(xi)^A is also a simple random variable,
withE(/(A-)) = E?=1/(xi)P(Ai).
In particular, if f(x) = (x — fix)2, we get the variance of X, defined
by Var(X) = E((X - /iX)2). Clearly Var(X) > 0. Expanding the square
and using linearity, we see that Var(X) = E(X2) -\i\= E(X2) - E{X)2.
In particular, we always have
Var(X) < E(X2). (4.1.4)
It also follows immediately that
Var(aX + /3) = a2 Var(X).
(4.1.5)

4.2. General non-negative random variables. 45
We also see that Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y), where
Cov(X, Y) = E ((X - /**)(r - Mr)) = E(XY) - E(X)E(Y) is the covari-
ance' in particular, if X and Y are independent then Cov(X, Y) = 0 so
thatVar(X + Y) = Var(X) + Var(Y). More generally, Var(Ei^i) =
V. Var(A'i) + 2 £i<:J. Cov(Xi, Xj), so we see that
Var(Xi + ... + X„) = Var(Xi) + ... + Var(X„), {Xn} independent.
(4.1.6)
Finally, if Var(X) > 0 and Var(F) > 0, then the correlation between
X and Y is defined by Corr(X,y) = Cov(X, y)/^Var(X) Var(F); see
Exercises 4.5.11 and 5.5.6.
This concludes our discussion of the basic properties of expectation for
simple random variables. (Indeed, it is possible to read Section 5
immediately at this point, provided that one restricts attention to simple random
variables only.) We now note a fact that will help us to define E(X) for
random variables X which are not simple. It follows immediately from the
order-preserving property of E(-).
Proposition 4.1.7. If X is a simple random variable, then
E{X) = sup{E(y); Y simple, Y < X) .
4.2. General non-negative random variables.
If X is not simple, then it is not clear how to define its expected value
E(X). However, Proposition 4.1.7 provides a suggestion of how to
proceed. Indeed, for a general non-negative random variable X, we define the
expected value E(X) by
E(X) = sup{E(y); Y simple, Y < X} .
By Proposition 4.1.7, if X happens to be a simple random variable then
this definition agrees with the previous one, so there is no confusion in
reusing the same symbol E(-). Indeed, this one single definition (with a minor
modification for negative values in Subsection 4.3 below) will apply to all
random variables, be they discrete, absolutely continuous, or neither (cf.
Section 6).
We note that it is indeed possible that E(X) will be infinite. For
example, suppose (fi,J",P) is Lebesgue measure on [0,1], and define X by
X{u>) = 2n for 2~n < u < 2-(n"1). (See Figure 4.2.1.) Then E{X) >
T,Li 2*2"* = N for any N G N. Hence, E(X) - oo.
Recall that for k £ N, the kth moment of a non-negative random
variable X is defined to be E(Xfc), finite if E(\X\k) < oo. Since Ix^"1 <

46
4. Expected values.
X{u>)
,
8-
7 -
6 -
5 -
4-
3 -
2 -
1 -
0-
■
i
i
1 1—
H 1 1 1 1 1
3 I
8 2
Figure 4.2.1. A random variable X having E(X) = oo.
max(|x|fe, 1) < \x\k + 1 for any x G R, we see that if E(|X|fc) is finite, then
soisEdXl*"1).
It is immediately apparent that our general definition of E(-) is still
order-preserving. However, proving linearity is less clear. To assist, we
have the following result. We say that {Xn} / X if X\ < X<i < ..., and
also Mnin^oo Xn(u) = X{uj) for each uj G ft. (That is, the sequence {Xn}
converges monotonically to X.)
Theorem 4.2.2. (The monotone convergence theorem.) Suppose
X\, X2,.. • are random variables with E(Xi) > — 00, and {Xn} / X. Then
X is a random variable, and limn_>oo F,(Xn) = E(X).
Proof. We know from (3.1.6) that X is a random variable (alternatively,
simply note that in this case, {X < x} = Oni^n < x} G T for all xGR).
Furthermore, by monotonicity we have E(Xi) < E(X2) < ... < E(X), so
that limnE(Xn) exists (though it may be infinite if E(X) = +00), and is
< E(X).
To finish, it suffices to show that limn E{Xn) > E(X). If ^(Xi) = +00
this is trivial, so assume E{X\) is finite. Then, by replacing Xn by Xn—X\
and X by X - X\, it suffices to assume the Xn and X are non-negative.

4.2. General non-negative random variables. 47
By the definition of E(X) for non-negative X, it suffices to show that
I'm E(Xn) ^ E(y) for any simple random variable y < X. Writing Y =
V VilAti we see ^at ^ suffices to prove that limnE(Xn) > YliviP(Ai),
where {Ai} is any finite partition of 0, with Vi < X(u) for all uj € Ai.
To that end, choose e > 0, and set Ain = {u G Ai; Xn(u;) > v» - c}.
Then {j4»n} / 4i as n -+ oo. Furthermore, E(Xn) > Yli(vi -e)P(Ain). As
n —► oo, by continuity of probabilities this converges to ^(^ ~~ e)P(^i) —
y^.ViP(Ai) — e. Hence, limE(Xn) > Ylivi^(^i) ~ e- Since this is true for
any 6 > 0, we must (cf. Proposition A.3.1) have limE(Xn) > X^fzP^z)*
as required. I
Remark 4.2.3. Since expected values are unchanged if we modify the
random variable values on sets of probability 0, we still have linin—oo E(Xn) =
E(X) provided {Xn} / X almost surely (a.s.), i.e. on a subset of Q, having
probability 1. (Compare Remark 3.1.9.)
We note that the monotonicity assumption of Theorem 4.2.2 is indeed
necessary. For example, if (fi,^*, P) is Lebesgue measure on [0,1], and if
Xn = nl(0>i), then Xn —► 0 (since for each w e [0,1] we have Xn(u) = 0
for all n > l/w), but E{Xn) = 1 for all n.
To make use of Theorem 4.2.2, set *„(x) = min(n, 2_nL2nxJ) for x >
0, where [^J is the floor of r, or greatest integer not exceeding r. (See
Figure 4.2.4.) Then \J>n(x) is a slightly rounded-down version of x, truncated
at n. Indeed, for fixed x > 0 we have that \J>(x) > 0, and {\J>n(x)} /xas
n —► oo. Furthermore, the range of \I>n is finite (of size n2n + 1). Hence,
this shows:
Proposition 4.2.5. Let X be a general non-negative random variable.
Set Xn = Vn(X) with Vn as above. Then Xn > 0 and {Xn} / X, and
each Xn is a simple random variable. In particular, there exists a sequence
of simple random variables increasing to X.
Using Theorem 4.2.2 and Proposition 4.2.5, it is straightforward to prove
the linearity of expected values for general non-negative random variables.
Indeed, with Xn = 9n(X) and Yn = Vn{Y), we have (using.linearity of
expectation for simple random variables) that for X, Y > 0 and a, b > 0,
E(aX + bY) = limE(aXn + bYn) = lim {aE(Xn) + bE(Yn))
= aE(X) + bE(Y). (4.2.6)

48
4. Expected values.
y = xy
2 +
7
4
3
2
5
4
l 4-
3
4
2
4
H 1 h
0 A
i !
H 1 h
3 1 o
2 4 Z
y = *2(a?)
Figure 4.2.4. A graph of y = ^(aO-
Similarly, if X and V are independent, then by Proposition 3.2.3, Xn
and Yn are also independent. Hence, if E(X) and E(y) are finite,
E(XY) = lim E{XnYn) = lim E(Xn) E(yn) = E(X) E{Y), (4.2.7)
n n
as was the case for simple random variables. It then follows that Var(X +
y) = Var(X) + Var(y) exactly as in the previous case.
We also have countable linearity. Indeed, if X\, X2,... > 0, then by the
monotone convergence theorem,
E(Xi+X2+...) = E AimXi + X2 + ... + Xn) = limE(Xi+X2+.. .+X„)
V n / n
= lim [E(JCi) + E(X2) + ... + E(Xn)] = E(XX) + E{X2) + .... (4.2.8)
n
If the Xi are non non-negative, then (4.2.8) may fail, though it still holds
under certain conditions; see Exercise 4.5.14 (and Corollary 9.4.4).
We also have the following.
Proposition 4.2.9. If X is a non-negative random variable, then
Sibil f*(X > &) = EL-^J, where [X\ is the greatest integer not exceeding

4.3. Arbitrary random variables.
49
X. (In particular, if X is non-negative-integer valued, then YH?=i Y(X >
k) = E(X).)
Proof. We compute that
oo oo
]TP(X>fc) = ^[P(fc<X<fc + l)-hP(fc-hl<X<fc-h2)-h...]
oo oo
= jr*p(*<x<* + i) = ^ep([x\=e) = e[x\. ■
4.3. Arbitrary random variables.
Finally, we consider random variables which may be neither simple nor
non-negative. For such a random variable X, we may write X = X+ — X~,
where X+(u) = max(X(u;),0) and X~(u) = max(-X(u;),0). Both X+
and X~ are non-negative, so the theory of the previous subsection applies
to them. We may then set
E{X) = E(X+)-E(X"). (4.3.1)
We note that E(X) is undefined if both E(X+) and E(X~) are infinite.
However, if E(X+) = oo and E(X~) < oo, then we take E(X) = oo.
Similarly, if E(X+) < oo and E(X~) = oo, then we take E(X) = -oo.
Obviously, if E(X+) < oo and E(X~) < oo, then E(X) will be a finite
number.
We next check that, with this modification, expected value retains the
basic properties of order-preserving, linear, etc.:
Exercise 4.3.2. Let X and Y be two general random variables (not
necessarily non-negative) with well-defined means, such that X < Y.
(a) Prove that X+ < Y+ and X~ >Y~.
(b) Prove that expectation is still order-preserving, i.e. that E(X) < E(Y)
under these assumptions.
Exercise 4.3.3. Let X and Y be two general random variables with
finite means, and let Z = X + Y.
(a) Express Z+ and Z~ in terms of X+, X~, F+, and Y~.
(b) Prove that E(Z) = E(X) + E(F), i.e. that E(Z+)-E(Z~) = E(X+)-
E(X~) + E(F+) - E(y-). [Hint: Re-arrange the relations of part (a) so
that you can make use of (4.2.6).]

50
4. Expected values.
(c) Prove that expectation is still (finitely) linear, for general random
variables with finite means.
Exercise 4.3.4. Let X and Y be two independent general random
variables with finite means, and let Z = XY.
(a) Prove that X+ and Y+ are independent, and similarly for each of X+
and Y~, and X~ and y+, and X~ and Y~.
(b) Express Z+ and Z~ in terms of X+, X~, Y+, and Y~.
(c) Prove that E(XY) = E{X) E(y).
4.4. The integration connection.
Given a probability triple (fi, .F, P), we sometimes write E(X) as JQ XdP
or Jn X(u)P(duj) (the "f2" is sometimes omitted). We call this the integral
of the (measurable) function X with respect to the (probability) measure
P.
Why do we make this identification? Well, certainly expected value
satisfies some similar properties to that of the integral: it is linear, order-
preserving, etc. But a more convincing reason is given by
Theorem 4.4.1. Let (fi,^*,P) be Lebesgue measure on [0,1]. Let
X : [0,1] —> R be a bounded function which is Riemann integrable (i.e.
integrable in the usual calculus sense). Then X is a random variable with
respect to (fi,^7, P), and E(X) = J0 X(t)dt. In words, the expected value
of the random variable X is equal to the calculus-style integral of the
function X.
Proof. Recall the definitions of lower and upper integrals, viz.
L J X = sup \ J2(U - ti-i) inf X(t); 0 = t0 < h < ... < tn = 1 > ;
jo yt^i u-\<t<u j
U X = inf \ Y\(U - U-x) sup X(t); 0 = t0 < tx < ... < tn = 1 \ .
Jo [~[ u-i<t<u J
Recall that we always have L j0 X < U j0 X, and that X is Riemann
integrable if L J0 X = U J0 X, in which case /0 X(t) dt is defined to be this
common value.
But X)r=i(*< - U-i) 'mftt-i<t<tt X(t) = E(y), where the simple random
variable Y is defined by Y(uj) = inf^.^t^tt X(t) whenever ^_x < u < U.
Hence, since X is Riemann integrable, for each n E N we can find a simple

4.4. The integration connection.
51
random variable Yn < X with F,(Yn) > L J0 X(i) dt — £. Similarly we can
find^n >X withE(Zn) < UfnX{t)dt+±. Let An = max{Yu ... ,rn),
and Bn = min(Zi,... ,Z„).
We claim that {An} / X a.s. Indeed, {An} is increasing by
construction. Furthermore, An < X < Bn, and lim E{An) = lim
n—►oo
f1 X(t) dt. Hence, if Sk = {u e ft : lim^^ (fl„(o;) - j4„(o;)) > l/fc}> then
0°= limn-.ooE(Bn - An) > P{Sk)/k, so P(Sk) = 0 for all k e N. Then
P[lim„->oo ^n < X] < Pflim^oo An < limn^oo Bn] < P[|Jfc Sk] = 0 by
countable subadditivity, proving the claim.
Hence, by Theorem 4.2.2 and Remarks 4.2.3 and 3.1.9, X must be a
random variable, with E(X) = limn^oo E(j4„) = J0 X(t) dt. I
On the other hand, there are many functions X which are not Riemann
integrable, but which nevertheless are random variables with respect to
Lebesgue measure, and thus have well-defined expected values. For
example, if X is defined by X(i) = 1 for t irrational, but X(t) = 0 for t rational,
i.e. X = Iqc where Qc is the set of irrational numbers, then X is not
Riemann-integrable but we still have E(X) = J XdP = 1 being perfectly
well-defined.
We sometimes call the expected value of X, with respect to the Lebesgue
measure probability triple, its Lebesgue integral, and if E|X| < oo we say
that X is Lebesgue integrable. By Theorem 4.4.1, the Lebesgue integral is a
generalisation of the Riemann integral. Hence, in addition to learning many
things about probability theory, we are also learning more about integration
at the same time!
We also note that we do not need to restrict attention to integrals over
the unit interval [0,1]. Indeed, if X : R —> [0, oo) is Borel-measurable, then
we can define f^X d\, where A stands for Lebesgue measure on the entire
real line R, by
X{t)\(dt) = Y^ I X{n + t)P{dt), (4.4.2)
or equivalently
x(t)\(dt) = ^E(yB),
nez
where Yn(t) = X(n + t) and the expectation is with respect to Lebesgue
measure on [0,1]. That is, we can integrate a non-negative function over the
entire real line by adding up its integral over each interval [n,n + 1). (For
general X, we can then write X = X+ - X~, as usual.) In other words,
we can represent A, Lebesgue measure on R, as a sum of count ably many
copies of Lebesgue measure on unit intervals. We shall see in Section 6 that
/
/

52
4. Expected values.
we may use Lebesgue measure on R to help us define the distributions of
general absolutely-continuous random variables.
Remark 4.4.3. Measures like A above, which are not finite but which
can be written as the countable sum of finite measures, are called a-finite
measures. They are not probability measures, and cannot be used to define
probability spaces; however by countable additivity they still satisfy many
properties that probability measures do. Of course, unlike in the probability
measure case, J^ X dX may be infinite or undefined even if X is bounded.
4.5. Exercises.
Exercise 4.5.1. Let (fi,.F, P) be Lebesgue measure on [0,1], and set
{1 , 0 < lj < 1/4
2a;2, 1/4 < u < 3/4
a;2, 3/4 < u < 1.
Compute P(X e A) where
(a) A = [0,1].
(b) ^ = [|,1].
Exercise 4.5.2. Let X be a random variable with finite mean, and let
a G R be any real number. Prove that E(max(X, a)) > max(E(X),a).
[Hint: Consider separately the cases E(X) > a and E(X) < a] (See also
Exercise 5.5.7.)
Exercise 4.5.3. Give an example of random variables X and Y defined
on Lebesgue measure on [0,1], such that P(X > Y) > \, but E(X) < E(Y).
Exercise 4.5.4. Let (fi,^7, P) be the uniform distribution on £1 =
{1,2,3}, as in Example 2.2.2. Find random variables X, Y, and Z on
(n,.F,P) such that P(X > Y)P(Y > Z)P{Z > X) > 0, and E(X) =
E(y) = E(Z).
Exercise 4.5.5. Let X be a random variable on (fi,^7, P), and suppose
that fi is a finite set. Prove that X is a simple random variable.
Exercise 4.5.6. Let X be a random variable defined on Lebesgue measure
on [0,1], and suppose that X is a one-to-one function, i.e. that if u>\ ^ UJ2
then X(uj\) ^ X(u>2)- Prove that X is not a simple random variable.

4.5. Exercises.
53
Exercise 4.5.7. (Principle of inclusion-exclusion, general case) Let
Ax, A2, • • • 1 An G T. Generalise the principle of inclusion-exclusion to:
n
P(^iU...Ui4„) = ^P(^) - Y, P(AinAi)
+ Y P(Ai n^jn^) - • • • ± P(^l n • • • n An)-
l<i<j<k<n
[Hint: Expand 1 — ]^[^=1(1 - 1a,)» and take expectations of both sides.]
Exercise 4.5.8. Let f(x) = ax2 + bx + c be a second-degree polynomial
function (where a, 6, c € R are constants).
(a) Find necessary and sufficient conditions on a, 6, and c such that the
equation E(f(aX)) = a2E(f(X)) holds for all a G R and all random
variables X.
(b) Find necessary and sufficient conditions on a, 6, and c such that the
equation E(f(X - /?)) = E(/(X)) holds for all (3 G R and all random
variables X.
(c) Do parts (a) and (b) account for the properties of the variance function?
Why or why not?
Exercise 4.5.9. In proving property (4.1.6) of variance, why did we
not simply proceed by induction on n? That is, suppose we know that
Var(X -f Y) — Var(X) + Var(y) whenever X and Y are independent.
Does it follow easily that Var(X + Y + Z) = Var(X) + Var(F) + Var(Z)
whenever X, Y, and Z are independent? Why or why not? How does
Exercise 3.6.6 fit in?
Exercise 4.5.10. Let X\, X2,... be i.i.d. with mean // and variance a2,
and let iV be an integer-valued random variable with mean m and variance v,
with N independent of all the X{. Let S = Xi +... + XN = X)£i Xi lN>i
Compute Var(5) in terms of /i, a2, m, and v.
Exercise 4.5.11. Let X and Z be independent, each with the standard
normal distribution, let a, b G R (not both 0), and let Y = aX + bZ.
(a) Compute Corr(X,r).
(b) Show that |Corr(X,y)| < 1 in this case. (Compare Exercise 5.5.6.)
(c) Give necessary and sufficient conditions on the values of a and b such
that Corr(X,F) = 1.
(d) Give necessary and sufficient conditions on the values of a and b such
that Corr(X,y) = -1.

54
4. Expected values.
Exercise 4.5.12. Let X and Y be independent general non-negative
random variables, and let Xn = \I>n(X), where \J>n(x) = min(n, 2_n[2na:J)
as in Proposition 4.2.5.
(a) Give an example of a sequence of functions $n : [0, oo) —> [0, oo), other
than $n(x) = \Pn(z), sucn tnat f°r aU a;, 0< $n(x) < x and {$n(x)} / x
as n —► oo.
(b) Suppose yn = $n00 with 3>n as in part (a). Must Xn and Yn be
independent?
(c) Suppose {Yn} is an arbitrary collection of non-negative simple random
variables such that {Yn} / Y. Must Xn and Yn be independent?
(d) Under the assumption of part (c), determine (with proof) which
quantities in equation (4.2.7) are necessarily equal.
Exercise 4.5.13. Give examples of a random variable X defined on
Lebesgue measure on [0,1], such that
(a) E(X+) = oo and 0 < E(X~) < oo.
(b) E(X-) = oo and 0 < E(X+) < oo.
(c) E(X+)=E(X-) = oo.
(d) E(X) < oo but E{X2) = oo.
Exercise 4.5.14. Let Zi, Z2,... be general random variables with E|Z*| <
00, and let Z = Zi + Z2 + ....
(a) Suppose EiE(^) < °° and Y,iE(zr) < °°- Prove that E(z) =
(b) Show that we still have E(Z) = ^ E(Z^) if we have at least one of
E<E(Z+)<ooorE<E(Zr)<oo.
(c) Let {Z»} be independent, with P(Z» = +1) = P(Z» = -1) = \ for each
i. Does E(Z) = Yli E(Z*) in this case? How does that relate to (4.2.8)?
Exercise 4.5.15. Let (^1,^*1,Pi) and (02,^*2^2) be two probability
triples. Let A\, A2,... G T\, and B\, B2,... G ^2- Suppose that it happens
that the sets {An x £?n} are all disjoint, and furthermore that (J^Li(^n x
Bn) = A x B for some A^T\ and J5 G ^2-
(a) Prove that for each u; G fii, we have
00
lA(u)P2(B) = £)u»P2(Bn).
n=l
[Hint: This is essentially countable additivity of P2, but you do need to be
careful about disjointness.]
(b) By taking expectations of both sides with respect to Pi and using
countable additivity of Pi, prove that
00
Pi(A)P2(B) = £Pi(An)P2(B„).
71=1

4.6. Section summary.
55
(c) Use this result to prove that the J and P for product measure,
presented in Subsection 2.6, do indeed satisfy (2.5.5).
4.6. Section summary.
In this section we defined the expected value E(X) of a random variable
X first for simple random variables and then for general random variables.
We proved basic properties such as linearity and order-preserving. We
defined the variance Var(X). If X and Y are independent, then E(XY) =
E{X)E(Y) and Var(X + Y) = Var(X) + Var(y). We also proved the
monotone convergence theorem. Finally, we connected expected value to
Riemann (calculus-style) integration.

5. Inequalities and convergence.
57
5. Inequalities and convergence.
In this section we consider various relationships regarding expected
values and limits.
5.1. Various inequalities.
We begin with two very important inequalities about random variables.
Proposition 5.1.1. (Markov's inequality) IfX is a non-negative random
variable, then for all a > 0,
P{X>a) < E(X)/a.
In words, the probability that X exceeds a is bounded above by its mean
divided by a.
Proof. Define a new random variable Z by
Z{u) " \ 0, X(u>)<a.
Then clearly Z < X, so that E(Z) < E(X) by the order-preserving
property. On the other hand, we compute that E(Z) = aP(X > a). Hence,
aP(X>a)<E{X). I
Markov's inequality applies only to non-negative random variables, but
it immediately implies another inequality which holds more generally:
Proposition 5.1.2. (Chebychev's inequality.) Let Y be an arbitrary
random variable, with finite mean [iY. Then for all a > 0,
P{\Y-fjLY\>a) < Var(Y)/a2.
In words, the probability that Y differs from its mean by more than a is
bounded above by its variance divided by a2.
"roof. Set X = (Y — Hy)2. Then X is a non-negative random variable.
Thus, using Proposition 5.1.1, we have
P (\Y -»y\>a) = P(X> a2) < E(X)/a2 = Var(y)/c*2 . I
We shall use the above two inequalities extensively, including to prove
the laws of large numbers presented below. Two other sometimes-useful
inequalities are as follows.

58
5. Inequalities and convergence.
Proposition 5.1.3. (Cauchy-Schwarz inequality.) Let X and Y be
random variables with E(X2) < oo and E(Y2) < oo. Then E\XY\ <
^/E(X2)E(Y2).
Proof. Let Z = \X\ / y/WX2) and W = \Y\ / y/WY2), so that E(Z2) =
E(W2) = 1. Then
0 < E((Z-W)2) = E(Z2 + W2-2ZW) = l + l-2E(ZWr),
so E{ZW) < 1, i.e. E|AY| < y/E{X2) E{Y2). I
Proposition 5.1.4. (Jensen's inequality) Let X be a random variable
with finite mean, and let (j) : R —> R be a convex function, i.e. a function
such that X(f)(x) + (1 — X)(/)(y) > (j){\x + (1 — \)y) for x,y,\ 6 R and
0 < A < 1. Then E (4>(X)) > <\> (E(X)).
Proof. Since (j) is convex, we can find a linear function g(x) = ax + b
which lies entirely below the graph of (j) but which touches it at the point x =
E(X), i.e. such that g{x) < </>{x) for all x e R, and g(E(X)) = 0(E(X)).
Then
E(0(X)) > B(g(X)) = B(aX + b) = aE(X) + 6 = #(E(X)) = 0(E(X)). I
5.2. Convergence of random variables.
If Z, Zi, Z2,... are random variables defined on some (fi, T, P), what
does it mean to say that {Zn} converges to Z as n —> 00?
One notion we have already seen (cf. Theorem 4.2.2) is pointwise
convergence, i.e. limn_oc Zn(uj) = Z(uj). A slightly weaker notion which often
arises is convergence almost Surely (or, a.s. or with probability 1 or w.p. 1
or almost everywhere), meaning that P(limn_oc Zn = Z) = 1, i.e. that
P{u; 6 fi : limn_>oo Zn(u) = Z(u)} = 1. As an aid to establishing such
convergence, we have the following:
Lemma 5.2.1. Let Z, Z\, Z2,... be random variables. Suppose for each
e > 0, we have P(|Zn - Z\ > e i.o.) = 0. Then P(Zn -> Z) = 1, i.e. {Zn}
converges to Z almost surely.
Proof. It follows from Proposition A.3.3 that
P(Zn -+Z) = P(Ve>0, \Zn-Z\ <ea.a.) = l-P(3e>0, \Zn-Z\ >ei.o.).

5.2. Convergence of random variables.
59
By countable subadditivity, we have that
p(3 6 > 0, 6 rational, \Zn - Z\ > e i.o.) < ]T P(|Zn - Z\ > e i.o.) = 0.
e>0
e rational
But given any e > 0, there exists a rational e' > 0 with e' < e. For this e',
we have that {\Zn - Z\ > e i.o.} C {\Zn - Z\ > e' i.o.}. It follows that
p(3 e > 0, \Zn-Z\ > e i.o.) < P(3 e' > 0, e' rational, \Zn-Z\ > e' i.o.) = 0,
thus giving the result. I
Combining Lemma 5.2.1 with the Borel-Cantelli Lemma, we obtain:
Corollary 5.2.2. Let Z, Zi,Z2,... be random variables. Suppose for
each e >0, we have £n P(|Z„ - Z\ > e) < oo. Then P(Zn -+ Z) = 1, i.e.
{Zn} converges to Z almost surely
Another notion only involves probabilities: we say that {Zn} converges
to Z in probability if for all e > 0, limn-^ P(|Zn — Z| > e) = 0. We next
consider the relation between convergence in probability, and convergence
almost surely.
Proposition 5.2.3. Let Z, Zi,Z2,... be random variables. Suppose
Zn —> Z almost surely (i.e., P(Zn —> Z) = 1). Then Zn —> Z in probability
(i.e., for any e > 0, we have P(|Zn — Z\ > e) —> 0J. That is, if a sequence of
random variables converges almost surely, then it converges in probability
to the same limit.
Proof. Fix e > 0, and let An = {uj; 3m > n, \Zm - Z\ > e}. Then
{An} is a decreasing sequence of events. Furthermore, if uj 6 pL-^™'
then Zn(u) ~h Z(uj) as n -> oo. Hence, V{f)nAn) < P(Zn /> Z) = 0.
By continuity of probabilities, we have P(-An) —> PCfln^) = 0. Hence,
P(|Zn - Z| > e) < P(i4n) -^ 0, as required. I
On the other hand, the converse to Proposition 5.2.3 is false. (This
justifies the use of "strong" and "weak" in describing the two Laws of
Large Numbers below.) For a first example, let {Zn} be independent, with
P(Zn = 1) = 1/n = 1 - P(Zn = 0). (Formally, the existence of such {Zn}
follows from Theorem 7.1.1.) Then clearly Zn converges to 0 in probability.
On the other hand, by the Borel-Cantelli Lemma, P(Zn = 1 i.o.) = 1, so
P(Zn - 0) = 0, so Zn does not converge to 0 almost surely.
For a second example, let (£l,P, P) be Lebesgue measure on [0,1], and
set ZY = l^i), Z2 = lfi,!], Z3 = l[0,i), Z4 = l[i,j), ^5 = l[j,f), ZQ =

60
5. Inequalities and convergence.
1(3,1], ^7 = 1[o,J)' ^8 = 1[^,i)' etc* Then, by inspection, Zn converges to
0 in probability, but Zn does not converge to 0 almost surely.
5.3. Laws of large numbers.
Here we prove a first form of the weak law of large numbers.
Theorem 5.3.1. (Weak law of large numbers - first version.) £e£
X\, X2,. •. be a sequence of independent random variables, each having the
same mean m, and each having variance < v < 00. Then for all e > 0,
lim p(\-(X1+X2 + ... + Xn) - ml > e) =0.
In words, the partial averages ^ (X\ + X2 +... + Xn) converge in probability
to m.
Proof. Set Sn = ±{XX + X2 + ... + Xn). Then using linearity of
expected value, and also properties (4.1.5) and (4.1.6) of variance, we see
that E(5n) = m and Var(5n) < v/n. Hence by Chebychev's inequality
(Theorem 5.1.2), we have
P M-(Xi+X2 + ... + Xn) - m\ > eJ < v/e2n -> 0, n->oo,
as required. I
For example, in the case of infinite coin tossing, if Xi = r» = 0 or 1 as
the 2th coin is tails or heads, then Theorem 5.3.1 states that the probability
that the fraction of heads on the first n tosses differs from ^ by more than e
goes to 0 as n —> 00. Informally, the fraction of heads gets closer and closer
to ^ with higher and higher probability.
We next prove a first form of the strong law of large numbers.
Theorem 5.3.2. (Strong law of large numbers - first version.) Let
X\, X2, • • • be a sequence of independent random variables, each having the
same finite mean m, and each having E((X* — m)4) < a < 00. Then
P ( lim -(Xi + X2 + ... + Xn) = m) = 1.
In words, the partial averages £ (Xi + X2 +... + Xn) converge almost surely
to m.

Eliminating the moment conditions. 61
q* e (Xi - m)2 - (^ — m)4 "*" * (consider separately the two cases
proof. 2 < 1 anci ^ - m)2 > 1), it follows that each Xi has variance
(Xi - mj_ -^ ^ ^e assume for simplicity that m = 0; if not then we can
' mtv^place ^ by X, - m.
simpi> r^F= Xi + x2 + • • • + Xn, and consider E(S4). Now, S4 will (when
ltiplied out) contain many terms of the form XiXjXkXi for ij,k,£
mU • t but all of these have expected value zero. Similarly, it will contain
!anyCte™s of the form XiXj(Xk)2 and Xi(Xj)3 which also have expected
11 1 zero The only terms with non-vanishing expectation will be n terms
of "he form (X,)4, and Q)(42) = 3n(n - 1) terms of the form (X*)2^)2
with i ^ 3- Now, E ({Xi)4) < a. Furthermore, if i ^ j then X? and X2 are
independent by Proposition 3.2.3, so since m = 0 we have E ((Xi)2{Xj)2) =
E ((X,)2) E ((X,)2) = Var(X,)Var(X,) < v2. We conclude that E(54) <
rm 4- 3n(n - l)v2 < Kn2 where K = a + 3v2. This is the key.
To finish, we note that for any e > 0, we have by Markov's inequality
that /.l i \
P (|± 5n| > e) = P (|5n| > ne) = P (|5n|4 > nV)
< E (S4) /(n4e4) < tfn2/(n4e4) = Ke"4^ •
Since E^=i £* < °° by (A.3.7), it follows from Corollary 5.2.2 that ^Sn
converges to 0 almost surely. I
For example, in the case of coin tossing, Theorem 5.3.2 states that the
fraction of heads on the first n tosses will converge, as n —> oo, to |.
Although this conclusion sounds quite similar to the corresponding conclusion
from Theorem 5.3.1, we know from Proposition 5.2.3 (and the examples
following) that it is actually a stronger result.
5.4. Eliminating the moment conditions.
Theorems 5.3.1 and 5.3.2 provide clear evidence that the partial sums
n(^i + • • • + Xn) are indeed converging to the common mean m in some
sense. However, they require that the variance (i.e., second moment) or
even the fourth moment of the X{ be finite (and uniformly bounded). This
^ an unnatural condition which is sometimes difficult to check.
Thus, in this section we develop a new form of the strong law of large
numbers which requires only that the mean (i.e., first moment) of the
random variables be finite. However, as a penalty, it demands that the random
variables be "i.i.d." as opposed to merely independent. (Of course, once we
have proven a strong law, we will immediately obtain a weak law by using

62
5. Inequalities and convergence.
Proposition 5.2.3.) Our proof follows the approach in Billingsley (1995).
We begin with a definition.
Definition 5.4.1. A collection of random variables {Xa}aei are
identically distributed if for any Borel-measurable function / : R —> R, the
expected value E(/(XQ)) does not depend on a, i.e. is the same for all
a e I.
Remark 5.4.2. It follows from Proposition 6.0.2 and Corollary 6.1.3
below that {Xa}aej are identically distributed if and only if for all xeR,
the probability P (Xa < x) does not depend on a.
Definition 5.4.3. A collection of random variables {Xa}aei are i.i.d. if
they are independent and are also identically distributed.
Theorem 5.4.4. (Strong law of large numbers - second version.) Let
X\, X2,... be a sequence of i.i.d. random variables, each having finite mean
m. Then
P ( lim -{Xx + X2 + ... + Xn) = m J = 1.
yn—00 n J
In words, the partial averages ^{X\ +X2 +... + Xn) converge almost surely
to m.
Proof. The proof is somewhat difficult because we do not assume the
finiteness of any moments of the Xi other than the first. Instead, we shall
use a "truncation argument", by defining new random variables Yi which
are truncated versions of the corresponding Xi. Thus, higher moments will
exist for the Yi, even though the Yi will tend to be "similar" to the Xi.
To begin, we assume that Xi > 0; if not, we can consider separately X*
and X~. (Note that we cannot now assume without loss of generality that
m = 0.) We set Yi = Xilxt<i, i.e. Yi = Xi unless Xi exceeds i, in which
case Yi = 0. Then since 0 < J^ < i, therefore E(F-fc) < ik < 00. Also, by
Proposition 3.2.3, the {Yi} are independent. Furthermore, since the Xi are
i.i.d., we have that E(Yi) = E(XilXi<i) = E(X1lXl<i) / E(Xi) = m as
i —> 00, by the monotone convergence theorem.
We set Sn = £?=1 Xu and set 5; = £2n=1 Y{. We compute using (4.1.6)
and (4.1.4) that
Var(S;) = Var(yx) + ... + Var(Fn) < E(>f) + ... + E(Fn2)
= E{X\\Xx<a) + • • • + E(X2nlXn<n) < r£E{XllXl<n) < n3 < oc . (5.4.5)
We now choose a > 1 (we will later let a \ 1), and set un = [an\, i.e.
un is the greatest integer not exceeding an. Then un < an. Furthermore,

5.4. Eliminating the moment conditions.
63
nee OLn > 1? ^ follows (consider separately the cases an < 2 and an > 2)
that un > #n/% i-e. l/wn < 2/an. Hence, for any x > 0, we have that
00 9/
£ l/un < £ V«n < £ 2/a" < £ 2/afe = ^ . (5.4.6)
->_ an>x an>x OQ
(Note that here we sum over k = logQ x, logQ x + 1,..., even if logQ x is not
an integer.)
We now proceed to the heart of the proof. For any e > 0, we have that
s* -e(s; )
"n. v un'
oo Var(S;n/un)
S.-Kfi]|>,)
u„ \ u„ >\ — )
f by Chebychev's inequality
oo " Var(S* )
= YZ-i1^ by (4.1.5)
< YZ.1 U"E{Xli^n) by (5.4.5)
= ^E (xx2 Y,%Li ^T^n^x,) by countable linearity
< ^E (X? f^) by (5.4.6)
^X)E(Xr)
2m
< OO.
" «a(i-i)
This finiteness is the key. It now follows from Corollary 5.2.2 that
—— Uj2— > converges to 0 almost surely. (5.4.7)
Un J
E(S* )
To complete the proof, we need to replace —^^- by m, and replace
SUn by 5'Un, and finally replace the index un by the general index k. We
consider each of these three issues in turn.
First, since E(Y^) —> m as i —*• oo, and since iin —> oo as n —> oo, it
follows immediately that Un —> m as n —> oo. Hence, it follows from
(5.4.7) that
converges to m almost surely. (5.4.8)
Second, we note by Proposition 4.2.9 that
oo oo oo
^P(Xk^Yk) = J2P(Xk>k) < J2P(Xk>k)
fc=l fc=l k=l

64
5. Inequalities and convergence.
Y^P{Xx>k) = E[X1\<E{Xl)=m<oo.
k=l
Hence, again by Borel-Cantelli, we have that P(X& ^ Yk io.) = 0, so that
P(Xfc = Yk a.a.) = 1. It follows that, with probability one, as n —> oo the
s* s
limit of —^ coincides with the limit of —!£lL. Hence, (5.4.8) implies that
) _JiiL K converges to m almost surely. (5.4.9)
I un J
Finally, for an arbitrary index &, we can find n = Uk so that un < k <
un+\. But then
un SUn _ SUn < Sj^ < Sun+i _ Sun+1 un+\
un+i un un+i k un un+i un
Now; as k —> oo, we have n = Uk —*• oo, so that -^ ► - and ^±^ —> q.
Hence, for any a > 1 and S > 0, with probability 1 we have ra/(l + 5)a <
^f- < (1 + (5)am for all sufficiently large k. For any e > 0, choosing a > 1
and 5 > 0 so that m/(l + 5)a > m — e and (1 + 8)am < m + e, this implies
that P(|(Sfc/A;) — ra| > e z.o.) = 0. Hence, by Lemma 5.2.1, we have that as
k —*• oo,
— converges to m almost surely,
AC
as required. I
Using Proposition 5.2.3, we immediately obtain a corresponding
statement about convergence in probability.
Corollary 5.4.11. (Weak law of large numbers - second version.) Let
Xi, X2, • •. be a sequence ofi.i.d. random variables, each having finite mean
m. Then for all e > 0,
lim P[
-(X1+X2 + ... + Xn) - m
n
> c = 0.
In words, the partial averages ^ (X\ + X2 +... + Xn) converge in probability
to m.

5.5. Exercises.
65
5.5. Exercises.
Exercise 5.5.1. Suppose E(2X) = 4. Prove that P(X > 3) < 1/2.
Exercise 5.5.2. Give an example of a random variable X and a > 0 such
that P{X > ot) > E(X)/a. [Hint: Obviously X cannot be non-negative.]
Where does the proof of Markov's inequality break down in this case?
Exercise 5.5.3. Give examples of random variables Y with mean 0 and
variance 1 such that
(a) P(|y| > 2) = 1/4.
(b) P(|F| > 2) < 1/4.
Exercise 5.5.4. Suppose X is a non-negative random variable with
E(X) = oo. What does Markov's inequality say in this case?
Exercise 5.5.5. Suppose Y is a random variable with finite mean \iy
and with Var(y) = oo. What does Chebychev's inequality say in this case?
Exercise 5.5.6. For general jointly defined random variables X and
Y, prove that |Corr(X, Y)\ < 1. [Hint: Don't forget the Cauchy-Schwarz
inequality.] (Compare Exercise 4.5.11.)
Exercise 5.5.7. Let cGR, and let 4>(x) = max(x, a) as in Exercise 4.5.2.
Prove that 0 is a convex function. Relate this to Jensen's inequality and to
Exercise 4.5.2.
Exercise 5.5.8. Let </>(x) = x2.
(a) Prove that 0 is a convex function.
(b) What does Jensen's inequality say for this choice of 0?
(c) Where in the text have we already seen the result of part (b)?
Exercise 5.5.9. Prove Cantelli's inequality, which states that if X is a
random variable with finite mean m and finite variance v, then for a > 0,
P (X - m > a) < —^ .
v + a2
[Hint: First show P(X - m > a) < P((X - m + y)2 > (a + y)2). Then
use Markov's inequality, and minimise the resulting bound over choice of
2/>0.]
Exercise 5.5.10. Let X\, X2,... be a sequence of random variables, with
E[Xn] = 8 and Var[Xn] = l/y/h~ for each n. Prove or disprove that {Xn}
must converge to 8 in probability.

66
5. Inequalities and convergence.
Exercise 5.5.11. Give (with proof) an example of a sequence {Yn} of
jointly-defined random variables, such that as n —> oo: (i) Yn/n converges
to 0 in probability; and (ii) Yn/n2 converges to 0 with probability 1; but
(iii) Yn/n does not converge to 0 with probability 1.
Exercise 5.5.12. Give (with proof) an example of two discrete random
variables having the same mean and the same variance, but which are not
identically distributed.
Exercise 5.5.13. Let r 6 N. Let Xi,X2,... be identically distributed
random variables having finite mean m, which are r-dependent, i.e. such
that Xkx, Xk2,..., Xk3 are independent whenever ki+\ > ki+r for each i.
(Thus, independent random variables are O-dependent.) Prove that with
probability one, ^ J^JLi Xi —► m as n —► oo. [Hint: Break up the sum
Sr=i ^ mto r different sums.]
Exercise 5.5.14. Prove the converse of Lemma 5.2.1. That is, prove
that if {Xn} converges to X almost surely, then for each e > 0 we have
P(|Xn-X| >ci.o.) = 0.
Exercise 5.5.15. Let X\,X2,... be a sequence of independent random
variables with P(Xn = 3n) = P(Xn = -3n) = \. Let Sn = Xi +... + Xn.
(a) Compute E(Xn) for each n.
(b) For n G N, compute Rn = sup{r G R; P(|5n| > r) = 1}, i.e. the
largest number such that \Sn\ is always at least Rn.
(c) Compute lim^^ ±Rn.
(d) For which e > 0 (if any) is it the case that P(£|Sn| > e) -^ 0?
(e) Why does this result not contradict the various laws of large numbers?
5.6. Section summary.
This section presented inequalities about random variables. The first,
Markov's inequality, provides an upper bound on P(I > a) for non-
negative random variables X. The second, Chebychev's inequality, provides
an upper bound on P(|y — /xy| > a) in terms of the variance of Y. The
Cauchy-Schwarz and Holder inequalities were also discussed.
It then discussed convergence of sequences of random variables, and
presented various versions of the Law of Large Numbers. This law concerns
partial averages \{X\ + • • - + Xn) of collections of independent random
variables {Xn} all having the same mean m. Under the assumption of either
finite higher moments (First Version) or identical distributions (Second
Version), we proved that these partial averages converge in probability (Weak
Law) or almost surely (Strong Law) to the mean ra.

6. Distributions of random variables.
67
6. Distributions of random variables.
The distribution or law of a random variable is defined as follows.
Definition 6.0.1. Given a random variable X on a probability triple
(fl, T, P)> Ms distribution (or law) is the function \i defined on B, the Borel
subsets of R, by
fi{B) = P(IGB) = P(X~1{B)), BeB.
If fj, is the law of a random variable, then (R, B, /jl) is a valid probability
triple. We shall sometimes write /i as C(X) or as PX"1. We shall also
write X ~ fi to indicate that /i is the distribution of X.
We define the cumulative distribution function of a random variable X
by Fx(^) = P(^ < x)> f°r ^ € R. By continuity of probabilities, the
function Fx is right-continuous, i.e. if {xn} \ x then Fx(xn) —> Fx(x). It
is also clearly a non-decreasing function of x, with limx_oo Fx(x) = 1 and
limx_^_oo Fx{x) = 0. We note the following.
Proposition 6.0.2. Let X and Y be two random variables (possibly
defined on different probability triples). Then C(X) = £(Y) if and only if
Fx{x) = FY{x) for all x e R.
Proof. The "if" part follows from Corollary 2.5.9. The "only if" part is
immediate upon setting B = (—oo,x\. I
6.1. Change of variable theorem.
The following result shows that distributions specify completely the
expected values of random variables (and functions of them).
Theorem 6.1.1. (Change of variable theorem.) Given a probability
triple (Q^J7, P), let X be a random variable having distribution \i. Then
for any Borel-measurable function / : R —> R, we have
f f{X{u))V(dw) = ^ f(t)»(dt), (6.1.2)
i.e. Ep[/(X)] = EM(/), provided that either side is well-defined. In words,
the expected value of the random variable f(X) with respect to the
probability measure P on Q is equal to the expected value of the function f with
respect to the measure /i on R.

68
6. Distributions of random variables.
Proof. Suppose first that / = 1b is an indicator function of a Borel
set KR. Then fnf(X(u))P((L>) = fn lWu,)€*}P(du;) = P(X G B),
while JZofitMdt) = f^l^By^dt) = /z(£) = P(X G B), so equality
holds in this case.
Now suppose that / is a non-negative simple function. Then / is a finite
positive linear combination of indicator functions. But since both sides of
(6.1.2) are linear functions of /, we see that equality still holds in this case.
Next suppose that / is a general non-negative Borel-measurable
function. Then by Proposition 4.2.5, we can find a sequence {/n} of non-negative
simple functions such that {/n} / f. We know that (6.1.2) holds when /
is replaced by fn. But then by letting n —► oo and using the Monotone
Convergence Theorem (Theorem 4.2.2), we see that (6.1.2) holds for / as
well.
Finally, for general Borel-measurable /, we can write / = /+ — /"". Since
(6.1.2) holds for /+ and for /" separately, and since it is linear, therefore
it must also hold for /. |
Remark. The method of proof used in Theorem 6.1.1 (namely
considering first indicator functions, then non-negative simple functions, then
general non-negative functions, and finally general functions) is quite widely
applicable; we shall use it again in the next subsection.
Corollary 6.1.3. Let X and Y be two random variables (possibly
defined on different probability triples). Then C(X) = C(Y) if and only if
E[f(X)} = E[f(Y)] for all Borel-measurable f : R -♦ R for which either
expectation is well-defined. (Compare Proposition 6.0.2 and Remark 5.4.2.)
Proof. If C(X) = C(Y) = \i (say), then Theorem 6.1.1 says that
E[f(X)}=E[f(Y)] = fRfd».
Conversely, if E[f(X)] = E[f(Y)] for all Borel-measurable / : R -► R,
then setting f = lB shows that P[X G B] = P[Y G B] for all Borel BCR,
i.e. that C{X) = C(Y). # ■
Corollary 6.1.4. If X and Y are random variables with P(X = Y) = 1,
then E[f(X)] = E[f(Y)} for all Borel-measurable f : R -► R for which
either expectation is well-defined.
Proof. It follows directly that C(X) = C(Y). Then, letting /x = C{X) =
C(Y), we have from Theorem 6.1.1 that E[f(X)] = E[f(Y)} = fRfdfi. ■

6.2. Examples of distributions.
69
6.2. Examples of distributions.
For a first example of a distribution of a random variable, suppose that
p(X = c) = 1, i.e. that X is always (or, at least, with probability 1)
equal to some constant real number c. Then the distribution of X is the
point mass 8C, defined by 6C(B) = 1b(c), i.e. 5C(B) equals 1 if c G B and
equals 0 otherwise. In this case we write X ~ 5C, or C(X) = 5C. From
Corollary 6.1.4, since P(X = c) = 1, we have E(X) = E(c) = c, and
E(X3 + 2) = E(c3 + 2) = c3 + 2, and more generally E[f(X)] = f(c) for any
function /. In symbols, /n/(X(w))P(du;) = JRf(t)5c(dt) = /(c). That is,
the mapping / k-> E[/(X)] is an evaluation map.
For a second example, suppose X has the Poisson(5) distribution
considered earlier. Then P(X G A) = ^2jeAe~55j/j\, which implies that
C(X) = Z]jlo(e_5^/i0^j5 a convex combination of point masses. The
following proposition shows that we then have E(/(X)) = YlTLo fti) e~5&/j}-
for any function / : R —► R.
Proposition 6.2.1. Suppose /x = ^2iPi^i, where {//*} are probability
distributions, and {Pi} are non-negative constants (summing to 1, if we
want // to also be a probability distribution). Then for Borel-measurable
functions f : R —► R,
/ /d/z = ^2 Pi / fdV>i,
provided either side is well-defined.
Proof. As in the proof of Theorem 6.1.1, it suffices (by linearity and
the monotone convergence theorem) to check the equation when f — 1b is
an indicator function of a Borel set B. But in this case the result follows
immediately since »(B) = £. fam{B). ■
Clearly, any other discrete random variable can be handled similarly to
the Poisson(5) example. Thus, discrete random variables do not present
any substantial new technical issues.
For a third example of a distribution of a random variable, suppose X
has the Normal(0,1) distribution considered earlier (henceforth denoted
AT(0,1)). We can define its law ^iN by
/oo
4>{t) 1B(t)\(dt), B Borel, (6.2.2)
-OO
where A is Lebesgue measure on R (cf. (4.4.2)) and where </>(t) = -^=e~*2/2.
We note that for a mathematically complete definition, it is necessary to

70
6. Distributions of random variables.
use the Lebesgue integral rather than the Riemann integral. Indeed, the
Riemann integral is undefined unless B is a rather simple set (e.g. a finite
union of intervals, or more generally a set whose boundary has measure
0), while we need /xat(-B) to be defined for all Borel sets B. Furthermore,
since <f> is continuous it is Borel-measurable (Proposition 3.1.8), so Lebesgue
integrals such as (6.2.2) make sense.
Similarly, given any Borel-measurable function (called a density
function) f such that / > 0 and /f° f(t)X(dt) = 1, we can define a law //
by
/oo
f(t)lB(t)X(dt), B Borel.
-OO
We shall sometimes write this as /x(£) = fB f or /x(£) = fB f(t) X(dt), or
even as fi(dt) = f(t) X(dt) (where such equalities of "differentials" have the
interpretation that the two sides are equal when integrated over t G B for
any Borel B, i.e. that JB fi(dt) = fB f(t)X(dt) for all B). We shall also write
this as ^ = /, and shall say that /x is absolutely continuous with respect
to A, and that / is the density for /x with respect to A. We then have the
following.
Proposition 6.2.3. Suppose /x has density f with respect to A. Then
for any Borel-measurable function g : R —► R,
/oo /»oo
g(tMdt) = / g(t)f(t)\(dt),
-oo J — oo
provided either side is well-defined. In words, to compute the integral of a
function with respect to /x, it suffices to compute the integral of the function
times the density with respect to A.
Proof. Once again, it suffices to check the equation when g = 1b IS
an indicator function of a Borel set B. But in that case, f g(t)ii(dt) =
flB(th(dt) = fi(B), while fmg(t)f(t)X(dt) = flB(t)f(t)X(dt) = p(B) by
definition. The result follows. I
By combining Theorem 4.4.1 and Proposition 6.2.3, it is possible to
do explicit computations with absolutely-continuous random variables. For
example, if X ~ N(0,1), then
E(X) = ftfiN(dt) = ft(f)(t)X(dt) = J tcj){i)dt,
and more generally
E(g(X)) = Jg(t)fiN(dt) = j g(t)(t>(t)X(dt) = j^g{t) cf>{t) dt

6.3. Exercises.
71
for any Riemann-integrable function g; here the last expression is an
ordinary, old-fashioned, calculus-style Riemann integral. It can be computed in
this manner that E{X) = 0, E(X2) = 1, E(X4) = 3, etc.
For an example combining Propositions 6.2.3 and 6.2.1, suppose that
£(X) = \$i + 1^2 + |mat, where fi^ is again the AT(0,1) distribution.
Then E(X) = 1(1) + 1(2) + ±(0) = f, E(X2) = 1(1) + 1(4) + i(l) = |,
and so on. Note, however, that it is not the case that Var(X) equals the
corresponding linear combination of variances (indeed, the variance of a
point-mass is 0, so that the corresponding linear combinations of variances
is i(0) + i(0) + ±(l) = |); rather, the formula Var(X) = E(X2)-E(X)2 =
|-(f)2 = ±§ should be used.
Exercise 6.2.4. Why does Proposition 6.2.1 not imply that Var(X)
equals the corresponding linear combination of variances?
6.3. Exercises.
Exercise 6.3.1. Let (Q,T, P) be Lebesgue measure on [0,1], and set
{1 , 0 < u < 1/4
2a;2, 1/4 < u < 3/4
a;2, 3/4 < u < 1.
Compute P(X e A) where
(a) A =[0,1].
(b) A =[1,1].
Exercise 6.3.2. Suppose P(Z = 0) = P(Z = 1) = \, that Y ~ N(0,1),
and that Y and Z are independent. Set X = YZ. What is the law of XI
Exercise 6.3.3. Let X ~ Poisson(5).
(a) Compute E(X) and Var(X).
(b) Compute E(3X). .
Exercise 6.3.4. Compute E(X), E(X2), and Var(X), where the law of
X is given by
(a) C(X) = \8\ + ^A, where A is Lebesgue measure on [0,1].
(b) C(X) = \52 + yN, where //at is the standard normal distribution
^(0,1).
Exercise 6.3.5. Let X and Z be independent, with X ~ JV(0,1), and
with P(Z = 1) = P(Z = -1) = 1/2. Let Y = XZ (i.e., Y is the product of
.* and Z).

72
6. Distributions of random variables.
(a) Prove that Y ~ N(0,1).
(b) Prove that P(\X\ = \Y\) = 1.
(c) Prove that X and Y are not independent.
(d) Prove that Cov(X, Y) = 0.
(e) It is sometimes claimed that if X and Y are normally distributed
random variables with Cov(X, Y) = 0, then X and Y must be independent.
Is that claim correct?
Exercise 6.3.6. Let X and Y be random variables on some probability
triple (fi,^,P). Suppose E(X4) < oo, and that P[ra < X < z] = P[m <
Y < z] for all integers m and all z G R. Prove or disprove that we necessarily
haveE(r4) = E(X4).
Exercise 6.3.7. Let X be a random variable, and let Fx(x) be its
cumulative distribution function. For fixed x G R, we know by right-continuity
that limyNfX Fx(y) = Fx(x).
(a) Give a necessary and sufficient condition that limyyx Fx(y) = Fx(x).
(b) More generally, give a formula for Fx(x) — (limyyx Fx(y)), in terms
of a simple property of X.
Exercise 6.3.8. Consider the statement: f(x) = (f{x)) for all x G R.
(a) Prove that the statement is true for all indicator functions / = 1^.
(b) Prove that the statement is not true for the identity function f(x) = x.
(c) Why does this fact not contradict the method of proof of Theorem 6.1.1?
6.4. Section summary.
This section defined the distribution (or law), C(X), of a random
variable X, to be a corresponding distribution on the real line. It proved that
C(X) is completely determined by the cumulative distribution function,
Fx(x) = P(X < x), of X. It proved that expectation E(f(X)) of any
function of X can be computed (in principle) once C(X) or Fx(x) is known. It
then considered a number of examples of distributions of random variables,
including discrete and continuous random variables and various
combinations of them. It provided a number of results for computing expected
values with respect to such distributions.

7. Stochastic processes and gambling games. 73
7. Stochastic processes and gambling games.
Now that we have covered most of the essential foundations of
rigorous probability theory, it is time to get "moving", i.e. to consider random
processes rather than just static random variables.
A (discrete time) stochastic process is simply a sequence X0, X\, X2,...
of random variables defined on some fixed probability triple (fi, T, P). The
random variables {Xn} are typically not independent. In this context we
often think of n as representing time; thus, Xn represents the value of a
random quantity at the time n.
For a specific example, let (ri,r2,...) be the result of infinite fair coin
tossing (so that {ri} are independent, and each r* equals 0 or 1 with
probability \\ see Subsection 2.6), and set
X0 = 0; Xn = n+r2 + ... + rn, n > 1; (7.0.1)
thus, Xn represents the number of heads obtained up to time n.
Alternatively, we might set
X0 = 0; Xn = 2(n + r2 + ... + rn) - n, n > 1; (7.0.2)
then Xn represents the number of heads minus the number of tails obtained
up to time n. This last example suggests a gambling game: each time we
obtain a head we increase Xn by 1 (i.e., we "win"), while each time we
obtain a tail we decrease Xn by 1 (i.e., we "lose").
To allow for non-fair games, we might wish to generalise (7.0.2) to
X0 — 0; Xn = Zi + Z2 + •.. + Zn, n > 1,
where the {Zi} are assumed to be i.i.d. random variables, satisfying P(Z* =
1) = p and P(Z» = -1) = 1 - p, for some fixed 0 < p < 1. (If p = \ then
this is equivalent to (7.0.2), with Zi = 2r{ - 1.)
This raises an immediate issue: can we be sure that such random
variables {Zi} even exist? In fact the answer to this question is yes, as the
following subsection shows.
' •!• A first existence theorem.
We here show the existence of sequences of independent random vari-
a°les having prescribed distributions.
heorem 7.1.1. Let /ii,/x2, • • • be any sequence of Borel probability
•Measures on R. Then there exists a probability space (f2, T, P), and random

74 7. Stochastic processes and gambling games.
variables X\, X2, ... defined on (fi, T, P), such that {Xn} are independent,
and C(Xn) = /zn.
We begin with a lemma.
Lemma 7.1.2. Let U be a random variable whose distribution is Lebesgue
measure (i.e., the uniform distribution) on [0,1]. Let F be any cumulative
distribution function, and set (j){u) = inf{x; F(x) > u) for 0 < u < 1. Then
P((j)(U) < x) = F(z) for each z G R; in words, the cumulative distribution
function of (f){U) is F.
Proof. Since F is non-decreasing, if F(z) < u, then (j){u) > z. On the
other hand, since F is right-continuous, we have that inf{x;F(x) > u] =
min{x;F(x) > u}; that is, the infimum is actually obtained. It follows
that if F(z) > u, then 4>{u) > z. Hence, 0(ii) < z if and only if u < F(z).
Since 0 < F(z) < 1, we obtain that P(())(U) < z) = P(J7 < F(z)) = F(z). I
Proof of Theorem 7.1.1. We let (fJ,.?7, P) be infinite independent fair
coin tossing, so that r*i,r2,... are i.i.d. with P(r^ = 0) = P(r^ = 1) = |.
Let {Zij} be a two-dimensional array filled by these r*i, as follows:
/ Z\\ ^12 ^13 • -\
Z21 Z22 ^23 • •
Z31 Z32 Z33 • •
Z41 Z42 Z43 . .
V ; /
(n
T2
7*5
TQ
U r8
r7 ...
v; ;
•\
/
Hence, {Zij} are independent, with P(Zij = 0) = P(Zij = 1) = \.
Then, for each n G N, we set C/n = X^ Znk/2k. By Corollary 3.5.3,
the {£/n} are independent. Furthermore, by the way the Un were
constructed, we have P (^ <Un < ^r) = ^ for k G N and 0 < j < 2k.
By additivity and continuity of probabilities, this implies that P(a <Un<
b) = b — a whenever 0 < a < b < 1. Hence, by Proposition 2.5.8, each Un
follows the uniform distribution (i.e. Lebesgue measure) on [0,1].
Finally, we set Fn(x) = fin ((—00, x]) for x G R, set (j)n{u) = inf{x;w <
Fn(x)} for 0 < u < 1, and set Xn = <t>n{Un). Then {Xn} are independent
by Proposition 3.2.3, and Xn ~ /xn by Lemma 7.1.2, as required. I

7.2. Gambling and gambler's ruin.
75
7.2. Gambling and gambler's ruin.
By Theorem 7.1.1, for fixed 0 < p < 1 we can find random variables
{Zi) which are i.i.d. with P(Z* = 1) = p and P(Z* = -1) = 1 - p = q. We
then set Xn = a + Z\ + Z2 +... + Zn (with X0 = a) for some fixed integer a.
We shall interpret Xn as a gambling player's "fortune" (in dollars) at time
n when repeatedly making $1 bets, and shall refer to the stochastic process
{Xn} as simple random walk. Thus, our player begins with $a, and at each
time has probability p of winning $1 and probability q = 1 — p of losing $1.
We first note the distribution of Xn. Indeed, clearly P(Xn = a + k) = 0
unless — n < k < n with n + k even. For such A:, there are (n+fc) different
V 2 '
possible sequences Zi,..., Zn such that Xn = a + k, namely all sequences
consisting of II^ symbols +1 and r^ symbols —1. Furthermore, each such
sequence has probability p~*~q~*~. We conclude that
/ ft \ n + k n-k
P(Xn = a + k) = I n+fc jp 2 q 2 , -n<k<n, n + A;even,
with P(Xn = a + A:) = 0 otherwise.
This is a rather "static" observation about the process {Xn}; of greater
interest are questions which depend on its time-evolution. One such
question is the gambler's ruin problem, defined as follows. Suppose that 0 <
a < c, and let To = inf{n > 0;Xn = 0} and rc = inf{n > 0;Xn = c] be
the first hitting time of 0 and c, respectively. (These infima are taken to
be +oo if the condition is satisfied for no n.) The gambler's ruin question
is, what is P(rc < To)? That is, what is the probability that the player's
fortune will reach the value c before it reaches the value 0. Informally, what
is the probability that the gambler gets rich before going broke? (Note that
(Tc < to} includes the case when To = oo while tc < oo; but it does not
include the case when tc = t0 = oo.)
Solving this question is not straightforward, since there is no limit to
how long it will take until either the fortune c or the fortune 0 is reached.
However, by using the right trick, the solution presents itself. We set s(a) =
sc,P(a) = P(tc < to). Writing the dependence on a explicitly will allow us
to vary a, and to relate s(a) to s(a-l) and s(a+l). Indeed, for 1 < a < c-1
we have that
s(a) ■= P(tc<t0)
= P(Zi = -1, TC < T0) + P(Zi = +1, TC < T0) (7.2.1)
= qs(a — l)+ps(a + l).
That is, s(a) is a simple convex combination of s(a — 1) and s(a +1); this
is the key. We further have by definition that s(0) = 0 and s(c) = 1.

76 7. Stochastic processes and gambling games.
Now, (7.2.1) gives c— 1 equations, for the c—1 unknowns s(l),..., s(c— 1).
This system of equations can then be solved in several different ways (see
Exercises 7.2.4, 7.2.5, and 7.2.6 below), to obtain that
i-uy 1
s(a) = sc,p(a) = f-i-c, p?-. (7.2.2)
i-(;)
and
s(a) = sCyP(a) = a/c, p = -. (7.2.3)
(This last equation is suggestive: for a fair game (p = ^), with probability
a/c you end up with c dollars; and the product of these two quantities is
your initial fortune a. We will consider this issue again when we study
martingales; see in particular Exercises 14.4.10 and 14.4.11.)
Exercise 7.2.4. Verify that (7.2.2) and (7.2.3) satisfy (7.2.1), and also
satisfy s(0) = 0 and s(c) = 1.
Exercise 7.2.5. Solve equation (7.2.1) by direct algebra, as follows.
(a) Show that (7.2.1) implies that for 1 < a < c — 1, s(a + 1) — s(a) =
*(s(a)-s(a-l)).
(b) Show that this implies that for 0 < a < c — 1, s(a + 1) — s(a) =
(c) Show that this implies that for 0 < a < c, s(a) = ^Zo (p) s(1)-
(d) Solve for s(l), and verify (7.2.2) and (7.2.3).
Exercise 7.2.6. Solve equation (7.2.1) using the theory of difference
equations, as follows.
(a) Show that the corresponding "characteristic equation" t° = qt~l +ptl
has two distinct roots t\ and.^2 when p ^ 1/2, and one double root £3 when
p = 1/2. Solve for t\, £2, and £3.
(b) When p ^ 1/2, the theory of difference equations says that we must
have sc,p(a) = C\(ti)a + C2(^)a for some constants C\ and C2. Assuming
this, use the boundary conditions sc,p(0) = 0 and sc,p(c) = 1 to solve for C\
and C2. Verify (7.2.2).
(c) When p = 1/2, the theory of difference equations says that we must
have sCjP(a) = Cs(ts)a + C±a{tz)a for some constants C3 and C4. Assuming
this, use the boundary conditions to solve for C3 and C4. Verify (7.2.3).
As a specific example, suppose you start with $9,700 (i.e., a = 9700) and
your goal is to win $10,000 before going broke (i.e., c = 10000). If p = \,

7.3. Gambling policies.
77
vour probability of success is a/c = 0.97, which is very high; on the
h r hand, if P = 0.49, then your probability of success is given by
V0.49
9700
/051\
Vo.49y
10000
which is approximately 6.1 x 10~6, or about one chance in 163,000. This
shows rather dramatically that even a small disadvantage on each bet can
lead to a very large disadvantage in the long run!
Now let r(a) — rCiP(a) = P(to < rc) be the probability that our gambler
goes broke before reaching the desired fortune. Then clearly s(a) and r(a)
are related. Indeed, by "considering the bank's point of view", we see
immediately that rCjP(a) = sc,i-P(c-a) (that is, the chance of going broke before
obtaining c dollars, when starting with a dollars and having probability p
of winning each bet, is the same as the chance of obtaining c dollars before
going broke, when starting with c — a dollars and having probability 1 — p
of winning each bet), so that
rc,P(a)
i-(f)c"a
i-(f)c
P±\
c—a
c
Finally, let us consider the probability that the gambler will eventually
go broke if they never stop gambling, i.e. P(to < oo) without regard to
any target fortune c. Well, if we let Hc = {r0 < rc}, then clearly {Hc} is
increasing up to {r0 < oo}, as c —► oo. Hence, by continuity of probabilities,
P(t0 < oo) = lim P(HC) = lim rCiP(a) = <
1,
1(0"
p<h
p>h-
(7.2.7)
Thus, if p < I (i.e., if the gambler has no advantage), then the gambler is
certain to eventually go broke. On the other hand, if say p = 0.6 and a = 1,
then the gambler has probability 1/3 of never going broke.
7.3. Gambling policies.
Suppose now that our gambler is allowed to choose how much to bet
each time. That is, the gambler can choose random variables Wn so that
their fortune at time n is given by
Xn = a + W1Z1+W2Z2 + ... + WnZn,

78 7. Stochastic processes and gambling games.
with {Zn} as before. To avoid "cheating", we shall insist that Wn > o (you
can't bet a negative amount), and also that Wn = fn{Zi,Z2, • •, ZnA [s a
function only of the previous bet results (you can't know the result of a bet
before you choose how much to wager). Here the fn : {—1, l}n_1 —► R>o are
fixed deterministic functions, collectively referred to as the gambling policy.
(If Wn = 1 for each n, then this is equivalent to our previous gambling
model.)
Note that Xn = Xn_i +WnZn, with Wn and Zn independent, and with
E(Zn) =p-q. Hence, E(Xn) = E(Xn_i) + (p - q)E(Wn). Furthermore,
since Wn > 0, therefore E(Wn) > 0. It follows that
(a) if p = i, then E(Xn) = E(Xn_i) = ... = E(X0) = a, so that
limE(Xn) = a;
(b) if p < i, then E(Xn) < E(Xn_i) < ... < E(X0) = a, so that
limE(Xn) <a;
(c) if p > |, then E(Xn) > E(Xn_i) > ... > E(X0) = a, so that
limE(Xn) >a.
This seems simple enough, and corresponds to our intuition: if p < ^ then
the player's expected value can only decrease. End of story?
Perhaps not. Consider the "double 'til you win" policy, defined by
Vri~1' ^n-| q^ otherwise ' n~1'
That is, we first bet $1. Each time we lose, we double our bet on the
succeeding turn. As soon as we win once, we bet zero from then on.
It is easily seen that, with this gambling policy, we will be up $1 as soon
as we win a bet. That is, letting r = inf{n > 1; Zn = +1}, we have that
Xn = a + 1 provided that r < n. Now, clearly P(r > n) = (1 -p)n. Hence,
assuming p > 0, we see that P(r < oo) = 1. It follows that P(limXn =
a + 1) = 1, so that E(limXn) = a + 1. In words, with probability one we
will gain $1 with this gambling policy, for any positive value of p, and thus
"cheat fate". How can this be, in light of (a) and (b) above?
The answer, of course, is triat in this case E(limXn) (which equals a + 1)
is not the same as limE(Xn) (which must be < a). This is what allows us
to "cheat fate". On the other hand, we may need to lose an arbitrarily large
amount of money before we win our $1, so "infinite capital" is required to
follow this gambling policy. We show now that, if the fortunes Xn must
remain bounded (i.e., if we only have a finite amount of capital to draw on),
then E(limXn) must indeed be the same as limE(Xn).
Theorem 7.3.1. (The bounded convergence theorem.) Let {Xn} be a
sequence of random variables, with limXn = X. Suppose there is K G R
such that \Xn\ < K for all n G N (i.e., the {Xn} are uniformly bounded).
ThenE(X) = limE(X„).

7.3. Gambling policies.
79
of. We have from the triangle inequality that
\E(X) - E(Xn)\ = \E(X - Xn)\ < E(\X - Xn\).
\\re shall show that this last expression goes to 0 as n —► oo. Indeed, fix
> 0, and set An = {u e ft; \X(u>) - Xn(u)\ > c}. Then \X(u) -
Xn(v)\ ^ e + 2K1AnHi so that E(l* - Xn\) < € + 2KP(An). Hence,
using Proposition 3.4.1, we have that
limsupE(|X-Xn|) < e + 2#limsupP(Al)
< e + 2/rP(limsupAn)
= e,
since \X(u) - Xn{uj)\ —> 0 for all uj G f2, so that limsup An is the empty
set. Hence, E(|X - Xn\) -+ 0, as claimed. I
It follows immediately that, if we are gambling with p < |, and we use
any gambling policy which leaves the fortunes {Xn} uniformly bounded,
then limXn (if it exists) will have expected value equal to limE(Xn), and
therefore be < a. So, if we have only finite capital (no matter how large),
then it is not possible to cheat fate.
Finally, we consider the following question. Suppose p < |, and 0 <
a < c. What gambling system maximises P(rc < To), i.e. maximises the
probability that we reach the fortune c before losing all our money?
For example, suppose again that p = 0.49, and that a = 9700, c = 10000.
If Wn = 1 (i.e. we bet exactly $1 each time), then we already know that
P(rc < r0) = [(0.51/0.49)9700 - l] / [(0.51/0.49)10000 - l] = 6.1 x 10~6.
On the other hand, if Wn = 2, then this is equivalent to instead having
a = 4850 and c = 5000, so we see that P(tc < r0) = [(0.51/0.49)4850 - l] /
[(0.51/0.49)5000 - 1] = 2.5 x 10"3, which is about 400 times better. In fact,
if Wn = 100, thenP(Tc < t0) = [(0.51/0.49)97 - l] / [(0.51/0.49)100 - l] =
0.885, which is a very favourable probability.
This example suggests that, if p < \ and we wish to maximise P(tc <
ro), then it is best to bet in larger amounts, i.e. to get the game over with as
quickly as possible. This is indeed true. More precisely, we define bold play
to be the gambling strategy Wn = min(Xn_i, c - Xn_i), i.e. the strategy
°f betting as much as possible each time. It is then a theorem that, when
P < 2> this is the optimal strategy in the sense of maximising P(tc < To).
For a proof, see e.g. Billingsley (1995, Theorem 7.3).
Disclaimer. We note that for p < ^, even though to maximise P(tc < t0)
^ is best to bet large amounts, still overall it is best not to gamble at all.
Indeed, by (7.2.7), if p < \ and you have any finite amount of money, then

80 7. Stochastic processes and gambling games.
if you keep betting you will eventually lose it all and go broke. This is th
reason few probabilists attend gambling casinos!
7.4. Exercises.
Exercise 7.4.1. For the stochastic process {Xn} given by (7.0.1),
compute (for n, k > 0)
(a) P(Xn = k).
(b) P(rk = n).
[Hint: These two questions do not have the same answer.]
Exercise 7.4.2. For the stochastic process {Xn} given by (7.0.2),
compute (for n, k > 0)
(a) P(Xn = k).
(b).P(Xn>0).
Exercise 7.4.3. Prove that there exist random variables Y and Z such
that p(y = i) = P(r = -l) = P(z = i) = P(z = -l) = i P(r =
0) = P(Z = 0) = i, and such that Cov(y, Z) = \. (In particular, Y
and Z are not independent.) [Hint: First use Theorem 7.1.1 to construct
independent random variables X\, X2, and X3 each having certain two-
point distributions. Then construct Y and Z as functions of X\, X2, and
Exercise 7.4.4. For the gambler's ruin model of Subsection 7.2, with
c = 10000 and p = 0.49, find the smallest positive integer a such that
sc,p(a) ^ \- Interpret your result in plain English.
Exercise 7.4.5. For the gambler's ruin model of Subsection 7.2, let
Pn = P(min(ro,Tc) > n) be J;he probability that the player's fortune has
not hit 0 or c by time n.
(a) Find any explicit, simple expression jn such that (3n < 7n for all n € N,
and such that limn_>oo jn = 0.
(b) Find any explicit, simple expression an such that (3n > an > 0 for all
ne N.
Exercise 7.4.6. Let {Wn} be i.i.d. with P[Wn = +1] = P[Wn = 0] = 1/4
and P[Wn = —1] = 1/2, and let a be a positive integer. Let Xn = a + W\ +
... + Wn, and let r0 = inf{n > 0; Xn = 0}. Compute P(r0 < 00). [Hint:
Let {Yn} be like {Xn}, except with immediate repetitions of values omitted.
Is {Yn} a simple random walk? With what parameter p?]

7.5. Section summary.
81
'*o 7 4.7« Verify explicitly that rcJa) + scJa) = 1 for all a, c,
and?-
cise 7.4-8- In gambler's ruin, recall that {rc < r0} is the event
the player eventually wins, and {to < rc} is the event that the player
eventually loses.
( \ Give a similar plain-English description of the complement of the union
of these two events, i.e. ({rc < r0} U {r0 < rc}) .
(b) Give three different proofs that the event described in part (a) has
probability 0: one using Exercise 7.4.7; a second using Exercise 7.4.5; and
a third recalling how the probabilities sCiP(a) were computed in the text,
and seeing to what extent the computation would have differed if we had
instead replaced sClP(a) by SCiP(a) = P(rc < r0).
(c) Prove that, if c > 4, then the event described in part (a) contains un-
countably many outcomes (i.e., that uncountably many different sequences
Z\, Z2, • • • correspond to this event, even though it has probability 0). [Hint:
This is not entirely dissimilar from the analysis of the Cantor set in
Subsection 2.4.]
Exercise 7.4.9. For the gambling policies model of Subsection 7.3,
consider the "triple 'til you win" policy defined by W\ = 1, and for n > 2,
Wn = 3n_1 if Zi = ... = Zn-i = -1 otherwise Wn = 0.
(a) Prove that, with probability 1, the limit limn^oo Xn exists.
(b) Describe precisely the distribution of linin^oo Xn.
Exercise 7.4.10. Consider the gambling policies model, with p = 1/3,
a = 6, and c = 8.
(a) Compute the probability sc?p(a) that the player will win (i.e. hit c
before hitting 0) if they bet $1 each time (i.e. if Wn = 1).
(b) Compute the probability that the player will win if they bet $2 each
time (i.e. if Wn = 2).
(c) Compute the probability that the player will win if they employ the
strategy of Bold Play (i.e., if Wn = min(Xn_1, c - Xn-i)). [Hint: While it
is difficult to do explicit computations involving Bold Play in general, here
the numbers are small enough that it is not difficult.]
7.5. Section summary.
This section introduced the concept of stochastic processes, from the
point of view of gambling games. It proved a general theorem about the
existence of sequences of independent random variables having arbitrary
.specified distributions. It used this to define several models of gambling

82 7. Stochastic processes and gambling games.
games. If the player bets $1 each time, and has independent probability p 0r
winning each bet, then explicit formulae were developed for the probability
that the player achieves some specified target fortune c before losing all
their money. The formula is very sensitive to values of p near ^.
If the player is allowed to choose how much to bet at each stage, then
with clever betting they can "cheat fate" and always win. However, this
requires them to have infinite capital available. If their capital is finite
and p < |, then on average they will always lose money by the Bounded
Convergence Theorem.

8. Discrete Markov chains.
83
g piscrete Markov chains.
In this section, we consider the general notion of a (discrete time, discrete
nace, time-homogeneous) Markov chain.
A Markov chain is characterised by three ingredients: a state space S
hich is a finite or countable set; an initial distribution {fi}ies consisting of
non-negative numbers summing to 1; and transition probabilities {pij}ijes
consisting of non-negative numbers with ^2j£SPij = 1 for each i € S.
Intuitively, a Markov chain represents the random motion of some
particle moving around the space S. V{ represents the probability that the
particle starts at the point z, while Pij represents the probability that, if
the particle is at the point i, it will then jump to the point j on the next
step.
More formally, a Markov chain is defined to be a sequence of random
variables Xo, X\, X2,... taking values in the set 5, such that
P(Xo = io, X\ — il-, - - - ->Xn = in) = VioPioiiPi^ - • -Pin-iin
for any n e N and any choice of io,H, • • • ,in € S. (Note that, for these
{Xn} to be random variables in the sense of Definition 3.1.1, we need to
have S C R; however, if we allow more general random variables as in
Remark 3.1.10, then this restriction is not necessary.) It then follows that,
for example,
P(Xx=j) = ]TP(X°=i> Xy=j) = 5>iPij.
ieS i€S
This also has a matrix interpretation: writing [p] for the matrix {Pij}ijes,
and [//(")] for the row-vector {P{Xn = i)}i€s, we have [/^1}] = [//0)] [p],
and more generally [fi^+V] = [fi^] [p].
We present some simple examples of Markov chains here. Note that,
except in the first example, we do not bother to specify initial
probabilities {vi}\ we shall see that initial probabilities are often not crucial when
studying a chain's properties.
Example 8.0.1. (Simple random walk.) Let S = Z be the set of all
integers. Fix a e Z, and let va = 1, with i/» = 0 for i ^ a. Fix a real
number p with 0 < p < 1, and let pM+i = p and pM_i = 1 - p for each
i € Z, with p^; = 0 if j / i ± 1. Thus, this Markov chain begins at the
point a (with probability 1), and at each step either increases by 1 (with
probability p) or decreases by 1 (with probability 1 -p). It is easily seen that
this Markov chain corresponds precisely to the gambling game of Subsection
7.2.

84
8. Discrete Markov chains.
Example 8.0.2. Let S = {1,2,3} consist of just three elements, and
define the transition probabilities {ptj} in matrix form by
/ 0 1/2 1/2N
(Pij) = 1/3 1/3 1/3
\l/4 1/4 1/2j
(so that p3i = |, etc.). This Markov chain jumps around on the three
points {1,2,3} in a random and interesting way.
Example 8.0.3. (Random walk on Z/(d).) Let S = {0,1,2,..., d - 1},
and define the transition probabilities by
pHo;
i — J QL i = j + l(mod d) or i = j — l(mod d);
otherwise.
If we think of the d elements of 5 as arranged in a circle, then our particle,
at each step, either stays where it is, or moves one step clockwise, or moves
one step counter-clockwise, each with probability |.
Example 8.0.4. (Ehrenfest's urn.) Consider two urns, Urn 1 and Urn 2.
Suppose there are d balls divided between the two urns. Suppose at each
step, we choose one ball uniformly at random from among the d balls, and
switch it to the opposite urn. We let Xn be the number of balls in Urn 1 at
time n. Thus, S = {0,1,2,..., d}, with Pi^-i = i/d and Pi,i+i = (d — i)/d,
for 0 < i < d (with p^ = 0 if j ^ i ± 1). Thus, this Markov chain moves
randomly among the possible numbers {0,1,..., d} of balls in Urn 1 at each
time. One might expect that, if d is large and the Markov chain is run for a
long time, that there would most likely be approximately d/2 balls in Urn
1. (We shall consider such questions in Subsection 8.3 below.)
We note that we can also interpret Markov chains in terms of conditional
probability. Recall that, if A and B are events with P{B) > 0, then the
conditional probability of A given B is P(A\B) = L^) > intuitively, it is
the probabilistic proportion of the event B which is also contained in the
event A. Thus, for a Markov chain, if P(Xk = i) > 0, then
p( y, , _ a I y, _ a\ _ P(Xfc=i, Xk+i=j)
_ Zll0,i1,...,lfc_1 »*0Pi0*lP*li2~Pik-^k-lP*k-l*PiJ
£si0,il,...,ik_1 I/toPlOtlPllt2-Pifc-2llfc-lP'fe-ll
= Pij •
This formally justifies the notion of p^ as a transition probability. Note
also that this conditional probability does not depend on the starting time

8.1. A Markov chain existence theorem.
85
• atifying the notion of time homogeneity. (The only reason we do not
u this conditional probability as our definition of a Markov chain, is that
f conditional probability is not well-defined if P(Xk = i) = 0.)
\\Te can similarly compute, for any n G N, again assuming P(Xk = i) >
0 that P(Xfc+n =j\Xk=i) is equal to
E_ (n) a
Piik+iPik+iik+2 ' ' 'Pik+n-2ik+n-iPik+n-ij — Pij >
here p\^ 1S an nth or^er transition probability. Note again that this
probability does not depend on the starting time k (despite appearances to the
contrary). By convention, we set
^=Sii = { J' i=J • , (8.0.5)
Vx* %3 \ 0, otherwise ' v ;
i.e. in zero time units the Markov chain just stays where it is.
8.1. A Markov chain existence theorem.
To rigorously study Markov chains, we need to be sure that they exist.
Fortunately, this is relatively straightforward.
Theorem 8.1.1. Given a non-empty countable set S, and non-negative
numbers {vi}ies and {pij}i,jes, with Y.jvj = 1 and YZjPij = * for eacn
i G S, there exists a probability triple (ft, J7, P), and random variables
Xq,Xi,... defined on (£}, J7,P), such that
*P(X0 = i0,X=ii,...,Xn = in) = "ioPioii --Pin-iin
for allneN and all i0,...,in6S.
Proof. We let (fi, J7, P) be Lebesgue measure on [0,1]. We construct the
random variables {Xn} as follows.
1. Partition [0,1] into intervals {^0)}i€5, with length(/f 0)) = ^.
2. Partition each jf0) into intervals {I^)}iJeS, with length^) = v^.
3. Inductively partition [0,1] into intervals {/i£i...in}t0,tll...>in€S» such that
^Oil-in ^ ^oT-^n-l aIld len6th(4f<)l...<n) = "toP^l. •■•P.„-l<« ^ all
n G N.
4. Define Xn by saying that Xn(uj) = in if uj G ^^...^..^ for some choice
Of 20,...,2n-l G 5.

86
8. Discrete Markov chains.
Then it is easily verified that the random variables {Xn} have the desired
properties. |
We thus see that, as in Theorem 7.1.1, an "old friend" (in this case
Lebesgue measure on [0,1]) was able to serve as a probability space on
which to define important random variables.
8.2. Transience, recurrence, and irreducibility.
In this subsection, we consider some fundamental notions related to
Markov chains. For simplicity, we shall assume that Vi > 0 for all z G 5, and
shall write Pj(- • •) as shorthand for the conditional probability P(- • • \Xq =
z). Intuitively, P{ (A) stands for the probability that the event A would have
occurred, had the Markov chain been started at the state z. We shall also
write Ei(- • •) for expected values with respect to Pj.
To proceed, we define, for z, j G S and n G N, the probabilities
/£° = Pi(Xn = j, but Xm + j for 1 < m < n - 1);
oo
fij = Pi(3n>l;Xn=j) = £/§°.
71=1
That is, fl?' is the probability, starting from z, that we first hit j at the
time n; fij is the probability, starting from z, that we ever hit j.
A state z G 5 is called recurrent (or persistent) if fa = 1, i.e. if starting
from z we will certainly eventually return to z. It is called transient if it is
not recurrent, i.e. if fa < 1. Recurrence and transience are very important
concepts in Markov chain theory, and we prove some results about them
here. (Recall that z.o. stands for infinitely often.)
Theorem 8.2.1. Let {Xn} be a Markov chain, on a state space S.
Let i- G S. Then i is transient'if and only ifPi(Xn = i z.o.) = 0 if and
only if Yl^Li Pii < °°- On the other hand, i is recurrent if and only if
Pi(Xn = i z.o.) = 1 if and only ifY^^LiPu = °°-
To prove this theorem, we begin with a lemma.
Lemma 8.2.2. We have P{ (#{n > 1; Xn = j} > k) = fij{fjj)k-\ for
k = 1, 2, — In words, the probability that, starting from i, we eventually
hit the state j at least k times, is given by fij(fjj)k~l-
Proof. Starting from z, to hit j at least k times is equivalent to first
hitting j once (starting from z), then to return to j at least k — 1 more

8.2. Transience, recurrence, and irreducibility. 87
t'mes (each time starting from j). The result follows. I
proof of Theorem 8.2.1. By continuity of probabilities, and by Lemma
8 2.2, we have that
Pi(Xn = i i.o.) = lim Pi (#{n > 1; Xn = i} > k)
k—+oo
{I
lim (/„)* =-{ "' fll
k-*oo I 1, Jii — ± •
This proves the first equivalence to each of transience and recurrence.
Also, using first countable linearity, and then Proposition 4.2.9, and then
Lemma 8.2.2, we compute that
E~iPin) = E£=iP<(*n = 0
- S^Li Et(ix„=i)
= E4(#{n>l;X„ = i})
f i^- < oo, fu<l
I oo, /«= 1
thus proving the remaining two equivalences. I
As an application of this theorem, we consider simple symmetric
random walk (cf. Subsection 7.2 or Example 8.0.1, with X0 = 0 and p = |).
We recall that here S = Z, and for any i € Z we have p^n) = (n/2) (|)n =
((n/2)'!)22n f°r n even (with p^ = 0 for n odd). Using Sterling's
approximation, which states that n\ ~ (^) \I2-ku as n —> oo, we compute that
for large even n, we have p^n) ~ y/2]wn. Hence, we see (cf. (A.3.7)) that
^n=iPii ~ °°- Therefore, by Theorem 8.2.1, simple symmetric random
walk is recurrent.
On the other hand, if p ^ \ for simple random walk, then p\^ =
(n/2)(p(l-p))n/2 Withp(l-p) < |. Sterling then gives for large even
n that p^n) ~ [4p(l -p)]n/2y27™, with 4p(l - p) < 1. It follows that
^n=i Pa < oo, so that simple asymmetric random walk is not recurrent.
In higher dimensions, suppose that S = Zd, with each of the d
coordinates independently following simple symmetric random walk. Then clearly
Pa for this process is simply the d}h power of the corresponding quantity
for ordinary (one-dimensional) simple symmetric random walk. Hence, for

88
8. Discrete Markov chains.
large even n, we have pW ~ (2/7m)d/2 in this case. It follows (cf (j^ o ^
that Z^Li Pi? = oo if and only if d < 2. That is, higher-dimensional sim i
symmetric random walk is recurrent in dimensions 1 and 2, but trans
in dimensions > 3, a somewhat counter-intuitive result.
Finally, we note that for irreducible Markov chains, somewhat mo
can be said. A Markov chain is irreducible if fij > 0 for all i, j £ 5 :
if it is possible for the chain to move from any state to any other state
Equivalently, the Markov chain is irreducible if for any i, j g S, there is
r G N with p\j > 0. (A chain is reducible if it is not irreducible, i.e. if
fij = 0 for some i,j G 5.) We can now prove
Theorem 8.2.3. Let {Pij}ijes be the transition probabilities for an
irreducible Markov chain on a state space S. Then the following are
equivalent:
(1) There is k G S with fkk = 1> J-e- with k recurrent.
(2) For all i,j G S, we have fij = 1 (so, in particular, all states are
recurrent).
(3) There are k,£eS with £^°=1 p$ = 00.
(4) For all i,j G 5, we have Yl™=i Pij = °°-
If any of (l)-(4) hold, we say that the Markov chain itself is recurrent.
Proof. That (2) implies (1) is immediate.
That (1) implies (3) follows immediately from Theorem 8.2.1.
To show that (3) implies (4): Assume that (3) holds, and let i,j G S.
By irreducibility, there are m, r G N with p\™] > 0 and p^} > 0. But then,
since pg^) > p^tfrt^ we have £nP£> > p^EntiS = «>,
as claimed.
To show that (4) implies (2): Suppose, to the contrary of (2), that
fij < 1 for some ij G S. Then 1 - fjj > Pjfc < Tj)(\ - fij) > 0. (Here
Pjin < Tj) stands for the probability, starting from j, that we hit i before
returning to j; and it is positive by irreducibility.) Hence, fjj < 1, so by
Theorem 8.2.1 we have YlnPjj < °°> contradicting (4). I
For example, for simple symmetric random walk, since YlnPu = oc"
this theorem says that from any state i, the walk will (with probability 1)
eventually reach any other state j. Thus, the walk will keep on wondering
from state to state forever; this is related to "fluctuation theory" for random
walks. In particular, this implies that for simple symmetric random walk,
with probability 1 we have limsupn Xn = oo and liminfn Xn = — oo.
Such considerations are related to the remarkable law of the iterated
logarithm. Let Xn = Z\ + ... + Zn define a random walk, where Zx, Z2,. • •

Stationary distributions and convergence.
89
th mean 0 and variance 1. (For example, perhaps {Xn} is sim-
are i.i. • random walk.) Then it is a fact that with probability 1,
ple s^ ,*r / /JTTToglogn) = 1. In other words, for any e > 0, we have
liinviP>\l+e)V2TW^i'0.) = 0andP(Xn > (l-e)y/2nla&ogni.o.) =
*^ u-"""dves extremely precise information about how the peaks of {Xn}
1 f large n. For a proof see e.g. Billingsley (1995, Theorem 9.5).
©
a 3 Stationary distributions and convergence.
Given a Markov chain on a state space S, with transition
probabilities \Pi }i tet {ni}i€S ^e a distribution on S (i.e., 7^ > 0 for all i G S,
and f^ Q^i = !)• Then {iti}^ is stationary for the Markov chain if
^2ieSTTiPij = ^ for all j G 5. (In matrix form, [7r][p] = [it].)
What this means is that if we start the Markov chain in the distribution
{7Ti} (i.e., P[Xo = i] = TTi for all i G S), then one time unit later the
distribution will still be {7^} (i.e., P[Xi = i] = m for all i G S); this is the
reason for the terminology "stationary". It follows immediately by induction
that the chain will still be in the same distribution {ni} any number n steps
later (i.e., P[Xn = i] = m for all i G S). Equivalently, J2ies niPij = nj *OT
any n G N. (In matrix form this is even clearer: [7r][p]n = [it].)
Exercise 8.3.1. A Markov chain is said to be reversible with respect to
a distribution {71-;} if, for all i,j G 5, we have TTiPij = TTjPji. Prove that, if
a Markov chain is reversible with respect to {iTi}, then {7^} is a stationary
distribution for the chain.
For example, for random walk on Z/(d) as in Example 8.0.3, it is
computed that the uniform distribution, given by TXi — 1/d for i = 0,1,..., d-1,
is a stationary distribution. For Ehrenfest's Urn (Example 8.0.4), it is
computed that the binomial distribution, given by tx{ = (f) £ for i = 0,1,..., d,
is a stationary distribution.
Exercise 8.3.2. Verify these last two statements.
Now, given a Markov chain with a stationary distribution, one might
expect that if the chain is run for a long time (i.e. n —► 00), that the
probability of being at a particular state j G S might converge to 7rj5 regardless
°f the initial state chosen. That is, one might expect that lim^^ Pi(Xn =
J) == 7Tj for any i, j G 5. This is not true in complete generality, as the
following two examples show. However, we shall see in Theorem 8.3.10 below
that this is indeed true for many Markov chains.

90
8. Discrete Markov chains.
For a first example, suppose S = {1,2}, and that the transition
probabilities are given by
(pa) = (I J).
That is, this Markov chain never moves at all! Hence, any distribution is
stationary for this chain. In particular, we could take 7Ti = 7T2 = \ as a
stationary distribution. On the other hand, we clearly have Pi(Xn = 1) = i
for all n G N, which certainly does not approach \. Thus, this Markov chain
does not converge to the stationary distribution {^i\. In fact, this Markov
chain is clearly reducible (i.e., not irreducible), which is the obstacle to
convergence.
For a second example, suppose again that S = {1,2}, and that the
transition probabilities are given this time by
(Pij) = (; j).
Again we may take -K\ = tt2 = \ as a stationary distribution (in fact,
this time the stationary distribution is unique). Furthermore, this Markov
chain is irreducible. On the other hand, we have Pi(Xn = 1) = 1 for n even,
and Pi(Xn = 1) = 0 for n odd. Hence, again we do not have Pi{Xn =
1) —> \. (On the other hand, the Cesaro averages of Pi(Xn = 1), i.e.
n Sr=i PiC^i = 1)? d° mdeed converge to ^, which is not a coincidence.)
Here the obstacle to convergence is that the Markov chain is "periodic",
with period 2, as we now discuss.
Definition 8.3.3. Given Markov chain transitions {p^} on a state space
5, and a state i G 5, the period of i is the greatest common divisor of the
set{n>l;pin)>0}.
That is, the period of i is the g.c.d. of the times at which it is possible
to travel from i to i. For example, if the period of i is 2, then this means it
is only possible to travel from i to i in an even number of steps. (Such was
the case for the second example above.) Clearly, if the period of a state is
greater than 1, then this will be an obstacle to convergence to a stationary
distribution. This prompts the following definition.
Definition 8.3.4. A Markov chain is aperiodic if the period of each state
is 1. (A chain which is not aperiodic is said to be periodic.)
Before proceeding, we note a fact about periods that makes aperiodicity
easier to verify.

8.3. Stationary distributions and convergence. 91
T emnia 8.3.5. Let i and j be two states of a Markov chain, and suppose
that fij > 0 an<^ fj* > 0 (*-e-> * and 3 "communicate"). Then the periods
ofi and of j are equal
proof. Since fij > 0 and fji > 0, there are r, s G N with p\y > 0 and
J?) > 0. Since plr+n+s) > P^P^P^, this implies that
p![+n+s) > 0 whenever p<? > ° • (8'3'6)
Now, if we let the periods of i and j be U and tj, respectively, then (8.3.6)
with n = 0 implies that U divides r + s. Then, for any n e N with P^] > o,
(8.3.6) implies that U divides r + n + s, hence that U divides n. That is, U
is a common divisor of {n G N; p^ > 0}. Since tj is the greatest common
divisor of this set, we must have U < tj. Similarly, tj < U, so we must have
U = tj. ■
This immediately implies
Corollary 8.3.7. If a Markov chain is irreducible, then all of its states
have the same period.
Hence, for an irreducible Markov chain, it suffices to check aperiodicity
at any single state.
We shall prove in Theorem 8.3.10 below that all Markov chains which
are irreducible and aperiodic, and have a stationary distribution, do in fact
converge to it. Before doing so, we require two further lemmas, which give
more concrete implications of irreducibility and aperiodicity.
Lemma 8.3.8. If a Markov chain is irreducible, and has a stationary
distribution {ni}, then it is recurrent.
Proof. Suppose to the contrary that the chain is not recurrent. Then, by
Theorem 8.2.3, we have Y2nPij < °° f°r an< states * and j; in particular,
Almn-+ocPi™ = 0. Now, since {7^} is stationary, we have ttj = J2t KiP^ for
all n e N. Hence, letting n —> 00, we see (formally, by using the M-test,
cf. Proposition A.4.8) that we must have ttj = 0, for all states j. This
contradicts the fact that £,- nj = 1. I
•Remark. The converse to Lemma 8.3.8 is false, in the sense that just
because an irreducible Markov chain is recurrent, it might not have a
stationary distribution (e.g. this is the case for simple symmetric random walk).

92
8. Discrete Markov chains.
Lemma 8.3.9. If a Markov chain is irreducible and aperiodic, then for
each pair (i,j) of states, there is a number no = no(i,j), such that p\V > q
for all n>no.
Proof. Fix states i and j. Let T = {n > l;p£° > 0}. Then, by
aperiodicity, gcd(T) = 1. Hence, we can find m £ N, and fci,..., km e T
and integers b\,..., 6m, such that fci&i + ... + kmbm — 1. We furthermore
choose any a G T, and also (by irreducibility) choose c € N with p\y > 0.
These values shall be the key to defining no-
We now set M = ki\b\\ + ... + fcm|6m| (i.e., a sum that without the
absolute value signs would be 1), and define no = no(i,j) = aM + c. Then,
if n > no, then letting r = [(n — c)/a\, we can write n = c + ra + s where
0 < s < a and r > M. We then observe that, since Y1T=\ bthe = 1 and
YlT=i IM^ = M, we have
m
n = (r - M)a + ^ {a\be\ + sbe) ke + c,
e=i
where the quantities in brackets are non-negative. Hence, recalling that
a, ki € T, and that pf) > 0, we have that
$ > o,
as required. I
We are now able to prove our main Markov chain convergence theorem.
Theorem 8.3.10. If a Markov chain is irreducible and aperiodic, and
has a stationary distribution {iti}, then for all states i and j, we have
lim^oo Pi(Xn = j) = ttj. .
Proof. The proof uses the method of "coupling". Let the original Markov
chain have state space S, and transition probabilities {pij}. We define a new
Markov chain {(Xn,Yn)}<%L0, having state space S = S x S, and transition
probabilities given by P^^ki) ~ PikPji- That is, our new Markov chain
has two coordinates, each of which is an independent copy of the original
Markov chain.
It follows immediately that the distribution on S given by n^j) = Ttiitj
(i.e., a product of two probabilities in the original stationary distribution)
is in fact a stationary distribution for our new Markov chain. Furthermore,
from Lemma 8.3.9 above, we see that our new Markov chain will again
r-M
<$*{&) IK*
U=l
*(*«>)
a\be\+sbe

8.3. Stationary distributions and convergence. 93
h irreducible and aperiodic (indeed, we have pfil ,k£s > 0 whenever n >
max(no(^)>n()C7'^)))- Hence, from Lemma 8.3.8, we see that our new
Markov chain is in fact recurrent. This is the key to what follows.
To complete the proof, we choose %q G 5, and set r = inf{n > 1; Xn =
Y — i0}. Note that, for m < n, the quantities P(ij)(r = m, Xn = k)
and P(i7)(r — miYn — k) are equal; indeed, they both equal P^(r —
m)p^n~k - Hence, for any states i, j, and fc, we have that
\Pik
(n)
■Pjk
W)
(Xn = k)-P{ij)(Yn = k)\
|Em=i P(ij)(*» = fc, r = m) + P(ij)(Xn = fc, r > n)
" ELi P(ij)(^n = fc, r = m) - P(ii)(y„ = k, r>n)
= \P{ij)(Xn = fc, r>n)- P(ij)(yn = fc, r > n)|
< P(ii)(r>n)
(where the inequality follows since we are considering a difference of two
positive quantities, each of which is < P(^)(r > n)). Now, since the new
Markov chain is irreducible and recurrent, it follows from Theorem 8.2.3
that f(ij),(i0i0) = 1. That is, with probability 1, the chain will eventually
hit the state (io,io), in which case r < oo. Hence, as n —+ oo, we have
P(ij)(r > n) —> 0, so that
(n) (n)
0.
On the other hand, by stationarity, we have for the original Markov
chain that
(n)
Pij -*3
E^(^-«)
kes
^ V I (w) (n)
fcG5
and we see (again, by the M-test) that this converges to 0 as n —> oo. Since
Pij = Pi(Xn = j), this proves the theorem. I
Finally, we note that Theorem 8.3.10 remains true if the chain begins in
some other initial distribution {^} besides those with i/* = 1 for some i:
Corollary 8.3.11. If a Markov chain is irreducible and aperiodic, and
has a stationary distribution {71^}, then regardless of the initial distribution,
for all j e S, we have limn_+oo P(Xn = j) = 7Tj.

94
8. Discrete Markov chains.
Proof. Using the M-test and Theorem 8.3.10, we have
lim^oo P(Xn = j) = limn^oo ^2ieS P(X0 = i, Xn = j)
= lmwoe Ziesp(*o = i) P(Xn =j\X0 = i)
= limn^oo J2ies vi pi(^n = j)
= J2ieS vi limn-oo Pi(Xn = j)
= TTj ,
which gives the result.
8.4. Existence of stationary distributions.
Theorem 8.3.10 gives powerful information about the convergence of
irreducible and aperiodic Markov chains. However, this theorem requires
the existence of a stationary distribution {ni}. It is reasonable to ask for
conditions under which a stationary distribution will exist. We consider
that question here.
Given a Markov chain {Xn} on a state space 5, and given a state i € 5,
we define the mean return time rrii by rrii = E* (inf{n > 1; Xn = i}). That
is, rrii is the expected time to return to the state i. We always have m* > 1.
If i is transient, then with positive probability we will have inf{n > 1; Xn =
i} = 00, so of course we will have ra^ = 00. On the other hand, if i is
recurrent, then we shall call i null recurrent if rrii = 00, and shall call i
positive recurrent if m^ < 00. (The names come from considering 1/rrii
rather than rrii.)
The main theorem of this subsection is
Theorem 8.4.1. If a Markov chain is irreducible, and if each state i of
the Markov chain is positive recurrent with (finite) mean return time rrii,
then the Markov chain has a unique stationary distribution {ni}, given by
7Ti = \ I rrii for each state i.
This theorem is rather surprising. It is not immediately clear that the
mean return times rrii have anything to do with a stationary distribution; it
is even less expected that they provide a precise formula for the stationary
distribution values. The proof of the theorem relies heavily on the following
lemma.
Lemma 8.4.2. LetGn(i, j) = E* (#{£; 1 < £ < n, Xe = j}) = YZ=iPif-
Then for an irreducible recurrent Markov chain,
lim G«™ - '
n m
3

8.4. Existence of stationary distributions. 95
for any states i and j.
proof. Let TJ be the time of the rth hit of the state j. Then
JJ = T) + {Tf - T/) + ... + (IJ - T/-1);
here the r — 1 terms in brackets are i.i.d. and have mean rrij. Hence, from
the Strong Law of Large Numbers (second version, Theorem 5.4.4), we have
TV
that limr_+oc -jr = rrij with probability 1.
Now, for n e N, let r(n) = #{£; 1 < £ < n, Xe = j}. Then
limn—oo r(n) = °° w^n probability 1 by recurrence. Also clearly T[(n) <
n<Tr(n) + lsothat
rr,r(n) r7-rr(n) + l
T-i— < -^- < T-i
r(n) r(n) r(n)
Hence, we must have limn^oo ^y = rrij with probability 1 as well.
Therefore, with probability 1 we have linin—oo ^^ = 1 /rrij.
On the other hand, Gn(i,j) = Ei(r(n)), and furthermore 0 < ^^ < 1.
Hence, by the bounded convergence theorem,
lim v 1JJ = lim E» -^ = l/rrij,
as claimed.
Remark 8.4.3. We note that this proof goes through without change in
the case rrij = oo, so that lim^oo Gn^ = 0 in that case.
Proof of Theorem 8.4.1. We begin by proving the uniqueness. Indeed,
suppose that 53* oupij = otj for all states j, for some probability distribution
{oti}. Then, by induction, Y^iaiVij = aj fc>r a^ t G N, so that also
n 12t=i £» aip\f = <*j. Hence,
qlj = limn^oe 1 YZ=i £. aipV
= 53. oci limn^oo \ 53"=1 p^ by the M-test
= Y.i ai^ by tne Lemma
= l/rrij.
That is, olj = I/rrij is the only possible stationary distribution.

96
8. Discrete Markov chains.
Now, fix any state z. Then for all t G N, we have YljPzj = *• Hence,
if we set C = J^ 1/™?, then using the Lemma and (A.4.9), we have
j t=l j t=l
In particular, C < oo.
Again fix any state 2. Then for all states j, and for all t € N, we have
YliPziPij ~ Pzj • Hence, again by the Lemma and (A.4.9), we have
** - SuLkt^ = JS5.E ;£>!?*. a £><• <8-«>
But then summing both sides over all states j, and recalling that V ^- =
C < 00, we see that we must have equality in (8.4.4), i.e. we must have
~WT ~ Si ^~Pij- Hut this means that the probability distribution defined
by TTz = l/Crrti must be a stationary distribution for the Markov chain!
Finally, by uniqueness, we must have ttj = l/rrij for each state j, i.e. we
must have C = 1. This completes the proof of the theorem. I
On the other hand, states which are not positive recurrent cannot
contribute to a stationary distribution:
Proposition 8.4.5. If a Markov chain has a stationary distribution {7^},
and if a state j is not positive recurrent (i.e., satisfies rrij = 00), then we
must have ttj = 0.
Proof. We have that Yli^iPij = nj f°r a^ t e N. Hence, using the
M-test and also Lemma 8.4.2 and Remark 8.4.3, we have
7rj= lim -5>f>g>=$> Hm If>«=5>)(0)=0,
i t=\ i t=l i
as claimed. I
Corollary 8.4.6. If a Markov chain has no positive recurrent states, then
it does not have a stationary distribution.
Proof. Suppose to the contrary that it did have a stationary distribution
{ni}. Then from the above proposition, we would necessarily have ttj = 0

8.4. Existence of stationary distributions.
97
for each state j, contradicting the fact that V ttj = 1. I
Theorem 8.4.1 and Corollary 8.4.6 provide clear information about
Markov chains where all states are positive recurrent, or where none are,
respectively. One could still wonder about chains which have some positive
recurrent and some non-positive-recurrent states. We now show that, for
irreducible Markov chains, this cannot happen. The statement is somewhat
analogous to that of Lemma 8.3.5.
Proposition 8.4.7. Let i and j be two states of a Markov chain. Suppose
that fij > 0 and fji > 0 (i.e., the states i and j communicate). Then ifi is
positive recurrent, then j is also positive recurrent.
Proof. Find r,t G N with p(?) > 0 and p[f > 0. Then by Lemma 8.4.2
and Remark 8.4.3,
1 t l V^ (™) \ r l V^ (r) (m-r-t) (t) Vji Vij ^ n
— = hm - > p). } > hm - > p^'p). 'p\.} = -^ J- > 0,
J 77i=l m=r+t
so that rrij < oo. I
Corollary 8.4.8. For an irreducible Markov chain, either all states are
positive recurrent or none are.
Combining this corollary with Theorem 8.4.1 and Corollary 8.4.6, we see
that
Theorem 8.4.9. Consider an irreducible Markov chain. Then either (a)
all states are positive recurrent, and there exists a unique stationary
distribution, given by nj = l/m?, to which (assuming aperiodicity) Pi(Xn = j)
converges as n —> oo; or (b) no states are positive recurrent, and there does
not exist a stationary distribution.
For example, consider simple symmetric random walk, with state space
the integers Z. It is clear that rrij must be the same for all states j (i.e.
no state is any "different" from any other state). Hence, it is impossible
that 2Zjez Vm.? = 1- Thus, simple symmetric random walk cannot fall
into category (a) above. Hence, simple symmetric random walk falls into
category (b), and does not have a stationary distribution; in fact, it is null
recurrent.
Finally, we observe that all irreducible Markov chains on finite state
spaces necessarily fall into category (a) above:

98
8. Discrete Markov chains.
Proposition 8.4.10. For an irreducible Markov chain on a finite state
space, all states are positive recurrent (and hence a unique stationary
distribution exists).
Proof. Fix a state i. Write hj{ } = Pj(X* = i for some 1 < t < m) =
Yl™=ifji- Then limm_+oo hfr' = fji > 0, for each state j. Hence, since
the state space is finite, we can find m G N and 6 > 0 such that h^ > 6
for all states j.
But then we must have 1 - h^ < (1 - 5)Ln/mJ, so that letting r{ =
inf{n > 1; Xn = i}, we have by Proposition 4.2.9 that
oo oo oo
m = J2Pi(n > n +1) = £(i - h^) < £(i - 6)Wm* = j < oo. I
n=0 n=0 n=0
8.5. Exercises.
Exercise 8.5.1. Consider a discrete-time, time-homogeneous Markov
chain with state space 5 = {1,2}, and transition probabilities given by
Pn = a, P12 = 1 - a, P21 = 1, P22 = 0.
For each 0 < a < 1,
(a) Compute p\™' = P(Xn = j \ XQ = i) for each i,j G X and n G N.
(b) Classify each state as recurrent or transient.
(c) Find all stationary distributions for this Markov chain.
Exercise 8.5.2. For any e > 0, give an example of an irreducible Markov
chain on a countably infinite state space, such that \pij — Pik \ < e for all
states i, j, and k.
Exercise 8.5.3. For an arbitrary Markov chain on a state space consisting
of exactly d states, find (with proof) the largest possible positive integer N
such that for some states i and j, we have p\, * > 0 but p\™ = 0 for all
n < N.
Exercise 8.5.4. Given Markov chain transition probabilities {Pij}i,jeS
on a state space 5, call a subset CCS closed if ^2jecPij ~ 1 f°r eacn
i G C. Prove that a Markov chain is irreducible if and only if it has no
closed subsets (aside from the empty set and S itself).

8.5. Exercises.
99
Exercise 8.5.5. Suppose we modify Example 8.0.3 so the chain moves
one unit clockwise with probability r, or one unit counter-clockwise with
probability 1 - r, for some 0 < r < 1. That is, S = {0,1,2,..., d ■
1} and
Pij =
= < 1
0,
j = i + 1 (mod d)
j = i — l(mod d)
otherwise.
Find (with explanation) all stationary distributions of this Markov chain.
Exercise 8.5.6. Consider the Markov chain with state space S = {1,2,3}
and transition probabilities pi2 = P23 = P3i = 1. Let 71*1 = 7r2 = 7r3 = 1/3.
(a) Determine whether or not the chain is irreducible.
(b) Determine whether or not the chain is aperiodic.
(c) Determine whether or not the chain is reversible with respect to {iti}.
(d) Determine whether or not {ni} is a stationary distribution.
(e) Determine whether or not limn_0
(n)
Exercise 8.5.7. Give an example of an irreducible Markov chain, and
two distinct states i and j, such that /^ ' > 0 for all n G N, and such that
(n)
,(n+lK
ft"' is not a decreasing function of n (i.e. for some n G N, /^ < /^ ;).
Exercise 8.5.8.
Condition on X\.
Prove the identity fa = p^ + J2k^jP^fkj- [Hint:
Exercise 8.5.9. For each of the following transition probability
matrices, determine which states are recurrent and which are transient, and also
compute fn for each i.
(a)
(b)
/1/2
2/3
0
V 1
/ 1
1/2
0
0
1/10.
, 0
V 0
1/2
0
0
0
0
0
1/5
0
0
0
0
0 0 \
0 1/3
4/5 1/5 1
0 0 /
0 0
1/2 0
4/5 0
1/3 1/3
0 0
0 0
0 0
0
0
0
1/3
7/10
0
0
0
0
0
0
0
0
1
0 \
0
0
0
1/5
1 j
0 /
Exercise 8.5.10. Consider a Markov chain (not necessarily irreducible)
♦ on a finite state space.

100
8. Discrete Markov chains.
(a) Prove that at least one state must be recurrent.
(b) Give an example where exactly one state is recurrent (and all the rest
are transient).
(c) Show by example that if the state space is countably infinite then part
(a) is no longer true.
Exercise 8.5.11. For asymmetric one-dimensional simple random walk
(i.e. where P(Xn = Xn_i + 1) = p = 1 - P(Xn = Xn_i - 1) for some
p/|), provide an asymptotic upper bound for Y^LnPu • That is, find
an explicit expression 7^, with lim^^oo 7^ = 0, such that YI^LnPu — 7iv
for all sufficiently large N.
Exercise 8.5.12. Let P = (pij) be the matrix of transition probabilities
for a Markov chain on a finite state space.
(a) Prove that P always has 1 as an eigenvalue. [Hint: Recall that the
eigenvalues of P are the same whether it acts on row vectors to the left or
on column vectors to the right.]
(b) Suppose that v is a row eigenvector for P corresponding to the
eigenvalue 1, so that vP = v. Does v necessarily correspond to a stationary
distribution? Why or why not?
Exercise 8.5.13. Call a Markov chain doubly stochastic if its
transition matrix {Pij}ijes has the property that YliesPij ~ * ^or eac^ 3 ^ ^'
Prove that, for a doubly stochastic Markov chain on a finite state space
5, the uniform distribution (i.e. TTi = l/\S\ for each i £ 5) is a stationary
distribution.
Exercise 8.5.14. Give an example of a Markov chain on a finite state
space, such that three of the states each have a different period.
Exercise 8.5.15. Consider Ehrenfest's Urn (Example 8.0.4).
(a) Compute Po(Xn = 0) for n odd.
(b) Prove that P0(Xn = 0) /» 2~d as n -► 00.
(c) Why does this not contradict Theorem 8.3.10?
Exercise 8.5.16. Consider the Markov chain with state space 5 =
{1,2,3} and transition probabilities given by
/ 0 2/3 1/3 \
(Pij) =1/4 0 3/4 .
\4/5 1/5 0 /
(a) Find an explicit formula for Pi(ti = n) for each n e N, where T\ —

8.6. Section summary.
101
inf{™ > !; *" = *}■
Cb) Compute the mean return time mi = E(ti).
(c) Prove that this Markov chain has a unique stationary distribution, to
be called {7^}.
(d) Compute the stationary probability 7Ti.
Exercise 8.5.17. Give an example of a Markov chain for which some
states are positive recurrent, some states are null recurrent, and some states
are transient.
Exercise 8.5.18. Prove that if fa > 0 and fji = 0, then i is transient.
Exercise 8.5.19. Prove that for a Markov chain on a finite state space,
no states are null recurrent. [Hint: The previous exercise provides a starting
point.]
Exercise 8.5.20. (a) Give an example of a Markov chain on a finite
state space which has multiple (i.e. two or more) stationary distributions.
(b) Give an example of a reducible Markov chain on a finite state space,
which nevertheless has a unique stationary distribution.
(c) Suppose that a Markov chain on a finite state space is decomposable,
meaning that the state space can be partitioned as 5 = S\ US2, with Si
non-empty, such that fa = fji = 0 whenever i E S\ and j G 52- Prove that
the chain has multiple stationary distributions.
(d) Prove that for a Markov chain as in part (b), some states are transient.
[Hint: Exercise 8.5.18 may help.]
8.6. Section summary.
This section gave a lengthy exploration of Markov chains (in discrete
time and space). It gave several examples of Markov chains, and proved
their existence. It defined the important concepts of transience and
recurrence, and proved a number of equivalences of them. For an irreducible
Markov chain, either all states are transient or all states are recurrent.
The section then discussed stationary distributions. It proved that the
transition probabilities of an irreducible, aperiodic Markov chain will
converge to the stationary distribution, regardless of the starting point. Finally,
Jt related stationary distributions to the mean return times of the chain's
states, proving that (for an irreducible chain) a stationary distribution exists
if and only if these mean return times are all finite.

9. More probability theorems.
103
9. More probability theorems.
In this section, we discuss a few probability ideas that we have not
needed so far, but which will be important for the more advanced material
to come.
9.1. Limit theorems.
Suppose X,Xi,X2,-.- are random variables defined on some
probability triple (fi,.F,P). Suppose further that liuin^oQ Xn(oj) = X(uj) for
each fixed u G ft, (or at least for all u outside a set of probability 0), i.e.
that {Xn} converges to X almost surely. Does it necessarily follow that
lim„_ooE(Xn) = E(X)?
We already know that this is not the case in general. Indeed, it was not
the case for the "double 'til you win" gambling strategy of Subsection 7.3
(where if 0 < p < 1/2, then E(Xn) < a for all n, even though E(limXn) =
a + 1). Or for a simple counter-example, let ft = N, with P(u;) = 2~u
for u € ft, and let Xn(u) = 2nSUJin (i.e., Xn(u) = 2n if u = n, and
equals 0 otherwise). Then {Xn} converges to 0 with probability 1, but
E(Xn) = l-/+0asn->oo.
On the other hand, we already have two results giving conditions under
which it is true that E(Xn) —> E(X), namely the Monotone Convergence
Theorem (Theorem 4.2.2) and the Bounded Convergence Theorem
(Theorem 7.3.1). Such limit theorems are sometimes very helpful.
In this section, we shall establish two more similar limit theorems,
namely the Dominated Convergence Theorem and the Uniformly Integrable
Convergence Theorem. A first key result is
Theorem 9.1.1. (Fatou 's Lemma) IfXn > C for all n, and some constant
C > -oo, then
E fliminf Xn) < liminf E(Xn).
V n—►oo / n—KX)
(We allow the possibility that both sides are infinite.)
Proof. Set Yn = infk>nXk, and let Y = limn_^oo Yn = liminf Xn.
n—►oo
Then Yn > C and {Yn} / Y, and furthermore Yn < Xn. Hence, by the
order-preserving property and the monotone convergence theorem, we have
liminfn E{Xn) > liminfn E(Fn) = E(F), as claimed. I
Remarks.
1- In this theorem, "lim infn_oo Xn" is of course interpreted pointwise; that
is, its value at uj is liminfn-^oo Xn(uj).

104
9. More probability theorems.
2. For the "simple counter-example" mentioned above, it is easily verified
that E(liminfn_ooXn) = 0 while liminfn^oo E(Xn) = 1, so the
theorem gives 0 < 1 which is true. If we replace Xn by — Xn in that example,
we instead obtain 0 < — 1 which is false; however, in that case the {Xn}
are not bounded below.
3. For the "double 'til you win" gambling strategy of Subsection 7.3, the
fortunes {Xn} are not bounded below. However, they are bounded above
(by a + 1), so their negatives are bounded below. When the theorem is
applied to their negatives, it gives that — (a + 1) < liminf E(—Xn), so
that limsupE(Xn) < a + 1, which is certainly true since Xn < a -f 1 for
all n. (In fact, if p < ^, then limsupE(Xn) < a.)
It is now straightforward to prove
Theorem 9.1.2. (The Dominated Convergence Theorem) IfX, X\, X2,...
are random variables, and if {Xn} —> X with probability 1, and if there is
a random variable Y with \Xn\ < Y for all n and with E(Y) < 00, then
limn_00E(Xn) = E(X).
Proof. We note that Y + Xn > 0. Hence, applying Fatou's Lemma to
[Y + Xn}, we obtain that
E(F) + E{X) = E{Y + X)< lim inf E(F 4- Xn) = E(F) + lim inf E(Xn).
n n
Hence, canceling the E(Y) terms (which is where we use the fact that
E(y) < 00), we see that E(X) < liminfnE(Xn).
Similarly, Y — Xn > 0, and applying Fatou's Lemma to [Y — Xn},
we obtain that E(Y) - E(X) < E(Y) + liminfnE(-Xn) = E(F) -
limsupnE(Xn), so that E(X) > limsupnE(Xn).
But we always have limsupnE(Xn) > liminfnE(Xn). Hence, we must
have that limsupn E(Xn) = lim*infn E(Xn) = E(X), as claimed. I
Remark. Of course, if the random variable Y is constant, then the
dominated convergence theorem reduces to the bounded convergence theorem.
For our second new limit theorem, we need a definition. Note that
for any random variable X with E(|X|) < 00 (i.e., with X "integrable"),
by the dominated convergence theorem lima_oo E (|X|l|x|>a) = E(0) = 0.
Taking a supremum over n makes a collection of random variables uniformly
integrable:

9.1. Limit theorems.
105
Definition 9.1.3. A collection {Xn} of random variables is uniformly
integrable if
lim supE(|Xn|l|Xn|>a) =0. (9.1.4)
a—oo n v i i_ /
Uniform integrability immediately implies boundedness of certain
expectations:
Proposition 9.1.5. If {Xn} is uniformly integrable, then supn E(|Xn|) <
oo. Furthermore, if also {Xn} —> X a.s., then E|X| < oo.
Proof. Choose ol\ so that (say) supnE (|^n|l|xn|>ai) < 1- Then
supE(|Xn|) = supE(|Xn|l|Xn|<ai -h|Xn|l|Xn|>ai) <ai + Koo.
n n
It then follows from Fatou's Lemma that, if {Xn} —> X a.s., then E(|X|) <
liminfnE(|Xn|) < supnE(|Xn|) < oo. I
The main use of uniform integrability is given by:
Theorem 9.1.6. (The Uniform Integrability Convergence Theorem) If
X,X\,X2,... are random variables, and if {Xn} —> X with probability 1,
and if {Xn} are uniformly integrable, then limn_>oo E(Xn) = E(X).
Proof. Let Yn = \Xn-X\, so that Yn -► 0. We shall show that E(Fn) -► 0;
it then follows from the triangle inequality that \E(Xn)-E(X)\ < E(Yn) —►
0, thus proving the theorem. We will consider E(Fn) in two pieces, using
thatyn = yniyn<a + yniyn>a.
Let Fn(a) = ynlyn<Q. Then for any fixed a > 0, we have |ria)| < a,
and also Fn(a) -> 0 as n —> oo, so by the bounded convergence theorem we
have
lim E(Vn<a)>) =0, a>0. (9.1.7)
n—►oo V /
For the second piece, we note by the triangle inequality (again) that
Yn < \Xn\ + \X\ <2Mn where Mn = max(|X„|, |X|). Hence, if Yn > a,
then we must have Mn > a/2, and thus that either \Xn\ > a/2 or \X\ >
®/2. This implies that
^nlyn>a < 2MnlMn>a/2 < 2|Xn|l,Xn|>a/2 + 2|X|l|X|>a/2.
Taking expectations and supremums gives
supE(FnlyTi>Q) < 2supE(|Xn|l|XTi|>a/2)+2E(|X|l|X|>a/2) .

106
9. More probability theorems.
Now, as a —> oo, the first of these two terms goes to 0 by the uniform
integrability assumption; and the second goes to 0 by the dominated
convergence theorem (since E|X| < oo). We conclude that
lim supE(ynlyn>a) = 0. (9.1.8)
a—oo n
To finish the proof, let e > 0. By (9.1.8), we can find ao > 0 such
that supnE(Fnlyn>ao) < e/2. By (9.1.7), we can find n0 € N such that
E (Yn J < e/2 for all n > Uq. Then, for any n > no, we have
E(Yn) = E (Y^) + E (ynly„>Q0) < \ + | = e.
Hence, E(yn) —> 0, as desired. |
Remark 9.1.9. While the above limit theorems were all stated for a
countable limit of random variables, they apply equally well for continuous
limits. For example, suppose {Xt}t>o is a continuous-parameter family
of random variables, and that lim^oX^u;) = Xq(uj) for each fixed u G
fi. Suppose further that the family {Xt}t>o is dominated (or uniformly
integral) as in the previous limit theorems. Then for any countable sequence
of parameters {tn} \ 0, the theorems say that Ft(Xtn) —> E(Xo). Since
this is true for any sequence {£n} \ 0, it follows that lim^o E(J*Q) = E(X)
as well.
9.2. Differentiation of expectation.
A classic question from multivariate calculus asks when we can
"differentiate under the integral sign". For example, is it true that
The answer is yes, as the following proposition shows. More generally, the
proposition considers a family of random variables {Ft} (e.g. Ft(uj) = ewt),
and the derivative (with respect to t) of the function E(i^) (e.g. of J0 e^cLu).
Proposition 9.2.1. Let {Ft}a<t<b be a collection of random variables
with finite expectations, defined on some probability triple (ft,!7,P).
Suppose for each u and each a < t < b, the derivative F/(u;) = j^Ft{uj) exists.
Then F[ is a random variable. Suppose further that there is a random
variable Y on (ft,.F,P) with E(Y) < oo, such that |F/| < Y for all a < t < b.

9.3. Moment generating functions and large deviations. 107
Then if we define 4>(t) = E(Ft), then (j) is differentiable, with finite derivative
^(t) = E(F/) for alla<t <b.
proof. To see that F[ is a random variable, let tn = t + £. Then F[ =
limn-+oo n(^tn — ^t) and hence is the countable limit of random variables,
and therefore is itself a random variable by (3.1.6).
Next, note that we always have
Ft+h — Ft
< Y. Hence, using the domi-
h
nated convergence theorem together with Remark 9.1.9, we have
h^o h h-+o \ h
and this is finite since E(|F/|) < E(Y) < oo. I
9.3. Moment generating functions and large deviations.
The moment generating function of a random variable X is the function
Mx(s) = E(e5X), seR.
At first glance, it may appear that this function is of little use. However,
we shall see that a surprising amount of information about the distribution
of X can be obtained from Mx(s).
If X and Y are independent random variables, then esX and esy are
independent by Proposition 3.2.3, so we have by (4.2.7) that
Mx+y(s) = Mx{s)MY(s), X, Y independent. (9.3.1)
Clearly, we always have Mx(0) = 1. However, we may have Mx(s) = oo
for certain s ^ 0. For example, if P[X = m] = c/m2 for all integers rn ^ 0,
where c = (£j#0 j"2)-1, then Mx{s) = oo for all s ^ 0. On the other
hand, if X ~ JV(0,1), then completing the square gives that
Mx(s) = f°° esx^=e-*2/2dx = e*^2 f°° -Le-(x-s)2/2<te
= es^2{l) = es^2. (9.3.2)
A key property of moment generating functions, at least when they are
finite in a neighbourhood of 0, is the following.

108
9. More probability theorems.
Theorem 9.3.3. Let X be a random variable such that Mx(s) < oo fox-
Is! < s0, for some Sq > 0. Then E|Xn| < oo for all n, and Mx{s) is analytic
for \s\ < so with
oo
Mx(s) = ^E(X")sn/n!.
n=0
In particular, the rth derivative at s = 0 is given by M% (0) = E(Xr).
Proof. The idea of the proof is that
Mx(s) = E(esX) = E (1 + (sX) + (sX)2/2\ + ...)
<?2
= l + sE(X) + -E(X2) + ... .
However, the final equality requires justification. For this, fix s with \s\ < sq,
and let Zn = 1 + (sX) + (sX)2/2! + ... + (sX)n/n!. We have to show that
E(limn_oo Zn) = limn^oo E(Zn). Now, for all n G N,
\Zn\ < l + \sX\ + \sX\2/2\ + ... + \sX\n/n\
< 1 + \sX\ + \sX\2/2\ + ... = e|5X| < esX+e~sX = Y.
Since \s\ < s0» we have E(y) = Mx(s) + Mx(-s) < oo. Hence, by the
dominated convergence theorem, E(limn_00 Zn) = limn^oo E(Zn). I
Remark. Theorem 9.3.3 says that the rth derivative of Mx at 0 equals
the rth moment of X (thus explaining the terminology "moment
generating function"). For example, Mx(0) = 1, M'x(0) = E(X), Af£(0) =
E(X2), etc. This result also follows since (£)rE(esX) = E ((&)reaX) =
E (XresX), where the exchange of derivative and expectation can be
justified (for \s\ < sq) using Proposition 9.2.1 and induction.
We now consider the subject of large deviations. If Xi,X2,... are i.i.d.
with common mean m and finite variance v, then it follows from Cheby-
chev's inequality (as in the proof of the Weak Law of Large Numbers) that
for all e > 0, P (Xl+'n'+Xn > m + e) < v/ne2, which converges to 0 as
n —> oo. But how quickly is this limit reached? Does the probability really
just decrease as 0(l/n), or does it decrease faster? In fact, if the moment
generating functions are finite in a neighbourhood of 0, then the convergence
is exponentially fast:
Theorem 9.3.4. Suppose XX,X2,... are i.i.d. with common mean m.
and with Mx% (s) < oo for -a < s < b where a, b > 0. Then

9 3. Moment generating functions and large deviations. 109
where p = info<*<6 {e-^m^MXl(8)) < 1.
This theorem gives an exponentially small upper bound on the
probability that the average of the Xi exceeds its mean by at least e. This is a (very
simple) example of a large deviations result, and shows that in this case the
probability is decreasing to zero exponentially quickly - much faster than
just 0(l/n).
To prove Theorem 9.3.4, we begin with a lemma:
Lemma 9.3.5. Let Z be a random variable with E(Z) < 0, such that
Mzis) < oo for —a < s < b, for some a, b > 0. Then P(Z > 0) < p, where
p = mio<s<bMz(s) < 1.
Proof. For any 0 < 5 < 6, since the function x i—► esx is increasing,
Markov's inequality implies that
P(Z > 0) = P(esZ > 1) < 5i^i = Mz{s).
Hence, taking the infimum over 0 < 5 < 6,
P(Z>0) < inf Mz(s)=p.
~ 0<s<b
Furthermore, since Mz{0) = 1 and M'Z(Q) = E(Z) < 0, we must have that
Mz(s) < 1 for all positive 5 sufficiently close to 0. In particular, p < 1. I
Remark. In Lemma 9.3.5, we need a > 0 to ensure that M'z(0) = E(Z).
However, the precise value of a is unimportant.
Proof of Theorem 9.3.4. Let Y{ = X{ - m - e, so E(Yi) = -e < 0. Then
for -a < s < 6, MYl(s) = E{esY*) = e-s(m+e)E(esXi) = e-^+^Mx^s) <
°c. Using Lemma 9.3.5 and (9.3.1), we have
P
(Xl + -n+Xn>rn + e)=p(Y>+-n+Yn>0)
~ P (Y, + ... + Yn > 0) < inf MYi+...+Yn(s) = inf (MYl (s))n = pn ,
u<s<b 0<s<b
where p = M0<s<bMYl(s) = m{0<s<b(e-s{m+e)MXl{s)). Furthermore,
trom Lemma 9.3.5, p < 1. |

110
9. More probability theorems.
9.4. Fubini's Theorem and convolution.
In multivariate calculus, an iterated integral like f0 f0 x2 ys dx dy, can
be computed in three different ways: integrate first x and then y, or
integrate first y and then x, or compute a two-dimensional "double integral"
over the full two-dimensional region. It is well known that under mild
conditions, these three different integrals will all be equal.
A generalisation of this is Fubini's Theorem, which allows us to compute
expectations with respect to product measure in terms of an "iterated
integral" , where we integrate first with respect to one variable and then with
respect to the other, in either order. (For non-negative /, the theorem is
sometimes referred to as TonellVs theorem.)
Theorem 9.4.1. (Fubini's Theorem) Let ft be a probability measure on
X, and v a probability measure on y, and let /i x v be product measure on
X x y. If f : X xy —► R is measurable with respect to /i x v, then
Sxxy f d^ x») = Sx [Sy /(x' y)»(dy)) V{dx) ,g 4 ^
= Sy (Sx /(x> V)»(<&)) »(dy)>
provided that either fXxy /+d(/i x v) < oo or fXxy f~d(fi x v) < oo (or
both). [This is guaranteed if, for example, f > C > —oo, or f < C < oo,
or fXxy\f\d(fix v) < oo. Note that we allow that the inner integrals (i.e.,
the integrals inside the brackets) may be infinite or even undefined on a set
of probability 0.]
The proof of Theorem 9.4.1 requires one technical lemma:
Lemma 9.4.3. The mapping E i—► Jy (fx 1e(x, y)^{dx)) v(dy) is a
well-defined, count ably additive function of subsets E.
Proof (optional). For E C X x y, and y e y, let Sy(E) = {x G
X\ (x^y) £ E}- We first acgue that, for any y G y and any set E which
is measurable with respect to /i x v, the set Sy(E) is measurable with
respect to /i. Indeed, this is certainly true if E = A x B with A and B
measurable, for then Sy(E) is always either A or 0. On the other hand, since
Sy preserves set operations (i.e., 5y(EiUE2) = Sy(Ei)USy(E2), Sy(Ec) =
Sy(E)c, etc.), the collection of sets E for which Sy(E) is measurable is
a a-algebra. Furthermore, the measurable rectangles A x B generate the
product measure's entire a-algebra. Hence, for any measurable set E, Sy(E)
is a measurable subset of X, so that ii(Sy(E)) is well-defined.
Next, consider the collection Q of those measurable subsets E C X x
y for which /ji(Sy(E)) is a measurable function of y G y. This Q
certainly contains all the measurable rectangles Ax B, since then /j,(Sy(E)) =

9.4. Fubini's Theorem and convolution.
Ill
(A) 1b{v) wh*cn is measurable. Also, Q is closed under complements since
(S (E )) — viSyiE)0) = 1_m(^(^))- ^ is a^so dosed under countable
disjoint unions, since if {En} are disjoint, then so are {Sy(En)}, and thus
iz(5y(U„£»)) = MnSy(En)) = ZnKSy(En))- ^m these facts, it
follows (formally, from the U7r-A theorem", e.g. Billingsley, 1995, p. 42) that Q
includes all measurable subsets E, i.e. that fi(Sy(E)) is a measurable
function of y for any measurable set E. Thus, integrals like fyfi(Sy(E)) v(dy)
are well-defined. Since fy (fx lE{x,y)n(dx)) v(dy) = Jyfi(Sy(E)) v(dy),
the claim about being well-defined follows.
To prove countable additivity, let {En} be disjoint. Then, as above,
fi>(Sy(UnEn)) = En/^^y^n))' Countable additivity then follows from
countable linearity of expectations with respect to v. |
Proof of Theorem 9.4.1. We first consider the case where / = 1^ is an
indicator function of a measurable set E C X x y. By Lemma 9.4.3, the
mapping E »-> fy (fx ^-E{x1y)fi(dx)) v(dy) is a well-defined, countably
additive function of subsets E. When E = A x JE?, this integral clearly equals
^(A)v(B) = (/i x v)(E). Hence, by uniqueness of extensions
(Proposition 2.5.8), we must have fy (Jx l^(x,2/)/i(dx)) v(dy) = (/iX v)(E) for any
measurable subset E. Similarly, fx (fy lE{x,y)v(dy)\ fi(dx) = (/i x v)(E),
so that (9.4.2) holds whenever / = 1e-
We complete the proof by our usual arguments. Indeed, by
linearity, (9.4.2) holds whenever / is a simple function. Then, by the
monotone convergence theorem, (9.4.2) holds for general measurable non-negative
/. Finally, again by linearity since / = /+ - /~, we see that (9.4.2)
holds for general functions / as long as we avoid the oo — oo case where
Jxxyf*d(fi x v) = JXxyf~d(fi x v) = oo (while still allowing that the
inner integrals may be oo - oo on a set of probability 0). |
Remark. If JXxyf+d(fi x i/) = fXxyf~d(fi x i/) = oo, then Fubini's
Theorem might not hold; see Exercises 9.5.13 and 9.5.14.
By additivity, Fubini's Theorem still holds if /i and/or v are a-finite
Measures (cf. Remark 4.4.3), rather than just probability measures. In
particular, letting // = P and v be counting measure on N leads to the
following generalisation of countable linearity (essentially a re-statement of
Exercise 4.5.14(a)):
Corollary 9.4.4. Let Z\, Z2,... be random variables with ^ E|Z»| < oo.
;nienE(£zZ0 = £;E(Zi).

112
9. More probability theorems.
As an application of Fubini's Theorem, we consider the convolution
formula, about sums of independent random variables.
Theorem 9.4.5. Suppose X and Y are independent random variables
with distributions /i = C(X) and v = C(Y). Then the distribution ofX + Y
is given by /i * v, where
(li*v)(H) = [ li(H-y)v(dy), #CR,
Jn
with H — y = {h — y; h G H}. Furthermore, if /jl has density f and v has
density g (with respect to \ = Lebesgue measure on R), then /j, * v has
density f * g, where
(f*9)(x) = I f(x-y)g(y) A(dy), xgR.
Jn
Proof. Since X and Y are independent, we know that £((X, Y)) =/zxi/,
i.e. the distribution of the ordered pair (X, Y) is equal to product measure.
Given a Borel subset H C R, let B = {(x,y) G R2; x + y G H}. Then
using Fubini's Theorem, we have
P(X + YeH) = P((X,y)GB)
= 0*xi/)(B)
= In (InM*>V)»(dx))v(dy)
= inv(H-y) "(dy)
so C(X + Y) = ji* v. If /i has density / and z/ has density #, then using
Proposition 6.2.3, shift invariance of Lebesgue measure as in (1.2.5), and
Fubini's theorem again,
(n*v)(H) = 'JKii(H-y)v(dy)
= SkUh-v1 ^{dx))v{dy)
= /r (/«_„/WA(dx))fl(y)A(d»)
= /r (fHf(x-y)X(dx))g(y)X(dy)
= STt[lHf(x-v)9(y)Hdx))X(dy)
= JH(JRf(x-y)g(y)\(dy))X(dx)
= ///(/* 9) (x) X(dx),
so \i * v has density given by / * g. I

9.5. Exercises.
113
9.5. Exercises.
Exercise 9.5.1. For the "simple counter-example" with ft = N, P(u;) =
2'^ for uj G ft, and Xn(u;) = 2n<Su;>n, verify explicitly that the hypotheses
of each of the Monotone Convergence Theorem, the Bounded Convergence
Theorem, the Dominated Convergence Theorem, and the Uniform Integra-
bility Convergence Theorem, are all violated.
Exercise 9.5.2. Give an example of a sequence of random variables which
is unbounded but still uniformly integrable. For bonus points, make the
sequence also be undominated, i.e. violate the hypothesis of the Dominated
Convergence Theorem.
Exercise 9.5.3. Let X,Xi,X2,... be non-negative random variables,
denned jointly on some probability triple (ft,.F, P), each having finite
expected value. Assume that limn_,oc Xn(u) = X{u) for all u G ft. For
n, K G N, let Yn,K — m.m(Xn,K). For each of the following statements,
either prove it must true, or provide a counter-example to show it is
sometimes false.
(a) limK^oolimn^ooECFn^) = E(X).
(b) limn^00limK^ooE(yn,K) = E(X).
Exercise 9.5.4. Suppose that limn_ocXn(a;) = 0 for all uj G ft, but
lim^oo E[Xn] ^ 0. Prove that E(supn |Xn|) = co.
Exercise 9.5.5. Suppose supn E(|Xn|r) < co for some r > 1. Prove that
{Xn} is uniformly integrable. [Hint: If |Xn(u;)| > a > 0, then |Xn(o;)| <
|JfnM|r/ttr-1.]
Exercise 9.5.6. Prove that Theorem 9.1.6 implies Theorem 9.1.2. [Hint:
Suppose \Xn\ < Y where E(Y) < co. Prove that {Xn} satisfies (9.1.4).]
Exercise 9.5.7. Prove that Theorem 9.1.2 implies Theorem 4.2.2,
assuming that E|X| < co. [Hint: Suppose {Xn} / X where E|X| < co. Prove
that {Xn} is dominated.]
Exercise 9.5.8. Let ft = {1,2}, with P({1}) = P({2}) = \, and let
*i({l}) = t2 and Ft({2}) = t4for0<t<l.
(a) What does Proposition 9.2.1 conclude in this case?
(b) In light of the above, what rule from calculus is implied by
Proposition 9.2.1?
Example 9.5.9. Let Xu X2, •.. be i.i.d., each with P(X{ = 1) = P(X{ =

114
9. More probability theorems.
(a) Compute the moment generating functions M^fs).
(b) Use Theorem 9.3.4 to obtain an exponentially-decreasing upper bound
onP(£(X1 + ... + Xn)>0.l).
Exercise 9.5.10. Let Xi,X2,... be i.i.d., each having the standard
normal distribution N(Q, 1). Use Theorem 9.3.4 to obtain an exponentially-
decreasing upper bound on P (^(-^i + ... + Xn) > O.l). [Hint: Don't
forget (9.3.2).]
Example 9.5.11. Let X have the distribution Exponential(5), with
density fx(x) = 5e~5x for x > 0 (with fx(x) = 0 for x < 0).
(a) Compute the moment generating function Mx(t).
(b) Use Mx(t) to compute (with explanation) the expected value E(A").
Example 9.5.12. Let a > 2, and let M(t) = e"|t|Q for t eR. Prove that
M(t) is not a characteristic function of any probability distribution. [Hint:
Consider Af"(*).]
Exercise 9.5.13. Let X = y = N, and let n{n} = v{n} = 2~n for n G N.
Let / : X x y — R by /(n,n) = (4n - 1), and /(n,n + 1) = -2(4n - 1),
with / (n, m) = 0 otherwise.
(a) Compute /^ (/y /(x, y)v{dyyj /i(dx).
(b) Compute /y (/^ /(x, y)^{dx)) v{dy).
(c) Why does the result not contradict Fubini's Theorem?
Exercise 9.5.14. Let A be Lebesgue measure on [0,1], and let f(x,y) —
Sxy{x2 - y2)(x2 + y2)~3 for (x, j/) ^ (0,0), with /(0,0) = 0.
(a) Compute J0 f/0 f(x,y)\(dy))\(dx). [Hint: Make the substitution
u = x2 + y2, v = x, so du = 2ydy, dv = dx, and x2 — y2 = 2v2 — u.}
(b) Compute J* (j^ f(x,y)>ldxj) X(dy).
(c) Why does the result not contradict Fubini's Theorem?
Exercise 9.5.15. Let X ~ Poisson(a) and Y ~ Poisson(fr) be
independent. Let Z = X + Y. Use the convolution formula to compute P(Z = z)
for all z e R, and prove that Z ~ Poisson(a + b).
Exercise 9.5.16. Let X ~ N(a, v) and Y ~ iV(6, w) be independent. Let
Z = X + Y. Use the convolution formula to prove that Z ~ N(a + b, v + w).
Exercise 9.5.17. For a, (3 > 0, the Gamma(a,/3) distribution has
density function f(x) = 0axa-1e~x^/T(a) for x > 0 (with /(x) = 0 for

9.6. Section summary.
115
< o), where T(a) = f0 tx~1e~tdt is the gamma function. (Hence, when
Z= 1, Gamma(l,/3) = Exp(/3).) Let X ~ Gamma(a, /?) and Y ~
Gamnia(7, /?) be independent, and let Z = X + Y. Use the convolution
formula to prove that Z ~ Gamma(a + 7,/?). [Note: You may use the
facts that T(a +1) = a T(a) for a e R, and T(n) = (n - 1)! for n G N, and
rx tr-i(x _ i)*-1^ = xr+s_1 T(r) r(s) / r(r + s) for r, 5, x > 0.]
9.6. Section summary.
This section presented various probability results that will be required
for the more advanced portions of the text. First, the Dominated
Convergence Theorem and the Uniform Integrability Limit Theorem were proved,
to extend the Monotone Convergence Theorem and the Bounded
Convergence Theorem studied previously, providing further conditions under which
limE(Xn) = E(limXn). Second, a result was given allowing derivative and
expectation operators to be exchanged. Third, moment generating functions
were introduced, and some of their basic properties were studied. Finally,
Fubini's Theorem for iterated integration was proved, and applied to give
a convolution formula for the distribution of a sum of independent random
variables.

10. Weak convergence.
117
10. Weak convergence.
riven Borel probability distributions /i, /ii,/i2, • • • on R, we shall write
^ u, and say that {/jin} converges weakly to /i, if jKfdfin —► JR/d/i
f n all bounded continuous functions / : R —► R.
This is a rather natural* definition, though we draw the reader's atten-
t* n to the fact that this convergence need hold only for continuous functions
f (as opposed to all Borel-measurable /; cf. Proposition 3.1.8). That is, the
-topology" of R is being used here, not just its measure-theoretic properties.
10.1. Equivalences of weak convergence.
We now present a number of equivalences of weak convergence (see also
Exercise 10.3.8). For condition (2), recall that the boundary of a set A C R
isdA = {xGR; Ve>0, An (x - e,x + e) ^ 0, Ac n (x - e,x + e) ^ 0}.
Theorem 10.1.1. The following are equivalent.
(1) iin => /i (i.e., {fin} converges weakly to fi);
(2) fin(A) —► fJi{A) for all measurable sets A such that fi(dA) = 0;
(3) [in ((—oo, x]) —► /i ((—oo, x]) for all x E R such that fi{x} = 0;
(4) (Skorohod's Theorem) there are random variables Y, Y\, Y2,... defined
jointly on some probability triple, with C(Y) = /jl and C(Yn) = [in for each
neN, such that Yn^Y with probability 1.
(5) JR/d/in —+ fKfdfi for all bounded Borel-measurable functions f :
R —> R such that /J>(Df) = 0, where Df is the set of points where f is
discontinuous.
Proof. (5) =» (1): Immediate.
(5) => (2): This follows by setting / = 1A, so that Df = dA, and
»(Df) = KdA) = 0. Then fin(A) = J f dfin - / /d/i = n(A).
(2) => (3): Immediate, since the boundary of (—oo,x] is {x}.
(1) ==> (3): Let e > 0, and let / be the function defined by f(t) = 1 for
* < x, f(t) = 0 for t > x + e, with / linear on the interval (x,x + e) (see
Figure 10.1.2 (a)). Then / is continuous, with l(_00fX] < / < l(-oo,*+e]-
Hence,
limsup/An((-oo,ar]) < limsup / fdfin = / fd/ji < /i((-oo,x + e]) .
n-+oo n->oo J J
In fact, it corresponds to the weak* ("weak-star") topology from functional analysis,
w|th X the set of all continuous functions on R vanishing at infinity (cf. Exercise 10.3.8),
with norm denned by ||/|| = supxGR |/(x)|, and with dual space X* consisting of all
finite signed Borel measures on R. The Helly Selection Principle below then follows from
Alaoglu's Theorem. See e.g. pages 161-2, 205, and 216 of Folland (1984).

118
10. Weak convergence.
This is true for any e > 0, so we conclude that limsupn_>00 /xn ((-oo,x]) <
n((-oo,x}).
(a) a
H—h
x—e x x+e
9(t)
(b) ,
1
x—e x x+e
Figure 10.1.2. Functions used in proof of Theorem 10.1.1.
Similarly, if we let g be the function defined by g(t) = 1 for t < x — e,
g(t) = 0 for t > x, with g linear on the interval (x — e, x) (see Figure 10.1.2
(b)), then l^c^x-e] < 9 ^ l(-oo,x]> and we obtain that
liminf /in ((—oo,x]) > liminf / fdfin = / fdfi > fi ((—oo,x — e]) .
n^oo n—►oo y y
This is true for any e > 0, so we conclude that limin^^co [in ((—oo,x]) >
/i((-cx),x)).
But if fi{x} = 0, then /i ((—oo, x]) = /i ((—oo, x)), so we must have
limsup//n ((—oo,x]) i= liminf/in ((—oo,x]) = /i ((—oo,x]) ,
as claimed.
(3) ==> (4): We first define the cumulative distribution functions, by
Fn(x) = /in((-oo,x]) and F(x) = /x((-oo,x]). Then, if we let (Q.T.'P)
be Lebesgue measure on [0,1], and let Yn(u) = inf{x;Fn(x) > u) and
Y(uj) = inf{x; F(x) > a;}, then as in Lemma 7.1.2 we have C(Yn) = /in and
C(Y) = ii. Note that if F(z) < a, then Y{a) > z, while if F(w) > b, then
Y(b) < z.
Since {Fn} —»• F at most points, it seems reasonable that {Yn} —► y at
most points. We will prove that {Yn} —► y at points of continuity of y.
Then, since Y is non-decreasing, it can have at most a countable number

10.2. Connections to other convergence. 119
0f discontinuities: indeed, it has at most m(Y(n + 1) - Y(n)) < oo
discontinuities of size > 1/ra within the interval (n,n + 1], then take countable
union over m and n. Since countable sets have Lebesgue measure 0, this
implies that {Yn} -> Y with probability 1, proving (4).
Suppose, then, that Y is continuous at cj, and let y = Y(u). For any
e > 0, we claim that F(y—e) < uj < F(y+e). Indeed, if we had F(y—e) = u,
then setting w = y — e and b = lj above, this would imply Y(lj) < y — e =
Y(uj) — 6, a contradiction. Or, if we had F(y + e) = lj, then setting z = y + e
and a = w + 5 above, this would imply Y(uj + 8) > y + e = Y(lj) + e for all
S > 0, contradicting the continuity of Y at u. So, F(y — e) < u < F(y + e)
for all e > 0.
Next, given e > 0, find e' with 0 < e' < e such that fi{y — e'} =
l*{y + e'} = 0. Then Fn{y - e') - F(y - e') and Fn(y + e') - F(j/ + e'), so
fn(2/ — e') < a; < Fn(?/ + e') for all sufficiently large n. This in turn implies
(setting first z = y — e' and a = lu above, and then w = y + e' and b = u
above) that j/ - e' < Yn{u) < y + e', i.e. \Yn(u) - Y(lj)\ < c' < e for all
sufficiently large n. Hence, Yn(u) —► y(u;).
(4) => (5): Recall that if / is continuous at x, and if {xn} —► x, then
/(xn) - /(*). Hence, if {yn} - y and F 0 £>/, then {/(yn)} - /(y).
It follows that P[{/(y„)} — f{Y)] > P[{yn} — y and Y 0 £>/]. But
by assumption, P[{yn} -> Y] = 1 and P[y 0 I?/] = /i(Dy') = 1, so also
P[{/(yn)} —► f(Y)] = 1. If / is bounded, then from the bounded
convergence theorem, E [/(Y^)] —► E [/(y)], i.e. J f dfin -+ f f dfi, as claimed. I
For a first example, let ji be Lebesgue measure on [0,1], and let /in be
defined by fin (^) = ^ for z = 1,2,..., n. Then /i is purely continuous while
[in is purely discrete; furthermore, /i(Q) = 0 while /in(Q) = 1 for each n.
On the other hand, for any 0 < x < 1, we have /i((-oo,x]) = x while
Mn((-oo,x]) = \nx\/n. Hence, |/in ((-oo,x]) — /i((—oo,x])| < ^ —► 0 as
ft —► oo, so we do indeed have /in => /i. (Note that dQ = [0,1] so that
»(9Q) ± 0.)
For a second example, suppose X\, X<i,... are i.i.d. with finite mean m,
and Sn = ^(Xi + ... + Xn). Then the weak law of large numbers says
that for any e > 0 we have P (Sn < m - e) -> 0 and P (5n < m + e) -> 1 as
^ -^ oo. It follows that £(5n) ^> 5m(-), a point mass at m. Note that it is
not necessarily the case that P (Sn < m) -+ 8m ((-oo,ra]) = 1, but this is
no contradiction since the boundary of (—oo,m] is {m}, and (5m{m} ^ 0.
10.2. Connections to other convergence.
We now explore a sufficient condition for weak convergence.
Proposition 10.2.1. If {Xn} -► X in probability, then C(Xn) => C(X).

120
10. Weak convergence.
Proof. For any e > 0, if X > z + e and \Xn — X\ < e, then we must
have Xn > z. That is, {X > z + e} n {\Xn - X\ < e} C {Xn > 2}
Taking complements, {X < z + c}U {|Xn - X\ > e} D {Xn < zy
Hence, by the order-preserving property and subadditivity, P(Xn < z) <
P(X < z + c) -f- P(|X - Xn| > c). Since {Xn} -> X in probability, w^
get that limsupn_ocP(Xn < z) < P(X < z + e). Letting e \ 0 gives
limsup,^ P(Xn <z)< P(X < z).
Similarly, interchanging X and Xn and replacing z with z — e in the
above gives P(X < z - e) < P{Xn < z) + P(|X - Xn| > c), or P(Xn <
z) > P(X < z-e)-P(\X-Xn\ > e), so liminf P(Xn < z) > P(X < z-e).
Letting e \ 0 gives liminf P(Xn < z) > P(X < z).
If P(X = z) = 0, then P(X < z) = P(X < z), so we must have
liminf P(Xn < z) = limsupP(Xn < z) = P(X < z), as claimed. |
Remark. We sometimes write C(Xn) => C(X) simply as Xn => X, and
say that {Xn} converges weakly (or, in distribution) to X.
We now have an interesting near-circle of implications. We already knew
(Proposition 5.2.3) that if Xn —» X almost surely, then Xn —► X in
probability. We now see from Proposition 10.2.1 that this in turn implies that
C(Xn) => C(X). And from Theorem 10.1.1(4), this implies that there are
random variables Yn and Y having the same laws, such that Yn —» Y almost
surely.
Note that the converse to Proposition 10.2.1 is clearly false, since the
fact that C(Xn) => C(X) says nothing about the underlying relationship
between Xn and X, it only says something about their laws. For example,
if X, Xi, X2,... are i.i.d., each equal to ±1 with probability |, then of course
C(Xn) => £(X), but on the other hand P (|X - Xn| > 2) = \ +> 0, so X„
does not converge to X in probability or with probability 1. However, if X is
constant then the converse to Proposition 10.2.1 does hold (Exercise 10.3.1).
Finally, we note that Skorohod's Theorem may be used to translate
results involving convergence with probability 1 to results involving weak
convergence (or, by Proposition 10.2.1, convergence in probability). For
example, we have
Proposition 10.2.2. Suppose C(Xn) => £(X), with Xn > 0. Then
E(X)< liminf E(Xn).
Proof. By Skorohod's Theorem, we can find random variables Yn and
Y with C(Yn) = £(Xn), C{Y) = £(X), and Yn -> Y with probability 1.
Then, from Fatou's Lemma,
E(X) =E(y) = E(liminfyn) < liminf E(Yn) = liminf E(Xn). I

10.3. Exercises.
121
For example, if X = 0, and if P(Xn = n) = i and P(Xn = 0) = 1 - £,
then C(Xn) =* £P0> and ° = EW ^ liminf E(Xn) = 1. (In fact, here
X —» X in probability, as well.)
Remark 10.2.3. We note that most of these weak convergence concepts
have direct analogues for higher-dimensional distributions, not considered
here; see e.g. Billingsley (1995, Section 29).
10.3. Exercises.
Exercise 10.3.1. Suppose C(Xn) => 5C for some c G R. Prove that {Xn}
converges to c in probability.
Exercise 10.3.2. Let X,Yi,!^,... be independent random variables,
with P{Yn = 1) = 1/n and P(Fn = 0) = 1 - 1/n, . Let Zn = X + Fn. Prove
that C(Zn) => C(X), i.e. that the law of Zn converges weakly to the law of
X.
Exercise 10.3.3. Let /xn = xV(0, ^) be a normal distribution with
mean 0 and variance -. Does the sequence {/xn} converge weakly to some
probability measure? If yes, to what measure?
Exercise 10.3.4. Prove that weak limits, if they exist, are unique. That
is, if/x, */, /xi,/X2,... are probability measures, and /xn => /x, and also /xn => */,
then ji = v.
Exercise 10.3.5. Let /xn be the Poisson(n) distribution, and let /x be the
Poisson(5) distribution. Show explicitly that each of the four conditions
of Theorem 10.1.1 are violated.
Exercise 10.3.6. Let a\, a2,... be any sequence of non-negative real
numbers with Ysi a>i = I- Define the discrete measure /x by /x(-) = J2ieN a*^(')>
where Si(-) is a point-mass at the positive integer i. Construct a sequence
IMn} of probability measures, each having a density with respect to Lebesgue
measure, such that /xn => /x.
Exercise 10.3.7. Let C(Y) = /x, where /x has continuous density /. For
n € N, let Yn = [nY\ / n, and let /xn = £{Yn).
(a) Describe /xn explicitly.
(b) Prove that /xn => /x.
(c) Is jin discrete, or absolutely continuous, or neither? What about /x?

122
10. Weak convergence.
Exercise 10.3.8. Prove that the following are equivalent.
(1) /xn => /x.
(2) J f dfin —► J f dfi for all non-negative bounded continuous / : R —► R.
(3) J/d/xn —» J/d/x for all non-negative continuous / : R —» R with
compact support, i.e. such that there are finite a and 6 with f(x) = 0 for all
x < a and all x > b.
(4) J f dfin —> f f dfi for all continuous / : R —» R with compact support.
(5) J f dfin —> f f dfi for all non-negative continuous / : R —» R which
vanish at infinity, i.e. such that linix^-oo f(x) = linix^oo f(x) = 0.
(6) / / d/xn —> f f dfi for all continuous / : R —» R which vanish at infinity.
[Hints: You may assume the fact that all continuous functions on R which
have compact support or vanish at infinity are bounded. Then, showing
that (1) =» each of (4)-(6), and that each of (4)-(6) =» (3), is easy. For
(2) => (1), note that if |/| < M, then / + M is non-negative. For (3)
=> (2), note that if / is non-negative bounded continuous and m £ Z,
then fm = /l[m,m+i) is non-negative bounded with compact support and
is "nearly" continuous; then recall Figure 10.1.2, and that / = J2mez fm-]
Exercise 10.3.9. Let 0 < M < oo, and let /,/i,/2,.-. : [0,1] ->
[0, M] be Borel-measurable functions with J0 f dX = J0 fn dX = 1. Suppose
limn/n(x) = f(x) for each fixed x G [0,1]. Define probability measures
/x,/xi,/x2, •.. by fi(A) = JAfdX and fin{A) = JA fn dA, for Borel A C [0,1].
Prove that /xn ^> /x.
Exercise 10.3.10. Let / : [0,1] —» (0,oo) be a continuous function
such that f0 fdX = 1 (where A is Lebesgue measure on [0,1]). Define
probability measures /x and {/xn} by /x(^4) = J0 flAdX and /xn(^4) =
zr=i/(</») iA(»/n)/Er=i/(</»)•
(a) Prove that /xn => /x. [Hint: Recall Riemann sums from calculus.]
(b) Explicitly construct random variables Y and {Yn} so that C(Y) = /x,
C(Yn) = /xn, and Yn —> Y with probability 1. [Hint: Remember the proof
of Theorem 10.1.1.]
10.4. Section summary.
This section introduced the notion of weak convergence. It proved
equivalences of weak convergence in terms of convergence of expectations of
bounded continuous functions, convergence of probabilities, convergence of

10.4. Section summary.
123
cumulative distribution functions, and the existence of corresponding
random variables which converge with probability 1. It proved that if random
variables converge in probability (or with probability 1), then their laws
converge weakly.
Weak convergence will be very important in the next section, including
allowing for a precise statement and proof of the Central Limit Theorem
(Theorem 11.2.2 on page 134).

11. Characteristic functions.
125
11. Characteristic functions.
Given a random variable X, we define its characteristic function (or
Fourier transform) by
(j)X(t) = E(eitx) = E[cos(tX)]+iE[sm(tX)}, f€R.
The characteristic function is thus a function from the real numbers to the
complex numbers. Of course, by the Change of Variable Theorem
(Theorem 6.1.1), (f)x{t) depends only on the distribution of X. We sometimes
write (f)x{t) as </>(t).
The characteristic function is clearly very similar to the moment
generating function introduced earlier; the only difference is the appearance of
the imaginary number i = y/^1 in the exponent. However, this change is
significant; since |eltX| = 1 for any (real) t and X, the triangle inequality
implies that |0xMI — 1 < °° f°r all ^ and a^ random variables X. This is
quite a contrast to the case for moment generating functions, which could
be infinite for any s ^ 0.
Like for moment generating functions, we have 0x(O) = 1 for any X,
and if X and Y are independent then 0x+y(£) = (f>x{t)(t>Y(t) by (4.2.7).
We further note that, with /z = £(X), we have
fell+ />)-«()I = || (ei<'+»>'-el") (.(<&)!
s/r»-n*is/nr-w
= f\eihx-l\fi{dx).
Now, as h —» 0, this last quantity decreases to 0 by the bounded convergence
theorem (since \elhx -1| < 2). We conclude that (j>x is always a (uniformly)
continuous function.
The derivatives of (px are also straightforward. The following
proposition is somewhat similar to the corresponding result for Mx{s)
(Theorem 9.3.3), except that here we do not require a severe condition like
"^x(s)<ooforall |*| < s0"-
Proposition 11.0.1. Suppose X is a random variable with E (\X\k) <
00• Then for 0 < j < k, <\>x has finite jth derivative, given by ^{t) =
E [(iXye***]. In particular, ^}(0) = i'E(X').
^roof. We proceed by induction on j. The case j = 0 is trivial. Assume
n°w that the statement is true for j - 1. For t € R, let Ft = (iX)j-leitx,

126
11. Characteristic functions.
so that \F(\ = \(iX)ieitX\ = \X\K Since E(\X\k) < oo, therefore also
Efl-Xp') < oo. It thus follows from Proposition 9.2.1 that
&\t) = j/tl\t)
dt
Eftxy-1?**]
= E&iiX)'-1?**] = EUiX)jeitx].
at
11.1. The continuity theorem.
In this subsection we shall prove the continuity theorem for
characteristic functions (Theorem 11.1.14), which says that if characteristic functions
converge pointwise, then the corresponding distributions converge weakly:
/xn => /z if and only if <j>n{t) —» (f)(t) for all t. This is a very important
result; for example, it is used to prove the central limit theorem in the
next subsection. Unfortunately, the proof is somewhat technical; we must
show that characteristic functions completely determine the
corresponding distribution (Theorem 11.1.1 and Corollary 11.1.7 below), and must
also establish a simple criterion for weak convergence of "tight" measures
(Corollary 11.1.11).
We begin with an inversion theorem, which tells how to recover
information about a probability distribution from its characteristic function.
Theorem 11.1.1. (Fourier inversion theorem) Let /ibea Borel
probability measure on R, with characteristic function (j)(t) = JK eltxfi(dx). Then
if a <b and /x{a} = fi{b} = 0, then
MMD^lim^l
T p—ita p—itb
(p(t)dt.
_rp Zt
To prove Theorem 11.1.1, we use two computational lemmas.
Lemma 11.1.2. For T > 0 and a < 6,
IS
T | — Ua _ p—itb
4>{t)
dtfi(dx) < 2T(b-a) < oo.
-t | it
Proof. We first note by the triangle inequality that
ita p—itb
Atx
it
p—ita _ p—itb I
it
f
J a
-itrdr

11.1. The continuity theorem.
127
<
/ \e-itr\dr= /
J a J a
Idr = b — a.
Hence,
LL
-ita 0—itb
Atx
it
dtfi(dx)
JnJ-i
(b- a)dt [i(dx)
-L
2T (b - a) fi{dx) = 2T(b-a).
Lemma 11.1.3. ForT>Q and 6 e R,
rT sm{6t)
lim /
r-,oo J_,
t
dt = it sign (0),
(11.1.4)
where sign (0) = 1 for 6 > 0, sign(0) = -1 for 6 < 0, and sign(O) = 0.
Furthermore, there is M < oo such that \ J_T[sm(6t)/t]dt\ < M for all
T > 0 and 6 G R.
Proof (optional). When 6 = 0 both sides of (11.1.4) vanish, so assume
0 ^ 0. Making the substitution s = \6\t, dt = ds/\6\ gives
/:
—^-^-dt = sign(0)
T t J-T
sin(|0|£)
r sma
J-T t
r\e\T
J-\e\T
sins
dt = sign(0) / ds ,
\e\T s
and hence
T-+00
Furthermore,
/:
,T sin(0s) , „ . ,„% f°° sins ,
hm / —^-±ds = 2 sign (9) \ ds.
S Jo &
(11.1.5)
(11.1.6)
H—ds = [°°{sms)( [°°e-usdu\ds
/»oo / /»oo \ /»oo / /»oo \
= / (/ (sins)e-usds)du = / (/ {sin s)e~usds \ du.
Now, for u > Q, integrating by parts twice,
r°° ioo r°°
= / (sins)e-usds = (-coss)e~us \ -/ (-coss){-u)e-usds
Jo >a=o Jo
= 0 - (-1) - u Usins)e-us |~ + / (sins)(-u)e-usds)
4 =

128
11. Characteristic functions.
poo
1+0-0-u2 / (sins)e~usds.
Jo
u=0
Hence, Iu = 1 — u2Iu, so Iu = 1/(1 + u2).
We then compute that
f°°sms , f°° T , f°° 1 f
/ as = ludu = r- cm = arctanfin
Jo « Jo Jo l+«2
= arctan(oo) — arctan(O) = ir/2 — 0 = 7r/2.
Combining this with (11.1.6) gives (11.1.4).
Finally, since convergent sequences are bounded, it follows from (11.1.4)
that the set { J_T[sin(t)/t] dt}T> is bounded. It then follows from (11.1.5)
that the set { J_T[sm(0t)/t] dt)T>0 e R is bounded as well. |
Proof of Theorem 11.1.1. We compute that
1 nT *> — *to ^ — itb ... ,
-ita -itb
= h S-t e~"aue~"° (k e{y(dx)) dt [by definition of <j>(t)\
= ^F Jr S-t e% ~if —^t fi{dx) [by Fubini and Lemma 11.1.2]
= £ /r S-t sin(t(x"Q))7sin(t(x"b))^ M<**) [since ^ is odd].
Hence, we may use Lemma 11.1.3, together with the bounded convergence
theorem and Remark 9.1.9, to conclude that
1 pT -—ita p—itb
lim — / </>(t)dt
T-+oo 27T J_t it K '
1 f°°
= — / 7r [sign (x — a) — sign (x — b)] fi(dx)
^ J-oo .
/oo 2
7r- [sign (x - a) — sign (x — b)] fi(dx)
-oo 2
(The last equality follows because (1/2) [sign (x - a) - sign (x - b)] is equal
to 0 if x < a or x > b; is equal to 1/2 if x = a or x = 6; and is equal to 1 if
a < x < b.) But if fi{a} = /i{£>} = 0, then this is precisely equal to fi ([a, 6]),
as claimed. I
From this theorem easily follows the important

11.1. The continuity theorem.
129
Corollary 11.1-7. (Fourier uniqueness theorem) Let X and Y be random
variables. Then 4>x(t) = (j)y{t) for all t G R if and only if C(X) = C(Y),
I e if and only if X and Y have the same distribution.
proof. Suppose (f>x(t) = <t>Y {t) for all t G R. From the theorem, we know
that P(a < X < b) = P(a < Y < b) provided that P{X = a) = P(X =
h) = P(Y = a) = P(Y = 6) = 0, i.e. for all but countably many choices of
a and b. But then by taking limits and using continuity of probabilities, we
see that P(X G I) = P(Y G I) for all intervals J C R. It then follows from
uniqueness of extensions (Proposition 2.5.8) that C(X) = C(Y).
Conversely, if C(X) = C{Y), then Corollary 6.1.3 implies that E(eitX) =
E(eity), i.e. <j>x{t) = 4>Y{t) for all teR. ■
This last result makes the continuity theorem at least plausible.
However, to prove the continuity theorem we require some further results.
Lemma 11.1.8. (Helly Selection Principle) Let {Fn} be a sequence
of cumulative distribution functions (i.e. Fn(x) = /xn ((—oo,x]) for some
probability distribution iin). Then there is a subsequence {Fnk}, and a
non-decreasing right-continuous function F with 0 < F < 1, such that
lim^oo Fnk(x) = F(x) for all x G R such that F is continuous at x. [On
the other hand, we might not have linix^-oo F(x) = 0 or linis^oo F(x) = 1J
Proof. Since the rationals are countable, we can write them as Q =
{qi,q2, • • •}• Since 0 < Fn(^i) < 1 for all n, the Bolzano-Weierstrass
theorem (see page 204) says there is at least one subsequence {vk '} such that
lim^oo Fik(i)(qi) exists. Then, there is a further subsequence {vk '} (i.e.,
{?k } is a subsequence of {^ }) such that limfc—oo i^ (2)^2) exists (but
also lim^oo Fik(2)(qi) exists, since {vk ^} is a subsequence of {£k ^}).
Continuing, for each m G N there is a further subsequence {tk} such that
linifc^oQ F£k(m)(qj) exists for j < m.
We now define the subsequence we want by rik = £k , i.e. we take the
k element of the kth subsequence. (This trick is called the diagonalisation
method.) Since {n^} is a subsequence of {£k} from the kth point onwards,
this ensures that lim^oo Fnk(q) = G(q) exists for each q G Q. Since each
is non-decreasing, therefore G is also non-decreasing.
To continue, we set F(x) = inf{G(<?); q G Q, q > x}. Then F is easily
seen to be non-decreasing, with 0 < F(x) < 1. Furthermore, F is right-
continuous, since if {xn} \ x then {{q € Q : q > xn}} /* {q e Q :
q > x}, and hence F(xn) -* F(x) as in Exercise 3.6.4. Also, since G is
non-decreasing, we have F(q) > G(q) for all q 6 Q.

130
11. Characteristic functions.
Now, if F is continuous at x, then given e > 0 we can find rational
numbers r, s, and u with r < u < x < s, and with F(s) — F(r) < e. We
then note that
F{x) - e < F(r)
= inf G(q)
q>r
= inflimFnfc(g)
q>r k
= inf lim inf Fn. (q)
q>r k
< lim inf Fnk (u) since u G Q, u > r
k
< lim inf Fnk (x) since x > u
k
< limsupFnfc(x)
k
< limsupFnfc(s) since s > x
k
= G(s)
< F{s)
< F(x) + e.
This is true for any e > 0, hence we must have
lim inf Fnk(x) = limsupFnfc(x) = F(x),
k k
so that lim*; Fnk (x) = F(x), as claimed. I
Unfortunately, Lemma 11.1.8 does not ensure that limx^oo F(x) = 1 or
\imx^-00F(x) = 0 (see e.g. Exercise 11.5.1). To rectify this, we require a
new notion. We say that a collection {/xn} of probability measures on R is
tight if for all e > 0, there are a < b with /xn ([a, b]) > 1 — e for all n. That
is, all of the measures give most of their mass to the same finite interval;
mass does not "escape off to infinity".
•
Exercise 11.1.9. Prove that:
(a) any finite collection of probability measures is tight.
(b) the union of two tight collections of probability measures is tight.
(c) any sub-collection of a tight collection is tight.
We then have the following.
Theorem 11.1.10. If {/in} is a tight sequence of probability measures,
then there is a subsequence {/infc} and a probability measure fi, such that
jink =4> n, i.e. {/infc} converges weakly to fi.

11.1. The continuity theorem.
131
proof. Let Fn(x) = \xn ((—oo,x]). Then by Lemma 11.1.8, there is a
subsequence Fnk and a function F such that Fnk (x) —► F(x) at all continuity
points of F. Furthermore 0 < F < 1.
We now claim that F is actually a probability distribution function, i.e.
that limx__oo F(x) = 0 and lim^oo F(x) = 1. Indeed, let e > 0. Then
using tightness, we can find points a < b which are continuity points of F,
such that fjLn ((a, b]) > 1 - e for all n. But then
lim F(x) - lim F(x) > F(b) - F(a)
X—♦OO X—► — OO
= lim [F„(6) - Fn(a)] = lim /i» ((a, 6]) > 1 - e.
n n
This is true for all e > 0, so we must have lim^—co F(x)-\imx^-00 F(x) = 1,
proving the claim.
Hence, F is indeed a probability distribution function. Thus, we can
define the probability measure jx by /j, ((a, b]) = F(b) - F(a) for a < b. Then
fink => jz by Theorem 10.1.1, and we are done. I
A main use of this theorem comes from the following corollary.
Corollary 11.1.11. Let {fjLn} be a tight sequence of probability
distributions on R. Suppose that /jl is the only possible weak limit of {fin}, in the
sense that whenever /j,nk => v then v = ji (that is, whenever a subsequence
of the {fjin} converges weakly to some probability measure, then that
probability measure must be fi). Then /in => /j,, i.e. the full sequence converges
weakly to /jl.
Proof. If fjin t4> jl, then by Theorem 10.1.1, it is not the case that
Hn{oc,x] —> ijl(—oo,x] for all x G R with /jl{x} = 0. Hence, we can find
x g R, e > 0, and a subsequence {n^}, with /jl{x} = 0, but with
|Mnfc((-oo,x]) - Ai((-oo,x])| > e, fcGN. (11.1.12)
On the other hand, {^nfc} is a subcollection of {jxn} and hence tight, so
by Theorem 11.1.10 there is a further subsequence {jxnfc } which converges
weakly to some probability measure, say v. But then by hypothesis we must
have v = jx, which is a contradiction to (11.1.12). I
Corollary 11.1.11 is nearly the last thing we need to prove the continuity
theorem. We require just one further result, concerning a sufficient condition
for a sequence of measures to be tight.
Lemma 11.1.13. Let {/in} be a sequence of probability measures on
R, with characteristic functions <j>n(t) = /eltxfin(dx). Suppose there is a

132
11. Characteristic functions.
function g which is continuous at 0, such that \\mn<j>n(t) = g(t) for each
\t\ < to for some to > 0. Then {fj,n} is tight.
Proof. We first note that g(0) = \imncj>n(0) = limn 1 = 1. We then
compute that, for y > 0,
Mn((-OC,-2]U[^Oo)) = Jjx|>2/y 1 Hn{dx)
^ Hxl^/yi1-^) Unite)
< 2/M>2/y(l-^)Mn(<**)
= !\x\>2/y0-lv) I-y C1 - COs(te)) dt Hn(dx)
^ /x6R(Vy) I-y (1 - COs(te)) di /Xn(dx)
= LenVMHyV-***)<* !*«{**)
= ly!ly{l-<l>n{t))dt.
Here the first inequality uses that 1 j^y > \ whenever \x\ > 2/y, the
second inequality uses that 8m™x' = 8my^ < -jL the second equality uses
that J^ cos(tx) dt = 2sin(yx> ? the final inequality uses that 1 — cos(to) > 0,
the third equality uses that f^ sin(to) dt = 0, and the final equality uses
Fubini's theorem (which is justified since the function is bounded and hence
has finite double-integral).
To finish the proof, let e > 0. Since #(0) = 1 and g is continuous at
0, we can find yo with 0 < yo < to such that |1 — g(t)\ < e/4 whenever
\t\ < y0. Then \± /^(l - g(t))dt\ < e/2. Now, <£„(*) - 9<t) for all
1*1 < 2/o, and |</>n(*)| < 1- Hence, by the bounded convergence theorem, we
can find n0 G N such that |-^ f0° (1 - (f)n(t))dt\ < e for all n > n0.
Hence, fin( - ^, ^) = 1 - Mn((-oo,-^] U [^,oo)) > 1 - e for all
n > no- It follows from the definition that {jxn} is tight. I
We are now, finally, in a position to prove the continuity theorem.
•
Theorem 11.1.14. (Continuity Theorem) Let /j,, li\,li2, •• • be
probability measures, with corresponding characteristic functions </>, (j)\, fa, —
Then /j,n => /j, if and only if (j)n(t) —> <j>(t) for all t E R. In words, the
probability measures {/j,n} converge weakly to [i if and only if their characteristic
functions converge pointwise to that of [i.
Proof. First, suppose that /in => /i. Then, since cos(to) and sin(to) are
bounded continuous functions, we have as n —► oo for each tGR that
(j)n(t) = f cos(tx)fin(dx) + i fsm(tx)fin(dx)
—> f cos(tx)fi(dx) + if sin(tx)fi(dx)
= 0(*).

11.2. The Central Limit Theorem.
133
Conversely, suppose that (j)n{t) —► (j){i) for each tGR. Then by Lemma
11.1.13 (with g = (/>), the {/in} are tight. Now, suppose that we have /infc =>
v for some subsequence {fink} and some measure v. Then, from the previous
paragraph we must have <t>nk (t) —► <j)u(t) for all t, where (j>u(t) = J eltxv(dx).
On the other hand, we know that <j>nk (t) —> <j>{t) for all t\ hence, we must
have (j>v = <t>- But from Fourier uniqueness (Corollary 11.1.7), this implies
that v = p.
Hence, we have shown that /j, is the only possible weak limit of the {^n}.
Therefore, from Corollary 11.1.11, we must have /in => /j,, as claimed. I
11.2. The Central Limit Theorem.
Now that we have proved the continuity theorem (Theorem 11.1.14), it
is very easy to prove the classical central limit theorem.
First, we compute the characteristic function for the standard normal
distribution iV(0,1), i.e. for a random variable X having density with respect
to Lebesgue measure given by fx(x) = ~k=e~x ^2- That is, we wish to
compute
/OO 1
eitxle-x*/2dx
V2i
Comparing with the computation leading to (9.3.2), we might expect that
<j>x{t) = Mx(it) = e^ I2 = e~l I2. This is in fact correct, and can be
justified using theory of complex analysis. But to avoid such technicalities,
we instead resort to a trick.
Proposition 11.2.1. If X ~ JV(0,1), then <f>x(t) = e"*2/2 for all t G R.
Proof. By Proposition 9.2.1 (with Ft = eitx and Y = \X\, so that
E(K) < oo and \F(\ = \(iX)eitx\ = \X\ < Y for all t), we can differentiate
under the integral sign, to obtain that
/OO -j /*00 -j
ixeit*'e-x2/2dx = / ieitx 1 xe-*2/2dx
-oo v27T J-oo v27T
Integrating by parts gives that
/OO -j
#ite^dx = -td>X(t).
Hence, </>'x(t) = -t<t>x{t), so that ^log0x(*) = -*. Also, we know that
log^x(O) = logl = 0. Hence, we must have log </>x(t) = /0*(-s) ds = -t2/2,

134
11. Characteristic functions.
whence </>x{t) = e~* ^2- |
We can now prove
Theorem 11.2.2. (Central Limit Theorem) Let Xi,X2,... be Lid.
with finite mean m and finite variance v. Set Sn = X\ + ... + Xn. Then as
n —» oo,
r(Sn-nm\
where /j,n = N(Q, 1) is the standard normal distribution, i.e. the distribution
having density -^=e~x I2 with respect to Lebesgue measure.
Proof. By replacing X{ by ^7=p, we can (and do) assume that m = 0
and v = 1.
Let (j)n(t) = E f eitSn/y^j be the characteristic function of Sn/y/n. By
the continuity theorem (Theorem 11.1.14), and by Proposition (11.2.1), it
suffices to show that limn <j)n(t) = e~l I2 for each fixed t G R.
To this end, set </>(£) = E(eltXl). Then as n —► oo, using a Taylor
expansion and Proposition 11.0.1,
4>n(t) = E (j*(Xi+...+Xn)/V*\
= 4>(t/Vn)n
= (i+^ft)+% ( a)2 E ((^h+0{1/n)y
= (l-|i+o(l/n))n
as claimed. (Here o(l/n) means a quantity qn such that qn/(l/n) —► 0 as
n —> oo. Formally, the limit holds since for any e > 0, for sufficiently large
n we have qn > —e/n and also qn»< e/n, so that the liminf is > e~^ /2)~c
and the limsup is < e"^ /2)+e.) I
Since the normal distribution has no points of positive measure, this
theorem immediately implies (by Theorem 10.1.1) the simpler-seeming
Corollary 11.2.3. Let {Xn}, m, v, and Sn be as above. Then for each
fixed x G R,
lim P (Sn-nm < \ = $(x) ? (n 2 4)
n—oo y y/nV J
where &(x) = fx -^=e~* l2dt is the cumulative distribution function for
the standard normal distribution.

11.3. Generalisations of the Central Limit Theorem. 135
This can also be written as P(5n < nm + Xy/nv) —► $>(x). That is,
% 4.... + Xn is approximately equal to nra, with deviations from this value
of order y/n. For example, suppose Xi,X2,... each have the Poisson(5)
distribution. This implies that m = E(X$) = 5 and v = Var(Xj) = 5.
Hence, for each fixed x G R, we see that
P (Xi + ... + Xn < 5n + xy/5n\ -+ $(x), n -► oc.
Remarks.
1. It is not essential in the Central Limit Theorem to divide by y/v.
Without doing so, the theorem asserts instead that
2. Going backwards, the Central Limit Theorem in turn implies the WLLN,
since if y > m, then as n —► oo,
P[(Sn/n) <y}= P[Sn < ny] = P[{Sn - nm)/y/nH < (ny - nm)/y/n^]
« $[(ny — nm)/y/nv] = $[y/n(y — m)/y/v\ —► ^(+oo) = 1,
and similarly if y < m then P[(Sn/n) < y] —> $(—oo) = 0. Hence,
C(Sn/n) => 5m, and so «Sn/n converges to m in probability.
11.3. Generalisations of the Central Limit Theorem.
The classical central limit theorem (Theorem 11.2.2 and Corollary 11.2.3)
is extremely useful in many areas of science. However, it does have certain
limitations. For example, it provides no quantitative bounds on the
convergence in (11.2.4). Also, the insistence that the random variables be i.i.d. is
sometimes too severe.
The first of these problems is solved by the Berry-Esseen Theorem, which
states that if Xi, X2,... are i.i.d. with finite mean m, finite positive variance
v, and E (|X; - m|3) = p < oo, then
|p^i + ... + Xw-nm \
I V y/vn J
This theorem thus provides a quantitative bound on the convergence in
(11.2.4), depending only on the third moment. For a proof see e.g. Feller
(1971, Section XVI.5). Note, however, that this error bound is absolute, not
relative: as x -* -oo, both P(Xl+-+^~nm < x) and &(x) get small, and
<
3p
fnv°

136
11. Characteristic functions.
the Berry-Esseen Theorem says less and less. In particular, the theore
does not assert that P (Xl+'"^~nm < x) decreases as 0{e~x2/2) as x ^
—oo, even though $(x) does. (We have already seen in Theorem 9.3.4 that
pf xi+^±2£zl < x J often decreases as px for some p < 1.)
Regarding the second problem, we mention just two of many results
To state them, we shall consider collections {Znk\ n > 1, 1 < fc < rn} 0f
random variables such that each row {Znk}\<k<rn is independent, called
triangular arrays. (If rn = n they form an actual triangle.) We shall
assume for simplicity that E(Znfc) = 0 for each n and fc. We shall further
set c2nk = E(Z*k) (assumed to be finite), Sn = Znl + ... + Znrn, and
4 = Var(Sn) = (r21 + ... + (72rn.
For such a triangular array, the Lindeberg Central Limit Theorem states
that C(Sn/sn) =$> N(0,1), provided that for each e > 0, we have
1 Tn
^^Y,^Znk1\Znk>esn] = 0. (11.3.1)
n k=l
This Lindeberg condition states, roughly, that as n —► oo, the tails of the
Znk contribute less and less to the variance of Sn.
Exercise 11.3.2. Consider the special case where rn — 72, with Znk =
-j=j Yn where {Yn} are i.i.d. with mean 0 and variance v < oo (so sn = 1).
(a) Prove that the Lindeberg condition (11.3.1) is satisfied in this case.
[Hint: Use the Dominated Convergence Theorem.]
(b) Prove that the Lindeberg CLT implies Theorem 11.2.2.
This raises the question that, if (11.3.1) is not satisfied, then what other
limiting distributions may arise? Call a distribution \x a possible limit if
there exists a triangular array as defined above, with supn s^ < oo and
linin-xx) maxi<fc<rn c\k = 0 (so that no one term dominates the
contribution to Var(5n)), such thatr£(5n) => /jl. Then we can ask, what
distributions are possible limits? Obviously the normal distributions N(0,v) are
possible limits; indeed C(Sn) => N(0,v) whenever (11.3.1) is satisfied and
5^ —► v. But what else?
The answer is that the possible limits are precisely the infinitely divisible
distributions having mean 0 and finite variance. Here a distribution /jl is
called infinitely divisible if for all n G N, there is a distribution vn such that
the n-fold convolution of vn equals \x (in symbols: vn * vn * ... * vn = /x).
Recall that this means that, if X\, X2,..., Xn ~ vn are independent, then
X\ + ... + Xn ~ /i.
Half of this theorem is obvious; indeed, if /i is infinitely divisible, then
we can take rn = n and C(Xnk) = vn in the triangular array, to get that

11.4. Method of moments.
137
£/£n) =» /i. For a proof of the converse, see e.g. Billingsley (1995, Theorem
28.2).
XI.4. Method of moments.
There is another way of proving weak convergence of probability
measures, which does not explicitly use characteristic functions or the continuity
theorem (though its proof of correctness does, through Corollary 11.1.11).
Instead, it uses moments, as we now discuss.
Recall that a probability distribution fi on R has moments defined by
a^ — j xkii{dx), for k = 1,2,3, Suppose these moments all exist and are
all finite. Then is \x the only distribution having precisely these moments?
And, if a sequence {/in} of distributions have moments which converge to
those of/i, then does it follow that \xn => \x, i.e. that the fjLn converges weakly
to /i? We shall see in this section that such conclusions hold sometimes,
but not always.
We shall say that a distribution /i is determined by its moments if all
its moments are finite, and if no other distribution has identical moments.
(That is, we have J \xk\fj,(dx) < oo for all k G N, and furthermore whenever
/ xkn(dx) = J xki/(dx) for all k G N, then we must have v = /x.)
We first show that, for those distributions determined by their moments,
convergence of moments implies weak convergence of distributions; this
result thus reduces the second question above to the first question.
Theorem 11.4.1. Suppose that /i is determined by its moments. Let
{fin} be a sequence of distributions, such that f xkfj,n(dx) is finite for all
n, k G N, and such that limn^oo JxkfjLn(dx) = J xkfi(dx) for each k G N.
Then /x => /i, i.e. the [in converge weakly to /i.
Proof. We first claim that {/in} is tight. Indeed, since the moments
converge to finite quantities, we can find Kk G R with f xkfin(dx) < Kk
for all n G N. But then, by Markov's inequality, letting Yn ~ /xn, we have
t*n([-R,R]) = P(|yn|<J2)
= l-P(\Yn\>R)
= i-p(rn2>i?2)
> i-(E[rn2]/i*2)
> i-(x2/i?2),
which is > 1 — e whenever R > y/K2/e, thus proving tightness.
We now claim that if any subsequence {/inr} converges weakly to some
distribution v, then we must have v = /i. The theorem will then follow from
Corollary 11.1.11.

138 11. Characteristic functions.
Indeed, suppose \xnr => v. By Skorohod's theorem, we can find
random variables Y and {Yr} with C(Y) = v and C(Yr) = /inr, such that
Yr —» Y with probability 1. But then also Yk —► l^fc with probability i
Furthermore, for k G N and a > 0, we have
E (|nifcl|yr|*>a) < E (i^pl|yr|t>a) < iE (|rr|2fc)
= -E ((Yrfk) < ** ,
a a
which is independent of r, and goes to 0 as a —► oo. Hence, the {Yk}
are uniformly integrable. Thus, by Theorem 9.1.6, E(Yk) —► E(Ffc), i.e.
fxkfinr(dx) —► fxkv(dx).
But we already know that Jxkfinr(dx) —► f xkfj,(dx). Hence, the
moments of i^ and /x must coincide. And, since /i is determined by its moments,
we must have v = /i, as claimed. |
This theorem leads to the question of which distributions are determined
by their moments. Unfortunately, not all distributions are, as the following
exercise shows.
Exercise 11.4.2. Let f(x) = ^7^e_(logx)2/2 for x > 0 (with f(x) =
0 for x < 0) be the density function for the random variable ex, where
X ~ iV(0,l). Let g(x) = f(x) (1 + sin(27rlogx)). Show that g(x) > 0
and that fxkg(x)dx = fxkf(x)dx for k = 0,1,2,.... [Hint: Consider
J xkf(x) sin(27rlogx)dx, and make the substitution x = esek, dx = esekds]
Show further that J \x\kf(x)dx < oo for all k G N. Conclude that g is
a probability density function, and that g gives rise to the same (finite)
moments as does /. Relate this to Theorem 11.4.1 above and Theorem
11.4.3 below.
On the other hand, if a distribution satisfies that its moment generating
function is finite in a neighbourhood of the origin, then it will be determined
by its moments, as we now show. (Unfortunately, the proof requires a result
from complex function theory.)
Theorem 11.4.3. Let sq > 0, and let X be a random variable with
moment generating function Mx{s) which is finite for \s\ < so- Then C(X)
is determined by its moments (and also by Mx(s)).
Proof (optional). Let fx(z) = E(ezX) for ztC. Since \ezX\ = ex*ez,
we see that fx(z) is finite whenever \$lez\ < s0. Furthermore, just like for
Mx(s), it follows that fx(z) is analytic on {z e C; |3&ez| < s0}. Now, if

11.5. Exercises.
139
y has the same moments as does X, then for |s| < su? we have by order-
oreserving and countable linearity that
E(esy) < E(esY + e-sY)
= 2 (l + s2E^2) +...j=Mx(S) + Mx(-s) < oo.
Hence, My(s) < oo for \s\ < sq. It now follows from Theorem 9.3.3 that
MY(s) = Mx(s) for \s\ < s0, i.e. that fx{s) = fy(s) for real |s| < s0- By
the uniqueness of analytic continuation, this implies that fy(z) = fx(z)
for \$lez\ < s0. In particular, since <f>x(t) = fx(it) and </>y(t) = fy(it),
we have $x = <i>y- Hence, by the uniqueness theorem for characteristic
functions (Theorem 11.1.7), we must have C(Y) = C(X), as claimed. I
Remark 11.4.4. Proving weak convergence by showing convergence of
moments is called the method of moments . Indeed, it is possible to prove
the central limit theorem in this manner, under appropriate assumptions.
Remark 11.4.5. By similar reasoning, it is possible to show that if
Mxn{s) < oo and Mx{s) < oo for all n G N and |s| < so? and also
Mxn(s) —> Mx(s) for all |s| < sq, then we must have C(Xn) => C{X).
11.5. Exercises.
Exercise 11.5.1. Let [in = 5n be a point mass at n (for n = 1,2,...).
(a) Is {/in} tight?
(b) Does there exist a subsequence {/xnfc}, and a Borel probability measure
M, such that /infc => fi? (If so, then specify {n^} and /jl.) Relate this to
theorems from this section.
(c) Setting Fn(x) = fin ((—oo, x]), does there exist a non-decreasing, right-
continuous function F such that Fn(x) —» F(x) for all continuity points x
of F? (If so, then specify F.) Relate this to the Helly Selection Principle.
(d) Repeat part (c) for the case where /in = 5-n is a point mass at —n.
Exercise 11.5.2. Let /in = Sn mod 3 be a point mass at n mod 3. (Thus,
Mi = £i, /x2 = $2, M3 = <5o, ^4 = Si, /i5 = S2, /i6 = S0, etc.)
This should not be confused with the statistical estimation procedure of the same
name, which estimates unknown parameters by choosing them to make observed moments
equal theoretical ones.

140
11. Characteristic functions.
(a) Is {fjin} tight?
(b) Does there exist a Borel probability measure /i, such that /xn => ^? /j*
so, then specify fi.)
(c) Does there exist a subsequence {/infc}5 and a Borel probability measure
/i, such that /infc => \xl (If so, then specify {rik} and fj,.)
(d) Relate parts (b) and (c) to theorems from this section.
Exercise 11.5.3. Let {xn} be any sequence of points in the interval [0,1].
Let /jin = 6Xn be a point mass at xn.
(a) Is {fxn} tight?
(b) Does there exist a subsequence {/infc }, and a Borel probability measure
fi, such that /infc => /i? (Hint: by compactness, there must be a subsequence
of points {xnk} which converges, say to y G [0,1]. Then what does fiUk
converge to?)
Exercise 11.5.4. Let /i2n = ^o> and let /X2n+i = ^n> for n = 0,1,2,....
(a) Does there exist a Borel probability measure /i, such that /in => /x?
(b) Suppose for some subsequence {/in/c} and some Borel probability
measure i/, we have fjLnk => v. What must v be?
(c) Relate parts (a) and (b) to Corollary 11.1.11. Why is there no
contradiction?
Exercise 11.5.5. Let fjin = Uniform[0,n], so fjin([a,b]) = (b — a)/n for
0 < a < b < n.
(a) Prove or disprove that {/j,n} is tight.
(b) Prove or disprove that there is some probabilty measure /i such that
[in =>/i.
Exercise 11.5.6. Suppose /in => A*- Prove or disprove that {fj,n} must
be tight.
Exercise 11.5.7. Define the Borel probability measure fin by fin ({#}) =
1/n, for x = 0, £, ^,..., n^. Let A be Lebesgue measure on [0,1].
(a) Compute <j)n(t) = f ettxfin(dx), the characteristic function of /xn.
(b) Compute (j)(t) = f eltx\(dx), the characteristic function of A.
(c) Does (j)n{t) —► </>{t), for each t6R?
(d) What does the result in part (c) imply?
Exercise 11.5.8. Use characteristic functions to provide an alternative
solution of Exercise 10.3.2.
Exercise 11.5.9. Use characteristic functions to provide an alternative
solution of Exercise 10.3.3.

11.5. Exercises.
141
Exercise 11.5.10. Use characteristic functions to provide an alternative
solution of Exercise 10.3.4.
Exercise 11.5.11. Compute the characteristic function 4>x(t), and also
±j (0) = iE(X), where X follows
(a) the binomial distribution: P(X = k) = (nk)pk{l - p)n~k, for k =
0,1,2,. ..,n.
(b) the Poisson distribution: P(X = k) = e A^-, for k = 0,1, 2,....
(c) the exponential distribution, with density with respect to Lebesgue
measure given by fx(x) = Xe~Xx for x > 0, and fx{x) = 0 for x < 0.
Exercise 11.5.12. Suppose that for n € N, we have P[Xn = 5] = \/n
and P[Xn = 6] = 1 - (1/n).
(a) Compute the characteristic function (j>xn(t), for all n € N and £ € R.
(b) Compute limn—oo0x»(*)»
(c) Specify a distribution fi such that limn^oo (f)Xn (t) = J eltx /j,(dx) for all
teR.
(d) Determine (with explanation) whether or not C(Xn) => ji.
Exercise 11.5.13. Let {Xn} be i.i.d., each having mean 3 and variance
4. Let S = X\ + X2 + ... + ^10,000- In terms of $(#), give an approximate
value for P[5< 30,500].
Exercise 11.5.14. Let Xi,X2,... be i.i.d. with mean 4 and variance
9. Find values C(n,x), for n € N and x € R, such that as n —► 00,
P[Xi + X2 + ... + Xn < C(n, x)] « *(x).
Exercise 11.5.15. Prove that the Poisson(A) distribution, and the
N(m, v) (normal) distribution, are both infinitely divisible (for any A > 0,
m e R, and v > 0). [Hint: Use Exercises 9.5.15 and 9.5.16.]
Exercise 11.5.16. Let X be a random variable whose distribution C(X)
is infinitely divisible. Let a > 0 and b e R, and set y = aX + 6. Prove that
£(y) is infinitely divisible.
Exercise 11.5.17. Prove that the Poisson(A) distribution, the N(m,v)
distribution, and the Exp(A) (exponential) distribution, are all determined
by their moments, for any A > 0, m € R, and v > 0.
Exercise 11.5.18. Let X:Xi,X2,... be random variables which are
uniformly bounded, i.e. there is M e R with \X\ < M and \Xn\ < M for
all n. Prove that {C(Xn)} => C{X) if and only if E (X*) -► E (Xk) for all
keN.

142
11. Characteristic functions.
11.6. Section summary.
This section introduced the characteristic function 4>x{t) = E(eitx)
After introducing its basic properties, it proved an Inversion Theorem (to
recover the distribution of a random variable from its characteristic
function) and a Uniqueness Theorem (which shows that if two random variables
have the same characteristic function then they have the same distribution).
Then, using the Helly Selection Principle and the notion of tightness, it
proved the important Continuity Theorem, which asserts the equivalence of
weak convergence of distributions and pointwise convergence of
characteristic functions. This important theorem was used to prove the Central Limit
Theorem about weak convergence of averages of i.i.d. random variables to
a normal distribution. Some generalisations of the Central Limit Theorem
were briefly discussed.
The section ended with a discussion of the method of moments, an
alternative way of proving weak convergence of random variables using only
their moments, but not their characteristic functions.

12. Decomposition of probability laws.
143
12. Decomposition of probability laws.
Let /ibea Borel probability measure on R. Recall that fi is discrete if it
takes all its mass at individual points, i.e. if ^2xejil^{x} = ^(R). Also [i is
absolutely continuous if there is a non-negative Borel-measurable function
/ such that fi(A) = JA f(x)\(dx) = J lA(x)f(x)X(dx) for all Borel sets A,
where A is Lebesgue measure on R.
One could ask, are these the only possible types of probability
distributions? Of course the answer to this question is no, as fi could be a mixture of
a discrete and an absolutely continuous distribution, e.g. [i = |5o+|iV(0,1).
But can every probability distribution at least be written as such a mixture?
Equivalently, is it true that every distribution [i with no discrete component
(i.e. which satisfies /j,{x} = 0 for each x € R) must necessarily be absolutely
continuous?
To examine this question, say that fi is dominated by A (written ji <C A),
if ^(A) = 0 whenever X(A) = 0. Then clearly any absolutely continuous
measure [i must be dominated by A, since whenever X(A) = 0 we would
then have fi{A) = J lA(x)f(x)X(dx) = 0. (We shall see in Corollary 12.1.2
below that in fact the converse to this statement also holds.)
On the other hand, suppose Z\, Z<i,... are i.i.d. taking the value 1 with
probability 2/3, and the value 0 with probability 1/3. Set Y = ^2^=1 Zn^~n
(i.e. the base-2 expansion of Y is O.Z1Z2 .. .)• Further, define S C R by
S = (x€[0,l]; lim -JTdiW^l],
V 2=1 '
where di(x) is the ith digit in the (non-terminating) base-2 expansion of x.
Then by the strong law of large numbers, we have P(Y € S) = 1 while
X(S) = 0. Hence, from the previous paragraph, the law of Y cannot be
absolutely continuous. But clearly C(Y) has no discrete component. We
conclude that C(Y) cannot be written as a mixture of a discrete and an
absolutely continuous distribution. In fact, C(Y) is singular with respect to
A (written C(Y) _L A), meaning that there is a subset SCR with X(S) = 0
and P(y e Sc) = 0.
12.1. Lebesgue and Hahn decompositions.
The main result of this section is the following.
Theorem 12.1.1. (Lebesgue Decomposition) Any probability measure
M on R can uniquely be decomposed as /1 = jidisc + ^ac + fising, where the
me/)«??rp«; //j,„.. /y„^. and u^nn satisfy

144
12. Decomposition of probability laws.
(a) iodise is discrete, i.e. £x6RHdisc{x} = Mwc(R);
(b) Hac is absolutely continuous, i.e. fiac{A) = JAf dX for all Borel sets A
for some non-negative, Borel-measurable function f, where X is Lebesgue
measure on R;
(c) Using is singular continuous, i.e. fiSing{^} = 0 for all xGR, but there
is S C R with X(S) = 0 and Using (Sc) = 0.
Theorem 12.1.1 will be proved below. We first present an important
corollary.
Corollary 12.1.2. (Radon-Nikodym Theorem) A Borel probability
measure fi is absolutely continuous (i.e. there is f with fi(A) = JAfd\ for
all Borel A) if and only if it is dominated by X (i.e. fi <C A, i.e. n(A) = 0
whenever X(A) = 0).
Proof. We have already seen that if ii is absolutely continuous then
ju< A.
For the converse, suppose ji <C A, and let ji = Hdisc + AW + Using as in
Theorem 12.1.1. Since X{x} = 0 for each x, we must have fi{x} = 0 as well,
so that iodise = 0. Similarly, if S is such that X(S) — 0 and Hsing(SC) = 0,
then we must have fi(S) = 0, so that iiSing(S) = 0, so that /J,Sing = 0-
Hence, /i = iiac, i.e. fi is absolutely continuous. I
Remark 12.1.3. If fi(A) = fAfdX for all Borel A, we write ^ = /, and
call / the density, or Radon-Nikodym derivative, of ji with respect to A.
It remains to prove Theorem 12.1.1. We begin with a lemma.
Lemma 12.1.4. (Hahn Decomposition) Let 0 be a finite usigned
measure" on (ft, J7), i.e. (/) = [i — v for some finite measures fi and v. Then there
is a partition Q = A+ lM~,# with A+,A~ e T, such that (j)(E) > 0 for all
E C A+, and </>(E) < 0 for all E C A~.
Proof. Following Billingsley (1995, Theorem 32.1), we set
a = sup{(j)(A); A G J7} .
We shall construct a subset A+ such that 4>{A+) = a. Once we have
done this, then we can then take A~ = Q \ A+. Then if E C A+ but
<p(E) < 0, then </>(A+\E) = (f)(A+) - (j){E) > (f)(A+) = a, which contradicts
the definition of a. Similarly, if E C A~ but </>(£) > 0, then (p(A+ U
E) > (f>(A+) + (t>(E) > a, again contradicting the definition of a. Thus, to
complete the proof, it suffices to construct A+ with </>(A+) = a.

12.1. Lebesgue and Hahn decompositions.
145
To that end, choose (by the definition of a) subsets Ai,A2,...€:Jr such
that <p(An) -► a. Let A = [jAu and let
Gn = I H A'k > each A'fc = 4* or A \ Ak \
(so that Gn contains < 2n different subsets, which are all disjoint). Then
let
Cn= |J 5,
segn
<^>(S)>0
i.e. Cn is the union of those elements of Gn with non-negative measure under
(p. Finally, set A+ = limsupCn. We claim that 4>(A+) = a.
First, note that since An is a union of certain particular elements of
Qn (namely, all those formed with A'n = An), and Cn is a union of all the
^-positive elements of Gn, it follows that </>(Cn) > 4>(An).
Next, note that CmU.. .UCn is formed from CmU.. .UCn-i by including
some additional ^-positive elements of Gn, so </>(Cm U Cm+i U ... U Cn) >
(j)(Cm U Cm+i U... U Cn-i). It follows by induction that 0(Cm U... UCn) >
<t>(Cm) > 4>(Am). Since this holds for all n, we must have (by continuity of
probabilities on /i and v separately) that (j)(Cm U Cm+i U ...) > </>(Am) for
all m.
But then (again by continuity of probabilities on ji and v separately) we
have
(j){A+) = 0(limsupCn) = lim 0(Cm U Cm+1 U ...) > lim </>(Am) = a.
m—+00 771—► OO
Hence, 0(A+) = a, as claimed. I
Remarks.
1. The Hahn decomposition is unique up to sets of (^-measure 0, i.e. if
-A+U A~ and B+UB~ are two Hahn decompositions then (f)(A+\B+) =
0(5+ \ A+) = (j){A- \ B~) = 4>{B~ \ A~) = 0, since e.g. A+ \ £+ =
A+ n B~ must have 0-measure both > 0 and < 0.
2. Using the Radon-Nikodym theorem, it is very easy to prove the Hahn
decomposition; indeed, we can let f = // + i/, let / = %* and £ = ^f,
and set A+ = {/ > #}. However, this reasoning would be circular,
since we are going to use the Hahn decomposition to prove the Lebesgue
decomposition which in turn proves the Radon-Nikodym theorem!
3- If (j) is any countably additive mapping from T to R (not necessarily
non-negative, nor assumed to be of the form /i — i/), then continuity
of probabilities follows just as in Proposition 3.3.1, and the proof of
Theorem 12.1.4 then goes through without change. Furthermore, it then

146
12. Decomposition of probability laws.
follows that <j) = ii-v where \i{E) — </>(EnA+) and v(E) = -(^{EnA")
i.e. every such function is in fact a signed measure.
We are now able to prove our main theorem.
Proof of Theorem 12.1.1. We first take care of fidisc Indeed, clearly
we shall define fj,disc{A) = Y^xeA^{x)^ anc* tnen M ~~ Vdisc has no discrete
component. Hence, we assume from now on that /i has no discrete
component.
Second, we note that by countable additivity, it suffices to assume that
fi is supported on [0,1], so that we may also take A to be Lebesgue measure
on [0,1].
To continue, call a function g a candidate density if g > 0 and fEgd\ <
fi{E) for all Borel sets E. We note that if g\ and g2 are candidate densities,
then so is max(<7i, (72), since
/ max(#!, g2) dX = gi dX + / g2 dX
JE JEn{gi>g2} JEn{gi<g2}
< fi{E n {gi > g2}) + n(E H {9l < g2}) = fi{E).
Also, by the monotone convergence theorem, if hi,h2l... are candidate
densities and hn f h, then h is also a candidate density. It follows from
these two observations that if g\,g2,... are candidate densities, then so is
supn gn = lim^oc max(#i,.... gn).
Now, let (3 = sup{Jj0 xl gdX; g a candidate density}. Choose candidate
densities gn with Jj0 ^ gndX > j3 — ^, and let / = supn>1 gn, to obtain that
/ is a candidate density with JJ0 ^ / dX = /?, i.e. / is (up to a set of measure
0) the largest possible candidate density.
This / shall be our density for /iac. That is, we define fiac(A) = JA f dX.
We then (of course) define nSing{A) = /i(j4) — nac(A). Since / was a
candidate density, therefore /ising^A) > 0. To complete the existence proof, it
suffices to show that /ising is singular.
For each n e N, let [0,1] = A+UA~ be a Hahn decomposition (cf.
Lemma 12.1.4) for the signed measure 0n = fising — ^A. Set M = (Jn An.
Then Mc = C\nAn^ so that M° £ An for each n- lt follows that
(fusing ~ kX) (M°) ^ 0 f0r a11 n> S0 tll£tt M^n^(^/C) < ^X(MC) for all
n. Hence, fiSing(Mc) = 0. We claim that A(M) = 0, so that musing is
indeed singular. To prove this, we assume that X(M) > 0, and derive a
contradiction.
If A(M) > 0, then there is n G N with X(A+) > 0. For this n, we have
{fusing ~ £A) (£) > 0, i.e. fising(E) > ±\(E), for all E C A+. We now
claim that the function g = f + ± 1A+ is a candidate density. Indeed, we

12.2. Decomposition with general measures. 147
compute for any Borel set E that
= /w(£) + iHA+nE)
< Hac{E) + fising{A+ fl E)
< Hac(E) + Hsing(E)
= H(E),
thus verifying the claim.
On the other hand, we have
/ gdX= [ fdX + - f lA+d\ = (3+±\(A+)>(3,
7(0,1] «/[0,l] n «/[0,l] " n
which contradicts the maximality of /. Hence, we must actually have had
X(M) = 0, showing that Using must actually be singular, thus proving the
existence part of the theorem.
Finally, we prove the uniqueness. Indeed, suppose fi = jiac + Using =
Vac + "sing, with nac{A) = fA f d\ and vac{A) = fAgdX. Since u.sing and
Vsing are singular, we can find Si and 52 with A (Si) = A(^2) = 0 and
M«n*(Sf) = "sing(S?) = 0. Let S = Sx U52, and let B = {u e Sc; f(u>) <
g{uo)}. Then g - f > 0 on B, but J^ - f)dX = u.ac{B) - uac{B) =
fi(B) - i/(B) = 0. Hence, X(B) = 0. But we also have X{S) = 0, hence
A{/ < g} = 0. Similarly A{/ > g} = 0. We conclude that A{/ = #} = 1,
whence [iac = i/ac, whence nsing = vsing. I
Remark. Note that, while fiac is unique, the density / = -^ is only
unique up to a set of measure 0.
12.2. Decomposition with general measures.
Finally, we note that similar decompositions may be made with respect
to other measures. Instead of considering absolute continuity or singularity
with respect to A, we can consider them with respect to any other
probability measure v, and the same proofs apply. Furthermore, by countable
additivity, the above proofs go through virtually unchanged for cr-finite v as
well (cf. Remark 4.4.3). We state the more general results as follows. (For
the general Radon-Nikodym theorem, we write /i <C v, and say that /i is
dominated by v, if u.(A) = 0 whenever i/(A) = 0.)
Theorem 12.2.1. (Lebesgue Decomposition, general case) Iffi and v are
two cr-finite measures on some measurable space (Q, J7), then u. can uniquely

148
12. Decomposition of probability laws.
be decomposed as ji = fiac + fiSin9, where fiac is absolutely continuous wi>h
respect to v (i.e., there is a non-negative measurable function / : fi -^ p
such that fiac(A) = fAfdv for all A G T), and nsing is singular (i.e., there
is SCQ with i/(5) = 0 and fiSin9{Sc) = 0).
Remark. In the above statement, for simplicity we avoid mentioning
the discrete part fidisc, thus allowing for the possibility that v may itself
have some discrete component (in which case the corresponding discrete
component of fi would still be absolutely continuous with respect to v)
However, if v has no discrete component, then if we wish we can extract
Iodise from fiSing as in Theorem 12.1.1.
Corollary 12.2.2. (Radon-Nikodym Theorem, general case) If fi and
v are two a-finite measures on some measurable space (Q^J7), then ji is
absolutely continuous with respect to v if and only if /i <C v.
If [i < i/, so fi(A) = fAfdv for all A e J7, then we write ^ = /,
and call -^ the Radon-Nikodym derivative of fi with respect to v. Thus,
/j,(A) = fA(-fo) dv. If v is a probability measure, then we can also write this
as fJ>(A) = E^f^j 1^], where E^ stands for expectation with respect to v.
12.3. Exercises.
Exercise 12.3.1. Prove that u. is discrete if and only if there is a countable
set S with //(5C) = 0.
Exercise 12.3.2. Let X and Y be discrete random variables (not
necessarily independent), and let Z = X + Y. Prove that C(Z) is discrete.
Exercise 12.3.3. Let X be a random variable, and let Y = c X for some
constant c > 0.
(a) Prove or disprove that if C(X) is discrete, then C(Y) must be discrete.
(b) Prove or disprove that if C(X) is absolutely continuous, then C(Y)
must be absolutely continuous.
(c) Prove or disprove that if C(X) is singular continuous, then C(Y) must
be singular continuous.
Exercise 12.3.4. Let X and Y be random variables, with C(Y) absolutely
continuous, and let Z = X -\-Y.
(a) Assume X and Y are independent. Prove that C(Z) is absolutely
continuous, regardless of the nature of C(X). [Hint: Recall the convolution
formula of Subsection 9.4.]

12.4. Section summary.
149
(b) Show that if X and Y are not independent, then C{Z) may fail to be
absolutely continuous.
Exercise 12.3.5. Let X and Y be random variables, with C(X) discrete
and C(Y) singular continuous. Let Z = X + Y. Prove that C(Z) is singular
continuous. [Hint: If X(S) = P{Y € Sc) = 0, consider the set U = {s + x :
8 e 5, P(X = x) > 0}.]
Exercise 12.3.6. Let A, B, Z\, Z2,... be i.i.d., each equal to +1 with
probability 2/3, or equal to 0 with probability 1/3. Let Y — Y^iL\ %\ 2~l as
at the beginning of this section (so v = L(Y) is singular continuous), and
let W ~ N{0,1). Finally, let X = A(BY + (1 - B)W), and set [i = C(X).
Find a discrete measure fidisc, an absolutely continuous measure fiac, and a
singular continuous measure fis, such that fi = fidisc + Vac + A*s-
Exercise 12.3.7. Let fi, v, and p be probability measures with ji <C
i/ < p. Prove that ^ = ^f^ with p-probability 1. [Hint: Use Proposition
6.2.3.]
Exercise 12.3.8. Let fi and v be probability measures with ji <C v and
i/ <C [i. (This is sometimes written as fi = v.) Prove that -^ > 0 with
//-probability 1, and in fact ^ = 1/^.
Exercise 12.3.9. Let fi and i/ be discrete probability measures, with
0 = ii - v. Write down an explicit Hahn decomposition ft = A+ U A~ for
0.
Exercise 12.3.10. Let /i and i/ be absolutely continuous probability
measures on R (with the Borel cr-algebra), with 0 = \x — v. Write down an
explicit Hahn decomposition R = A+ U A~ for 0.
12.4. Section summary.
This section proved the Lebesgue Decomposition theorem, which says
that every measure may be uniquely written as the sum of a discrete
measure, an absolutely continuous measure, and a singular continuous measure.
From this followed the Radon-Nikodym Theorem, which gives a simple
condition under which a measure is absolutely continuous.

13. Conditional probability and expectation. 151
13. Conditional probability and expectation.
Conditioning is a very important concept in probability, and we consider
it here.
Of course, conditioning on events of positive measure is quite
straightforward. We have already noted that if A and B are events, with P(B) > 0,
then we can define the conditional probability P(A | B) = P{A n B) / P(B);
intuitively, this represents the probabilistic proportion of the event B which
also includes the event A. More generally, if Y is a random variable, and
if we define v by i/(5) = P{Y G S\B) = P(FgS, B)/P(B), then
v = C(Y \B) is a probability measure, called the conditional distribution
of Y given B. We can then define conditional expectation by E (Y \ B) —
jyv(dy). Also, C(Y 1B) = P(B) C(Y \ B) + P{BC) 50, so taking
expectations and re-arranging,
E(Y\B) = E{Y1B)/P(B). (13.0.1)
No serious difficulties arise.
On the other hand, if P(jB) = 0 then this approach does not work at
all. Indeed, it is quite unclear how to define something like P(Y G S \ B) in
that case. Unfortunately, it frequently arises that we wish to condition on
events of probability 0.
13.1. Conditioning on a random variable.
We begin with an example.
Example 13.1.1. Let (X.Y) be uniformly distributed on the triangle
T = {{x,y) G R2; 0 < y < 2, y < x < 2}; see Figure 13.1.2. (That
is. P ((X, Y)eS) = ±\2{S H T) for Borel S C R2, where A2 is Lebesgue
measure on R2; briefly, dP = \ 1T dxdy.) Then what is P (Y > f | X = l)?
What is E (Y \ X = 1)? Since P(X = 1) = 0, it is not clear how to proceed.
We shall return to this example below.
Because of this problem, we take a different approach. Given a random
variable X. we shall consider conditional probabilities like P(A\X). and
also conditional expected values like E(F|X), to themselves be random
variables. We shall think of them as functions of the "random" value X.
This is very counter-intuitive: we are used to thinking of P(- • •) and E(- • •)
as numbers, not random variables. However, we shall think of them as
random variables, and we shall see that this allows us to partially resolve
'the difficulty of conditioning on sets of measure 0 (such as {X = 1} above).

152 13. Conditional probability and expectation.
Figure 13.1.2. The triangle T in Example 13.1.1.
The idea is that, once we define these quantities to be random variables,
then we can demand that they satisfy certain properties. For starters, we
require that
E \P(A | X)} = P(A), E [E(Y | X)} = E(Y).
(13.1.3)
In words, these random variables must have the correct expected values.
Unfortunately, this does not completely specify the distributions of the
random variables P(A \ X) and E(Y \ X); indeed, there are infinitely many
different distributions having the same mean. We shall therefore impose a
stronger requirement. To state it, recall that if Q is a sub-a-algebra (i.e. a
a-algebra contained in the main a-algebra T), then a random variable Z is
^-measurable if {Z < z) e Q for all z e R. (It follows that also {Z = z) =
{Z<z}\ \Jn{Z <z-±}eG.) Also, a(X) = {{X e B} : B C R Borel}.
Definition 13.1.4. Given random variables X and Y with E\Y\ < oo.
and an event A, P(A \ X) is a conditional probability of A given X if it is a
a(X)-measurable random variable and, for any Borel S C R, we have
E(P(A\X)ixes) = P(An{X e S})
(13.1.5)
Similarly, E(Y \ X) is a conditional expectation of Y given X if it is a cr(X)-

13.1. Conditioning on a random variable.
153
measurable random variable and, for any Borel S C R, we have
E (E(Y I X)lxes) = E(Y lxes) . (13.1.6)
That is, we define conditional probabilities and expectations by
specifying that certain expected values of them should assume certain values. This
prompts some observations.
Remarks.
1. Requiring these quantities to be a(X)-measurable means that they are
functions of X alone, and do not otherwise depend on the sample point
u. Indeed, if a random variable Z is cr(X)-measurable, then for each
z e R, {Z = z} = {X e Bz} for some Borel Bz C R. Then Z = f(X)
where / is defined by f(x) = z for all x € Bz, i.e. Z is a function of X.
2. Of course, if we set S = R in (13.1.5) and (13.1.6), we obtain the special
case (13.1.3).
3. Since expected values are unaffected by changes on a set of measure 0,
we see that conditional probabilities and expectations are only unique
up to a set of measure 0. Thus, if P(X = x) = 0 for some particular
value of x, then we may change P(A \ X) on the set {X = x} without
restriction. However, we may only change its value on a set of measure
0; thus, for "most" values of X, the value P(A \ X) cannot change. In
this sense, we have mostly (but not entirely) overcome the difficulty of
Example 13.1.1.
These conditional probabilities and expectations always exist:
Proposition 13.1.7. Let X and Y be jointly defined random variables,
and let A be an event. Then P(A | X) and E(Y \ X) exist (though they are
only unique up to a set of probability 0).
Proof. We may define P(A \ X) to be ^-, where Po and v are measures
on cf{X) defined as follows. Po is simply P restricted to cr(X), and
u(E) = P(AHE), Ee a(X).
Note that v < P0, so by the general Radon-Nikodym Theorem
(Corollary 12.2.2), ^ exists and is unique up to a set of probability 0.
Similarly, we may define E(Y \ X) to be ^ - ^, where
p+{E) = E (y+ 1E) , p~(E) = E (Y~ 1E), Ee a(X).
Then (13.1.5) and (13.1.6) are automatically satisfied, since e.g.
E[P(A\X)lxes] = E
-—lxes = / -75-dP
dpo J Jxes dP0

154 13. Conditional probability and expectation.
Jxes «^o
Example 13.1.1 continued. To better understand Definition 13.1.4, we
return to the case where (X, Y) is assumed to be uniformly distributed over
the triangle T = {(x,y) G R2; 0 < y < 2, y < x < 2}. We can then define
P(Y >l\X) and E(Y \ X) to be what they "ought" to be, namely
f x-±
P(Y > - IX) = { ~x^ ' X - 3/4 ; E(r IX) = — .
v -41 ' | 0, X<3/4 V ' ; 2
We then compute that, for example,
3
E (p(Y >j\X) lXeS) = [ f ~ J Kdy) X(dx)
= / (x-DUidx)
P(V>f, Xes) = f f {\)\\{dy)\{dx)
= / (x--Y-\{dx)^
7sn[f,2] V 4/2
thus verifying (13.1.5) for the case A = {Y > f }. Similarly, for S C [0,2],
while
E(E(Y\X)1X€S) = f %h(dy)\(dx)
while
E (yixes) =/ y\\{dy)dx
= I f y \x{dy) x{dx) = I TX{dx)'
fSJO * JS
so that the two expressions in (13.1.6) are also equal.
This example was quite specific, but similar ideas can be used more
generally; see Exercises 13.4.3 and 13.4.4. In summary, conditional probabilities
and expectations always exist, and are unique up to sets of probability 0,

13.2. Conditioning on a sub-ot-algebra. 155
but finding them may require guessing appropriate random variables and
then verifying that they satisfy (13.1.5) and (13.1.6).
13.2. Conditioning on a sub-cr-algebra.
So far we have considered conditioning on just a single random variable
X. We may think of this, equivalently, as conditioning on the generated
a-algebra, a(X). More generally, we may wish to condition on something
other than just a single random variable X. For example, we may wish to
condition on a collection of random variables X\, X2,... ,Xn, or we may
wish to condition on certain other events.
Given any sub-a-algebra Q (in place of cr(X) above), we define the
conditional probability P(A \ Q) and the conditional expectation E(y | Q) to be
(/-measurable random variables satisfying that
E{P(A\Q)1G) = P(AnG) (13.2.1)
and
E(E(Y\G)1G) = E(yiG) (13.2.2)
for any G € (/. If Q — cr(X), then this definition reduces precisely to the
previous one, with P(A | X) being shorthand for P(A | a(X)), and E(Y \ X)
being shorthand for E(Y \ cr(X)). Similarly, we shall write e.g. E(F | X\, X2)
as shorthand for E{Y \ a{XuX2)), where a{X1,X2) = {{Xi < a} n {X2 <
b} : a, b € R} is the a-algebra generated by X\ and X2. As before, these
conditional random variables exist (and are unique up to a set of
probability 0) by the Radon-Nikodym Theorem.
Two further examples may help to clarify matters. If Q = {0,0} (i.e.,
G is the trivial a-algebra), then P(A | Q) = P(A) and E(Y | Q) = E(Y)
are constants. On the other hand, if Q = T (i.e., Q is the full cr-algebra),
or equivalently if A e Q and X is (/-measurable, then P(A \ Q) = 1a and
E(X | Q) = X. Intuitively, if Q is small then the conditional values cannot
depend too much on the sample point a;, but rather they must represent
average values. On the other hand, if Q is large then the conditional values
can (and must) be very close approximations to the unconditional values.
In brief, the larger Q is, the more random are conditionals with respect to
G. See also Exercise 13.4.1.
Exercise 13.2.3. Let Q\ and Q2 be two sub-cr-algebras.
(a) Prove that if Z is Q\ -measurable, and Q\ C Q2, then Z is also Q2-
rneasurable.
(b) Prove that if Z is ^-measurable, and also Z is ^'-measurable, then Z
is (Q fl (/^-measurable.

156 13. Conditional probability and expectation.
Exercise 13.2.4. Let S\ and 52 be two disjoint events, and let Q be a
sub-a-algebra. Show that the following hold with probability 1. [Hint: Use
proof by contradiction.]
(a) 0<P(Si|a)<l.
(b) P(Si U S2 | G) = P(Si | Q) + P(52 | Q).
Remark 13.2.5. Exercise 13.2.4 shows that conditional probabilities
behave "essentially" like ordinary probabilities. Now, the additivity
property (b) is only guaranteed to hold with probability 1, and the exceptional
set of probability 0 could perhaps be different for each S\ and S2. This
suggests that perhaps P(S \ Q) cannot be defined in a consistent, count ably
additive way for all S € T. However, it is a fact that if a random variable
Y is real-valued (or, more generally, takes values in a Polish space, i.e. a
complete separable metric space like R), then there always exist regular
conditional distributions, which are versions of P(F G B \ Q) defined
precisely (not just w.p. 1) for all Borel B in a countably additive way; see e.g.
Theorem 10.2.2 of Dudley (1989), or Theorem 33.3 of Billingsley (1995).
We close with two final results about conditional expectations.
Proposition 13.2.6. Let X and Y be random variables, and let Q be a
sub-o-algebra. Suppose that E(F) and E(X7) are Unite, and furthermore
that X is Q-measurable. Then with probability 1,
E(XY\G) = XE(Y\G).
That is, we can "factor" X out of the conditional expectation.
Proof. Clearly XE(Y\G) is (/-measurable. Furthermore, if X = 1g0,
with Go, G G G, then using the definition of E(F | G), we have
E (X E(Y | G) 1g) = E (E(y | G) lcnGo) = E (YlGnG0) = E (XY1G) ,
so that (13.2.2) holds in this case. But then by the usual linearity and
monotone convergence arguments, (13.2.2) holds for general X. It then
follows from the definition that X E(Y \ G) is indeed a version of E(XY \ G)- 1
For our final result, suppose that Gi Q G2- Note that since E(Y\Gi)
is G\-measurable, it is also (^-measurable, so from the above discussion we
have E(E(y|<?i)|02) = E(Y\Gi). That is, conditioning first on G\ and
then on G2 is equivalent to conditioning just on the smaller sub-a-algebra,
G\. What is perhaps surprising is that we obtain the same result if we
condition in the opposite order:

13.3. Conditional variance.
157
proposition 13.2.7. Let Y be a random variable with Unite mean, and
let Q\ C Q2 be two sub-o~-algebras. Then with probability 1,
E(E(Y\g2)\Gi) = E(y|ai)-
That is, conditioning first on g2 and then on Q\ is equivalent to conditioning
just on the smaller sub-o-algebra Q\.
Proof. By definition, E (E(Y \ G2) \Q\) is Q\ -measurable. Hence, to show-
that it is a version of E(F | Q\), it suffices to check that, for any G € Q\ C Q2,
E [E (E(y | g2) | Gi) 1G] = E(yiG). But using the definitions of E(- ■ • | Gi)
and E(- • • | g2), respectively, and recalling that G G G\ and G G $2, we have
that
e[e(E(y | g2) | Gi) id = e(E(y | g2) iG) = e(y ig) . I
In the special case G\ = £2 = <7, we obtain that E[E(y | G) \ G] =
E[Y\G], In words, repeating the operation "take conditional expectation
with respect to G" multiple times is equivalent to doing it just once. That
is, conditional expectation is a projection operator, and can be thought of
as "projecting" the random variable Y onto the cr-algebra G-
13.3. Conditional variance.
Given jointly defined random variables X and Y, we can define the
conditional variance of Y given X by
Var(y|X) = E[(y-E(y|X))2|X].
Intuitively, Var(y | X) is a measure of how much uncertainty there is in Y
even after we know X. Since X may well provide some information about
y, we might expect that on average Var(y | X) < Var(F). That is indeed
the case. More precisely:
Theorem 13.3.1. Let Y be a random variabie, and Q a sub-a-algebra.
If Var(y) < oo, then
Var(r) = E[Var(F | Q)] + Var[E(Y | Q)\.
Proof. We compute (writing m = E{Y) = E[E(Y | G)}, and using (13.1.3)
and Exercise 13.2.4) that
Var(y). = E[(y-m)2]
= E[E[(Y-m)2\g}}
= E[E[(Y-E(Y\g) + E(Y\g)-m)2\G}]
= E[E[(y - E(y i g)f \ g] +e[(e(y \ g) - m)2 \ g)}
+2E[E[(y-E(y|a))(E(y|g)-m)|g]]
= E[Var(y Ig)] + Var[E(Y \G)]+0,

158 13. Conditional probability and expectation.
since using Proposition 13.2.6, we have
E[(Y -E(Y\g))(E(Y \g)-m)\G} = (E(Y\g)-m)E[(Y-E(Y\g))\g]
= (E(Y\g)-m)[E(Y\g)-E(Y\G)] = 0. |
For example, if Q = cr(X), then E[Var(F | X)] represents the average
uncertainty in Y once X is known, while Var[E(F | X)] represents the
uncertainty in Y caused by uncertainty in X. Theorem 13.3.1 asserts that the
total variance of Y is given by the sum of these two contributions.
13.4. Exercises.
Exercise 13.4.1. Let A and B be events, with 0 < P(-B) < 1. Let
Q = cr(B) be the a-algebra generated by B.
(a) Describe Q explicitly.
(b) Compute P(A | Q) explicitly.
(c) Relate P{A \ G) to the earlier notion of P(A \ B) = P(A (IB)/ P{B).
(d) Similarly, for a random variable Y with finite mean, compute E(Y" | (/).
and relate it to the earlier notion of E(F | B) = E{Y 1B) / P(B).
Exercise 13.4.2. Let Q be a sub-a-algebra, and let A be any event.
Define the random variable X to be the indicator function 1^. Prove that
E(X | Q) = P(A I Q) with probability 1.
Exercise 13.4.3. Suppose X and Y are discrete random variables. Let
q(x,y) = P(X = x, Y = y).
(a) Show that with probability 1,
[Hint: One approach is to first argue that it suffices in (13.1.6) to consider
the case S = {xo}.]
(b) Compute P(Y = y\X). [Hint: Use Exercise 13.4.2 and part (a).]
(c) show that e(y\x) = J2yyV(y = y\x)'
Exercise 13.4.4. Let X and Y be random variables with joint
distribution given by C{X,Y) = dP = f(x,y)\2(dx,dy), where A2 is two-
dimensional Lebesgue measure, and / : R2 —► R is a non-negative Borel-
measurable function with JR2 / d\2 = 1. (Example 13.1.1 corresponds to
the case f(x,y) = |lr(#,y).) Show that we can take P(Y e B \ X) =
fBgx(y)\(dy) and E(F | X) = fKygx(y)Hdy), where the function gx :

13.4. Exercises.
159
r -> R is defined by gx(y) = * !, "1,.,. whenever Lf(x,t)\(dt) is pos-
itive and finite, otherwise (say) gx(y) = 0.
Exercise 13.4.5. Let Q = {1,2,3}, and define random variables X and
Y by Y{u) = cj, and X(l) = X{2) = 5 and X(3) = 6. Let Z = E{Y \ X).
(a) Describe a(X) precisely.
(b) Describe (with proof) Z(uj) for each uj G ft.
Exercise 13.4.6. Let Q be a sub-cr-algebra, and let X and Y be two
independent random variables. Prove by example that "E(X\Q) and E(y | Q)
need not be independent. [Hint: Don't forget Exercise 3.6.3(a).]
Exercise 13.4.7. Suppose Y is a(X)-measurable, and also X and Y are
independent. Prove that there is C <E R with P(F = C) = 1. [Hint: First
prove that P(Y < y) = 0 or 1 for each y G R.]
Exercise 13.4.8. Suppose Y is ^-measurable. Prove that Var(Y \ Q) = 0.
Exercise 13.4.9. Suppose X and Y are independent.
(a) Prove that E(Y \ X) = E(F) w.p. 1.
(b) Prove that Var(F | X) = Var(y) w.p. 1.
(c) Explicitly verify Theorem 13.3.1 (with Q = a(X)) in this case.
Exercise 13.4.10. Give an example of jointly defined random variables
which are not independent, but such that E(Y \ X) = E(Y) w.p. 1.
Exercise 13.4.11. Let X and Y be jointly defined random variables.
(a) Suppose E{Y \ X) = E{Y) w.p. 1. Prove that E{XY) = E(X) E{Y).
(b) Give an example where E(XY) = E(X)E(Y), but it is not the case
that E(Y | X) = E{Y) w.p. 1.
Exercise 13.4.12. Let {Zn} be independent, each with finite mean. Let
^o = a, and Xn = a + ZY + ... + Zn for n > 1. Prove that
E(Xn+i |X0,Xi,... ,Xn) — Xn + E(Zn+i).
Exercise 13.4.13. Let X0,Xi,... be a Markov chain on a countable
state space 5, with transition probabilities {ptj}, and with E|Xn| < cxd for
all n. Prove that with probability 1:
jexJPXnj- [Hint: Don't forget
Exercise 13.4.3.]
. (b) E(Xn+1|Xn) = £,^PxrJ.

160 13. Conditional probability and expectation.
13.5. Section summary.
This section discussed conditioning, with an emphasis on the problems
that arise (and their partial resolution) when conditioning on events of
probability 0, and more generally on random variables and on sub-cr-algebras. It
presented definitions, examples, and various properties of conditional
probability and conditional expectation in these cases.

14. Martingales.
161
14. Martingales.
In this section we study a special kind of stochastic process called a
martingale. A stochastic process Xq, Xi, ... is a martingale if E|Xn| < oo
for all n, and with probability 1,
E (^n+l | Xq, X\ -> • • • j Xn) = Xn .
Of course, this really means that E(Xn+i \Jrn) — Xn, where Tn is the o-
algebra cr(Xo, X\,..., Xn). Intuitively, it says that on average, the value of
Xn+i is the same as that of Xn.
Remark 14.0.1. More generally, we can define a martingale by specifying
that E (Xn+i \Jrn) = Xn for some choice of increasing sub-cr-fields Tn such
that Xn is measurable with respect to Tn. But then a(X0,..., Xn) C Tn,
so by Proposition 13.2.7,
E(Xn+i |Xo,..., Xn) = E[E(Xn+i l^n) | X0,... ,Xn]
= E[Xn I Ao,..., Xn\ = Xn ,
so the above definition is also satisfied, i.e. the two definitions are equivalent.
Markov chains often provide good examples of martingales (though there
are non-Markovian martingales too, cf. Exercise 14.4.1). Indeed, by
Exercise 13.4.13, a Markov chain on a countable state space 5, with transition
probabilities Pij, and with E|Xn| < oo, will be a martingale provided that
Y,3Vij = «, tGS. (14.0.2)
(Intuitively, given that the chain is at state i at time n, on average it will
still equal i at time n + 1.) An important specific case is simple symmetric
random walk, where S = Z and Xq = 0 and Pi,i-i = Pi,i+i = \ for all
ieZ.
We shall also have occasion to consider submartingales and supermartin-
gales. The sequence Xo,X\,... is a submartingale if E|Xn| < oo for all n,
and also
E(Xn+i|X0,-Y1,,..,Xn)>Xn. (14.0.3)
It is a supermartingale if E|Xn| < oo for all n, and also
E {Xn+i \Xo,Xi,..., Xn) < Xn .
(These names are very standard, even though they are arguably the reverse
°f what they should be.) Thus, a process is a martingale if and only if it is

162
14. Martingales.
both a submartingale and a supermartingale. And, {Xn} is a
submartingale if and only if {—Xn} is a supermartingale. Furthermore, again by
Exercise 13.4.13, a Markov chain {Xn} with E|Xn| < oo is a submartingale
if ^2jEsJPij > * f°r aU i G S, or is a supermartingale if J2jesJPij S i
for all z G S. Similarly, if Xo — 0 and Xn = Z\ + ... + Zn, where {ZA
are i.i.d. with finite mean, then by Exercise 13.4.12, {Xn} is a martingale
if E(Z«) = 0, is a supermartingale if E(Z^) < 0; or is a submartingale if
E(Zi) > 0.
If {Xn} is a submartingale, then taking expectations of both sides
of (14.0.3) gives that E(Xn+i) > E(Xn), so by induction
E{Xn) > E(X„.i) > ... > E(Xi) > E(X0). (14.0.4)
Similarly, using Proposition 13.2.7,
E [Xn+2 I Xo, • • •, Xn] = E[E(Xn+2 | Xo,..., Xn+\) | Xo,..., Xnj
> E [Xn+i | Xo,..., Xn] > Xn ,
so by induction
E(Xm|X0,Xi,...,Xn) >Xn, m>n. (14.0.5)
For supermartingales, analogous statements to (14.0.4) and (14.0.5) follow
with each > replaced by <, and for martingales they can be replaced by =.
14.1. Stopping times.
If X0, Xi,... is a martingale, then as in (14.0.4), E(Xn) = E(Xo) for all
n £ N. But what about E(Xr), where r is a random time? That is, if we
define a new random variable Y by Y(uj) = Xt^(uj), then must we also
have E(F) = E(X0)? If r is independent of {Xn}, then E(Xr) is simply a
weighted average of different E(Xn), and therefore still equals E(Xo)- But
what if r and {Xn} are not independent?
We shall assume that r is a stopping time, i.e. a non-negative-integer-
valued random variable with the property that {r = n} G cr(Xo,... ,Xn).
(Intuitively, this means that one can determine if r = n just by knowing
the values X0,..., Xn; r does not "look into the future" to decide whether
or not to stop at time n.) Under these conditions, must it be true that
E(Xr) = E(X0)?
The answer to this question is no in general. Indeed, consider simple
symmetric random walk, with Xo = 0, and let r = inf{n > 0; Xn = —5}.
Then r is a stopping time, since {r = n) = {Xo ^ —5, .... Xn_i ^
-5, Xn = —5} G <j(X0, ... ,Xn). Furthermore, since {Xn} is recurrent, we

14.1. Stopping times.
163
have P(r < oo) = 1. On the other hand, clearly XT = -5 with probabilitv
1, so that E(XT) = -5, not 0.
However, for bounded stopping times the situation is quite different as
the following result (one version of the Optional Sampling Theorem) shows.
Theorem 14.1.1. Let {Xn} be a submartingale, let M e N, and let
n,T2 be stopping times such that 0 < t\ < r2 < M with probability 1.
ThenE(XT2)>E(XTl).
Proof. Note first that {tx < k < r2} = {n < k - 1} Pi {r2 < k - 1}C,
so that (since t\ and r2 are stopping times) the event \r\ < k < r2} is in
a(Xo,..., Xk-i)- We therefore have, using a telescoping series, linearity of
expectation, and the definition of E(- • • | Xo,..., Xk-i), that
E(XT2)-E(XT1)
= E(XT2-XT1)
= E(E|n+1(Xfc-Xfc_1))
= E \12k=i(xk ~ xk-i)iTl<k<T2j
= Sfc=lE((^fc -Xfc_l)lri<fc<r2)
= Sfc=i E((E(Xfc |X0,...,Xfc_i) - Xk-i) ^■rl<k<T2) •
But since {Xn} is a submartingale, (E(Xk\Xo,... ,Xk-i) — Xk-i) > 0.
Therefore, E(XT2) - E(XTl) > 0, as claimed. I
Corollary 14.1.2. Let {Xn} be a martingale, let MgN, and let T\. t2
be stopping times such that 0 < t\ < r2 < M with probability 1. Then
E(XT2) = E(XT1).
Proof. Since both {Xn} and {— Xn} are submartingales, the proposition
gives that E{XT2) > E{XTl) and also E(-XT2) > E(-XTl). The result
follows. I
Setting tj = 0, we obtain:
Corollary 14.1.3. If {Xn} be a martingale, and r is a bounded stopping
time, then E{XT) = E(X0).
Example 14.1.4. Let {Xn} be simple symmetric random walk with
^o = 0. and let r = min(1012, inf{n > 0]Xn = -5}). Then r is indeed a
bounded stopping time (with r < 1012), so we must have E(XT) = 0. This
is surprising since the probability is extremely high that r < 1012, and in
this case XT = -5. However, in the rare case that r = 1012, XT will (on

164
14. Martingales.
average) be so large as to cancel this —5 out. More precisely, using the
observation (13.0.1) that E{Z\B) = E(Z1B)/P(B) when P(B) > 0, we
have that
0 = E(XT) = E(Xrlr<1oi2)+E(Xrlr=10i2)
= E(XT | r < 1012) P(r < 1012) + E(XT \ r = 1012) P(r = 1012)
= (-5) P(r < 1012) + E(X10i2 | r = 1012) P(r = 1012).
Here P(r < 1012) « 1 and P(r = 1012) « 0, but E(X10i2 | r = 1012) is so
huge that the equation still holds.
A generalisation of Theorem 14.1.1, allowing for unbounded stopping
times, is as follows.
Theorem 14.1.5. Let {Xn} be a martingale with stopping time r.
Suppose P(r < oo) = 1, and E|Xr| < oo, and linin—oo E[Xn lT>n] = 0.
Then E{XT) = E(X0).
Proof. Let Zn = A"min(TjTl) for n = 0,1,2,— Then Zn = XTlT<n +
XnlT>n = XT — XTlT>n + XnlT>n, so XT = Zn — Anlr>n -h ArlT>n.
Hence,
E(XT) = E(Zn) - E[Xnlr>n] + E[Xrlr>n].
Since min(r, n) is a bounded stopping time (cf. Exercise 14.4.6(b)), it follows
from Corollary 14.1.3 that E(Zn) = E(X0) for all n. As n -> oo, the second
term goes to 0 by assumption. Also, the third term goes to 0 by the
Dominated Convergence Theorem, since S|XT| < oo, and lT>n —> 0 w.p. 1 since
P[r < oo] = 1. Hence, letting n —> oo, we obtain that E(XT) = E(Xq). I
Remark 14.1.6. Theorem 14.1.5 in turn implies Corollary 14.1.3, since
if r < M, then Xn lT>n = 0 for all n > M, and also
E|XT| = E|Xolr=o + ... + XMlr=M| < E|X0| + ... + E|XmI < °°-
However, this reasoning is circular, since we used Corollary 14.1-3 in the
proof of Theorem 14.1.5.
Corollary 14.1.7. Let {Xn} be a martingale with stopping time Tj
such that P[r < oo] = 1. Assume that {Xn} is bounded up to tlT^ t,
i.e. that there is M < oo such that \Xn\ ln<T < M ln<T for 0$ n- Tilen
E(XT) = E(Xo).

14.1. Stopping times.
165
Proof. We have \XT\ < M, so that E\XT\ < M < oo. Also |E(Xnlr>n)| <
E(|Xn|lr>n) < E(M lr>n) = MP(r > n), which converges to 0 as n —> oo
since P[r < oo] = 1. Hence, the result follows from Theorem 14.1.3. I
Remark 14.1.8. For submartingales {Xn} we may replace = by > in
Corollary 14.1.3 and Theorem 14.1.5 and Corollary 14.1.7, while for super-
martingales we may replace = by <.
Another approach is given by:
Theorem 14.1.9. Let r be a stopping time for a martingale {Xn}, with
P(r < oo) = 1. Then E(XT) = E{X0) if and only if lim^oo E[Xmin(Tfn)] =
Ellinin-^oo Xmin(r?n) J.
Proof. By Corollary 14.1.3, lim^oo E[Xmin(r>n)] = lim^oo E(X0) =
E(Xo). Also, since P(r < oo) = 1, we must have P[limn_^oo Xmin(r n) =
XT] = 1, so E[limn_+oo Xmin(T)Tl)] = E(XT). The result follows. I
Combining Theorem 14.1.9 with the Dominated Convergence Theorem
yields:
Corollary 14.1.10. Let r be a stopping time for a martingale {Xn},
with P(r < oo) = 1. Suppose there is a random variable Y with E(Y) < oo
and |Xmin(r,n)| < Y for all n. Then E(XT) = E(X0).
As a particular case, we have:
Corollary 14.1.11. Let r be a stopping time for a martingale {Xn}.
Suppose E(r) < oo, and \Xn - Xn_i| < M < oo for all n. Then E(XT) =
E(X0).
Proof. Let Y = \X0\ +Mt. Then E(F) < E|X0| + ME(r) < oo. Also,
l^min(r,n)l < |*o| + M min(r,n) < Y.
Hence, the result follows from Corollary 14.1.10. I
Similarly, combining Theorem 14.1.9 with the Uniform Integrability
Convergence Theorem yields:
Corollary 14.1.12. Let r be a stopping time for a martingale {Xn},
with P(r < oo) = 1. Suppose
lir^supE(|Xmin(r,n)|l|x ,>Q) = 0.

166
14. Martingales.
Then E{XT) = E{X0).
Example 14.1.13. (Sequence waiting time.) Let {rn}n>i be infinite
fair coin tossing (cf. Subsection 2.6), and let r = inf{n > 3 : rn_2 =
1, rn_i = 0, rn = 1} be the first time the sequence heads-tails-heads is
completed. Surprisingly, E(r) can be computed using martingales. Indeed,
suppose that at each time n, a new player appears and bets $1 on tails,
then if they win they bet $2 on heads, then if they win again they bet
$4 on heads. (They stop betting as soon as they either lose once or win
three bets in a row.) Let Sn be the total amount won by all the betters by
time n. Then by construction, {Sn} is a martingale with stopping time r,
and furthermore \Sn — Sn-i| < 7 < oo. Also, E(r) < oo, and ST = — r + 10
by Exercise 14.4.12. Hence, by Corollary 14.1.11, 0 = E(5T) = -E(r) +10,
whence E(r) = 10. (See also Exercise 14.4.13.)
Finally, we observe the following fact about E(XT) when {Xn} is a
general random walk, i.e. is given by sums of i.i.d. random variables. (If the
[Zi] are bounded, then part (a) below also follows from applying
Corollary 14.1.11 to {Xn - nm}.)
Theorem 14.1.14. (Weld's theorem) Let {Zi} be i.i.d. with finite
mean m. Let X0 = a, and Xn = a + Z\ + ... 4- Zn for n > 1. Let r
be a stopping time for {Xn}, with E(r) < oo. Then
(a) E(Xr) = a + raE(r); and furthermore
(b) if m = 0 and E(ZX2) < oo, then Var(Xr) = E(Zf) E(r).
Proof. Since {i < r} = {r >z-l} G a(X0,... ,Xi_i) = a(Z0,..., Z^i),
it follows from Lemma 3.5.2 that {i < r} and {Zi G B} are independent
for any Borel B C R, so that li<T and Zi are independent for each i.
For part (a), assume first that the Zi are non-negative. It follows from
countable linearity and independence that
E{XT-a) = E(Zi + ... + ZT) = E^JTza^r)
oo oo
= £E(Zili<r) = ^E^Etl^)
oo oo
= £E(Zi)E(li<T) = E(Zx)£P(T>i) = E(Zx)E(t),
2=1 2=1
the last equality following from Proposition 4.2.9.

14.1. Stopping times.
167
For general Zj, the above shows that E(J^- \Z{\ li<T) = E|Zi| E(r) <
oc. Hence, by Exercise 4.5.14(a) or Corollary 9.4.4, we can write
E(XT-a) = e(]Tz,1*<tJ = ^E(Z,li<r) = E(Zi)E(r).
For part (b), if m = 0 then from the above E(Xr) = a. Assume first
that r is bounded, i.e. r < M < oo. Then using part (a),
(M 2\
(]rz,ii<r) J
= E^Z^l^j+O = E(Z2)E(r),
where the cross terms equal 0 since as before {j < r} G cr(Zo, •. •, Zj-\) so
Zi lj<T is independent of Zj, and hence
E I ^2 ZiZJ ll^T ^<r J = E I Z2 ZiZ3 1j<r
\i<j J \l<i<j<M
= Y, E(ZiZilj<T)= £ E(Zili<T)E(Zi) = 0.
\<i<j<M \<i<j<M
If r is not bounded, then let pk = min(r, fc) be corresponding bounded
stopping times. The above gives that Var(XPfc) = E(Z2) E(p^) for each k.
As k —> oo, E(pfc) —> E(r) by the Monotone Convergence Theorem, so the
result follows from Lemma 14.1.15 below. I
The final argument in the proof of Theorem 14.1.14 requires two
technical lemmas involving I? theory:
Lemma 14.1.15. In the proof of Theorem 14.1.14, lim^oo Var(XPfc) =
Var(Xr).
Proof (optional). Clearly XPk -+ XT a.s. Let \\W\\ = y/E(W2) be the
usual L2 norm. Note first that for m < n, using Theorem 14.1.14(a) and
since the cross terms again vanish, we have
II*p„-*pJ|2 = V[(XPn-XPm)2} = E[(ZPm+1 + ... + ZPn)2}
= E[Z2Pm+1+...+Zl}+0 = E[Z2 + ... + Z2J-E[Z2 + ...+Z2J

168
14. Martingales.
= E(Z2)E(pn)-E(Z2)E(pm) = E(Z2)[E(pn)-E09m)]
which -» E(Z?)[E(t) - E(r)] = 0 as n,ra —► oo. This means that the
sequence {XPri}^=1 is a Cauchy sequence in L2, and since L2 is complete (see
e.g. Dudley (1989), Theorem 5.2.1), we must have limn_^oo \\XPn - Y\\ = q
for some random variable Y\ It then follows from Lemma 14.1.16 below
(with p = 2) that y = XT a.s., so that limn_^oo \\XPn — XT\\ = 0. The
triangle inequality then gives that \\\XPn — a\\ - \\XT - a\\\ < \\XPn - XT\\
solimn^00||XPn-a|| = ||Xr-a||. Hence, lim^^ ||XPn-a||2 = ||XT-a||2'
i.e. limn^oo Var(X2J = Var(X2). B
Lemma 14.1.16. If {Yn} -> Y in Lp, i.e. lim^oo E(|yn - Y\*>) = 0,
where 1 < p < oo, then there is a subsequence {Ynk} with Ynk —> Y a.s. In
particular, if {Yn} —► Y in Lp', and aiso {yn} —> Z a.s., then Y = Z a.s.
Proof (optional). Let {Yntc} be any subsequence such that E(|ynfc -
Y\p) < 4"fc, and let Ak = {u £ Q : \Ynk\uj) - Y{u)\ > 2~k^}. Then
4"fc > E(|ynfc-yr) > E{\Ynk-YflAk)
> (2-^)PP(4) = 2~kP(Ak).
Hence, P(Ak) < 2~fc, so ^2kP(Ak) = 1 < oo, so by the Borel-Cantelli
Lemma, P(limsupfeAfc) = 0. For any e > 0, since {a; G ft : |ynfc(cj) -
y(^)| > 4 C Afc for all fc > plog2(l/e), this shows that P(|ynfc(u;) -
^(^)| > € i.o.) < P(limsupfcAfc) = 0. Lemma 5.2.1 then implies that
{ynfc}^ya.s. ■
14.2. Martingale convergence.
If {Xn} is a martingale (or a submartingale), will it converge almost
surely to some (random) value?
Of course, the answer to this question in general is no. Indeed, simple
symmetric random walk is a martingale, but it is recurrent, so it will forever
oscillate between all the integers, without ever settling down anywhere.
On the other hand, consider the Markov chain on the non-negative
integers with (say) Xo = 50, and with transition probabilities given by
Pv = 27+T f°r 0 < j < 2z, with pij = 0 otherwise. (That is, if Xn = i,
then Xn+\ is uniformly distributed on the 2i -f 1 points 0,1,..., 2i.) This
Markov chain is a martingale. Clearly if it converges at all it must converge
to 0; but does it?

14.2. Martingale convergence. 169
This question is answered by:
Theorem 14.2.1. (Martingale Convergence Theorem) Let {Xn} be
a submartingale. Suppose that supnE|Xn| < oo. Then there is a (finite)
random variable X such that Xn —> X almost surely (a.s.).
We shall prove Theorem 14.2.1 below. We first note the following
immediate corollary.
Corollary 14.2.2. Let {Xn} be a martingale which is non-negative (or
more generally satisfies that either Xn > C for all n, or Xn < C for all
n, for some C € RJ. Then there is a (finite) random variable X such that
Xn —► A a.s.
Proof. If {Xn} is a non-negative martingale, then E|Xn| = E(Xn) =
E(Xo), so the result follows directly from Theorem 14.2.1.
If Xn > C [respectively Xn < C], then the result follows since {Xn — C}
[respectively {—Xn + C}} is a non-negative martingale. I
It follows from this corollary that the above Markov chain example does
indeed converge to 0 a.s.
For a second example, suppose Xq = 50 and that pij = 2min/. i00_^+1
for \j — i\< min(z, 100 — i). This Markov chain lives on {0,1,2,..., 100}, at
each stage jumping uniformly to one of the points within min(z, 100 — i) of
its current position. By Corollary 14.2.2, {Xn} —> X a.s. for some random
variable X, and indeed it is easily seen that P(X = 0) = P(X = 100) = \.
For a third example, let S = {2n; n € Z}, with a Markov chain on S
having transition probabilities pi^i = ^ Pi,i — §• This is again a martingale,
and in fact it converges to 0 a.s. (even though it is unbounded).
For a fourth example, let S = N be the set of positive integers, with
Xq = 1- Let pu = 1 for 2 even, with pi^-i = 2/3 and Pi,i+2 = 1/3
for i odd. Then it is easy to see that this Markov chain is a non-negative
martingale, which converges a.s. to a random variable X having the property
that P(X = i) > 0 for every even non-negative integer i.
It remains to prove Theorem 14.2.1. We require the following lemma.
Lemma 14.2.3. (Upcrossing Lemma) Let {Xn} be a submartingale.
For M e N and a < (3, let
U^f = sup{/c; 3ai <bi < a2 <b2 < a3 < ...
<ak <bk< M; Xaz < a, Xbt > j3} .

170
14. Martingales.
(Intuitively, U^ represents the number of times the process {Xn} "Up-
crosses" the interval [a,/?] by time M.) Then
E(U%0) < E\XM-X0\/(p-a).
Proof. By Exercise 14.4.2, the sequence {max(Xn,a)} is also a sub-
martingale, and it clearly has the same number of upcrossings of [a, 0] as
does {Xn}. Furthermore, | max(lM, ot) — max(Xo, ot)\ < \Xm — Xo\. Hence,
for the remainder of the proof we can (and do) replace Xn by max(Xn.a).
i.e. we assume that Xn > a for all n.
Let uo = vo = 0, and iteratively define Uj and Vj for j > 1 by
Uj = min(M, inf{/c > Vj-\; Xk < a}) ;
Vj = min(M, ini{k > Uj\ Xk > 0}) .
We necessarily have vm = M, so that
E(XM) = E(XVM)
— E(XVM — XUm + XUm — XVm_x + XVM_1 — XUM_l +... 4-XUl — Xq + Xq)
(M \ M
J2(*vk - xuj + Y,E (**> - *«*-,) ■
k=l / k=l
Now, E(XUk - XVk_l) > 0 by Theorem 14.1.1, since {Xn} is a sub-
martingale and Vk-i <Uk<M. Hence, Y^k=\ E{XUk - XVk_l) > 0. (This
is the subtlest part of the proof; since usually XUk < a and XVk_1 > 0, it
is surprising that E(XUk — XVk_1) > 0.)
Furthermore, we claim that
E [X>»* - *"*)) > E ((/^ - «) Ulf) ■ (14-2-4'
Indeed, each upcrossing contributes at least 0—a to the sum. And, any unull
cycle" where uk = Vk = M contributes nothing, since then XVk — XUk =
Xm — Xm = 0. Finally, we may have one "incomplete cycle" where uk < ^1
but Vk = M. However, since we are assuming that Xn > a, we must have
Xvk ~XUk = XM~XUk > a-a = 0 in this case; that is, such an incomplete
cycle could only increase the sum. This establishes (14.2.4).
We conclude that E(XM) > E(X0) + (0 - a) E (V^). Re-arranging.
(0-a)E (U^ < E(XM)-E(X0) < E\XM-X0\, which gives the result. I

14.3. Maximal inequality.
171
Proof of Theorem 14.2.1. Let K = supn E {\Xn\) < «,. Note first that
by Fatou's lemma,
E
(liminf \Xn\) < liminf El ATJ < K < oo.
\ n / n
It follows that P (|Xn| —> co) = 0, i.e. {Xn} will no£ diverge to ±00.
Suppose now that P (liminf Xn < lim sup Xn) > 0. Then since
{ lim inf Xn < lim sup Xn }
= [J (J J liminf Xn < q < q+ -< limsupXnj ,
g€Qfc6N
it follows from countable subadditivity that we can find a, /3 € Q with
P( lim inf Xn < a < (5 < limsupXn) > 0.
With U^f as in Lemma 14.2.3, this implies that P (limM-00 U%f = 00) >
0, so E (limM-+oo V^f = 00) =00. Then by the monotone convergence
theorem, limM-oo E(£/^) = E(limM->oo Utf) = 00. But Lemma 14.2.3
says that for all M G N, E{U^) < E|X^;Xo1 < ^, a contradiction.
We conclude that P(limn_00 |Xn| = 00) = 0 and also P(liminf Xn <
lim sup Xn) = 0. Hence, we must have P(limn_»oo Xn exists and is finite) =
1, as claimed. ■
14.3. Maximal inequality.
Markov's inequality says that for a > 0, P{X0 > a) < PflXol > a) <
E|X0| / a. Surprisingly, for a submartingale, the same inequality holds with
X0 replaced by max0<i<n^: even though usually (maxo<i<n^) > ^o-
Theorem 14.3.1. (Martingale maximal inequality) If {Xn} iS a su^-
martingale, then for all a > 0,
( max Xi) > a
\0<i<n J ~
< E|Xn|
Proof. Let Ak be the event {Xk > a, but X{ < a for i < k}, i.e. the
event that the process first reaches a at time k. And let A = Uo<fc<n4
be the event that the process reaches a by time n.

172
14. Martingales.
We need to show that P(A) < E\Xn\ /a, i.e. that aP(A) < E\Xn\. To
that end, we compute (letting Tk = cr(X0,X\,...,Xk)) that
aP(A) = Sfc==0aP(^fc) since {^fc} disjoint
= ELoE(al^)
^ Sc=o E (^fclAfc) since Xk > a on ^fc
< ELoE(E(Xn|jrfc)iAj by (14.0.5)
= ELoE(Xn^Ak) since ^fc € ?k
= E(Xnl^) since {^4fc} disjoint
< E(\Xn\lA)
< E|Xn|,
as required. |
Theorem 14.3.1 is clearly false if {Xn} is not a submartingale. For
example, if the Xi are i.i.d. equal to 0 or 2 each with probability ^, and if
a = 2, then P [(max0<i<n Xi) > a] = 1 - (^)n+1, which for n > 1 is not
<E|Xn|/a=|,
If {Xn} is in fact a non-negative martingale, then E|Xn| = E(Xn) =
E(Xo), so the bound in Theorem 14.3.1 does not depend on n. Hence,
letting n —► oo and using continuity of probabilities, we obtain:
Corollary 14.3.2. If {Xn} is a non-negative martingale, then for all
a>0,
E(X0)
( sup Xi1 > a
<
a
L x0<i<oo
For example, consider the third Markov chain example above, where
3' Pi,£ 3*
S = {2n;n G Z} and pii2i = $> Pi ± = §• K? say, X0 = 1, then we obtain
that P [(sup0<i<ooXj) > 2] < |, which is perhaps surprising. (This result
also follows from applying (7.2.7) to the simple non-symmetric random walk
{log2X„}.)
Remark 14.3.3. (Martingale Central Limit Theorem) If Xn = Zq +
... + Zn, where {Zj} are i.i.d. with mean 0 and variance 1, then we already
know from the classical Central Limit Theorem that 4& => iV(0,1), i.e.
that the properly normalised Xn converge weakly to a standard normal
distribution. In fact, a similar result is true for more general martingales
{Xn}. Let Tn = a(XQ, Xi, ..., Xn). Set <rg = Var(X0), and for n > 1 set
al = Var^J^-O = E (X2n - X2n_, \ Tn.x) .
Then follows from induction that Var(Xn) = X^=oE (<r£), and of course
E(Xn) = E(Xq) does not grow with ro. Hence, if we set vt = min{n >

14.4. Exercises.
173
0; ELo^fc ^ 0» then for larSe *> the random variable *„,/>/* has mean
close to 0 and variance close to 1. We then have —p- =>• AT(0,1) as t —► cxd
under certain conditions, for example if £n °n = °° with probability one
and the differences Xn — Xn-i are uniformly bounded. For a proof see e.g.
Billingley (1995, Theorem 35.11).
14.4. Exercises.
Exercise 14.4.1. Let {Z{} be i.i.d. with P(Z* = 1) = P(Z» = -1) = 1/2.
Let X0 = 0, Xx = Zi, and for n > 2, Xn = Xn_i + (1 + Zx +... + Z„-i)(2*
Zn — 1). (Intuitively, this corresponds to wagering, at each time n, one
dollar more than the number of previous victories.)
(a) Prove that {Xn} is a martingale.
(b) Prove that {Xn} is not a Markov chain.
Exercise 14.4.2. Let {Xn} be a submartingale, and let a £ R. Let
Yn = max(Xn,a). Prove that {Yn} is also a submartingale. (Hint: Use
Exercise 4.5.2.)
Exercise 14.4.3. Let {Xn} be a non-negative submartingale. Let
Yn = {Xn)2' Assuming F,(Yn) < cxd for all n, prove that {Yn} is also a
submartingale.
Exercise 14.4.4. The conditional Jensen's inequality states that if <j> is
a convex function, then E (</>(X) | Q) > </> (E(X \ Q)).
(a) Assuming this, prove that if {Xn} is a submartingale, then so is {(j){Xn)}
whenever <j> is non-decreasing and convex with E|</>(Xn)| < cxd for all n.
(b) Show that the conclusions of the two previous exercises follow from
part (a).
Exercise 14.4.5. Let Z be a random variable on a probability triple
(0, T, P), and let Qq C Q1 C ... C T be a nested sequence of sub-cr-algebras.
Let Xn — E(Z | Qn). (If we think of Qn as the amount of information we have
available at time n, then Xn represents our best guess of the value Z at time
n.) Prove that X0, Xi,... is a martingale. [Hint: Use Proposition 13.2.7 to
show that E(Xn+i | Qn) = Xn. Then use the fact that X{ is ^-measurable
to prove that <t(X0, ..., Xn) C Qn.]
Exercise 14.4.6. Let {Xn} be a stochastic process, let r and p be two
non-negative-integer-valued random variables, and let m G N.
(a) Prove that r is a stopping time for {Xn} if and only if {r < n} G
(t(X0,...,X„) foralln>0.

174
14. Martingales.
(b) Prove that if r is a stopping time, then so is min(r, m).
(c) Prove that if r and p are stopping times for {Xn}, then so is min(r, p).
Exercise 14.4.7. Let CgR, and let {Zi} be an i.i.d. collection of random
variables with P[Z» = -1] = 3/4 and P[Z» = C] = 1/4. Let X0 = 5, and
Xn = 5 + Zi + Z2 + ... + Zn forn > 1.
(a) Find a value of C such that {Xn} is a martingale.
(b) For this value of C, prove or disprove that there is a random variable
X such that as n —► oo, Xn —► X with probability 1.
(c) For this value of C, prove or disprove that P[Xn = 0 for some n £
N] = 1.
Exercise 14.4.8. Let {Xn} be simple symmetric random walk, with
Xq = 0. Let r = inf{n > 5 : Xn+i = Xn -I-1} be the first time after 4 which
is just before the chain increases. Let p = r + 1.
(a) Is r a stopping time? Is p a stopping time?
(b) Use Theorem 14.1.5 to compute E(Xp).
(c) Use the result of part (b) to compute E(Xr). Why does this not
contradict Theorem 14.1.5?
Exercise 14.4.9. Let 0 < a < c be integers, with c > 3. Let {Xn} be
simple symmetric random walk with Xq = a, let a = inf{n > 1 : Xn =
0 or c}, and let p = a — 1. Determine whether or not E(Xa) = a, and
whether or not E(XP) = a. Relate the results to Corollary 14.1.7.
Exercise 14.4.10. Let 0 < a < c be integers. Let {Xn} be simple
symmetric random walk, started at Xq = a. Let r = inf{n > 1; Xn = 0 or
c}.
(a) Prove that {Xn} is a martingale.
(b) Prove that E(XT) = a. [Hint: Use Corollary 14.1.7.]
(c) Use this fact to derive an alternative proof of the gambler's ruin formula
given in Section 7.2, for the case p = 1/2.
Exercise 14.4.11. Let 0 < p < 1 with p ^ 1/2, and let 0 < a < c be
integers. Let {Xn} be simple random walk with parameter p. started at
X0 = a. Let r = inf{n > 1; Xn = 0 or c}. Let Zn = ((1 - p)/p)Xn for
n = 0,1,2,....
(a) Prove that {Zn} is a martingale.
(b) Prove that E(ZT) = ((1 -p)/p)a. [Hint: Use Corollary 14.1.7.]
(c) Use this fact to derive an alternative proof of the gambler's ruin formula
given in Section 7.2, for the case p ^ 1/2.
Exercise 14.4.12. Let {Sn} and r be as in Example 14.1.13.
(a) Prove that E(r) < oo. [Hint: Show that P(r > 3m) < (7/8)m, and

14.4. Exercises
175
use Proposition 4.2.9.]
(b) Prove that Sr = -r + 10. [Hint: By considering the r different players
one at a time, argue that Sr = (r - 3)(-l) + 7-1 + 1.]
Exercise 14.4.13. Similar to Example 14.1.13. let {rn}n>1 be infinite
fair coin tossing, a = inf{n > 3 : rn_2 = 0. r„_i = 1, rn = 1}, and
p = inf{n > 4 : rn_3 = 0, rn_2 = 1, rn_! = 0, rn = 1}.
(a) Describe a and p in plain English.
(b) Compute E(er).
(c) Does E(<j) = E(r), with r as in Example 14.1.13? Why or why not?
(d) Compute E(p).
Exercise 14.4.14. Why does the proof of Theorem 14.1.1 fail if M = oc?
[Hint: Exercise 4.5.14 may help.]
Exercise 14.4.15. Modify the proof of Theorem 14.1.1 to show that
if {Xn} is a submartingale with |Xn+i - Xn| < M < oo, and t\ < r2
are stopping times (not necessarily bounded) with E(r2) < oo, then we
still have E(XT2) > E(XTl). [Hint: Corollary 9.4.4 may help.] (Compare
Corollary 14.1.11.)
Exercise 14.4.16. Let {Xn} be simple symmetric random walk, with
X0 = 10. Let r = min{n > 1; Xn = 0}, and let Yn = Xmm^r). Determine
(with explanation) whether each of the following statements is true or false.
(a) E(X200) = 10.
(b) E(Y200) = 10.
(c) E(XT) = 10.
(d) E(Yr) = 10.
(e) There is a random variable X such that {Xn} —> X a.s.
(f) There is a random variable Y such that {Yn} —> Y a.s.
Exercise 14.4.17. Let {Xn} be simple symmetric random walk with
X0 = 0, and let r = inf{n > 0; Xn = -5}.
(a) What is E(Xr) in this case?
(b) Why does this fact not contradict Wald's theorem part (a) (with a =
m = 0)?
Exercise 14.4.18. Let 0 < p < 1 with p ^ 1/2, and let 0 < a < c be
integers. Let {Xn} be simple random walk with parameter p, started at
X0 = a. Let r = inf{n > 1; Xn = 0 or c}. (Thus, from the gambler's ruin
solution of Subsection 7.2, P(Xr = c) = [((l-p)/p)a-l] / [((l-p)/p)c-l].)
(a) Compute E(Xr) by direct computation.
(b) Use Wald's theorem part (a) to compute E(r) in terms of E(Xr).

176
14. Martingales.
(c) Prove that the game's expected duration satisfies E(r) = (a - c[((l _
P)/P)a-1]/[((1-P)/P)c-1])/(1-2P).
(d) Show that the limit of E(r) as p —* 1/2 is equal to a(c — a).
Exercise 14.4.19. Let 0 < a < c be integers, and let {Xn} be simple
symmetric random walk with Xo = a. Let r = inf{n > 1; Xn = 0 or c}.
(a) Compute Var(Xr) by direct computation.
(b) Use Wald's theorem part (b) to compute E(r) in terms of Var(XT).
(c) Prove that the game's expected duration satisfies E(r) = a(c — a).
(d) Relate this result to part (d) of the previous exercise.
Exercise 14.4.20. Let {Xn} be a martingale with \Xn+i — Xn\ < 10 for
all n. Let r = inf{n > 1 : |Xn| > 100}.
(a) Prove or disprove that this implies that P(r < oo) = 1.
(b) Prove or disprove that this implies there is a random variable X with
{Xn} -► X a.s.
(c) Prove or disprove that this implies that P [r < oo, or there is a random
variable X with {Xn} —► X] = 1. [Hint: Let Yn = Xm-m(Ttny]
Exercise 14.4.21. Let Zi, Z2,... to be independent, with
/ 2* \ 2* - 1 1
p(Z{ = -—- ) = ^— and P(Zi = -2l) = - .
V 2l - 1 / 2l v } 2l
Let X0 = 0, and Xn = Zx + ... + Zn for n > 1.
(a) Prove that {^n} is a martingale. [Hint: Don't forget (14.0.2).]
(b) Prove that P[Zi > 1 a.a] = 1, i.e. that with probability 1, Z{ > 1 for
all but finitely many i. [Hint: Don't forget the Borel-Cantelli Lemma.]
(c) Prove that P[limn^00 Xn = 00] = 1. (Hence, even though {Xn} is a
martingale and thus represents a player's fortune in a "fair" game, it is still
certain that the player's fortune will converge to +00.)
(d) Why does this result not contradict Corollary 14.2.2?
(e) Let r = inf{n > 1 : Xn < 0}, and let Yn = Xm[n{T,n). Prove that
{Yn} is also a martingale, and that P[limn^oo Yn = 00] > 0. Why does this
result not contradict Corollary 14.2.2?
14.5. Section summary.
This section provided a brief introduction to martingales, including
submartingales, supermartingales, and stopping times. Some examples
were given, mostly arising from Markov chains. Various versions were of
the Optional Sampling Theorem were proved, giving conditions such that
E(XT) = E(X0). This together with the Upcrossing Lemma was then used
to prove the important Martingale Convergence Theorem. Finally, the
Martingale maximal inequality was presented.

15. General stochastic processes
177
15. General stochastic processes.
We end this text with a brief look at more general stochastic processes
We attempt to give an intuitive discussion of this area, without being overly
careful about mathematical precision. (A full treatment of these processes
would require another course, perhaps following one of the books in
Subsection B.5.) In particular, in this section a number of results are stated
without being proved, and a number of equations are derived in an intuitive
and non-rigorous manner. All exercises are grouped by subsection, rather
than at the end of the entire section.
To begin, we define a (completely general) stochastic process to be any
collection {Xt\ t £ T} of random variables defined jointly on some
probability triple. Here T can be any non-empty index set. If T = {0,1,2,...} then
this corresponds to our usual discrete-time stochastic processes; if T = R-°
is the non-negative real numbers, then this corresponds to a continuous-time
stochastic process as discussed below.
15.1. Kolmogorov Existence Theorem.
As with all mathematical objects, a proper analysis of stochastic
processes should begin with a proof that they exist In two places in this text
(Theorems 7.1.1 and 8.1.1), we have proved the existence of certain random
variables, defined on certain underlying probability triples, having certain
specified properties. The Kolmogorov Existence Theorem is a huge
generalisation of these results, which allows us to define stochastic processes quite
generally, as we now discuss.
Given a stochastic process {Xt; t £ T}, and k G N, and a finite
collection t\,..., tk G T of distinct index values, we define the Borel probability
measure Hti...tk on Rfc by
litl...tk(H) = P((Xtl,...,Xtfc)eff), HCKk Borel.
The distributions {/x^...^; k G N, t\,... ,tk G T distinct} are called the
finite-dimensional distributions for the stochastic process {Xt; t eT}.
These finite-dimensional distributions clearly satisfy two sorts of
consistency conditions:
(Cl) If (s(l),s(2),..., s(fc)) is any permutation of (1,2, ...,fc) (meaning
that s : {1, 2,..., k} —> {1, 2,..., k} is one-to-one), then for distinct
ti,..., tk G T, and any Borel Hi,... ,Hk CR, we have
Ht1...tk(Hlx...xHk) = »tsil)...ts(k)(Hs{1)x...xHs{k)). (15.1.1)

178
15. General stochastic processes.
That is, if we permute the indices U, and correspondingly modify the
set H = Hi x ... x Hk, then we do not change the probabilities. For
example, we must have P(X e A, Y G B) = P(Y G B, X G A), even
though this will not usually equal P(F G i, X G B).
(C2) For distinct t\,..., tk G T, and any Borel Hi,..., Hk-i Q R, we have
Htl...tk(Hi x...xHk-i xR) = ^tl...tfc_1(flri x...x/ffc_i). (15.1.2)
That is, allowing Xtk to be anywhere in R is equivalent to not
mentioning Xtk at all. For example, P(X G A, Y G R) = P(X G A).
Conditions (CI) and (C2) are quite obvious and uninteresting. However,
what is surprising is that they have an immediate converse; that is, for any
collection of finite-dimensional distributions satisfying them, there exists a
corresponding stochastic process. A formal statement is:
Theorem 15.1.3. (Kolmogorov Existence Theorem) A family of Borel
probability measures {fJ>ti...tkm, k G N, U G T distinct}, with fiti...tk &
measure on Hk, satisfies the consistency conditions (Cl) and (C2) above
if and only if there exists a probability triple (HT,JrT,P), and random
variables {Xt}t£T defined on this triple, such that for all k G N, distinct
t\,..., tk €T, and Borel H C Hk, we have
P((Xh,...,Xtk)eH) = /^...tjtf)- (15.1-4)
The theorem thus says that, under extremely general conditions,
stochastic processes exist. Theorems 7.1.1 and 8.1.1 follow immediately as special
cases (Exercises 15.1.6 and 15.1.7).
The "only if" direction is immediate, as discussed above. To prove the
"if" direction, we can take
RT = {all functions T -> R}
and
TT = <j{{Xt G H}; t£T, HCR Borel} .
The idea of the proof is to first use (15.1.4) to define P for subsets of the
form {Xtl,..., Xt J G H) (for distinct tx,..., tk G T, and Borel H C Rfc),
and then extend the definition of P to the entire cr-algebra TT, analogously
to the proof of the Extension Theorem (Theorem 2.3.1). The argument is
somewhat involved; for details see e.g. Billingsley (1995, Theorem 36.1).
Exercise 15.1.5. Suppose that (15.1.1) holds whenever s is a
transposition, i.e. a one-to-one function s : {1,2,... ,k} —+ {1,2,... ,k} such that

15.2. Markov chains on general state
179
s(i) = i for k -2 choices of i G {1,2,.. ,,k\. Provp tW th„ g~,*
;,.^. . x. n , . x, x ! ' ' v J / ^rove tnat trie first consistency
condition is satisfied, i.e. that (15.1.1) holds for all permutations s.
Exercise 15.1.6. Let Xi,X2,... be independent, with Xn ~ ^n.
(a) Specify the finite-dimensional distributions Hti,t2,..,tk for distinct non-
negative integers ti, t2, •.. ,tk-
(b) Prove that these /iti,t2,-,tk satisfy (15.1.1) and (15.1.2).
(c) Prove that Theorem 7.1.1 follows from Theorem 15.1.3.
Exercise 15.1.7. Consider the definition of a discrete Markov chain
{Xn}, from Section 8.
(a) Specify the finite-dimensional distributions /i^,^,...,^ for non-negative
integers t\ < t2 < ... < tk>
(b) Prove that they satisfy (15.1.1).
(c) Specify the finite-dimensional distributions /J>tut2,>,tk for general
distinct non-negative integers ti, t2, • •., tk, and explain why they satisfy (15.1.2).
[Hint: It may be helpful to define the order statistics, whereby £(r) is the
rth-largest element of {t\, t2,..., £&} for 1 <r< k.]
(d) Prove that Theorem 8.1.1 follows from Theorem 15.1.3.
15.2. Markov chains on general state spaces.
In Section 8 we considered Markov chains on countable state spaces
5, in terms of an initial distribution {vi}i£s and transition probabilities
{Pij}i,jes- We now generalise many of the notions there to general (perhaps
uncountable) state spaces.
We require a general state space X, which is any non-empty (perhaps
uncountable) set, together with a cr-algebra T of measurable subsets. The
transition probabilities are then given by {P(x,A)}X£x, agt- We make the
following two assumptions:
(Al) For each fixed x G X, P(x, •) is a probability measure on (X^J7).
(A2) For each fixed AGf, P(x, A) is a measurable function of x G X.
Intuitively, P(x, A) is the probability, if the chain is at a point x, that it
will jump to the subset A at the next step. If X is countable, then P(x, {i})
corresponds to the transition probability pXi of the discrete Markov chains
of Section 8. But on a general state space, we may have P(x, {%}) = 0 for
all ieX.
We also require an initial distribution v, which is any probability
distribution on (X,T). The transition probabilities and initial distribution
then give rise to a (discrete-time, general state space, time-homogeneous)
Markov chain Xq, X\,X2,..., where
P(X0g A), Xi eA1,...1xneAn)

180
15. General stochastic processes.
= / v(dx0) P(xo,dxi)...
Jx0€Ao JxieAi
... I P{xn-2,dxn-i)P{x )• (15.2.1)
Jxn-lGAn-i
Note that these integrals (i.e., expected values) are well-defined because of
condition (A2) above.
As before, we shall write Px(- • •) for the probability of an event
conditional on Xo = x, i.e. under the assumption that the initial distribution
v is a point-mass at the point x. And, we define higher-order
transition probabilities inductively by P1(x, A) = P(x,^4), and Pn+1(x, A) =
$xP{x,dz)Pn{z,A) for n > 1.
Analogous to the countable state space case, we define a stationary
distribution for a Markov chain to be a probability measure 7r(-) on (X,T),
such that tt(A) = fx 7r(dx)P(x, A) for all A e J7. (This generalises our
earlier definition ttj = Ylies^iPiJ-) As m tne countably infinite case, Markov
chains on general state spaces may or may not have stationary distributions.
Example 15.2.2. Consider the Markov chain on the real line (i.e. with
X = R), where P(x, •) = iV(|, |) for each x G X. In words, if Xn is equal
to some real number x, then the conditional distribution of Xn+i will be
normal, with mean | and variance |. Equivalently, Xn+i = ^Xn + {7n+i,
where {Un} are i.i.d. with Un ~ 7V(0, |). This example is analysed in
Exercise 15.2.5 below; in particular, it is shown that 7r(-) = JV(0,1) is a
stationary distribution for this chain.
For countable state spaces 5, we defined irreducibility to mean that for
all i,j £ 5, there is n €. N with p\j > 0. On uncountable state spaces this
definition is of limited use, since we will often (e.g. in the above example)
have p\j = 0 for all i,j G X and all n > 1. Instead, we say that a Markov
chain on a general state space X is ^-irreducible if there is a non-zero, o~-
finite measure ip on (X,T) such that for any A e J7 with ip(A) > 0, we
have Px(ta < oo) > 0 for all x G X. (Here ta = inf{n > 0; Xn G A) is the
first hitting time of the subset A; thus, ta < oo is the event that the chain
eventually hits the subset A, and (/>-irreducibility is the statement that the
chain has positive probability of eventually hitting any subset A of positive
ip measure.)
Similarly, on countable state spaces S we defined aperiodicity to mean
that for all i G 5, gcd{n > 1 : p^ > 0} = 1, but on uncountable state
spaces we will usually have p\^ = 0 for all n. Instead, we define the period
of a general-state-space Markov chain to be the largest (finite) positive
integer d such that there are non-empty disjoint subsets X\,..., Xd C X,

15.2. Markov chains on general state spaces. 181
with P(x, Xi+1) = 1 for all x G X{ (1 < i < d - 1) and P(x, X{) = 1 for all
x G Xd- The chain is periodic if its period is greater than 1, otherwise the
chain is aperiodic.
In terms of these definitions, a fundamental theorem about general state
space Markov chains is the following generalisation of Theorem 8.3.10. For
a proof see e.g. Meyn and Tweedie (1993).
Theorem 15.2.3. If a discrete-time Markov chain on a general state space
is (f)-irreducible and aperiodic, and furthermore has a stationary distribution
7r(), then for n-almost every x G X, we have that
lim sup|Pn(x,A)-7r(^)| -► 0.
In words, the Markov chain converges to its stationary distribution in the
total variation distance metric.
Exercise 15.2.4. Consider a Markov chain which is ^irreducible with
respect to some non-zero a-finite measure ^, and which is periodic with
corresponding disjoint subsets A\,..., X&. Let B = (Jf Xi.
(a) Prove that Pn(x, Bc) = 0 for all x G B.
(b) Prove that ^(Bc) = 0.
(c) Prove that ip(Xi) > 0 for some i.
Exercise 15.2.5. Consider the Markov chain of Example 15.2.2.
(a) Let 7r(-) = ./V(0,1) be the standard normal distribution. Prove that
7r() is a stationary distribution for this Markov chain.
(b) Prove that this Markov chain is ^irreducible with respect to A, where
A is Lebesgue measure on R.
(c) Prove that this Markov chain is aperiodic. [Hint: Don't forget
Exercise 15.2.4.]
(d) Apply Theorem 15.2.3 to this Markov chain, writing your conclusion
as explicitly as possible.
Exercise 15.2.6. (a) Prove that a Markov chain on a countable state
space X is ^irreducible if and only if there is j G X such that P{(tj <
oo) > 0 for all i G X.
(b) Give an example of a Markov chain on a countable state space which
is ^-irreducible, but which is not irreducible in the sense of Subsection 8.2.
Exercise 15.2.7. Show that, for a Markov chain on a countable state
space 5, the definition of aperiodicity from this subsection agrees with the
previous definition from Definition 8.3.4.

182
15. General stochastic processes.
Exercise 15.2.8. Consider a discrete-time Markov chain with state space
X = R, and with transition probabilities such that P(x, •) is uniform on the
interval [x — 1, x + 1]. Determine whether or not this chain is </>-irreducible.
Exercise 15.2.9. Recall that counting measure is the measure ?/>(•)
defined by ip(A) = \A\, i.e. ip(A) is the number of elements in the set A, with
ip{A) = oo if the set A is infinite.
(a) For a Markov chain on a countable state space X, prove that irre-
ducibility in the sense of Subsection 8.2 is equivalent to </>-irreducibility
with respect to counting measure on X.
(b) Prove that the Markov chain of Exercise 15.2.5 is not ^irreducible with
respect to counting measure on R. (That is, prove that there is a set A,
and xGA", such that Px(t~a < oo) = 0, even though ip(A) > 0 where ip is
counting measure.)
(c) Prove that counting measure on R is not cr-finite (cf. Remark 4.4.3).
Exercise 15.2.10. Consider the Markov chain with X = R, and with
P(x, •) = N(x, 1) for each x e X.
(a) Prove that this chain is ^irreducible and aperiodic.
(b) Prove that this chain does not have a stationary distribution. Relate
this to Theorem 15.2.3.
Exercise 15.2.11. Let # = {1,2,...}. Let P(l, {1}) = 1, and for x > 2,
P(x, {1}) = l/x2 and P(x, {x + 1}) = 1 - (1/x2).
(a) Prove that this chain has stationary distribution 7r(-) = Si(-).
(b) Prove that this chain is ^irreducible and aperiodic.
(c) Prove that if X0 = x > 2, then P[Xn = x + n for all n] > 0, and
||Pn(x, •) - tt(.)|| /> 0. Relate this to Theorem 15.2.3.
Exercise 15.2.12. Show that the finite-dimensional distributions
implied by (15.2.1) satisfy the two consistency conditions of the Kolmogorov
Existence Theorem. What does this allow us to conclude?
15.3. Continuous-time Markov processes.
In general, a continuous-time stochastic process is a collection {Xt}t>o
of random variables, defined jointly on some probability triple, taking values
in some state space X with cr-algebra T, and indexed by the non-negative
real numbers T = {t > 0}. Usually we think of the variable t as representing
(continuous) time, so that Xt is the (random) state at some time t > 0.

15.3. Continuous-time Markov processes. 183
DS + t
Such a process is a (continuous-time, time-homogeneous) Markov process
if there are transition probabilities Pt(x, •) for all t > 0 and all x G X, and
an initial distribution v, such that
P{X0e Ao,Xtl 6i4i,...,Xtn eAn)
= / • . / v(dx0)
Jx0€A0 Jxtl£Ai Jxtri€An
P^ixo^dxtjP^-^ixt^dxt,) ...P^-^-^x^^dxtJ (15.3.1)
for all times 0 < t\ < ... < tn and all subsets Ai,...,An G T. Letting
P°(x, •) be a point-mass at x, it then follows that
{x,A) = f Ps(x,dy)Pt(y,A), s,t > 0, x G X, A G T. (15.3.2)
(This is the semigroup property of the transition probabilities: PS_K =
Ps Pl.) On a countable state space X, we sometimes write p\^ for P*(i, {j});
(15.3.2) can then be written as p\fl = YskextfikPkj' As in (8.0.5), p^ = 6^,
(i.e., at time t = 0, p^ equals 1 for i = j and 0 otherwise).
Exercise 15.3.3. Let {Pt(x, •)} be a collection of Markov transition
probabilities satisfying (15.3.2). Define finite-dimensional distributions fi>ti...tk
for tu...,tk > Oby/ztl...tfc(4> x ... x Ak) = P(Xtl G Ai, ...,Xtfc G Afc)
as implied by (15.3.1).
(a) Prove that the Kolmogorov consistency conditions are satisfied by
{Mti-tfc}- [Hint: You will need to use (15.3.2).]
(b) Apply the Kolmogorov Existence Theorem to {fJ>ti...tk}i and describe
precisely the conclusion.
Another important concept for continuous-time Markov processes is the
generator. If the state space X is countable (discrete), then the generator
is a matrix Q = (#ij), defined for z,j G X by
qij = lim Pij ~ 6ij . (15.3.4)
Since 0 < p\^ < 1 for all ^ > 0 and all i and j, we see immediately that
Qu < 0, while q^ > 0 if i ^ j. Also, since ^(Pij - Sij) = 1 - 1 = 0, we
have Y^j Qij = 0 for each i. The matrix Q represents a sort of "derivative"
of P* with respect to t. Hence, pfj « <5ij + s<ftj for small s > 0, i.e.
Ps = I + sQ + 0(5) as 5 \ 0.
Exercise 15.3.5. Let {P*} = {p^.} be the transition probabilities for
a continuous-time Markov process on a finite state space X, with finite

184
15. General stochastic processes.
generator Q = {qij} given by (15.3.4). Show that Pl = exp(tQ). (Given
a matrix M, we define exp(M) by exp(M) = / + M + \M2 + ^M3 + ...,
where J is the identity matrix.) [Hint: Pl = (p*/n)n? and as n —> oo, we
have P^n = I + (t/n)Q + 0(l/n2).]
Exercise 15.3.6. Let X = {1,2}, and let Q = (qij) be the generator of
a continuous-time Markov process on X, with
Compute the corresponding transition probabilities P* = (p\j) of the
process, for any t > 0. [Hint: Use the previous exercise. It may help to know
that the eigenvalues of Q are 0 and —9, with corresponding left eigenvectors
(2,1) and (1,-1).]
The generator Q may be interpreted as giving "jump rates" for our
Markov processes. If the chain starts at state i and next jumps to state j,
then the time before it jumps is exponentially distributed with mean 1/%.
We say that the process jumps from i to j at rate q^. Exercise 15.3.5 shows
that the generator Q contains all the information necessary to completely
reconstruct the transition probabilities Pl of the Markov process.
A probability distribution 7r(«) on X is a stationary distribution for the
process {Xt}t>0 if J n(dx) Pt{x, •) = 7r(-) for all t > 0; if X is countable we
can write this as ttj = Yliex niPij f°r a^ 3 £ X and all t > 0. Note that it
is no longer sufficient to check this equation for just one particular value of
t (e.g. t = 1). Similarly, we say that {Xt}t>o is reversible with respect to
tt(-) if n(dx) P*(x, dy) = n(dy) Pl(y, dx) for all x, y e X and all t > 0.
Exercise 15.3.7. Let {Xt}t>0 be a Markov process. Prove that
(a) if {Xt}t>o is reversible with respect to 7r(-), then 7r(-) is a stationary
distribution for {Xt}t>Q.
(b) if for some T > 0 we have f ir(dx) P\x, •) = tt(-) for all 0 < t < T,
then 7r(-) is a stationary distribution for {Xt}t>o.
(c) if for some T > 0 we have n(dx) Pt(x,dy) = n(dy) Pt(y,dx) for all
x, y e X and all 0 < t < T, then {Xt}t>0 is reversible with respect to 7r(-)«
Exercise 15.3.8. For a Markov chain on a finite state space X with
generator Q, prove that {7Ti}i£x is a stationary distribution if and only if
7r Q = 0, i.e. if and only if ^2ieX ni <fe = 0 f°r a^ •? ^ ^*
Exercise 15.3.9. Let {Zn} be i.i.d. ~ Exp(5), so P[Zn > z] = e~5z for
z > 0. Let Tn = Zi + Z2 +... + Zn for n > 1. Let {Xt}t>o be a continuous-
time Markov process on the state space X = {1,2}, defined as follows. The

15.3. Continuous-time Markov processes. 185
process does not move except at the times Tn, and at each time Tn, the
process jumps from its current state to the opposite state (i.e., from 1 to
2, or from 2 to 1). Compute the generator Q for this process. [Hint: You
may use without proof the following facts, which are consequences of the
"memoryless property" of the exponential distribution: for any 0 < a < 6,
(i) P(3n : a < Tn < 6) = 1 - e~^b-a\ which is « 5(6 - a) as 6 \ a; and
also (ii) P(3n : a < Tn < Tn+1 < 6) = 1 - e~5(6-a)(l + 5(6 - a)), which is
« 25(6 - a)2 as 6 \ a]
Exercise 15.3.10. (Poisson process.) Let A > 0, let {Zn} be i.i.d.
~ Exp(A), and let Tn = Zx + Z2 + ... + Zn for n > 1. Let {Nt}t>o be a
continuous-time Markov process on the state space X = {0,1,2,...}, with
N0 = 0, which does not move except at the times Tn, and increases by 1
at each time Tn. (Equivalently, Nt = #{n G N : Tn < t}; intuitively, Nt
counts the number of events by time t.)
(a) Compute the generator Q for this process. [Don't forget the hint from
the previous exercise.]
(b) Prove that P(Nt < m) = e-xt(Xt)m/m\ + P(JVt < m - 1) for m =
0,1,2,.... [Hint: First argue that P(Nt < m) = P(Tm+i > t) and that
Tm ~ Gamma(m, A), and then use integration by parts; you may assume
the result and hints of Exercise 9.5.17.]
(c) Conclude that ¥{Nt = j) = e-Xt{Xt)j / j!, i.e. that JVt ~ Poisson(AJ).
[Remark: More generally, there are Poisson processes with variable rate
functions A : [0, oo) —> [0,oo), for which Nt ~ Poisson( fQ X(s) ds).]
Exercise 15.3.11. Let {Nt}t>0 be a Poisson process with rate A > 0.
(a) For 0 < s < t, compute V{NS = 11 Nt = 1).
(b) Use this to specify precisely the conditional distribution of the first
event time Ti, conditional on Nt = 1.
(c) More generally, for r < t and m e N, specify the conditional
distribution of Tm, conditional on Nr = m — 1 and Nt = m.
Exercise 15.3.12. (a) Let {Nt}t>o be a Poisson process with rate
A > 0, let 0 < s < t, and let UUU2 be i.i.d. ~ Uniform[0,*].
(a) Compute P(7VS = 01 Nt = 2).
(b) Compute P(U\ > s and U2 > s).
(c) Compare the answers to parts (a) and (b). What does this comparison
seem to imply?

186
15. General stochastic processes.
15.4. Brownian motion as a limit.
Having discussed continuous-time processes on countable state spaces,
we now turn to continuous-time processes on continuous state spaces. Such
processes can move in continuous curves and therefore give rise to interesting
random functions. The most important such process is Brownian motion,
also called the Wiener process. Brownian motion is best understood as a
limit of discrete time processes, as follows.
Let Zi, Z2,... be any sequence of i.i.d. bounded random variables having
mean 0 and variance 1, e.g. Zi = ±1 with probability 1/2 each. For each n €
N, define a discrete-time random process {y|n)}g0 iteratively by F0(n) = 0,
and
(n)
yW +
1
y/n
Zi+i, z = 0,1,2,... .
(15.4.1)
Thus, y}n) = ^(Zi + Z2 + ... + Zi). In particular, {F|n)} is like simple
symmetric random walk, except with time sped up by a factor of n, and
space shrunk by a factor of y/n.
?{n)
Figure 15.4.2. Constructing Brownian motion.
We can then "fill in" the missing values by linear interpolation on each in-
terval [^, ^]. In this way, we obtain a continuous-time process {Yt }t>o;

15.4. Brownian motion as a limit.
187
see Figure 15.4.2. Thus, [Yt }t>o is a continuous-time process which agrees
with Yt whenever t = ^. Furthermore, Yt is always within 0(l/n) of
yffj, where |_rj is the greatest integer not exceeding r. Intuitively, then,
n
{Yt }t>o is like an ordinary discrete-time random walk, except that it has
been interpolated to a continuous-time process, and furthermore time has
been sped up by a factor of n and shrunk by a factor of y/n. In summary,
this process takes lots and lots of very small steps.
Remark. It is not essential that we choose the missing values so as to make
the function linear on [-, ^1. The same intuitive limit will be obtained
no matter how the missing values are defined, provided only that they are
close (as n —> oo) to the corresponding Y± values. For example, another
option is to use a constant interpolation of the missing values, by setting
Ytn = Fffj . In fact, this constant interpolation would make the following
n
calculations cleaner by eliminating the 0(1/n) errors. However, the linear
interpolation has the advantage that each Yt is then a continuous function
of t, which makes it more intuitive why Bt is also continuous.
Now, the factors n and y/n have been chosen carefully. We see that
Y(tn) * Y^ = j-{Zi + Z2 + ... + Z[tni). Thus, F|n) is essentially ^
n
times the sum of \tn\ different i.i.d. random variables, each having mean 0
and variance 1. It follows from the ordinary Central Limit Theorem that,
as n —> oo with t fixed, we have C (Y™ J => N(0,t), i.e. for large n the
random variable Yt is approximately normal with mean 0 and variance t.
Let us note a couple of other facts. Firstly, for s < t, E (Y™ Y™ J w
'- E {Y^Y^j =E((Z1 + ... + ZLanJ)(Z1 + ... + Z[tni)). Since the
Zi have mean 0 and variance 1, and are independent, this is equal to -^p-,
which converges to s as n —> oo.
Secondly, if n —> oo with s < t fixed, the difference Yt — Ys is within
0(1/n) of Z^snj+1 + ... + Z|^nj. Hence, by the Central Limit Theorem,
converges weakly to N(0,t — s) as n —► oo, and is in fact
independent of Yj1 .
Brownian motion is, intuitively, the limit as n —> oo of the processes
{y;";}t>0. That is, we can intuitively define Brownian motion {Bt}t>o by
—(n) ~~
saying that Bt = limn^oo Yt for each fixed t > 0. This analogy is very
useful, in that all the n —+ oo properties of Yt discussed above will apply
to Brownian motion. In particular:

188
15. General stochastic processes.
• normally distributed: C(Bt) = N(0,t) for any t > 0.
• covariance structure: "E(BsBt) = s for 0 < s < t.
• independent normal increments: C(Bt2 — Btl) = N(0, t2 — t\), and Bt4 —
Bt3 is independent of Bt2 — Btl whenever 0 < t\ < t2 < £3 < t±.
In addition, Brownian motion has one other very important property.
It has continuous sample paths, meaning that for each fixed u € Q, the
function Bt(u) is a continuous function of t. (However, it turns out that,
with probability one, Brownian motion is not differentiate anywhere at all!)
Exercise 15.4.3. Let {Bt}t>o be Brownian motion. Compute E[(£2 +
#3 + I)2]-
Exercise 15.4.4. Let {Bt}t>o be Brownian motion, and let Xt = 2t+3Bt
for t > 0.
(a) Compute the distribution of Xt for t > 0.
(b) Compute E[{Xt)2} for t > 0.
(c) Compute E{XsXt) for 0 < s < t.
Exercise 15.4.5. Let {Bt}t>o be Brownian motion. Compute the
distribution of Zn for n e N, where Zn = £ (Bi + B2 + ... + Bn).
Exercise 15.4.6. (a) Let / : R —> R be a Lipschitz function, i.e. a
function for which there exists a G R such that |/(x) — f{y)\ < a\x — y\ for
all x, y e R. Compute \imh\^o(f(t + h) — f(t))2/h for any tGR.
(b) Let {Bt} be Brownian motion. Compute lim/^oE ((Bt+h — Bt)2/h)
for any t > 0.
(c) What do parts (a) and (b) seem to imply about Brownian motion?
15.5. Existence of Brownian motion.
We thus see that Brownian motion has many useful properties. However,
we cannot make mathematical use of these properties until we have
established that Brownian motion exists, i.e. that there is a probability triple
(Q,^7, P), with random variables {Bt}t>o defined on it, which satisfy the
properties specified above.
Now, the Kolmogorov Existence Theorem (Theorem 15.1.3) is of some
help here. It ensures the existence of a stochastic process having the same
finite-dimensional distributions as our desired process {Bt}. At first glance
this might appear to be all we need. Indeed, the properties of being
normally distributed, of having the right covariance structure, and of having

15.5. Existence of Brownian motion.
189
independent increments, all follow immediately from the finite-dimensional
distributions.
The problem, however, is that finite-dimensional distributions cannot
guarantee the property of continuous sample paths. To see this, let {Bt}t>0
be Brownian motion, and^let U be a random variable which is distributed
according to (say) Lebesgue measure on [0,1]. Define a new stochastic
process {B't}t>0 by B't = Bt + l{t/=t}- That is, B[ is equal to Bt, except
that if we happen to have U = t then we add one to Bt. Now, since
P(C/ = t) = 0 for any fixed t, we see that {B't}t>o has exactly the same finite-
dimensional distributions as {Bt}t>o does. On the other hand, obviously if
{Bt}t>o has continuous sample paths, then {B't}t>o cannot. This shows that
the Kolmogorov Existence Theorem is not sufficient to properly construct
Brownian motion. Instead, one of several alternative approaches must be
taken.
In the first approach, we begin by setting B0 = 0 and choosing B\ ~
iV(0,1). We then define Bi to have its correct conditional distribution,
conditional on the values of Bo and B\. Continuing, we then define Bi
to have its correct conditional distribution, conditional on the values of Bq
and Bi; and similarly define Bz conditional on the values of B\ and B\.
Iteratively, we see that we can define Bt for all values of t which are dyadic
rationals, i.e. which are of the form ^r for some integers i and n. We then
argue that the function {Bt; t a non-negative dyadic rational} is uniformly
continuous, and use this to argue that it can be "filled in" to a continuous
function {Bt; t > 0} defined for all non-negative real values t. For the
details of this approach, see e.g. Billingsley (1995, Theorem 37.1).
In another approach, we define the processes {Yt } as above, and ob-
serve that Yt is a (random) continuous function of t. We then prove that
as n —> oo, the distributions of the functions {Yt }t>o (regarded as
probability distributions on the space of all continuous functions) are tight, and
in fact converge to the distribution of {Bt}t>o that we want. We then use
this fact to conclude the existence of the random variables {Bt}t>o- For
details of this approach, see e.g. Fristedt and Gray (1997, Section 19.2).
Exercise 15.5.1. Construct a process {B'tf}t>o which has the same
finite-dimensional distributions as does {Bt}t>o and {^}t>o, but such that
neither {J3"}t>0 nor {B'tf — B[}t>o has continuous sample paths.
Exercise 15.5.2. Let {Xt}teT and {Xft}t£T be a stochastic processes
with the countable time index T. Suppose {Xt}ter and {X^}ter have
identical finite-dimensional distributions. Prove or disprove that {Xt}teT and
{X[}teT must have the same full joint distribution.

190
15. General stochastic processes.
15.6. Diffusions and stochastic integrals.
Suppose we are given continuous functions /z, a : R —> R, and we replace
equation (15.4.1) by
Y\l\ = Y™ + ±Zi+1a (y<»>) + -n (Y[n)) . (15.6.1)
"7T n yjn \ n / U V n / '
That is, we multiply the effect of Z»+i by cr(y[n^), and we add in an extra
n
drift of ^//(yj ). If we then interpolate tfizs function to a function {Y™}
defined for all t > 0, and let n —> oo, we obtain a diffusion, with dn/t /i(x),
and diffusion coefficient or volatility a(x). Intuitively, the diffusion is like
Brownian motion, except that its mean is drifting according to the function
fji(x), and its local variability is scaled according to the function a(x). In
this context, Brownian motion is the special case fi(x) = 0 and a(x) = 1.
Now, for large n we see from (15.4.1) that -j^Zi+i « Bi±± — B±, so
that (15.6.1) is essentially saying that
W3 - y|B) « (flm - BA a(Y\nA + I M(y<">) .
If we now (intuitively) set Xt = limn_>oo Yt for each fixed t > 0, then
we can write the limiting version of (15.6.1) as
dXt = a(Xt)dBt + fi(Xt)dt, (15.6.2)
where {Bt}t>o is Brownian motion. The process {Xt}t>o is called a
diffusion with drift ji(x) and diffusion coefficient or volatility o~(x). Its
definition (15.6.2) can be interpreted roughly as saying that, as h \ 0, we
have
Xt+h « *t + (7(Xt)(Bt+fc - Bt) + /i(Xt) h; (15.6.3)
thus, given that Xt = x, we have that approximately Xt+h ~ N(x +
n(x)h, o~2(x)h). If /x(x) = 0 and cr(x) = 1, then the diffusion coincides
exactly with Brownian motion.
We can also compute a generator for a diffusion defined by (15.6.2). To
do this we must generalise the matrix generator (15.3.4) for processes on
finite state spaces. We define the generator to be an operator Q acting on
smooth functions / : R —* R by
(Qf)(x) = lim I Ex (f(Xt+h) - f(Xt)) , (15.6.4)
where Ex means expectation conditional onlt = x. That is, Q maps one
smooth function / to another function, Qf.

15.6. Diffusions and stochastic integrals
191
Now, for a diffusion satisfying (15.6.3), we can use a Tavl™.
•^ scrips py—
pansion to approximate f(Xt+h) as
f(Xt+h) « f(Xt) + (Xt+h-Xt)f'(Xt) + ±(Xt+h-Xt)2f"(Xt). (15.65)
On the other hand,
Xt+h-Xt ~ a(Xt)(Bt+h-Bt) + fi(Xt)h.
Conditional on Xt = x, the conditional distribution of Bt+h — Bt is normal
with mean 0 and variance h\ hence, the conditional distribution of Xt+h —Xt
is approximately normal with mean fi(x) h and variance cr2(x) h. So, taking
expected values conditional on Xt = x, we conclude from (15.6.5) that
Ex[f(Xt+h)-f(Xt)} « n{x)hf'(x) + \ [(»(x)h)2 + o2(x)h\ f"(x)
= fi(x)hf'(x) + l-a2{x)hf"{x) + 0(h2).
It then follows from the definition (15.6.4) of generator that
(Q/)(*) = M*) I'M + \ °2(x) f"(x) • (15.6.6)
We have thus given an explicit formula for the generator Q of a diffusion
defined by (15.6.2). Indeed, as with Markov processes on discrete spaces,
diffusions are often characterised in terms of their generators, which again
provide all the information necessary to completely reconstruct the process's
transition probabilities.
Finally, we note that Brownian motion can be used to define a certain
unusual kind of integral, called a stochastic integral or ltd integral. Let
{Bt}t>o be Brownian motion, and let {Ct}t>o be a non-anticipative real-
valued random variable (i.e., the value of Ct is conditionally independent
°f {Bs}s>t, given {Bs}s<t). Then it is possible to define an integral of the
form Ja Ct dBt, i.e. an integral "with respect to Brownian motion". Roughly
speaking, this integral is defined by
/ CtdBt = lim Y,Cti(Btt+1-Bti),
Ja i=o
where a = to < h < ... < £m = 6, and where the limit is over finer
and finer partitions {(*i_1,*i]} of (a, b]. Note that this integral is random
both because of the randomness of the values of the integrand Ct, and also

192
15. General stochastic processes.
because of the randomness from Brownian motion Bt itself. In particular,
we now see that (15.6.2) can be written equivalently as
pb pb
Xb-Xa= I a(Xt)dBt+ n(Xt)dt,
J a J a
where the first integral is with respect to Brownian motion. Such stochastic
integrals are used in a variety of research areas; for further details see e.g.
Bhattacharya and Waymire (1990, Chapter VII), or the other references in
Subsection B.5.
Exercise 15.6.7. Let {-^t}t>o be a diffusion with constant drift fi(x) = a
and constant volatility a(x) = b > 0.
(a) Show that Xt = a t + b Bt.
(b) Show that C(Xt) = N(at, b2t).
(c) Compute E(XS Xt) for 0 < s < t.
Exercise 15.6.8. Let {Bt}t>o be standard Brownian motion, with
Bq = 0. Let Xt = f0 sds + fQb Bs ds = at + bBt be a diffusion with
constant drift fi{x) = a > 0 and constant volatility a(x) = b > 0. Let
Zt = exp[- (a + \b2)t + Xt].
(a) Prove that {Zt}t>o is a martingale, i.e. that E[Zt | Zu (0 < u < s)] = Zs
for 0 < s < t.
(b) Let A, B > 0, and let TA = inf{t > 0 : Xt = A} and T.B = inf{t >
0 : Xt = — B} denote the first hitting times of A and —£, respectively.
Compute P(Ta < T_B). [Hint: Use part (a); you may assume that
Corollary 14.1.7 also applies in continuous time.]
Exercise 15.6.9. Let g : R —> R be a smooth function with g(x) > 0
and Jg(x)dx = 1. Consider the diffusion {Xt} defined by (15.6.2) with
a(x) = 1 and fi{x) = ^g'(x)/g(x). Show that the probability distribution
having density g (with respect to Lebesgue measure on R) is a stationary
distribution for the diffusion. [Hint: Show that the diffusion is reversible
with respect to g(x)dx. You may use (15.6.3) and Exercise 15.3.7.] This
diffusion is called a Langevin diffusion.
Exercise 15.6.10. Let {p\j} be the transition probabilities for a Markov
chain on a finite state space X. Define the matrix Q = (qij) by (15.3.4).
Let / : X —> R be any function, and let i G X. Prove that (Qf)i (i.e.,
YskexQikfk) corresponds to (Qf)(i) from (15.6.4).
Exercise 15.6.11. Suppose a diffusion {^t}t>o satisfies that
(Q/)(x) = 17/,(x) + 23x2/,,W,

15.7. Ito's Lemma.
193
for all smooth / R - R. Compute the drift and volatility functions for
this diffusion.
Exercise 15.6.12. Suppose a diffusion {Xt}t>0 has generator given
by (Qf)(x) = -f /'Or) + \f"{x). (Such a diffusion is called an Ornstein-
Uhlenbeck process.)
(a) Write down a formula for dXt. [Hint: Use (15.6.6).]
(b) Show that {Xt}t>o is reversible with respect to the standard normal
distribution, AT(0,1). [Hint: Use Exercise 15.6.9.]
15.7. Ito's Lemma.
Let {Xt}t>o be a diffusion satisfying (15.6.2), and let / : R —► R be a
smooth function. Then we might expect that {/(^)}t>o would itself be a
diffusion, and we might wonder what drift and volatility would correspond
to {f(Xt)}t>o- Specifically, we would like to compute an expression for
d(/(Xt)).
To that end, we consider small h « 0, and use the Taylor approximation
f(Xt+h) « f(Xt) + f'(Xt) (Xt+h - Xt) + \f"(Xt) (Xt+h - Xtf + o(h).
Now, in the classical world, an expression like (Xt+h — Xt)2 would be 0(h2)
and hence could be neglected in this computation. However, we shall see
that for Ito integrals this is not the case.
We continue by writing
/t+h pt+h
a{Xs)dBs + / fi(Xs)ds
« *(Xt)(Bt+h-Bt) + ii{Xt)h.
Hence,
f(Xt+h) w /(Xt) + /'(Xt) [<7(Xt) {Bt+h ~ Bt) + jx(Xt) h]
+\f\Xt) k(Xt) (Bt+h - Bt) + »(Xt) h}2 + o(h).
In the expansion of (a(Xt)(Bt+h — Bt) + fx(Xt)h)2, all terms are o(h) aside
from the first, so that
/(Xt+h) w /(Xt) + /'(Xt) [a(Xt) (Bt+h - Bt) + M(*t) h)
+ -/"(Xt) \a(Xt) {Bt+h - Bt)}2 + o(h).

194
15. General stochastic processes.
But Bt+h-Bt ~ N(0, h), so that E [(Bt+h - Bt)2] = K and in fact {Bt+h-
Bt)2 = h + o(h). (This is the departure from classical integration theory;
there, we would have {Bt+h — Bt)2 = o(h).) Hence, as h —> 0,
f(Xt+h) « /(Xt) + /'(Xt)[a(Xt)(Bt+h-Bt) + M(Xt)ft]
+^/"(^t)a2(Xt)/l + o(/i).
We can write this in differential form as
d(f(Xt)) = f(Xt)[a(Xt)dBt+^Xt)dt] + ^r(Xt)a2(Xt)dt,
or
d(f(Xt)) = f'(Xt)a(Xt)dBt + (f'(Xt)»(Xt) + l-f"{Xt)a2{Xt)\ dt.
(15.7.1)
That is, {f(Xt)} is itself a diffusion, with new drift /i(x) = f'(x)fi(x) +
\f"{x) cr2(x) and new volatility a(x) = f'(x) a(x).
Equation (15.7.1) is ltd ys Lemma, or ltd's formula. It is a generalisation
of the classical chain rule, and is very important for doing computations with
Ito integrals. By (15.6.2), it implies that d(f(Xt)) is equal to f'{Xt) dXt +
\f"{Xt) v2(Xt) dt. The extra term \f"(Xt) cr2(Xt) dt arises because of the
unusual nature of Brownian motion, namely that {Bt+h — Bt)2 ~ h instead
of being 0(h2). (In particular, this again indicates why Brownian motion
is not differentiate; compare Exercise 15.4.6.)
Exercise 15.7.2. Consider the Ornstein-Uhlenbeck process {Xt} of
Exercise 15.6.12, with generator (Q/)(x) = -ff'(x) + \f"(x).
(a) Let Yt = {Xt)2 for each t > 0. Compute dYt.
(b) Let Zt = (Xt)s for each t > 0. Compute dZt.
Exercise 15.7.3. Consider the diffusion {Xt} of Exercise 15.6.11, with
generator (Qf)(x) = 17f'(Xt) + 23(X,)2 f"(Xt). Let Wt = (Xt)5 for each
t > 0. Compute dWt.
15.8. The Black-Scholes equation.
In mathematical finance, the prices of stocks are often modeled as
diffusions. Various calculations, such as the value of complicated stock options
or derivatives, then proceed on that basis.
In this final subsection, we give a very brief introduction to financial
models, and indicate the derivation of the famous Black-Scholes equation

15.8. The Black-Scholes equation.
195
for pricing a certain kind of stock option. Our treatment is very cursory
for more complete discussion see books on mathematical finance, such as
those listed in Subsection B.6.
For simplicity, we suppose there is a constant "risk-free" interest rate r.
This means that an amount of money M at present is worth ertM at a time
t later. Equivalently, an amount of money M a time t later is worth only
e~~rtM at present.
We further suppose that there is a "stock" available for purchase. This
stock has a purchase price Pt at time t. It is assumed that this stock price
can be modeled as a diffusion process, according to the equation
dPt = bPtdt + <jPtdBt, (15.8.1)
i.e. with fi(x) = bx and a(x) = ax. Here Bt is Brownian motion, b is the
appreciation rate, and a is the volatility of the stock. For present purposes
we assume that b and a are constant, though more generally they (and also
r) could be functions of t.
We wish to find a value, or fair price, for a "European call option" (an
example of a financial derivative). A European call option is the option to
purchase the stock at a fixed time T > 0 for a fixed price q > 0. Obviously,
one would exercise this option if and only if Pt > q, and in this case
one would gain an amount Pt — q. Therefore, the value at time T of the
European call option will be precisely max(0, Pt — q)', it follows that the
value at time 0 is e~rT max(0, Pt — q). The problem is that, at time 0, the
future price Pt is unknown (i.e. random). Hence, at time 0, the true value
of the option is also unknown. What we wish to compute is a fair price for
this option at time 0, given only the current price Fo of the stock.
To specify what a fair price means is somewhat subtle. This price is
taken to be the (unique) value such that neither the buyer nor the seller
could strictly improve their payoff by investing in the stock directly (as
opposed to investing in the option), no matter how sophisticated an investment
strategy they use (including selling short, i.e. holding a negative number of
shares) and no matter how often they buy and sell the stock (without being
charged any transaction commissions).
It turns out that for our model, the fair price of the option is equal to
the expected value of the option given Pq, but only after replacing b by r,
i.e. after replacing the appreciation rate by the risk-free interest rate. We
do not justify this fact here; for details see e.g. Theorem 1.2.1 of Karatzas,
1997. The argument involves switching to a different probability measure
(the risk-neutral equivalent martingale measure), under which {Bt} is still
a (different) Brownian motion, but where (15.8.1) is replaced by dPt =
rPtdt + aPtdBt.

196
15. General stochastic processes.
We are thus interested in computing
E (e"rT max(0, PT-q)\ Po) , (15.8.2)
under the condition that
b is replaced by r. (15.8.3)
To proceed, we use Ito's Lemma to compute d (logPt). This means that
we let f(x) = logx in (15.7.1), so that f'(x) = l/x and f"(x) = -l/x2.
We apply this to the diffusion (15.8.1), under the condition (15.8.3), so that
ji{x) = rx and a(x) = ax. We compute that
d(logPt) = ^PtdBt+(rPt^ + i(-^)(aPt)2) *
= <jdBt+ (r- 2a2)dt-
Since these coefficients are all constants, it now follows that
log (PT/P0) = log PT-log P0 = a(£T-£0)+(r-ia2)(T-0),
whence
PT = Po exp (a(BT - B0) + (r - \^)t\ .
Recall now that BT — B0 has the normal distribution JV(0,T), with
density function } e~x /2T. Hence, by Proposition 6.2.3, we can compute
the value (15.8.2) of the European call option as
/ e~rT max (0, P0 exp (ax + (r - \o-2)t\ - q\ -^= e~x2/2T dx .
(15.8.4)
After some re-arranging and substituting, we simplify this integral to
Po $ (^= (log(PoAz) + T(r + icr2)))
- qe-rT$ ^-l=(log(P0/9) +T(r- ia2))) , (15.8.5)
where $(z) = -4= fz e~s l2ds is the usual cumulative distribution func-
v y V27T J— oo
tion of the standard normal.
Equation (15.8.5) thus gives the time-0 value, or fair price, of a European
call option to buy at time T > 0 for price q > 0 a stock having initial price

15.9. Section summary.
197
P0 and price process {Pt}t>o governed by (15.8.1). Equation (15.8.5) is the
well-known Black-Scholes equation for computing the price of a European
call option. Recall that here r is the risk-free interest rate, and a is the
volatility. Furthermore, the appreciation rate b does not appear in the final
formula, which is advantageous because b is difficult to estimate in practice.
For further details of this subject, including generalisations to more
complicated financial models, see any introductory book on the subject of
mathematical finance, such as those listed in Subsection B.6.
Exercise 15.8.6. Show that (15.8.4) is indeed equal to (15.8.5).
Exercise 15.8.7. Consider the price formula (15.8.5), with r, a, T, and
Po fixed positive quantities.
(a) What happens to the price (15.8.5) as q \ 0? Does this result make
intuitive sense?
(b) What happens to the price (15.8.5) as q —► oo? Does this result make
intuitive sense?
Exercise 15.8.8. Consider the price formula (15.8.5), with r, a, Fo? and
q fixed positive quantities.
(a) What happens to the price (15.8.5) as T \ 0? [Hint: Consider
separately the cases q > Po, q = Po, and q < P0.] Does this result make intuitive
sense?
(b) What happens to the price (15.8.5) as T —► oo? Does this result make
intuitive sense?
15.9. Section summary.
This section considered various aspects of the theory of stochastic
processes, in a somewhat informal manner. All of the topics considered can be
thought of as generalisations of the discrete-time, discrete-space processes
of Sections 7 and 8.
First, the section considered existence questions, and stated (but did not
prove) the important Kolmogorov Existence Theorem. It then discussed
discrete-time processes on general (non-discrete) state spaces. It provided
generalised notions of irreducibility and aperiodicity, and stated a theorem
about convergence of irreducible, aperiodic Markov chains to their
stationary distributions. Next, continuous-time processes were considered. The
semigroup property of Markov transition probabilities was presented.
Generators of continuous-time, discrete-space Markov processes were discussed.
The section then focused on processes in continuous time and space.
Brownian motion was developed as a limit of random walks that take smaller

198
15. General stochastic processes.
and smaller steps, more and more frequently; the normal, independent
increments of Brownian motion then followed naturally. Diffusions were then
developed as generalisations of Brownian motion. Generators of diffusions
were described, as were stochastic integrals.
The section closed with intuitive derivations of Ito's Lemma, and of
the Black-Scholes equation from mathematical finance. The reader was
encouraged to consult the books of Subsections B.5 and B.6 for further
details.

A. Mathematical Background.
199
A. Mathematical Background.
This section reviews some of the mathematics that is necessary as a
prerequisite to understanding this text. The material is reviewed in a form
which is most helpful for our purposes. Note, however, that this material
is merely an outline of what is needed; if this section is not largely review
for you, then perhaps your background is insufficient to follow this text
in detail, and you may wish to first consult references such as those in
Subsection B.l.
A.l. Sets and functions.
A fundamental object in mathematics is a set, which is any collection
of objects. For example, {1,3,4, 7}, the collection of rational numbers, the
collection of real numbers, and the empty set 0 containing no elements, are
all sets. Given a set, a subset of that set is any subcollection. For example,
{3,7} is a subset of {1,3,4,7}; in symbols, {3,7} C {1,3,4,7}.
Given a subset, its complement is the set consisting of all elements of a
"universal" set which are not in the subset. In symbols, if A C ft, where ft
is understood to be the universal set, then Ac = {cj £ ft;a; ^ A}.
Given a collection of sets {Aa}aej (where / is any indicator set; perhaps
/ = {1,2}), we define their union to be the set of all elements which are in
at least one of the Aa; in symbols, |JQ Aa = {u; u £ Aa for some a £ I}.
Similarly, we define their intersection to be the set of all elements which are
in all of the Aa; in symbols, f]a Aa = {lj; lj £ Aa for all a £ I}.
Union, intersection, and complement satisfy de Morgan's Laws, which
state that ((JQ Aa)C = f|Q A% and (f|Q Aa)C = \Ja A%. In words,
complement converts unions to intersections and vice-versa.
A collection {Aa}aei of subsets of a set ft are disjoint if the intersection
of any pair of them is the empty set; in symbols, if Aai C\Aa2 = 0 whenever
ai,Q2 £ I are distinct. The collection {Aa}aej is called a partition of ft
if {Aa}a€i is disjoint and also \Ja Aa = ft. If {Aa}aej are disjoint, we
• #
sometimes write their union as \J Aa (e.g. Aai UAQ2) and call it a disjoint
union.
We also define the difference of sets A and B by A\B = AD Bc\ for
example, {1,3,4,7} \ {3,7,9} = {1,4}.
Given sets fti and ft2, we define their (Cartesian) product to be the set
of all ordered pairs having first element from fti and second element from
In fact, certain very technical restrictions are required to avoid contradictions such
as Russell's paradox. But we do not consider that here.

200
A. Mathematical Background.
^2; in symbols, fii x fi2 = {(^1^2);^i G ^i> ^2 G ^2}- Similarly, larger
Cartesian products Yla ^<* are defined.
A function f is a mapping from a set Q,\ to a second set f^; in symbols,
/ : Qi —► Q2- Its domain is the set of points in Qi for which it is defined,
and its range is the set of points in ft2 which are of the form f(u>i) for some
If / : fii —> ^2, and if S\ Cfli and 52 C ^2, then the image of S\ under
/ is given by /(Si) = {^2 G ^2^2 = /(^i) for some a;i G Si}, and the
inverse image of S2 under / is given by /_1(S2) = {lj\ G ft\',f(uJi) G S2}.
We note that inverse images preserve the usual set operations, for example
f-\Sl\JS2) = f-1(Si)Uf-1(S2y, r1(SlnS2) = /-1(5i)n/-1(52); and
f-\Sc) = f-1(Sf.
Special sets include the integers Z = {0,1, — 1,2, — 2,3,...}, the positive
integers or natural numbers N = {1,2,3,...}, the rational numbers Q =
{^; m,nGZ,n/ 0}, and the real numbers R.
Given a subset SCfi, its indicator function I5 : ft —> R is defined by
1 ( \ J *> u;eS
ls{u) = \0, uiS.
Finally, the Axiom of Choice states that given a collection {Aa}aei of
non-empty sets (i.e., Aa ^ 0 for all a), their Cartesian product fJQ ^U is
also non-empty. In symbols,
JJ Aa ^ 0 whenever Aa ^ 0 for each a. (A 1.1)
a
That is, it is possible to "choose" one element simultaneously from each of
the non-empty sets Aa - a fact that seems straightforward, but does not
follow from the other axioms of set theory.
A.2. Countable sets.
One important property of sets is their cardinality, or size. A set ft
is finite if for some n G N and some function / : N —> ft, we have
/ ({1,2,..., n}) 2 fi- A set is countable if for some function / : N —► fi, we
have / (N) = ft. (Note that, by this definition, all finite sets are countable.
As a special case, the empty set 0 is both finite and countable.) A set is
countably infinite if it is countable but not finite. It is uncountable if it is
not countable.
*More formally, it is a collection of ordered pairs (ui,^) 6 Hi x Q2, with u\ £ Qi
and u;2 = /(^l) G SI2, such that no lj\ appears in more than one pair.

A.2. Countable sets.
201
Intuitively, a set is countable if its elements can be arranged in a
sequence. It turns out that countability of sets is very important in
probability theory. The axioms of probability theory make certain guarantees
about countable operations (e.g., the countable additivity of probability
measures) which do not necessarily hold for uncountable operations. Thus
it is important to understand which sets are countable and which are not.
The natural numbers N are obviously countable; indeed, we can take the
identity function (i.e., f(n) = n for all n € N) in the definition of countable
above. Furthermore, it is clear that any subset of a countable set is again
countable; hence, any subset of N is also countable.
The integers Z are also countable: we can take, say, /(l) = 0, /(2) = 1,
/(3) = _i, /(4) = 2, /(5) = -2, /(6) = 3, etc. to ensure that /(N) = Z.
More surprising is
Proposition A.2.1. Let Q.\ and fi2 be countable sets. Then their
Cartesian product fix x fi2 is also countable.
Proof. Let fx and f2 be such that /i(N) = Qx and /2(N) = ft2.
Then define a new function / : N -+ Qx x fl2 by /(l) = (/i(l),/2(l)),
/(2) = (/x(l),/2(2)), /(3) = (/i(2),/2(l)), /(4) = (/i(l),/2(3)), /(5) =
(/i(2),/2(2)), /(6) = (/i(3),/2(l)), /(7) = (/i(l),/2(4)), etc. (See
Figure A.2.2.) Then /(N) = Qi x fi2, as required. I
(4,4)
(4,3)
• • •
(4,2)
(4,1)
Figure A.2.2. Constructing / in Proposition A.2.1.
From this proposition, we see that e.g. N x N is countable, as is Z x
Z. It follows immediately (and, perhaps, surprisingly) that the set of all
rational numbers is countable; indeed, Q is equivalent to a subset of Z x Z
(1,4) (2,4) (3,4)
(1,3) (2,3) (3,3)
(1,2) (2,2) (3,2)
XXX
(1,1) (2,1) (3,1)

202
A. Mathematical Background.
if we identify the rational number m/n (in lowest terms) with the element
(ra, n) e Z x Z. It also follows that, given a sequence Q\, Q2, • • • of countable
sets, their countable union Q = \Ji f^ is also countable; indeed, if /i(N) =
fli, then we can identify Q with a subset of N x N by the mapping (ra, n) (—►
fm(n).
For another example of the use of Proposition A.2.1, recall that a real
number is algebraic if it is a root of some non-constant polynomial with
integer coefficients.
Exercise A.2.3. Prove that the set of all algebraic numbers is countable.
On the other hand, the set of all real numbers is not countable. Indeed,
even the unit interval [0,1] is not countable. To see this, suppose to the
contrary that it were countable, with / : N —> [0,1] such that /(N) = [0,1].
Imagine writing each element of [0,1] in its usual base-10 expansion, and
let di be the 2th digit of the number f(i). Now define c» by: Cj = 4 if di = 5,
while Ci = 5 if di ^ 5. (In particular, c» ^ di.) Then let x = X)Si Ci 10~l
(so that the base-10 expansion of x is O.C1C2C3 ...). Then x differs from f(i)
in the 2th digit of their base-10 expansions, for each i. Hence, x ^ f(i) for
any i. This contradicts the assumption that /(N) = [0,1].
A.3. Epsilons and Limits.
Real analysis and probability theory make constant use of arguments
involving arbitrarily small e > 0. A basic starting block is
Proposition A.3.1. Let a and b be two real numbers. Suppose that, for
any e > 0, we have a <b + e. Then a <b.
Proof. Suppose to the contrary that a > b. Let e = g^. Then e > 0,
but it is not the case that a < b + e. This is a contradiction. I
Arbitrarily small e > 0 are used to define the important concept of a
limit. We say that a sequence of real numbers #i,#2, • • • converges to the
real number x (or, has limit x) if, given any e > 0, there is N G N (which
may depend on e) such that for any n > N, we have \xn — x\ < e. We shall
also write this as limn—oo xn = x, and shall sometimes abbreviate this to
limn xn = x or even limxn = x. We also write this as {xn} —► x.
Exercise A.3.2. Use the definition of a limit to prove that
(a) limn^oo £ = 0.

A.3. Epsilons and Limits.
203
(b) limn^oo ^ = 0 for any k> 0.
(c) limn_0021/- = l.
A useful reformulation of limits is given by
Proposition A.3.3. Let {xn} be a sequence of real numbers. Then
{xn} —► x if and only if for each e > 0, the set {n G N; \xn — x\ > e} is
finite.
Proof. If {xn} —> x, then given e > 0, there is N G N such that
\xn — x\ < e for all n > N. Hence, the set {n; \xn — x\ > e} contains at
most N — 1 elements, and so is finite.
Conversely, given e > 0, if the set {n; \xn — x\ > e} is finite, then it has
some largest element, say K. Setting N = K + 1, we see that \xn — x\ < e
whenever n> N. Hence, {xn} —> x. I
Exercise A.3.4. Use Proposition A.3.3 to provide an alternative proof
for each of the three parts of Exercise A.3.2.
A sequence which does not converge is said to diverge. There is one
special case. A sequence {xn} converges to infinity if for all M G R, there
is N G N, such that xn> M whenever n > N. (Similarly, {xn} converges
to negative infinity if for all M G R, there is N G N, such that xn < M
whenever n > N.) This is a special kind of "convergence" which is also
sometimes referred to as divergence!
Limits have many useful properties. For example,
Exercise A.3.5. Prove the following:
(a) If limn xn = x, and a G R, then limn axn = ax.
(b) Iflim
n Xji — X and limn yn = y, then limn (xn + yn) = x + y.
(c) If limn xn = x and limn yn = y, then limn {xnyn) = xy.
(d) If limn xn = x, and x^0, then limn(l/xn) = l/x.
Another useful property is given by
Proposition A.3.6. Suppose xn —> x, and xn < a for all n G N. Then
x < a.
Proof. Suppose to the contrary that x > a. Let e = ^^. Then e > 0,
but for all n G N we have \xn - x\ > x - xn > x - a > e, contradicting the
assertion that xn —► x. |

204
A. Mathematical Background.
Given a sequence {xn}, a subsequence {xUk} is any sequence formed
from {xn} by omitting some of the elements. (For example, one
subsequence of the sequence (1,3,5,7,9,...) of all positive odd numbers is
the sequence (3,9,27,81,...) of all positive integer powers of 3.) The
Bolzano- Weierstrass theorem, a consequence of compactness, says that
every bounded sequence contains a convergent subsequence.
Limits are also used to define infinite series. Indeed, the sum ^Si si 1S
simply a shorthand way of writing limn_oo ^7= l si- ^ this ^m^ exists and
is finite, then we say that the series JZSi s* converges; otherwise we say
the series diverges. (In particular, for infinite series, converging to infinity
is usually referred to as diverging.) If the Si are non-negative, then X)Si Si
either converges to a finite value or diverges to infinity; we write this as
YlZi s* < °° and SSi si = °°» respectively.
For example, it is not hard to show (by comparing JZSi *~a *° *ne
integral J^° t~adi) that
oo
^(l/za) < oo if and only if a > 1. (A3.7)
i=l
Exercise A.3.8. Prove that YliLi i1 / ilo^(i)) = °°- [Hint: Show
EZi(U^og(i))> ^(dx/xlogx).}
Exercise A.3.9. Prove that ^^ (l/ilog(i)loglog(i)) = oo, but
E.~i(l/<log(0[loglog(t)]2)<cx).
A.4. Infimums and supremums.
Given a set {xa}aei of real numbers, a lower bound for them is a real
number £ such that xa> £ for all a £ I. The set {xa}a£i is bounded below
if it has at least one lower bound. A lower bound is called the greatest lower
bound if it is at least as large as any other lower bound.
A very important property of the real numbers is: Any non-empty set
of real numbers which is bounded below has a unique greatest lower bound.
The uniqueness part of this is rather obvious; however, the existence part
is very subtle. For example, this property would not hold if we restricted
ourselves to rational numbers. Indeed, the set {q G Q; q > 0, q2 > 2} does
not have a greatest lower bound if we allow rational numbers only; however,
if we allow all real numbers, then the greatest lower bound is y/2.
The greatest lower bound is also called the infimum, and is written as
inf{xQ;a £ /} or infa6/xa. We similarly define the supremum of a set
of real numbers to be their least upper bound. By convention, for the

A.4. INFIMUMS AND SUPREMUMS.
205
empty set 0, we define inf 0 = oo and sup0 = — oo. Also, if a set 5 is n +
bounded below, then inf 5 = —oo; similarly, if it is not bounded above, then
sup 5 = oo. One obvious but useful fact is
inf 5 < x, for any #G 5. (A4.1)
Of course, if a set of real numbers has a minimal element (for example,
if the set is finite), then this minimal element is the infimum. However,
infimums exist even for sets without minimal elements. For example, if 5
is the set of all positive real numbers, then inf 5 = 0, even though 0^5.
A simple but useful property of infimums is the following.
Proposition A.4.2. Let S be a non-empty set of real numbers which
is bounded below. Let a = inf 5. Then for any e > 0, there is s G 5 with
a < s < a + e.
Proof. Suppose, to the contrary, that there is no such s. Clearly there is
no s G 5 with s < a (otherwise a would not be a lower bound for 5); hence,
it must be that all s G 5 satisfy s > a + e. But in this case, a + e is a lower
bound for 5 which is larger than a. This contradicts the assertion that a
was the greatest lower bound for 5. |
For example, if 5 is the interval (5,20), then inf 5 = 5, and 5 0 5, but for
any e > 0, there is x G 5 with x < 5 + e.
Exercise A.4.3. (a) Compute inf{# G R : x > 10}.
(b) Compute inf{<? G Q : q > 10}.
(c) Compute inf{g G Q : q > 10}.
Exercise A.4.4. (a) Let R,SCH each be non-emtpy and bounded
below. Prove that inf (R U 5) = min (inf R, inf S).
(b) Prove that this formula continues to hold if R = 0 and/or 5 = 0.
(c) State and prove a similar formula for sup(i? U 5).
Exercise A.4.5. Suppose {an} —► a. Prove that infnan < a < supnan.
[Hint: Use proof by contradiction.]
Two special kinds of limits involve infimums and supremums, namely the
limit inferior and the limit superior, defined by liminfxn = lim inffc>n:rfc
n n—>oo —
and lim sup xn = lim supA.>nXfc, respectively. Indeed, such limits always
n n-^00
exist (though they may be infinite). Furthermore, limn xn exists if and only
if lim inf xn = lim sup xn.

206
A. Mathematical Background.
Exercise A.4.6. Suppose {an} —► a and {bn} —► 6, with a <b. Let cn =
an for n odd, and cn = bn for n even. Compute liminf cn and limsupcn.
71 n
We shall sometimes use the "order" notations, O(-) and o(-). A quantity
g{x) is said to be 0(h(x)) as x -+ £ If limsupx_^ |^(x)//i(x)| < oo. (Here
£ can be oo, or — oo, or 0, or any other quantity. Also, h(x) can be any
function, such as x, or x2, or 1/x, or 1.) Similarly, g(x) is said to be o(h(x))
as x —> £ if limx_>£ g(x)/h(x) = 0.
Exercise A.4.7. Prove that {an} —> 0 if and only if an = o(l) as n —> oo.
Finally, we note the following. We see by Exercise A.3.5(b) and
induction that if {xnk} are real numbers for n G N and 1 < k < K < oo, with
limn^oo xnk = 0 for each /c, then lim^^co Ylk=i Xnk = 0 as well. However,
if there are an infinite number of different k then this is not true in general
(for example, suppose xnn = 1 but xnk = 0 for k ^ n). Still, the following
proposition gives a useful condition under which this conclusion holds.
Proposition A.4.8. (The M-test.) Let {#nfc}n,fceN be a collection of real
numbers. Suppose that limn-^ xnk = ak for each fixed fcGN. Suppose
further that Sfcli suPn \xnk\ < oo. Then limn^oo YHZLi xnk = ^2T=i a*-
Proof. The hypotheses imply that J2T=i lafcl < °°- Hence, by replacing
xnk by xnk — Uk, it suffice to assume that ak = 0 for all k.
Fix e > 0. Since YllZLi supn xnk < oo, we can find K G N such
that S^=k+i suPnxnfc < e/2. Since limn^ooXnfc = 0, we can find (for
k = 1,2,..., K) numbers Nk with xnk < e/2K for all n > Nk. Let N =
max(iV"i,..., Nk)- Then for n > AT, we have 5Z£li xnfc < ^2^ + § = e*
Hence, lim^oo YlkLi xnk < £fcliafc- Similarly, limn^oo YlkLi x^k >
^2^=1 ak- The result follows. I
If limn_00 xnk = <*>k for each fixed k G N, with :rnfc > 0, but if we do
not know that YllkLi suPn xnk < oo, then we still have
oo
lim Vxnl > VV, (A.4.9)
n—►oo
assuming this limit exists. Indeed, if not then we could find some finite
K e N with limn_oo YH?=i xnk < Y^=\ak, contradicting the fact that
limn^oo Y^=\ Xnk > limn^oo Ylk=l Xnk = Sjfe=l a*'

A.5. Equivalence relations.
207
A.5. Equivalence relations.
In one place in the notes (the proof of Proposition 1.2.6), the idea of an
equivalence relation is used. Thus, we briefly review equivalence relations
here.
A relation on a set 5 is a boolean function on S x 5; that is, given
x,y G 5, either x is related to y (written x ~ y), or x is not related to y
(written x ^ y). A relation is an equivalence relation if (a) it is reflexive,
i.e. x ~ x for all x G 5; (b) it is symmetric, i.e. x ~ y whenever y ~ x; and
(c) it is transitive, i.e. x ^ z whenever x ~ y and y ~ z.
Given an equivalence relation ~ and an element x G 5, the equivalence
class of x is the set of all y G 5 such that y ~ x. It is straightforward
to verify that, if ~ is an equivalence relation, then any pair of equivalence
classes is either identical or disjoint. It follows that, given an equivalence
relation, the collection of equivalence classes form a partition of the set 5.
This fact is used in the proof of Proposition 1.2.6 herein.
Exercise A.5.1. For each of the following relations on 5 = Z, determine
whether or not it is an equivalence class; and if it is, then find the equivalence
class of the element 1:
(a) x ~ y if and only if \y — x\ is an integer multiple of 3.
(b) x ~ y if and only if \y — x\ < 5.
(c) x ~ y if and only if \x\, \y\ < 5.
(d) x ~ y if and only if either x = y, or |x|, \y\ < 5 (or both).
(e) x ~ y if and only if |x| = \y\.
(f) x ~ y if and only if |x| > \y\.

B. Bibliography.
209
B. Bibliography.
The books in Subsection B.l below should be used for background study,
especially if your mathematical background is on the weak side for
understanding this text. The books in Subsection B.2 are appropriate for lower-
level probability courses, and provide lots of intuition, examples,
computations, etc. (though they largely avoid the measure theory issues). The
books in Subsection B.3 cover material similar to that of this text, though
perhaps at a more advanced level; they also contain much additional
material of interest. The books in Subsection B.4 do not discuss probability
theory, but they do discuss measure theory in detail. Finally, the books in
Subsections B.5 and B.6 discuss advanced topics in the theory of stochastic
processes and of mathematical finance, respectively, and may be appropriate
for further reading after mastering the content of this text.
B.l. Background in real analysis.
K.R. Davidson and A.P. Donsig (2002), Real analysis with real
applications. Prentice Hall, Saddle River, NJ.
J.E. Marsden (1974), Elementary classical analysis. W.F. Freeman and
Co., New York.
W. Rudin (1976), Principles of mathematical analysis (3rd ed.). McGraw-
Hill, New York.
M. Spivak (1994), Calculus (3rd ed.). Publish or Perish, Houston, TX.
(This is no ordinary calculus book!)
B.2. Undergraduate-level probability.
W. Feller (1968), An introduction to probability theory and its
applications, Vol. I (3rd ed.). Wiley & Sons, New York.
G.R. Grimmett and D.R. Stirzaker (1992), Probability and random
processes (2nd ed.). Oxford University Press.
D.G. Kelly (1994), Introduction to probability. Macmillan Publishing
Co., New York.
J. Pitman (1993), Probability. Springer-Verlag, New York.
S. Ross (1994), A first course in probability (4th ed.). Macmillan
Publishing Co., New York.
R.L. Scheaffer (1995), Introduction to probability and its applications
(2nd ed.). Duxbury Press, New York.

210
B. Bibliography.
B.3. Graduate-level probability.
P. Billingsley (1995), Probability and measure (3rd ed.). John Wiley k
Sons, New York.
L. Breiman (1992), Probability. SIAM, Philadelphia.
K.L. Chung (1974), A course in probability theory (2nd ed.). Academic
Press, New York.
R.M. Dudley (1989), Real analysis and probability. Wadsworth, Pacific
Grove, CA.
R. Durrett (1996), Probability: Theory and examples (2nd ed.). Duxbury
Press, New York.
W. Feller (1971), An introduction to probability theory and its
applications, Vol. II (2nd ed.). Wiley & Sons, New York.
B. Pristedt and L. Gray (1997), A modern approach to probability
theory. Birkhauser, Boston.
J.C. Taylor (1997), An introduction to measure and probability. Springer,
New York.
D. Williams (1991), Probability with martingales. Cambridge University
Press.
B.4. Pure measure theory.
D.L. Cohn (1980), Measure theory. Birkhauser, Boston.
J.L. Doob (1994), Measure theory. Springer-Verlag, New York.
G.B. Folland (1984), Real analysis: Modern techniques and their
applications. John Wiley & Sons, New York.
P.R. Halmos (1964), Measure theory. Van Nostrand, Princeton, NJ.
H.L. Royden (1988), Real analysis (3rd ed.). Macmillan Publishing Co.,
New York.
W. Rudin (1987), Real and complex analysis (3rd ed.). McGraw-Hill,,
New York.
B.5. Stochastic processes.
S. Asmussen (1987), Applied probability and queues. John Wiley & Sons,
New York.
R. Bhattacharya and E.C. Waymire (1990), Stochastic processes with
applications. John Wiley & Sons, New York.
J.L. Doob (1953), Stochastic processes (2nd ed.). John Wiley & Sons,
New York.

B.6. Mathematical finance.
211
D. Freedman (1983), Brownian motion and diffusion. Springer-Verlag,
New York.
S. Karlin and H. M. Taylor (1975), A first course in stochastic processes.
Academic Press, New York.
S.P. Meyn and R.L. Tweedie (1993), Markov chains and stochastic
stability. Springer-Verlag, London.
E. Nummelin (1984), General irreducible Markov chains and non-negative
operators. Cambridge University Press.
S. Resnick (1992), Adventures in stochastic processes. Birkhauser, Boston.
B.6. Mathematical finance.
M. Baxter and A. Rennie (1996), Financial calculus: An introduction to
derivative pricing. Cambridge University Press.
N.H. Bingham and R. Kiesel (1998), Risk-neutral valuation: Pricing and
hedging of financial derivatives. Springer, London, New York.
R.J. Elliott and P.E. Kopp (1999), Mathematics of financial markets.
Springer, New York.
I. Karatzas (1997), Lectures on the mathematics of finance. American
Mathematical Society, Providence, Rhode Island.
I. Karatzas and S.E. Shreve (1998), Methods of mathematical finance.
Springer, New York.
A.V. Mel'nikov (1999), Financial markets: Stochastic analysis and the
pricing of derivative securities. American Mathematical Society, Providence,
Rhode Island.

Index
213
Index
Ac, see complement
L2, 168
IP, 168
Mx(s), see moment generating
function
AT(0,1), 69
O(-), 206
E, see expected value
Ei, 86
T, see cr-algebra
N, 200
Q, 200
R, 200
Z, 200
ft, see sample space
P, see probability measure
Pi, 86
fl, see intersection
U, see union
Sc (point mass), 69
5ij, 85, 183
U, see disjoint union
0, see empty set
/ix,43
/in =>/i, 117
Is, see indicator function
0-irreducibility, 180
</>x{t), see characteristic function
\, see difference of sets
cr-finite measure, 52, 147, 182
cr-algebra, 7
Borel, 16
generated, 16
cr-field, see cr-algebra
C, see subset
A?, 86
o(-), 206
absolute continuity, 70, 143, 144,
148
additivity
countable, 2, 24
finite, 2
uncountable, 2, 24
Alaoglu's theorem, 117
algebra, 10, 13
semi, 9
sigma, 7
algebraic number, 3, 202
almost always, 34
almost everywhere, 58
almost sure convergence, see
convergence
analytic function, 108
aperiodicity, 90, 180, 181
appreciation rate, 195
area, 23
axiom of choice, 3, 4, 200
background, mathematical, 9, 199,
209
Berry-Esseen theorem, 135
binomial distribution, 141
Black-Scholes equation, 194, 197
bold play, 79
Bolzano-Weierstrass theorem, 129,
204
Borel cr-algebra, 16
Borel set, 16
Borel-Cantelli Lemma, 35
Borel-measurable, 30
boundary conditions, 76
boundary of set, 117
bounded convergence theorem, 78
bounded set, 204
Brownian motion, 186-188
properties, 187
a.a., 34
a.s., 47, 58
call option, 195

214
Index
candidate density, 146
Cantelli's inequality, 65
Cantor set, 16, 81
cardinality, 200
Cartesian product, 199
Cauchy sequence, 168
Cauchy-Schwarz inequality, 58, 65
central limit theorem, 134
Lindeberg, 136
martingale, 172
Cesaro average, 90
change of variable theorem, 67
characteristic function, 125
Chebychev's inequality, 57
closed subset of a Markov chain,
98
CLT, see central limit theorem
coin tossing, 21, 22
sequence waiting time, 166,
175
communicating states, 91, 97
compact support, 122
complement, 199
complete metric space, 168
complete probability space, 15, 16
composition, 30
conditional distribution, 151
regular, 156
conditional expectation, 152, 155
conditional probability, 84, 152,
155
conditional variance, 157
consistency conditions, 177, 178
continuity of probabilities, 33
continuity theorem, 132
continuous sample paths, 188, 189
continuum hypothesis, 4
convergence, 202
almost sure (a.s.), 47, 58-60,
62, 103, 120
in distribution, 120
in probability, 59, 60, 64, 120
of series, 204
pointwise, 58
to infinity, 203
weak, 117, 120
with probability one (w.p. 1)?
58
convergence theorem
bounded, 78
dominated, 104, 165
martingale, 169
monotone, 46
uniform integrability, 105, 165
convex function, 58, 65, 173
convolution, 112
correlation (Corr), 45, 53, 65
countable additivity, 2, 24
countable linearity, 48, 111
countable monotonicity, 10
countable set, 200
countable subadditivity, 8
counting measure, 5, 182
coupling, 92
covariance (Cov), 45
cumulative distribution function,
67
de Morgan's Laws, 199
decomposable Markov chain, 101
decreasing events, 33
density function, 70, 144
derivative
financial, 194, 195
Radon-Nikodym, 144, 148
determined by moments, 137
diagonalisation method, 129
difference equations, 76
difference of sets, 199
diffusion, 190
coefficient, 190
Langevin, 192
Ornstein-Uhlenbeck, 193, 194
discrete measure, 69, 143, 144, 148
discrete probability space, 9, 21
disjoint set, 199

Index
215
disjoint union, 8, 199
distribution, 67
binomial, 141
conditional, 151
exponential, 141
finite-dimensional, 177
function, 67
gamma, 114
initial, 179, 183
normal, 1, 69, 70, 133, 134,
141, 193
Poisson, 1, 8, 69, 141, 185
stationary, 89, 180, 184
uniform, 2, 9, 16
divergence, 203
of series, 204
domain, 200
dominated convergence theorem,
104, 165
dominated measure (<C), 143, 147
double 'til you win, 78
doubly stochastic Markov chain,
100
drift, 190
drift (of diffusion), 190
dyadic rational, 189
Ehrenfest's urn, 84
empty set, 199
equivalence class, 3, 207
equivalence relation, 3, 207
equivalent martingale measure, 195
European call option, 195
evaluation map, 69
event, 7
decreasing, 33
increasing, 33
expected value, 1, 43, 45
conditional, 152, 155
infinite, 45, 49
linearity of, 43, 46, 47, 50
order-preserving, 44, 46, 49
exponential distribution, 141
fair price, 195
Fatou's lemma, 103
field, see algebra
financial derivative, 194, 195
finite additivity, 2
finite intersection property, 22
finite superadditivity, 10
finite-dimensional distributions, 177
first hitting time, 75, 180
floor, 47
fluctuation theory, 88
Fourier inversion, 126
Fourier transform, 125
Fourier uniqueness, 129
Fubini's theorem, 110
j function, 200
/ analytic, 108
convex, 58, 65, 173
density, 70, 144
distribution, 67
identity, 201
indicator, 43, 200
Lipschitz, 188
measurable, 29, 30
random, 186
gambler's ruin, 75, 175
gambling policy, 78
double 'til you win, 78
gamma distribution, 114
gamma function, 115
generalised triangle inequality, 44
generated a-algebra, 16
generator, 183, 190, 191
greatest integer not exceeding, 47
greatest lower bound, 204
Hahn decomposition, 144
Heine-Borel Theorem, 16
Helly selection principle, 117, 129
hitting time, 75, 180
i.i.d., 62

216
Index
i.o., 34, 86
identically distributed, 62
identity function, 201
image, 200
inverse, 200
inclusion-exclusion, principle of, 8,
53
increasing events, 33
independence, 31, 32
of sums, 39
indicator function, 43, 200
inequality
Cantelli's, 65
Cauchy-Schwarz, 58, 65
Chebychev's, 57
Jensen, 65, 173
Jensen's, 58
Markov's, 57
maximal, 171
infimum, 204
infinite expected value, 45, 49
infinite fair coin tossing, 22
infinitely divisible, 136
infinitely often, 34, 86
initial distribution, 83, 179, 183
integers (Z), 200
positive (N), 200
integrable
Lebesgue, 51
Riemann, 50, 51
uniformly, 105
integral, 50
Ito, 191
iterated, 110
Lebesgue, 51
lower, 50
Riemann, 50
stochastic, 191
upper, 50
interest rate, 195
intersection, 199
interval, 9, 15
inverse image, 200
irreducibility, 88, 180
Ito integral, 191
Ito's lemma, 194
iterated integral, 110
iterated logarithm, 88
Jensen's inequality, 58, 65, 173
jump rate, 184
Kolmogorov consistency conditions,
177, 178
Kolmogorov existence theorem, 178,
182
Kolmogorov zero-one law, 37
Langevin diffusion, 192
large deviations, 108, 109
law, see distribution
law of large numbers
strong, 60, 62
weak, 60, 64
law of the iterated logarithm, 88
least upper bound, 204
Lebesgue decomposition, 143, 147
Lebesgue integral, 51
Lebesgue measure, 16, 50, 51
liminf, 205
limit, 202
inferior, 205
superior, 205
limsup, 205
Lindeberg CLT, 136
linearity of expected values, 43,
46, 47, 50
countable, 48, 111
Lipschitz function, 188
lower bound, 204
greatest, 204
lower integral, 50
M-test, 206
Markov chain, 83, 179
decomposable, 101
Markov process, 183

Index
217
Markov's inequality, 57
martingale, 76, 161
central limit theorem, 172
convergence theorem, 169
maximal inequality, 171
sampling theorem, 163
mathematical background, see
background, mathematical
maximal inequality, 171
mean, see expected value
mean return time, 94
measurable
function, 29, 30
rectangle, 23
set, 4, 7
space, 31
measure
(7-finite, 52, 147, 182
counting, 5, 182
Lebesgue, 16, 50, 51
outer, 11
probability, 7
product, 22, 23
signed, 144, 146
measure space, see probability space
method of moments, 139
mixture distribution, 143
moment, 45, 108, 137
moment generating function, 107
monotone convergence theorem, 46
monotonicity, 8, 11, 46
countable, 10
natural numbers, 200
normal distribution, 1, 69, 70, 133,
134, 141, 193
null recurrence, 94
option, 195
optional sampling theorem, 163
order statistics, 179
order-preserving, 44, 46, 49
Ornstein-Uhlenbeck process, 193,
194
outer measure, 11
partition, 43, 199
period of Markov chain, 90, 180
periodicity, 90, 181
permutation, 177, 179
persistence, 86
point mass, 69
Poisson distribution, 1, 8, 69, 141,
185
Poisson process, 185
policy, see gambling policy
Polish space, 156
positive recurrence, 94
prerequisites, 8, 199
price, fair, 195
principle of inclusion-exclusion, 8,
53
probability, 1
conditional, 84, 152, 155
probability measure, 7
defined on an algebra, 25
probability space, 7
complete, 15, 16
discrete, 9, 21
probability triple, 7
process
Markov, 183
stochastic, 182
process, stochastic, 73
product measure, 22, 23
product set, 199
projection operator, 157
Radon-Nikodym derivative, 144,
148
Radon-Nikodym theorem, 144, 148
random function, 186
random variable, 29
absolutely continuous, 1, 70
discrete, 1, 69

218
Index
simple, 43
random walk, 75, 87, 97, 161, 166
range, 43, 200
rate, jump, 184
rational numbers (Q), 200
real numbers (R), 200
recurrence, 86, 88
null, 94
positive, 94
reducibility, 88
reflexive relation, 207
regular conditional distribution, 156
relation, 207
equivalence, 207
reflexive, 207
symmetric, 207
transitive, 207
return time, 94
reversibility, 89, 184, 192
Riemann integrable, 50, 51
right-continuous, 67
risk-free interest rate, 195
risk-neutral measure, 195
Russell's paradox, 199
sample paths, continuous, 188, 189
sample space, 7
semialgebra, 9, 10
semigroup, 183
set, 199
Borel, 16
boundary, 117
Cantor, 16, 81
cardinality, 200
complement, 199
countable, 200
difference, 199
disjoint, 8, 199
empty, 199
finite, 200
intersection, 199
measurable, 4, 7
product, 199
uncountable, 200
union, 199
disjoint, 8, 199
universal, 199
shift, 3
sigma-algebra, see cr-algebra
sigma-field, see a-algebra
sigma-finite measure, see cr-finite
measure
sign(0), 127
signed measure, 144, 146
simple random variable, 43
simple random walk, 75, 97
symmetric, 87, 161
singular measure, 143, 144, 148
Skorohod's theorem, 117
state space, 83, 179
stationary distribution, 89, 180,
184
Sterling's approximation, 87
stochastic integral, 191
stochastic process, 73, 177, 182
continuous time, 182
stock, 195
stock options, 194, 195
value of, 195
stopping time, 162
strong law of large numbers, 60,
62
sub-a-algebra, 152
subadditivity, 8, 11
submartingale, 161
convergence theorem, 169
sampling theorem, 163
subsequence, 204
subset, 199
superadditivity, 10
supermartingale, 161
supremum, 204
symmetric relation, 207
tightness, 130
Tonelli's theorem, 110

Index
219
total variation distance, 181
transience, 86
transition matrix, 83, 84, 89
transition probabilities, 83, 179
nth order, 85
transitive relation, 207
triangle inequality, 44
triangular array, 136
truncation argument, 62
Tychonov's Theorem, 22
uncountable additivity, 2, 24
uncountable set, 200
uniform distribution, 2, 9, 16
uniform integrability, 105
convergence theorem, 105, 165
uniformly bounded, 78
union, 199
disjoint, 8, 199
universal set, 199
upcrossing lemma, 169
upper bound, 204
least, 204
upper integral, 50
value of stock option, 195
vanishing at infinity, 117, 122
variable, random, see random
variable
variance (Var), 44
conditional, 157
volatility, 190, 195
volume, 23
w.p. 1, see convergence
Wald's theorem, 166, 175, 176
weak convergence, 117, 120
weak law of large numbers, 60, 64
weak* topology, 117
Wiener process, 186
zero-one law, Kolmogorov, 37