/
Text
A.N. Shiryaev
Probability
Second Edition
Springer
N.-AvV
***«*.
"Order out of chaos"
(Courtesy of Professor A. T. Fomenko of the Moscow State University)
Preface to the Second Edition
In the Preface to the first edition, originally published in 1980, we mentioned
that this book was based on the author's lectures in the Department of
Mechanics and Mathematics of the Lomonosov University in Moscow,
which were issued, in part, in mimeographed form under the title
"Probability, Statistics, and Stochastic Processors, I, II" and published by that
University. Our original intention in writing the first edition of this book was to
divide the contents into three parts: probability, mathematical statistics, and
theory of stochastic processes, which corresponds to an outline of a three-
semester course of lectures for university students of mathematics. However,
in the course of preparing the book, it turned out to be impossible to realize
this intention completely, since a full exposition would have required too
much space. In this connection, we stated in the Preface to the first edition
that only probability theory and the theory of random processes with discrete
time were really adequately presented.
Essentially all of the first edition is reproduced in this second edition.
Changes and corrections are, as a rule, editorial, taking into account
comments made by both Russian and foreign readers of the Russian original and
of the English and German translations [Sll]. The author is grateful to all of
these readers for their attention, advice, and helpful criticisms.
In this second English edition, new material also has been added, as
follows: in Chapter III, §5, §§7-12; in Chapter IV, §5; in Chapter VII, §§8-10.
The most important addition is the third chapter. There the reader will
find expositions of a number of problems connected with a deeper study of
themes such as the distance between probability measures, metrization of
weak convergence, and contiguity of probability measures. In the same
chapter, we have added proofs of a number of important results on the rapidity of
convergence in the central limit theorem and in Poisson's theorem on the
Vlll
Preface to the Second Edition
approximation of the binomial by the Poisson distribution. These were
merely stated in the first edition.
We also call attention to the new material on the probability of large
deviations (Chapter IV, §5), on the central limit theorem for sums of
dependent random variables (Chapter VII, §8), and on §§9 and 10 of Chapter VII.
During the last few years, the literature on probability published in Russia
by Nauka has been extended by Sevastyanov [S10], 1982; Rozanov [R6],
1985; Borovkov [B4], 1986; and Gnedenko [G4], 1988. It appears that these
publications, together with the present volume, being quite different and
complementing each other, cover an extensive amount of material that is
essentially broad enough to satisfy contemporary demands by students in
various branches of mathematics and physics for instruction in topics in
probability theory.
Gnedenko's textbook [G4] contains many well-chosen examples,
including applications, together with pedagogical material and extensive surveys of
the history of probability theory. Borovkov's textbook [B4] is perhaps the
most like the present book in the style of exposition. Chapters 9 (Elements of
Renewal Theory), 11 (Factorization of the Identity) and 17 (Functional Limit
Theorems), which distinguish [B4] from this book and from [G4] and [R6],
deserve special mention. Rozanov's textbook contains a great deal of
material on a variety of mathematical models which the theory of probability and
mathematical statistics provides for describing random phenomena and their
evolution. The textbook by Sevastyanov is based on his two-semester course
at the Moscow State University. The material in its last four chapters covers
the minimum amount of probability and mathematical statistics required in
a one-year university program. In our text, perhaps to a greater extent than
in those mentioned above, a significant amount of space is given to set-
theoretic aspects and mathematical foundations of probability theory.
Exercises and problems are given in the books by Gnedenko and
Sevastyanov at the ends of chapters, and in the present textbook at the end
of each section. These, together with, for example, the problem sets by A. V.
Prokhorov and V. G. and N. G. Ushakov (Problems in Probability Theory,
Nauka, Moscow, 1986) and by Zubkov, Sevastyanov, and Chistyakov
(Collected Problems in Probability Theory, Nauka, Moscow, 1988) can be used by
readers for independent study, and by teachers as a basis for seminars for
students.
Special thanks to Harold Boas, who kindly translated the revisions from
Russian to English for this new edition.
Moscow
A. Shiryaev
Preface to the First Edition
This textbook is based on a three-semester course of lectures given by the
author in recent years in the Mechanics-Mathematics Faculty of Moscow
State University and issued, in part, in mimeographed form under the title
Probability, Statistics, Stochastic Processes, I, II by the Moscow State
University Press.
We follow tradition by devoting the first part of the course (roughly one
semester) to the elementary theory of probability (Chapter I). This begins
with the construction of probabilistic models with finitely many outcomes
and introduces such fundamental probabilistic concepts as sample spaces,
events, probability, independence, random variables, expectation,
correlation, conditional probabilities, and so on.
Many probabilistic and statistical regularities are effectively illustrated
even by the simplest random walk generated by Bernoulli trials. In this
connection we study both classical results (law of large numbers, local and
integral De Moivre and Laplace theorems) and more modern results (for
example, the arc sine law).
The first chapter concludes with a discussion of dependent random
variables generated by martingales and by Markov chains.
Chapters II-IV form an expanded version of the second part of the course
(second semester). Here we present (Chapter II) Kolmogorov's generally
accepted axiomatization of probability theory and the mathematical methods
that constitute the tools of modern probability theory (<r-algebras, measures
and their representations, the Lebesgue integral, random variables and
random elements, characteristic functions, conditional expectation with
respect to a <r-algebra, Gaussian systems, and so on). Note that two measure-
theoretical results—Caratheodory's theorem on the extension of measures
and the Radon-Nikodym theorem—are quoted without proof.
X
Preface to the First Edition
The third chapter is devoted to problems about weak convergence of
probability distributions and the method of characteristic functions for
proving limit theorems. We introduce the concepts of relative compactness
and tightness of families of probability distributions, and prove (for the
real line) Prohorov's theorem on the equivalence of these concepts.
The same part of the course discusses properties "with probability 1"
for sequences and sums of independent random variables (Chapter IV). We
give proofs of the "zero or one laws" of Kolmogorov and of Hewitt and
Savage, tests for the convergence of series, and conditions for the strong law
of large numbers. The law of the iterated logarithm is stated for arbitrary
sequences of independent identically distributed random variables with
finite second moments, and proved under the assumption that the variables
have Gaussian distributions.
Finally, the third part of the book (Chapters V-VIII) is devoted to random
processes with discrete parameters (random sequences). Chapters V and VI
are devoted to the theory of stationary random sequences, where "
stationary " is interpreted either in the strict or the wide sense. The theory of random
sequences that are stationary in the strict sense is based on the ideas of
ergodic theory: measure preserving transformations, ergodicity, mixing, etc.
We reproduce a simple proof (by A. Garsia) of the maximal ergodic theorem;
this also lets us give a simple proof of the BirkhofT-Khinchin ergodic theorem.
The discussion of sequences of random variables that are stationary in
the wide sense begins with a proof of the spectral representation of the
covariance fuction. Then we introduce orthogonal stochastic measures, and
integrals with respect to these, and establish the spectral representation of
the sequences themselves. We also discuss a number of statistical problems:
estimating the covariance function and the spectral density, extrapolation,
interpolation and filtering. The chapter includes material on the Kalman-
Bucy filter and its generalizations.
The seventh chapter discusses the basic results of the theory of martingales
and related ideas. This material has only rarely been included in traditional
courses in probability theory. In the last chapter, which is devoted to Markov
chains, the greatest attention is given to problems on the asymptotic behavior
of Markov chains with countably many states.
Each section ends with problems of various kinds: some of them ask for
proofs of statements made but not proved in the text, some consist of
propositions that will be used later, some are intended to give additional
information about the circle of ideas that is under discussion, and finally,
some are simple exercises.
In designing the course and preparing this text, the author has used a
variety of sources on probability theory. The Historical and Bibliographical
Notes indicate both the historial sources of the results and supplementary
references for the material under consideration.
The numbering system and form of references is the following. Each
section has its own enumeration of theorems, lemmas and formulas (with
Preface to the First Edition
XI
no indication of chapter or section). For a reference to a result from a
different section of the same chapter, we use double numbering, with the
first number indicating the number of the section (thus, (2.10) means formula
(10) of §2). For references to a different chapter we use triple numbering
(thus, formula (II.4.3) means formula (3) of §4 of Chapter II). Works listed
in the References at the end of the book have the form [L«], where L is a
letter and n is a numeral.
The author takes this opportunity to thank his teacher A. N. Kolmogorov,
and B. V. Gnedenko and Yu. V. Prokhorov, from whom he learned probability
theory and under whose direction he had the opportunity of using it. For
discussions and advice, the author also thanks his colleagues in the
Departments of Probability Theory and Mathematical Statistics at the Moscow
State University, and his colleagues in the Section on probability theory of the
Steklov Mathematical Institute of the Academy of Sciences of the U.S.S.R.
Moscow A. N. Shiryaev
Steklov Mathematical Institute
Translator's acknowledgement. I am grateful both to the author and to
my colleague, C. T. Ionescu Tulcea, for advice about terminology.
R. P. B.
Contents
Preface to the Second Edition vii
Preface to the First Edition ix
Introduction 1
CHAPTER I
Elementary Probability Theory 5
§1. Probabilistic Model of an Experiment with a Finite Number of
Outcomes 5
§2. Some Classical Models and Distributions 17
§3. Conditional Probability. Independence 23
§4. Random Variables and Their Properties 32
§5. The Bernoulli Scheme. I. The Law of Large Numbers 45
§6. The Bernoulli Scheme. II. Limit Theorems (Local,
De Moivre-Laplace, Poisson) 55
§7. Estimating the Probability of Success in the Bernoulli Scheme 70
§8. Conditional Probabilities and Mathematical Expectations with
Respect to Decompositions 76
§9. Random Walk. I. Probabilities of Ruin and Mean Duration in
Coin Tossing 83
§10. Random Walk. II. Reflection Principle. Arcsine Law 94
§11. Martingales. Some Applications to the Random Walk 103
§12. Markov Chains. Ergodic Theorem. Strong Markov Property 110
CHAPTER II
Mathematical Foundations of Probability Theory 131
§1. Probabilistic Model for an Experiment with Infinitely Many
Outcomes. Kolmogorov's Axioms 131
xiv Contents
§2. Algebras and a-algebras. Measurable Spaces 139
§3. Methods of Introducing Probability Measures on Measurable Spaces 151
§4. Random Variables. I. 170
§5. Random Elements 176
§6. Lebesgue Integral. Expectation 180
§7. Conditional Probabilities and Conditional Expectations with
Respect to a a-Algebra 212
§8. RandomVariables.il. 234
§9. Construction of a Process with Given Finite-Dimensional
Distribution 245
§10. Various Kinds of Convergence of Sequences of Random Variables 252
§11. The Hilbert Space of Random Variables with Finite Second Moment 262
§12. Characteristic Functions 274
§13. Gaussian Systems 297
CHAPTER III
Convergence of Probability Measures. Central Limit
Theorem 308
§1. Weak Convergence of Probability Measures and Distributions 308
§2. Relative Compactness and Tightness of Families of Probability
Distributions 317
§3. Proofs of Limit Theorems by the Method of Characteristic Functions 321
§4. Central Limit Theorem for Sums of Independent Random Variables. I.
The Lindeberg Condition 328
§5. Central Limit Theorem for Sums of Independent Random Variables. II.
Nonclassical Conditions 337
§6. Infinitely Divisible and Stable Distributions 341
§7. Metrizability of Weak Convergence 348
§8. On the Connection of Weak Convergence of Measures with Almost Sure
Convergence of Random Elements ("Method of a Single Probability
Space") 353
§9. The Distance in Variation between Probability Measures.
Kakutani-Hellinger Distance and Hellinger Integrals. Application to
Absolute Continuity and Singularity of Measures 359
§10. Contiguity and Entire Asymptotic Separation of Probability Measures 368
§11. Rapidity of Convergence in the Central Limit Theorem 373
§12. Rapidity of Convergence in Poisson's Theorem 376
CHAPTER IV
Sequences and Sums of Independent Random Variables 379
§1. Zero-or-One Laws 379
§2. Convergence of Series 384
§3. Strong Law of Large Numbers 388
§4. Law of the Iterated Logarithm 395
§5. Rapidity of Convergence in the Strong Law of Large Numbers and
in the Probabilities of Large Deviations 400
Contents
XV
CHAPTER V
Stationary (Strict Sense) Random Sequences and
Ergodic Theory 404
§1. Stationary (Strict Sense) Random Sequences. Measure-Preserving
Transformations 404
§2. Ergodicity and Mixing 407
§3. Ergodic Theorems 409
CHAPTER VI
Stationary (Wide Sense) Random Sequences. L2 Theory 415
§1. Spectral Representation of the Covariance Function 415
§2. Orthogonal Stochastic Measures and Stochastic Integrals 423
§3. Spectral Representation of Stationary (Wide Sense) Sequences 429
§4. Statistical Estimation of the Covariance Function and the Spectral
Density 440
§5. Wold's Expansion 446
§6. Extrapolation, Interpolation and Filtering 453
§7. The Kalman-Bucy Filter and Its Generalizations 464
CHAPTER VII
Sequences of Random Variables that Form Martingales 474
§1. Definitions of Martingales and Related Concepts 474
§2. Preservation of the Martingale Property Under Time Change at a
Random Time 484
§3. Fundamental Inequalities 492
§4. General Theorems on the Convergence of Submartingales and
Martingales 508
§5. Sets of Convergence of Submartingales and Martingales 515
§6. Absolute Continuity and Singularity of Probability Distributions 524
§7. Asymptotics of the Probability of the Outcome of a Random Walk
with Curvilinear Boundary 536
§8. Central Limit Theorem for Sums of Dependent Random Variables 541
§9. Discrete Version of Ito's Formula 554
§10. Applications to Calculations of the Probability of Ruin in Insurance 558
CHAPTER VIII
Sequences of Random Variables that Form Markov Chains 564
§1. Definitions and Basic Properties 564
§2. Classification of the States of a Markov Chain in Terms of
Arithmetic Properties of the Transition Probabilities p{$ 569
§3. Classification of the States of a Markov Chain in Terms of
Asymptotic Properties of the Probabilities p$ 573
§4. On the Existence of Limits and of Stationary Distributions 582
§5. Examples 587
Historical and Bibliographical Notes 596
References 603
Index of Symbols 609
Index 611
Introduction
The subject matter of probability theory is the mathematical analysis of
random events, i.e., of those empirical phenomena which—under certain
circumstance—can be described by saying that:
They do not have deterministic regularity (observations of them do not
yield the same outcome);
whereas at the same time
They possess some statistical regularity (indicated by the statistical
stability of their frequency).
We illustrate with the classical example of a "fair" toss of an "unbiased"
coin. It is clearly impossible to predict with certainty the outcome of each
toss. The results of successive experiments are very irregular (now "head,"
now "tail") and we seem to have no possibility of discovering any regularity
in such experiments. However, if we carry out a large number of
"independent" experiments with an "unbiased" coin we can observe a very definite
statistical regularity, namely that "head" appears with a frequency that is
"close" to^.
Statistical stability of a frequency is very likely to suggest a hypothesis
about a possible quantitative estimate of the "randomness" of some event A
connected with the results of the experiments. With this starting point,
probability theory postulates that corresponding to an event A there is a
definite number P(A), called the probability of the event, whose intrinsic
property is that as the number of "independent" trials (experiments)
increases the frequency of event A is approximated by P(A).
Applied to our example, this means that it is natural to assign the proba-
2
Introduction
bility \ to the event A that consists of obtaining "head" in a toss of an
"unbiased" coin.
There is no difficulty in multiplying examples in which it is very easy to
obtain numerical values intuitively for the probabilities of one or another
event. However, these examples are all of a similar nature and involve (so far)
undefined concepts such as "fair" toss, "unbiased" coin, "independence,"
etc.
Having been invented to investigate the quantitative aspects of
"randomness," probability theory, like every exact science, became such a science
only at the point when the concept of a probabilistic model had been clearly
formulated and axiomatized. In this connection it is natural for us to discuss,
although only briefly, the fundamental steps in the development of
probability theory.
Probability theory, as a science, originated in the middle of the seventeenth
century with Pascal (1623-1662), Fermat (1601-1655) and Huygens
(1629-1695). Although special calculations of probabilities in games of chance
had been made earlier, in the fifteenth and sixteenth centuries, by Italian
mathematicians (Cardano, Pacioli, Tartaglia, etc.), the first general methods
for solving such problems were apparently given in the famous
correspondence between Pascal and Fermat, begun in 1654, and in the first book on
probability theory, De Ratiociniis in Aleae Ludo (On Calculations in Games of
Chance), published by Huygens in 1657. It was at this time that the
fundamental concept of "mathematical expectation" was developed and theorems
on the addition and multiplication of probabilities were established.
The real history of probability theory begins with the work of James
Bernoulli (1654-1705), Ars Conjectandi (The Art of Guessing) published in
1713, in which he proved (quite rigorously) the first limit theorem of
probability theory, the law of large numbers; and of De Moivre (1667-1754),
Miscellanea Analytica Supplementum (a rough translation might be The
Analytic Method or Analytic Miscellany, 1730), in which the central limit
theorem was stated and proved for the first time (for symmetric Bernoulli
trials).
Bernoulli deserves the credit for introducing the "classical" definition of
the concept of the probability of an event as the ratio of the number of
possible outcomes of an experiment, that are favorable to the event, to the
number of possible outcomes.
Bernoulli was probably the first to realize the importance of considering
infinite sequences of random trials and to make a clear distinction between
the probability of an event and the frequency of its realization.
De Moivre deserves the credit for defining such concepts as independence,
mathematical expectation, and conditional probability.
In 1812 there appeared Laplace's (1749-1827) great treatise Theorie
Analytique des Probabilities (Analytic Theory of Probability) in which he
presented his own results in probability theory as well as those of his
predecessors. In particular, he generalized De Moivre's theorem to the general
Introduction
3
(unsymmetric) case of Bernoulli trials, and at the same time presented De
Moivre's results in a more complete form.
Laplace's most important contribution was the application of
probabilistic methods to errors of observation. He formulated the idea of
considering errors of observation as the cumulative results of adding a large number
of independent elementary errors. From this it followed that under rather
general conditions the distribution of errors of observation must be at least
approximately normal.
The work of Poisson (1781-1840) and Gauss (1777-1855) belongs to the
same epoch in the development of probability theory, when the center of the
stage was held by limit theorems. In contemporary probability theory we
think of Poisson in connection with the distribution and the process that
bear his name. Gauss is credited with originating the theory of errors and, in
particular, with creating the fundamental method of least squares.
The next important period in the development of probability theory is
connected with the names of P. L. Chebyshev (1821-1894), A. A. Markov
(1856-1922), and A. M. Lyapunov (1857-1918), who developed effective
methods for proving limit theorems for sums of independent but arbitrarily
distributed random variables.
The number of Chebyshev's publications in probability theory is not
large—four in all—but it would be hard to overestimate their role in
probability theory and in the development of the classical Russian school of that
subject.
"On the methodological side, the revolution brought about by Chebyshev
was not only his insistence for the first time on complete rigor in the proofs of
limit theorems,... but also, and principally, that Chebyshev always tried to
obtain precise estimates for the deviations from the limiting regularities that are
available for large but finite numbers of trials, in the form of inequalities that are
valid unconditionally for any number of trials."
(A. N. Kolmogorov [30])
Before Chebyshev the main interest in probability theory had been in the
calculation of the probabilities of random events. He, however, was the
first to realize clearly and exploit the full strength of the concepts of random
variables and their mathematical expectations.
The leading exponent of Chebyshev's ideas was his devoted student
Markov, to whom there belongs the indisputable credit of presenting his
teacher's results with complete clarity. Among Markov's own significant
contributions to probability theory were his pioneering investigations of
limit theorems for sums of independent random variables and the creation
of a new branch of probability theory, the theory of dependent random
variables that form what we now call a Markov chain.
"Markov's classical course in the calculus of probability and his original
papers, which are models of precision and clarity, contributed to the greatest
extent to the transformation of probability theory into one of the most significant
4
Introduction
branches of mathematics and to a wide extension of the ideas and methods of
Chebyshev."
(S. N. Bernstein [3])
To prove the central limit theorem of probability theory (the theorem
on convergence to the normal distribution), Chebyshev and Markov used
what is known as the method of moments. With more general hypotheses
and a simpler method, the method of characteristic functions, the theorem
was obtained by Lyapunov. The subsequent development of the theory has
shown that the method of characteristic functions is a powerful analytic
tool for establishing the most diverse limit theorems.
The modern period in the development of probability theory begins with
its axiomatization. The first work in this direction was done by S. N.
Bernstein (1880-1968), R. von Mises (1883-1953), and E. Borel (1871-1956).
A. N. Kolmogorov's book Foundations of the Theory of Probability appeared
in 1933. Here he presented the axiomatic theory that has become generally
accepted and is not only applicable to all the classical branches of probability
theory, but also provides a firm foundation for the development of new
branches that have arisen from questions in the sciences and involve infinite-
dimensional distributions.
The treatment in the present book is based on Kolmogorov's axiomatic
approach. However, to prevent formalities and logical subtleties from
obscuring the intuitive ideas, our exposition begins with the elementary theory of
probability, whose elementariness is merely that in the corresponding
probabilistic models we consider only experiments with finitely many
outcomes. Thereafter we present the foundations of probability theory in their
most general form.
The 1920s and '30s saw a rapid development of one of the new branches of
probability theory, the theory of stochastic processes, which studies families
of random variables that evolve with time. We have seen the creation of
theories of Markov processes, stationary processes, martingales, and limit
theorems for stochastic processes. Information theory is a recent addition.
The present book is principally concerned with stochastic processes with
discrete parameters: random sequences. However, the material presented
in the second chapter provides a solid foundation (particularly of a logical
nature) for the-study of the general theory of stochastic processes.
It was also in the 1920s and '30s that mathematical statistics became a
separate mathematical discipline. In a certain sense mathematical statistics
deals with inverses of the problems of probability: If the basic aim of
probability theory is to calculate the probabilities of complicated events under a
given probabilistic model, mathematical statistics sets itself the inverse
problem: to clarify the structure of probabilistic-statistical models by
means of observations of various complicated events.
Some of the problems and methods of mathematical statistics are also
discussed in this book. However, all that is presented in detail here is
probability theory and the theory of stochastic processes with discrete parameters.
CHAPTER I
Elementary Probability Theory
§1. Probabilistic Model of an Experiment with a
Finite Number of Outcomes
1. Let us consider an experiment of which all possible results are included
in a finite number of outcomes col9 ...,coN. We do not need to know the
nature of these outcomes, only that there are a finite number N of them.
We call colt ...,coN elementary events, or sample points, and the finite set
Q = {cox,..., coN},
the space of elementary events or the sample space.
The choice of the space of elementary events is theirs* step in formulating
a probabilistic model for an experiment. Let us consider some examples of
sample spaces.
Example 1. For a single toss of a coin the sample space Q consists of two
points:
ft = (H, T},
where H = "head" and T = "tail". (We exclude possibilities like "the coin
stands on edge," "the coin disappears," etc.)
Example 2. For n tosses of a coin the sample space is
Q = {co: co = (alt..., an), a{ = H or T}
and the general number N(Q) of outcomes is 2".
6
I. Elementary Probability Theory
Example 3. First toss a coin. If it falls "head" then toss a die (with six faces
numbered 1, 2, 3, 4, 5, 6); if it falls "tail", toss the coin again. The sample
space for this experiment is
Q = {HI, H2, H3, H4, H5, H6, TH, TT}.
We now consider some more complicated examples involving the
selection of n balls from an urn containing M distinguishable balls.
2. Example 4 (Sampling with replacement). This is an experiment in which
after each step the selected ball is returned again. In this case each sample of
n balls can be presented in the form (^,..., an), where at is the label of the
ball selected at the ith step. It is clear that in sampling with replacement
each at can have any of the M values 1, 2,..., M. The description of the
sample space depends in an essential way on whether we consider samples
like, for example, (4, 1, 2, 1) and (1, 4, 2, 1) as different or the same. It is
customary to distinguish two cases: ordered samples and unordered samples.
In the first case samples containing the same elements, but arranged
differently, are considered to be different. In the second case the order of
the elements is disregarded and the two samples are considered to be the
same. To emphasize which kind of sample we are considering, we use the
notation (^,..., a„) for ordered samples and [alt..., aj for unordered
samples.
Thus for ordered samples the sample space has the form
Q = {co: co = (fllf..., an), a{ = 1,..., M]
and the number of (different) outcomes is
N(Q) = Mn. (1)
If, however, we consider unordered samples, then
Q = {co: co = [fllf..., «„], a,- = 1,..., M}.
Clearly the number JV(Q) of (different) unordered samples is smaller than
the number of ordered samples. Let us show that in the present case
^(«) = CJ[f+II_l> (2)
where Clk = k\/\l\ (k - /)!] is the number of combinations of I elements,
'taken/cat a time.
We prove this by induction. Let N(M, n) be the number of outcomes of
interest. It is clear that when k < M we have
N(k, 1) = k = CI
§1. Probabilistic Model of an Experiment with a Finite Number of Outcomes 7
Now suppose that N(k, n) = C£+n_ x for k < M; we show that this formula
continues to hold when n is replaced by n + 1. For the unordered samples
[>!,..., an+1] that we are considering, we may suppose that the elements
are arranged in nondecreasing order: ax < a2 < • • • < an. It is clear that the
number of unordered samples with ax = 1 is N(M, n), the number with
ax = 2 is N(M — 1, n), etc. Consequently
N(M, n + 1) = N(M, n) + N(M - 1, n) + • • • + N(l, n)
= Cm+„_ x + CM_ i +„-i + • • • C„
— V^M + n ~" ^M + n-l) + ^M-l+n ~~ ^M-l+n-1^
+ ••■ + («:} -CD = C3,++\,;
here we have used the easily verified property
of the binomial coefficients.
Example 5 (Sampling without replacement). Suppose that n < M and that
the selected balls are not returned. In this case we again consider two
possibilities, namely ordered and unordered samples.
For ordered samples without replacement the sample space is
Q = {co: co = (fllf..., an), ak ^ au k ^ f, at = 1,..., M},
and the number of elements of this set (called permutations) is M(M — 1)---
(M — n + 1). We denote this by (M)„ or 4M and call it "the number of
permutations of M things, n at a time").
For unordered samples (called combinations) the sample space
Q = {co: co = [fllt..., flj, ak ^ ^, fc ^ f, 0i = 1,..., M}
consists of
JV(Q) = CM (3)
elements. In fact, from each unordered sample [alt...,aj consisting of
distinct elements we can obtain n! ordered samples. Consequently
N(Q)-n! = (M)n
and therefore
The results on the numbers of samples of n from an urn with M balls are
presented in Table 1.
8
I. Elementary Probability Theory
Table 1
M"
(M)„
Ordered
Cm + ii-1
cnM
Unordered
With
replacement
Without
replacement
^^^^.^^ Sample
Type ^\^^
For the case M = 3 and n = 2, the corresponding sample spaces are
displayed in Table 2.
Example 6 (Distribution of objects in cells). We consider the structure of
the sample space in the problem of placing n objects (balls, etc.) in M cells
(boxes, etc.). For example, such problems arise in statistical physics in
studying the distribution of n particles (which might be protons, electrons,...)
among M states (which might be energy levels).
Let the cells be numbered 1, 2,..., M, and suppose first that the objects
are distinguishable (numbered 1, 2,..., ri). Then a distribution of the n
objects among the M cells is completely described by an ordered set
(«!,..., an), where at is the index of the cell containing object i. However,
if the objects are indistinguishable their distribution among the M cells
is completely determined by the unordered set [al9...9 a„], where at is the
index of the cell into which an object is put at the ith step.
Comparing this situation with Examples 4 and 5, we have the following
correspondences:
(ordered samples) «-► (distinguishable objects),
(unordered samples) «-► (indistinguishable objects),
Table 2
(1,1) (1.2) (1,3)
(2,1) (2,2) (2,3)
(3,1) (3,2) (3,3)
(1,2) (1,3)
(2,1) (2,3)
(3,1) (3,2)
Ordered
[1,1] [2,2] [3,3]
[1,2] [1,3]
[2,3]
[1,2] [1,3]
[2,3]
Unordered
1 With
replacement i
Without
replacement
^^\SampIe
Type ^"\l
§1. Probabilistic Model of an Experiment with a Finite Number of Outcomes 9
by which we mean that to an instance of an ordered (unordered) sample of
n balls from an urn containing M balls there corresponds (one and only one)
instance of distributing n distinguishable (indistinguishable) objects among
M cells.
In a similar sense we have the following correspondences:
(sampling with replacement) *
(a cell may receive any number\
of objects I'
(sampling without replacement)
These correspondences generate others of the same kind
a cell may receive at most\
one object )
/an unordered sample in\
sampling without
replacement
(indistinguishable objects in the
problem of distribution among cells
when each cell may receive at
most one object
etc.; so that we can use Examples 4 and 5 to describe the sample space for
the problem of distributing distinguishable or indistinguishable objects
among cells either with exclusion (a cell may receive at most one object) or
without exclusion (a cell may receive any number of objects).
Table 3 displays the distributions of two objects among three cells. For
distinguishable objects, we denote them by W (white) and B (black). For
indistinguishable objects, the presence of an object in a cell is indicated
by a +.
Table 3
IWIBI I ||W|B| ||W| |B|
|b|w| II |w|b| II |w|b|
|B| |W|| IBIWM | |W|B|
IW|B| I |WJ L1J
|B|W| | 1 |W|B|
LbJ |W| 1 |B|W|
Distinguishable
objects
i + + i i ii i+ + i II
l+l+l 1 i+i
1 l±l±J
1 I+ + I
J±J
l+l+l I 1+J_
1 l+l+J
J+J
Indistinguishable
objects
w c
3 O
C "K
if
^sDistribu-
KindXtion
ofobjeclsN^^
10 I. Elementary Probability Theory
Table 4
N(Q) in the problem of placing n objects in M cells
^\^^ Kind of
^s"\v^^ objects
Distribution ^"\^^
Without exclusion
With exclusion
Distinguishable
objects
(Maxwell-
Boltzmann
statistics)
(M)„
Ordered
samples
Indistinguishable
objects
(Bose-
Einstein
statistics)
(Fermi-Dirac
statistics)
Unordered
samples
With
replacement
Without
replacement
^\Sample
Type^\|
N(Q) in the problem of choosing n balls from an urn
containing M balls
The duality that we have observed between the two problems gives us
an obvious way of finding the number of outcomes in the problem of placing
objects in cells. The results, which include the results in Table 1, are given in
Table 4.
In statistical physics one says that distinguishable (or indistinguishable,
respectively) particles that are not subject to the Pauli exclusion principle!
obey Maxwell-Boltzmann statistics (or, respectively, Bose-Einstein
statistics). If, however, the particles are indistinguishable and are subject to the
exclusion principle, they obey Fermi-Dirac statistics (see Table 4). For
example, electrons, protons and neutrons obey Fermi-Dirac statistics.
Photons and pions obey Bose-Einstein statistics. Distinguishable particles
that are subject to the exclusion principle do not occur in physics.
3. In addition to the concept of sample space we now need the fundamental
concept of event.
Experimenters are ordinarily interested, not in what particular outcome
occurs as the result of a trial, but in whether the outcome belongs to some
subset of the set of all possible outcomes. We shall describe as events all
subsets A c Q for which, under the conditions of the experiment, it is possible
to say either "the outcome co e A" or "the outcome co ¢ A"
t At most one particle in each cell. (Translator)
§1. Probabilistic Model of an Experiment with a Finite Number of Outcomes 11
For example, let a coin be tossed three times. The sample space Q consists
of the eight points
Q = {HHH, HHT,..., TTT}
and if we are able to observe (determine, measure, etc.) the results of all three
tosses, we say that the set
A = {HHH, HHT, HTH, THH}
is the event consisting of the appearance of at least two heads. If, however,
we can determine only the result of the first toss, this set A cannot be
considered to be an event, since there is no way to give either a positive or negative
answer to the question of whether a specific outcome co belongs to A.
Starting from a given collection of sets that are events, we can form new
events by means of statements containing the logical connectives "or,"
"and," and "not," which correspond in the language of set theory to the
operations "union," "intersection," and "complement."
If A and B are sets, their union, denoted by A u B, is the set of points that
belong either to A or to B:
A u B = {co e Q: co e A or co e B}.
In the language of probability theory, A u B is the event consisting of the
realization either of A or of B.
The intersection of A and Bt denoted by A n B, or by AB, is the set of
points that belong to both A and B:
A nB = {coeQ:coeA and coeB}.
The event A nB consists of the simultaneous realization of both A and B.
For example, if A = {HH, HT, TH} and B = {TT, TH, HT} then
AvB = {HH, HT, TH, TT} (= Q),
AnB = {TH, HT}.
If A is a subset of Q, its complement, denoted by A, is the set of points of
Q that do not belong to A.
If B\A denotes the difference of B and A (i.e. the set of points that belong
to B but not to A) then A = C1\A. In the language of probability, A is
the event consisting of the nonrealization of A. For example, if A =
{HH, HT, TH} then A = {TT}, the event in which two successive tails occur.
The sets A and A have no points in common and consequently A n A is
empty. We denote the empty set by 0. In probability theory, 0 is called an
impossible event. The set Q is naturally called the certain event.
When A and B are disjoint (AB = 0), the union A u B is called the
sum of A and B and written A + B.
If we consider a collection jtf0 of sets AgQwe may use the set-theoretic
operators u, n and \ to form a new collection of sets from the elements of
12
I. Elementary Probability Theory
s#0; these sets are again events. If we adjoin the certain and impossible
events Q and 0we obtain a collection $£ of sets which is an algebra, i.e. a
collection of subsets of Q for which
(1) Qe^,
(2) if A e sd, B e sd, the sets AvB,AnB,A\B also belong to s0.
It follows from what we have said that it will be advisable to consider
collections of events that form algebras. In the future we shall consider only
such collections.
Here are some examples of algebras of events:
(a) {Q, 0}, the collection consisting of Q and the empty set (we call this the
trivial algebra);
(b) {A, A, Q, 0}, the collection generated by A;
(c)i={/4:/4g Q}, the collection consisting of all the subsets of Q
(including the empty set 0).
It is easy to check that all these algebras of events can be obtained from the
following principle.
We say that a collection
® = {Dlt...,Dm}
of sets is a decomposition of Q, and call the Dt the atoms of the decomposition,
if the Dt are not empty, are pairwise disjoint, and their sum is Q:
Dt + ■ • • + D„ = Q.
For example, if Q consists of three points, Q = {1, 2, 3}, there are five
different decompositions:
Qx = {DJ with D1 = {1,2,3};
Q)2 = {/>!, D2} with £>! = {1, 2}, D2 = {3};
<2>3 = {Dlt D2} with Dx = {1, 3}, D2 = {2};
04 = {£>!, D2) with £>! = {2, 3}, D2 = {1};
^5 = {^i, D2, D3} with D, = {1}, D2 = {2}, D3 = {3}.
(For the general number of decompositions of a finite set see Problem 2.)
If we consider all unions of the sets in Q), the resulting collection of sets,
together with the empty set, forms an algebra, called the algebra induced by
3), and denoted by ol(3>). Thus the elements of ol(2>) consist of the empty set
together with the sums of sets which are atoms of 3).
Thus if Q) is a decomposition, there is associated with it a specific algebra
@ = ol(3)).
The converse is also true. Let ^ be an algebra of subsets of a finite space
Q. Then there is a unique decomposition 3) whose atoms are the elements of
§1. Probabilistic Model of an Experiment with a Finite Number of Outcomes 13
^, with @ = a(2>). In fact, let De@ and let D have the property that for
every B e 38 the set D n B either coincides with D or is empty. Then this
collection of sets D forms a decomposition 3) with the required property
a(^) = <%. In Example (a), Q) is the trivial decomposition consisting of the
single set Dx = Q; in (b), Q) = {/1, A}. The most fine-grained decomposition
3), which consists of the singletons {co,-}, w.efi, induces the algebra in
Example (c), i.e. the algebra of all subsets of Q.
Let ®x and S)2 be two decompositions. We say that 3)2 1S finer than ^i>
and write ^,<^2, if ol(3>x) c <x(^2).
Let us show that if Q consists, as we assumed above, of a finite number of
points (ol9..., coN, then the number N(sf) of sets in the collection sf is
equal to 2N. In fact, every nonempty set A e s# can be represented as A =
{co£l,..., coik}, where co£j. e Q, 1 < fr < N. With this set we associate the
sequence of zeros and ones
(0,...,0,1,0,...,0,1,...),
where there are ones in the positions iit...,ik and zeros elsewhere. Then
for a given k the number of different sets A of the form {cotl,..., coik} is the
same as the number of ways in which k ones (k indistinguishable objects)
can be placed in N positions (N cells). According to Table 4 (see the lower
right-hand square) we see that this number is CkN. Hence (counting the empty
set) we find that
JV(jO= 1 + Ch + •■• + CJJ = (1 + If = 2N.
4. We have now taken the first two steps in defining a probabilistic model
of an experiment with a finite number of outcomes: we have selected a sample
space and a collection $0 of subsets, which form an algebra and are called
events. We now take the next step, to assign to each sample point (outcome)
co.eQ,-, i = 1,..., N, a weight. This is denoted by p((Oi) and called the
probability of the outcome Wj; we assume that it has the following properties:
(a) 0 < p(cOi) < 1 (nonnegativity),
(b) pio^x) + • • • + p(coN) = 1 (normalization).
Starting from the given probabilities p(cOi) of the outcomes C0j, we define
the probability P(A) of any event A e s# by
P(A)= X pica,). (4)
U-.OieA]
Finally, we say that a triple
(Q, sf, P),
where Q = {a)u ..., coN}t s? is an algebra of subsets of O and
P = {P(A);Aetf}
14
I. Elementary Probability Theory
defines (or assigns) a probabilistic model, or a probability space, of experiments
with a (finite) space Q of outcomes and algebra s# of events.
The following properties of probability follow from (4):
P(0) = 0, (5)
P(0) = 1, (6)
P(A uB) = P(A) + P(B) - P(A n B). (7)
In particular, if A n B = 0, then
P(A + B) = P(A) + P(B) (8)
and
P(I) = 1 - P04). (9)
5. In constructing a probabilistic model for a specific situation, the
construction of the sample space Q and the algebra sf of events are ordinarily
not difficult. In elementary probability theory one usually takes the algebra
s# to be the algebra of all subsets of Q. Any difficulty that may arise is in
assigning probabilities to the sample points. In principle, the solution to this
problem lies outside the domain of probability theory, and we shall not
consider it in detail. We consider that our fundamental problem is not the
question of how to assign probabilities, but how to calculate the
probabilities of complicated events (elements of &#) from the probabilities of the
sample points.
It is clear from a mathematical point of view that for finite sample spaces
we can obtain all conceivable (finite) probability spaces by assigning non-
negative numbers pl9..., pN, satisfying the condition py + • • ■ + pN = 1, to
the outcomes au ..., coN.
The validity of the assignments of the numbers px,...,pN can, in specific
cases, be checked to a certain extent by using the law of large numbers
(which will be discussed later on). It states that in a long series of
"independent" experiments, carried out under identical conditions, the frequencies
with which the elementary events appear are "close" to their probabilities.
In connection with the difficulty of assigning probabilities to outcomes,
we note that there are many actual situations in which for reasons of
symmetry it seems reasonable to consider all conceivable outcomes as equally
probable. In such cases, if the sample space consists of points cou . .tcoN,
with N < oo, we put
p(Wl) = ... = p(coN) = 1/JV,
and consequently
P(A) = N(A)/N (10)
§1. Probabilistic Model of an Experiment with a Finite Number of Outcomes 15
for every event Aesf, where N(A) is the number of sample points in A.
This is called the classical method of assigning probabilities. It is clear that
in this case the calculation of P(A) reduces to calculating the number of
outcomes belonging to A. This is usually done by combinatorial methods,
so that combinatorics, applied to finite sets, plays a significant role in the
calculus of probabilities.
Example 7 (Coincidence problem). Let an urn contain M balls numbered
1, 2,..., M. We draw an ordered sample of size n with replacement. It is
clear that then
Q = {co:co = (fli,..., an), a{ = 1,:.., M}
and N(Q) = M". Using the classical assignment of probabilities, we consider
the M" outcomes equally probable and ask for the probability of the event
A = {co: co = (alt..., an\ at ^ ajt i ^ ;},
i.e., the event in which there is no repetition. Clearly N(A) = M(M — 1)---
(M — n + 1), and therefore
This problem has the following striking interpretation. Suppose that
there are n students in a class. Let us suppose that each student's birthday
is on one of 365 days and that all days are equally probable. The question
is, what is the probability Pn that there are at least two students in the class
whose birthdays coincide? If we interpret selection of birthdays as selection
of balls from an urn containing 365 balls, then by (11)
The following table lists the values of Pn for some values of n:
n
Pn
4
0.016
16
0.284
22
0.476
23
0.507
40
0.891
64
0.997
It is interesting to note that (unexpectedly!) the size of class in which there
is probability \ of finding at least two students with the same birthday is not
very large: only 23.
Example 8 (Prizes in a lottery). Consider a lottery that is run in the following
way. There are M tickets numbered 1, 2,..., M, of which n, numbered
1,..., n, win prizes (M > 2n). You buy n tickets, and ask for the probability
(P, say) of winning at least one prize.
16
I. Elementary Probability Theory
Since the order in which the tickets are drawn plays no role in the presence
or absence of winners in your purchase, we may suppose that the sample space
has the form
Q = {co: co = [al9..., a„], ak ^ alt k ^ /, ax = 1,..., M}.
By Table 1, N(Q) = CnM. Now let
A0 = {co:co = [alt..., aj, ak ^ at,k ^ 1,^ = n + 1 M}
be the event that there is no winner in the set of tickets you bought. Again
by Table 1, N(A0) = CnM_„. Therefore
P(^o)" Ct (M)n
\ M/\ M-l/ \ M-n + l)
and consequently
'■^■'-('-^-inl-firiTT)-
If M = h2 and n -> oo, then P(A0) -> e~l and
P_l _g-i «0.632.
The convergence is quite fast: for n = 10 the probability is already P = 0.670.
6. Problems
1. Establish the following properties of the operators n and u:
A \j B = B yj A, AB = BA (commutativity),
A u (B u C) = (A u B) u C, /1(#C) = (4£)C (associativity),
/l(# uQ^Bu/lC, /I u (BC) = U u #)(/1 u C) (distributivity),
Au A = A, AA = A (idempotency).
Show also that
AZTB = AnB, AB = AkjB.
2. Let Q contain N elements. Show that the number d(N) of different decompositions of
Q, is given by the formula
oo UN
WO-T'Itj. (12)
*=0 K-
(Hint: Show that
N-l
d(N) = £ C^_! d(k), where d(0) = 1,
*=o
and then verify that the series in (12) satisfies the same recurrence relation.)
§2. Some Classical Models and Distributions
17
3. For any finite collection of sets Alt...,Ant
P(AX u ■•■ u An) < P(At) + ■•■ + P(/t„).
4. Let A and B be events. Show that AB u BA is the event in which exactly one of A
and B occurs. Moreover,
P(AB u BA) = PU) + P(#) - 2P(/I#).
5. Let Al9..., i4„ be events, and define S0tSlt...,Sn as follows: S0 = 1,
Sr = I PUkl n ■•• n Akr), 1 < r < «,
where the sum is over the unordered subsets Jr = [kl9..., fcr] of {1,...,«}.
Let Bm be the event in which each of the events Alt ...tAn occurs exactly m times.
Show that
P(iU= £(-irmq"Sr.
In particular, for m = 0
P(B0)=l-S1+S2 ±Sn.
Show also that the probability that at least m of the events AX,...,A„ occur
simultaneously is
P(fl,) + — + P(Bn) = £ (-D'-Cri1^.
In particular, the probability that at least one of the events Alt..., A„ occurs is
HBi) + ••• + P(£„) = St - S2 + .-■ ± Sn.
§2. Some Classical Models and Distributions
1. Binomial distribution. Let a coin be tossed n times and record the results
as an ordered set (alt..., an), where at = 1 for a head ("success") and at = 0
for a tail ("failure"). The sample space is
Q = {co: co = (alt..., an), a( = 0, 1}.
To each sample point co = (alt..., an) we assign the probability
p(co) = p*y "z<\
where the nonnegative numbers p and g satisfy p + q = 1. In the first place,
we verify that this assignment of the weights p{co) is consistent. It is enough
to show that Y^oen P(w) = 1-
We consider all outcomes co = (al,...,an) for which £,-a{ = k, where
k = 0, 1,..., n. According to Table 4 (distribution of k indistinguishable
18
I. Elementary Probability Theory
ones in n places) the number of these outcomes is Ckn. Therefore
I Pico) = t CknPkqn~k = (p + qf = 1.
weCl fc = 0
Thus the space Q together with the collection &0 of all its subsets and the
probabilities P(/l) = Y,*>eA P(w)» A est, defines a probabilistic model. It is
natural to call this the probabilistic model for n tosses of a coin.
In the case n = 1, when the sample space contains just the two points
co = 1 ("success") and co = 0 ("failure"), it is natural to call p(l) = p the
probability of success. We shall see later that this model for n tosses of a
coin can be thought of as the result of n "independent" experiments with
probability p of success at each trial.
Let us consider the events
Ak = {co:co = (al9..., an), ax -\ + an = k}, k = 0, 1,..., n,
consisting of exactly k successes. It follows from what we said above that
P(Ak) = C*pY-\ (1)
and SU0 P04k) = 1.
The set of probabilities (P(/40),..., P(An)) is called the binomial
distribution (the number of successes in a sample of size n). This distribution plays an
extremely important role in probability theory since it arises in the most
diverse probabilistic models. We write Pn(k) = P(Ak), k = 0, 1,..., n.
Figure 1 shows the binomial distribution in the case p = \ (symmetric coin)
for « = 5,10,20.
We now present a different model (in essence, equivalent to the preceding
one) which describes the random walk of a "particle."
Let the particle start at the origin, and after unit time let it take a unit
step upward or downward (Figure 2).
Consequently after n steps the particle can have moved at most n units
up or n units down. It is clear that each path co of the particle is completely
specified by a set (a u ..., an), where at— +1 if the particle moves up at the
ith step, and a:— — 1 if it moves down. Let us assign to each path co the
weight p(co) = pvMqn-v«»)i where v(co) is the number of + Ts in the sequence
co = (a!,..., an), i.e. v(co) = [(^ H + an) + n]/2, and the nonnegative
numbers p and q satisfy p + q = 1.
Since ^,,^ p(co) = 1, the set of probabilities p(co) together with the space
Q of paths co = (alt ...,an) and its subsets define an acceptable probabilistic
model of the motion of the particle for n steps.
Let us ask the following question: What is the probability of the event Ak
that after n steps the particle is at a point with ordinate /c? This condition
is satisfied by those paths co for which v(co) — (n — v(co)) = k, i.e.
v(co) = -^-.
§2. Some Classical Models and Distributions
19
0.3 4-
0.2-1-
o.i 4-
0.3 4-
0.2 4-
0.1 4-
0 12 3 4 5
23456789 10
0.3 4-
0.2 4-
0.1 4-
/7 = 20
012345678910 20
Figure 1. Graph of the binomial probabilities P„(k) for n = 5,10,20.
The number of such paths (see Table 4) is C5,n+kl/2, and therefore
P(Ak) = c\?+kV2pln+kV2qln~kV2.
Consequently the binomial distribution (P(/4_n),..., P(/40),..., P(An))
can be said to describe the probability distribution for the position of the
particle after n steps.
Note that in the symmetric case (p = q = |) when the probabilities of
the individual paths are equal to 2"",
P(Ak) = C[n"+kl/2 • 2"".
Let us investigate the asymptotic behavior of these probabilities for large n.
If the number of steps is 2n, it follows from the properties of the binomial
coefficients that the largest of the probabilities P(Ak), \k\ < 2n, is
PUo) = Qn.2"2".
Figure 2
20
I. Elementary Probability Theory
-4-3-2-10 1 2 3 4
Figure 3. Beginning of the binomial distribution.
From Stirling's formula (see formula (6) in Section 4)
Consequently
2n"W
and therefore for large n
P04o) ~
y/nn
y/mi
Figure 3 represents the beginning of the binomial distribution for 2n
steps of a random walk (in contrast to Figure 2, the time axis is now directed
upward).
2. Multinomial distribution. Generalizing the preceding model, we now
suppose that the sample space is
Q = {co: co = («1,..., an)t at = blt..., br}t
where bu ..., br are given numbers. Let v^co) be the number of elements of
co = (alt..., an) that are equal to 6,-, i = 1,..., r9 and define the probability
of co by
where p( > 0 and px + h pr = 1. Note that
£p(co)= £ Cn(Ml,...,«r)p^--.p^,
<oeSl rni^,..., nP^0,i
l»U + ••• +nP = n >
where CB(ni,..., nr) is the number of (ordered) sequences (alt...,an) in
which by occurs nx times, ...,br occurs nr times. Since nl elements bt can
t The notation/(«) ~ g(n) means that f(n)/g(n) -> 1 as n -> oo.
§2. Some Classical Models and Distributions
21
be distributed into n positions in CJj1 ways; n2 elements b2 into n — nx
positions in CJj2_ni ways, etc., we have
Cn(nlt..., nr) = Cn"' • CnnLni ■ • • Cn"i(ni+... +np_l}
_ M-' (w — Hi)!
~~ ny! (n - ^i)! n2 ! (n - ^ - n2)!
n!
Therefore
£ P(w) = I „ |W'M|P"'"-P"r = (Pi + '•• + Pr)" = 1,
coett rn!^0 nP^O.i "1 • " * nr •
lni+ —+nP = n /
and consequently we have defined an acceptable method of assigning
probabilities.
Let
K, „P = {co: v^co) = nlf..., v£co) = «r}.
Then
P(4.t J = «11!,.... «r)p?« • • • Pr"-. (2)
The set of probabilities
{P(^1....,J}
is called the multinomial (or polynomial) distribution.
We emphasize that both this distribution and its special case, the binomial
distribution, originate from problems about sampling with replacement.
3. The multidimensional hypergeometric distribution occurs in problems that
involve sampling without replacement.
Consider, for example, an urn containing M balls numbered 1, 2,..., M,
where Mt balls have the color bl9 ..., Mr balls have the color br, and
Mj + • ■ • + Mr = M. Suppose that we draw a sample of size n < M without
replacement. The sample space is
D = {co: co = (au ..., an), ak^ at,k ^ lta: = 1,..., M}
and N(Q) = (M)„. Let us suppose that the sample points are equiprobable,
and find the probability of the event Bni „r in which nx balls have
color blt..., nr balls have color br, where nx + • • • + nr = n. It is easy to
show that
**(Bmi J = CJLnl9.... nr){My)ni • • • (Mr)„p,
22
I. Elementary Probability Theory
and therefore
p(B , N(Bni ,,)¾...¾ (3)
K ni nr) N(Q) CnM
The set of probabilities {P(Bm np)} is called the multidimensional
hypergeometric distribution. When r = 2 it is simply called the hypergeometric
distribution because its "generating function" is a hypergeometric function.
The structure of the multidimensional hypergeometric distribution is
rather complicated. For example, the probability
P(£n,„2) =-^7^, nx+n2 = n, Ml+M2 = M, (4)
contains nine factorials. However, it is easily established that if M -» oo
and Mj -> oo in such a way that MJM -» p (and therefore M2/M -> 1 — p)
then
p(5ni>n2)-Q+n2p^(i-pr. (5)
In other words, under the present hypotheses the hypergeometric
distribution is approximated by the binomial; this is intuitively clear since
when M and Mx are large (but finite), sampling without replacement ought
to give almost the same result as sampling with replacement.
Example. Let us use (4) to find the probability of picking six "lucky"
numbers in a lottery of the following kind (this is an abstract formulation of the
"sportloto," which is well known in Russia):
There are 49 balls numbered from 1 to 49; six of them are lucky (colored
red, say, whereas the rest are white). We draw a sample of six balls, without
replacement. The question is, What is the probability that all six of these
balls are lucky? Taking M = 49, Mx = 6, nx = 6, n2 = 0, we see that the
event of interest, namely
B6 o = {6 balls, all lucky}
has, by (4), probability
PtfW = t4~ * 7.2 x lO"8.
4. The numbers n! increase extremely rapidly with n. For example,
10! = 3,628,800,
15! = 1,307,674,368,000,
and 100! has 158 digits. Hence from either the theoretical or the
computational point of view, it is important to know Stirling's formula,
;GMa-
n\ = Jinn - exp -£ , 0 < 0n < 1, (6)
§3. Conditional Probability. Independence
23
whose proof can be found in most textbooks on mathematical analysis
(see also [69]).
5. Problems
1. Prove formula (5).
2. Show that for the multinomial distribution {P(y4ni,..., A„r)} the maximum
probability is attained at a point (/c,,...,/cr) that satisfies the inequalities npt — 1 <
kt < [n + r - l)pf, i = 1,..., r.
3. One-dimensional I sing model. Consider n particles located at the points 1,2,...,«.
Suppose that each particle is of one of two types, and that there are «, particles of the
first type and n2 of the second («, + n2 = «). We suppose that all n\ arrangements of
the particles are equally probable.
Construct a corresponding probabilistic model and find the probability of the
event A&mx x, m,2, m2\, m22) = {v, j = m,,,..., v22 = m22}t where vy is the number
of particles of type i following particles of type; (ij = 1, 2).
4. Prove the following inequalities by probabilistic reasoning:
£¢=2-,
fc = 0
£(02 = cn2n,
fc = 0
£ (- l)""kC* = Q_„ m > n + 1,
fc = 0
£ k(k - l)C^ = m(m - l)2m~2, m > 2.
fc = 0
§3. Conditional Probability. Independence
1. The concept of probabilities of events lets us answer questions of the
following kind: If there are M balls in an urn, Mj white and M2 black, what is
the probability P(A) of the event A that a selected ball is white? With the
classical approach, P(A) = MJM.
The concept of conditional probability, which will be introduced below,
lets us answer questions of the following kind: What is the probability that
the second ball is white (event B) under the condition that the first ball was
also white (event A)? (We are thinking of sampling without replacement.)
It is natural to reason as follows: if the first ball is white, then at the
second step we have an urn containing M — 1 balls, of which Mx — 1 are
white and M2 black; hence it seems reasonable to suppose that the
(conditional) probability in question is (Mi — 1)/(M — 1).
24
I. Elementary Probability Theory
We now give a definition of conditional probability that is consistent
with our intuitive ideas.
Let (Q, sf, P) be a (finite) probability space and A an event (i.e. A e sf).
Definition 1. The conditional probability of event B assuming event A with
PG4) > 0 (denoted by P(B\A)) is
P(AB)
P(A) '
In the classical approach we have P(A) = N(A)/N(Q), P(AB) =
N(AB)/N(Q), and therefore
From Definition 1 we immediately get the following properties of
conditional probability:
P(A\A)= 1,
P(0M) = O,
P(B\A)=1, B^A,
P(B, + B2\A) = PifiM) + P(B2\A).
It follows from these properties that for a given set A the conditional
probability P( • | A) has the same properties on the space (ClnA,s/n A),
where sf n A = {B n A: B e sf}, that the original probability P(-) has on
(¾ ^).
Note that
however in general
P(B\A) + P(B\A) = 1;
P(B\A) + P(B\A) * 1,
P(B\A) + P(B\A) * 1.
Example 1. Consider a family with two children. We ask for the probability
that both children are boys, assuming
(a) that the older child is a boy;
(b) that at least one of the children is a boy.
§3. Conditional Probability. Independence
25
The sample space is
Q = {BB, BG, GB, GG},
where BG means that the older child is a boy and the younger is a girl, etc.
Let us suppose that all sample points are equally probable:
P(BB) = P(BG) = P(GB) = P(GG) = i
Let A be the event that the older child is a boy, and B, that the younger
child is a boy. Then A u B is the event that at least one child is a boy, and
AB is the event that both children are boys. In question (a) we want the
conditional probability P{AB\A), and in (b), the conditional probability
P{AB\A\jB).
It is easy to see that
"-"""•-igsfe-i-}-
2. The simple but important formula (3), below, is called the formula for
total probability. It provides the basic means for calculating the
probabilities of complicated events by using conditional probabilities.
Consider a decomposition Q) — {Alt..., An} with P04,) > 0, i = 1,..., n
(such a decomposition is often called a complete set of disjoint events). It
is clear that
and therefore
But
B = BAy + ••• + BAn
P{B) = £ P(BAd.
P(BAt) = P(B\AdP(Ad.
Hence we have the formula for total probability:
P(B) = £ PiB\AdP(A£ (3)
i=l
In particular, if 0 < P(A) < 1, then
P(B) = P(B\A)P(A) + P(B\A)P(A). (4)
26
I. Elementary Probability Theory
Example 2. An urn contains M balls, m of which are "lucky." We ask for the
probability that the second ball drawn is lucky (assuming that the result of
the first draw is unknown, that a sample of size 2 is drawn without
replacement, and that all outcomes are equally probable). Let A be the event that
the first ball is lucky, B the event that the second is lucky. Then
m(m — 1)
P{B\A) =
P{B\A) -
P{BA)
" P04)
P(BA)
" PiA)
M(M -
m
M
m(M -
M(M -
~~ M -
-J)
m)
m
m - r
and
M - 1
M ~
P(B) = P(B\A)P(A) + P{B\A)P{A)
m — \m m M — m m
M - 1 M M - 1 M M'
It is interesting to observe that P{A) is precisely m/M. Hence, when the
nature of the first ball is unknown, it does not affect the probability that the
second ball is lucky.
By the definition of conditional probability (with P(A) > 0),
P(AB) = P(B\A)P(A). (5)
This formula, the multiplication formula for probabilities, can be generalized
(by induction) as follows: If Al9..., An- x are events with P(Ay • • ■ An-1) > 0,
then
P(Al-..An) = P(Al)P(A2\Al)...P(An\Al---An.l) (6)
(here A1--An= A1nA2n--n An).
3. Suppose that A and B are events with P(A) > 0 and P(B) > 0. Then
along with (5) we have the parallel formula
P(AB) = P(A\B)P(B). (7)
From (5) and (7) we obtain Bayes's formula
P(A)P(B\A)
P(AlB)=P&)' (8)
§3. Conditional Probability. Independence
27
If the events Au...,An form a decomposition of Q, (3) and (8) imply
Bayes's theorem:
In statistical applications, Alt...,An {Ax + • • • + An = Q) are often
called hypotheses, and P04£) is called the a priori^ probability of At. The
conditional probability P(At\B) is considered as the a posteriori probability
of At after the occurrence of event B.
Example 3. Let an urn contain two coins: Au a fair coin with probability
j of falling H; and A2, a biased coin with probability \ of falling H. A coin is
drawn at random and tossed. Suppose that it falls head. We ask for the
probability that the fair coin was selected.
Let us construct the corresponding probabilistic model. Here it is natural
to take the sample space to be the set Q = {/^H, A {I, A2H, A2T}t which
describes all possible outcomes of a selection and a toss (A ^ means that
coin Ax was selected and fell heads, etc.) The probabilities p(co) of the various
outcomes have to be assigned so that, according to the statement of the
problem,
P(Al)=P(A2) = i
and
P(HM1) = i P(H\A2) = i
With these assignments, the probabilities of the sample points are uniquely
determined:
P(^1H) = i P(AlT) = l P(A2H)=l P(A2T)=i
Then by Bayes's formula the probability in question is
P(A {m_ P^PjHlAQ = 3
r^iin; p^p^j^) + p(A2)P(H\A2) 5'
and therefore
P042|H) = f.
4. In certain sense, the concept of independence, which we are now going to
introduce, plays a central role in probability theory: it is precisely this concept
that distinguishes probability theory from the general theory of measure
spaces.
t A priori: before the experiment; a posteriori: after the experiment.
28
I. Elementary Probability Theory
If A and B are two events, it is natural to say that B is independent of A
if knowing that A has occurred has no effect on the probability of B. In other
words, "B is independent of A" if
P{B\A) = P(B) (10)
(we are supposing that P(A) > 0).
Since
it follows from (10) that
P(AB) = P(A)P(B). (11)
In exactly the same way, if P(B) > 0 it is natural to say that" A is independent
ofB"if
P(A\B) = P(A).
Hence we again obtain (11), which is symmetric in A and B and still makes
sense when the probabilities of these events are zero.
After these preliminaries, we introduce the following definition.
Definition 2. Events A and B are called independent or statistically independent
(with respect to the probability P) if
P(AB) = P(A)P(B).
In probability theory it is often convenient to consider not only
independence of events (or sets) but also independence of collections of events (or
sets).
Accordingly, we introduce the following definition.
Definition 3. Two algebras sty and st2 of events (or sets) are called
independent or statistically independent (with respect to the probability P) if all pairs
of sets Ay and A2, belonging respectively to sty and st2, are independent.
For example, let us consider the two algebras
stx = {Ay, Ay, 0, CI} and st2 = {A2, A2, 0, Q},
where Ay and A2 are subsets of Q. It is easy to verify that sty and s#2 are
independent if and only if Ay and A2 are independent. In fact, the
independence of sty and s#2 means the independence of the 16 events Ay and A2,
Ay and A2,..., CI and CI. Consequently Ay and A2 are independent.
Conversely, if Ay and A2 are independent, we have to show that the other 15
§3. Conditional Probability. Independence
29
pairs of events are independent. Let us verify, for example, the independence
of A! and A2. We have
P{A,A2) = P04O - P{A,A2) = P{AX) - P{AX)P{A2)
= P(^1)(l-P(^2)) = P(^i)P(^2).
The independence of the other pairs is verified similarly.
5. The concept of independence of two sets or two algebras of sets can be
extended to any finite number of sets or algebras of sets.
Thus we say that the sets Au...,An are collectively independent or
statistically independent (with respect to the probability P) if for k = 1,..., n
and 1 < il < i2 < • • • < ik <, n
P(^V^lc) = P(^l)"P(^lc). (12)
The algebras s/l9...9 sf„ of sets are called independent or statistically
independent (with respect to the probability P) if all sets A „ ...,An belonging
respectively to jflt..., $£n are independent.
Note that pairwise independence of events does not imply their
independence. In fact if, for example, Q = {colt co2, co3, co4} and all outcomes are
equiprobable, it is easily verified that the events
A = {colt co2}, B = {cou co3}, C = {colt co4}
are pairwise independent, whereas
P(ABC) = i* (¾3 = P04)P(f*)P(C).
Also note that if
P(ABC) = P04)P(B)P(C)
for events A, B and C, it by no means follows that these events are pairwise
independent. In fact, let Q consist of the 36 ordered pairs (i,y), where ij =
1,2,..., 6 and all the pairs are equiprobable. Then if A = {(i,j):j = 1,2 or 5},
B = {(ij):j = 4, 5 or 6}, C = {(i,j): i+j = 9} we have
P(AB) = * * i = P(A)P(B),
P(AC) =£ ^ = P04)P(Q,
P(BQ =^ ** = P(B)P{C\
but also
P(^BC) = £=P04)P(f*)P(C).
6. Let us consider in more detail, from the point of view of independence,
the classical model (Q, sf, P) that was introduced in §2 and used as a basis
for the binomial distribution.
30
I. Elementary Probability Theory
In this model
Q = {co: co = (au ..., an), a-t = 0, 1}, st = {A: A c Q}
and
p{d) = p1 V"10'. (13)
Consider an event A £ Q. We say that this event depends on a trial at
time k if it is determined by the value ak alone. Examples of such events are
Ak = {co: ak = 1}, Ak = {co: ak = 0}.
Let us consider the sequence of algebras s/i9 sf2,..., s#„, where sdk =
{Ak, Ak, 0, Q} and show that under (13) these algebras are independent.
It is clear that
P04*)= I PM= I PsV"Zfl'
{o>'.ak=l) {toiak=l}
_ p y „01+ "+ak-i+ak+i + -+an
ifli ak-i,ak+i a„)
n— 1
x (n-l)-(a1 + -+flk-1+flk+1+-+fl„) _ p y Ql „1 in-1)-I _
£ = 0
and a similar calculation shows that P(Ak) = q and that, for k ^ 1,
P&MO = p2, P^,) = pq, P&kAd = «2.
It is easy to deduce from this that $£k and $£x are independent for k ^ /.
It can be shown in the same way that s/lt sf2t.. ■, .s/„ are independent.
This is the basis for saying that our model (Q, sf, P) corresponds to "n
independent trials with two outcomes and probability p of success." James
Bernoulli was the first to study this model systematically, and established
the law of large numbers (§5) for it. Accordingly, this model is also called
the Bernoulli scheme with two outcomes (success and failure) and probability
p of success.
A detailed study of the probability space for the Bernoulli scheme shows
that it has the structure of a direct product of probability spaces, defined
as follows.
Suppose that we are given a collection (Q1} @lt Pj),..., (Q„, ^n, P„) of
finite probability spaces. Form the space Q = Qj x Cl2 x • • - x Cl„ of points
co = (alt...t an\ where a{ e Qf. Let &0 = #t ® • • • ® @n be the algebra of
the subsets of Q that consists of sums of sets of the form
A = By x B2 x • • • x Bn
with BiG&t. Finally, for co = (al9 ...,an) take p(co) = Pi(ax)• • • pn(an) and
define P{A) for the set A = Bx x B2 x • - • x Bn by
P04)= I Pi(ai)"-P„(a„).
{aieBi a„eB„)
§3. Conditional Probability. Independence
31
It is easy to verify that P(0) = 1 and therefore the triple (O, sf, P) defines
a probability space. This space is called the direct product of the probability
spaces (Qlf @u PJ,..., (fK, ®n, Pn).
We note an easily verified property of the direct product of probability
spaces: with respect to P, the events
A, = {wiflieB!} ^ = {co:aneBn},
where Bt e &it are independent. In the same way, the algebras of subsets of O,
s0x = {Al:Al = {<D:aleBl},Ble@l},
sfn = {Am: An = {a>: aneBn}, Bne<2n}
are independent.
It is clear from our construction that the Bernoulli scheme
(O, jf, P) with O = {co: co = (alt..., an), a{ = 0 or 1}
s/ = {A:AcQ} and p(co) = p1 a'qn _s at
can be thought of as the direct product of the probability spaces (0£, 96 x, P,),
i = 1,2,..., n, where
0,.= (0,1), ^=((0),(1),0,0,.),
Pi({l» = P, Pi({0}) = q-
7. Problems
1. Give examples to show that in general the equations
P{B\A) + P(B\A) = 1,
P(B\A) + P(B\A) = 1
are false.
2. An urn contains M balls, of which M, are white. Consider a sample of size n. Let Bj
be the event that the ball selected at theyth step is white, and Ak the event that a sample
of size n contains exactly k white balls. Show that
P(Bj\Ak) = k/n
both for sampling with replacement and for sampling without replacement.
3. Let Ax,..., A„ be independent events. Then
\i=t / i=t
4. Let /4, A„ be independent events with P(A{) = pt. Then the probability P0
that neither event occurs is
^o= fid-Pi)-
32
I. Elementary Probability Theory
5. Let A and B be independent events. In terms of P(A) and P(B), find the probabilities
of the events that exactly k, at least k, and at most k of A and B occur (k — 0,1, 2).
6. Let event A be independent of itself, i.e. let A and A be independent. Show that
P(A) is either 0 or 1.
7. Let event A have P(/l) = 0 or 1. Show that A and an arbitrary event B are
independent.
8. Consider the electric circuit shown in Figure 4:
Figure 4
Each of the switches A, B, C, D, and E is independently open or closed with
probabilities p and q, respectively. Find the probability that a signal fed in at "input"
will be received at "output". If the signal is received, what is the conditional
probability that E is open?
§4. Random Variables and Their Properties
1. Let (Q, sf, P) be a probabilistic model of an experiment with a finite
number of outcomes, N(Q) < oo, where $£ is the algebra of all subsets of
Q. We observe that in the examples above, where we calculated the
probabilities of various events A e sf, the specific nature of the sample space Q was
of no interest. We were interested only in numerical properties depending
on the sample points. For example, we were interested in the probability of
some number of successes in a series of n trials, in the probability distribution
for the number of objects in cells, etc.
The concept "random variable," which we now introduce (later it will
be given a more general form) serves to define quantities that are subject to
"measurement" in random experiments.
Definition 1. Any numerical function £ = £(co) defined on a (finite) sample
space Q is called a (simple) random variable. (The reason for the term " simple "
random variable will become clear after the introduction of the general
concept of random variable in §4 of Chapter II.)
§4. Random Variables and Their Properties
33
Example 1. In the model of two tosses of a coin with sample space Q =
{HH, HT, TH, TT}, define a random variable f = f(co) by the table
0)
«w)
HH
2
HT
1
TH
1
TT
0
Here, from its very definition, £(co) is nothing but the number of heads in the
outcome co.
Another extremely simple example of a random variable is the indicator
(or characteristic function) of a set A e &0\
C = Uco),
where t
fl, coeA,
7^) = |o, co* A
When experimenters are concerned with random variables that describe
observations, their main interest is in the probabilities with which the
random variables take various values. From this point of view they are
interested, not in the distribution ofthe probability P over (Q, $t\ but in
its distribution over the range of a random variable. Since we are considering
the case when Q contains only a finite number of points, the range X of
the random variable £ is also finite. Let X = {xlt..., xm}, where the
(different) numbers xlt..., xm exhaust the values of f.
Let SC be the collection of all subsets of X, and let BeSC. We can also
interpret B as an event if the sample space is taken to be X, the set of values
of*
On (X, SC\ consider the probability P£(-) induced by £ according to the
formula
P£B) = P{co: £(co)eB}9 Be3C.
It is clear that the values of this probability are completely determined by
the probabilities
Pfrd = P{co: ¢(0,) = xj, XieX.
The set of numbers (P^),..., P^xm)} is called the probability
distribution of the random variable \.
t The notation 1(A) is also used. For frequently used properties of indicators see Problem 1.
34
I. Elementary Probability Theory
Example 2. A random variable £ that takes the two values 1 and 0 with
probabilities p ("success") and q ("failure"), is called a Bernoullit random
variable. Clearly
Pfr) = p*ql-X, x = 0, 1. (1)
A binomial (or binomially distributed) random variable £ is a random
variable that takes the n + 1 values 0, 1,..., n with probabilities
Pfc) = QpV"x, x = 0, 1,..., n. (2)
Note that here and in many subsequent examples we do not specify the
sample spaces (Q, $£, P), but are interested only in the values of the random
variables and their probability distributions.
The probabilistic structure of the random variables £ is completely
specified by the probability distributions {/^(x,), i — 1,..., m}. The concept
of distribution function, which we now introduce, yields an equivalent
description of the probabilistic structure of the random variables.
Definition 2. Let xeR1. The function
F£x) = P{co:t(cD)<x}
is called the distribution function of the random variable £.
Clearly
[i:xi<x)
and
Pfrd = Fficd - Ffic, -),
where F£x-) = limyTx F^OO-
If we suppose that xt < x2 < • • • < xm and put F^(x0) = 0, then
P&d = F&) - F«(x,_ 0, i =1,--., m.
The following diagrams (Figure 5) exhibit J^(x) and F^(x) for a binomial
random variable.
It follows immediately from Definition 2 that the distribution F^ = F^(x)
has the following properties:
(l)F,(-a)) = 0,F,(+oo)=l;
(2) F£x) is continuous on the right (F^(x+) = F^(x)) and piecewise constant.
t We use the terms "Bernoulli, binomial, Poisson, Gaussian,..., random variables" for what
are more usually called random variables with Bernoulli, binomial, Poisson, Gaussian,...,
distributions.
§4. Random Variables and Their Properties
35
***>.
Wj
I !
i
t
t—r
1 2
Figure 5
Along with random variables it is often necessary to consider random
vectors £ = (£ i,..., £,.) whose components are random variables. For
example, when we considered the multinomial distribution we were dealing
with a random vector v = (v^..., vr), where v,- = v,(co) is the number of
elements equal to bt9 i = 1,..., r, in the sequence co = (a l9 ..., a„).
The set of probabilities
P^xu ..., xr) = P{co: Mco) = xl9..., £r(co) = xr},
where x.-eA",-, the range of ££, is called the probability distribution of the
random vector £, and the function
F^Xi,..., xr) = P{co: f !(co) < xl9..., fr(co) < x,},
where xjeU1, is called the distribution function of the random vector £ =
(f 1,..., fr>
For example, for the random vector v = (v1}...,vr) mentioned above,
Pv(nu ..., nr) = Cn(nlt..., nr)p\l •••pn/
(see (2.2)).
2. Let £i,..., £r be a set of random variables with values in a (finite) set
X c Rl. Let SC be the algebra of subsets of A\
36
1. Elementary Probability Theory
Definition 3. The random variables f i,..., fr are said to be independent
(collectively independent) if
P{£1 = x1,...,£r = xr} = P{£1 = x1}...P{£r = xr}
for all Xi,..., xr e X; or, equivalently, if
PK16B1,...,4eBr} = P{^1eB1}-P{^eBr}
for all ^,..., Bre#".
We can get a very simple example of independent random variables
from the Bernoulli scheme. Let
Q = {co: co = (au ..., an\ at = 0, 1}, p(co) = pz V-1"'
and &(co) = at for co = (alt..., a„), i = 1,..., n. Then the random variables
f ii f 2» • • •, £„ are independent, as follows from the independence of the events
Ax — {co: ax = 1},..., An = {co: an = 1},
which was established in §3.
3. We shall frequently encounter the problem of finding the probability
distributions of random variables that are functions/(£l9..., £,.) of random
variables f i,..., fr. For the present we consider only the determination
of the distribution of a sum £ = ^ + r\ of random variables.
If ^ and »7 take values in the respective sets X = {xl9..., xk} and Y =
{.Vi» • • •»}>/}> the random variable £ = £ + r\ takes values in the set Z =
{z: z = x,- + yj, i = 1,..., fc;/ = 1,..., /}. Then it is clear that
P&) = P(C = z) = P{£ + r\ = z} = X P{£ = xlfi7 = y,.}.
{(i.j):x,+yj=z}
The case of independent random variables ^ and r\ is particularly
important. In this case
P{£ = xf> /7 = yj) = P{£ = x,}Pfe = yjh
and therefore
W = I PfrdPJyj) = I PfrdPfr - xf) (3)
for all z e Z, where in the last sum Pn{z — x,) is taken to be zero if z - xt ¢ Y.
For example, if £ and »7 are independent Bernoulli random variables,
taking the values 1 and 0 with respective probabilities p and q, then Z =
{0, 1, 2} and
Pf (0) = ^(0)^(0) = g2,
^c(l) = W^) + ^W) = 2pg,
Pf(2) = P.CDP.Ci) = p2.
§4. Random Variables and Their Properties
37
It is easy to show by induction that if f lt £2,..., £„ are independent
Bernoulli random variables with P{^ = 1} = p, P{^f = 0} = q, then the
random variable C = ^i + ••• + £„ has the binomial distribution
*c(*) = C£pV~*. fc = 0, 1,..., n. (4)
4. We now turn to the important concept of the expectation, or mean value,
of a random variable.
Let (Q, sd, P) be a (finite) probability space and ^ = £(co) a random
variable with values in the set X = {xlf..., xk}. If we put At — {coil; = x,},
i = 1,..., k, then £ can evidently be represented as
«g>) = £ X../04,.), (5)
i= 1
where the sets Ax,...,Ak form a decomposition of Q (i.e., they are pairwise
disjoint and their sum is Q; see Subsection 3 of §1).
Let Pi = P{£ = x,}. It is intuitively plausible that if we observe the values
of the random variable £ in "n repetitions of identical experiments", the
value xf ought to be encountered about ptn times, i = 1,..., k. Hence the
mean value calculated from the results of n experiments is roughly
1 *
- [nPi*i + • • • + np^J = £ Pi*i-
n (= i
This discussion provides the motivation for the following definition.
Definition 4. The expectation^ or mean value of the random variable ^ =
£*= l X|f 04i) is the number
E£ = I x£JW (6)
i=l
Since /1,- = {co: £(co) = x,} and P£x() = P(/l,), we have
E£ = £ x,.^). (7)
Recalling the definition of F^ = F^(x) and writing
AF^(x) = Ffic) - F£x -),
we obtain P£xt) = AF^x,) and consequently
E£ = £ x.-AF^x,). (8)
i= 1
t Also known as mathematical expectation, or expected value, or (especially in physics)
expectation value. (Translator)
38
I. Elementary Probability Theory
Before discussing the properties of the expectation, we remark that it is
often convenient to use another representation of the random variable £,
namely
/= 1
where Bl + • • • + Bt = Q, but some of the x's may be repeated. In this case
E£ can be calculated from the formula Yj=i x/P(5/)> which differs formally
from (5) because in (5) the x£ are all different. In fact,
£ x'jP(Bj) = Xi £ P(B;) = x,.P(^)
ij:x'j = xi) {j--xj = x,}
and therefore
5. We list the basic properties of the expectation:
(1) Ift>OthenEt>0.
(2) E(a£ + bn) = aE£ + ££»7, where a and fe are constants.
(3) Ift > r\ then E£ > En.
(4) |E£|<E|£|.
(5) If ^ and rj are independent, then E£n = E£ • E/7.
(6) (EI ^|)2 < E£2 -E?/2 (Cauchy-Bunyakovskii inequality).^
(7) //£ = /U) then E£ = P(,4).
Properties (1) and (7) are evident. To prove (2), let
« /
Then
a£ + bn = aX xJiAt n Bj) + b £ yjKAi ^ Bj)
«./ «./
= 1(^ + ^,^¾
«./
and
E(a£ + brj) = £ (ax, + b^OPUf n ^)
1./
I I
= flSxfPM,) + &5>jP(*/) = aEt + K17.
«' /
t Also known as the Cauchy-Schwarz or Schwarz inequality. (Translator)
§4. Random Variables and Their Properties
39
Property (3) follows from (1) and (2). Property (4) is evident, since
|E£h
<ZlxI-|P(^) = E|^|.
I > I
To prove (5) we note that
= E I XtyjHAt n Bj) = £ xiyJP{Ai n Bj)
«./ us
= I.xtyjP(AdPiBj)
i.j
= (l*|PW,)) • (l^PP*,)) = Ef-E*.
where we have used the property that for independent random variables the
events
At = {co: £(co) = x,} and Bj = {co: rj(co) = ^}
are independent: P(A{ n Bj) = P(Ai)P(Bj).
To prove property (6) we observe that
^ = ^/04,.), n2 = Y,y]m)
i J
and
E£2 = I x?PU,.), E/y2 = I tfPOtf.
Let Ef 2 > 0, E/y2 > 0. Put
<? = -^?. n-
v/E?' ' Ve?'
Since 2||^| < |2 + J72, we have 2E||/y| < E?2 + E/y2 = 2. Therefore
E||/7|<land(E|£7l)2<E£2-E/72.
However, if, say, E£2 = 0, this means that ££ xfP(At) = 0 and
consequently the mean value of £ is 0, and P{co: £(co) = 0} = 1. Therefore if at
least one of E£2 or Erj2 is zero, it is evident that E |^| = 0 and consequently
the Cauchy-Bunyakovskii inequality still holds.
Remark. Property (5) generalizes in an obvious way to any finite number of
random variables: if £u ..., £r are independent, then
The proof can be given in the same way as for the case r = 2, or by induction.
40
f. Elementary Probability Theory
Example 3. Let £ be a Bernoulli random variable, taking the values 1 and 0
with probabilities p and q. Then
E£=l.P{£=l}+0.P{£ = 0} = p.
Example 4. Let f „..., £„ be n Bernoulli random variables with Pfo = 1}
= p, P{& = 0} = q, p + g = 1. Then if
S„ = fi +■■• + «■
we find that
ESn = wp.
This result can be obtained in a different way. It is easy to see that ES„
is not changed if we assume that the Bernoulli random variables f i,..., f „
are independent. With this assumption, we have according to (4)
P(Sn = k) = C*pV-\ k = Q,l9...9n.
Therefore
ES„ = t *Ptf„ = *) = I *Ctpty"*
k = 0 k = 0
= npY ^ ~ ^! pfc-i0(n-i)-(k-i)
= npY (" ~ 1)! pVn-1)"' = «P.
However, the first method is more direct.
6. Let ^ = £,- x,-I(A,), where /1,- = {co: £(co) = xj, and <p = <p(£(co)) is a
function of £(co). If B, = {co: q>(£{co)) = j;,-}, then
j
and consequently
E? = Z,yjP(Bj) = ZyjP^j). (9)
j I
But it is also clear that
?(£(")) = 2>(x,-)/04,-).
§4. Random Variables and Their Properties
41
Hence, as in (9), the expectation of the random variable q> = <p(£) can be
calculated as
i
7. The important notion of the variance of a random variable £ indicates
the amount of scatter of the values of £ around E£.
Definition 5. The variance (also called the dispersion) of the random variable
£ (denoted by V£) is
V£ = E(£ - E£)2.
The number a = + yjv~£ is called the standard deviation.
Since
E(£ - E£)2 = E(£2 - 2* • E£ + (E£)2) = E£2 - (E£)2,
we have
V£ = E£2 - (E£)2.
Clearly V£ ^ 0. It follows from the definition that
M{a + b£) = fc2V<!;, where a and b are constants.
In particular, Ma = 0, V(fcf) = b2V£.
Let £ and r\ be random variables. Then
V« + n) = E((£ - EO + (//- E/7))2
= V£ + V/y + 2E(£ - E0fo - E/y).
Write
COv(£, /7) = E(£ - EQ(i7 - E/y).
This number is called the covariance of £ and rj. If V£ > 0 and V/7 > 0, then
cov(f, rfi
ptt*) =
v/wFv^
is called the correlation coefficient of £, and »7. It is easy to show (see Problem
7 below) that if p(£, rf) = ±1, then <!; and »7 are linearly dependent:
r\ = a£ + b,
with a > 0 if p(^, rf) = 1 and a < 0 if p(£, »7) = -1.
We observe immediately that if I, and r\ are independent, so are £, — E£
and n -En. Consequently by Property (5) of expectations,
cov(£, n) = Etf - EO • Efo - E/7) = 0.
42
I. Elementary Probability Theory
Using the notation that we introduced for covariance, we have
V(£ + ri) = V£ + V/y + 2cov(£, n); (10)
if £ and rj are independent, the variance of the sum ^ + rj is equal to the sum
of the variances,
V(£ + »y)= V£ + V17. (11)
It follows from (10) that (11) is still valid under weaker hypotheses than
the independence of ^ and rj. In fact, it is enough to suppose that £ and r\ are
uncorrelated, i.e. cov(£, r\) — 0.
Remark. If £ and r\ are uncorrelated, it does not follow in general that they
are independent. Here is a simple example. Let the random variable a take
the values 0, n/2 and n with probability 3. Then £ = sin a and r\ = cos a are
uncorrelated; however, they are not only stochastically dependent (i.e., not
independent with respect to the probability P):
P{£ = 1, rj = 1} = 0 * i = P{£ = l}P{i7 = 1},
but even functionally dependent: £2 + n2 = 1.
Properties (10) and (11) can be extended in the obvious way to any
number of random variables:
vf I fc) = I Vfc + 2 £ cov(&. y. (12)
V=i / 1 = 1 t>j
In particular, if f lf..., £„ are pairwise independent (pairwise uncorrelated
is sufficient), then
v(i&) = 1; vfc. (13)
Example 5. If £ is a Bernoulli random variable, taking the values 1 and 0
with probabilities p and q, then
V£ = E(£ - E£)2 = (£ - p)2 = (1 - p)2p + p2<? = M.
It follows that if f t,..., £„ are independent identically distributed Bernoulli
random variables, and S„ = f 1 + ■ • • + f„, then
VSn = npg. (14)
8. Consider two random variables ^ and r\. Suppose that only £ can be
observed. If £ and r\ are correlated, we may expect that knowing the value of £
allows us to make some inference about the values of the unobserved
variable n.
Any function f = /(£) of ^ is called an estimator for n. We say that an
estimator/* = /*(£) is best in the mean-square sense if
Eft -/*(£))2 = mf Eft -/(0)2.
§4. Random Variables and Their Properties
43
Let us show how to find a best estimator in the class of linear estimators
A(£) = a + b£. We consider the function g(a, b) = E{n - (a + H))2.
Differentiating g(a, b) with respect to a and b, we obtain
dg(atb)
= - 2E[»y - (a + fcC)],
da
dg(fl, ft)
at
= - 2EKn - (a + &©)a,
whence, setting the derivatives equal to zero, we find that the best mean-
square linear estimator is A*(£) = a* + fr*£, where
In other words,
a* = E„-6*E«, 6* =22^. (15)
A*© = E, + C-^«-Ef). (16)
The number E(rj — X*{^))2 is called the mean-square error of observation.
An easy calculation shows that it is equal to
cov2(& rj)
Consequently, the larger (in absolute value) the correlation coefficient
p(6 n) between £ and /7, the smaller the mean-square error of observation
A*. In particular, if |p(& n)\ = 1 then A* = 0 (cf. Problem 7). On the other
hand, if £ and n are uncorrelated (p(<^, n) = 0), then A*(£) = En, i.e. in the
absence of correlation between ^ and n the best estimate of n in terms of ^ is
simply En (cf. Problem 4).
9. Problems
1. Verify the following properties of indicators IA = /^(co):
/0 = 0, /« = 1, /,, + /*= 1,
Iaub = Ia + Ib ~ Iab-
The indicator of (J"=l At is 1 - f]?=i (1 — ^fX tne indicator of (J"=1 y4{ is
n?=i (1 _ JaX and the indicator of JJ=1 4£ is JJ=1 /^..
where /1 A J3 is the symmetric difference of /1 and J3, i.e. the set (A\B) u (J3\/4).
44
I. Elementary Probability Theory
2. Let £1,..., £,n be independent random variables and
f m:n = min(f,,..., f n), f max = max(f l5..., f n).
Show that
P{£ni„>x}= flP{^>^}.
i=t
P{£max<*} = flP{^<x}.
i=t
3. Let ¢,,..., £n be independent Bernoulli random variables such that
Pfo = 0} = 1 - A.A,
P{& = 1} = A.-A,
where A is a small number, A > 0, A,- > 0.
Show that
Ptfi + --- + ^=i} = ( Z A«)A + °<A2)'
P{^ + -- + ^n>l} = 0(A2).
4. Show that inf_00<fl<00 E(£ — a)2 is attained for a = E£ and consequently
inf E(£ - a)2 = V£.
— 00<fl<00
5. Let £ be a random variable with distribution function F£x) and let m, be a median
of F^x), i.e. a point such that
F^m,.-) < i < F4(m,).
Show that
inf E|£-af=E|£-mt.|.
— 0O<O< 00
6. Let P4(x) = P{£ = x) and F4(x) = P(£ < x}. Show that
^+tW = FA——)
for a > 0 and — oo < fc < oo. If y > 0, then
Ff<y) = ^+^) - J^-n/Jo + P&-yfy)-
Let f+ = max(f, 0). Then
{0, x < 0,
^(0), x = 0,
F^(x), x>0.
§5. The Bernoulli Scheme. I. The Law of Large Numbers
45
7. Let g and i/ be random variables with Vg > 0, Vi/ > 0, and let p = p(£, ly) be their
correlation coefficient. Show that \p\ < 1. If \p\ = 1, find constants a and b such
that i/ = ag + b. Moreover, if p = 1, then
17 - Eiy _ g - Eg
(and therefore a > 0), whereas if p = -1, then
i/ - Eiy _ g - Eg
(and therefore a < 0).
8. Let g and ly be random variables with Eg = Eiy = 0, Vg = Viy = 1 and correlation
coefficient p = p(Ji, ly). Show that
Emax(g2,iy2)<l+V1-P2-
9. Use the equation
(indicator of |J ,4,.) = jf[(l - IA(\
to deduce the formula P(J30) = 1 - St + S2 + - ■ • ± S„ from Problem 4 of §1.
10. Let gt,..., gn be independent random variables, (px = </>1(g1,..., gfc) and q>2 =
^2½¾ + ii•- • > €nX functions respectively of gt,..., gfc and gfc+lt..., gn. Show that the
random variables cpt and <p2 are independent.
11. Show that the random variables gt,..., gn are independent if and only if
Ftt *„(xi,.•.,*„) = F4l(x,) --^,,00
for all xlf..., xn, where F4l ^(x,,..., xn) = P{gt < xu -.., gn < *„}•
12. Show that the random variable g is independent of itself (i.e., g and g are
independent) if and only if g = const.
13. Under what hypotheses on g are the random variables g and sin g independent?
14. Let g and ly be independent random variables and ly # 0. Express the probabilities
of the events P{giy < z} and P{g/iy < z} in terms of the probabilities P4(x) and P^(y).
§5. The Bernoulli Scheme. I. The Law of
Large Numbers
1. In accordance with the definitions given above, a triple
(Q, s/, P) with Q = {co: co = (alf..., an), at = 0, 1},
is called a probabilistic model of n independent experiments with two
outcomes, or a Bernoulli scheme.
46
I. Elementary Probability Theory
In this and the next section we study some limiting properties (in a sense
described below) for Bernoulli schemes. These are best expressed in terms of
random variables and of the probabilities of events connected with them.
We introduce random variables ^,..., £n by taking £,(co) = ai9 i =
I,..., n, where co = (alt... t an). As we saw above, the Bernoulli variables
£,-(co) are independent and identically distributed:
P{£=l} = p, PK, = 0} = *, 1 = 1,...,11.
It is natural to think of £, as describing the result of an experiment at the
ith stage (or at time i).
Let us put S0(co) = 0 and
!£i + --- + &,
k= !,...,«.
As we found above, ESn = np and consequently
(1)
In other words, the mean value of the frequency of "success", i.e. SJn,
coincides with the probability p of success. Hence we are led to ask how much
the frequency SJn of success differs from its probability p.
We first note that we cannot expect that, for a sufficiently small s > 0
and for sufficiently large n, the deviation of SJn from p is less than £ for all
co, i.e. that
Sn(co)
In fact, when 0 < p < 1,
p{^=1}=p{^
< £, CO E CI.
(2)
1,...,6,= i} = p",
pp! = oj= P{^ =0,...,^ = 0} = q\
whence it follows that (2) is not satisfied for sufficiently small £ > 0.
We observe, however, that when n is large the probabilities of the events
{SJn = 1} and {SJn = 0} are small. It is therefore natural to expect that the
total probability of the events for which |[Sn(co)/n] - p\ > e will also be
small when n is sufficiently large.
We shall accordingly try to estimate the probability of the event
{co: \[Sn(co)/n\ — p\ > e}. For this purpose we need the following inequality,
which was discovered by Chebyshev.
§5. The Bernoulli Scheme. I. The Law of Large Numbers 47
Chebyshev's inequality. Let (Q, sf, P) be a probability space and £ = £(co) a
nonnegative random variable. Then
P{t > £} < E£/£ (3)
for all £ > 0.
Proof. We notice that
f = f/(£ > «) + £/(£ < £) > £/(£ > £) > e/(£ > £),
ator of A
ties of the expectation,
E£>£E/(£>£) = £P(£>£),
where 1(A) is the indicator of A.
Then, by the properties of the expectation
which establishes (3).
Corollary. //£ is any random variable, we have for £ > 0,
P{|£|>£}<E|£|/£,
P{|£|>£} = P{£2>£2}<E£2/£2, (4)
P{|£-E£|>£}<V£/£2.
In the last of these inequalities, take ^ = SJn. Then using (4.14), we obtain
\\ n | J £ « £ « £ «£
Therefore
W-'H
*»*' (5)
M£2 4rt£Z
from which we see that for large n there is rather small probability that the
frequency SJn of success deviates from the probability p by more than £.
For n > 1 and 0 < k < n, write
*>„(*) = cjpY-fc.
Then
IS
P<
P U 4 = I ^(*),
| J {fc:|(k/n)-p|^c}
and we have actually shown that
z pm* ™ *A. (6)
48
I. Elementary Probability Theory
Pn(k)k
1
np + ne
Figure 6
i.e. we have proved an inequality that could also have been obtained
analytically, without using the probabilistic interpretation.
It is clear from (6) that
-> oo. (7)
I PM-
[k:\(k(n)-p\*e)
o,
We can clarify this graphically in the following way. Let us represent the
binomial distribution (P„(/c), 0 < k < n} as in Figure 6.
Then as n increases the graph spreads out and becomes flatter. At the same
time the sum of Pn(k\ over k for which np — m< k < np 4- ne, tends to 1.
Let us think of the sequence of random variables S0, Slt...,Sn as the
path of a wandering particle. Then (7) has the following interpretation.
Let us draw lines from the origin of slopes kp, k(p + £), and k(p — e). Then
on the average the path follows the kp line, and for every £ > 0 we can say that
when n is sufficiently large there is a large probability that the point Sn
specifying the position of the particle at time n lies in the interval
[n(p — £), n(p + £)]; see Figure 7.
We would like to write (7) in the following form:
{|tH-£H'
(8)
Figure 7
k(p + e)
MP - e)
§5. The Bernoulli Scheme. I. The Law of Large Numbers
49
However, we must keep in mind that there is a delicate point involved
here. Indeed, the form (8) is really justified only if P is a probability on a
space (Q, s#) on which infinitely many sequences of independent Bernoulli
random variables £l9 £2,...,are defined. Such spaces can actually be
constructed and (8) can be justified in a completely rigorous probabilistic
sense (see Corollary 1 below, the end of §4, Chapter II, and Theorem 1, §9,
Chapter II). For the time being, if we want to attach a meaning to the analytic
statement (7), using the language of probability theory, we have proved only
the following.
Let (Q(n), st{n\ P(n)), n > 1, be a sequence of Bernoulli schemes such that
Q(n) = {co(n). ^(n, = (flw ^^ aW)t fljn) = ^ ^
sf™ = {A: A cQC)},
p(nVn)) = p£< 4*-W
and
S£n V>) = £? V}) + • • ■ + <&" Vn)),
where, for n < 1, ftn),..., fj0 are sequences of independent identically
distributed Bernoulli random variables.
Then
p(n))w(n):
^^-p\>e}= £ PJBQ-+Q. n-oo. (9)
n | J (k:\(k/n)-pUe)
Statements like (7)-(9) go by the name of James Bernoulli's law of large
numbers. We may remark that to be precise, Bernoulli's proof consisted in
establishing (7), which he did quite rigorously by using estimates for the
"tails" of the binomial probabilities Pn(k) (for the values of k for which
\{k/n) — p\>e). A direct calculation of the sum of the tail probabilities of
the binomial distribution Y,[k:\wn)-Pue) pnQ<) is rather difficult problem for
large n, and the resulting formulas are ill adapted for actual estimates of the
probability with which the frequencies Sjn differ from p by less than £.
Important progress resulted from the discovery by De Moivre (for p = j)
and then by Laplace (for 0 < p < 1) of simple asymptotic formulas for Pn(k)t
which led not only to new proofs of the law of large numbers but also to
more precise statements of both local and integral limit theorems, the essence
of which is that for large n and at least for k ~ np,
PM „ 1 e-(fc-"p)2/(2np,)j
yjlnnpq
and
{k:\{k/n)-p\se) ^/2ll J-*/HIpq
50
I. Elementary Probability Theory
2. The next section will be devoted to precise statements and proofs of these
results. For the present we consider the question of the real meaning of the
law of large numbers, and of its empirical interpretation.
Let us carry out a large number, say N, of series of experiments, each of
which consists of "n independent trials with probability p of the event C of
interest." Let S\Jn be the frequency of event C in the rth series and Ne the
number of series in which the frequency deviates from p by less than £:
NE is the number of Vs for which |(Sj/n) — p\ < £. Then
NJN ~ P£ (10)
where P£ = P{|(S»-p|<fi}.
It is important to emphasize that an attempt to make (10) precise inevitably
involves the introduction of some probability measure, just as an estimate for
the deviation ofSJn from p becomes possible only after the introduction of a
probability measure P.
3. Let us consider the estimate obtained above,
p{ft-pM= X p-(/c)-i' (11)
(J n I J lk:\ik/n)-p\>e) 4rt£
as an answer to the following question that is typical of mathematical
statistics: what is the least number n of observations that is guaranteed to
have (for arbitrary 0 < p < 1)
flf-'H*1-*
(12)
where a is a given number (usually small)?
It follows from (11) that this number is the smallest integer n for which
_1
' 4£2<X
»*T3Z- (1¾
For example, if a = 0.05 and e = 0.02, then 12 500 observations guarantee
that (12) will hold independently of the value of the unknown parameter p.
Later (Subsection 5, §6) we shall see that this number is much overstated;
this came about because Chebyshev's inequality provides only a very crude
upper bound for P{\(SJri) — p\ > e}.
4. Let us write
f \S„(w) I 1
C(n,£) = <co: -^^-p <£^.
From the law of large numbers that we proved, it follows that for every
e > 0 and for sufficiently large n, the probability of the set C(n, £) is close to
1. In this sense it is natural to call paths (realizations) co that are in C(n, £)
typical (or (n, £)-typical).
§5. The Bernoulli Scheme. I. The Law of Large Numbers
51
We ask the following question: How many typical realizations are there,
and what is the weight p(co) of a typical realization?
For this purpose we first notice that the total number N(Q) of points is 2",
and that if p = 0 or 1, the set of typical paths C(n, e) contains only the single
path (0,0,.... 0) or (1,1,..., 1). However, if p = £, it is intuitively clear that
"almost all" paths (all except those of the form (0,0,..., 0) or (1, 1,..., 1))
are typical and that consequently there should be about 2" of them.
It turns out that we can give a definitive answer to the question whenever
0 < p < 1; it will then appear that both the number of typical realizations
and the weights p(co) are determined by a function of p called the entropy.
In order to present the corresponding results in more depth, it will be
helpful to consider the somewhat more general scheme of Subsection 2 of
§2 instead of the Bernoulli scheme itself.
Let (pt, p2,..., pr) be a finite probability distribution, i.e. a set of nonnega-
tive numbers satisfying px + • • • 4- pr — 1. The entropy of this distribution is
H = ~ £ P.lnP.->
(14)
with 0 • In 0 = 0. It is clear that H > 0, and H = 0 if and only if every pi9
with one exception, is zero. The function /(x) = — x In x, 0 < x < 1, is
convex upward, so that, as know from the theory of convex functions,
/M + -+/(xf)
*f(±±^*y
Pi + ' ■ ' + Pr
.h,(» + ;"+»)-iT.
Consequently
r
H = - £ Pi\npi< -r
i=\
In other words, the entropy attains its largest value for px = • • • = pr = 1/r
(see Figure 8 for H = H(p) in the case r = 2).
If we consider the probability distribution (pu p2,..., pr) as giving the
probabilities for the occurrence of events AltA2,-.., Artsay, then it is quite
clear that the "degree of indeterminancy" of an event will be different for
H(pY
Figure 8. The function H(p) = -p In p - (1 - p)ln(l - p).
52
I. Elementary Probability Theory
different distributions. If, for example, px = 1, p2 = • • • = pr = 0, it is clear
that this distribution does not admit any indeterminacy: we can say with
complete certainty that the result of the experiment will be Ax. On the other
hand, if pt = • • • = pr = 1/r, the distribution has maximal indeterminacy,
in the sense that it is impossible to discover any preference for the occurrence
of one event rather than another.
Consequently it is important to have a quantitative measure of the
indeterminacy of different probability distributions, so that we may compare
them in this respect. The entropy successfully provides such a measure of
indeterminacy; it plays an important role in statistical mechanics and in many
significant problems of coding and communication theory.
Suppose now that the sample space is
Q = {co: co = (fllf..., fln), at = 1,..., r}
and that p(co) = p\l{to) • • • pvrAto), where v.(co) is the number of occurrences of i
in the sequence co, and (pl9..., pr) is a probability distribution.
For £ > 0 and n = 1, 2,..., let us put
C(n,£) = \co:
Vj(C0)
< e,
It is clear that
P(C(M,£))>
i-£p{
n
'}■
P. >£f.
and for sufficiently large n the probabilities P{\(\'i(co)/ri) — p,| > £} are
arbitrarily small when n is sufficiently large, by the law of large numbers
applied to the random variables
««)-{« %:". *=i «•
Hence for large n the probability of the event C(n, £) is close to 1. Thus, as in
the case n = 2, a path in C(n, £) can be said to be typical.
If all pi > 0, then for every coeQ.
K») = exp{-ni(-^lnPli)}.
Consequently if co is a typical path, we have
|i;(-^lnP,)-H|<-i|^>-PJlnft<-£i,„p,.
I*=l\ n / | fc=l | n | k=l
It follows that for typical paths the probability p(co) is close to e~"H and —
since, by the law of large numbers, the typical paths "almost" exhaust CI
when n is large—the number of such paths must be of order enH. These
considerations lead up to the following proposition.
§5. The Bernoulli Scheme. I. The Law of Large Numbers 53
Theorem (Macmillan). Let p, > 0, i = 1,..., r and 0 < £ < 1. Then there is
an n0 = n0(e; plt..., pr) swc/i that for all n> n0
(a) 6?n(H-£) < N(C(n, £0) < e"(H+£);
(b) e~n{H+t) < p(co) < e-nlH-£\ co e C(n, fil);
(c) P(C(h, fil)) = £ p(co) - 1, n - oo,
toeC(n.e„)
where
ex is the smaller of e and e/< — 2 £ In pk >.
Proof. Conclusion (c) follows from the law of large numbers. To establish
the other conclusions, we notice that if co e C(n, e) then
npk — e,n < vk(co) < npk + £^, k = 1,..., r,
and therefore
p{co) = exp{ -£ vk In pk) < exp{-n £ pk In pk - etn £ In pk}
<exp{-n(/f-i£)}.
Similarly
pico) > exp{- n(H + j£)}.
Consequently (b) is now established.
Furthermore, since
P(C(n, £0) ^: N(C(n, £0) • min pico),
toeC{n,£i)
we have
N(C(M'£l))- min p(co)<e-"(H+(1/2>£> '
and similarly
N(C(n, 8l» a-*<ML > P(C(«, £l)K"H-("2)0-
max p(co)
t»eC(n, Ei)
Since P(C(n, ^)) -> 1, n -> oo, there is an nx such that P(C(n, ^)) > 1 — £
for n > nu and therefore
NiCin, £l)) > (1 - £) expMH - J)}
= exp{n(H - £) + (infi + ln(l - £))}.
54
I. Elementary Probability Theory
Let n2 be such that
ine + ln(l - £) > 0.
for n> n2. Then when n > n0 = max(iii, n2) we have
This completes the proof of the theorem.
5. The law of large numbers for Bernoulli schemes lets us give a simple and
elegant proof of Weierstrass's theorem on the approximation of continuous
functions by polynomials.
Let/ = f(p) be a continuous function on the interval [0,1]. We introduce
the polynomials
b„(p) = i /(-) c*Py-\
*=o yv
which are called Bernstein polynomials after the inventor of this proof of
Weierstrass's theorem.
If £i,..., £n is a sequence of independent Bernoulli random variables
with P{£,- = 1} = p, P{£ = 0} = q and Sn = ^ + • • • + <L then
«©-
B„(P).
Since the function f = /(p), being continuous on [0, 1], is uniformly
continuous, for every £ > 0 we can find 5 > 0 such that | /(x) - f(y)\ < e
whenever |x - y\ < 3. It is also clear that the function is bounded: | /(x)| <,
M < oo.
Using this and (5), we obtain
i f(p) - bm\ = 11 [m - /(£)] c*pv-
< z \m-f(-n
+ I
{k:|(k/n)-p|>5}
m-f[-
Ctfq"-
Cknpkqn~
M
Hence
< £ + 2M z ckyq»-k <£+™=s +
[k:Ukirii-P\>5) 4«d 2nd
lim max | f(p) - Bn(p)\ = 0,
n-»oo 0<p^l
which is the conclusion of Weierstrass's theorem.
§6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson) 55
6. Problems
1. Let ^ and i/ be random variables with correlation coefficient p. Establish the following
two-dimensional analog of Chebyshev's inequality:
P{|£ " E£| > ejvtor |i/ - Ei/| > e./Sh,} <i(l + yfT^p1).
(Hint: Use the result of Problem 8 of §4.)
2. Let f = f{x) be a nonnegative even function that is nondecreasing for positive x.
Then for a random variable £ with | £(co) \ < C,
f(Q /00
In particular, if f(x) = x2,
EE2 — e2 V£
qc2 <P{|g-Eg|>£}<-^.
3. Let £t,..., £n be a sequence of independent random variables with V££ < C. Then
l\ n n I J nez
(15)
(With the same reservations as in (8), inequality (15) implies the validity of the law of
large numbers in more general contexts than Bernoulli schemes.)
4. Let f lt..., £„ be independent Bernoulli random variables with P{& = 1} = p > 0,
P{£,- =—1} = 1 — p. Derive the following inequality of Bernstein: there is a number
a > 0 such that
||5l_(2p-l)|>£l<2c--2",
P<
where Sn = f t H h <!;„ and £ > 0.
§6. The Bernoulli Scheme. II. Limit Theorems
(Local, De Moivre-Laplace, Poisson)
1. As in the preceding section, let
Sn = £i +••• + &,.
Then
e|-* (1)
and by (4.14)
■(H'-f
(2)
56
I. Elementary Probability Theory
It follows from (1) that SJn ~ p, where the equivalence symbol ~ has been
given a precise meaning in the law of large numbers as the assertion
P{l(s„A0 - p| > e} -> 0. It is natural to suppose that, in a similar way, the
relation
ft
(3)
which follows from (2), can also be given a precise probabilistic meaning
involving, for example, probabilities of the form
- P
<x I— k xeR\
or equivalently
ii Tvs: I j
(since ES„ = np and VS„ = npq).
If, as before, we write
Pn(k) = C*pY
0 < k < n,
for n > 1, then
p{|^^h4= £ PM-
(4)
We set the problem of finding convenient asymptotic formulas, as n -> oo,
for Pn(k) and for their sum over the values of k that satisfy the condition on
the right-hand side of (4).
The following result provides an answer not only for these values of k
(that is, for those satisfying \k - np\ = 0(y/npq)) but also for those satisfying
\k-np\ = o{npqfl\
Local Limit Theorem. Let 0 < p < 1; then
1
Pnik) <
(k-np)2/(2npg)
y/2nnpq
uniformly for k such that \k — np\ = o(npq)2/3, i.e. as n -> oo
(5)
sup
[k:\k-np\<vin))
Pnik)
1 e-(fc-np)2/(2npq)
Jlnnpq
► 0,
where <p(ri) = o(npq)2'
§6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson) 57
The proof depends on Stirling's formula (2.6)
n\ =j27ine-nnn(l + R(n)\
where R(ri) -> 0 as n -> oo.
Then if n -*■ oo, k -> oo, n — k -> oo, we have
'" fc!(n-fc)!
yfhtne~nnn 1 + R(n)
Jink • 2tt(« - fc) «"*** ■ e-t"-k>(« - fc)»-*(l + K(fr))(l + J*(n - k))
1 1 + fi(n, k,n-k)
fie m-r
where £ = £(n, k, n — k) is defined in an evident way and £ -> 0 as n -» oo,
fr -» oo, n — k -> oo.
Therefore
fei-f
Write p = k/n. Then
1 /n\k/\ n\n-k
J2nnp{\ - p)
w._4^©'(^)"-«»
= , 1 expiring + fn - fc)ln^ ^1-(1 + £)
= y 1 exp{.^lng+(l-^lnJ-|1j(l+£)
V2tt^(1 -/>) I |_" ^ \ n) \- p\ j
' ^2^(1 - p)
exp{-«H(p)}(l +£),
where
H(x) = xln- + (1 -x)\n\ -.
We are considering values of k such that |k - np\ = o(npq)2,3t and
consequently p — /? -> 0, n -> oo.
58
I. Elementary Probability Theory
Since, for 0 < x < 1,
x2 (1-x)2'
if we write H(p) in the form H(p + (p - p)) and use Taylor's formula, we
find that for sufficiently large n
H(p) = H(p) + H'(p)(P - p) + iH"(pKp - p)2 + OQp - p\3)
= \(~ + ^iP - P)2 + 0(\p - p\3).
Consequently
Pn(k) = I exJ- JLtf - pY + nO{\p - p|3)l(l + £)•
y/2nnp{\ - P) I 2pq J
Notice that
np)2
" ,* x2 n (k V (k
2pq 2pq \n J 2npq
Therefore
Pn(k) = 1 g-(fc-"p)2/(2np9)(1 + £/(n> fc> „ _ fc))>
V27wpg
3'(n, fc, n - fc) = (1 + £(n, fc, n - fc))exp{n OQp - p\3)} A*| _ ^
where
l+a'(
and, as is easily seen,
sup|£'(n, k, n — k)\ -> 0, n -> oo,
if the sup is taken over the values of k for which
\k — np\ < <p(ri), <p(n) = o(npq)213.
This completes the proof.
§6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson) 59
Corollary. The conclusion of the local limit theorem can be put in thefollowing
equivalent form: For all xeR1 such that x = o(npq)lt6t and for np + Xyjnpq
an integer from the set {0,1 n},
Pninp + xjnpq) ~ e~x2'2t
yjlnnpq
i.e. as n -> oo,
sup
{x:|x|<0(n)}
Pn(np + Xy/npq)
1
yjlnnpq
-x2/2
(7)
(8)
where i]/(ri) = o(npq)1/6.
With the reservations made in connection with formula (5.8), we can
reformulate these results in probabilistic language in the following way:
P{Sn = k} ~ 1 e-ik-nP)2,{2nPq)t |fc _ npj = ofapq)2'3, (9)
yjlnnpq
p\Sn~np = A L
1 \/npq J ~ yfht
yjlnnpq
e-x2l\ x = o(npq)116.
(10)
(In the last formula np + Xyjnpq is assumed to have one of the values
0, 1,..., n.)
If we put tk = (k — np)/yfnpq and Atk = tk+l — tk = l/y/npq9 the
preceding formula assumes the form
(^-np = J _ ^-,^ h = 0(npq)m (11)
I yjnpq J yjln
It is clear that Atk = l/y/npq -> 0 and the set of points {tk} as it were
"fills" the real line. It is natural to expect that (11) can be used to obtain the
integral formula
I \fnpq J yJlnJa
— co < a < b < oo.
Let us now give a precise statement.
2. For — oo < a < fc < oo let
Pn(a,b-]= I Pn(np + XyfnTq\
where the summation is over those x for which np + Xyjnpq is an integer.
60
I. Elementary Probability Theory
It follows from the local theorem (see also (11)) that for all tk defined by
k = np + ffcx/npg and satisfying \tk\ < T < oo,
Pn(np + ^/¾ = -7= e"'2/2[1 + e('*' n)1 (12)
„"'2/2
/27T
where
sup \e(tk, n)\ -► 0, n -> 00. (13)
l'k|<T
Consequently, if a and!? are given so that — T < a <b < % then
= -^= f ^"x2/2 <** + Jtfte /7) + K<2)(fl, b\ (14)
where
a<tk^by/2n y/111 Ja
a<tk£b y/27l
From the standard properties of Riemann sums,
sup 1/^(^)1-0, n-00. (15)
-T<a<b£T
It also clear that
sup \R™(a9b)\
-T£a£b£T
< sup \e(tktn)\. £ -%e-*12
\tk\zT \tk\<T y/2n
< sup \e(tk,n)\
\'u\<T
x \-±= f e-*2'2 dx + sup 1¾¾ b)\\ - 0, (16)
Lv27lJ-T -T£a£b<T J
where the convergence of the right-hand side to zero follows from (15) and
from
-^L f e~x2'2 dx<-^=r e-*2'2 dx = 1, (17)
the value of the last integral being well known.
§6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson) 61
We write
<D(x) =
1 f e-'2/2dt.
Then it follows from (14)-(16) that
sup | Pn(a, f] - (Otf?) - 0>(a))| - 0, n - oo. (18)
-T<a£b£T
We now show that this result holds for T = oo as well as for finite T. By
(17), corresponding to a given £ > 0 we can find a finite T = T(e) such that
-^=1 e-x2/2dx> 1 -i £.
(19)
According to (18), we can find an N such that for all n > N and T = T(s)
we have
sup | Pn(a, b] - WJb) - 0>(a))| < i £. (20)
-T£a<b£T
It follows from this and (19) that
P„(-7; T] > 1 - ±£,
and consequently
JU-oo, ^] + Pn(T, oo) < Jfi,
where Pn(- oo, T] = limsj^ Pn(S, T] and Pn(T, oo) = limstoo Pn(T, S].
Therefore for -oo < a < -T< T< b <> oo,
|pn(fl,!7]-4=r^x2/2^|
I v27C ^° I
<|pn(-x;r]-^Lf e-^2rfx
+ I P„(a, - T] - -^ f T e-*2'2 dx I + I Pn(T, « - -^= (V*1'2 <** I
I yj2nJa | | yjlnh |
< i£ + P„(- oo, - T] + -4= f ^""2/2 <** + Pn(Tt oo)
4 ^/2nJ-oo
By using (18) it is now easy to see that Pn(a, b] tends uniformly to d>(b) —
O(a) for — oo < a < b < oo.
Thus we have proved the following theorem.
62
I. Elementary Probability Theory
De Moivre-Laplace Integral Theorem. Let 0 < p < 1,
Pn(k) = CfcnPy-\ Pn(a,b-]= £ Pn(np + Xyfiti),
a<x£b
Then
sup Pn(a, f] - -^= f 6? ~x2/2 rfx - 0, n - oo. (21)
-oo<a<b<oo | yJ2>7l *° I
With the same reservations as in (5.8), (21) can be stated in probabilistic
language in the following'way:
I f Sn-ESn <b\--^= Ce-x2'2dxUo, n^co.
sup \P<a<—^=- J ^/¾^ |
-oo<a<b<oo| (. V*«Jn
It follows at once from this formula that
P{A <S^B}- \*(^S) - *fe?)l - 0, (22)
L V y/npq J \ yjnpq /J
as n -> oo, whenever —oo</l<B<oo.
Example. A true die is tossed 12 000 times. We ask for the probability P that
the number of 6's lies in the interval (1800,2100].
The required probability is
/l\*/5\ 12000-fc
"~ 2L £-120001/:
1800<fc;£2100 \°,
An exact calculation of this sum would obviously be rather difficult.
However, if we use the integral theorem we find that the probability P in
question is (n = 12 000, p = £, a = 1800, b = 2100)
Of2)00-2000) - of1?00"2000) = 0(,/6) - 0(-2^6)
Vl20004-f/ Vl2000.£-f/ W V
« 0(2.449) - 0(- 4.898) « 0.992,
where the values of 0(2.449) and 0(-4.898) were taken from tables of 0(x)
(this is the normal distribution function; see Subsection 6 below).
3. We have plotted a graph of Pn(np + Xy/npq) (with x assumed such that
np + Xy/ripq is an integer) in Figure 9.
Then the local theorem says that when x = o(npq)116, the curve
(l/y/27tnpq)e~x2t2 provides a close fit to Pn(np + x^/npq). On the other hand
the integral theorem says that Pn(a, b~\ = P{a^/npq < Sn — np < b^/npq} =
P{np + Oy/npq < Sn<np + byjnpq) is closely approximated by the integral
(1/7¾^2 dx.
§6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson) 63
P„(np + Xy/npq)k
^X-
s
s
s
'
•s.
s.
1 c-x*n
Innpq
0
Figure 9
We write
™-«-«^ (=pfe?-})-
Then it follows from (21) that
sup |F„(x) - 0>(x)| -
-oo<Jf<oo
(23)
It is natural to ask how rapid the approach to zero is in (21) and (23),
as n -> oo. We quote a result in this direction (a special case of the Berry-
Esseen theorem: see §6 in Chapter III):
sup |F„(x)-
-oo<Jf^oo
*Mi <
p2 + q2
(24)
npq
It is important to recognize that the order of the estimate (1/y/npq)
cannot be improved; this means that the approximation of F„(x) by O(x)
can be poor for values of p that are close to 0 or 1, even when n is large. This
suggests the question of whether there is a better method of approximation
for the probabilities of interest when p or q is small, something better than
the normal approximation given by the local and integral theorems. In this
connection we note that for p = |, say, the binomial distribution {Pn(k)} is
symmetric (Figure 10). However, for small p the binomial distribution is
asymmetric (Figure 10), and hence it is not reasonable to expect that the
normal approximation will be satisfactory.
PJLk)
0.3-^
0.2 +
0.1 +
w
p = \t n = 10
b
0.3 +
0.2
0.1 +
*=*-
/rv P = hn= 10
K
v\
I
frn-
Figure 10
64
I. Elementary Probability Theory
4. It turns out that for small values of p the distribution known as the Poisson
distribution provides a good approximation to {P„(k)}.
Let
p^_ICnW-\ k = 0,l,...,n,
nK ' |0, fc = n + 1, n + 2,. .,
and suppose that p is a function p(n) of n.
Poisson's Theorem. Let p{n) -> 0, n -> oo, in swc/i a way that np(n) -> A,
wnere A > 0. Then for k =1,2,...,
PM-*nk9 n-oo, (25)
vvnere
"* = -Jp. ^ = 0,1,.... (26)
The proof is extremely simple. Since p(n) = (X/n) + o(l/n) by hypothesis,
for a given /c = 0, 1,... and sufficiently large n,
PM = Cknpkq»-k
But
n(n_l)...(n_fc + l)^ + 0^T
n(n-l)---(n-/c+ 1),., ,
= 7 \_X + o(l)J* -> AT, n -> oo,
and
which establishes (25).
The set of numbers {nk, k = 0, 1,...} defines the Poisson probability
distribution (nk > 0, Jj?=o ™k = 1)- Notice that all the (discrete) distributions
considered previously were concentrated at only a finite number of points.
The Poisson distribution is the first example that we have encountered of a
(discrete) distribution concentrated at a countable number of points.
The following result of Prokhorov exhibits the rapidity with which Pn(k)
converges to nk as n -> oo: if np(n) = X > 0, then
oo 2/1
£ | Pn(k) - nk | < — • min(2, X). (27)
^ = n n
(A proof of a somewhat weaker result is given in §12, Chapter III.)
§6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson) 65
5. Let us return to the De Moivre-Laplace limit theorem, and show how it
implies the law of large numbers (with the same reservation that was made
in connection with (5.8)). Since
p{|^_pU8j = p{|^UeR
Un I J II yfim I VP4)
it is clear from (21) that when £ > 0
P\\--P <4--7= e-x2'2dx-+0,
till" I J v^J-*^
(28)
whence
■OH 4
n -> oo,
which is the conclusion of the law of large numbers.
From (28)
flc I ") 1 reyffipq
P\\--P <4 7=\ e-*2>2dx,
whereas Chebyshev's inequality yielded only
(29)
KIH-H-s-
It was shown at the end of §5 that Chebyshev's inequality yielded the estimate
1
n >
4e2a
for the number of observations needed for the validity of the inequality
p||—-p| <el > 1 -a.
■AH4
Thus with £ = 0.02 and a = 0.05, 12 500 observations were needed. We can
now solve the same problem by using the approximation (29).
We define the number k(a) by
{|!-p|s«}ai-*
"*2/2 dx = 1 - a.
Since £ ^/(n/pq) > 2fiN/n, if we define n as the smallest integer satisfying
2e^i > k(a) (30)
we find that
1½
n
(31)
66
I. Elementary Probability Theory
We find from (30) that the smallest integer n satisfying
guarantees that (31) is satisfied, and the accuracy of the approximation can
easily be established by using (24).
Taking £ = 0.02, a = 0.05, we find that in fact 2500 observations suffice,
rather than the 12 500 found by using ChebysheVs inequality. The values
of k(cx) have been tabulated. We quote a number of values of k(a) for various
values of a:
a
0.50
0.3173
0.10
0.05
0.0454
0.01
0.0027
m
0.675
1.000
1.645
1.960
2.000
2.576
3.000
6. The function
<*>(x) = -4= f e~t2'2 dt> (32)
yjlll J-00
which was introduced above and occurs in the De Moivre-Laplace integral
theorem, plays an exceptionally important role in probability theory. It is
known as the normal or Gaussian distribution on the real line, with the
(normal or Gaussian) density
0.67 1.96 2.58
Figure 11. Graph of the normal probability density <p(x).
§6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre-Laplace, Poisson) 67
<D(x) = 4=r- Fe-y2'2 dy
Figure 12. Graph of the normal distribution <b(x).
We have already encountered (discrete) distributions concentrated on a
finite or countable set of points. The normal distribution belongs to another
important class of distributions that arise in probability theory. We have
mentioned its exceptional role; this comes about, first of all, because under
rather general hypotheses, sums of a large number of independent random
variables (not necessarily Bernoulli variables) are closely approximated by
the normal distribution (§4 of Chapter III). For the present we mention only
some of the simplest properties of <p(x) and €>(x), whose graphs are shown in
Figures 11 and 12.
The function <p(x) is a symmetric bell-shaped curve, decreasing very
rapidly with increasing |x|: thus <p(l) = 0.24197, <p(2) = 0.053991, <p(3) =
0.004432, <p(4) = 0.000134, <p(5) = 0.000016. Its maximum is attained at
x = 0 and is equal to (In)'1/2 « 0.399.
The curve <E(x) = (l/^/ln) ^-^e''2'2 dt approximates 1 very rapidly
as x increases: ¢(1) = 0.841345, ¢(2) = 0.977250, ¢(3) = 0.998650, ¢(4) =
0.999968, ¢(4, 5) = 0.999997.
For tables of <p(x) and ®(x), as well as of other important functions that
are used in probability theory and mathematical statistics, see [Al].
7. At the end of subsection 3, §5, we noticed that the upper bound for the
probability of the event {co: \(SJn) — p\ > £}, given by Chebyshev's
inequality, was rather crude. That estimate was obtained from Chebyshev's
inequality ?{X > e} < EX2/e2 for nonnegative random variables X > 0. We may,
however, use Chebyshev's inequality in the form
P{X>£} = P{X2k>£2k}<
EX2k
p2*
(33)
However, we can go further by using the "exponential form" of Chebyshev's
68
I. Elementary Probability Theory
inequality: if X > 0 and X > 0, this states that
?{X > £} = ?{exx > ex'} < Ee«x~s\ (34)
Since the positive number X is arbitrary, it is clear that
P{X>e}< inf Ee«x-S\ (35)
Let us see what the consequences of this approach are in the case when
X = SJntSn = ^+- + ^P{^=l) = p,P^i = 0) = qJ>l.
Let us set <p(X) = Ee**K Then
q>(k) = 1 - p + pex
and, under the hypothesis of the independence of f lf £2, -. •, £„,
E**" = [<p(A)]".
Therefore, (0 < a < 1)
•£4
, < juf £gA(Sn/n-a) _ jnf e-n[Aa/n-ln«>(A/n)]
= inf e~nfflS~ln «»(«)] = g-nsupa>0[as-ln ^(s)^ /3^
s>0
Similarly,
1i4
The function /(s) = as — log[l — p + pes~\ attains its maximum for
p ^ a ^ 1 at the point s0(f'(s0) = 0) determined by the equation
p(l-a)'
Consequently,
sup/(s) = H(a),
s>0
where
H(a) = aln- + (l-a)ln^—-
p 1 -p
is the function that was previously used in the proof of the local theorem
(subsection 1).
Thus, for p ^ a < 1
■{!-}
< e"nH(fl), (38)
and therefore, since H(p + x) > 2x2 and 0 ^ p + x ^ 1, we have, for £ > 0
and 0 < p ^ 1,
§6. The Bernoulli Scheme. II. Limit Theorems (Local, De Moivre- Laplace, Poisson) 69
P&-p>e\<e-2n£2. (39)
We can establish similarly that for a ^ p <, 1
pj^<flj<e-nH(fl), (40)
and consequently, for every £ > 0 and 0 < p < 1,
P&-p<-e\<e-2n£2. (41)
Therefore,
p||^-p|>£J<2e-2"£2. (42)
Hence, it follows that the number n3(a) of observations of the inequality
p{|^-p|<;£J>l-a, (43)
that are guaranteed to be satisfied for every p, 0 < p < 1, is determined by the
formula
[ln(2/<x)l
(44)
where [x] is the integral part of x. If we neglect "integral parts" and compare
n3(a) with nx(a) = [(4ae2)-1], we find that
"M L^oo, «i0.
a
It is clear from this that when a [ 0, an estimate of the smallest number of
observations needed that can be obtained from the exponential Chebyshev
inequality is more precise than the estimate obtained from the ordinary
Chebyshev inequality, especially for small a.
There is no difficulty in applying the formula
1 f00
?-y2/2dy 1^-tf"*2/2.
/27CX
to show that k2{<x) ~ 2 log(2/a) when a |0. Therefore,
^-1, «10.
Inequalities like (38)-(42) are known as inequalities for the probability of
large deviations. This terminology can be explained in the following way.
70
I. Elementary Probability Theory
The De Moivre-Laplace integral theorem makes it possible to estimate in
a simple way the probabilities of the events {\Sn — np\< Xy/n}
characterizing the "standard" deviation (up to order y/n) of Sn from np. Even the
inequalities (39), (41), and (42) provide an estimate of the probabilities of the
events {co: \Sn — np\ < xn}, describing deviations of order greater than yfn,
in fact of order n.
We shall continue the discussion of probabilities of large deviations, in
more general situations, in §5, chap. IV.
8. Problems
1. Let n = 100, p = i^, re» lfc» T5» T5- Using tables (for example, those in [Al]) of the
binomial and Poisson distributions, compare the values of the probabilities
P{10 < S.oo < 12}, P{20 < Sl00 < 22},
P{33 < Sl00 < 35}, P{40 < S.oo < 42},
P{50 < S.oo < 52}
with the corresponding values given by the normal and Poisson approximations.
2. Let p = 5 and Zn = 2Sn - n (the excess of l's over 0's in n trials). Show that
sup|y/^P{Z2n = j} - e-?i*n\ - 0, n - oo.
j
3. Show that the rate of convergence in Poisson's theorem is given by
k\ | n
§7. Estimating the Probability of Success
in the Bernoulli Scheme
1. In the Bernoulli scheme (Q, sf, P) with Q = {coico = (xu..., xn), xt =
0,1)}^ = A: A cQ),
P(cd) = p=*^-s*',
we supposed that p (the probability of success) was known.
Let us now suppose that p is not known in advance and that we want to
determine it by observing the outcomes of experiments; or, what amounts
to the same thing, by observations of the random variables £u ..., £Bf where
££co) = xt. This is a typical problem of mathematical statistics, and can be
formulated in various ways. We shall consider two of the possible
formulations: the problem of estimation and the problem of constructing confidence
intervals.
In the notation used in mathematical statistics, the unknown parameter
is denoted by 6, assuming a priori that 6 belongs to the set © = [0, 1]. We
say that the set (A s/, Pe; 6 e ©) with pe(co) = 6* *•(! - 0)"-£ *< is a probabil-
sup
§7. Estimating the Probability of Success in the Bernoulli Scheme
71
istic-statistical model (corresponding to " n independent trials " with
probability of "success" 0e©), and any function Tn = Tn(oS) with values in © is
called an estimator.
If Sn = f j + • • • + £„ and T* = Sjn, it follows from the law of large
numbers that T* is consistent, in the sense that (e > 0)
Pe{|7*-6>|>£}-0, n-oo. (1)
Moreover, this estimator is unbiased: for every 0
EeT* = 6>, (2)
where Ee is the expectation corresponding to the probability Pe.
The property of being unbiased is quite natural: it expresses the fact that
any reasonable estimate ought, at least "on the average," to lead to the
desired result. However, it is easy to see that T* is not the only unbiased
estimator. For example, the same property is possessed by every estimator
_ Mi + • • • + bnxn
n
where by + • • • + bn = n. Moreover, the law of large numbers (1) is also
satisfied by such estimators (at least if \bt\ < K < oo; see Problem 2, §3,
Chapter III) and so these estimators Tn are just as "good" as T*.
In this connection there arises the question of how to compare different
unbiased estimators, and which of them to describe as best, or optimal.
With the same meaning of "estimator," it is natural to suppose that an
estimator is better, the smaller its deviation from the parameter that is being
estimated. On this basis, we call an estimator fn efficient (in the class of
unbiased estimators 7^) if,
Vefn = infVeTn, 0 e©, (3)
where Ve Tn is the dispersion of Tn, i.e. Ee(Tn — 6)2.
Let us show that the estimator T*, considered above, is efficient. We have
/Sn\ yeSn "0(1 - 0) 0(1 - 0)
VeT* = Ve(— = -V = 2 = -• (4)
6 n \n) n2 n2 n
Hence to establish that T* is efficient, we have only to show that
Wv.r.a«LLS. (5)
This is obvious for 0 = 0 or 1. Let 0 e (0, 1) and
Pe(x,-) = 0^(1-0)1^.
It is clear that
pe(a>) = fl Pe(xd.
72
I. Elementary Probability Theory
Let us write
Le(ctf) = In pe(coi).
Then
Le(co) = In 0 - £ xt + ln(l - 0)£(1 - xt)
and
dLe(a>) _ £(*, - 0)
d0 0(1 - 0) '
Since
1 =Eel = £pe(a>),
CO
and since rn is unbiased,
0 = Eern = £rn(a>)Pe(co).
CO
After differentiating with respect to 0, we find that
Therefore
and by the Cauchy-Bunyakovskii inequality,
whence
^-^^- (6)
where
is known as Fisher's information.
From (6) we can obtain a special case of the Rao-Cramer inequality
for unbiased estimators Tn:
T'^mr (7>
§7. Estimating the Probability of Success in the Bernoulli Scheme 73
In the present case
0)]2 0(1 - I
which also establishes (5), from which, as we already noticed, there follows
the efficiency of the unbiased estimator T* = SJn for the unknown
parameter 0.
2. It is evident that, in considering T* as a pointwise estimator for 6, we have
introduced a certain amount of inaccuracy. It can even happen that the
numerical value of T* calculated from observations of xi,...,x„ differs
rather severely from the true value 0. Hence it would be advisable to
determine the size of the error.
It would be too much to hope that T*(co) differs little from the true value
6 for all sample points co. However, we know from the law of large numbers
that for every 3 > 0 and for sufficiently large n, the probability of the event
{|0 - T*(co)| > 3} will be arbitrarily small.
By Chebyshev's inequality
and therefore, for every X > 0,
ith Pe-proba
1 M1 -e)
If we take, for example, A = 3, then with Pe-probability greater than 0.8
(1 - (1/32) = | « 0.8889) the event
10 - T*| < 3
will be realized, and a fortiori the event
|0-T*|<-^=,
2y/n
since 0(1 - 0) < \.
Therefore
4" - ™ * iri=Pelr" - ir* *' *T*+Ma °-8888-
In other words, we can say with probability greater than 0.8888 that the exact
value of 6 is in the interval [T* - (3/2^), T* + (3/2^/¾]. This statement
is sometimes written in the symbolic form
74
I. Elementary Probability Theory
6^T*±—r (>88%),
where " >88 %" means "in more than 88 % of all cases."
The interval [T* - (3/2,/n), T* + (3/2^/n)] is an example of what are
called confidence intervals for the unknown parameter.
Definition. An interval of the form
»l(4 *2(0>)]
where ij/i(co) and ^2(w) ^re functions of sample points, is called a confidence
interval of reliability 1 — 3 (or of significance level 3) if
P*{*iM < 6 < ij/2(co)} > 1 - <5.
forall0e©.
The preceding discussion shows that the interval
X — 'Ll
T*-^=,T;+:
ijn" " 2jn~\
has reliability 1 — (1/A2). In point of fact, the reliability of this confidence
interval is considerably higher, since Chebyshev's inequality gives only
crude estimates of the probabilities of events.
To obtain more precise results we notice that
|co: |0 - T*\ < xj^^i = {co: ^(TJ, n) < 6 < ^(T*. n»,
where tyx = ^(T*, n) and i]/2 = ii/2(T%> n) are the roots of the quadratic
equation
(0-T*)2 = ^0(1-0),
which describes an ellipse situated as shown in Figure 13.
Figure 13
§7. Estimating the Probability of Success in the Bernoulli Scheme
75
Now let
1^*0(1 - 0) J
Then by (6.24)
1
sup|F2(x)-0>(x)| <
: y/nOO. - 0)
Therefore if we know a priori that
O<A<0<1-A<1,
where A is a constant, then
1
sup|Fg(x)-0>(x)|<-
and consequently
Pb{*i(T*, n) < 6< ij,2(T*, n)} = Po{|0 - T*\ < xlfXLJ>\
UMi - 6) J
> (20(A)- 1)- -^=.
Ay/n
Let X* be the smallest X for which
(20(A) -1)- -1= > 1 - 3*,
Ayjn
where 3* is a given significance level. Putting 3 = 3* — (2/Ay/n), we find
that X* satisfies the equation
<D(A)=l-i&
For large n we may neglect the term 2/A^/n and assume that X* satisfies
<D(A*) = l-is*.
In particular, if A* = 3 then 1 - 3* = 0.9973 .... Then with probability
approximately 0.9973
-"> (8)
or, after iterating and then suppressing terms of order 0(n~3/4), we obtain
76
I. Elementary Probability Theory
T* _ 3 /Tn*(l^ T*) < 0 < Tn* + 3 /^1
(9)
(10)
*(1 - T*)
Hence it follows that the confidence interval
L 2y/n 2y/n\
has (for large n) reliability 0.9973 (whereas Chebyshev's inequality only
provided reliability approximately 0.8889).
Thus we can make the following practical application. Let us carry out
a large number N of series of experiments, in each of which we estimate the
parameter 0 after n observations. Then in about 99.73 % of the N cases, in
each series the estimate will differ from the true value of the parameter by
at most 3/2^/n. (On this topic see also the end of §5.)
3. Problems
1. Let it be known a priori that 6 has a value in the set 0O S [0,1]. Construct an unbiased
estimator for 0, taking values only in 0O.
2. Under the hypotheses of the preceding problem, find an analog of the Rao-Cramer
inequality and discuss the problem of efficient estimators.
3. Under the hypotheses of the first problem, discuss the construction of confidence
intervals for 6.
§8. Conditional Probabilities and Mathematical
Expectations with Respect to Decompositions
1. Let (Q, sf, P) be a finite probability space and
9 = {Du...,Dk}
a decomposition of Q (Dt e sf, P(Z)£) > 0, i = 1,..., k, and Dy + • • • + Dk =
Q). Also let A be an event from sf and P(A\Dt) the conditional probability of
A with respect to Dt.
With a set of conditional probabilities {P(A\Dd, i = 1, • • -, k} we may
associate the random variable
tt(o>)= X P{A\DdlDl{w) (1)
(cf. (4.5)), that takes the values P(A\Dt) on the atoms of Z),-. To emphasize
that this random variable is associated specifically with the decomposition
% we denote it by
P(A\9) or P{A\S)){cd)
§8. Conditional Probabilities and Mathematical Expectations
77
and call it the conditional probability of the event A with respect to the
decomposition Q).
This concept, as well as the more general concept of conditional
probabilities with respect to a <r-algebra, which will be introduced later, plays an
important role in probability theory, a role that will be developed progressively
as we proceed.
We mention some of the simplest properties of conditional probabilities:
P(A + B\2>) = P(A\2>) + P(B\®)\ (2)
if Q) is the trivial decomposition consisting of the single set Q then
P04|O) = P(A). (3)
The definition of P(A\@) as a random variable lets us speak of its
expectation; by using this, we can write the formula (33) for total probability
in the following compact form:
EP(A\@) = P(A). (4)
In fact, since
P(A\@)= £ P{A\DtfDi(o>\
i= 1
then by the definition of expectation (see (4.5) and (4.6))
k k
EPiA\m = £ P(A\DdP(Dd = I P(ADd = P(A).
£=1 i=l
Now let r\ = rj(co) be a random variable that takes the values yu...,yk
with positive probabilities:
k
ri(co) = £ yjID.(co),
j=i
where D} = {co: rj(co) = j/J. The decomposition Q)^ = {/)^..., Dk} is called
the decomposition induced by rj. The conditional probability P(A\@,j) will
be denoted by P(A\n) or P(A\n\co\ and called the conditional probability
of A with respect to the random variable r\. We also denote by P(A\r\ =yj)
the conditional probability P(A\Dj), where Dj = {co: rj(co) = y$.
Similarly, if rjlt rj2,..., rjm are random variables and Bnuni nm is the
decomposition induced by rjlt rj2, . • •, r\m with atoms
Dyi,y2 ym = {co: ny(co) = yu..., rjm(co) = yj,
then P(A|Dnun ,J will be denoted by P(A\nu n2t..., rjm) and called the
conditional probability of A with respect to nu n2, •.., nm.
Example 1. Let £ and rj be independent identically distributed random
variables, each taking the values 1 and 0 with probabilities p and q. For k =
0, 1, 2, let us find the conditional probability P(£ + r\ = k\rj) of the event
A = {co:£ + n = k} with respect to rj.
78
I. Elementary Probability Theory
To do this, we first notice the following useful general fact: if £ and rj are
independent random variables with respective values x and y, then
P{Z + ri = z\rl = y)=P{Z,+y = z). (5)
In fact,
P(£ + rj = z, rj = y)
P(£ + fl=f\fl = y)-
Pfo = y)
P(£ + y = z,n = y) P(£ + y = z)PQ? = q)
p(ii = j') Pto = 3>)
= P« + y = z).
Using this formula for the case at hand, we find that
P« + n = k\rj) = P« + n = fc|iy = 0)/(,.0,(0)
+ PK + iy = *|ij = 1)/(,-1,(0))
= P« = k)I{„=0)(co) + P« = * - l}/h= i,(o»).
Thus
f «/{,-0,(0»), * = 0.
P« + 17 = *|i7) = I p/„-0j(o») + «/„- 1}(o)), * = 1, (6)
[p/{„=i,(o>), k = 2,
or equivalently
[¢(1 — if X k = 0,
P« + tf = *l*) = jp(l-ij) + OT, *=1, (7)
[PI. k = 2,
2. Let £ = £(co) be a random variable with values in the set X = {xl9..., x„} :
i
£ = £ XjIAj(co\ Aj = {co:£ = Xj}
and let B = {Du ..., Dk} be a decomposition. Just as we defined the
expectation of £ with respect to the probabilities P(Aj),j = 1,...,/.
E£ = I *,PG4A (8)
./=1
it is now natural to define the conditional expectation of £ with respect to B
by using the conditional probabilities P(A}\Q)\ j = 1,..., /. We denote
this expectation by E(£ | 3)) or E(£ | B) (co), and define it by the formula
E«|0)= £ x,-PC4,.| 0). (9)
y=i
According to this definition the conditional expectation E(^ | 3)) (co) is a
random variable which, at all sample points co belonging to the same atom
§8. Conditional Probabilities and Mathematical Expectations
79
P(-) —®-> E£
1(3.1)
P(-|D) (10) » E«|D)
(1)
(11)
P(-|0) (9) > E«|0)
Figure 14
D£, takes the same value £j=1 XjP^lZ),). This observation shows that the
definition of E(£|<2) could have been expressed differently. In fact, we could
first define E(£|Z>i), the conditional expectation of £ with respect to Dh by
E({\Dd = I XjPtAjm ( = ^jfy, (10)
and then define
Etf|0Xo>)= EE(^|A)/0<M (11)
i=l
(see the diagram in Figure 14).
It is also useful to notice that E(£\D) and E(^| 3)) are independent of the
representation of ^.
The following properties of conditional expectations follow immediately
from the definitions:
E(a£ + bri\3)) = c£(£| &) + bE(n| 3)\ a and b constants; (12)
E(£|Q) = E£; (13)
E(C| 3>) = C, C constant; (14)
if£ = IA(co) then
E(£|0) = PG4|0). (15)
The last equation shows, in particular, that properties of conditional
probabilities can be deduced directly from properties of conditional expectations.
The following important property generalizes the formula for total
probability (5):
EE(£|0) = E£. (16)
For the proof, it is enough to notice that by (5)
EE«| &) = E I XjPlAjl 9) = 2 x,EP04,.| 9) = £ x,P(^) = £¢.
./=1 ./=1 ./=1
Let 3) = {£>! Z)k} be a decomposition and rj = rj(co) sl random
variable. We say that rj is measurable with respect to this decomposition,
80
I. Elementary Probability Theory
or ^-measurable, if 0, =^ 0, i.e. rj = rj(coi) can be represented in the form
k
»7(fi>)= Z J^d.M'
i=l
where some yt might be equal. In other words, a random variable is
immeasurable if and only if it takes constant values on the atoms of 0.
Example 2. If Q) is the trivial decomposition, Q) = {O}, then rj fs
^-measurable if and only if rj = C, where C is a constant. Every random variable
rj is measurable with respect to Q)^.
Suppose that the random variable rj is ^-measurable. Then
E«ij|0) = uE«|0) (17)
and in particular
Efo|0) = ij (Eft 10,,) = *;). (18)
To establish (17) we observe that if £ = £j=1 x^-J^, then
f k
ft = Z Z Xjyi*AjDi
j=n=i
and therefore
E(ft|0) = Z I x^iPM^I®)
= Z Z ^y, Z WAjDADJIoJa)
j=n=i m=i
= Z Z x^PC^DJDOW©)
j=i i=i
= Z Z¥«P(^IW4 (19)
j=i «=i
On the other hand, since I2D. = /^ and IDt • /^ = 0, i ^ m, we obtain
ijEtf| 0) = ^Z yiW©)] • [Z XjPC^-l 0)1
= [Z yiW©)] • JE [z *;P04,-|z)m)l • /*»
= Z Z^^pc^iaO'/dM
.=1./=1
which, with (19), establishes (17).
We shall establish another important property of conditional expectations.
Let S)x and S>2 be two decompositions, with S>x =< S)2 (02 is "finer"
§8. Conditional Probabilities and Mathematical Expectations
81
than S){). Then
EffKI^I^J-EKI^t). (20)
For the proof, suppose that
Then if £ = £j= i xjIaj> we have
E«|02) = Z^.PC^I^),
y=i
and it is sufficient to establish that
E[PU;| 92)\ 9J = P(Aj\ ®x). (21)
Since
P(Aj\@2)= £ P(Aj\D2q)ID2q,
q=l
we have
E[PU^2)I^]= t P(AJ\D2q)P(D2q\3l)
9=1
= I P(^U>2«) jj£ P(D2q\Dlp)ID J
= I^/i PUyl^)P(/)2j/)ip)
p=l g=l
= Z W Z PU;l^2,)P(/)2j/)ip)
p=l {g:D2g = Dlp}
« PjAjD^) P(P2g)
£ ""',,:*>£,»„, P(C2,) P(Ol„)
m
= Z /Dlp-PU;|Z)ip) = PU;|^1),
p=l
which establishes (21).
When 3) is induced by the random variables i/i,..., iyft (0 = ^l7l ^),
the conditional expectation E^l^ „k) is denoted by E(£|>7i,..., rjk),
or E(£|»7i,...,»7fc)(G>), and is called the conditional expectation of £ with
respect torju...ink.
It follows immediately from the definition of E(£|>7) that if £ and rj are
independent, then
E(£k) = E£. (22)
From (18) it also follows that
E<ff |ff> — if- (23)
82 I. Elementary Probability Theory
Property (22) admits the following generalization. Let £ be independent
of Q) (i.e. for each Dt e <& the random variables £ and ID. are independent).
Then
E(£|0) = E£. (24)
As a special case of (20) we obtain the following useful formula:
Zm\ril,rJ2)\ril] = E(t\r,l). (25)
Example 3. Let us find E(£ + rj\ri) for the random variables £ and rj
considered in Example 1. By (22) and (23),
This result can also be obtained by starting from (8):
E(£ + rj\rj) = £ kP{£, +rj = k\rj) = p(l - rj) + qrj + 2prj = p + rj.
k=0
Example 4. Let £ and rj be independent and identically distributed random
variables. Then
E(£|£ + >7) = E07|£ + ,,)=£±^. (26)
In fact, if we assume for simplicity that £ and rj take the values 1,2,..., m,
we find (1 <, k <, m, 2 < I < 2m)
' P« + i| = 0 Ptf + * = /)
P« = /QPfo = / - k) _ P(rj = k)Pjt = l-k)
P(t + rj = l) Ptf + r, = 0
= Pfo = /c|^+ ^ = 0-
This establishes the first equation in (26). To prove the second, it is enough
to notice that
2E(£|£ + rj) = Etftf + 1,) + Efolf + I,) = E« + i||^ + rj) = £ + rj.
3. We have already noticed in §1 that to each decomposition Q) =
{£>!,..., Z)k} of the finite set Q there corresponds an algebra a{Qi) of subsets
of Q. The converse is also true: every algebra & of subsets of the finite space
Q generates a decomposition Q) {!% = ol{3>)). Consequently there is a one-
to-one correspondence between algebras and decompositions of a finite
space Q. This should be kept in mind in connection with the concept, which
will be introduced later, of conditional expectation with respect to the special
systems of sets called tr-algebras.
For finite spaces, the concepts of algebra and tr-algebra coincide. It will
§9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing 83
turn out that if & is an algebra, the conditional expectation E(£|«$) of a
random variable £ with respect to {% (to be introduced in §7 of Chapter II)
simply coincides with E(£ | Q)\ the expectation of £ with respect to the
decomposition Q) such that & = a( Q)\ In this sense we can, in dealing with
finite spaces in the future, not distinguish between E(£|^) and E(£|0),
understanding in each case that E(£|^) is simply defined to be E(£| Q)\
4. Problems
1. Give an example of random variables £ and i/ which are not independent but for
which
Etfl*) = E£
(a. (22).)
2. The conditional variance of £ with respect to <& is the random variable
V(£|0)=E[(£-E(£|0))2|0].
Show that
V£= EV(£|0) + VE(£|0).
3. Starting from (17), show that for every function f = f(rj) the conditional expectation
E(£|i/) has the property
E[/(f7)E(^)] = EK/M].
4. Let £ and i/ be random variables. Show that infy E(i/ - f(0)2 is attained for/*(£) =
E(i/10- (Consequently, the best estimator for i/ in terms of £, in the mean-square sense,
is the conditional expectation E(i/|£)).
5. Let f lt..., £n, t be independent random variables, where £,,..., £„ are identically
distributed and x takes the values 1, 2,..., n. Show that if St = ^, -\ h £r is the
sum of a random number of the random variables,
E(Sr|t) = tE£„ V(Sr|t) = tV£,
and
ESr = Et • E£„ VSr = Et • V^ + Vt • (E^)2.
6. Establish equation (24).
§9. Random Walk. I. Probabilities of Ruin and
Mean Duration in Coin Tossing
1. The value of the limit theorems of §6 for Bernoulli schemes is not just
that they provide convenient formulas for calculating probabilities P(S„ = k)
and P(A < Sn < B). They have the additional significance of being of a
84
I. Elementary Probability Theory
universal nature, i.e. they remain useful not only for independent Bernoulli
random variables that have only two values, but also for variables of much
more general character. In this sense the Bernoulli scheme appears as the
simplest model, on the basis of which we can recognize many probabilistic
regularities which are inherent also in much more general models.
In this and the next section we shall discuss a number of new probabilistic
regularities, some of which are quite surprising. The ones that we discuss are
again based on the Bernoulli scheme, although many results on the nature
of random oscillations remain valid for random walks of a more general
kind.
2. Consider the Bernoulli scheme (Q, $0, P), where Q = {co: co = (xlt..., xn),
X| = ±1}, s/ consists of all subsets of Q, and p{co) = p**»q»-*<>>\ v(a>) =
(£ x,- + n)/2. Let £(¢)) = x,, i = 1,..., n. Then, as we know, the sequence
<*!,..., <!;„ is a sequence of independent Bernoulli random variables,
Pft = i) = p, P(ft =-i) = g, p + * = i.
Let us put S0 = 0, Sk = <*! H h £k, 1 < k < n. The sequence S0,
Si,..., Sn can be considered as the path of the random motion of a particle
starting at zero. Here Sk+1 = Sk + £k, i.e. if the particle has reached the
point Sk at time k, then at time k + 1 it is displaced either one unit up (with
probability p) or one unit down (with probability q).
Let A and B be integers, A < 0 < B. An interesting problem about this
random walk is to find the probability that after n steps the moving particle
has left the interval (A, B). It is also of interest to ask with what probability
the particle leaves {A, E) at A or at B.
That these are natural questions to ask becomes particularly clear if we
interpret them in terms of a gambling game. Consider two players (first
and second) who start with respective bankrolls (-/4) and B. If £t =+1,
we suppose that the second player pays one unit to the first; if ££ = — 1, the
first pays the second. Then Sk = £t + • • • + £k can be interpreted as the
amount won by the first player from the second (if Sk < 0, this is actually
the amount lost by the first player to the second) after k turns.
At the instant k < n at which for the first time Sk = B (Sk = A) the
bankroll of the second (first) player is reduced to zero; in other words, that player
is ruined. (If k < n, we suppose that the game ends at time k, although the
random walk itself is well defined up to time n, inclusive.)
Before we turn to a precise formulation, let us introduce some notation.
Let x be an integer in the interval \_A, B~\ and for 0 < k < n let Sk = x + Sk,
*!:£ = min{0 < I < k: Sf = A or B}, (1)
where we agree to take t£ = k if A < Sf < B for all 0 < I < k.
For each k in 0 < k < n and x e [/4, B], the instant t£, called a stopping
time (see §11), is an integer-valued random variable defined on the sample
space Q (the dependence of t£ on Q is not explicitly indicated).
§9. Random Walk. 1. Probabilities of Ruin and Mean Duration in Coin Tossing 85
It is clear that for all / < k the set {co: xx = /} is the event that the random
walk {Sf, 0 < i < k], starting at time zero at the point x, leaves the interval
(A, B) at time /. It is also clear that when / < k the sets {co: xx = /, Sf = A}
and {co:xx = /, Sf ~ B} represent the events that the wandering particle
leaves the interval (A, B) at time / through A or B respectively.
For 0 < k < n, we write
■** = Z {co:r£=/,Sf = ,4},
0</<k
(2)
** = £ {co:iZ = /,Sf = £},
0</<k
and let
<xk(x) = P(^), /?k(x) = P(^)
be the probabilities that the particle leaves (A, B\ through A or B respectively,
during the time interval [0, fc]. For these probabilities we can find recurrent
relations from which we can successively determine a^x),..., a„(x) and
.0i(x),...,/?„(x).
Let, then, A < x < B. It is clear that a0(x) = fi0(x) = 0. Now suppose
1 < k < n. Then by (8.5),
A(x) = P&£) = P«|S? = * + DP(£i = 1)
+ PWIS? = x - l)P«i = -1)
= pP(@x\Sx = x + 1) + qP(@x\Sx = x - 1). (3)
We now show that
P{@xk\Sx = x + 1) = Pm±!), P(@xk\Sx = x - 1) = PiOi-1)-
To do this, we notice that <%* can be represented in the form
Of = {co: (x, x + <*!,..., x + <*! + - + ^6¾
where £k is the set of paths of the form
(x, x + xlt..., x + Xi + • • • xfc)
with xt = ± 1, which during the time [0, fr] first leave (A, B) at B (Figure 15).
We represent Bx in the form Bx'x+ 1 + Bxx~ \ where Bxx+1 and E^x~l
are the paths in Bx for which xx = +1 or xt = —1, respectively.
Notice that the paths (x, x + 1, x + 1 + x2,..., x + 1 + x2 + • • • + xk)
in Bx,x+ 1 are in one-to-one correspondence with the paths
(x + 1, x + 1 + x2,..., x + 1 + x2,..., x + 1 + x2 + • • • + xk)
in Bxt \. The same is true for the paths in Bxx~*. Using these facts, together
with independence, the identical distribution of ^,..., £k, and (8.6), we
obtain
86
I. Elementary Probability Theory
B
x
0
A
Figure 15. Example of a path from the set B%.
PWIS? = x + 1)
= PW|^ = 1)
= P{(*,x + {i,...,x + ^+... + WeBSI^ = l}
= P{(x + l,x + 1 + £2, ...,x + 1 + £2 + ••• + WeBJl}}
= P{(x + l,x + 1 + ^,...,x + 1 + £i + ••• + &-i)eflj£}}
= PW-1).
In the same way,
P(^|Sf = x - 1) = P(^:}).
Consequently, by (3) with x g (/1, £) and k ^ n,
ft(x) = Pft-i(x + 1) + «ft_ t(x - 1), (4)
where
/?,(B) = 1, 0,04) = 0, 0</<m. (5)
Similarly
ak(x) = pak_ t(x + 1) + gak_ t(x - 1) (6)
with
at(A) = 1, a,(B) = 0, 0 < I < n.
Since a0(x) = /J0(x) = 0,xe 04, B\ these recurrent relations can (at least
in principle) be solved for the probabilities
a^x),..., an(x) and ft(x),..., /?n(x).
Putting aside any explicit calculation of the probabilities, we ask for their
values for large n.
For this purpose we notice that since @$_xcz @$, k < n, we have
&_i(x) < /?k(x) ^ 1. It is therefore natural to expect (and this is actually
§9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing 87
the case; see Subsection 3) that for sufficiently large n the probability /?„(x)
will be close to the solution /J(x) of the equation
#x) = p«x + 1) + qfrx - 1) (7)
with the boundary conditions
KB) = 1, RA) = 0, (8)
that result from a formal approach to the limit in (4) and (5).
To solve the problem in (7) and (8), we first suppose that p ^ q. We see
easily that the equation has the two particular solutions a and b(q/p)x, where
a and b are constants. Hence we look for a solution of the form
«x) = a + b{qlp)x. (9)
Taking account of (8), we find that for A < x < B
m = Mr - (g/py m
M) (q/p)B-(q/Pr ( }
Let us show that this is the only solution of our problem. It is enough to
show that all solutions of the problem in (7) and (8) admit the
representation (9).
Let /?(x) be a solution with p(A) = 0, fi(B) = 1. We can always find
constants a and b such that
a + b{q/pY = hA\ a + b\q/p)A+ * = &4 + 1).
Then it follows from (7) that
p{A + 2) = a + bXqlp)A+2
and generally
Ax) = a + b(q/p)x. .
Consequently the solution (10) is the only solution of our problem.
A similar discussion shows that the only solution of
<x(x) = pa(x + 1) + q*(x - 1), x e (A, B) (11)
with the boundary conditions
a(A) = 1, ol(B) = 0 (12)
is given by the formula
If p = q = ^ the only solutions /?(x) and a(x) of (7), (8) and (11), (12) are
respectively
**-j£4 (14)
and
Elementary Probability Theory
«(*) =
(15)
We note that
a(x) + 0(x) = 1 (16)
forO <p < 1.
We call a(x) and /?(x) the probabilities of ruin for the first and second
players, respectively (when the first player's bankroll is x — A, and the second
player's is B — x) under the assumption of infinitely many turns, which of
course presupposes an infinite sequence of independent Bernoulli random
variables f lf £2,..., where ff = +1 is treated as a gain for the first player,
and f,- = -1 as a loss. The probability space (Q, s#, P) considered at the
beginning of this section turns out to be too small to allow such an infinite
sequence of independent variables. We shall see later that such a sequence
can actually be constructed and that /?(x) and a(x) are in fact the probabilities
of ruin in an unbounded number of steps.
We now take up some corollaries of the preceding formulas.
If we take A = 0,0 < x < B, then the definition of /?(x) implies that this
is the probability that a particle starting at x arrives at B before it reaches 0.
It follows from (10) and (14) (Figure 16) that
ix/B, p = q = i
Kx) = i MPT -1
{(1/P)B
r
p^q-
(17)
Now let q > p, which means that the game is unfavorable for the first
player, whose limiting probability of being ruined, namely a = a(0), is given
by
(Q/P)B ~ 1
(q/p)B ~ (q/p)A
Next suppose that the rules of the game are changed: the original bankrolls
of the players are still ( — A) and B, but the payoff for each player is now \,
Figure 16. Graph of /?(x), the probability that a particle starting from x reaches B
before reaching 0.
§9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing 89
rather than 1 as before. In other words, now let P(£,- = j) = p, P(& = —j) =
q. In this case let us denote the limiting probability of ruin for the first player
by a1/2. Then
„ (q/p)2B ~ 1
lt2-(q/p)2B-(q/p)2A'
and therefore
{q/p)B + 1
ail2~a\q/p)B^(q/p)A<°"
iiq>p.
Hence we can draw the following conclusion: if the game is unfavorable
to the first player (i.e.t q > p) then doubling the stake decreases the probability
of ruin.
3. We now turn to the question of how fast a„(x) and /?„(x) approach their
limiting values a(x) and J3(x).
Let us suppose for simplicity that x = 0 and put
an = <xn(0), ft = /?n(0), yn = 1 - (<x„ + ft).
It is clear that
yn = P{A <Sk<B,0<k<n},
where {A < Sk < B, 0 < k < n} denotes the event
fl {A<Sk<B}.
Osfc<n
Let n = rm, where r and m are integers and
Ci = £i+■•• + £«,
t2 = Sm+l + * • • + C,2m->
Cr = W- 1)+1 + "" + £rm-
Then if C = \A \ + B, it is easy to see that
{A< Sk < B, 1 < k < rm} s {|Cil < C,.... |C,I < Q,
and therefore, since Ci» • • •» C are independent and identically distributed,
yn < P{ICil < c,...,iu < c} = fl P{ic,| < c} = (P{|di < c}y. (18)
£= 1
We notice that V£i = w[l - (p - ¢)2]. Hence, for 0 < p < 1 and
sufficiently large m,
P{ICil<C}<£i, (19)
where ex < 1, since Vd < C2 if P(lf:| < C} - 1.
90
I. Elementary Probability Theory
If p = 0 or p = 1, then P{|Cil < C} = 0 for sufficiently large m, and
consequently (19) is satisfied for 0 < p ^ 1.
It follows from (18) and (19) that for sufficiently large n
yn < e", (2°)
where £ = e\lm < 1.
According to (16), a + )5 = 1. Therefore
(a - an) + (/? - ft) = yn,
and since a > a„, /? > ft, we have
0 < a - <xn < yn < £",
0 < P - ft < yn < £n, £ < 1.
There are similar inequalities for the differences a(x) — a„(x) and /?(x) — ft(x).
4. We now consider the question of the mean duration of the random walk.
Let mk(x) = Et£ be the expectation of the stopping time rk,k < n.
Proceeding as in the derivation of the recurrent relations for ft(x), we find that,
for x g 04, B),
mfc(x) = ErZ= £ /P(r£ = /)
l£/£fc
= I /• l>P(T£ = /|£! = 1) + «P« — f|«. = -1)]
l</<fc
= I I ■ [pPCtf-J =/-1) + «P(tfi{ = /-1)]
l£/<fc
= Z « + i)[pp(« = o + «pw-1 = 0]
0£/£fc-l
= pwfc_!(x +1) + gwfc-i(x - 1)
+ Z [pPW£i! =0 + sPttfi} = /)]
0</<fc-l
= pwfc_!(x +1) + qm^^x -1) + 1.
Thus, for x g (/1, £) and 0 < k < n, the functions wfc(x) satisfy the recurrent
relations
wfc(x) = 1 + pwfc-i(x +1) + qm^^x - 1), (21)
with w0(x) = 0. From these equations together with the boundary conditions
mk(A) = mk(B) = 0, (22)
we can successively find wii(x),..., mn(x).
Since wfc(x) < mk+l(x), the limit
w(x) = lim mn(x)
§9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing 91
exists, and by (21) it satisfies the equation
m{x) = 1 + pm{x +1) + qm{x - 1) (23)
with the boundary conditions
m(A) = m{B) = 0. (24)
To solve this equation, we first suppose that
m(x) < oo, x e (A, B). (25)
Then if p ^ q there is a particular solution of the form x/{q — p) and the
general solution (see (9)) can be written in the form
m(x) = —?— + a + b(-) .
Q-P \P/
Then by using the boundary conditions m(A) = m(B) = 0 we find that
m{x) = —!— (BP(x) + Aa(x) - x], (26)
p- q
where /?(x) and a(x) are defined by (10) and (13). If p = q = j, the general
solution of (23) has the form
m(x) = a + bx — x2,
and since m(A) = m(B) = 0 we have
m(x) = (B- x)(x - A). (27)
It follows, in particular, that if-the players start with equal bankrolls
(B= -A), then
m(0) = B2.
If we take B = 10, and suppose that each turn takes a second, then the
(limiting) time to the ruin of one player is rather long: 100 seconds.
We obtained (26) and (27) under the assumption that m(x) < oo, x g (A, B).
Let us now show that in fact m(x) is finite for all x g (A, B). We consider only
the case x = 0; the general case can be analyzed similarly.
Let p = q = j. We introduce the random variable STn defined in terms of
the sequence S0,SU...9S„ and the stopping time t„ = t£ by the equation
S*„= ISfc/{tn=fc}(co). (28)
*=o
The descriptive meaning of SXn is clear: it is the position reached by the
random walk at the stopping time t„. Thus, if t„ < n, then SXn = A or B;
ifrn = n, then/1 <SXn<B.
92
I. Elementary Probability Theory
Let us show that when p = q = i,
ESt„ = 0, (29)
ES?„ = Et„. (30)
To establish the first equation we notice that
ESt„ = Z ECSJ,,,,^)]
fc = 0
= I E[Sn/{tn=fc}(co)] + f E[(Sfc - Sn)I{Xn=k)(co)]
k=0 k=0
= ES„ + t EKSfc " Sn)I{Xn=k)(co)l (31)
fc = 0
where we evidently have ES„ = 0. Let us show that
I E[(S* " Sn)/{tn=fc}(co)] = 0.
*=o
To do this, we notice that {t„ > k} = {A < St < B,..., A < Sk < B}
when 0 < k < n. The event {A < Sx < B,..., A < Sk < B} can evidently
be written in the form
{co: (*!,...,&) eAkl (32)
where Ak is a subset of {— 1, +1}*. In other words, this set is determined by
just the values of £u ..., £k and does not depend on £k+ lt..., £„. Since
(T„ = k) = {T„ > k - 1}\{T„ > k},
this is also a set of the form (32). It then follows from the independence of
f i,..., f„ and from Problem 10 of §4 that the random variables Sn — Sk and
I{tn=k) are independent, and therefore
E[(S„ - Sfc)/{tn=fc}] = E[S„ - SJ ■ E/{tn=fc} = 0.
Hence we have established (29).
We can prove (30) by the same method:
ES?n = I ESfc2/{tn=fc} = t E([S„ + (Sk - Sn)]2/{tn=fc})
fc=0 fc=0
= I [ES„2/{tn=fc} + 2ES„(Sfc - Sn)I[tn=k)
fc = 0
+ E(S„ - Sk)2I{tn=k{] = ESn2 - £ E(S„ - Sfc)2/{tn=fc}
*=o
= „ _ £ („ _ k)P{Tn = k)=t kP(Tn = k) = ET„.
fc = 0 fc=0
§9. Random Walk. I. Probabilities of Ruin and Mean Duration in Coin Tossing 93
Thus we have (29) and (30) when p = q = j. For general p and q
(P + q = 1) it can be shown similarly that
EStn = (p-<z)-Ern, (33)
E[Stn-Tn.E^]2 = V^.ETn, (34)
where E^ = p - q, V^ = 1 - (p - q)2.
With the aid of the results obtained so far we can now show that
lim^^ mn(0) = m(0) < oo.
If p = q = ± then by (30)
Et„ < rnaxU2, B2). (35)
Ifp^g,thenby(33),
^max(U|,B)
ET"- \p~q\ ' (36>
from which it is clear that m(0) < oo.
We also notice that when p = q = \
Ern = ESl = ^2 • <xn + B2 ■ ft, + E[Sn2/u<Sn<B}]
and therefore
^2-an + B2-pn < Et„ < ^2-an + B2■ fin + max(^2,B2)-);n.
It follows from this and (20) that as n -> oo, Et„ converges with exponential
rapidity to
m(0) = 42a + B2fi = A2—^ - B2 ~^ = |^|.
There is a similar result when p ^ q:
aA + /?B
Et„ -* m(0) = , exponentially fast.
p- q
5. Problems
1. Establish the following generalizations of (33) and (34):
ES* = x + (p - q)Exxn,
E[SiS-t:.E{1]2 = V^.EtJ.
2. Investigate the limits of a(x), /?(x), and m(x) when the level A [ - oo.
3. Let p = q = % in the Bernoulli scheme. What is the order of E \Sn\ for large «?
94
I. Elementary Probability Theory
4. Two players each toss their own symmetric coins, independently. Show that the
probability that each has the same number of heads after n tosses is 2~ 2..2=0
Hence deduce the equation JJ=0 (C*)2 = Qn.
Let an be the first time when the number of heads for the first player coincides with
the number of heads for the second player (if this happens within n tosses; an = w + 1
if there is no such time). Find Emin(an, «).
§10. Random Walk. II. Reflection Principle.
Arcsine Law
1. As in the preceding section, we suppose that £lt £2, • • •, £2„ *s a sequence
of independent identically distributed Bernoulli random variables with
P(6 = 1) = P, P«, = -!) = «,
Sk = £t + • • • + £», 1 < k < 2«; S0 = 0.
We define
a2n = min{l < k < In: Sk = 0},
putting a2n = oo if Sk ^ 0 for 1 < fr < 2n.
The descriptive meaning of o2n is clear: it is the time of first return to
zero. Properties of this time are studied in the present section, where we
assume that the random walk is symmetric, i.e. p = q = \.
For 0 < k < n we write
"2* = P(S2k = 0), f2k = P(a2n = 2k). (1)
It is clear that w0 = 1 and
u2k = ^2k " 2
Our immediate aim is to show that for 1 < k < n the probability f2k is
given by
flk = 2^M2(fc-l)- (2)
It is clear that
{<r2n = 2/c} = {Sy ^0,S2 ^ 0,..., S2fc-! ^0,S2fc = 0}
for 1 < k < n, and by symmetry
fik = P{Si * 0,..., S2fc_ t * 0, S2k = 0}
= 2PIS! > 0,..., S2k. t > 0, S2k = 0}. (3)
§10. Random Walk. II. Reflection Principle. Arcsine Law
95
A sequence (S0,..., Sk) is called a path of length k; we denote by Lk(A)
the number of paths of length k having some specified property A. Then
and S2fc+1 = a2k+u ..., S2n = a2fc+1 + ■ • ■ + a2n)■ 2_2n
= 2L2fc(S! > 0,..., S2fc_ x > 0, S2fc = 0) • 2"2\ (4)
where the summation is over all sets (a2k+ u ..., a2n) with af = ±1.
Consequently the determination of the probability f2k reduces to
calculating the number of paths L^Si > 0,..., S2fc_ x > 0, S2k = 0).
Lemma 1. Let a and b be nonnegative integers, a — b > 0 and k = a + b.
Then
Lk(Sx > 0,..., S^ > 0, Sk = a - b) = ^=-^ C£. (5)
Proof. In fact,
Lfc(Sj > 0,..., Sk_ n > 0, Sk = a - b)
= Lk(Si = 1,S2 >0,..., Sfc_! >0,Sfc = fl-fe)
= Lfc(Si = 1, Sfc = a - b) - Lk(S1 = l,Sk = a-b;
and 3 i, 2 < i < fc - 1, such that Sf < 0). (6)
In other words, the number of positive paths (Slt S2,..., Sk) that originate
at (1, 1) and terminate at (k, a — b) is the same as the total number of paths
from (1, 1) to (k, a — b) after excluding the paths that touch or intersect the
time axis.*
We now notice that
Lk(Sx = 1, Sk = a - b\ 3 i, 2 < i < k - 1, such that S,- < 0)
= Lk(S1 = -l,Sk = a-b), (7)
i.e. the number of paths from a = (1,1) to (3 = (k, a — b\ neither touching
nor intersecting the time axis, is equal to the total number of paths that
connect a* = (1, — 1) with p. The proof of this statement, known as the
reflection principle, follows from the easily established one-to-one
correspondence between the paths A = (Sit ...,Sat Sa+ u ...,Sk) joining a and
/?, and paths B = (-Su ..., -Sa, Sa+U..., Sk) joining a* and /? (Figure 17);
a is the first point where A and B reach zero.
* A path (5,,..., Sk) is called positive (or nonnegative) if all St > 0 (5^ > 0); a path is said to
touch the time axis if Sj > 0 or else Sj < 0, for 1 <j<k> and there is an i, 1 < i < k, such
that Si = 0; and a path is said to intersect the time axis if there are two times i andj such that
St > 0 and Sj < 0.
96
I. Elementary Probability Theory
Figure 17. The reflection principle.
From (6) and (7) we find
Llfc(S1>a...>Slk_1>0>Slk = fl-Q
= msl = l,Sk = a-b)- US, = -l,Sk = a-b)
— W-l ~~ W-l — 7 ^fc>
which establishes (5).
Turning to the calculation of/"2*i we find that by (4) and (5) (with a = k,
b = k-l),
f2k = 2L2k{Sx > 0,..., S2k. t > 0, S2k = 0) ■ 2~2k
= ^2^.^ >a...,s2*-i = i)-2-2*
i ^ i
= 2-2"
2k
—£ Ck2k- 1 — ^ "2(fc- 1)-
Hence (2) is established.
We present an alternative proof of this formula, based on the following
observation. A straightforward verification shows that
_1_
2k
u2ik-l) — u2{k-l) ~ "2fc.
(8)
At the same time, it is clear that
{G2n = 2k} = {a2n > 2(k - l)}\>2n > 2k},
{a2n>2l} = {S1^0t...tS2l^0}
and therefore
{a2n = 2k} = {St # 0,..., S2(fc_ 1} * 0}\{S, * 0,..., S2k * 0}.
Hence
fik = P{Si ± 0,..., Sw_„ * 0} - PfSi * 0,..., S2fc * 0},
§10. Random Walk. II. Reflection Principle. Arcsine Law
97
Figure 18
and consequently, because of (8), in order to show that f2k = (l/2fr)w2(fc-i)
it is enough to show only that
L^S, * 0,..., S2k * 0) = L2k(S2k = 0). (9)
For this purpose we notice that evidently
L2k(Sl * 0,..., S2k * 0) = 2L2k{Sy > 0,..., S2k > 0).
Hence to verify (9) we need only establish that
2L2k(Sx > 0,..., S2k > 0) = L2fc(S! > 0,"..., S2k > 0) (10)
and
L2fc(S! > 0,..., S2k > 0) = L2k(S2k = 0). (11)
Now (10) will be established if we show that we can establish a one-to-one
correspondence between the paths A = (Slt..., S2k) for which at least one
Si = 0, and the positive paths B = (Slt..., S2k).
Let A = (Si,..., S2k) be a nonnegative path for which the first zero occurs
at the point a (i.e., Sa = 0). Let us construct the path, starting at (a, 2),
(Sa + 2, Sa+! + 2,..., S2k + 2) (indicated by the broken lines in Figure 18).
Then the path B = (Sl9..., Sfl_ i, Sa + 2,..., S2k + 2) is positive.
Conversely, let B = (S^ ..., S2fc) be a positive path and b the last instant
at which Sb = 1 (Figure 19). Then the path
A = (Sl,...,Sb,Sb+l-2,...,Sk-2)
Figure 19
98
I. Elementary Probability Theory
— m
Figure 20
is nonnegative. It follows from these constructions that there is a one-to-one
correspondence between the positive paths and the nonnegative paths with
at least one S£ = 0. Therefore formula (10) is established.
We now establish (11). From symmetry and (10) it is enough to show that
WSi > 0,..., S2k > 0) + L^Si > 0,..., S2k > 0 and 3 i,
1 < i < 2k, such that S, = 0) = L2k(S2k = 0).
The set of paths (S2k = 0) can be represented as the sum of the two sets
#! and #2> where ^x contains the paths (S0,..., S2k) that have just one
minimum, and #2 contains those for which the minimum is attained at at
least two points.
Let Cle<&l (Figure 20) and let y be the minimum point. We put the path
C\ = (S0, Si,..., S2k) in correspondence with the path C% obtained in the
following way (Figure 21). We reflect (S0, Slt..., S,) around the vertical
line through the point /, and displace the resulting path to the right and
upward, thus releasing it from the point (2/c, 0). Then we move the origin to
the point (/, — m). The resulting path C* will be positive.
In the same way, if C2 e <tf2 we can use the same device to put it into
correspondence with a nonnegative path C%.
A
•i^—I 1 1 1 1 1 1 1 ►
2k
Figure 21
§10. Random Walk. II. Reflection Principle. Arcsine Law
99
Conversely, let Cf = (Sx > 0,..., S2k > 0) be a positive path with
S2k = 2m (see Figure 21). We make it correspond to the path Cx that is
obtained in the following way. Let p be the last point at which Sp = m.
Reflect (Sp,..., S2m) with respect to the vertical line x = p and displace the
resulting path downward and to the left until its right-hand end coincides
with the point (0,0). Then we move the origin to the left-hand end of the
resulting path (this is just the path drawn in Figure 20). The resulting path
^i = (S0,...,S2fc) has a minimum at S2k = 0. A similar construction
applied to paths (St > 0,..., S2k > 0 and 3 i, 1 < i' <, 2k, with St = 0) leads
to paths for which there are at least two minima and S2k = 0. Hence we have
established a one-to-one correspondence, which establishes (11).
Therefore we have established (9) and consequently also the formula
flk = "2(fc- 1) - "2* = OWl^Cfc-l)-
By Stirling's formula
u2k = C2k-2~2k--^=, fc-*oo.
Therefore
1
J2k ~ 1= » K -* 00.
Hence it follows that the expectation of the first time when zero is reached,
namely
n
Emin(o-2„, 2n) = J] 2kP(a2n = 2k) + 2nu2n
* = i
= ZM2(fc-l) + 2""2n»
fc=l
can be arbitrarily large.
In addition, ^"=1 w2(fc-i) = °°» anc* consequently the limiting value of
the mean time for the walk to reach zero (in an unbounded number of steps)
is oo.
This property accounts for many of the unexpected properties of the
symmetric random walk that we have been discussing. For example, it
would be natural to suppose that after time 2n the number of zero net scores
in a game between two equally matched players (p = q = j), i.e. the number
of instants i at which S{ = 0, would be proportional to 2n. However, in fact
the number of zeros has order y/2n (see [Fl] and (15) in §9, Chapter VII).
Hence it follows, in particular, that, contrary to intuition, the "typical" walk
(S0, Sl9..., Sn) does not have a sinusoidal character (so that roughly half the
time the particle would be on the positive side and half the time on the negative
side), but instead must resemble a stretched-out wave. The precise formulation
of this statement is given by the arcsine law, which we proceed to investigate.
100
I. Elementary Probability Theory
2. Let P2fc.2„ be the probability that during the interval [0, 2n\ the particle
spends 2k units of time on the positive side.*
Lemma 2. Let u0 = 1 and 0 < k < n. Then
Plk,2n= u2k-u2n-2k (12)
Proof. It was shown above that f2k = u2ik-i) — w2*- Let us show that
k
"2fc= Z/2r-"2(fc-r)- (13)
Since {S2k = 0} c {a2n < 2k), we have
{S2k = 0} = {S2k = 0} n {a2n <2k}= £ {S2k = 0} n {a2n = 2/}.
i</<*
Consequently
u2k = P(S2k = 0) = £ P(S2fc = 0, a2n = 2/)
i<*<*
= Z P(S2fc = 0|<72fc = 2/)P(<72n = 2/).
i</<fc
But
P(S2fc = 0|<r2n = 2/) = P(S2fc = 0\St ± 0,..., S2/_t * 0, S2/ = 0)
= P(S2, + (£2/+1 + • ■ • + £2fc) = 0^ * 0,..., S,^ * 0, S2/ = 0)
= P(S2/ + (£2/+ ! + •■• + £2fc) = 0|S2/ = 0)
= P(£2*+i + ■•• + Z2k = 0) = P(S2(fc_0 = 0).
Therefore
"2*= Z P(S2(fc-0 = 0)P(t72n = 2/),
l£/£fc
which establishes (13).
We turn now to the proof of (12). It is obviously true for k = 0 and k = n.
Now let 1 < k < n — 1. If the particle is on the positive side for exactly 2k
instants, it must pass through zero. Let 2r be the time of first passage through
zero. There are two possibilities: either Sk > 0, k < 2r, or Sk < 0, k < 2r.
The number of paths of the first kind is easily seen to be
(l2 f2t)'2 Puk-r),2{n-r) = 2*2 ' f2r ' -^2(1:-r),2(n-r) •
* We say that the particle is on the positive side in the interval [m — 1, ni] if one, at least, of the
values S^-! and S„ is positive.
§10. Random Walk. II. Reflection Principle. Arcsine Law
101
The corresponding number of paths of the second kind is
1*2 "'jlr' ^2fc,2(n-r)*
Consequently, for 1 < k < n — 1,
1 k 1 k
Plk, 2n = X l_, J2r ' ^2(k - r), 2(n - r) + X X fir ' ^2fc. 2(n - r) • (14)
■^ r= 1 -^ r= 1
Let us suppose that P2*.2m = u2k • Uim-ik holds for m = 1,..., n — 1. Then
we find from (13) and (14) that
k k
Plk.2n — 2lhn-2k ' E fir ' ^2k-2r + 2M2fc ' zl, fir' "2n-2r-2fc
= 2li2n-2k ' u2k + 2**2* * u2n-2k = u2k'u2n-2k'
This completes the proof of the lemma.
Now let y(2n) be the number of time units that the particle spends on the
positive axis in the interval [0, 2ri]. Then, when x < 1,
lZ Zn J {fc,l/2<(2fc/2n)<x}
Since
""-^
as k -* oo, we have
1
Plk,2n — u2k ' M2(n-fc)
njHn - k)'
{k:l/2<(2k/2n)£x} ' {k:l/2<(2k/2n)^x}7m L»1\ H/l
as k -* oc and n — /fr -> oo.
Therefore
whence
^ 1 fx dt
E ^2k2n~- / ~»°. H-»00.
<(2k/2n)£x} ' 7T J 1/2 . A(l - 0
/ J •» 2fc,2n I /-t: r
{k: 1/2 <(2k/2n)£x} 7T J 1/2 ^y f (1 — f)
But, by symmetry,
E Plk, 2n -* 2
{fc:fc/n<l/2}
102
1. Elementary Probability Theory
and
1 fx dt 2 . r- !
1 , = - arcsm ux —. ?•
n Ji/2 ^O^T) *
Consequently we have proved the following theorem.
Theorem (Arcsine Law). The probability that the fraction of the time spent
by the particle on the positive side is at most x tends to In-1 arcsin y/x:
y\ Pik.in ~* 27T~! arcsin yfa (15)
{kik/nzx)
We remark that the integrand p{t) in the integral.
i f dt
*Jo Vf(l - 0
represents a U-shaped curve that tends to infinity as t -> 0 or 1.
Hence it follows that, for large n,
i.e., it is more likely that the fraction of the time spent by the particle on the
positive side is close to zero or one, than to the intuitive value \.
Using a table of arcsines and noting that the convergence in (15) is indeed
quite rapid, we find that
PJ2g) £0.024} *ai.
P{f>,0,},0,,
p{Zg> ^0.2^0.3,
p|2g)£0.65Uo.6.
Hence if, say, n = 1000, then in about one case in ten, the particle spends
only 24 units of time on the positive axis and therefore spends the greatest
amount of time, 976 units, on the negative axis.
3. Problems
1. How fast does Emin(a2n, 2w) -► oo as n -* oo?
2. Let t„ = min{l < k < n: Sk = 1}, where we take t„ = oo if Sk < 1 for 1 < k < n.
What is the limit of Emin(Tn, n) as n -> oo for symmetric (p = q = \) and for un-
symmetric (p # q) walks?
§11. Martingales. Some Applications to the Random Walk
103
§11. Martingales. Some Applications to the
Random Walk
1. The Bernoulli random walk discussed above was generated by a sequence
€ii. • •, €„ of independent random variables. In this and the next section we
introduce two important classes of dependent random variables, those that
constitute martingales and Markov chains.
The theory of martingales will be developed in detail in Chapter VII.
Here we shall present only the essential definitions, prove a theorem on the
preservation of the martingale property for stopping times, and apply this
to deduce the "ballot theorem." In turn, the latter theorem will be used for
another proof of proposition (10.5), which was obtained above by applying
the reflection principle.
2. Let (Q, sf, P) be a finite probability space and 3>x < 3>2 ^ * * • ^ @>n a
sequence of decompositions.
Definition 1. A sequence of random variables £l9..., £n is called a martingale
(with respect to the decomposition Q)x =< 3>2 ^ * • * ^ ^n) if
(1) £k is ^-measurable,
(2)E(£fc+1|<2fc) = £fc,l</c<H-l.
In order to emphasize the system of decompositions with respect to which
the random variables form a martingale, we shall use the notation
f = (&,0*)i ***,., (1)
where for the sake of simplicity we often do not mention explicitly that
1 < k < n.
When 3>k is induced by f t,..., f„, i.e.
@k = ®«i &.
instead of saying that £ = (£fc, 2>k) is a martingale, we simply say that the
sequence £ = (£fc) is a martingale.
Here are some examples of martingales.
Example 1. Let nu ..., rjn be independent Bernoulli random variables with
P(ifc = l)=P(ifc=-l) = i
Sk = rji + • • • + nk and Q)k = Qnx „k.
We observe that the decompositions 3)k have a simple structure:
Qx = {D+,D~},
104
I. Elementary Probability Theory
where
D+ = {(o: r]x = +1}, D~ = {cd: ^, = -1},
where
D++ = {w:^ = + 1,¾ = +1},...,/)-- = {w:>h= -1,¾ = -1).
etc.
It is also easy to see that ^,,...,^ = S>si sk-
Let us show that (Sk, 3)k) forms a martingale. In fact, Sk is ^-measurable,
and by (8.12), (8.18) and (8.24),
E(Sk+l\@k) = E(Sk + rik+1\@k)
= E(Sk\@k) + E(r}k+1\@k) = Sk + Erjk+l = Sk.
If we put S0 = 0 and take D0 = {Q}, the trivial decomposition, then the
sequence (Sfc, ^fc)0<*<n also forms a martingale.
Example 2. Let rjlt..., r]n be independent Bernoulli random variables with
Pixii = 1) = P. Pirii = -1)-0- If p # g, each of the sequences ^ = (&)
with
& = \j) . & = Sk - k(p - q\ where Sk = ^ + • • • + rjkt
is a martingale.
Example 3. Let »7 be a random variable, ®x =<•••=< 3>n, and
^ = E(^|^fc). (2)
Then the sequence £ = (£fc, £2fc) is a martingale. In fact, it is evident that
E{ri\S)k) is ^-measurable, and by (8.20)
E(£fc+1|£y = E[Efa|0k+1)£y = Efal^) = £k.
In this connection we notice that if ^ = (£fc, S)k) is any martingale, then
by (8.20)
& = E(&+i|0*) = E[E(^+2|^fc+1)|^fc]
= E(^+2l^) = -" = E(^|^fc). (3)
Consequently the set of martingales £ = (£k, 2>k) is exhausted by the
martingales of the form (2). (We note that for infinite sequences £ =
(£*> ^*)*>i this is, in general, no longer the case; see Problem 7 in §1 of
Chapter VII.)
§11. Martingales. Some Applications to the Random Walk
105
Example 4. Let nlt..., nn be a sequence of independent identically distributed
random variables, Sk = nx H + nkt and @>x = 2)Sn, S)2 = ^s„,sn_,» • • •»
Q)n = ^s„,s„-i s,- Let us show that the sequence £, = (£fc, 3)k) with
is a martingale. In the first place, it is clear that 3>k < @k+l and £k is Q)k-
measurable. Moreover, we have by symmetry, for; < n - k + 1,
E(nj\%) = E(rJl\^k) (4)
(compare (8.26)). Therefore
(n - k + l^Ciftl^^) = " I+1EfojW = E(S„_fc+1|0fc) = S„_fc+1,
./= i
and consequently
and it follows from Example 3 that £ = (£fc, ^fc) is a martingale.
Remark. From this martingale property of the sequence £ = (£fc, S)k)i^k^n,
it is clear why we will sometimes say that the sequence {Sk/k\^k^n forms a
reversed martingale. (Compare problem 6 in §1 of Chapter VII.)
Example 5. Let nu..., nn be independent Bernoulli random variables with
Pfoi= +l) = Pfoi =-1)-1,
Sk = rji H + »7*. Let /1 and B be integers, /1 < 0 < B. Then with 0 < X <
n/2, the sequence £ = (£fc, £2fc) with £2fc = £2Sl Sk and
tk = (cos X)~k expjufa - ^y^)l (5)
is a complex martingale (i.e., the real and imaginary parts of £fc form
martingales).
3. It follows from the definition of a martingale that the expectation E£fc is
the same for every k:
E& = E*!.
It turns out that this property persists if time k is replaced by a random
time.
In order to formulate this property we introduce the following definition.
Definition 2. A random variable t = t(co) that takes the values 1, 2,..., n is
called a stopping time (with respect to a decomposition (^fc)i<fc<„, Bx =<
@>i =^ ■ * ^ ^n) if. f°r ^ = 1, • • •, n, the random variable /{t=fc}(co) is
immeasurable.
106
I. Elementary Probability Theory
If we consider Q)k as the decomposition induced by observations for k
steps (for example, S>k = g>n „fc, the decomposition induced by the
variables nu..., nk), then the ^fc-measurability of I{T=k)(ca) means that the
realization or nonrealization of the event {t = k) is determined only by
observations for k steps (and is independent of the "future").
If 08k = o>(Q)k\ then the immeasurability of IlT=k)(co) is equivalent to the
assumption that
{r = k}e<%k. (6)
We have already introduced specific examples of stopping times: the times
t£, a2n introduced in §§9 and 10. Those times are special cases of stopping
times of the form
rA = min{0 <k <n:£ke A},
(7)
aA = min{0 < k < n: £k e A},
which are the times (respectively the first time after zero and the first time)
for a sequence £0, £lt..., £„ to attain a point of the set A.
4. Theorem 1. Let £ = (£fc, @>k\<k<n be a martingale and x a stopping time
with respect to the decomposition {S>k)\^k<n- Then
E«.|3i) = f i, (8)
where
¢.= 1 &/(.=«(*>) (9)
fc = l
and
Ef. = E^. (10)
Proof (compare the proof of (9.29)). Let D e Q)x. Using (3) and the properties
of conditional expectations, we find that
EfclP) = E(Ud)
KM } P(Z))
ZE[E(^n|^). /t.^./J
_1
1 "
= P(D) ?E^n'<*='>' ^
'p^ek./jO-ekjd),
§11. Martingales. Some Applications to the Random Walk
107
and consequently
E«t|®1) = E«,,|®1) = «1.
The equation E£t = E^ then follows in an obvious way.
This completes the proof of the theorem.
Corollary. For the martingale (Sk, @k)i<k<„ of Example 1, and any stopping
time t (with respect to (3>k)) we have the formulas
ESt = 0, ESt2 = Er, (11)
known as Wald's identities (cf (9.29) and (9.30); see also Problem 1 and
Theorem 3 in §2 of Chapter VII).
5. Let us use Theorem 1 to establish the following proposition.
Theorem 2 (Ballot Theorem). Let rjlt ...,rj„ be a sequence of independent
identically distributed random variables whose values are nonnegative integers,
Sfc = »7i + • • ■ + nk, 1 < k < n. Then
P{Sk < kfor allk,l<k< n\Sn} = 11 - ^ j , (12)
where a+ = max(a, 0).
Proof. On the set {co: Sn > n} the formula is evident. We therefore prove
(12) for the sample points at which Sn < n.
Let us consider the martingale £ = (£k, 3>k)i<k<n introduced in Example
4, with £k = Sn+1 _fc/(n + 1 - k) and ®k = 0Sn ~ _~k Sn.
We define
t = min{l <k <n:£k> 1},
taking t = n on the set {£k < 1 for all k such that 1 < k < n} =
{max1</<n(S//0 < 1}- It is clear that £t = £n = Si = 0 on this set, and
therefore
{max ^ < ll = {max ^ < 1, Sn < n] c {£ = 0}. (13)
Now let us consider those outcomes for which simultaneously
max, ^/<„(S//0 ^ 1 and Sn < n. Write a = n + 1 — t. It is easy to see that
a = max{l <k <n:Sk>k}
and therefore (since Sn < n) we have a < n, Sa > o, and Sa+l < a + 1.
Consequently na+1 = Sa+l - Sa < (a + 1) - a = 1, i.e. na+l = 0.
Therefore a <Sa = Sa+1 <o + 1, and consequently Sa = a and
108
I. Elementary Probability Theory
Therefore
{max ^1, Skills tf.= l}. (14)
From (13) and (14) we find that
| max ^ > 1, Sn < n\ = {£t = 1} n {Sn < n}.
Therefore, on the set {Sn < n}, we have
p{max 5» > i|Sl = pKt = i|SJ = £(£,1¾
where the last equation follows because £t takes only the two values 0 and 1.
Let us notice now that E(^t|Sn) = E^J^O, and (by Theorem 1)
E(ftl^i) = £i = sJn- Consequently, on the set {Sn < n} we have P{Sk < k
for all k such that 1 < k < n\Sn} = 1 - (SJn).
This completes the proof of the theorem.
We now apply this theorem to obtain a different proof of Lemma 1 of
§10, and explain why it is called the ballot theorem.
Let £ i,..., £„ be independent Bernoulli random variables with
PKi = 1)= P(&= -1) = i
Sk = £i + ••• + & and a, b nonnegative integers such that a — b > 0,
a + b = n. We are going to show that
P{Si > 0,..., Sn > 0|S„ = a - b} = ^-^. (15)
a + b
In fact, by symmetry,
P{S1>0,...,S„>0|S„ = a-&}
= P{Si <0,...,S„<0|S„ = -(fl-fc)}
= P{Sn + 1 < 1,..., Sn + n < n\Sn + n = n - (a - b)}
= Pf^ < 1,..., I/!+ ..- +I/,, <n|i71 + ••• + nn = n-(a-b)}
i-_b
L n J n a +
where we have put /7* = £fc + 1 and applied (12).
Now formula (10.5) follows from (15) in an evident way; the formula was
also established in Lemma 1 of §10 by using the reflection principle.
§11. Martingales. Some Applications to the Random Walk
109
Let us interpret f £ = +1 as a vote for candidate A and £t = — 1 as a vote
for B. Then Sk is the difference between the numbers of votes cast for A and
B at the time when k votes have been recorded, and
P{S! >0,...,Sn>0\Sn = a-b}
is the probability that A was always ahead of B, with the understanding that
A received a votes in all, B received b votes, and a — b > 0, a + b = n.
According to (15) this probability is (a - b)/n.
6. Problems
1. Let @)0 =^ ^, =^ • • • =^ 3)n be a sequence of decompositions with @)0 = {Q}, and let
t]k be ^-measurable variables, 1 < k < n. Show that the sequence f = (£fc, Qk) with
tk= Ifoi-EOfcUP,-,)]
is a martingale.
Z Let the random variables f|a fjfc satisfy E(»7fc 1^,,...,^.,) = 0. Show that the
sequence £ = (&), £fc£n with £, = »7, and
fc
Sfc+t = 5>*+i/i0h.---.1/1).
i=t
where yj are given functions, is a martingale.
3. Show that every martingale f = (&,£&*) has uncorrected increments: if a < b <
c < d then
cov<&-&,&-« = a
4. Let ^ = (£„...,^„) be a random sequence such that £fc is ^-measurable
{& =^ 30 2 ^ *"" ^ ^n)- Show that a necessary and sufficient condition for this
sequence to be a martingale (with respect to the system (^fc)) is that E£r = Ec, for
every stopping time t (with respect to (jS)k)). (The phrase "for every stopping time"
can be replaced by "for every stopping time that assumes two values.")
5. Show that if ^ = (^fc, @>k)x <*<„ is a martingale and t is a stopping time, then
£[£„/*=*,] = EKfc/,t=fc}]
for every k.
6. Let £ = (&, ^fc) and tj = (rjk, %) be two martingales, £, = 17, = 0. Show that
Ef„>7„ = £ E(& - &- tK^fc - %-1)
fc = 2
and in particular that
E£„2 = 1%-^.,)2.
110
I. Elementary Probability Theory
7. Let tju..., rjn be a sequence of independent identically distributed random variables
with Erji = 0. Show that the sequence £, = (£fc) with
_expA(^! +••• + t]k)
Cfc (EexpA^)"
is a martingale.
8. Let i/lt..., */„ be a sequence of independent identically distributed random variables
taking values in a finite set Y. Let f0(y) = P(r]l = y), y e F, and let /,(3;) be a non-
negative function with J]yGy /i(j>) = 1- Show that the sequence £ = (¾. ^J?) with
«- A, *,
- = /iO/i)---/iOh)
* Mm)—Mid*
is a martingale. (The variables £fc, known as likelihood ratios, are extremely important
in mathematical statistics.)
§12. Markov Chains. Ergodic Theorem.
Strong Markov Property
1. We have discussed the Bernoulli scheme with
Q={co:co = (x1,...,xn),xI = ai},
where the probability p(co) of each outcome is given by
p(co) = p(x1) -pCxJ, (1)
with p(x) = pxql~x. With these hypotheses, the variables £l9...,£n with
£i(co) = Xj are independent and identically distributed with
P«i = x) = ••• = P(£„ = x) = p(x), x = 0,1.
If we replace (1) by
p(co) = p1(x1)--pn(xn),
where p,(x) = pf(l - p,-), 0 < p,- < 1, the random variables f lf..., §„ are
still independent, but in general are differently distributed:
P«i = x) = Pl(x),.... P(£„ = x) = p„(x).
We now consider a generalization that leads to dependent random variables
that form what is known as a Markov chain.
Let us suppose that
Q = {co: co = (x0, xl9..., xn), x,- e X},
§12. Markov Chains. Ergodic Theorem. Strong Markov Property
111
where X is a finite set. Let there be given nonnegative functions p0(x),
Pi(x, y),.. -, pn(x, y) such that
IPo(x)=l,
(2)
Y,Pk(x>y)=l> fc=l,...,n; yeX.
yeX
For each co = (x0, xlt..., x„), put
p(co) = Po(x0)Pi(x0, xx) • • • P„(xn_x, xn). (3)
It is easily verified that J^e^ p(co) = 1, and consequently the set of numbers
p(co) together with the space Q and the collection of its subsets defines a
probabilistic model, which it is usual to call a model of experiments that form
a Markov chain.
Let us introduce the random variables £0, £lt..., £n with ^(co) = x,-. A
simple calculation shows that
P(£o = a) = p0(fl),
(4)
P(£o = «o» • • •» Zk = ak) = p0(a0)Pi(a0t ax) • • • pk(ak-lt ak).
We now establish the validity of the following fundamental property of
conditional probabilities:
P{&+i = ak+1\Zk = ak,...,t;0 = «o} = P{&+i =flfc+il^ = flfc} (5)
(under the assumption that P(£fc = akt..., £0 = a0) > 0).
By (4),
P{&+i = a*+il& = flfc,...,^o = flo}
_ p{&+1 = ak+1. • • •. £o = flo}
_ Po(Qq)Pi(Qq. ^l)• • • Pfc+ifofc. Qfc-n) _ „ , „ v
" Po(a0)--Pk(ak-1,ak) ~ *+!<*. W
In a similar way we verify
P{&+i = flfc+ilf* = «J = Pfc+ifcfc.flfc+i). (6)
which establishes (5).
Let Q)\ = Q)^ ^ be the decomposition induced by f0» •••»&» anc*
Then, in the notation introduced in §8, it follows from (5) that
P{&+1 = ak+l\Ot} = P{£fc + l = ak+1\tk} (7)
or
PKfc+i =ak+i\to, ■■■>&} = P{&+i =fl*+il&}.
112
I. Elementary Probability Theory
If we use the evident equation
P(AB\C) = P(A\BC)P(B\C),
we find from (7) that
P{Z„ = an,..., tk+l = ak + l\®l) = P{£„ = a,,..., &+l = ak+1\tk} (8)
or
P{£„ = «„,..., tk+l = ak+l\t0, ...,&} = P{£n = an,. ..,&+l =0fc+il6J.
(9)
This equation admits the following intuitive interpretation. Let us think
of £fc as the position of a particle "at present," (£0,..., £fc_ x) as the "past,"
and (^fc +!,..., 4) as the " future." Then (9) says that if the past and the present
are given, the future depends only on the present and is independent of how
the particle arrived at £fc, i.e. is independent of the past (£0,..., £fc_ j).
LetF = (fn = an,...,&+1 = ak+1), N = {& = ak},
B = {&-! = a*-i,...,fo = «o>-
Then it follows from (9) that
P(F|NB) = P(F|N),
from which we easily find that
P(FB | N) = P(F | N)P(B | N). (10)
In other words, it follows from (7) that for a given present N, the future F
and the past B are independent. It is easily shown that the converse also
holds: if (10) holds for all k = 0,1,..., n - 1, then (7) holds for every k = 0,
1,...,«- 1.
The property of the independence of future and past, or, what is the same
thing, the lack of dependence of the future on the past when the present is
given, is called the Markov property, and the corresponding sequence of
random variables £0,..., £„ is a Markov chain.
Consequently if the probabilities p(oj>) of the sample points are given by
(3), the sequence (£0,..., £n) with £{©) = x,- forms a Markov chain.
We give the following formal definition.
Definition. Let (Q, sf, P) be a (finite) probability space and let £ = (£0,..., £„)
be a sequence of random variables with values in a (finite) set X. If (7) is
satisfied, the sequence £ = (£0,..., £„) is called a (finite) Markov chain.
The set X is called the phase space or state space of the chain. The set of
probabilities (p„(x)), xeX, with p0(x) = P(£0 = x) is the initial distribution,
and the matrix ||pfc(x,y)\\, x, yeX, with p(x,y) = P{& = y\£k_l = x) is
the matrix of transition probabilities (from state x to state _y) at time
k = 1,..., n.
§12. Markov Chains. Ergodic Theorem. Strong Markov Property
113
When the transition probabilities pk(x, y) are independent of k, that is,
Pkfr, y) = p(x, y), the sequence £ = (£0,..., £„) is called a homogeneous
Markov chain with transition matrix ||p(x, y)\\.
Notice that the matrix ||p(x, y)\\ is stochastic: its elements are nonnegative
and the sum of the elements in each row is 1: ]£,, p(x, y) = l,xeX.
We shall suppose that the phase space X is a finite set of integers
(X = {0, 1,..., N}, X = {0, ± 1,..., ±N}, etc.), and use the traditional
notation pt = p0(i) and pi} = p(i,j).
It is clear that the properties of homogeneous Markov chains completely
determine the initial distributions p£ and the transition probabilities ptj. In
specific cases we describe the evolution of the chain, not by writing out the
matrix ||p£j|| explicitly, but by a (directed) graph whose vertices are the states
in X, and an arrow from state i to state j with the number p(j over it indicates
that it is possible to pass from point i to pointy with probability py. When
Pij = 0, the corresponding arrow is omitted.
Example 1. Let X = {0,1, 2} aitf
»PiJ= i o i).
\i o i/
The following graph corresponds to this matrix:
Here state 0 is said to be absorbing: if the particle gets into this state it remains
there, since p00 = 1. From state 1 the particle goes to the adjacent states 0
or 2 with equal probabilities; state 2 has the property that the particle remains
there with probability 3 and goes to state 0 with probability f.
Example 2. Let X = {0, ±1,..., ±N}, p0 = 1, pNN = P(-n)(-n) = 1, and,
for|i|<JV,
(p, ; = i+L
Pu=U J = i~U (11)
lo otherwise.
114
I. Elementary Probability Theory
The transitions corresponding to this chain can be presented graphically ii
the following way (N = 3):
This chain corresponds to the two-player game discussed earlier, when each
player has a bankroll N and at each turn the first player wins +1 from the
second with probability p, and loses (wins — 1) with probability q. If we
think of state i as the amount won by the first player from the second, then
reaching state N or — N means the ruin of the second or first player,
respectively.
In fact, if rjlt rj2,..., rj„ are independent Bernoulli random variables with
P(rji = +1) = p, P(^= -1) = q, So = 0 and Sk = ^ + ••• + rjk the
amounts won by the first player from the second, then the sequence S0,
Slt..., Sn is a Markov chain with p0 = 1 and transition matrix (11), since
P{S*+1 = J' |S* = **> $k-1 = h-1» • • •}
= P(S* + r}k+1 =j\Sk = ik,Sk.l = ifc_ x,...}
= P{Sk + rjk+l =j\Sk = ik} = P{,yfc+1 =; - ik}.
This Markov chain has a very simple structure:
Sfc+i-Sfc + fc+i. 0<k<n- 1,
where »ylf rj2,..., r\n is a sequence of independent random variables.
The same considerations show that if £0, 17^...,17,, are independent
random variables then the sequence f0» f 1» ■ • • i fn with
&+1 = /*(&, rjk+1), 0<k<n-l, (12)
is also a Markov chain.
It is worth noting in this connection that a Markov chain constructed in
this way can be considered as a natural probabilistic analog of a
(deterministic) sequence x = (x0,..., x„) generated by the recurrent equations
xk + l = fk(xk)-
We now give another example of a Markov chain of the form (12); this
example arises in queueing theory.
Example 3. At a taxi stand let taxis arrive at unit intervals of time (one at a
time). If no one is waiting at the stand, the taxi leaves immediately. Let rjk be
the number of passengers who arrive at the stand at time k, and suppose that
rjlt ...,rj„ are independent random variables. Let £k be the length of the
§12. Markov Chains. Ergodic Theorem. Strong Markov Property
115
waiting line at time k, £0 = 0. Then if £k = i, at the next time k + 1 the length
£k+x of the waiting line is equal to
. fa+i if'= 0,
In other words,
&+i = «* ~ 1)+ + f]k+l, 0<k<n-l,
where a+ = max(a, 0), and therefore the sequence £ = (£0,..., £„) is a
Markov chain.
Example 4. This example comes from the theory of branching processes. A
branching process with discrete times is a sequence of random variables
€o» €i» • • • > €n» where £k is interpreted as the number of particles in existence
at time k, and the process of creation and annihilation of particles is as
follows: each particle, independently of the other particles and of the
"prehistory" of the process, is transformed into j particles with probability pj,
j = 0, 1,..., M.
We suppose that at the initial time there is just one particle, £0 = 1. If at
time k there are £k particles (numbered 1, 2,..., £fc), then by assumption
£k+! is given as a random sum of random variables,
sjic + i — n\ t •• + */&>
where iff* is the number of particles produced by particle number i. It is
clear that if £k = 0 then £k+l = 0. If we suppose that all the random variables
nf\ k > 0, are independent of each other, we obtain
P{£*+i = **+il£fc = ifc> £*-i = '*-if-} = P(£fc+i = h+i\£k = '*}
= P{rf»+ --- + ^ = ^+1}.
It is evident from this that the sequence £0, f lf..., £„ is a Markov chain.
A particularly interesting case is that in which each particle either vanishes
with probability q or divides in two with probability p, p + q = 1. In this
case it is easy to calculate that
Pu = PWk+i =;l& = 0
is given by the formula
_ fc/'V'V-'72, J = 0,..., 2i,
j ~ (0 in all other cases.
2. Let ^ = (£fc, m, P) be a homogeneous Markov chain with starting vectors
(rows) m = (p^ and transition matrix H = ||p,v||. It is clear that
P« = P«i = ;lfo = /} = •■• = P{4 =714-1 = 0.
116 I. Elementary Probability Theory
We shall use the notation
p|? = P(& =;lfo = 0 (=PK*+i =;I6 = 0)
for the probability of a transition from state i to state y in fr steps, and
pf = P{&=./}
for the probability of finding the particle at pointy at time k. Also let
n(fc) = yf% Plk) = \\p{!}\\.
Let us show that the transition probabilities p\f satisfy the Kolmogorov-
Chapman equation
a
or, in matrix form,
p(k + l) _ p>(fc). p>(/) Q4\
The proof is extremely simple: using the formula for total probability
and the Markov property, we obtain
p!J+/> = P(£fc+/ =;|£0 = /) = £ P(&+/ = ;, & = <x|£0 = i)
a
= I *&♦, -;i& = <*)P(& = «16, = 0 = 1 p9p8?.
a a
The following two cases of (13) are particularly important:
the backward equation
pJT'-IiwB a?)
and the forward equation
pr"=np£% 06)
a
(see Figures 22 and 23). The forward and backward equations can be written
in the following matrix forms
P*fc+1> = [pW-tP, (17)
prfft+D = p.pw (18)
§12. Markov Chains. Ergodic Theorem. Strong Markov Property
117
1 /+1
Figure 22. For the backward equation.
Similarly, we find for the (unconditional) probabilities pf] that
(19)
or in matrix form
In particular,
(forward equation) and
|-|](fc+i) = n](fc).p.
ru<fc+i) _ r-|](i).p)(fc)
(backward equation). Since Pl l ] = IP, IIl x' = II, it follows from these equations
that
p><*> = p\ n(k) = n*.
Consequently for homogeneous Markov chains the fr-step transition
probabilities pj]p are the elements of the kth powers of the matrix IP, so that
many properties of such chains can be investigated by the methods of matrix
analysis.
0 k k+ 1
Figure 23. For the forward equation.
118
I. Elementary Probability Theory
Example 5. Consider a homogeneous Markov chain with the two states 0 and
1 and the matrix
ip = /Poo Poi\
\Pio Pn)
It is easy to calculate that
p.
and (by induction)
Pn
! _ / Poo + PoiPio Poi(Poo + Pn)\
\Pio(Poo + Pn) Pii + PoiPio j
1 /1-Pn l~Poo\
2 - Poo - Pn \1 - Pn 1 - Poo/
| (Poo + Pn - 1)" / 1-Poo -0-Poo)\
2-Poo-Pn \—(1 — Pn) 1-Pn /
(under the hypothesis that |p00 + pxl — 1| < 1).
Hence it is clear that if the elements of P satisfy |p00 + plt — 1| < 1 (in
particular, if all the transition probabilities p0- are positive), then as n -> oo
2 - Poo
and therefore
1 /1-P„ 1-Pcc\ (20)
o - Pn \1 - Pu 1 - Poo/
„(") = 1-P" Urn »<«> ^ X ~ P°°
lim * = o . " - lim rf? =
POO — Pi 1 n ^- POO — Pi 1
Consequently if |p00 + pu — 1| < 1, such a Markov chain exhibits
regular behavior of the following kind: the influence of the initial state on
the probability of finding the particle in one state or another eventually
becomes negligible (pj-^ approach limits n-r independent of i and forming a
probability distribution: n0 > 0, izx > 0, n0 + nt = 1); if also all p0- > 0
then 7r0 > 0 and TCj > 0.
3. The following theorem describes a wide class of Markov chains that have
the property called ergodicity: the limits n} = limn p0- not only exist, are
independent of i, and form a probability distribution (nj > 0, *£j nj — 1). but
also nj > 0 for all j (such a distribution ni is said to be ergodic).
Theorem 1 (Ergodic Theorem). Let P = ||p0|| be the transition matrix of a
chain with a finite state space X = {1, 2,..., N}.
(a) If there is an n0 such that
min pg0) > 0, (21)
§12. Markov Chains. Ergodic Theorem. Strong Markov Property
119
then there are numbers n^ ..., nN such that
j
and
pg> -+ tzj, n^co (23)
for every i e X.
(b) Conversely, if there are numbers nlt...,nN satisfying (22) and (23), there
is an n0 such that (21) holds.
(c) The numbers (nlt..., nN) satisfy the equations
*J = I.*aPaJ, J= I,- - N. (24)
a
mf = min p<f, Mf = max pg>.
i i
pB+i>-Zpi.p!?. (25)
Proof, (a) Let
Since
we have
ni
t?+ x> = min pg+ » = min £ pfapjy > min £ pfa min pg> = mjf >,
i i a i a a
whence mf* < n^n+1) and similarly MJn) > Mf+1}. Consequently, to establish
(23) it will be enough to prove that
Mf - my0 -> 0, n -► oo, ; = 1,..., N.
Let £ = minItJ- pgo) > 0. Then
p(.jo+n, = £ p(„0,p(n, = £ ^(no) _ £pjn>]p<n> + £ £ pjn>pg>
as a
-1 IMf*-»S?]p»+ «*■*.
a
But pS£0) - epJJ > 0; therefore
pjno + n) > mjn). £ [pjno) _ £pJn)-j + £p(2n) = mJn)(1 _ fi) + £pj2„)>
a
and consequently
m<no + n) > mjn)(1 _ £) + £pj2„)
In a similar way
Mjn0 + n) < Mjn)(1 _ fi) + fip(2n)
Combining these inequalities, we obtain
Mjn0 + n) _ mino + n) < (MJ„) _ rfn)) . (1 _ 8)
120
I. Elementary Probability Theory
and consequently
Mikno + n) _ mikn0 + n) < (MJ„) _ ^Jn)^ _ ^k j ^ k _ Q0>
Thus Mfp) - mf'* -> 0 for some subsequence np, np -> oo. But the
difference Mf - mf is monotonic in n, and therefore Mf] - mf] -> 0,
n -> oo.
If we put Uj = limn wj"*, it follows from the preceding inequalities that
IplJ* - wj| < MJn) - wjn) < (1 - e)[n/no*-1
for n> n0, that is, pi"* converges to its limit Uj geometrically (i.e., as fast as a
geometric progression).
It is also clear that ntf* > mfo) > £ > 0 for n > n0, and therefore nj > 0.
(b) Inequality (21) follows from (23) and (25).
(c) Equation (24) follows from (23) and (25).
This completes the proof of the theorem.
4. Equations (24) play a major role in the theory of Markov chains. A
nonnegative solution (nl,..., nN) satisfying £a na = 1 is said to be a stationary
or invariant probability distribution for the Markov chain with transition
matrix ||py||. The reason for this terminology is as follows.
Let us select an initial distribution (7^,..., nN) and take pj = nr Then
a
and in general pjn) = ny In other words, if we take (nl9..., nN) as the initial
distribution, this distribution is unchanged as time goes on, i.e. for any k
p«*=;) = p«o=;>. ./= 1,...,w.
Moreover, with this initial distribution the Markov chain ^ = (£, 11, P) is
really stationary: the joint distribution of the vector (£fc, 4+1, • • •, £*+/) is
independent of k for all I (assuming that k + I < n).
Property (21) guarantees both the existence of limits nj = lim p!"*, which
are independent of U and the existence of an ergodic distribution, i.e. one
with 7ij > 0. The distribution (7^,..., nN) is also a stationary distribution.
Let us now show that the set (nlt..., nN) is the only stationary distribution.
In fact, let (ftu ..., ftN) be another stationary distribution. Then
*] = Z *«p*j = • • • = Z *«p«j>
a a
and since pl"J -> 7c,- we have
*j = Z (¾' wj) = nj-
These problems will be investigated in detail in Chapter VIII for Markov
chains with countably many states as well as with finitely many states.
§12. Markov Chains. Ergodic Theorem. Strong Markov Property
121
We note that a stationary probability distribution (even unique) may
exist for a nonergodic chain. In fact, if
then
and consequently the limits lim pW do not exist. At the same time, the
system
n. = £ napajt j = 1, 2,
reduces to
of which the unique solution satisfying 7rt + n2 = 1 is (j, j).
We also notice that for this example the system (24) has the form
1*0 = ^o Poo + nlPlO-
ni = noPoi + ^lPllJ
from which, by the condition n0 = izx = 1, we find that the unique stationary
distribution (n0, n^ coincides with the one obtained above:
1-Pn 1 -Poo
TZq = - - —
2-Poo-Pn 2-Poo-Pu
We now consider some corollaries of the ergodic theorem.
Let A be a set of states, A c X and
ft
, xeA,
Consider
vM= ^n
which is the fraction of the time spent by the particle in the set A. Since
E[/x«*)lfo = Q = P(tkeA\Z0 = i) = EpSJ^p!^)),
122
I. Elementary Probability Theory
we have
n + 1 *=1o
and in particular
E[vu)(n)|{„ = .]=^|oP!?.
It is known from analysis (see also Lemma 1 in §3 of Chapter IV) that if
an -> a then (a0 + h an)/(n + 1) -> a, n -> oo. Hence if pj}* -> njt k -> oo,
then
Ev,y)(n) -> 7tif EVyl(n) -> nA, where tt^ = £ *;•
For ergodic chains one can in fact prove more, namely that the following
result holds for IA(t0),..., IA{£n\ ....
Law of Large Numbers. //£o> £i, • ..form a finite ergodic Markov chain, then
P{|v» - nA\ > £} ^ 0, n ^ oo, (26)
/or every s > 0 and every iniYia/ distribution.
Before we undertake the proof, let us notice that we cannot apply the
results of §5 directly to IA(£o), • • •» f >i(£n)> • • •» smce tnese variables are, in
general, dependent. However, the proof can be carried through along the
same lines as for independent variables if we again use Chebyshev's
inequality, and apply the fact that for an ergodic chain with finitely many
states there is a number p, 0 < p < 1, such that
\p^-nj\<C-pn. (27)
Let us consider states i and j (which might be the same) and show that,
for e > 0,
P{|vf» - nj\ > e\t0 = i} ^ 0, n -+ oo. (28)
By Chebyshev's inequality,
P{|v(,(«)-,^.16, = o <E<IV")~^0 = --}
Hence we have only to show that
E{|vw(n) - nj\2\Z0 = i}^0, n^ oo.
§12. Markov Chains. Ergodic Theorem. Strong Markov Property
123
A simple calculation shows that
E{|v{» - *,-|2l£o = '"> = (^TT)^'E{L?0(/°l(^) " ^Jh = '}
= . y y m^o
where
<° = E{[/ui«^w(«]lfo = <}
- tj-E[/U({*)lfo = 0 - «j-E[/u,«/)l{o = ']+ "|
-10-^-^-10--,-10 + ^.
s = min(fr, /) and * = \k — l\.
By (27),
p!;> = 7^ + £!;>, W$\<cP\
Therefore
K*'i < Ciiy + p'+ p* + p'x
where Cx is a constant. Consequently
V1 + XJ k = 0 / = 0 l/* + XJ fc = 0 / = 0
„ 4C, 2(ti + l) 8C,
^(n + 1)2' 1-p -(„ + l)(l-p)^0' "^°°-
Then (28) follows from this, and we obtain (26) in an obvious way.
5. In §9 we gave, for a random walk S0, Slt... generated by a Bernoulli
scheme, recurrent equations for the probability and the expectation of the
exit time at either boundary. We now derive similar equations for Markov
chains.
Let £ = (<!;0,..., <!;„) be a Markov chain with transition matrix ||p0-|| and
phase space X = {0, ± 1,..., ±N}. Let A and B be two integers, —N<
A < 0 < B < N, and xeX. Let 08k+ x be the set of paths (x0, xlt..., xk),
Xf e X, that leave the interval (/4, B) for the first time at the upper end, i.e.
leave (A, B) by going into the set (B, B + 1,..., N).
ForA<x<B, put
Wx) = P{«o We*l+I|{0 = x}.
In order to find these probabilities (for the first exit of the Markov chain
from (A, B) through the upper boundary) we use the method that was
applied in the deduction of the backward equations.
124
I. Elementary Probability Theory
We have
Wx) = P{«o,...,We^+1|fc = x}
= 2>*y-P{(£o, ...,^)e^fc+1|^0 = x,^=y}.
where, as is easily seen by using the Markov property and the homogeneity
of the chain,
= P{(x,y, £2,..., to e<%k+l\£0 = x, ^ = y}
aPfc{2 We*|{, = rf
Therefore
y
for /1 < x < B and 1 < k < n. Moreover, it is clear that
&(x) = 1, x = B, B + 1,..., N,
and
)5fc(x) = 0, x= -N,...,A
In a similar way we can find equations for afc(x), the probabilities for first
exit from (A9 B) through the lower boundary.
Let Tfc = min{0 < / < k: £, $ (A, B)}, where rk = k if the set {•} = 0.
Then the same method, applied to mk(x) = E(rfc |£0 = x), leads to the
following recurrent equations:
Wfc(x)= 1 +£lWjk-i(j>)Pxj,
y
(here 1 < k < n, A < x < B). We define
Wfc(x) = 0, x $ {A, B).
It is clear that if the transition matrix is given by (11) the equations for
afc(*)> Pk(x) and mk(x) become the corresponding equations from §9, where
they were obtained by essentially the same method that was used here.
These equations have the most interesting applications in the limiting
case when the walk continues for an unbounded length of time. Just as in §9,
the corresponding equations can be obtained by a formal limiting process
(k -+ oo).
By way of example, we consider the Markov chain with states {0, 1,..., B}
and transition probabilities
Poo = 1> Pbb = 1»
§12. Markov Chains. Ergodic Theorem. Strong Markov Property
and
(Pi > 0, j = i+ 1,
(Pi > 0, j = i+ 1,
U>0, j = i-l,
for 1 < i < B — 1, where p£ + qx + rf = 1.
For this chain, the corresponding graph is
■Qt
It is clear that states 0 and B are absorbing, whereas for every other state
i the particle stays there with probability ri9 moves one step to the right with
probability p£, and to the left with probability qt.
Let us find a(x) = lim*.^ afc(x), the limit of the probability that a particle
starting at the point x arrives at state zero before reaching state B. Taking
limits as k -> oo in the equations for afc(x), we find that
<*(/) = qj<3 -1) + rjaij) + P}ol(J + 1)
when 0 <j < B, with the boundary conditions
a(0) = 1, a(B) = 0.
Since rj — 1 — qs — pjt we have
pMJ + d -«(/» = «M/> - «a - D)
and consequently
*(j + 1)- o0) = PjW) - 1),
where
But
Therefore
Pi--- Pj
«(/+ 1)-1= I («(!- + I)" «(0).
i = 0
«(; +1)-l=(a(l)-1)-1/¾.
i=0
126
I. Elementary Probability Theory
If/ = B - 1, we have a(/ + 1) = a(B) = 0, and therefore
«d) = 1 = - v*n-r.
whence
«(1) = ^V^ and ^-¾¾. ;=1 B-
(This should be compared with the results of §9.)
Now let m(x) = \\mk mk(x), the limiting value of the average time taken
to arrive at one of the states 0 or B. Then m(0) = m(B) = 0,
m(x) = 1 + £ m(y)pxy
y
and consequently for the example that we are considering,
m(j) = 1 + qjm(j -1) + r}m(J) + psm(j + 1)
forj = 1, 2,..., B — 1. To find m(J) we put
M(J) = m(J) - m(J - 1), / = ftl B.
Then
PjM(j + 1) = qjM(j) - 1, j=l,...,B-l,
and consequently we find that
M(j + 1) = p,M(l) - K,-,
where
Therefore
m(i) = m(j) -
"•pj Pji Pj-i
m(0) = £ M(i + 1)
Pj
0---Pi J
= I (P.^(l) - *.) = 1*1) 'l Pi - iV
» = 0 f = 0 i = 0
It remains only to determine m(l). But w(B) = 0, and therefore
and for 1 < j < B,
^(1) = ¾¾
Lt=o Pi
j-l yB-l n j-l
i=0 2^=0 Pi i=0
§12. Markov Chains. Ergodic Theorem. Strong Markov Property
127
(This should be compared with the results in §9 for the case r,- = 0, p£ = p,
Qi = q)
6. In this subsection we consider a stronger version of the Markov property
(8), namely that it remains valid if time k is replaced by a random time (see
also Theorem 2). The significance of this, the strong Markov property, can
be illustrated in particular by the example of the derivation of the recurrent
relations (38), which play an important role in the classification of the states
of Markov chains (Chapter VIII).
Let £ = (£i, ...,£„) be a homogeneous Markov chain with transition
matrix ||py||; let & = (@t)o<k<n be a system of decompositions, Q)\ =
3>s0 £k. Let <%l denote the algebra a(^|) generated by the decomposition
We first put the Markov property (8) into a somewhat different form. Let
B e #j. Let us show that then
P{£„ = an,..., £fc+1 = ak+l\B n (£k = ak)}
= P{£„ = am9..., &+1 = ak+l\£k = ak} (29)
(assuming that P{B n (£k = ak)} > 0). In fact, B can be represented in the
form
* = I* tto = <*.-.. & = <*}.
where J]* extends over some set (ag,..., ag). Consequently
P{£„ = an,..., £k+l = ak+l\B n (¾ = ak)}
P{(L = an,---,tk = ak)nB}
P{(£* = ak) n B}
= £*P{«w = flw,...,& = fl*)ntf0 = £18,-.-,¾ = £tf)}
P{(tk = ak)nB} ' K >
But, by the Markov property,
P{(£„ = an,.... Zk = ak) n (£0 = a£,.... Zk = at)}
IP{£n = an,...,£fc+1 =ak+1\£0 = a$,. ..,^ = ^}
x P{£0 = «o. • • •. L = a*) if a* = a*.
0 ifflfc^flfc*,
fP{£n = an,.... £fc+1 = afc+1|& = aJP{£0 = *.-.,& = "ft
= | ifafc = c£,
[0 ifafc*flfc*,
[>{£„ = an,..., £fc+1 = afc+1l£* = ak}P{tfk = afc) n B}
= 1 \fak = ak*t
[0 iiak*at. ■ ~
128
I. Elementary Probability Theory
Therefore the sum Z* in (30) is equal to
P{£„ = an,..., £fc+1 = ak+l\£k = ak}P{(£k = ak) n B},
This establishes (29).
Let t be a stopping time (with respect to the system D* = (DQozkzn* see
Definition 2 in §11).
Definition. We say that a set B in the algebra @\ belongs to the system of sets
@\ if, for each k,0<k<n,
Bn{r = k}e@l (31)
It is easily verified that the collection of such sets B forms an algebra
(called the algebra of events observed at time t).
Theorem 2. Let £ = (^0,..., <!;„) be a homogeneous Markov chain with
transition matrix ||py||, t a stopping time (with respect to &), B e 38\ and
A = {co: x + / < n). Then if P{A nBn(£x = a0)} > 0, we have
PK,+/= £%,..-, ¢,+1 =a1\AnBn(£x = a0)}
= P«t+/ = ¾ {t+1=fl,Mn8t = flo». (32)
and if P{A n (f t = a0)} > 0 then
P{£t+/ = al,...,£t+1=a1\An (ft = a0)} = pao<Xl... pfl|_lfl|. (33)
For the sake of simplicity, we give the proof only for the case 1=1. Since
Bn(T = fc)e4we have, according to (29),
PK,+ i = fli,i4r>Br>«t = fl0)}
= Z P{6k+ias*i.6kasflo.T = *.fl}
fc;£n-l
= Z P{^+l=«ll^ = fl0,T = fc,B}P{^ = fl0,T = fc,B}
fc^n- 1
= Z PK*+iasflil& = flo}PK* = flo.* = fc,B}
= P-o-i- Z PK* = flo.T = fc,B}=paoBl-P{i4nBn«t = flo)},
fc^n-l
which simultaneously establishes (32) and (33) (for (33) we have to take
B = Q).
Remark. When I = 1, the strong Markov property (32), (33) is evidently
equivalent to the property that
P«t+ xeC\AnBn(Zx = a0)} = Pao(C\ (34)
§12. Markov Chains. Ergodic Theorem. Strong Markov Property
129
for every C c x, where
Pa0(Q = £ Paoaf
ateC
In turn, (34) can be restated as follows: on the set A = {t < n - 1},
PKt+ieC|0{} = Pfc(CX (35)
which is a form of the strong Markov property that is commonly used in the
general theory of homogeneous Markov processes.
7. Let { = K0 <!;„) be a homogeneous Markov chain with transition
matrix ||py||, and let
ft} = P{tk = U/ * U < I < k - l|fc = i} (36)
and
for i ?£ j be respectively the probability of first return to state i at time k and
the probability of first arrival at state j at time k.
Let us show that
Pi? = I /SMT". where pf = 1. (38)
fc=l
The intuitive meaning of the formula is clear: to go from state i to state j
in n steps, it is necessary to reach statej for the first time in k steps (1 < k < n)
and then to go from state j to state j inn — k steps. We now give a rigorous
derivation.
Let 7 be given and
t = min{l < k < n:£k= j},
assuming that t = n + 1 if {•} = 0- Then /g? = P{t = k\£0 = i} and
p!? = P{^ =./16, = 0
= I PKB=;,T = fc|fc = i'}
l<fc<n
= Z PK.+B-*=J,T = fc|fc = i}, (39)
l<fc<n
where the last equation follows because £T+n-k = ^„ on the set {t = fr}.
Moreover, the set {t = k) = {x = k, £T = 7} for every fr, 1 < k < n. Therefore
if p{£0 = it T = k] > 0, it follows from Theorem 2 that
PKr+„-fc =Mo = i,r = k} = P{tt+n.k =j\Z0 = U t = fc, t = J'}
= P{^+„-fc=7l6=7} = PJrfc>
130
I. Elementary Probability Theory
and by (37)
Pi? = t PK.+--* =;lfc = Ur = k}P{r = k\£0 = i}
~ L, Vjj J ij »
fc=l
which establishes (38).
8. Problems
1. Let £ = (£0,..., £„) be a Markov chain with values in X and f = /(x) (x e X) a
function. Will the sequence (/(£<>), f(Zn)) form a Markov chain? Will the
"reversed" sequence
(¢.,^-1,...^0)
form a Markov chain?
2. Let IP = ||py||t 1 < ij < r, be a stochastic matrix and X an eigenvalue of the matrix,
i.e. a root of the characteristic equation det||P — XE\\ = 0. Show that Aq = 1 is an
eigenvalue and that all the other eigenvalues have moduli not exceeding 1. If all the
eigenvalues X u..., Xr are distinct, then pJJ} admits the representation
PB* = ^ + ^./1)^+... + ^)^,
where tcj, fly(l),..., a./r) can be expressed in terms of the elements of P. (It follows
from this algebraic approach to the study of Markov chains that, in particular, when
|Xl | < 1,..., \Xr\ < 1, the limit lim plff exists for every j and is independent of i.)
3. Let £ = (£0,..., £n) be a homogeneous Markov chain with state space X and
transition matrix P = \\pxy\\. Let
Tcp(x) = Elcp(tlM0 = x]
l = Y,<p<y)Pxy)-
Let the nonnegative function <p satisfy
T<p(x) = <p(x), xeX.
Show that the sequence of random variables
C = &,3JD with £* = ?(£*)
is a martingale.
4. Let f = (¢,,, n, IP) and I = (£„, n,P) be two Markov chains with different initial
distributions n = (plt..., pr) and n = (p"i pr). Show that if minitip0- > a > 0
then
Z l^n> -P5n,l <2(1 -ay.
i=l
CHAPTER II
Mathematical Foundations of
Probability Theory
§1. Probabilistic Model for an Experiment with
Infinitely Many Outcomes. Kolmogorov's Axioms
1. The models introduced in the preceding chapter enabled us to give a
probabilistic-statistical description of experiments with a finite number of
outcomes. For example, the triple (CI, &0, P) with
Cl = {co:co = (alf...,an),at = 0,1},j^ = {A: A c CI}
and p(co) = pIfl'gn-Zfl' is a model for the experiment in which a coin is tossed
n times "independently" with probability p of falling head. In this model the
number N(Cl) of outcomes, i.e. the number of points in CI, is the finite
number 2".
We now consider the problem of constructing a probabilistic model for
the experiment consisting of an infinite number of independent tosses of a
coin when at each step the probability of falling head is p.
It is natural to take the set of outcomes to be the set
CI = {co:co = (au a2,. • •), a{ = 0,1},
i.e. the space of sequences co = (alt a2t.. •) whose elements are 0 or 1.
What is the cardinality N(C1) of ft? It is well known that every number
a e [0, 1) has a unique binary expansion (containing an infinite number of
zeros)
a=y +!§ + ••• to = 0,1).
132 II. Mathematical Foundations of Probability Theory
Hence it is clear that there is a one-to-one correspondence between the points
co of CI and the points a of the set [0,1), and therefore CI has the cardinality of
the continuum.
Consequently if we wish to construct a probabilistic model to describe
experiments like tossing a coin infinitely often, we must consider spaces CI
of a rather complicated nature.
We shall now try to see what probabilities ought reasonably to be assigned
(or assumed) in a model of infinitely many independent tosses of a fair coin
(P + q = i).
Since we may take CI to be the set [0,1), our problem can be considered
as the problem of choosing points at random from this set. For reasons of
symmetry, it is clear that all outcomes ought to be equiprobable. But the
set [0,1) is uncountable, and if we suppose that its probability is 1, then it
follows that the probability p(co) of each outcome certainly must equal
zero. However, this assignment of probabilities (p(co) = 0, co e [0, 1)) does
not lead very far. The fact is that we are ordinarily not interested in the
probability of one outcome or another, but in the probability that the result
of the experiment is in one or another specified set A of outcomes (an event).
In elementary probability theory we use the probabilities p(co) to find the
probability P(A) of the event A: P(A) = Y^^a p(4 In the present case, with
p(co) = 0, co e [0, 1), we cannot define, for example, the probability that a
point chosen at random from [0,1) belongs to the set [0, i). At the same time,
it is intuitively clear that this probability should be j.
These remarks should suggest that in constructing probabilistic models for
uncountable spaces CI we must assign probabilities, not to individual
outcomes but to subsets of CI. The same reasoning as in the first chapter shows
that the collection of sets to which probabilities are assigned must be closed
with respect to unions, intersections and complements. Here the following
definition is useful.
Definition 1. Let Q be a set of points co. A system sf of subsets of CI is called
an algebra if
(a) fie^,
(bJ^BGi^^uBei, An Best,
(c) Aesf ^> Aesf
(Notice that in condition (b) it is sufficient to require only that either
/4uBe^or that /InBei, since A \j B = A n B and A r\B = A\jB.)
The next definition is needed in formulating the concept of a probabilistic
model.
Definition 2. Let si be an algebra of subsets of CI. A set function \i = fi(A),
Aesf, taking values in [0, oo], is called a. finitely additive measure defined
§1. Probabilistic Model for an Experiment with Infinitely Many Outcomes 133
011 s/ if
KA + B) = KA) + KB). (1)
for every pair of disjoint sets A and B in s/.
A finitely additive measure \i with fi(p) < oo is called finite, and when
/4Q) = 1 it is called a finitely additive probability measure, or a finitely
additive probability.
2. We now define a probabilistic model (in the extended sense).
Definition 3. An ordered triple (Q, s/, P), where
(a) Q is a set of points co;
(b) s/ is an algebra of subsets of Q;
(c) P is a finitely additive probability on A,
is a probabilistic model in the extended sense.
It turns out, however, that this model is too broad to lead to a fruitful
mathematical theory. Consequently we must restrict both the class of
subsets of Q that we consider, and the class of admissible probability measures.
Definition 4. A system #" of subsets of Q is a a-algebra if it is an algebra and
satisfies the following additional condition (stronger than (b) of
Definition 1):
(b*) if Ane^9n = 1,2,..., then
\jAHe&, f]Ane^
(it is sufficient to require either that (J An e & or that f] An e &).
Definition 5. The space Q, together with a <r-algebra & of its subsets is a
measurable space, and is denoted by (Q, !F\
Definition 6. A finitely additive measure \i defined on an algebra si of subsets
of Q is countably additive (or o-additive), or simply a measure, if, for all
pairwise disjoint subsets Al9 A2,... of A with £/!„ e si
(oo \ oo
n=l / n=l
A finitely additive measure \i is said to be <r-finite if Q can be represented in
the form
n=l
with /i(Qn) < oo, n = 1, 2,....
134
II. Mathematical Foundations of Probability Theory
If a countably additive measure P on the algebra A satisfies P(fi) — *»
it is called a probability measure or a probability (defined on the sets that
belong to the algebra sf).
Probability measures have the following properties.
V 0 is the empty set then
P(0) = 0.
If A,B est then
P(A uB) = P(A) + P(B) - P(A n B).
If A,Best andB^A then
P(B) < P{A).
If An etf,n = 1,2,..., and \J An est, then
P(AX u A2 u • • •) < P{AX) + P{A2) + • • •.
The first three properties are evident. To establish the last one it is enough
to observe that \J™=1 An = £*=1 Bn, where B1 = A1, Bn = /^ rv-• n
An-i nAn,n> 2, B{ n B} = 0, i ^ j, and therefore
(oo \ / oo \ oo oo
U A. = P( I * = I P(Bn)< I P^jL
n=l / \„=1 / n=l «=1
The next theorem, which has many applications, provides conditions
under which a finitely additive set function is actually countably additive.
Theorem. Let P be a finitely additive set function defined over the algebra s#,
with P(Q) = 1. The following four conditions are equivalent:
(1) P is a-additive (P is a probability);
(2) P is continuous from below, i.e. for any sets Ay, /42,...g^ such that
An c An+l and \J™=1 An e s/,
\imP(AJ=p(\jAn\,
(3) P is continuous from above, i.e. for any sets A u A2,... such that An^An+1
and()Z>=lAnest,
\imP(An) = P^r\An^,
§1. Probabilistic Model for an Experiment with Infinitely Many Outcomes 135
(4) P is continuous at 0, Le. for any sets A^A2,...Esd such that A„+l ^ A„
lim PC4„) = 0.
n
Proof. (1) => (2). Since
0 An = A, + U2V1) + U3V2) + • • •,
we have
p( CUn) = P(A0 + P042Mi) + P(A3\A2) + -
= P{AX) + PU2) - PUO + P(^3) - PG42) + • •
= lim P(4n).
n
(2) => (3). Let n > 1; then
P04„) = PUAUiV„)) = P(^i) - PUiVU-
The sequence {A i\An}n>l of sets is nondecreasing (see the table in Subsection
3 below) and
Then, by (2)
limP(^1Mn) = pfO(M4))
n \n=l /
and therefore
lim PG4„) = PUO - lim PUA4)
n
= pui) - p( 0 MiVj) = p(^i) - p(^i\ ck)
= p(^o - puo + p( n^n) = p(/k)
(3) => (4). Obvious.
(4) => (1). Let Alt A2t • •. g sd be pairwise disjoint and let ]££L! /ln g $2.
Then
Table
Notation
Set-theoretic interpretation
Interpretation in probability theory
CO
n
&
AsP
/4 = n\/4
AuB
AnB(or AB)
0
AnB = 0
A+B
A\B
AAB
n=l
element or point
set of points
tr-algebra of subsets
set of points
complement of A, i.e. the set of points co that are not in A
union of A and B, i.e. the set of points co belonging either
to A or to B
intersection of A and 5, i.e. the set of points co belonging to
both A and B
empty set
A and B are disjoint
sum of sets, i.e. union of disjoint sets
difference of A and B, i.e. the set of points that belong to A
but not to B
symmetric difference of sets, i.e. (A\B) u (B\A)
union of the sets Al9 A2, •..
outcome, sample point, elementary event
sample space; certain event
tr-algebra of events
event (if co e /1, we say that event A occurs)
event that A does not occur
event that either A or B occurs
event that both A and B occur
impossible event
events A and B are mutually exclusive, i.e. cannot occur
simultaneously
event that one of two mutually exclusive events occurs
event that A occurs and B does not
event that A or B occurs, but not both
event that at least one of Alt A2,... occurs
£ A„ sum, i.e. union of pairwise disjoint sets Alt A2,... event that one of the mutually exclusive events Au A2,...
occurs
f] An intersection of A t, A 2,... event that all the events Ai,A2,... occur
n=l
A„ t A the increasing sequence of sets An converges to A, i.e. the increasing sequence of events converges to event A
for A = lim t A„ 1 Ax s /l2 c ... and A = (J /4„
/4„ J, A the decreasing sequence of sets An converges to A, i.e. the decreasing sequence of events converges to event A
(or A = lim J, An J /4t 2 /42 2 • • • and A = f] A„
\ n / n=l
Hrn /4n the set event that infinitely many of events A l5 A2 occur
(or lim sup An - °°
or*M.i.o.}) NU ^
Hm A„ the set event that all the events A t, A 2,... occur with the possible
" oo oo exception of a finite number of them
(or lim inf A„) (J f) Ak
n=l fc=n
* i.o. = infinitely often.
138
II. Mathematical Foundations of Probability Theory
and since £," „+1 ^i i 0»n -*■ °°> we have
£ PU.) = lim £ PC4,-) = lim p( £ a\
=lir[p(l/-)-p(..|+^)]
=p(|i.i)-.irp(j+/()=p(|i4
3. We can now formulate Kolmogorov's generally accepted axiom system,
which forms the basis for the concept of a probability space.
Fundamental Definition. An ordered triple (CI, IF, P) where
(a) CI is a set of points co,
(b) !F is a a-algebra of subsets of CI,
(c) P is a probability on IF,
is called a probabilistic model or a probability space. Here CI is the sample space
or space of elementary events, the sets A in !F are events, and P(A) is the
probability of the event A.
It is clear from the definition that the axiomatic formulation of probability
theory is based on set theory and measure theory. Accordingly, it is useful to
have a table (pp. 136-137) displaying the ways in which various concepts are
interpreted in the two theories. In the next two sections we shall give examples
of the measurable spaces that are most important for probability theory and
of how probabilities are assigned on them.
4. Problems
1. Let CI = {r: r e [0,1]} be the set of rational points of [0,1], sf the algebra of sets
each of which is a finite sum of disjoint sets A of one of the forms {r\a <r < b},
{r:a<r < b}, {r:a<r < b}, {r:a<r < b}, and P(A) = b - a. Show that P(A),
Aes/,is finitely additive set function but not countably additive.
2. Let CI be a countable set and & the collection of all its subsets. Put y{A) = 0 if A is
finite and /i(A) = oo if A is infinite. Show that the set function fi is finitely additive
but not countably additive.
3. Let /i be a finite measure on a a-algebra SF, An e $F, n = 1, 2,..., and A = limn An
(i.e., A = lim„ A„ = hm„ A„). Show that y{A) = limn fj(An).
4. Prove that P(A AB)= P(A) + P(#) - 2P(/I n B).
§2. Algebras and <r-Algebras. Measurable Spaces
139
5. Show that the "distances" px(A, B) and p2(A, B) defined by
Pl{A, B) = P(A A B),
p2(A, B) = mAvB)
[o if P(A u B) = 0
satisfy the triangle inequality.
6. Let p. be a finitely additive measure on an algebra sf, and let the sets Al,A2,-.-es
be pairwise disjoint and satisfy A = £-1, At e j^. Then p(A) > J> x //(/1,-).
7. Prove that
lim sup /!„ = lim inf An, lim inf An = lim sup Ant
lim inf/!„ e Hm SUp An, lim sup(y4n u Bn) = lim sup /!„ u lim sup B„,
lim sup A„ n lim inf B„ s Hm sup04n n Bn) e Hm sup /!„ n lim sup £n.
IMn T 4 or 4n J 4, then
lim inf An = lim sup A„.
8. Let {xn} be a sequence of numbers and An = (—oo, xn). Show that x = lim sup x„
and /1 = lim sup An are related in the following way: (—oo, x) £ A £ (— oo, x].
In other words, A is equal to either (—oo, x) or to (—oo, x].
9. Give an example to show that if a measure takes the value + oo, it does not follow in
general that countable additivity implies continuity at 0.
§2. Algebras and <r-Algebras. Measurable Spaces
1. Algebras and oalgebras are the components out of which probabilistic
models are constructed. We shall present some examples and a number of
results for these systems.
Let Q be a sample space. Evidently each of the collections of sets
^* = (0,0}, &*= {A'.A^Cl}
is both an algebra and a <r-algebra. In fact, ^ is trivial, the "poorest"
<T-algebra, whereas #"* is the "richest" <r-algebra, consisting of all subsets
of a
When CI is a finite space, the tr-algebra #"* is fully surveyable, and
commonly serves as the system of events in the elementary theory. However, when
the space is uncountable the class #"* is much too large, since it is impossible
to define "probability" on such a system of sets in any consistent way.
If A ^ fi, the system
140
II. Mathematical Foundations of Probability Theory
is another example of an algebra (and a <r-algebra), the algebra (or <r-algebra)
generated by A.
This system of sets is a special case of the systems generated by
decompositions. In fact, let
g> = {DUD2,...}
be a countable decomposition of CI into nonempty sets:
Cl = Dl+D2 + ---; DinDj=0, i * j.
Then the system sf = a(2>), formed by the sets that are unions of finite
numbers of elements of the decomposition, is an algebra.
The following lemma is particularly useful since it establishes the important
principle that there is a smallest algebra, or <r-algebra, containing a given
collection of sets.
Lemma 1. Let £ be a collection of subsets of CI. Then there are a smallest
algebra a(£) and a smallest a-algebra a{S) containing all the sets that are in S.
Proof. The class #"* of all subsets of CI is a <r-algebra. Therefore there are at
least one algebra and one <r-algebra containing S. We now define cl(£)
(or a{S)) to consist of all sets that belong to every algebra (or <r-algebra)
containing S. It is easy to verify that this system is an algebra (or tr-algebra)
and indeed the smallest.
Remark. The algebra a(£) (or o(E\ respectively) is often referred to as the
smallest algebra (or <r-algebra) generated by £.
We often need to know what additional conditions will make an algebra,
or some other system of sets, into a <r-algebra. We shall present several results
of this kind.
Definition 1. A collection Jt of subsets of CI is a monotonic class if An e M,
n= 1,2,..., together with An t A or An J A, implies that AtJt.
Let £ be a system of sets. Let /*(<f) be the smallest monotonic class
containing S. (The proof of the existence of this class is like the proof of Lemma 1.)
Lemma 2. A necessary and sufficient condition for an algebra s^ to be a
a-algebra is that it is a monotonic class.
Proof. A tr-algebra is evidently a monotonic class. Now let $0 be a monotonic
class and Anesft n = 1, 2, It is clear that Bn = \J"=l Ax e &0 and
Bn c Bn+1. Consequently, by the definition of a monotonic class,
Bn T U£ l AiE **• Similarly we could show that f]f= x A-x e st.
By using this lemma, we can prove that, starting with an algebra sf, we
can construct the <r-algebra g(j#) by means of monotonic limiting processes.
§2. Algebras and <r-Algebras. Measurable Spaces
141
Theorem 1. Let s$ be an algebra. Then
Visi) = ffGsO. (1)
Proof. By Lemma 2, ix(d) c o(st). Hence it is enough to show that ix(s#)
is a <T-algebra. But Jt = i^sf) is a monotonic class, and therefore, by Lemma
2 again, it is enough to show that /i(«sO is an algebra.
Let y4el; we show that AeJt. For this purpose, we shall apply a
principle that will often be used in the future, the principle of appropriate sets,
which we now illustrate.
Let
Jt = {B:5eJ,BeJ}
be the sets that have the property that concerns us. It is evident that
stf c Jt c Jt. Let us show that Jt is a monotonic class.
Let Bn e Jt; then Bn e Jt, Bn e Jt, and therefore
\\m\BnsJt, \\m\BnsJt, lim J BneJt, lim I BneJt.
Consequently
lim t Bn = lim \BneJt, lim j Bn = lim |BnGJ^,
lim t Bn = lim j Bn e Jt, lim j Bn = lim T #„ e ^,
and therefore Jt is a monotonic class. But Jt <=, Jt and ^ is the smallest
monotonic class. Therefore Jt — Jt, and if /4 e Jt — ix(sf), then we also
have A e Jt, i.e. Jt is closed under the operation of taking complements.
Let us now show that Jt is closed under intersections.
Let AeJt and
JtA= {B:BeJt,Ar\BeJt}.
From the equations
lim i (A n Bn) = A n lim j Bn,
lim t {A n Bn) = A n lim t £„
it follows that JtA is a monotonic class.
Moreover, it is easily verified that
(AeJtB)o(BeJtA). (2)
Now let Ae sf; then since sd is an algebra, for every Be^ the set
A n Best and therefore
.^ c JtA c ^.
But ^ is a monotonic class (since lim | /lBn = /1 lim t #„ and lim J /lBn =
A lim | £„), and Jt is the smallest monotonic class. Therefore JtA = ^ for
all A est. But then it follows from (2) that
(A e JtB)o(B eJtA = Jt).
142
II. Mathematical Foundations of Probability Theory
whenever A g s# and BeJt. Consequently ifAesf then
A 6lB
for every BeJt. Since A is any set in sf, it follows that
Therefore for every BeJt
JtB — Jt,
i.e. if B g Jt and C g Jt then C r\BeJt.
Thus ^ is closed under complementation and intersection (and therefore
under unions). Consequently Jt is an algebra, and the theorem is established.
Definition 2. Let Q be a space. A class 3) of subsets of Q is a d-system if
(a) fie^;
(b) A,B,e@,A^B^> B\A e Q)\
(c) Ane@,An^An+l^\jAne@.
If ^ is a collection of sets then d{8) denotes the smallest d-system
containing 8.
Theorem 2. If the collection 8 of sets is closed under intersections, then
d(£) = a{8) (3)
Proof. Every <r-algebra is a d-system, and consequently d{8) c o(8). Hence
if we prove that d{8) is closed under intersections, d{8) must be a <r-algebra
and then, of course, the opposite inclusion o{8) ^ d{8) is valid.
The proof once again uses the principle of appropriate sets.
Let
A = {Be d{8): BnAe d{8) for all A e 8).
If B g 8 then B n A e 8 for all A e 8 and therefore 8 ^ ^. But 8X is a
d-system. Hence d{8) ^ 8X. On the other hand, 8y c d(<f) by definition.
Consequently
8X = d{8).
Now let
82 = {Be d{8)\ BnAe d{8) for all A e d{8)}.
Again it is easily verified that 82 is a d-system. If B e 8, then by the definition
of 8X we obtain that BnAed{8) for all Ae8t = d{8). Consequently
8 £ <f2 and d(<f) £ <f2. But d(<f) 3 <f2; hence d(<f) = 82, and therefore
§2. Algebras and <r-Algebras. Measurable Spaces
143
whenever A and B are in d(£\ the set A n B also belongs to d(£\ i.e: d(£) is
closed under intersections.
This completes the proof of the theorem.
We next consider some measurable spaces (Q, IF) which are extremely
important for probability theory.
2. The measurable space (R, <%(R)). Let R = (— oo, oo) be the real line and
(a,b] = {xeR:a <x < b}
for all a and b, — co < a < b < oo. The interval (a, oo] is taken to be (a, oo).
(This convention is required' if the complement of an interval (— oo, b] is
to be an interval of the same form, i.e. open on the left and closed on the
right.)
Let sd be the system of subsets of R which are finite sums of disjoint
intervals of the form (a, b]:
n
A e sd if A = £ (at, 6,], n < oo.
i= 1
It is easily verified that this system of sets, in which we also include the
empty set 0, is an algebra. However, it is not a <r-algebra, since if An =
(0,1 - 1/n] g sf, we have |J„ K = (0,1) $ si.
Let <%(R) be the smallest <r-algebra o(sd) containing sd. This <r-algebra,
which plays an important role in analysis, is called the Borel algebra of subsets
of the real line, and its sets are called Borel sets.
UJ is the system of intervals J of the form (a, b], and a(S) is the smallest
<T-algebra containing ./, it is easily verified that o(J) is the Borel algebra.
In other words, we can obtain the Borel algebra from J without going
through the algebra s£y since o(J) = o(a(S)).
We observe that
[*,*]= n (* 4 4
Thus the Borel algebra contains not only intervals (a, b] but also the
singletons {a} and all sets of the six forms
(atb\ [0,¾ [o,fe), (-oo,fe), (-00,6], (a, oo). (4)
a <b,
a <b,
144
II. Mathematical Foundations of Probability
Let us also notice that the construction of @(R) could have been based on
any of the six kinds of intervals instead of on (a, b], since all the minimal
<T-algebras generated by systems of intervals of any of the forms (4) are the
same as <%(R).
Sometimes it is useful to deal with the <r-algebra 88(R) of subsets of the
extended real line R = [- oo, oo]. This is the smallest <r-algebra generated by
intervals of the form
(a, fe] = {x g R: a < x < b}, - oo < a < b < oo,
where (— oo, b~\ is to stand for the set {x g R: — oo < x < b}.
Remark 1. The measurable space (R, @(R)) is often denoted by (R, Sff) or
(R\ @{).
Remark 2. Let us introduce the metric
on the real line R (this is equivalent to the usual metric |x — y\) and let
&Bq(R) be the smallest tr-algebra generated by the open sets Sp(x°) =
{x g R: Pi(x, x°) < p},p> 0, x° g R. Then @0(R) = @(R) (see Problem 7).
3. The measurable space (#", @{Rn)). Let Rn = R x • • • x R be the direct, or
Cartesian, product of n copies of the real line, i.e. the set of ordered n-tuples
x = (xi,..., x„), where — oo < xfc < oo, k = 1,..., n. The set
I = IX x ••• x /n,
where Ik = (akt fej, i.e. the set {x g Rn: xk g Iki k = 1,...,«}, is called a
rectangle, and lk is a side of the rectangle. Let J* be the set of all rectangles /.
The smallest <r-algebra o(J) generated by the system J is the Borel algebra
of subsets of Rn and is denoted by &B(Rn). Let us show that we can arrive at
this Borel algebra by starting in a different way.
Instead of the rectangles I = It x - - • x In let us consider the rectangles
B = Bx x • • • x Bn with Borel sides (Bk is the Borel subset of the real line
that appears in the kth place in the direct product R x • • • x R). The smallest
<T-algebra containing all rectangles with Borel sides is denoted by
@(R) ® • • • ® @(R)
and called the direct product of the <7-algebras <%(R). Let us show that in fact
@(Rn) = @(R) ® • • • ® @{R\
§2. Algebras and <r-Algebras. Measurable Spaces
145
In other words, the smallest cr-algebra generated by the rectangles I =
11 x • • • x J„ and the (broader) class of rectangles B = Bt x •■> x Bn with
Borel sides are actually the same.
The proof depends on the following proposition.
Lemma 3. Let Shea class of subsets ofd, let B c Q and define
6 nB= {AnB:Ae£}. (5)
Then
o(£ r\B) = a(£) n B. (6)
Proof. Since £ c G(£\ we have
S nB^ o(£) n B. (7)
But g(£) n B is a cr-algebra; hence it follows from (7) that
o(£ nB) c o(£) n B.
To prove the conclusion in the opposite direction, we again use the
principle of appropriate sets.
Define
<€B = {A e a(£): Ar\Bea(£ n B)}.
Since o(£) and a(£ n £) are cr-algebras, #B is also a cr-algebra, and evidently
S C #fi C <7(A
whence cr(<£) c <7(#B) = #B c <7(<f) and therefore o(£) = <$B. Therefore
Ar\Bea{S r\B)
for every A c a{£\ and consequently cr(<f) n B ^ o(£ n £).
This completes the proof of the lemma.
Proof that <%(Rn) and ^ ® • • • ® ^ are the same. This is obvious for n = 1.
We now show that it is true for n = 2.
Since @(R2) c # ® #, it is enough to show that the Borel rectangle
Bt x B2 belongs to @(R2).
Let R2 = Rx x R2, where Rx and #2 are the "first" and "second" real
lines, Ii=^jX JS2, ^2 = #1 x ^2» where #t x R2 (or J^ x ^2) is the
collection of sets of the form Bx x R2(oxRy x B2\vj'\t\iBle@l(orB2e@2).
Also let Jx and J2 be the sets of intervals in R x and R2, and ^ = Jx x R2,
•72 = *i x ^2- Then, by (6),
#! xB2 = B1nB26j1nl2 = cr(y t) n £2
= a{Jl n B2) ^ (7(,/L n ^2)
= C^ x J2\
as was to be proved.
146
II. Mathematical Foundations of Probability Theory
The case of any n, n > 2, can be discussed in the same way.
Remark. Let @0(R") be the smallest <7-algebra generated by the open sets
Sp(x°) = {xe Rn: pn(x, x°) < p}, x° g Rn, p > 0,
in the metric
P„(x,x°) = t2-fcpi(xfc,xfc0),
fc = i
where x = (xlf..., xn), x° = (x?,..., x£).
Then @o(K) = ®(R") (Problem 7).
4. The measurable space (R00, ^(R00)) plays a significant role in probability
theory, since it is used as the basis for constructing probabilistic models of
experiments with infinitely many steps.
The space /?°° is the space of ordered sequences of numbers,
x = (xi, x2,...), — oo < xfc < oo, k = 1, 2,...
Let Ik and Bk denote, respectively, the intervals (aki fej and the Borel subsets
of the kth line (with coordinate xfc). We consider the cylinder sets
J{lx x ••• xIJ={x:x = (xl,x2,...),xlelu...9xHeIH}, (8)
J(BX x • • • x Bn) = {x.x = (xlt x2 ■ ■ •), xx e Blt..., xn e Bn}, (9)
J(Bn) = {x: (xlf..., xn) e Bn}t (10)
where Bn is a Borel set in @(Rn). Each cylinder J{BX x • • x Bn), or S(Bn),
can also be thought of as a cylinder with base in Rn+1, R"+ 2,..., since
J(BX x • • • x Bn) = ^{Bx x • • x Bn x R),
S(Bn) = S(Bn+l),
where Bn+1 = Bn x R.
It follows that both systems of cylinders f(B1 x • • • x Bn) and S(Bn)
are algebras. It is easy to verify that the unions of disjoint cylinders
Hh x-x/J
also form an algebra. Let @(Rm\ &\(Rm) and @2(Rm) be the smallest
<7-algebras containing all the sets (8), (9) or (10), respectively. (The <7-algebra
@y(Rm) is often denoted by @{K) ® @{R) x • --.) It is clear that @(Rm) c
^(K00) c @2(R<°). As a matter of fact, all three <r-algebras are the same.
To prove this, we put
<gn = {A e Rn: {x: (xlf..., xn) e A} e @(Ra)}
for n = 1, 2,... . Let Bn e @(Rn). Then
B"6^c @(R«>).
§2. Algebras and <r-Algebras. Measurable Spaces
147
But <&n is a <T-algebra, and therefore
@(Rn) c <7(^n) = #n c ^(K°°);
consequently
^2(K°°) c ^(K°°).
Thus @(Rm) = M^R™) = @2(Rm).
From now on we shall describe sets in ^(.R00) as Borel sets (in R00).
Remark. Let &0(Rm) be the smallest <r-algebra generated by the open sets
Sp(x°) = {xG/?00:p00(x,x°)<p}, x°gK°°, p>0,
in the metric
P«,(x,x°)= Z2-fcpi(xfc,xfc°),
fc=i
where x = (xlf x2,...), x° = (xj, x\,...). Then ^(K00) = ^0(K°°)
(Problem 7).
Here are some examples of Borel sets in J1?00:
(a) {xG#°°:supxn> a},
{xGK°°:inf xn<a}\
(b) {xERm:hmxn<a}t
{xGK°°:limxn> a},
where, as usual,
Iimxn = inf supxm, limxn = sup inf xm;
(c) {xeR™: x„ ->}, the set of x g .R00 for which lim x„ exists and is finite;
(d) {xG/?°°:limxn> a};
(e) {xeR»:lZLl\xH\>a};
(f) {x g Rm: Yi=i xk = 0 for at least one n > 1}.
To be convinced, for example, that sets in (a) belong to the system @(Rm\
it is enough to observe that
{x: sup xn> a} = \J {x: xn > a} e @(Rm\
n
{x: inf xn < a} = \J {x: xn < a) e @(Rm).
n
5. The measurable space (RT, <%(RT)\ where T is an arbitrary set. The space
RT is the collection of real functions x = (x,) defined for t g Tf. In general
we shall be interested in the case when T is an uncountable subset of the real
t We shall also use the notations x = (x,),eRT and x = (x,), t e RT, for elements of RT.
148
II. Mathematical Foundations of Probability Theory
line. For simplicity and definiteness we shall suppose for the present that
T = [0, oo).
We shall consider three types of cylinder sets
Jtx ,n(/i x ••• x /„)= {x:xtlelli...,xtnell}, (H)
•*.i tn(Bi * • • • x Bn) = {x: xtl eBu...,xtne Bn}> (12)
^ tnm = {*'■ (xll9.... xj e Bnl (13)
where Ik is a set of the form (aki fyj, Bk is a Borel set on the line, and B" is a
Borel set in Rn.
The set Jtl tn (/1 x • • • x /n) is just the set of functions that, at times
*i»•■•»*!!» "get through the windows" Iu...,In and at other times have
arbitrary values (Figure 24).
Let @(RT\ &i(RT) and @2(RT) be tne smallest <r-algebras corresponding
respectively to the cylinder sets (11), (12) and (13). It is clear that
@(RT) c @y(RT) c @2(RT).
(14)
As a matter of fact, all three of these tr-algebras are the same. Moreover, we
can give a complete description of the structure of their sets.
Theorem 3. Let T be any uncountable set. Then @(RT) = ^(K7) = @2(RT\
and every set A e &(RT) has the following structure: there are a countable set of
points tut2i...ofTanda Borel set B in @(Rm) such that
A = {x:(xtl,xt2,...)eB}.
(15)
Proof. Let 8 denote the collection of sets of the form (15) (for various
aggregates (tu t2,...) and Borel sets B in ^(K00)). If Al9 A2i...e8 and the
corresponding aggregates are T(1) = tff\ t{2\...), T(2) = (t{?\ t{22\...),...,
A* V Vyy/
Figure 24
§2. Algebras and <r-Algebras. Measurable Spaces
149
then the set T(00) = (jk Tlk) can be taken as a basis, so that every Am has a
representation
At = {x:(xu,xX2,...)eBi},
where B{ is a set in one and the same cr-algebra ^(.R00), and xt e T(00).
Hence it follows that the system £ is a cr-algebra. Clearly this cr-algebra
contains all cylinder sets of the form (1) and, since @2(R ) ls tne smallest
cr-algebra containing these sets, and since we have (14), we obtain
@(RT) s a^R?) c @2(RT) c g. (16)
Let us consider a set A from <f, represented in the form (15). For a given
aggregate (tut2,...\ the same reasoning as for the space (Rm, &(Rm)) shows
that A is an element of the cr-algebra generated by the cylinder sets (11). But
this cr-algebra evidently belongs to the cr-algebra <%(RT)\ together with (16),
this established both conclusions of the theorem.
Thus every Borel set A in the cr-algebra @(RT) is determined by restrictions
imposed on the functions x = (x,), t e T9 on an at most countable set of points
tut2,... • Hence it follows, in particular, that the sets
Ax = {x: sup x, < C for all t e [0, 1]},
A2 = {x: xt = 0 for at least one t e [0,1]},
A3 = {x: x, is continuous at a given point t0 e [0,1]},
which depend on the behavior of the function on an uncountable set of points,
cannot be Borel sets. And indeed none of these three sets belongs to &(R10, l]).
Let us establish this for Ax. If Ay e &(R[0' l]\ then by our theorem there
are a point (t J, tj,...) and a set B° e @(R °°) such that
\x: sup xt<C,te [0, 1] I = {x: (x,?, x,s,...) e B0}.
It is clear that the function yt = C - 1 belongs to Alt and consequently
(j/,o,...) g B°. Now form the function
fC-1, fG(f?,f5,...),
Zt {C+1, t*(t?,tS,...).
It is clear that
(v,0,^5.---) = (ztf> ^5,...),
and consequently the function z = (zr) belongs to the set {x: (x,y, ...}eB0}.
But at the same time it is clear that it does not belong to the set {x: sup x, < C}.
This contradiction shows that Al $ @(Rl°-l]).
150
II. Mathematical Foundations of Probability Theory
Since the sets Au A2 and A3 are nonmeasurable with respect to the
<T-algebra @[Rl°-1]) in the space of all functions x = (x,), t e [0,1], it is
natural to consider a smaller class of functions for which these sets are
measurable. It is intuitively clear that this will be the case if we take the
intial space to be, for example, the space of continuous functions.
6. The measurable space (C, ^(C)). Let T = [0,1] and let C be the space of
continuous functions x = (xf), 0 < t ^ 1. This is a metric space with the
metric p(x,y) = supteT\xt - yt\. We introduce two <7-algebras in C:
^(C) is the <7-algebra generated by the cylinder sets, and ^0(C) is generated
by the open sets (open with respect to the metric p(x, y)). Let us show that in
fact these <7-algebras are the same: ^(C) = &0(C).
Let B = {x: xt0 < b} be a cylinder set. It is easy to see that this set is open.
Hence it follows that {x:xfl < blt...,xtn < bn) e^0(C), and therefore
^(C) c ^0(C).
Conversely, consider a set Bp = {y:ye Sp(x0)} where x° is an element of C
and Sp(x°) = {x e C: supf6T|xf — x,°| < p) is an open ball with center at
x°. Since the functions in C are continuous,
Bp = {yeC:yeSp(x0)} = \yeC: max \yt - x,°| < pi
= f| {y e C: \ytk - x°J < p) e ®{C\ (17)
where tk are the rational points of [0,1]. Therefore ^0(C) c ^(C).
The following example is fundamental.
7. The measurable space (Z), ^(Z))), where Z) is the space of functions x = (x,),
t e [0,1], that are continuous on the right (xf = xr+ for all t < 1) and have
limits from the left (at every t > 0).
Just as for C, we can introduce a metric d(x, y) on D such that the <r-algebra
@0(D) generated by the open sets will coincide with the cr-algebra &(D)
generated by the cylinder sets. This metric d(x, y), which was introduced
by Skorohod, is defined as follows:
d(x, y) = inf{£ > 0: 3 X e A: sup |xt — yMt) + sup \t - A(t)l < e}, (18)
t t
where A is the set of strictly increasing functions X = Mt) that are continuous
on [0,1] and have A(0) = 0, A(l) = 1.
8. The measurable space (flier ^. Ht6r-^)- Along with the space
(RT, @(RT)\ which is the direct product of T copies of the real line together
with the system of Borel sets, probability theory also uses the measurable
space (n*er ^t» Hter «^t)» which is defined in the following way.
§3. Methods of Introducing Probability Measures on Measurable Spaces
151
Let T be any set of indices and (Q„ ^t) a measurable space, t e T. Let
Q = YlteT ^t» the set of functions co = (cot\ t e T, such that cot e Q, for each
teT.
The collection of cylinder sets
Jtl tn(Bl x • • • x Bn) = {co:cotieBlt...,cotneBn},
where Bt.e^t{i is easily shown to be an algebra. The smallest tr-algebra
containing all these cylinder sets is denoted by {sj,eT !Fti and the measurable
space (Jl ^,-, H &t) is called the direct product of the measurable spaces
(ati<Ft),teT.
9. Problems
1. Let 3St and 362 be a-algebras of subsets of Q. Are the following systems of sets a-
algebras?
3SX n @2 = {A: A e38 x and A e 382},
3Stu3S2 = {A: A e3SL or A e 382}.
2. Let CJ> = {Dlt D2,...} be a countable decomposition of Q and 3$ = o(@)). Are there
also only countably many sets in 361
3. Show that
38{Rn) ® 3S(R) = 3S(Rn+1).
4. Prove that the sets (b)-(f) (see Subsection 4) belong to ^(K°°).
5. Prove that the sets A2 and A3 (see Subsection 5) do not belong to 3S(R10'l]).
6. Prove that the function (15) actually defines a metric.
7. Prove that 3S0{Rn) = 38{R"), n > 1, and ^0(^°°) = W°)-
8. Let C = C[0, co) be the space of continuous functions x = (xt) defined for t > 0.
Show that with the metric
P(x, y) = £ 2"n min sup \xt - yt\, 1 L x, y e C,
n=t LOSISn J
this is a complete separable metric space and that the a-algebra 3S0(C) generated by
the open sets coincides with the a-algebra 38(C) generated by the cylinder sets.
§3. Methods of Introducing Probability Measures
on Measurable Spaces
1. The measurable space (R, @{R)). Let P = P(A) be a probability measure
defined on the Borel subsets A of the real line. Take A = (— oo, x] and put
F(x) = P(-oo,x], xeR. (1)
152
II. Mathematical Foundations of Probability Theory
This function has the following properties:
(1) F(x) is nondecreasing;
(2) F(-oo) = 0, F(+oo) = 1, where
F(-oo) = lim F(x), F(+oo) = IimF(x);
xi -co xf oo
(3) F(x) is continuous on the right and has a limit on the left at each xe R.
The first property is evident, and the other two follow from the continuity
properties of probability measures.
Definition 1. Every function F = F(x) satisfying conditions (1)-(3) is called
a distribution function (on the real line R).
Thus to every probability measure P on (R, @{R)) there corresponds (by
(1)) a distribution function. It turns out that the converse is also true.
Theorem 1. Let F = F(x) be a distribution function on the real line R. There
exists a unique probability measure P on (R, &(R)) such that
P(a, b] = F(b) - F(£i) (2)
for all a, b, — oo < a < b < oo.
Proof. Let rf be the algebra of the subsets A of R that are finite sums of
disjoint intervals of the form (a, &]:
n
k=l
On these sets we define a set function P0 by putting
PoG4)= tlFih) - F(ak)l A erf. (3)
This formula defines, evidently uniquely, a finitely additive set function on &0.
Therefore if we show that this function is also countably additive on this
algebra, the existence and uniqueness of the required measure P on <%(R)
will follow immediately from a general result of measure theory (which we
quote without proof).
Caratheodory's Theorem. Let CI be a space, rf an algebra of its subsets, and
& = o{sf) the smallest a-algebra containing s#. Let /*0 be a a-additive measure
on (CI, A). Then there is a unique measure \i on (CI, o(&0)) which is an extension
of Hq , le. satisfies
KA) = ti0(A), A erf.
§3. Methods of Introducing Probability Measures on Measurable Spaces 153
We are now to show that P0 is countably additive on «c/. By a theorem
from §1 it is enough to show that P0 is continuous at 0, i.e. to verify that
P0G4n)|0, AA0, Anes/.
Let A i, A2, •.. be a sequence of sets from s# with the property An j 0. Let
us suppose first that the sets An belong to a closed interval [—AT, AG, N < oo.
Since A is the sum of finitely many intervals of the form (a, b] and since
P0(a\ b] = F(b) - F{a') -> F(b) - F{a) = P0(a, b]
as a' i a, because F(x) is continuous on the right, we can find, for every An,
a set Bn e si such that its closure [B J £ An and
Po{An)-P0{Bn)<s-2-\
where £ is a preassigned positive number.
By hypothesis, f]An = 0 and therefore f] [BJ = 0. But the sets [BJ
are closed, and therefore there is a finite n0 = n0(e) such that
f| IB J = 0. (4)
n = 1
(In fact, [—N, AG is compact, and the collection of sets {[—AT, AG\[.B„]}n£i
is an open covering of this compact set. By the Heine-Borel theorem there
is a finite subcovering:
n0
U([-N,N]\[Bn]) = [-N,N]
n=l
and therefore f]nn°=i LB J = 0).
Using (4) and the inclusions Ano ^ Ano-1 c...c^ljWe obtain
P,
lUno) = Po(^\ 0¾) + Po(fcQ^)
= Po^\ n **) ^ po(jjC^V**))
no no
< !PoUA^)< Ie-2-fc<£.
fc=i fc=i
Therefore Po04„) I 0, n -> oo.
We now abandon the assumption that An ^ [—AT, N] for some N. Take
an £ > 0 and choose N so that P0[-N, N] > 1 - £/2. Then, since
An = Ann l-N, N-] + Ann [-AT, N],
we have
PoUn) = PoKC-N. AT] + P0(^n n l-N, AG)
< P0(4. n [-AT, AT]) + £/2
154
II. Mathematical Foundations of Probability Theory
and, applying the preceding reasoning (replacing An by An n [—N, N]), we
find that P0(An n [-N, NJ) < e/2 for sufficiently large n. Hence once again
P0(>4n) i 0, n -> oo. This completes the proof of the theorem.
Thus there is a one-to-one correspondence between probability measures
P on (R, @(R)) and distribution functions F on the real line R. The measure
P constructed from the function F is usually called the Lebesgue-Stieltjes
probability measure corresponding to the distribution function F.
The case when
10, x < 0,
x, 0 < x < 1,
1, x>l.
is particularly important. In this case the corresponding probability measure
(denoted by X) is Lebesgue measure on [0, 1]. Clearly A(a, b] = b — a. In
other words, the Lebesgue measure of (a, b~\ (as well as of any of the intervals
{a, b), [a, b] or [a, b)) is simply its length b — a.
Let
^([0, 1]) = {A n [0,1]: A e @(R)}
be the collection of Borel subsets of [0,1]. It is often necessary to consider,
besides these sets, the Lebesgue measurable subsets of [0,1]. We say that a
set A c [0,1] belongs to ^([0,1)] if there are Borel sets A and B such that
A c A c B and A(B\A) = 0. It is easily verified that ^([0, 1]) is a <r-algebra.
It is known as the system of Lebesgue measurable subsets of [0, 1]. Clearly
^([0, 1]) c J([0, 1]).
The measure k, defined so far only for sets in ^([0,1]), extends in a
natural way to the system ^([0, 1]) of Lebesgue measurable sets. Specifically,
if Ae <f([0,1]) andicAcfi, where A and Be J([0, l])_and t(B\A) = 0,
we define ^(A) = %A). The set function I = X(A), A e ^([0,1]), is easily
seen to be a probability measure on ([0,1], ^([0,1])). It is usually called
Lebesgue measure (on the system of Lebesgue-measurable sets).
Remark. This process of completing (or extending) a measure can be applied,
and is useful, in other situations. For example, let (Q, «^", P) be a probability
space. Let «^p be the collection of all the subsets A of Q for which there are
sets Bx and B2 of & such that By^ A^B2 and P(B2\By) = 0. The
probability measure can be defined for sets A e #p in a natural way (by P(A) =
P(2*i)). The resulting probability space is the completion of (Q, ^", P) with
respect to P.
A probability measure such that #p = #" is called complete, and the
corresponding space (Q, ^", P) is a complete probability space.
The correspondence between probability measures P and distribution
functions F established by the equation P(a, b~\ = F(b) — F(a) makes it
§3. Methods of Introducing Probability Measures on Measurable Spaces 155
m
i
jAFfa)
i
,W2)
1
1
1—
—►
*2 *3
Figure 25
possible to construct various probability measures by obtaining the
corresponding distribution functions.
Discrete measures are measures P for which the corresponding
distributions F = F{x) are piecewise constant (Figure 25), changing their values
at the points xltx2,... (AF(x.) > 0, where AF(x) = F(x) - F(x-). In
this case the measure is concentrated at the points xlt x2,...:
P({xfc}) = AF(xk) > 0, £ P({xfc}) = 1.
k
The set of numbers (plt p2,.. .)> where pk = P({xfc}), is called a discrete
probability distribution and the corresponding distribution function F = F(x)
is called discrete.
We present a table of the commonest types of discrete probability
distribution, with their names.
Table 1
Distribution Probabilities pk Parameters
Discrete uniform 1//V, k= 1,2,...,N N = 1,2,...
Bernoulli Pi=P»Po = 9 0<p<l,q=l—p
Binomial Cknf^qn~k, k = 0, 1,...,n 0 < p < I, q = 1 - p,
n = 1,2,...
Poisson e~k/k\, k = 0,l,... A>0
Geometric q*~ lp, k = 0, 1,... 0 < p < 1, q = 1 - p
Negative binomial QlJpV"', k = r,r + 1,... 0 < p < I, q = 1 — p,
r=l,2,...
Absolutely continuous measures. These are measures for which the
corresponding distribution functions are such that
*•(*)= f f(t)dt,
J — oo
(5)
156
II. Mathematical Foundations of Probability Theory
where f = f(t) are nonnegative functions and the integral is at first taken in
the Riemann sense, but later (see §6) in that of Lebesgue.
The function f = /(x), x e R, is the density of the distribution function
F = F(x) (or the density of the probability distribution, or simply the density)
and F = F(x) is called absolutely continuous.
It is clear that every nonnegative f = /(x) that is Riemann integrable and
such that ^^/(x) dx = 1 defines a distribution function by (5). Table 2
presents some important examples of various kinds of densities f = f(x)
with their names and parameters (a density/(x) is taken to be zero for values
of x not listed in the table).
Table 2
Distribution
Uniform on [a, b~\
Normal or Gaussian
Gamma
Beta
Exponential (gamma
with a = 1, P = 1/A)
Bilateral exponential
Chi-squared, x*
(gamma with a
a = n/2tp = 2)
Student, t
F
Cauchy
Density
l/(b -a), a<x<b
(27^)-^-^-^^, xeR
x"-le-^
iw • x~°
xr-\i-xy-1 ft
—i-—e—- o<x<i
B(r,s)
Xe'**, x>0
ik-*W xeR
2 - n/2xn/2 - 1^-^/2^(^), X > 0
(W^r(w/2)(1+TJ ' xeR
{m/nf1'2 x""2"1
B(wi/2, «/2) (1 + mx/n)lm+n)l2
e
n(x2 + e2y
Parameters
a,beR; a<b
m e R, a > 0
a > 0, P > 0
r > 0, s > 0
A>0
A>0
« = 1,2,...
«=1,2,...
m, « = 1,2,...
0>O
Singular measures. These are measures whose distribution functions are
continuous but have all their points of increases on sets of zero Lebesgue
measure. We do not discuss this case in detail; we merely give an example of
such a function.
§3. Methods of Introducing Probability Measures on Measurable Spaces 157
Figure 26
We consider the interval [0,1] and construct F(x) by the following
procedure originated by Cantor.
We divide [0,1] into thirds and put (Figure 26)
fi *e(if),
F2(x) =
defining it in the intermediate intervals by linear interpolation.
Then we divide each of the intervals [0, |] and [f, 1] into three parts and
define the function (Figure 27) with its values at other points determined by
linear interpolation.
h
I
0,
1,
x e (i fi,
xefi.fi.
x = 0,
x = l
Figure 27
158
II. Mathematical Foundations of Probability Theory
Continuing this process, we construct a sequence of functions F„(x),
n = 1, 2,..., which converges to a nondecreasing continuous function F(x)
(the Cantor function), whose points of increase (x is a point of increase of F(x)
if F(x + e) - F(x - £) > 0 for every £ > 0) form a set of Lebesgue measure
zero. In fact, it is clear from the construction of F(x) that the total length of
the intervals (3, §), (J, §), (J, |),... on which the function is constant is
12 4 1 " (2\n , ,^
Let Jf be the set of points of increase of the Cantor function F(x). It
follows from (6) that ^JT) = 0. At the same time, if \i is the measure
corresponding to the Cantor function F(x), we have ix(N) = 1. (We then say
that the measure is singular with respect to Lebesgue measure X.)
Without any further discussion of possible types of distribution functions,
we merely observe that in fact the three types that have been mentioned cover
all possibilities. More precisely, every distribution function can be represented
in the form pjFj 4- p2F2 + P3F3, where Fx is discrete, F2 is absolutely
continuous, and F3 is singular, and p£ are nonnegative numbers, px 4- p2 +
p3 = l.
2. Theorem 1 establishes a one-to-one correspondence between probability
measures on (R, &(R)) and distribution functions on R. An analysis of the
proof of the theorem shows that in fact a stronger theorem is true, one that in
particular lets us introduce Lebesgue measure on the real line.
Let fi be a <r-finite measure on (Q, sf), where $2 is an algebra of subsets of
Q. It turns out that the conclusion of Caratheodory's theorem on the
extension of a measure and an algebra $0 to a minimal a-algebra o(s#) remains
valid with a tr-finite measure; this makes it possible to generalize Theorem 1.
A Lebesgue-Stieltjes measure on (R, &B(R)) is a (countably additive)
measure \i such that the measure /*(/) of every bounded interval I is finite.
A generalized distribution function on the real line R is a nondecreasing
function G = G(x), with values on (—00, 00), that is continuous on the right.
Theorem 1 can be generalized to the statement that the formula
/i(fl, b] = G(b) - G(a), a < b,
again establishes a one-to-one correspondence between Lebesgue-Stieltjes
measures \i and generalized distribution functions G.
In fact, if G(+ 00) — G(—00) < 00, the proof of Theorem 1 can be taken
over without any change, since this case reduces to the case when G(+ 00) —
G(-oo) = 1 and G(-oo) = 0.
Now let G(+oo) - G(-oo) = 00. Put
|G(x), |x|<n,
Gn(x) = lG(n) x = «,
(G(-h), x = -n.
§3. Methods of Introducing Probability Measures on Measurable Spaces 159
On the algebra s# let us define a finitely additive measure /*0 such that
ti0(a, b] = G(b) — G(a), and let /*„ be the finitely additive measure previously
constructed (by Theorem 1) from G„(x).
Evidently /*„ J /i0 on $£. Now let Au A2,... be disjoint sets in $£ and
A = £ i4B e si. Then (Problem 6 of §1)
t*o(A)> 2>oU„).
n= 1
ff Z^°= l A*oG4„) = °o then fi0(A) = ]£?L, fi0(An). Let us suppose that
X>oG4„) < oo. Then
oo
lx0(A) = lim /in(^) = lim £ /i^).
n n fc=l
By hypothesis, £ ^oO^n) < °°. Therefore
0 < lidA) - £ /i004*) = lim [ f (/i„04fc) - /i0C4*))l < 0,
fc=l n [_k=l J
since/in<ju0.
Thus a <T-finite finitely additive measure /*0 is countably additive on si,
and therefore (by Caratheodory's theorem) it can be extended to a countably
additive measure \i on a(si).
The case G(x) = x is particularly important. The measure X corresponding
to this generalized distribution function is Lebesgue measure on (R, @(R)).
As for the interval [0,1] of the real line, we can define the system §8(R) by
writing A e §8(R) if there are Borel sets A and B such that A ^ A c b,
A{B\A) = 0. Then Lebesgue measure 1 on @{R) is defined by 1(A) = A(/l)
if/4cAcB,Ae <%(R) and A(B\/1) = 0.
3. The measurable space (Rn, @l(Rn). Let us suppose, as for the real line, that
P is a probability measure on (Rn, @(Rn).
Let us write
Fn(Xi, ..., xn) = P((-oo, xj x • • • x (-oo, xj),
or, in a more compact form,
F„(x) = P(-oo,x],
wherex = (xu...,xn), (—oo, x] = (— oo, xj x ••• x (—oo, xj.
Let us introduce the difference operator Aa.bt: Rn ->R, defined by the
formula
4.,.6,^1. • • • , *n) = F„(xl9 • • • , *i- 1, K *i+ 1 • • •)
-Fn(x1,...,xl_1,fl£,x£+1...)
160
II. Mathematical Foundations of Probability Theory
where at < fe,-. A simple calculation shows that
Afllbl...AflnbnFn(x1...xn) = P(a,fe], (7)
where (a, b] = (alt fej x • • • x (an, fej. Hence it is clear, in particular, that
(in contrast to the one-dimensional case) P(a, b] is in general not equal to
FJJb) - Fn(a).
Since P(a, b] > 0, it follows from (7) that
Afllbl"-AflnbnFn^i.---^n)>0 (8)
for arbitrary a = (au..., an), b = (bu..., bn).
It also follows from the continuity of P that Fn(x„ ..., x„) is continuous
on the right with respect to the variables collectively, i.e. if x{k) [ x, x(fc) =
(xf >,..., x!f>),±en
F„(x(k,)lF„(4 *->oo. (9)
It is also clear that
Fn(+oo,..., +oo) = l (10)
and
HmFn(x1,...,xn) = 0, (11)
xly
if at least one coordinate Of y is — oo.
Definition 2. An n-dimensional distribution function (on Rn) is a function
F = F(Xi,..., xn) with properties (8)-(11).
The following result can be established by the same reasoning as in
Theorem 1.
Theorem 2. Let F = F„(xlt ...,xn)bea distribution function on R". Then there
is a unique probability measure P on (Rn, <%(Rn)) such that
P(a, b] = Aaibl • • • Aanf,nFn(xu..., x„). (12)
Here are some examples of n-dimensional distribution functions.
Let F1,..., F" be one-dimensional distribution functions (on R) and
Fn(x1,...,x„) = F1(x1)...F"(xn).
It is clear that this function is continuous on the right and satisfies (10) and
(11). It is also easy to verify that
Afllbl • • • AflnbnFn(*i> • • •, xn) = J] LF\bk) - Fk(akJ] > 0.
Consequently Fn(xu..., x„) is a distribution function.
§3. Methods of Introducing Probability Measures on Measurable Spaces 161
The case when
10 xk < 0,
xk, 0<xfc<l,
1, xk > 1
is particularly important. In this case
Fn(x1,...,xn) = x1---xn.
The probability measure corresponding to this n-dimensional distribution
function is n-dimensional Lebesgue measure on [0, 1]".
Many n-dimensional distribution functions appear in the form
Fn(x1,...,xn) = p .-. p fn(t1,...,tn)dtl---dtn,
where fn(tu..., O is a nonnegative function such that
/•co /»oo
J ...J fn(t1,...,tn)dtl---dtn = i,
and the integrals are Riemann (more generally, Lebesgue) integrals. The
function / = fn(tlt..., tn) is called the density of the n-dimensional
distribution function, the density of the n-dimensional probability distribution,
or simply an n-dimensional density.
When n = 1, the function
f(x) = —\—e-^-m^2a\ xeR,
Oy/27Z
with a > 0 is the density of the (nondegenerate) Gaussian or normal
distribution. There are natural analogs of this density when n > 1.
Let U = ||rv|| be a nonnegative definite symmetric n x n matrix:
n
£ rijkikj > 0, /lj e R, i = 1,..., n, rtJ = rfi.
When U is a positive definite matrix, | U | = det U > 0 and consequently there
is an inverse matrix A = ||fl0||.
\A\112
fn(xu..., xn) = -—72 exp{ -i £ fl0<Xi - niiXx; - w£)}, (13)
where mteR,i = 1,...,n, has the property that its (Riemann) integral over
the whole space equals 1 (this will be proved in §13) and therefore, since it is
also positive, it is a density.
This function is the density of the n-dimensional (nondegenerate) Gaussian
or normal distribution (with vector mean m = (mu..., mn) and co variance
matrix R = ^4-1).
162 II. Mathematical Foundations of Probability Theory
Figure 28. Density of the two-dimensional Gaussian distribution.
When n = 2 the density f2(xlt x2) can be put in the form
2-110^02^/1 — p
f J f(*i - ™i)2
xeM-2-o37)[—^—
0 (Xi - WjXxz - m2) , (x2 - w2)2
— 2p 1 2
Ox02 ^2
where ot > 0, |p| < 1. (The meanings of the parameters mt, o( and p will be
explained in §8.)
Figure 28 indicates the form of the two-dimensional Gaussian density.
Remark. As in the case n = 1, Theorem 2 can be generalized to (similarly
defined) Lebesgue-Stieltjes measures on (Rn, &(Rn)) and generalized
distribution functions on Rn. When the generalized distribution function
G„(xlt..., x„) is Xi • • • x„, the corresponding measure is Lebesgue measure
on the Borel sets of Rn. It clearly satisfies
Ha,®- n (&,-«,),
i=l
i.e. the Lebesgue measure of the "rectangle"
(a, U] = {alib{\ x •• x (fln,fej
is its "content."
4. The measurable space (R00, @(Rm)). For the spaces Rn, n > 1, the proba-
ability measures were constructed in the following way: first for elementary
sets (rectangles (a, bj), then, in a natural way, for sets A = £ (ai9 fe,], and
finally, by using Caratheodory's theorem, for sets in @(Rn).
A similar construction for probability measures also works for the space
(K00, @(Rm)).
]} (14)
§3. Methods of Introducing Probability Measures on Measurable Spaces 163
Let
Jfn(B) = {xeRm: (xu ...,xn)eB}, Be @(Rn\
denote a cylinder set in Rm with base B e @(Rn). We see at once that it is
natural to take the cylinder sets as elementary sets in Rm, with their
probabilities defined by the probability measure on the sets of <%(R°°).
Let P be a probability measure on (Rm, @(Rm)). For n = 1, 2,..., we
take
Pn(B) = P(J?n(B)), B e ®(Rn\ (15)
The sequence of probability measures PltP2i... defined respectively on
(R, @(R)\ (R2, @{R2)\..., has the following evident consistency property:
for n = 1, 2,... and B e @(Rn),
Pn+l(BxR) = Pn(B). (16)
It is noteworthy that the converse also holds.
Theorem 3 (Kolmogorov's Theorem on the Extension of Measures in
(R°°, &(Rm))). Let Pu P2,... be a sequence of probability measures on
(R, 38{R)\ (R2, @(R2)),..., possessing the consistency property (16). Then
there is a unique probability measure P on (J1?00, @{Rm)) such that
P{Jfn{B)) = Pn(B\ B e <%(R"). (17)
for n = 1, 2,... .
Proof. Let Bn e @(Rn) and let Jn(Bn) be the cylinder with base B". We assign
the measure P(Sn(Bn)) to this cylinder by taking P(Sn(Bn)) = Pn(Bn).
Let us show that, in virtue of the consistency condition, this definition is
consistent, i.e. the value of P{J>n{Bn)) is independent of the representation of
the set Jn{Bn). In fact, let the same cylinder be represented in two way:
Jfn(Bn) = Jfn+k(Bn+k).
It follows that, if (xi,..., xn+fc) e Rn+\ we have
(xlf..., x„) e Bn o (xlf..., xn+fc) e Bn+k, (18)
and therefore, by (16) and (18),
Pn(B") = Pn+1((x1,...,xn+1):(x1,...,xn)eB")
= • • • = Pn+k((xl9..., xn+k):(xu..., xn)eBn)
= Pn+k(Bn+k).
Let sf(Rm) denote the collection of all cylinder sets Bn = Jn(Bn\ Bne@(Rn),
n = 1, 2,... .
164
II. Mathematical Foundations of Probability Theory
Now let Bu..., Bk be disjoint sets in sf(R°°). We may suppose without loss
of generality that Bt = >n(B"), i = 1,..., K for some n, where Bnu..., Bl are
disjoint sets in ^(Kn). Then
p(£ft)=p(|/"W>)=^(t/*)=^Pn{B?)=liPA)'
i.e. the set function P is finitely additive on the algebra sf(R°°).
Let us show that P is "continuous at zero," i.e. if the sequence of sets
£„ i 0» 1-+00, then P(Bn) -»• 0, n -► oo. Suppose the contrary, i.e. let
lim P(Bn) = 3 > 0. We may suppose without loss of generality that {Bn}
has the form
Bn={x:(x1,...,xn)eBn}, Bne<2(Rn).
We use the following property of probability measures Pn on (Rn, @(Rn))
(see Problem 9): if Bne@(Rn\ for a given 3 > 0 we can find a compact set
An e @(Rn) such that An c Bn and
Therefore if
we have
Pn(Bn\An)<5/2»+l.
, = {x:(xl9...,xJeA„},
P(Bn\A,d = Pn(Bn\An)<d/2n+1.
Form the set €„= f)k=l Ak and let C„ be such that
Cn={x:(Xl,...,Xn)eCn}.
Then, since the sets Bn decrease, we obtain
P(Bn\Cn) < £ P(Bn\Ak) < t ?(Bk\Ak) < 3/2.
k=l fc=l
But by assumption lim„ P(fln) = 3 > 0, and therefore limn P(€n) > 3/2 > 0.
Let us show that this contradicts the condition Cn J 0.
Let us choose a point x(n) = (x?*, x(2n),...) in Cn. Then (xj\..., x|In)) e Cn
for n > 1.
Let («i) be a subsequence of (n) such that x?0 -► xj, where xj is a point
in Cx. (Such a sequence exists since xf* e Cx and Cn is compact.) Then select
a subsequence (n2) of (ny) such that (x?2*, x(2n2>) -► (xj, x?) e C2. Similarly let
(x(1nk),...,xi£n,'))-^(xJ,...,x2)eCfc. Finally form the diagonal sequence
(mk), where mk is the frth term of (nk). Then xjmk) -► x? as mfc -► oo for i = 1,2,... ;
and(x°, x2,...) e £„forn = 1,2,... .which evidently contradicts the
assumption that Cn i 0, n -+ oo. This completes the proof of the theorem.
§3. Methods of Introducing Probability Measures on Measurable Spaces 165
Remark. In the present case, the space R00 is a countable product of lines,
R°° = R x R x • — . It is natural to ask whether Theorem 3 remains true if
(R00, &(Rm)) is replaced by a direct product of measurable spaces (Q£l #",),
f = 1, 2,
We may notice that in the preceding proof the only topological property
of the real line that was used was that every set in &(Rn) contains a compact
subset whose probability measure is arbitrarily close to the probability
measure of the whole set. It is known, however, that this is a property not only
of spaces (Rn, &(R"), but also of arbitrary complete separable metric spaces
with <T-algebras generated by the open sets.
Consequently Theorem 3 remains valid if we suppose that Pu P2,... is a
sequence of consistent probability measures on (Qlt #^),
where (Q£, J^) are complete separable metric spaces with tr-algebras ^
generated by open sets, and (Rm, @(Rm)) is replaced by
(Qi x Q2 x •' •, &i ® ^2 ® • • •)-
In §9 (Theorem 2) it will be shown that the result of Theorem 3 remains
valid for arbitrary measurable spaces (Q£, ^ if the measures Pn are
concentrated in a particular way. However, Theorem 3 may fail in the general
case (without any hypotheses on the topological nature of the measurable
spaces or on the structure of the family of measures {Pn}). This is shown by
the following example.
Let us consider the space Q = (0,1], which is evidently not complete, and
construct a sequence #i c ^2 c ... of tr-algebras in the following way. For
n = 1, 2,..., let
fl, 0 < co < 1/n,
^> = {o, l/*<eo<l,
<gn = {A e Q: A = {co: <pn(co) eB},Be@(R)}
and let !Fn = <t{#i,...,#„} be the smallest <r-algebra containing the sets
#!,..., #„. Clearly#i c ^2 c .... Let & = a((j &Q be the smallest
<T-algebra containing all the &n. Consider the measurable space (0,¾
and define a probability measure Pn on it as follows:
P.{«:(9l(a,X...,^«,))eir} = {J ^^
where Bn = &(R").'It is easy to see that the family {Pn} is consistent: if
Ae!Fn then Pn+ X(A) = Pn(A). However, we claim that there is no probability
measure P on (Q, !F) such that its restriction P\!Fn (i.e., the measure P
166
II. Mathematical Foundations of Probability Theory
considered only on sets in ^) coincides with Pn for n = 1, 2, In fact, let
us suppose that such a probability measure P exists. Then
P{co: (Piico) = • • • = <pn(co) = 1} = Pn{co: (p^coi) = • • • = <pn(co) = 1} = 1
(19)
for n = 1,2,... . But
{co: v^co) = • • • = <p„(co) = 1} = (0, 1/n) j 0,
which contradicts (19) and the hypothesis of countable additivity (and
therefore continuity at the "zero" 0) of the set function P.
We now give an example of a probability measure on (R00, ^(R00)). Let
F^x), F2(x),... be a sequence of one-dimensional distribution functions.
Define the functions G(x) = F^x), G2(xu x2) = F1(x1)F2(x2),..., and denote
the corresponding probability measures on (R, ^(R)), (R2, ^(R2)), • • • by
Pu P2, Then it follows from Theorem 3 that there is a measure P on
(R00, ^(R00)) such that
F{x e R00: (xlf ...,xJeB} = Pn(B), B e ^(Rn)
and, in particular,
P{x e R00: xt < alf..., x„ <: «„} = F^) • • • Fn(an).
Let us take F,(x) to be a Bernoulli distribution,
10, x<0,
q, 0 < x < 1,
1, x > 1.
Then we can say that there is a probability measure P on the space Q of
sequences of numbers x = (xls x2,.. .),X; = Oor 1, together with the <r-algebra
of its Borel subsets, such that
P{x: x, = au..., x„ = a J = pS V"Zfl'.
This is precisely the result that was not available in the first chapter for
stating the law of large numbers in the form (1.5.8).
5. The measurable space (RT, ^(RT)). Let T be a set of indices t e T and Rt a
real line corresponding to the index t. We consider a finite unordered set
t = [tl9..., t„] of distinct indices tit t( eT,n> 1, and let Px be a probability
measure on (R\ ^(R1)), where RT = Rtl x • • • x Rtn.
We say that the family {Px} of probability measures, where t runs through
all finite unordered sets, is consistent if, for all sets t = [tu..., t„] and
a = [su..., sj such that trctwe have
faKr • • >*J:(xSl,. • •,xSt) eB} = Pt{(xtl,...,xtn):(xSl,...,xSk) e B}
(20)
for every B e ^(R°).
§3. Methods of Introducing Probability Measures on Measurable Spaces
167
Theorem 4 (Kolmogorov's Theorem on the Extension of Measures in
(RT, <%{RT))). Let {PT} be a consistent family of probability measures on
(R\ @(RX)). Then there is a unique probability measure P on (RT, @(RT)) such
that
P(J?X(B)) = PX(B) (21)
for all unordered sets x = [tlt..., tn~\ of different indices tt eT,Be @(RX) and
J?t(B) = {xeRT:(xtl,...,xtn)eB}.
Proof. Let the set B e 38{RJ\ By the theorem of §2 there is an at most
countable set S = {su s2,...)- c t such that B = {x:(xSl, xS2,...)eB}, where
B e @{RS\ Rs = RSlx RS2 x • • •. In other words, B = J^S(B) is a cylinder
set with base B e @(RS).
We can define a set function P on such cylinder sets by putting
PWS)) = Ps(B), (22)
where Ps is the probability measure whose existence is guaranteed by
Theorem 3. We claim that P is in fact the measure whose existence is asserted
in the theorem. To establish this we first verify that the definition (22) is
consistent, i.e. that it leads to a unique value of P(B) for all possible
representations of B; and second, that this set function is countably additive.
Let 8 = JSl(Bi) and B = JS:i{B2). It is clear that then B = SSlus2(B3)
with some B3 e@(RSl<jS2); therefore it is enough to show that if S c S'
and B e @(RS\ then PS(B') = P^B\ where
B' = {(xsl, xS2,.. .):(xSl ,xS2,...)eB}
with S' = {s'i, s'2,...}, S = {slt s2,.. .}• But by the assumed consistency of
(20) this equation follows immediately from Theorem 3. This establishes that
the value of P(B) is independent of the representation of B.
To verify the countable additivity of P, let us suppose that {Bn} is a sequence
of pairwise disjoint sets in &8{RT). Then there is an at most countable set
S c t such that Bn = J^s(#n) for all n > 1, where Bn e @(RS). Since Ps
is a probability measure, we have
P(E Bn) = P£ /» = PS(Z Bn) = Z Ps(Bn)
= EP(/s(£„)) = EP(£„).
Finally, property (21) follows immediately from the way in which P was
constructed.
This completes the proof.
Remark 1. We emphasize that T is any set of indices. Hence, by the remark
after Theorem 3, the present theorem remains valid if we replace the real
lines Rt by arbitrary complete separable metric spaces Qi (with tr-algebras
generated by open sets).
168
II. Mathematical Foundations of Probability Theory
Remark 2. The original probability measures {PT} were assumed defined
on unordered sets t = [tlt..., tj of different indices. It is also possible to
start from a family of probability measures {Px} where t runs through all
ordered sets t = (t,,..., tn) of different indices. In this case, in order to have
Theorem 4 hold we have to adjoin to (20) a further consistency condition:
PUl ,JLAtl x ... x AJ = PlUi tin)(Atii x ... x Ati\ (23)
where (i,..., i„) is an arbitrary permutation of (1,..., n) and 4r. e &(Rti). As
a necessary condition for the existence of P this follows from (21) (with
P[tl tnffi replaced by Pltl tn)(B)).
From now on we shall assume that the sets t under consideration are
unordered. If T is a subset of the real line (or some completely ordered set),
we may assume without loss of generality that the set t = [tu...,£j
satisfies tt < t2 < ••• < tn. Consequently it is enough to define "finite-
dimensional" probabilities only for sets t = [tlt...,tj for which tx <
t2<-'<tn.
Now consider the case T = [0, oo). Then RT is the space of all real
functions x = (xr)r>0. A fundamental example of a probability measure on
(R[0-m), @(Rl°- ^)) is Wiener measure, constructed as follows.
Consider the family {<p,0>|x)},>0 of Gaussian densities (as functions of y
for fixed x):
<pt(y\x) = -4= e~ly-x)2l2t, yeR,
y/27Zt
and for each z = [tu..., tj, tx < t2 < • • • < tn, and each set
B = IX x .. x /n, Ik = (ak,bk),
construct the measure PX(B) according to the formula
/>,(/! x...x/„)
= J ••• J ^,(^110)^-^2^1)---^-,,,-^1^-1) ^i" dan (24)
(integration in the Riemann sense). Now we define the set function P for each
cylinder set Ju... J^ x • • • x /n) = {xeRT: xh e/, x,ne/„} by taking
P(>fl...rn(/i x .. x In)) = P[tl...tn](Ii x • x U
The intuitive meaning of this method of assigning a measure to the cylinder
set -?u ...tA}\ x • • • x /n) is as follows.
The set Jtx ...tn(/i x • • • x /„) is the set of functions that at times tu..., tn
pass through the '* windows nIl9...,I„ (see Figure 24 in §2). We shall interpret
§3. Methods of Introducing Probability Measures on Measurable Spaces 169
(Ptk-tk-S.cikWk-\) as the probability that a particle, starting at ak-x at time
h — h-1» arrives in a neighborhood of ak. Then the product of densities that
appears in (24) describes a certain independence of the increments of the
displacements of the moving "particle" in the time intervals
The family of measures {PT} constructed in this way is easily seen to be
consistent, and therefore can be extended to a measure on (Rl0j m\ &(R10'm))).
The measure so obtained plays an important role in probability theory. It
was introduced by N. Wiener and is known as Wiener measure.
6. Problems
1. Let F(x) = P(—oo, x]. Verify the following formulas:
P(a, b] = F(b) - F(a\ P(a, b) = F(b-) - F(a\
P[a, b] = F(b) - F(a-\ P[a, b) = F(b-) - F{a-\
P{x) = F(x) - F(x-),
where F(x - ) = limy T x F(y).
2. Verify (7).
3. Prove Theorem 2.
4. Show that a distribution function F = F(x) on jR has at most a countable set of
points of discontinuity. Does a corresponding result hold for distribution functions
on R"'!
5. Show that each of the functions
G(x, y) = [x + y], the integral part of x + y,
is continuous on the right, and continuous in each argument, but is not a (generalized)
distribution function on jR2.
6. Let \l be the Lebesgue-Stieltjes measure generated by a continuous distribution
function. Show that if the set A is at most countable, then y(A) = 0.
7. Let c be the cardinal number of the continuum. Show that the cardinal number of the
collection of Borel sets in R" is c, whereas that of the collection of Lebesgue
measurable sets is 2C.
8. Let (Q, SF^ P) be a probability space and st an algebra of subsets of Q such that
o{sf) = 3F. Using the principle of appropriate sets, prove that for every e > 0 and
B e SF there is a set A e si such that
P(AAB)<e.
170
II. Mathematical Foundations of Probability Theory
9. Let P be a probability measure on (Kn, @{R")). Using Problem 8, show that, f<
every e > 0 and B e @{Rn\ there is a compact subset A of^(R") such that A £
and
P(B\A) < e.
(This was used in the proof of Theorem 1.)
10. Verify the consistency of the measure defined by (21).
§4. Random Variables. I
1. Let (Q, &) be a measurable space and let (R, @(R)) be the real line with
the system <%(R) of Borel sets.
Definition 1. A real function £ = £(co) defined on (Q, F) is an ^-measurable
function, or a random variable, if
{co:t(a))eB}e& (1)
for every B e @(R)\ or, equivalently, if the inverse image
r!(B)s{fl):«a))eB}
is a measurable set in Q.
When (Q, ^) = (Rn, @{Rn% the ^(immeasurable functions are called
Borel functions.
The simplest example of a random variable is the indicator IA(co) of an
arbitrary (measurable) set /4 e #".
A random variable £ that has a representation
tfo>) = I xtl^co), (2)
where £ At = Q, /1£ e J5", is called discrete. If the sum in (2) is finite, the
random variable is called simple.
With the same interpretation as in §4 of Chapter I, we may say that a
random variable is a numerical property of an experiment, with a value
depending on "chance." Here the requirement (1) of measurability is
fundamental, for the following reason. If a probability measure P is defined on
(Q, #"), it then makes sense to speak of the probability of the event (£(co) e B}
that the value of the random variable belongs to a Borel set B.
We introduce the following definitions.
Definition 2. A probability measure P4 on (R, &(R)) with
P£B) = P{co: £(co) eB}, Be @(R),
is called the probability distribution of £ on (R, @l(R)).
§4. Random Variables. I
171
Definition 3. The function
F£x) = P(co: £(co) < x}, xeR,
is called the distribution function of £.
For a discrete random variable the measure P% is concentrated on an at
most countable set and can be represented in the form
P,(B)= I p(xfc), (3)
lk:xkeB)
where p(xk) = P{£ = xk} = AF^(xfc).
The converse is evidently true: If P4 is represented in the form (3) then £
is a discrete random variable.
A random variable £ is called continuous if its distribution function F^x)
is continuous for xeR.
A random variable £ is called absolutely continuous if there is a nonnegative
function f = f^(x), called its density, such that
F,(x)= f fjy)dy, xeR, (4)
(the integral can be taken in the Riemann sense, or more generally in that of
Lebesgue; see §6 below).
2. To establish that a function £ = £(co) is a random variable, we have to
verify property (1) for all sets Bef. The following lemma shows that the
class of such "test" sets can be considerably narrowed.
Lemma 1. Let £ be a system of sets such that o($) = &(R). A necessary and
sufficient condition that a function £ = £(co) is ^-measurable is that
{co:£(co)eE}e& (5)
for all E e£.
Proof. The necessity is evident. To prove the sufficiency we again use the
principle of appropriate sets.
Let 3) be the system of those Borel sets D in @(R) for which f ~ \D) e &.
The operation "form the inverse image" is easily shown to preserve the set-
theoretic operations of union, intersection and complement:
r,(yu.)-y«-,(B.x
rl(0B°) = 0r,(Ba)' (6)
172
II. Mathematical Foundations of Probability Theory
It follows that S> is a <r-algebra. Therefore
S c S) c @(R)
and
<r(<f) c CT(^)) = Q) c ^(R).
But <r(£) = @(R) and consequently 0 = @(R).
Corollary. A necessary and sufficient condition for £ = £(co) to be a random
variable is that
{co:£(co) <x}e&r
for every xeRtor that
{co:£(co) <ix}e&r
for every xeR.
The proof is immediate, since each of the systems
Sx = {x:x < c,ceR},
g2 = {x:x< c,ceR}
generates the <r-algebra @(R): aiEx) = <r(E2) = @(R) (see §2).
The following lemma makes it possible to construct random variables as
functions of other random variables.
Lemma 2. Let <p = <p(x) be a Bor el function and £ = £(co) a random variable.
Then the composition rj = <p <> £, le. the function rj(co) = <p(£(co))t is also a
random variable.
The proof follows from the equations
{co:rj(co)eB} = {ft):<p(MeB} = {co: £(co) e (p~l(B)} e& (7)
for B e @(R\ since <p~ *(B) e @(R).
Therefore if ^ is a random variable, so are, for examples, £", £ + = max(£, 0),
£~ = — min(£, 0), and |£|, since the functions x", x+, x~ and |x| are Borel
functions (Problem 4).
3. Starting from a given collection of random variables {£„}, we can construct
new functions, for example, ^°=1 |£fc|, Km £„, lim £„, etc. Notice that in
general such functions take values on the extended real line R = [—oo, oo].
Hence it is advisable to extend the class of ^-measurable functions somewhat
by allowing them to take the values + oo.
§4. Random Variables. I
173
Definition 4. A function £ = £(co) defined on (CI, &) with values in R =
[—00,00] will be called an extended random variable if condition (1) is
satisfied for every Borel set B e @(R).
The following theorem, despite its simplicity, is the key to the construction
of the Lebesgue integral (§6).
Theorem 1.
(a) For every random variable £ = £(co) (extended ones included) there is a
sequence of simple random variables £u £2,..., such that \£n\ < \£\ and
£n(co) -* £(co), n -> 00, for all coeCl.
(b) If also £(co) > 0, there is a sequence of simple random variables ^, £2 > ■ • •»
such that £„(co) | £(co), n -+ 00, for all co e Q.
Proof. We begin by proving the second statement. For n = 1,2,..., put
n2h k — 1
Ua>) = Z -^-hjco) + nIHm^n](co).
k=l *■
where lKn is the indicator of the set {(k — l)/2n < £(co) < k/2n}. It is easy
to verify that the sequence £n(co) so constructed is such that £„(co) | £(co)
for all co e CI. The first statement follows from this if we merely observe that
^ can be represented in the form £ = £+ — £". This completes the proof of
the theorem.
We next show that the class of extended random variables is closed under
point wise convergence. For this purpose, we note first that if ^, £2,... is a
sequence of extended random variables, then sup £„, inf £„, lim £„ and lim £n
are also random variables (possibly extended). This follows immediately from
{co: sup £„ > x} = |J {co: £n > x} e ^,
n
{co: inf £„ < x} = |J {co: £n < x} e ^,
and
fim £„ = inf sup £m, Hm f„ = sup inf £„,.
Theorem 2. Lef f it £2,... ^e a sequence of extended random variables and
£(co) = lim £„(co). Then %(co) is also an extended random variable.
The proof follows immediately from the remark above and the fact that
{co: £(co) < x} = {co: lim £n(co) < x}
= {co: hm £n(co) = Hm £n(co)} n {IET £n(co) < x}
= CI n {hm £n(co) < x} = {fim £n(co) < x} e ^.
174
II. Mathematical Foundations of Probability Theory
4. We mention a few more properties of the simplest functions of random
variables considered on the measurable space (Q, ^) and possibly taking
values on the extended real line R = [ — 00, oo].f
If £ and rj are random variables, £ + rj, £ — rj, fy, and £/rj are also random
variables (assuming that they are defined, i.e. that no indeterminate forms like
00 — 00, 00/00, a/0 occur.
In fact, let {£„} and {nn} be sequences of random variables converging to
^ and rj (see Theorem 1). Then
L ± In - f ± 1*
L | Z
*\n +-^ = 0)(^)
The functions on the left-hand sides of these relations are simple random
variables. Therefore, by Theorem 2, the limit functions ^ ± nt £rj and %/rj
are also random variables.
5. Let ^ be a random variable. Let us consider sets from £F of the form
{co: £(co)ef?}, Be&(R). It is easily verified that they form a tr-algebra,
called the a-algebra generated by £, and denoted by ^.
If <p is a Borel function, it follows from Lemma 2 that the function rj = cpo^
is also a random variable, and in fact ^-measurable, i.e. such that
{co: rj(co) e B} e ^, Be @(R)
(see (7)). It turns out that the converse is also true.
Theorem 3. Let rj be a iF^-measurable random variable. Then there is a Borel
function <p such that r\ = <p 0 f, i.e. n(co) = cp(£(co))for every co e Q.
Proof. Let O* be the class of ^-measurable functions rj = rj{co) and O^
the class of .^-measurable functions representable in the form <p 0 £, where
¢) is a Borel function. It is clear that O^ c ¢^. The conclusion of the theorem
is that in fact ¢^ = O^.
Let A e &^ and n(co) = IA(co). Let us show that rj e O^. In fact, \l Ae^
there is a £ e ^(R) such that /1 = {co: £(co) e B}. Let
Xb(x) = <L , D
(0, x$B.
Then/^(co) = Xb(^M) e ¢^. Hence it follows that every simple ^-measurable
function £"= l CiIA.(co), Ate^9 also belongs to <^.
t We shall assume the usual conventions about arithmetic operations in R: if ae R then
a ± 00 = ±00, a/±oo = 0; a • 00 = 00 if a > 0, and a ■ 00 = —00 if a < 0; 0 • (±00) = 0,
00 + 00 = 00, —00 — 00 = —00.
§4. Random Variables. I
175
Now let r\ be an arbitrary ^-measurable function. By Theorem 1 there
is a sequence of simple ^-measurable functions {nn} such that nn{co) -> rj(co),
n -> oo, co e Q. As we just showed, there are Borel functions <pn = cpn(x) such
that rj„(co) = cpn(^(co)). Then <pn(^(co)) -> rj(co), n -> oo, co e Q.
Let Bdenote the set {xe R: lim„ cpn(x) exists}. This is a Borel set. Therefore
{lim <pn(x),
n
0,
xeB,
cp{x
x$B
is also a Borel function (see Problem 7).
But then it is evident that rj(co) = limn <pn(^(co)) = cp(£(co)) for all co e CL
Consequently ^ = O^.
6. Consider a measurable space (Q, !F) and a finite or countably infinite
decomposition $) = {Dl,D2t...} of the space Q: namely, Dte^ and
££Z){ = £1 We form the algebra s# containing the empty set 0 and the sets
of the form J]aZ)a, where the sum is taken in the finite or countably infinite
sense. It is evident that the system $£ is a monotonic class, and therefore,
according to Lemma 2, §2, chap. II, the algebra s# is at the same time a
cr-algebra, denoted o(3)) and called the cr-algebra generated by the
decomposition S>. Clearly a{S>) c &.
Lemma 3. Let £ = £{co) be a o(S))-measurable random variable. Then £ is
representable in the form
^)=1^(60), (8)
fc=i
where xkeR,k>l, i.e., £(co) is constant on the elements Dk of the
decomposition, k>l.
Proof. Let us choose a set Dk and show that the cr(0)-measurable function £
has a constant value on that set.
For this purpose, we write
xk = sup[c: Dk n {co: £(co) < c} = 0].
Since {co: £(eo) < xk} = \Jr<xk {w: £(w) < r}> we nave Ak n {w: £(w) <
xk} = 0. rralional
Now let c > xk. Then Dkn{co: £(co) < c} ^ 0, and since the set
{co: £(eo) < c} has the form £aZ)a, where the sum is over a finite or countable
collection of indices, we have
Z)fcn{co:£(co)<c} = Z)fc.
Hence, it follows that, for all c> xk,
Dkn{co:£(co)>c} = 0,
176
II. Mathematical Foundations of Probability Theory
and since {co: £{co) > xk} = {Jr>x {co: £(co) > r}, we have
r rational
Dkn{co:t(co)>xk} = 0.
Consequently, Dk n {co: £(co) ^ xk} = 0, and therefore
Dk^{co:£(co) = xk}
as required.
7. Problems
1. Show that the random variable c; is continuous if and only if P(f = x) = 0 for all
xeR.
2. If | f | is ^"-measurable, is it true that f is also ^"-measurable?
3. Show that f = £(co) is an extended random variable if and only if {co: f(co) e B} e &
for all #e <#(£).
4. Prove that x", x+ = max(x, 0), x~ = -min(x, 0), and \x\ = x+ + x~ are Borel
functions.
5. If f and r\ are ^"-measurable, then {co: f(co) = »7(co)} e &.
6. Let f and »7 be random variables on (CI, ^), and Ae&. Then the function
Ua>) = t(a>)-IA + ri(a>)Iz
is also a random variable.
7. Let dji,..., £„ be random variables and <p(xu...,x„) a Borel function. Show that
<p(£i(o>). • • •. £n(o>)) is also a random variable.
8. Let £ and tj be random variables, both taking the values I, 2,...,N. Suppose that
&t = & Show that there is a permutation (iu i2,..., iw) of (1,2,..., N) such that
{co:t=j} = {co:ij = ij}(oTJ= 1,2,...,N.
§5. Random Elements
1. In addition to random variables, probability theory and its applications
involve random objects of more general kinds, for example random points,
vectors, functions, processes, fields, sets, measures, etc. In this connection it is
desirable to have the concept of a random object of any kind.
Definition 1. Let (CI, &) and (E, S) be measurable spaces. We say that a
function X = X(co), defined on CI and taking values in E, is &I8-measurable,
or is a random element (with values in E), if
{co: X(co) eB}e&
(1)
§5. Random Elements
177
for every Beg. Random elements (with values in £) are sometimes called
£-valued random variables.
Let us consider some special cases.
If (£, S) = (R, @(R)\ the definition of a random element is the same as the
definition of a random variable (§4).
Let (£, <f) = (R\ @(Rn)). Then a random element X(co) is a "random
point" in Rn. If 7¾ is the projection ofRn on the kth coordinate axis, X(co) can
be represented in the form
W = (^(4-5M, (2)
where £k = nk o X.
It follows from (1) that £k is an ordinary random variable. In fact, for
B e @(R) we have
{co:Mco)eB} = {co:tl(co)eR,...,tk-leR,tkeB,tk+leRt...}
= {co: X(co) e (R x ■ ■ • x R x B x R x • • • x R)} e &,
since Rx-xRxBxRx-xRe @(R").
Definition 2. An ordered set (^(co),..., rj„(co)) of random variables is called
an n-dimensional random vector.
According to this definition, every random element X{co) with values in
Rn is an n-dimensional random vector. The converse is also true: every
random vector X(co) = (^(co),..., £n(co)) is a random element in Rn. In fact,
if Bk e @{R\ k = 1,..., n, then
{co: X(co) e (B, x •. • x B„)} = f\ {co: ^(co) eBk}e&
k=l
But &(Rn) is the smallest cr-algebra containing the sets Bx x ••• x Bn.
Consequently we find immediately, by an evident generalization of Lemma 1
of §4, that whenever B e @(Rn\ the set {co: X(co) e B} belongs to &.
Let (£, £) = (Z, B(Z)\ where Z is the set of complex numbers x + iy,
x,yeR, and B{Z) is the smallest tr-algebra containing the sets {z:z = x + iy,
al < x < &i, a2 < y < b2}. It follows from the discussion above that a
complex-valued random variable Z(co) can be represented as Z(co) =
X(co) + iY(to\ where X(co) and Y(co) are random variables. Hence we may
also call Z(co) a. complex random variable.
Let (£, S) = (RT, @(RT)), where T is a subset of the real line. In this case
every random element X = X{co) can evidently be represented as X = (£t)teT
with £t = nt o X, and is called a random function with time domain T.
Definition 3. Let T be a subset of the real line. A set of random variables
X = (£t)teT is called a random process^ with time domain T.
t Or stochastic process (Translator).
178
II. Mathematical Foundations of Probability Theory
If T = {1, 2,...} we call X = (£l5 £2,...) a random process with discrete
time, or a random sequence.
If T = [0, 1], (-00,00), [0,00),..., we call X = (QteT a random
process with continuous time.
It is easy to show, by using the structure of the <r-algebra <%(RT) (§2) that
every random process X = (£,)teT (in the sense of Definition 3) is also a
random function on the space (RT, &(RT)).
Definition 4. Let X = (£t)teT be a random process. For each given coeQ
the function (£t(co))teT is said to be a realization or a trajectory of the process,
corresponding to the outcome co.
The following definition is a natural generalization of Definition 2 of §4.
Definition 5. Let X = (£t)teT be a random process. The probability measure
Px on (RT, @(RT)) defined by
px(£) = P{g>: X(co) eB}, Be@(RT),
is called the probability distribution ofX. The probabilities
Ptl J^sP^ft, {JeB}
with ty < t2 < •" < tn, tie T, are called finite-dimensional probabilities
(or probability distributions). The functions
Ftl rn(*i. • • • > *„) = P{g>: 6, < *i, • • •, ttn < *„}
with tt < t2 < ••• < tn, t{eT, are called finite-dimensional distribution
functions.
Let (£, <f) = (C, ^o(Q)> where C is the space of continuous functions
x = C*t)reT on T = [0, 1] and ^0(C) is the <r-algebra generated by the open
sets (§2). We show that every random element X on (C, &0(C)) is also a
random process with continuous trajectories in the sense of Definition 3.
In fact, according to §2 the set A = {xeC:xt <a} is open in &0(C).
Therefore
{co: £t(co) < a} = {a>: X(a>) e 4} e &.
On the other hand, let X = (£f(a>))f6T be a random process (in the sense
of Definition 3) whose trajectories are continuous functions for every
coeQ. According to (2.14),
{xeCixe Sp(x0)} = f] {x e C: \xtk - x° | < p},
tic
where tk are the rational points of [0,1]. Therefore
{co: X(co) e Sp(X°co))} = f]{co:\ &» - £>» \ < p) e ^
tk
and therefore we also have {co: X(co) e B} e SF for every B e &0(C).
§5. Random Elements
179
Similar reasoning will show that every random element of the space
(D, @0(D) can be considered as a random process with trajectories in the
space of functions with no discontinuities of the second kind; and conversely.
2. Let (Q, <F9 P) be a probability space and (Ea9 <fa) measurable spaces,
where a belongs to an (arbitrary) set 21.
Definition 6. We say that the ^"/^a-measurable functions (Xa(co))9 a e 21,
are independent (or collectively independent) if, for every finite set of
indices ^,...,¾ the random elements XUl,..., Xan are independent, i.e.
P(Xai e Bai,..., Xan e BJ = P(Xai e BJ ■ ■ ■ P(Xan e BJ, (3)
where BaeSa.
Let 21 = {1,2,...,«}, let £a be random variables, let a e 21 and let
F^(x1,...,xn) = P(^1<x1,...,^<xn)
be the n-dimensional distribution function of the random vector
£ = (£1,...,0. Let F^(x,) be the distribution functions of the random
variables £,-, i = 1,..., n.
Theorem. A necessary and sufficient condition for the random variables
£i,..., £„ to be independent is that
F^(x1,...,xn) = F^(x1).-.FJxn) (4)
for all (xlf..., xn) e Rn.
Proof. The necessity is evident. To prove the sufficiency we put
a = (al9...9an)9b = (by9...9bn)9
Pi(a,b] = P{(D:ay <£i < bl9...9an < £„ < bn}9
Ptffli, &.•] = P{«.- < 6 ^ W-
Then
pfa ft = n iFjbd - FtHasi = n^fli. w
i=l i=l
by (4) and (3.7), and therefore
P«! eIl9...9 ^eIn}=f\P{^e/J, (5)
i=l
where /,- = (aif fej.
We fix 12,..., /„ and show that
P{ZleBl,Z2eI29...,ZHeIm} = P{ZleBl}flP{ZleIt} (6)
i=2
for all By e@(R). Let Jt be the collection of sets in @(R) for which (6)
180
II. Mathematical Foundations of Probability Theory
holds. Then Jt evidently contains the algebra $£ of sets consisting of sums of
disjoint intervals of the form lx = (alt &J. Hence s# c Jt c @(R). From
the countable additivity (and therefore continuity) of probability measures it
also follows that Jt is a monotonic class. Therefore (see Subsection 1 of §2)
H(st) c Jt c @(R).
But ]x{sf) = o{sf) = @{R) by Theorem 1 of §2. Therefore Jt = @(R).
Thus (6) is established. Now fix Bl9 /3,..., /„; by the same method we
can establish (6) with /2 replaced by the Borel set B2. Continuing in this
way, we can evidently arrive at the required equation,
P(tleBl,...,tneBn) = P(Z1eBl)..P(tneBn),
where Bt e &(R). This completes the proof of the theorem.
3. Problems
1. Let f i,..., £„ be discrete random variables. Show that they are independent if and
only if
P(^ =Xl,...,tn = xn)= flP(& = xf)
£=1
for all real xi,...,xH.
2. Carry out the proof that every random function (in the sense of Definition 1) is a
random process (in the sense of Definition 3) and conversely.
3. Let X i,..., X„ be random elements with values in (£lt S^\..., (E„, £„), respectively.
In addition let (E'uS\\...,(E'n,&',) be measurable spaces and let #i,..., #„ be
&J&\,...,^,,/^-measurable functions, respectively. Show that if XU...,X„ are
independent, the random elements gx • Xlt..., g„ • X„ are also independent.
§6. Lebesgue Integral. Expectation
1. When (Q, ^", P) is a finite probability space and £ = £(co) is a simple
random variable,
«©)= IxiW4 (1)
fc=i
the expectation E£ was defined in §4 of Chapter I. The same definition of the
expectation E£ of a simple random variable £ can be used for any probability
space (Q, ^, P). That is, we define
E£ = t ***W (2)
k= 1
§6. Lebesgue Integral. Expectation
181
This definition is consistent (in the sense that E£ is independent of the
particular representation of £ in the form (1)), as can be shown just as for
finite probability spaces. The simplest properties of the expectation can be
established similarly (see Subsection 5 of §4 of Chapter I).
In the present section we shall define and study the properties of the
expectation E^ of an arbitrary random variable. In the language of analysis,
E£ is merely the Lebesgue integral of the ^"-measurable function £ = £(a>)
with respect to the measure P. In addition to Ef we shall use the notation
Jn«a>)P(<MorJnfdP.
Let ^ = £(a>) be a nonnegative random variable. We construct a sequence
of simple nonnegative random variables {£„}„;> i such that £„(co) | £(o>),
n -> oo, for each coeCl (see Theorem 1 in §4).
Since E£n < E£n+1 (cf. Property 3) of Subsection 5, §4, Chapter I), the
limit limn E£„ exists, possibly with the value + oo.
Definition 1. The Lebesgue integral of the nonnegative random variable
^ = £(cd), or its expectation, is
E£ = limE£n. (3)
n
To see that this definition is consistent, we need to show that the limit is
independent of the choice of the approximating sequence {£„}. In other
words, we need to show that if £„ | £ and r\m \ £, where {rjm} is a sequence of
simple functions, then
limE^limE^. (4)
n m
Lemma 1. Let rj and £„ be simple random variables, n > 1, and
Then
limE£n>E>7. (5)
n
Proof. Let £ > 0 and
An = {co: £„ > r\ — £}.
It is clear that An T O and
L = Ua„ + Ua„ > UAn >(ri- s)IAn.
Hence by the properties of the expectations of simple random variables we
find that
EL > E(rj - e)IAn = Er,IAn - sP{An)
= Erj- Enl2n - eP(An) > En - CP{An) - £,
182
II. Mathematical Foundations of Probability Theory
where C = max,,, rj(co). Since e is arbitrary, the required inequality (5) follows.
It follows from this lemma that limn E£n > limm Enm and by symmetry
limm Erjm > limn E£n, which proves (4).
The following remark is often useful.
Remark 1. The expectation E£ of the nonnegative random variable ^ satisfies
E£ = sup Es, (6)
{seS:s<Q
where S = {s} is a set of simple random variables (Problem 1).
Thus the expectation is well defined for nonnegative random variables.
We now consider the general case.
Let £ be a random variable and £+ = max(£, 0), £" = — min(^, 0).
Definition 2. We say that the expectation E£ of the random variable £
exists, or is defined, if at least one of E£+ and E^" is finite:
min(E£-\E<r)<oo.
In this case we define
E£ = E«r-E<r.
The expectation E£ is also called the Lebesgue integral (of the function £ with
respect to the probability measure P).
Definition 3. We say that the expectation of ^ is finite if E£+ < oo and
Ef" < oo.
Since |£| = £+ — £", the finiteness of E£, or |E£| < oo, is equivalent to
E | £ | < oo. (In this sense one says that the Lebesgue integral is absolutely
convergent.)
Remark 2. In addition to the expectation E£, significant numerical
characteristics of a random variable £ are the number E£r (if defined) and E | £|r, r > 0,
which are known as the moment of order r (or rth moment) and the absolute
moment of order r (or absolute rth moment) of £.
Remark 3. In the definition of the Lebesgue integral Jn £(co)P(dco) given
above, we suppose that P was a probability measure (P(Q) = 1) and that
the .^"-measurable functions (random variables) £ had values in
R = (—oo, oo). Suppose now that ix is any measure defined on a measurable
space (Q, &) and possibly taking the value + oo, and that ^ = £(co) is an
^-measurable function with values in R = [— oo, oo] (an extended random
variable). In this case the Lebesgue integral j"n Z(co)n(dco) is defined in the
§6. Lebesgue Integral. Expectation
183
same way: first, for nonnegative simple £ (by (2) with P replaced by /*),
then for arbitrary nonnegative £, and in general by the formula
f &w)ii{dw) = f vKdio) - f r«,
Jsi Jci Ja
provided that no indeterminacy of the form oo — oo arises.
A case that is particularly important for mathematical analysis is that in
which (Q, F) = (R, @(R)) and n is Lebesgue measure. In this case the
integral \R ftxMdx) is written JR £(x) dx, or ^m f (x) dx, or (L) {"«, flx) dx
to emphasize its difference from the Riemann integral (R) J"^ £(x) dx. If the
measure fi (Lebesgue-Stieltjes) corresponds to a generalized distribution
function G = G(x), the integral JR £(x)^(dx) is also called a Lebesgue-
Stieltjes integral and is denoted by (L-S) JR £(x)G(dx), a notation that
distinguishes it from the corresponding Riemann-Stieltjes integral
(R-S) f
«x)G(rfx)
R
(see Subsection 10 below).
It will be clear from what follows (Property D) that if E^ is defined then
so is the expectation E(£IA) for every Ae^. The notations E(£; A) or \A £ dP
are often used for E(£IA) or its equivalent, Jn £IA dP. The integral \A ^ dP is
called the Lebesgue integral oft with respect to P over the set A.
Similarly, we write \A £ d\x instead of Jn f • lA d\i for an arbitrary measure
\x. In particular, if n is an n-dimensional Lebesgue-Stieltjes measure, and
A = (alt &J x • • • x (fln, frj, we use the notation
M
Jai Ja
fen /•
f (xi,..., xn)|i(rfx1 • • • dx„) instead of £ d^.
If ^ is Lebesgue measure, we write simply dxx-"dxn instead of n(dxu..., dxn).
2. Properties of the expectation E£ of the random variable £.
A. Let cbea constant and let E£ exist. Then E(c£) exists and
E(c0 = <££-
B. Let £<rj; then
Et<Er,
wfr/i f/ie understanding that
if — oo < E£ f/ien — oo < E»y and E£ < En
or
if En < oo f/ien E£ < oo and E£ < Ew.
184
II. Mathematical Foundations of Probability Theory
C. IfE£ exists then
|E£|<E|a
D. If E£ exists then E(£IA) exists for each Ae^;ifE^ is finite, E(£IA) is
finite.
E.IfZ and n are nonnegative random variables, or such that E |£| < oo and
E\rj\ < oo, then
E(£ + rj) = E£ + En.
(See Problem 2 for a generalization.)
Let us establish A-E.
A. This is obvious for simple random variables. Let £ > 0, £n | & where f„
are simple random variables and c > 0. Then c£n | c£ and therefore
E(c0 = lim E(cU = c lim E£„ = cE£.
In the general case we need to use the representation £ = £+ — £"
and notice that (c£)+ = c£+, (c£)~ = c£~ when c > 0, whereas when
c<Q,(c0+ = -cr,(c0- = -cf+.
B. If 0 < ^ < »y, then E£ and E»y are defined and the inequality E£ < En
follows directly from (6). Now let E£ > — oo; then E£~ < oo. If ^ <rj,
we have £+ < »y+ and f" > »y". Therefore E»y" < E£~ < oo;
consequently En is defined and E£ = E£+ - E£~ < E»y+ - E»y" = En. The
case when En < oo can be discussed similarly.
C. Since — |£| < £ < |£|, Properties A and B imply
-E|£|<E£<E|a
i.e.|E£|<E|a
D. This follows from B and
B/J+ = <r/x < <r, «/j- = vi a ^ r.
E. Let £>0,n>0, and let {£„} and }nn) be sequences of simple functions
such that £„ T £ and »yn |»/. Then E(fn + *7„) = Efn + E»;n and
E(£„ + nn) T E(£ + rj), E£„ | E£, Ely, | Ei?
and therefore E(£ + rj) = E£ + En. The case when E|£|<oo and
E | n | < oo reduces to this if we use the facts that
t = C-C, r, = n+-n-, <T < l£l, C < |£|,
and
§6. Lebesgue Integral. Expectation
185
The following group of statements about expectations involve the notion
of "P-almost surely." We say that a property holds "P-almost surely" if there
is a set .fe^ with P(^T) = 0 such that the property holds for every point
co ofQ\jV. Instead of" P-almost surely" we often say " P-almost everywhere"
or simply "almost surely" (a.s.) or "almost everywhere" (a.e.).
F. Ifc = 0(a.s.)thenEf = 0.
In fact, if ^ is a simple random variable, ^ = J]xfc/ylk(G)) and xk ^ 0,
we have P(Ak) = 0 by hypothesis and therefore E£ = 0. If £ > Oand 0<s<f,
where s is a simple random variable, then s = 0 (a.s.) and consequently
Es = 0 and E^ = sup{SGS:s<5) Es = 0. The general case follows from this by
means of the representation £ = £+ — f" and the facts that £+ < |£|,
T <iaand|£| = 0(a.s.).
G. If £ = r\ (a.s.) and E|£| < oo, then E[771 < 00 and E£ = En (see also
Problem 3).
In fact, let Jf = {co: £ =£ n}. Then P(JT) = 0 and £ = ¢/^ + £/^,
n = rjlv + nljr = nl^- + £/^. By properties E and F, we have E£ = E^J^ +
E£/r = E^J p. But E^/^ = 0, and therefore E£ = E^/jp + E^/^ = En, by
Property E.
H. Let £ > 0 and E£ = 0. Then £ = 0 (a.s).
For the proof, let A = {a>: £(a>) > 0}, An = {co: £(co) > 1/n}. It is clear
that An T A and 0 <Z-IAn<£- IA. Hence, by Property B,
0 < E^ < E£ = 0.
Consequently
0 = E^n>ip^n)
and therefore P(An) = 0 for all n > 1. But P(,4) = lim P(,4n) and therefore
P04) = 0.
I. Let £ and »7 be such that E|£| < 00, E\n\ < 00 and E^f/J < EO7/J for
all Ae^. Then £ < »7 (a.s.).
In fact, let B = {a>: £(co) > 77(0))}. Then E(»y/B) < E(£JB) < E(»y/B) and
therefore E(£/B) = E(»y/B). By Property E, we have E((£ - n)IB) = 0 and by
Property H we have (£ — n)IB = 0 (a.s.), whence P(B) = 0.
J. Let £ be an extended random variable and E|£| < 00. Then |£| < 00
(a. s). In fact, let A = {o>|£(o))| = 00} and P(A) > 0. Then E|£| >
E(I^Uyi) = °° • pG4) = 00, which contradicts the hypothesis E|£| < 00.
(See also Problem 4.)
3. Here we consider the fundamental theorems on taking limits under the
expectation sign (or the Lebesgue integral sign).
186
II. Mathematical Foundations of Probability Theory
Theorem 1 (On Monotone Convergence). Let n, £, ^, £2>--- be random
variables.
(a) If £„ > r\ for all n > 1, En > - oo, and £„ | & ^ew
E£„ T E£.
(b) If £„ < »y for all n > 1, En < oo, awd £n j £, t/iew
E£„ i E£.
Proof, (a) First suppose that n > 0. For each k > 1 let {^n)}„^ i be a sequence
of simple functions such that &n) l£k,n^co. Put £(n) = maxx^k^N&}.
Then
£<n-l) < £<n) = max £jn) < max ^ = ¢^
Let C = Hmn C(n). Since
«■> < C(n) < Zn
for 1 < k < n, we find by taking limits as n -> oo that
& < C < £
for every k > 1 and therefore ^ = C-
The random variables £(n) are simple and £(n) f £. Therefore
Ef = EC = lim EC(n) < lim Efn.
On the other hand, it is obvious, since £n < £„+1 < £, that
lim E£„ <; E£.
Consequently lim E£„ = E£.
Now let n be any random variable with En > — oo.
If Ety = oo then E£n = E£ = oo by Property B, and our proposition is
proved. Let En < oo. Then instead of En > — oo we find E\n\ < oo. It is
clear that 0 < £„ — n f £ — n for all co e Q. Therefore by what has been
established, E(£n — n)\E{^ — n) and therefore (by Property E and Problem 2)
E£,,-E,TE£-Ei,.
But E \n\ < oo, and therefore E£n f E^, n -*■ oo.
The proof of (b) follows from (a) if we replace the original variables by
their negatives.
Corollary. Let {»/„}„> i be a sequence of nonnegative random variables. Then
oo oo
EI*- I EH..
n=1 n=1
§6. Lebesgue Integral. Expectation 187
The proof follows from Property E (see also Problem 2), the monotone
convergence theorem, and the remark that
k oo
n=l n=l
Theorem 2 (Fatou's Lemma). Let rit£u£2*---be random variables.
(a) If £n>n for all n > 1 and En > — oo, then
E Hm £„ < Hm E£n.
(b) If £„<n for alln>\ and En < oo, then
IuHE^<Ehm^n.
(c) //141 ^ »//or alln>\ and En < oo, f/ien
E Hm£n < HmE£n < IIm"E£n < E hm£n. (7)
Proof, (a) Let £, = infm>„ £m; then
Hm £„ = lim inf £m = lim £,.
n m>n n
It is clear that £„ f lim f „ and £„ > »7 for all n > 1. Then by Theorem 1
E Hm f„ = E lim £„ = lim E£„ = Hm E£n < Hm E£n,
n n n
which establishes (a). The second conclusion follows from the first. The third
is a corollary of the first two.
Theorem 3 (Lebesgue's Theorem on Dominated Convergence). Let n, £,
f i, £2, • • • fee random variables such that |£„| < nt En < oo and £n -> ^ (a.s.).
T/ienE|^| < oo,
E£,,-E£ (8)
and
E|4 — ¢1-0 (9)
as n -> oo.
Proof. Formula (7) is valid by Fatou's lemma. By hypothesis, Hm f„ =
lim ^ = £ <a.s.). Therefore by Property G,
E Hm 4 = Hm E^ = Hm E^ = E Rm 4 = E£,
which establishes (8). It is also clear that |£| < n. Hence E |£| < oo.
Conclusion (9) can be proved in the same way if we observe that
\tn-t\<2n.
188
II. Mathematical Foundations of Probability Theory
Corollary. Let n, £, £i,... be random variables such that \£„\ < v\-> £n ~* £
(a.s.) and Erf < oo for some p > 0. Then E |£|p < oo and E |£ - £„|p -»• 0»
n -> oo.
For the proof, it is sufficient to observe that
i£i < ^ is - up < cm + iur < (2^.
The condition "|£„l < rj, En < oo" that appears in Fatou's lemma and
the dominated convergence theorem, and ensures the validity of formulas
(7)-(9), can be somewhat weakened. In order to be able to state the
corresponding result (Theorem 4), we introduce the following definition.
Definition 4. A family {£„}„;> t of random variables is said to be uniformly
integrable if
sup f \UP(daS) - 0, c - oo, (10)
n J{\Sn\>c)
or, in a different notation,
supE[ia/{|ft.i>J-0, c-oo. (11)
n
It is clear that if £n, n > 1, satisfy |£n| < rj, En < oo, then the family
(£n}n;>i is uniformly integrable.
Theorem 4. Let {£„}n2.i be a uniformly integrable family of random variables.
Then
(a) E!im£n<!imE£n <hmE£n <Ehm£n.
(b) /f in addition £n -> t, (a.s.) then £ is integrable and
E£n-E£, ii-oo,
EIf,-¢1-0, n-oo.
Proof, (a) For every c > 0
E£„ = EKB/Kii<_ J + E[£n/{^_c}]. (12)
By uniform integrability, for every £ > 0 we can take c so large that
suplEKJ^.JKe. (13)
n
By Fatou's lemma,
ymEK./tt.a-d] ^ E[lim ^/(6ii_ J. ,
But £,,/^ a_c} > £n and therefore
limE[Sn/{^_c}] > E[Hm £„]. (14)
§6. Lebesgue Integral. Expectation
189
From (12)-(14) we obtain
!imEfn>EDimfJ-£.
Since £ > 0 is arbitrary, it follows that lim E£n > E Hm £„. The inequality
with upper limits, lim E£„ < E lim £„, is proved similarly.
Conclusion (b) can be deduced from (a) as in Theorem 3.
The deeper significance of the concept of uniform integrability is revealed
by the following theorem, which gives a necessary and sufficient condition
for taking limits under the expectation sign.
Theorem 5. Let 0 < £n -> £ and E£n < oo. Then E£n -> E£ < oo if and only if
the family {^J„^i is uniformly integrable.
Proof. The sufficiency follows from conclusion (b) of Theorem 4. For the
proof of the necessity we consider the (at most countable) set
A = {a:P(£ = a)>0}.
Then we have ^nI^n<a] -*■ £/K<flJ for each a$A, and the family
{£nftfn<a}}n&l
is uniformly integrable. Hence, by the sufficiency part of the theorem, we have
E£„/tfn<fl) -> E£ftfn<fl}> a t A, and therefore
Efn'tfn>fl>-Ef/{^fl>, at A, n-oo. (15)
Take an £ > 0 and choose a0 £ A so large that E^/^Sfloj < e/2; then choose
N0 so large that
for all n > N0, and consequently E^n/{€n>flo} < £. Then choose ax > a0 so
large that E^{€nSfll} < £ for all n < N0. Then we have
supEfn/{£n>fll} < £,
n
which establishes the uniform integrability of the family {£„}„> i of random
variables.
4. Let us notice some tests for uniform integrability.
We first observe that if {£,,} is a family of uniformly integrable random
variables, then
supE|£n|<oo. (16)
190
II. Mathematical Foundations of Probability Theory
In fact, for a given s > 0 and sufficiently large c > 0
supElfJ = sup[E(l4 1/,,6,,^,) + E(|£n |/{„n,<c})]
n n
< supE(|4 |/{Kn,^c}} + supE(|£n|/(Kn|<c)) < e + C
n n
which establishes (16).
It turns out that (16) together with a condition of uniform continuity is
necessary and sufficient for uniform integrability.
Lemma 2. A necessary and sufficient condition for a family {£„}„£ i of random
variables to be uniformly integrable is that E | £n|, n > l,are uniformly bounded
(i.e., (16) holds) and that E{ \Z„\IA},n > 1, are uniformly absolutely continuous
(i.e. sup E {| £n | IA) - 0 when P(A) - 0).
Proof. Necessity. Condition (16) was verified above. Moreover,
E{|£„|/x} = £(1^1/^^,^,} +E{|£n|/yln{Kn|<c}}
<E{|«U/{|U^}}+cP04). (17)
Take c so large that sup„ E{|^[/{Un, >c>} < e/2. Then if P(A) < e/2c, we have
suPE{|4|/x}<£
n
by (17). This establishes the uniform absolute continuity.
Sufficiency. Let e > 0 and 5 > 0 be chosen so that P(4) < d implies that
E(|^„|/yi) < e, uniformly in n. Since
E|^l>E|^|/{|U^}>CP{|^n|>C}
for every c > 0 (cf. Chebyshev's inequality), we have
sup P{|fn| > c) < - sup E 16,1 -> 0, c -> oo,
n C
and therefore, when c is sufficiently large, any set {|£n| > c}, n > 1, can be
taken as A Therefore sup E(| f„|/{kJ ^,) < e, which establishes the uniform
integrability. This completes the proof of the lemma.
The following proposition provides a simple sufficient condition for
uniform integrability.
Lemma 3. Let £u £2»--- be a sequence of integrable random variables and
G = G(t) a nonnegative increasing function, defined for t > 0, such that
lim — = oo. (18)
t-+oo *
supE[G(|£n|)] < oo. (19)
n
Then the family {£„}„> x is uniformly integrable.
§6. Lebesgue Integral. Expectation
191
Proof. Let e > 0, M = supnE[G(|^n|)], a = M/e. Take c so large that
G(t)/t > a for t > c. Then
uniformly for n > 1.
5. If ^ and w are independent simple random variables, we can show, as in
Subsection 5 of §4 of Chapter I, that E£rj = E£ • En. Let us now establish a
similar proposition in the general case (see also Problem 5).
Theorem 6. Let £ and n be independent random variables with E|£| < oo,
E1771 < 00. ThenE\£n\ < 00 and
E^n = E^Erj. (20)
Proof. First let £ > 0, rj > 0. Put
00 k
£n = £ - * {fc/n < §(«) < (fc + 1 )/n} >
fc = 0 M
- V kr
*ln — L ~ '[k/nz 17(a))<(fc+l)/n}•
fc = 0 n
Then fn < £, |£„ - f I ^ l/n and »yn < »y, |nn - n| < 1/n. Since Ef < 00 and
E»y < 00, it follows from Lebesgue's dominated convergence theorem that
lim E£„ = E£, lim Enn = En.
Moreover, since £ and r\ are independent,
kl
E£„»7n = Z ^2 ^^{fc/ns«<(fc+l)/n}^{//nsi7<(/+l)/n}
fc./^O "
= Z/ ^2 EVs«<(ft+l)/n} ' E^{//n^i7<(J+l)/n} = E£n " E,7n-
fc,/>0 n
Now notice that
|Efr - E^nnn\ < E |fr - Un\ < E[|«| • \rj - ijj]
+ E[|^|.|^-^n|]<iE^+iE^+^0, n^oo.
Therefore E^ = limn E£n*7n = lim E£„ • lim Enn = E£ • E»y, and E£n < 00.
The general case reduces to this one if we use the representations
£ = V - C,rj = n+ - r\-^r\ = £ V - <T*7+ - £V + fiT. This
completes the proof.
6. The inequalities for expectations that we develop in this subsection are
regularly used both in probability theory and in analysis.
192
II. Mathematical Foundations of Probability Theory
Chebyshev's Inequality. Let £ be a nonnegative random variable. Then for
every e > 0
P(£>£)<^. (2D
£
The proof follows immediately from
E£ > EK-/KiJ > £E/(^£) = fiP(£ > £).
From (21) we can obtain the following variant of Chebyshev's inequality:
If £ is any random variable then
P(£>£)<^ (22)
and
P(l£-E£|>£)<^, (23)
where V£ = E(£ — E£)2 is the variance of £.
The Cauchy-Bunyakovskii Inequality. Let £ andrj satisfyE£2 < oo,E»y2 < oo.
Then E | £rj \ < oo and
(E|£7l)2<E£2.E*72. (24)
Proof. Suppose that E£2 > 0, Erj2 > 0. Then, with I = Z/y/Ei?, fj =
rj/y/En2, we find, since 2\%rj\ < £2 + fj2, that
2E|#y|<E|2+E*;2 = 2,
i.e. E |lr\\ < 1, which establishes (24).
On the other hand if, say, E£2 = 0, then £ = 0 (a.s.) by Property I, and
then E£rj = 0 by Property F, i.e. (24) is still satisfied.
Jensen's Inequality. Let the Borel function g = g(x) be convex downward and
E|£| < oo. Then
0<E0 < Erffr (25)
Proof. If g = g(x) is convex downward, for each x0 e R there is a number
A(x0) such that
9(x) > g(x0) + (x - x0) - Mxo) (26)
for all xeR. Putting x = £ and x0 = E £, we find from (26) that
0(£)>0(E3 + (£-E£)-;i(Ea
and consequently Eg(£) > g(E%).
§6. Lebesgue Integral. Expectation
193
A whole series of useful inequalities can be derived from Jensen's inequality.
We obtain the following one as an example.
Lyapunov's Inequality. IfO<s<t,
(Ems)1/s<(Emrr. <zn
To prove this, let r = t/s. Then, putting rj = \£\s and applying Jensen's
inequality to g(x) = |x|r, we obtain \Erj\r < E \rj\r, i.e.
(Ei£is)r/s<Em<,
which establishes (27).
The following chain of inequalities among absolute moments in a
consequence of Lyapunov's inequality:
E|^|<(E|^|2)1/2<---<(E|^|n)1/n. (28)
Holder's Inequality. Let 1 < p < oo, 1 < q < oo, and (1/p) + (1/q) = 1. If
E|£|p < oo andE\rj\q < oo, thenE\£rj\ < oo and
E|£7l<(E|m1/p(EM«)1Al. (29)
If E |£|p = 0 or E\rj\q = 0, (29) follows immediately as for the Cauchy-
Bunyakovskii inequality (which is the special case p = q = 2 of Holder's
inequality).
Now let E|f |p > 0, E \rj\q > 0 and
C (E|^n1/P' * (EM*)1'*'
We apply the inequality
xayb < ax + by, (30)
which holds for positive x, y, a, b and a + b = 1, and follows immediately
from the concavity of the logarithm:
ln[flx + by] > a In x + b In y = In xfl/.
Then, putting x = |?|p, j; = |fjf|*, a = 1/p, b = 1/g, we find that
1^-1^ + -1^,
P Q
whence
Bm^-E\t\p+-mq=-+-= i.
P Q P Q
This establishes (29).
194
II. Mathematical Foundations of Probability Theory
Minkowski's Inequality.If E |£\p < oo, E \rj\p < op, 1 < p < oo, then we have
E|£ + n\p < ooand
(E|£ + *;|p)1/p < (E|£|p)1/P + (EM')1/p. (31)
We begin by establishing the following inequality: if a, b > 0 and p > 1,
then
^ + ^^2^(^ + ^ (32)
In fact, consider the function F(x) = (a + x)p - 2P" ^a" + xp). Then
F'(x) = pfc + *)p_1 - 2p-lpxp-\
and since p > 1, we have F'(a) = 0, F'(x) > 0 for x < a and F'(x) < 0 for
x > a. Therefore
F(b) < max F(x) = F(a) = 0,
from which (32) follows.
According to this inequality,
IS + rj\p < (|£| + \rj\y < 2p-\\t\p + M") (33)
and therefore if E |£|p < oo and E \rj\p < oo it follows that E |£ + rj\p < oo.
If p = 1, inequality (31) follows from (33).
Now suppose that p > 1. Take q > 1 so that (1/p) + (1/q) = 1. Then
1^ + ^^ = 1^ + ^1-1^ + ^1^^1^1-1^ + ^^^ + 1^11^ + ^^-1. (34)
Notice that (p — l)g = p. Consequently
E(|« + ijr1y = E|« + iy|"<oo>
and therefore by Holder's inequality
E(i£i i* + ^r1) < (Ei^ni/p(Ei^ + nfp-i>*y">
= (E|^n1/p(E|^ + »;n1/9<oo.
In the same way,
e(m i« + ^r1) < (Ei^ni/p(E^ + i,ni/fl.
Consequently, by (34),
E|£ + if|*£(E|£ + ij |p)1/9((E I £|p)1/p + <E | ij |p)1/p). (35)
If E | £ + r\ \p = 0, the desired inequality (31) is evident. Now let E | £ + r\ \p > 0.
Then we obtain
<E|£ + ^|p)1_(1/9) < (E|£|p)1/P + (EMp)1/p
from (35), and (31) follows since 1 - (1/q) = 1/p.
§6. Lebesgue Integral. Expectation
195
7. Let £ be a random variable for which E£ is defined. Then, by Property D,
the set function
Q(A)= ffdP, Ae3F, (36)
is well defined. Let us show that this function is countably additive.
First suppose that £ is nonnegative. IfAltA2t... are pairwise disjoint sets
from & and A = £ An, the corollary to Theorem 1 implies that
Q(A) = E(£ • IA) = E(£ • JZA]) = E(£ £ • /^)
If ^ is an arbitrary random variable for which E£ is defined, the countable
additivity of Q(A) follows from the representation
QG4) = Q+G4) - Q-\A), (37)
where
Q+(>1) = f C dP, Q~(A) = f r dP,
J A J A
together with the countable additivity for nonnegative random variables and
the fact that min(Q+(Q), Q~(Q)) < oo.
Thus if E^ is defined, the set function 0 = 0(A) is a signed measure—
a countably additive set function representable as 0 = Oi — 02» where at
least one of the measures Qi and O2 is finite.
We now show that 0 = Q04) has the following important property of
absolute continuity with respect to P:
if P(yl) = 0 then Q04) = O (AeP)
(this property is denoted by the abbreviation 0 < P).
To prove the sufficiency we consider nonnegative random variables. If
^ = Yi=ixk^Ak is a simple nonnegative random variable and P(A) = 0,
then
QG4) = E« • lA) = t xkP(Ak ^ A) = 0.
fc=i
If {<!;„}„ ^ 1 is a sequence of nonnegative simple functions such that £„ f £ > 0,
then the theorem on monotone convergence shows that
Q04) = E(£. IA) = \im E(tnIA) = Ot
since E(^n • IA) = 0 for all n > 1 and A with P(A) = 0.
Thus the Lebesgue integral Q(A) = J"A £ dP, considered as a function of
sets A e #", is a signed measure that is absolutely continuous with respect to
P (0 < P). It is quite remarkable that the converse is also valid.
196
II. Mathematical Foundations of Probability Theory
Radon-Nikodym Theorem. Let (Q, &) be a measurable space, \i a <r-finite
measure, and \ a signed measure (i.e., X — Xx — A2, where at least one of the
measures Ax and X2 is finite) which is absolutely continuous with respect to \l
Then there is an &:-measurable function f = f(co) with values inR = [ — oo, oo]
such that
%A) = f f(coMdco), Ae& (38)
Ja
The function f(co) is unique up to sets of fi-measure zero: if h = h(co) is
another ^-measurable function such that X(A) = \A h(co)ii(dco), Ae&, then
/i{£0:/(£0)^/2(£0)}=0.
If k is a measure, then f = f(coi) has its values inR+ = [0, oo].
Remark. The function f = f(co) in the representation (38) is called the
Radon-Nikodym derivative or the density of the measure X with respect to /*,
and denoted by dX/dfi or (idX/dfi)(co).
The Radon-Nikodym theorem, which we quote without proof, will play
a key role in the construction of conditional expectations (§7).
8. If ^ = Z?= i xi^At is a simple random variable,
E*«) = Z <Kx,)P04.) = Z 0(Xi)A^(x,.). (39)
In other words, in order to calculate the expectation of a function of the
(simple) random variable £ it is unnecessary to know the probability measure
P completely; it is enough to know the probability distribution P^ or, equiv-
alently, the distribution function F4 of ^.
The following important theorem generalizes this property.
Theorem 7 (Change of Variables in a Lebesgue Integral). Let (Q, &) and
(E, S) be measurable spaces and X — X(co) an &/^-measurable function with
values in E. Let P be a probability measure on (Q, &) and Px the probability
measure on (E, S) induced by X = X(co):
PX(A) = P{co: X(co) eA}, Ae S. (40)
Then
f g(x)Px(dx) = f g(X(a>))P(dco), AeS, (41)
J A JX-HA)
for every ^-measurable function g = g(x), xeE (in the sense that if one
integral exists, the other is well defined, and the two are equal).
Proof. Let A e S and g{x) = /B(x), where BeS. Then (41) becomes
PX(AB) = P(X~ \A) n X- \B)), (42)
§6. Lebesgue Integral. Expectation
197
which follows from (40) and the observation that X~l(A) n X~\B) =
X~\AnB).
It follows from (42) that (41) is valid for nonnegative simple functions
9 — 9(XX and therefore, by the monotone convergence theorem, also for all
nonnegative <f-measurable functions.
In the general case we need only represent g as g+ — g~. Then, since (41)
is valid for g+ and g~, if (for example) $A9+(x)Px(dx) < °°, we have
f g+(X(co))P(dco) < oo
Jx-ha)
also, and therefore the existence of \A g(x)Px(dx) implies the existence of
Jjr-iM>0(*(o>))P(<Mi
Corollary. Let (£, S) = (R, @(R)) and let £ = f(co) be a random variable with
probability distribution P%. Then ifg = g(x) is a Borel function and either of the
integrals \A g(x)P£dx) or J^- HA) g(£(co))P(dco) exists, we have
[g{x)Ps(dx)= f g(t(cD))P(dco).
J A J§-IM)
In particular, for A = R we obtain
E0(«fl>)) = f g{^d))P{doS) = f g(x)Ps(dx). (43)
Jii Jr
The measure P% can be uniquely reconstructed from the distribution
function F^ (Theorem 1 of §3). Hence the Lebesgue integral j"R g{x)P^{dx) is
often denoted by \R g(x)F£dx) and called a Lebesgue-Stieltjes integral
(with respect to the measure corresponding to the distribution function
Let us consider the case when F^(x) has a density f$(x\ i.e. let
*■«(*)= f Uy)dy, (44)
J-oo
where f^ = f^(x) is a nonnegative Borel function and the integral is a Lebesgue
integral with respect to Lebesgue measure on the set ( — oo, x] (see Remark 2
in Subsection 1). With the assumption of (44), formula (43) takes the form
E0(«fl>))= {" rt*Vfr)dx> (45)
J-oo
where the integral is the Lebesgue integral of the function ^(x)/^(x) with
respect to Lebesgue measure. In fact, if g{x) = /B(x), B e @(R\ the formula
becomes
P,(B)
= [ A(x)dx, Be@(R)\ (46)
198
II. Mathematical Foundations of Probability Theory
its correctness follows from Theorem 1 of §3 and the formula
F£b) - F£a) = { fjpOdx.
Ja
In the general case, the proof is the same as for Theorem 7.
9. Let us consider the special case of measurable spaces (Q, &) with a
measure /*, where Q = Clx x Q2» & — &\ ® ^2» ant* A* = A*i x /½ ls tne
direct product of measures /iL and \i2 0e-»tne measure on ^ such that
/i! x /i2(^ x B) = ndAMB), Ae&t, Bef2;
the existence of this measure follows from the proof of Theorem 8).
The following theorem plays the same role as the theorem on the reduction
of a double Riemann integral to an iterated integral.
Theorem 8 (Fubini's Theorem). Let £ = £(0^, o)2) be an &\®
^^-measurable function, integrable with respect to the measure fix x /i2:
I l«o>i, w2) I</(/*! x /i2) < 00. (47)
Then the integrals J^ £(0^, co2)/i1(^/co1) and Jn2 £(cou cQ2)fi2(dco2)
(1) are defined for all co j and o)2;
(2) are respectively 3F2- and 3F\-measurable functions with
/*2Jw2:J ^(^^¢02)1^1(^)=001 = 0,
/M^V J 1^(0)1,0)2)1^2(^2) = ooi = 0
(48)
and (3)
f (fl)l, o)2) rfO/i x /i2) = «o)lf co2)fi2(dco2) Li(rffi)i)
«/«1x02 ^fiiL^ J
= J J ^1, 0)2)/il(rfO)!) /i2(rfO)2).
(49)
Proof. We first show that ^(©2) = f(<»i. ^2) is .^-measurable with
respect to co2, for each co1 e Qla
Let Fe &± ® ^2 and £(0^, o)2) = /F(o)!, o)2). Let
^«1 = (w2 e Q2: (wi, o)2) e ^}
§6. Lebesgue Integral. Expectation
199
be the cross-section of F at cox, and let #^ = {Fe^: Fm e F2}. We must
show that #^ = & for every cox.
lfF = Ax BtAe#ri,Be^2,then
,,™ [B if coxe A,
{A x B)^ = \ x , ;
[0 ifco^A.
Hence rectangles with measurable sides belong to <&m. In addition, if
Fe&, then (F)m = F^, and if {F"}^, are sets in ^, then ((J F")ttl =
(J F^. It follows that #^ = &
Now let ^(coltco2) ^ 0. Then, since the function ^(coltco2) is ^-measurable
for each colt the integral J^ £(g>i, co2)fi2(dco2) is defined. Let us show that this
integral is an ^-measurable function and
£(«!, co2)/*2(dco2) L(dco0 = I ^i,fi)2)%ix/i2). (50)
Let us suppose that £(cot, co2) = IAx^coi, co2)t A e < Be^2. Then since
IaxiA®u w2> = ^(wj/^coz), we have
'iixn(fl)i, ^2)/^2(^2) = /^0 1^2)112(^2) (51)
Jn2 Jn2
and consequently the integral on the left of (51) is an ^-measurable function.
Now let £(©!, co2) = /F(w!, co2)> F e ^ = &x ® «^2 • Le* us show that the
integral f(u>x) = ^2 IF(colt co2)ti2(da)2)is ^-measurable. For this purpose we
put # = {Fe^:/(fl)i) is ^-measurable}. According to what has been
proved, the set A x B belongs to<&(Ae&rltBe #2)and therefore the algebra
s? consisting of finite sums of disjoint sets of this form also belongs to #. It
follows from the monotone convergence theorem that ^ is a monotonic
class, # = /*(#). Therefore, because of the inclusions ^ = ^ = ^ and
Theorem 1 of §2, we have & = o(sf) = \l{s0) c /*(#) = ^cf, i.e. ^ = ^.
Finally, if £(g>i, co2) is an arbitrary nonnegative ^-measurable function,
the ^-measurability of the integral Jn2 £(fl>lt ca2)ti2(dco) follows from the
monotone convergence theorem and Theorem 2 of §4.
Let us now show that the measure \i = \ix x \i2 defined on & = #"2 ® &2 •>
with the property (jxx x fi2)(A x B) = fix(A) • fi2(B), Ae&^Be&i, actually
exists and is unique.
For F e & we put
/i(F) = f f IF(o(co2)ti2(dca2) WiidtOi).
As we have shown, the inner integral is an ^-measurable function, and
consequently the set function /*(F) is actually defined for F e J* It is clear
200
II. Mathematical Foundations of Probability Theory
that if F = A x B, then /*(/4 x B) = /^04)/^(3). Now let {Fn} be disjoint
sets from SF. Then
= I Z I 'f- (^2)/^2(^2)/^(^)
=z r rr ^^^^^=^^-
i.e. /i is a (tr-finite) measure on SF.
It follows from Caratheodory's theorem that this measure \i is the unique
measure with the property that fi(A x B) = Hi(A)fi2(B).
We can now establish (50). If £,(<£>u co2) = /^xb^i. <a2), A e &l9 Be^2,
then
/ytxB(w1,co2M^i x /½) = (/^1 x /^)(4 x E), (52)
and since IAitdfQlt <a2) = /,4(001)/3(^2).we have
= £ L(coi) £ IB(colt co2)/i2(^2)l/ii(^i) = lii(A)ti2(B). (53)
But, by the definition of/iL x /i2,
0*i x /^)04 x B) = iiM)Hi{B).
Hence it follows from (52) and (53) that (50) is valid for £(cou co2) =
/>1xb(C0i,C02)-
Now let £(co!, co2) = /f(<^i» w2), FeJf The set function
MF) = f /F(co1} co2y(/*i x /i2), F e ^
is evidently a cr-fmite measure. It is also easily verified that the set function
<F) = J J *f(g>i, co2)ti2(dco2) L(^fi)i)
is a cr-fmite measure. It will be shown below that X and v coincide on sets of
the form F = A x B, and therefore on the algebra !F. Hence it follows by
Caratheodory's theorem that X and v coincide for all F e ^
§6. Lebesgue Integral. Expectation
201
We turn now to the proof of the full conclusion of Fubini's theorem.
By (47),
f+ (fl>lf o)2) </(/*! x /i2) < oo, <T(o)i, w2y(/ii X fi2) < 00.
Jsii x n2 Jciy x n2
By what has already been proved, the integral Jn2 £+(o)i, co2)fi2(dco2) is an
^i-measurable function of o)x and
^(0)1,0)2)^2(^2)^1(^1)=1 ^+(0)i,0)2)^(/ii x/i2)< 00.
Consequently by Problem 4 (see also Property J in Subsection 2)
£+(o)i, o)2)/i2(do)2) < 00 (/ira.s.).
Jn2
Jn2
In the same way
Jn2
and therefore
£ (o)i, €o2)/JLi(d(Oi) < 00 (/i!-a.s.),
Jn2
|fto)lf o)2)|/i2(^/o)2) < 00 (/ira.s.).
Jn2
It is clear that, except on a set Jf of /ij -measure zero,
£(0^,0)2)/^0)2)= ^(0)^0)2)/^2(^/0)2)- ^"(0)1,0)2)^2(^0)2).
•/«2 Jfi2 ^¾
(54)
Taking the integrals to be zero for cot e Jf9 we may suppose that (54) holds
for all o) e Qi. Then, integrating (54) with respect to \ix and using (50), we
obtain
J J £(0)1, o)2)/i2(rfo)2) L(^O)i) = J J £+(o)i,o)2)/i2(rfo)2)Ui(rfo)i)
-J J £"(0)i,0)2)/i2(rf0)2)Ui(rf0)i)
£ + (o)i, o)2)^i x /i2)
«/n,xn2
£-(0)1, o)2)rf(/ii x /i2)
«/«1 x n2
£(0)!, 0)2M/il X /i2).
Jcii x n2
202
II. Mathematical Foundations of Probability Theory
Similarly we can establish the first equation in (48) and the equation
I ^l5 co2) dUit x n2) = I I £(€0» cozlHiidcot) L^)-
This completes the proof of the theorem.
Corollary. If J"^ [Jft21^(^, co2)|ju2(da>2)]jUi (dcoi) < °°» the conclusion of
Fubini's theorem is still valid.
In fact, under this hypothesis (47) follows from (50), and consequently
the conclusions of Fubini's theorem hold.
Example. Let (<!;, n) be a pair of random variables whose distribution has a
two-dimensional density /^(x, y\ i.e.
P((& r,) e B) = J /^(x, y) dx dy, B e @(R2),
where /^(x, 3;) is a nonnegative ^(/?2)-measurable function, and the integral
is a Lebesgue integral with respect to two-dimensional Lebesgue measure.
Let us show that the one-dimensional distributions for £ and n have
densities/^(x) and/^Cv), and furthermore
/»00
/<to = fefix,y)dy
J -00
and (55)
/,00= r/ateJOdx.
In fact, if X e ^(/?), then by Fubini's theorem
PtfeA) = P((^)eX x jR) = £ Mx,y)dxdy = |jj"/^x, jO 4>1 dx.
This establishes both the existence of a density for the probability distribution
of £ and the first formula in (55). The second formula is established similarly.
According to the theorem in §5, a necessary and sufficient condition that
£ and n are independent is that
F,n{xiy) = F£x)Fn{y\ (x,y)eR2.
Let us show that when there is a two-dimensional density fitfx, y), the
variables £ and n are independent if and only if
fefx>y) = ffr)Uy) (56)
(where the equation is to be understood in the sense of holding almost
surely with respect to two-dimensional Lebesgue measure).
§6. Lebesgue Integral. Expectation
203
In fact, in (56) holds, then by Fubini's theorem
*W*. y) = f /ft("> v) dudv=[ /«(ii)/,(») du dv
»'(-oo,jc]x(-oo,y] »'(-oo,Jc]x(-oo,y]
= f /«(«)«fo
J(-oo,x]
and consequently £ and 17 are independent.
Conversely, if they are independent and have a density/5„(x, y), then again
by Fubini's theorem
/ft(", v) du dv =
J(-co,x]x(-co,y]
= 1 fs(u)f,(v)dudv.
J(—oo,x]x(—oo,y]
It follows that
J /ft(*. jO *c dy = J f£x)fjiy) dx dy
for every Be@ (R2), and it is easily deduced from Property I that (56) holds.
10. In this subsection we discuss the relation between the Lebesgue and
Riemann integrals.
We first observe that the construction of the Lebesgue integral is
independent of the measurable space (Q, &) on which the integrands are given.
On the other hand, the Riemann integral is not defined on abstract spaces in
general, and for Q, = R" it is defined sequentially: first for R1, and then
extended, with corresponding changes, to the case n > 1.
We emphasize that the constructions of the Riemann and Lebesgue
integrals are based on different ideas. The first step in the construction of the
Riemann integral is to group the points xeR1 according to their distances
along the x axis. On the other hand, in Lebesgue's construction (for Q, = R1)
the points xeR1 are grouped according to a different principle: by the
distances between the values of the integrand. It is a consequence of these
different approaches that the Riemann approximating sums have limits only
for "mildly" discontinuous functions, whereas the Lebesgue sums converge
to limits for a much wider class of functions.
Let us recall the definition of the Riemann-Stieltjes integral. Let G = G(x)
be a generalized distribution function on R (see subsection 2 of §3) and [i its
corresponding Lebesgue-Stieltjes measure, and let g = g(x) be a bounded
function that vanishes outside [a, b].
(f Uv)dv) = Fi(.x)F,(y)
(f fM<t»)(\ /,(«>) *)
\J<-00.*1 / \J(-CD.>] /
204
II. Mathematical Foundations of Probability Theory
Consider a decomposition 0* — {x0,..., x„},
a = x0 < *! < • • ■ < x„ = b,
of [a, fc], and form the upper and lower sums
Z = Z UGixd - G0cH)l Z = Z &[G(*i) " G(*«-ifl
9 i=l "^ i=l _
where
gt = sup 5(3;), gt = inf 5(3;).
Define simple functions g&(x) and g&(x) by taking
£te(x) = 9i, 9 Ax) = &,
onxj-! < x < xif and define g&(a) = g^(a) = g{a). It is clear that then
Z = (L-S) fW(x)G(rfx)
& Ja
and
5-^1
<te(x)G(rfx).
Now let {^k} be a sequence of decompositions such that 0>k ^ ^k+ x. Then
q9x > g&>2 > • • • > g > • • • > g&2 > g&l,
and if \g(x) \ < C we have, by the dominated convergence theorem,
lim Z = (L-S) f 9(x)G(dx\
\ (57)
lim Z = (L-S) f g(x)G(dx),
fc-»oo &k J a
where g{x) = limk g^fx\ q(x) = limk^fc(x).
If the limits limk Z^fc anc* nmfc Z^k are /*mte flM^ egwa/, and t^«V common
value is independent of the sequence~of decompositions {^k}, we say that g = g(x)
is Riemann-Stieltjes integrable, and the common value of the limits is denoted
by
(R-S) Cg(x
Ja
l(x)G(dx). (58)
When G(x) = x, the integral is called a Riemann integral and denoted by
(R)jbg(x)dx.
§6. Lebesgue Integral. Expectation
205
Now let (L-S) $g(x)G(dx) be the corresponding Lebesgue-Stieltjes
integral (see Remark 2 in Subsection 2).
Theorem 9.1fg = g(x) is continuous on [a, fc], it is Riemann-Stieltjes
integrable and
(R-S) jbg(x)G(dx) = (L-S) ^g(x)G(dx). (59)
Proof. Since g(x) is continuous, we have g(x) = g(x) = g(x). Hence by (57)
lim^^ Y,3i>k — nmk-oo Z^ic- Consequently g = g(x) is Riemann-Stieltjes
integral (again by (57))r
Let us consider in more detail the question of the correspondence between
the Riemann and Lebesgue integrals for the case of Lebesgue measure on the
line R.
Theorem 10. Let g(x) be a bounded function on [a, b].
(a) The function g = g(x) is Riemann integrable on [a, b~\ if and only if it is
continuous almost everywhere (with respect to Lebesgue measure 1 on
(b) VQ — 9(x) fc Riemann integrable, it is Lebesgue integrable and
(R) Jrtx) dx = (L) fg(x)%dx). (60)
Proof, (a) Let g = g(x) be Riemann integrable. Then, by (57),
(L)f g{x)X{dx) = {L)\ gix)k{dx).
Ja Ja
But g(x) < g(x) < g(x\ and hence by Property H
9(x) = 9(x) = g(x) a-a.s.), (61)
from which it is easy to see that g(x) is continuous almost everywhere (with
respect to X).
Conversely, let g = g(x) be continuous almost everywhere (with respect
to 1). Then (61) is satisfied and consequently g(x) differs from the (Borel)
measurable function g(x) only on a set jV with \Jf) = 0. But then
{x: g(x) < c} = {x: g(x) < c} n JP + {x: g(x) < c} n Jf
= {x: g(x) <c} r\JT + {x:g(x) < c} nJT
It is clear that the set {x: g(x) < c} n JV eW&a, fc]), and that
{x: g(x) < c} n Jf
206
II. Mathematical Foundations of Probability Theory
is a subset of ^ having Lebesgue measure 1 equal to zero and therefore also
belonging to ^(|>, &)]. Therefore g(x) is W([_a, fc])-measurable and, as a
bounded function, is Lebesgue integrable. Therefore by Property G,
(L) fg(x)X(dx) = (L) fg(x)X(dx) = (L) fg(x)Z(dx),
Ja *fa * a
which completes the proof of (a).
(b) If g = g(x) is Riemann integrable, then according to (a) it is continuous
(X-a..s.). It was shown above than then g(x) is Lebesgue integrable and its
Riemann and Lebesgue integrals are equal.
This completes the proof of the theorem.
Remark. Let [i be a Lebesgue-Stieltjes measure on ^([a, bj). Let ^([a, bj)
be the system consisting of those subsets A ^ [a, b] for which there are sets
A and B in ^(|>, 6]) such that A c A c B and n(B\A) = 0. Let ji be an
extension of ;u to W^a, bj) (ju(A) = ju(X) for A such that A ^ A ^ B and
H(B\A) = 0). Then the conclusion of the theorem remains valid if we
consider p. instead of Lebesgue measure X, and the Riemann-Stieltjes and
Lebesgue-Stieltjes measures with respect to /Z instead of the Riemann and
Lebesgue integrals.
11. In this part we present a useful theorem on integration by parts for the
Lebesgue-Stieltjes integral.
Let two generalized distribution functions F = F(x) and G = G(x) be
given on (R, @(R)).
Theorem 11. The following formulas are valid for all real a and b,a < b:
F(b)G(b) - F(a)G(a) = f F(s-)dG(s) + (G(s)dF(s\ (62)
Ja Ja
or equivalently
F(b)G(b) - F(a)G(a) = f F(s-)dG(s) + f G(s-)dF(s)
Ja Ja
+ £ AF(s)AG(s), (63)
a<s£b
wtere F(s-) = lim,Ts F(t), AF(s) = F(s) - F(s-).
Remark 1. Formula (62) can be written symbolically in "differential" form
d(FG) = F_dG + G dF. (64)
§6. Lebesgue Integral. Expectation
207
Remark 2. The conclusion of the theorem remains valid for functions F and G
of bounded variation on [o, b]. (Every such function that is continuous on
the right and has limits on the left can be represented as the difference of two
monotone nondecreasing functions.)
Proof. We first recall that in accordance with Subsection 1 an integral
J* (•) means J(fltfc] (•). Then (see formula (2) in §3)
(F(b) - F(a))(G(b) - G(a)) = |V(s) • pG(0-
LetF x G denote the direct product of the measures corresponding to F and
G. Then by Fubini's theorem
(FQ>) - F(a))(G(b) - G(a)) = f d(F x G\s, t)
= f /(.*,)(*, 0 d(F x G\Si t) + f /{5<t}(s, t) d{F x G){s, t)
J(a, b] x (a. fc] J (a, b] x (a, b]
= f (G(s) - G(a))dF(s) + f (F(f-) - F(a)) dG(t)
J{a, bf J (a, fc]
= fG(s)dF(s) + jV(S-)rfG(s) - G(a\F(b) - F(a)) - F(a\G(b) - G(a)\
(65)
where IA is the indicator of the set A.
Formula (62) follows immediately from (65). In turn, (63) follows from
(62) if we observe that
f (G(s) - G(s-))dF(s) = £ AG(s) • AF(s). (66)
Ja a<s<b
Corollary 1. If F(x) and G(x) are distribution functions, then
F(x)G(x) = f F(s-)dG(s)+ f G(s)dF(s). (67)
J — oo J— 00
If also
F(x) = f f(s)ds,
J-co
then
F(x)G(x) = (* F(s)rfG(s)+ f G(s)f(s)ds. (68)
•/—00 ^-00
208
II. Mathematical Foundations of Probability Theory
Corollary 2. Let £ be a random variable with distribution function F(x) and
E|£|n< oo. Then
f x" dF(x) = n ("x^^l - F(x)] dx, (69)
Jo Jo
f |x|" dF(x) = - Px" dF(-x) = n Px""lF{-x) dx (70)
J-oo Jo Jo
E |£|" = f °° |x|" ^(x) = n (^^[1 - F(x) + F(-x)] rfx. (71)
J-oo Jo
To prove (69) we observe that
f xndF{x)=-[ xnd(\-F{x))
Jo Jo
= - b\\ - F{b)) +n[xn~ \l - F(x)) dx. (72)
Let us show that since E | £ |" < oo,
b\l - F(b) + F(-b)) < bnP(| £ | > fc) - 0. (73)
In fact,
Em"= I f 1*1" dF(x) < oo
k=l Jfc-1
and therefore
But
£fc+l Jk-l
\x |" dF(x) -> 0, M -> 00.
X f ixrjF(x)>&"P(Ki>fc),
:fc+l ^k-1
k>
which establishes (73).
Taking the limit as b -> oo in (72), we obtain (69).
Formula (70) is proved similarly, and (71) follows from (69) and (70).
12. Let /4(r), t > 0, be a function of locally bounded variation (i.e., of bounded
variation on each finite interval [a, fc]), which is continuous on the right and
has limits on the left. Consider the equation
Zf= 1 + \zs-dA(s), (74)
Jo
§6. Lebesgue Integral. Expectation
209
which can be written in differential form as
dZ = Z_ dA, Z0 = 1. (75)
The formula that we have proved for integration by parts lets us solve (74)
explicitly in the class of functions of bounded variation.
We introduce the function
St{A) = em~M0) J! (1 + Ay4(s))e-A*s), (76)
0<s<f
where AA(s) = A(s) - A(s-) for s > 0, and AA(0) = 0.
The function A(s\ 0 < s < r, has bounded variation and therefore has at
most a countable number of discontinuities, and so the series Y,o<s<t I A^(s)l
converges. It follows that
[] (1 + AA(s))e-*Als)
0<s<t
is a function of locally bounded variation.
If Ac(t) = A(t) — YjO<s<t &A(s) is the continuous component of A(t\
we can rewrite (76) in the form
St{A) = eAC{t)-AC{0) n (1 + ^00)- (77)
0<s<f
Let us write
F(t) = eAC(t)-ACi0\ G(t) =[](! + A^(s»-
Then by (62)
St{A) = F(r)G(t) = 1 + Cf(s) dG(s) + Cg(s-) dF(s)
Jo Jo
= 1 + X F(s)G(s-)AA(s) + f G(s-)F(s)<L4c(s)
0<s<r ^0
= 1 + fVs-(X) dA(s).
Jo
Therefore £t{A\ t > 0, is a (locally bounded) solution of (74). Let us show that
this is the only locally bounded solution.
Suppose that there are two such solutions and let Y = Y(t), t > 0, be their
difference. Then
7(0 = ('Y(s-)dA(s).
Jo
Jo
Put
T = inf{r > 0: Y(t) * 0},
where we take T = oo if Y(t) = 0 for t > 0.
210
II. Mathematical Foundations of Probability Theory
Since A(t) is a function of locally bounded variation, there are two
generalized distribution functions Ax(t) and A2(t) such that A(t) = Ai(t) —
A2(t). If we suppose that T < oo, we can find a finite V such that
lA^T) + A2(T)1 - {AX{T) + A2(T)2 < i
Then it follows from the equation
7(0= JV(s-)^(s), r>T,
that
sup|7(0l<isup|7(0l
t<,t' t£T'
and since sup| 7(01 < oo, we have 7(0 = 0 for T < t < V, contradicting
the assumption that T < oo.
Thus we have proved the following theorem.
Theorem 12. There is a unique locally bounded solution o/(74), and it is given
by (76).
13. Problems
1. Establish the representation (6).
2. Prove the following extension of Property E. Let £ and i/ be random variables for
which E£ and Erj are defined and the sum E£ + Et] is meaningful (does not have the
form oo — oo or — oo + oo). Then
Etf + rj) = E£ + E*
3. Generalize Property G by showing that if £ = i/ (a.s.) and E£ exists, then Erj exists and
E£ = E»y.
4. Let £ be an extended random variable, \i a a-finite measure, and \a \£\diL < oo.
Show that |£| < oo 0*-a.s.) (cf. Property J).
5. Let fi be a a-finite measure, £ and i/ extended random variables for which E£ and Erj
are defined. If $A £ dP < $A t\ dP for all A e ^ then f < 17 (/z-a.s.). (Cf. Property I.)
6. Let £ and 17 be independent nonnegative random variables. Show that E£i/ = E£ • Ei/.
7. Using Fatou's lemma, show that
P(lim An) < lim P(An\ P(fim" 4,) > HE PU„).
8. Find an example to show that in general it is impossible to weaken the hypothesis
"|£J < 17, Ei/ < 00" in the dominated convergence theorem.
§6. Lebesgue Integral. Expectation
211
9. Find an example to show that in general the hypothesis "£„ <n,En > — oo" in
Fatou's lemma cannot be omitted.
10. Prove the following variants of Fatou's lemma. Let the family {^}„>1 of random
variables be uniformly integrable and let E Hm £„ exist. Then
nmE^<Enm^.
Let £„ < n„t n > 1, where the family {^}n^.1 is uniformly integrable and nn
converges a.s. (or only in probability—see §10 below) to a random variable n. Then
EE^<EIE^
11. Dirichlet's function
J1, x irrational,
~ (0, x rational,
is defined on [0,1], Lebesgue integrable, but not Riemann integrable. Why?
12. Find an example of a sequence of Riemann integrable functions {f„)n7, i, defined on
[0,1], such that \f„\ < 1,/, -*f almost everywhere (with Lebesgue measure), but
f is not Riemann integrable.
13. Let (atj; ij > 1) be a sequence of real numbers such that £It/ |aft;| < oo. Deduce
from Fubinfs theorem that
,?/*=?(? flyH(?4 (78)
14. Find an example of a sequence (ai}\it j > 1) for which £i,jlavl = oo and the equation
in (78) does not hold.
15. Starting from simple functions and using the theorem on taking limits under the
Lebesgue integral sign, prove the following result on integration by substitution.
Let h = h{y) be a nondecreasing continuously differentiable function on [a, £?],
and let/(;x) be (Lebesgue) integrable on \h(a\ h(by\. Then the function f{h{y))h'(y)
is integrable on [a, b] and
(mf(x)dx= \bf(h{yW{y)dy.
•>h(a) Ja
16. Prove formula (70).
17. Let £, £i, £2i • • • to nonnegative integrable random variables such that E^„ -> E£ and
P(f - &, > e) -> 0 for every e > 0. Show that then E \£„ - f | -> 0, n -> oo.
18. Let ^, nt £ and £„t n„, £„, n > 1, be random variables such that
£,-£ tlm*>tl. C^C, */«<£,<£«, ">1,
EC^EC, Enn^Ent
and the expectations E£, En, E£ are finite. Show that then E£„ -> E£ (Pratfs lemma).
If also nn <, 0 < £, then E|£, - f | - 0,
Deduce that if £, £ & E|fJ -> E|f | and E|f | < oo, then E|£, - £| -> 0.
212
II. Mathematical Foundations of Probability Theory
§7. Conditional Probabilities and Conditional
Expectations with Respect to a <r-Algebra
1. Let (£2, ^ P) be a probability space, and let A eg? be an event such that
P(/4) > 0. As for finite probability spaces, the conditional probability ofB with
respect to A (denoted by P(B\A)) means P(BA)/P(A\ and the conditional
probability of B with respect to the finite or countable decomposition S> =
{DUD2...} with POD,-) > 0, i > 1 (denoted by P(B | S>)) is the random variable
equal to P(J31D{) for co e Dl3 i > 1:
P(B|0)= X P(B|D,)/0|.M-
i> 1
In a similar way, if ^ is a random variable for which E^ is defined, the
conditional expectation of £ with respect to the event A with P(A) > 0 (denoted
by E(£|/i)) is E(ZIA)/P(A) (cf. (1.8.10).
The random variable P(B \ S>) is evidently measurable with respect to the
cr-algebra ^ = o{3)), and is consequently also denoted by P(B\<g) (see §8 of
Chapter I).
However, in probability theory we may have to consider conditional
probabilities with respect to events whose probabilities are zero.
Consider, for example, the following experiment. Let ^ be a random
variable that is uniformly distributed on [0, 1]. If £ = x, toss a coin for which
the probability of head is x, and the probability of tail is 1 — x. Let v be the
number of heads in n independent tosses of this coin. What is the "conditional
probability P(v = k\£ = x)"? Since P(£ = x) = 0, the conditional
probability P(v = k\£ = x) is undefined, although it is intuitively plausible that
"it ought to be Cknx\\ - x)"-\"
Let us now give a general definition of conditional expectation (and, in
particular, of conditional probability) with respect to a cr-algebra ^, ^ ^ ^
and compare it with the definition given in §8 of Chapter I for finite probability
spaces.
2. Let (Q, &, P) be a probability space, ^ a cr-algebra, ^ c ^ (eg is a o-
subalgebra of SF\ and £ = £(cd) a random variable. Recall that, according to
§6, the expectation E £ was defined in two stages: first for a nonnegative random
variable ^, then in the general case by
Edj = Edj+ -E<T,
and only under the assumption that
min(E<r,Edj+)< <*).
A similar two-stage construction is also used to define conditional
expectations E(dj|^).
§7. Conditional Probabilities and Expectations with Respect to a a-Algcbra 213
Definition 1.
(1) The conditional expectation of a nonnegative random variable £ with
respect to the o-algebra <& is a nonnegative extended random variable,
denoted by E(^|^) or E(£|^)(co), such that
(a) E(£|«0 is ^-measurable;
(b) for every A e ^
J\^P = J*E(<^)</P. (1)
(2) The conditional expectation E(£\&\ or E(^|^)(co), of any random
variable £ with respect to the o-algebra <&, is considered to be defined if
min(E(<T|^),E(<ri^))<°o,
P-a.s., and it is given by the formula
E(<^) = E(<ri^)-E(rin
where, on the set (of probability zero) of sample points for which E(^+ | ^)
= E(c~ |^) = oo, the difference E(^+\<&) — E(£~ |^) is given an arbitrary
value, for example zero.
We begin by showing that, for nonnegative random variables, E(£|^)
actually exists. By (6.36) the set function
Q04)
= f^P, Xe^, (2)
is a measure on (£2, ^), and is absolutely continuous with respect to P
(considered on (£2, ^), ^ Q &). Therefore (by the Radon-Nikodym theorem)
there is a nonnegative ^-measurable extended random variable E(^|^) such
that
0(,4) = Je(£|«0</P. (3)
Then (1) follows from (2) and (3).
Remark 1. In accordance with the Radon-Nikodym theorem, the
conditional expectation E(^|^) is defined only up to sets of P-measure zero.
In other words, E(^|^) can be taken to be any ^-measurable function f(co)
for which Q(A) = $Af(a))dP, A e<& (a "variant" of the conditional
expectation).
Let us observe that, in accordance with the remark on the Radon-
Nikodym theorem,
E(<^) = ^(o>), (4)
214
II. Mathematical Foundations of Probability Theory
i.e. the conditional expectation is just the derivative of the Radon-Nikodym
measure 0 with respect to P (considered on (£2, ^)).
Remark 2. In connection with (1), we observe that we cannot in general put
E(£|^) = & since £ is not necessarily ^-measurable.
Remark 3. Suppose that ^ is a random variable for which E£ does not exist.
Then E(^ | ^) may be definable as a ^-measurable function for which (1) holds.
This is usually just what happens. Our definition E(£\&) = E(£+\<&)—
E(^~|^) has the advantage that for the trivial cr-algebra & = {0, £2} it
reduces to the definition of E^ but does not presuppose the existence of E£
(For example, ifc is a random variable with E^+ = oo,E^~ = oo,and^ = «^;
then E^ is not defined, but in terms of Definition 1, E(^ \<&) exists and is simply
£ = <r - r.
Remark 4. Let the random variable £ have a conditional expectation E(^|^)
with respect to the cr-algebra ^. The conditional variance (denoted by V(£|^)
or V(^|^)(a)) of £ is the random variable
V(^|^) = E[(^-E(^|^))2|^].
(Cf. the definition of the conditional variance V(£\&>) of £ with respect to a
decomposition ®, as given in Problem 2, §8, Chapter I.)
Definition 2. Let Be <F. The conditional expectation E(/B|^) is denoted by
P{B | ^), or P(B | &)(co\ and is called the conditional probability of the event B
with respect to the cr-algebra &, & Q !F.
It follows from Definitions 1 and 2 that, for a given Be^ P(B\&) is a
random variable such that
(a) P(B\&) is ^-measurable,
(b) P(AnB)= (p(B\&)dP (5)
for every Ae<&.
Definition 3. Let ^ be a random variable and <&n the cr-algebra generated by
a random element n. Then E(£|^n), if defined, means E(^|i? or E(£|n)(cD\
and is called the conditional expectation of£ with respect to n.
The conditional probability P(£|^„) is denoted by P{B\r\) or P(B\n)(co),
and is called the conditional probability ofB with respect to r\.
3. Let us show that the definition of E(^ |^) given here agrees with the
definition of conditional expectation in §8 of Chapter I.
§7. Conditional Probabilities and Expectations with Respect to a a-Algcbra 215
Let 3) = {£>!, £>2,...} be a finite or countable decomposition with atoms
£>,- with respect to the probability P (i.e. P(£>,) > 0, and if A Q £>., then
either P(A) = 0 or P(D\A) = 0).
Theorem 1. If & = a(3>) and £> is a random variable for which E£ is defined,
then
E{£\9) = E(t\Dt) (P-a.s. on £>,) (6)
or equivalently
E«W = T^J (P-a.s.onDd.
(The notation "f = n (P-a.s. on A)" or
"£ = rj(A; P-a.s.)" means that P(A n {£ ^ 17}) = 0.)
Proof. According to Lemma 3 of §4, E(£\&) = K{ on £>;, where /Cf are
constants. But
\ ZdP= f E(^)dP = KiP(Di\
JDi JD{
whence
This completes the proof of the theorem.
Consequently the concept of the conditional expectation E(£\@) with
respect to a finite decomposition B = {D\,..., £>„}, as introduced in
Chapter I, is a special case of the concept of conditional expectation with
respect to the cr-algebra ^ = a(D).
4. Properties of conditional expectations. (We shall suppose that the
expectations are defined for all the random variables that we consider and that
A*. If C is a constant and £ = C (a.s.), then E(£\&) = C (a.s.).
B*. If£ < n (a.s.) then E(£\&) < E(n\<0) (a.s.).
C*. |E(£|^|<E(m|^(a.s.).
D*. If a, b are constants and aE£ + bEn is defined, then
E(a£ + bri\<Z) = aE(£\<g) + bE(n \ <$) (a.s.).
E*. Let J5* = {(p, Q} be the trivial a-algebra. Then
E(£|^) = E£ (a.s.).
216
II. Mathematical Foundations of Probability Theory
F*. E(£|^) = £(fl.s.).
G*. E(E(£|«0) = E£.
H*. If <3y <=&2then
ElE(t\<Z2)\$l-]=E(t\<Zl) (a.s.).
I*. If <3X 2^2 then
E[E(^|^2)|^1) = E(^|^2) (a.s.).
J*. Let a random variable £ for which E£ is defined be independent of the
a-algebra <& (i.e., independent of IB, Be <&). Then
E(£|?) = E£ (a.s.).
K*. Let n be a <&-measurable random variable, E|^| < oo and E|^| < oo.
Then
E(fr|0) = *E(£|9) (^).
Let us establish these properties.
A*. A constant function is measurable with respect to 0. Therefore we need
only verify that
w.
CdP, Ae<$.
But, by the hypothesis t; = C (a.s.) and Property G of §6, this equation
is obviously satisfied.
B*. If f < n (a.s.), then by Property B of §6
f £</P< fndP, Ae&,
and therefore
(E(£\&)dP< fE(i7|9)£fP, /4e^.
The required inequality now follows from Property I (§6).
C*. This follows from the preceding property if we observe that —1^|
<£<ia
D*. If A e & then by Problem 2 of §6,
f (flf + to?) rfP = f o£ dP + f to? rfP = f <£(£|9) rfP
f bE(n\&)dP = f [flE(^|S0 + «0?|9)] ^P»
+
which establishes D*.
§7. Conditional Probabilities and Expectations with Respect to a a-Algcbra 217
E*. This property follows from the remark that E^ is an ^.-measurable
function and the evident fact that if A = Q or A = 0 then
J>P=I;
E£dP.
F*. Since £ if .^"-measurable and
f £dP = f £</P, Ae^
we have E(f |^) = £ (a.s.).
G*. This follows from E* and H* by taking <3X = {0, Q} and &2 = <$.
H*. Let ,4 e^; then
JEMVJdP-fzdP.
J A
Since <3X c ^2> we have Ae<g2 and therefore
J EDEtflSa)!*!] ^P = J E(£|02) </P = J £ dP.
Consequently, when Ae<gu
JEWVJdP- JE[E(^2)\^JdP
and by Property I (§6) and Problem 5 (§6)
E(^|^1) = E[E(^|^2)|^1] (a.s.)-
I*. If A e^, then by the definition of E[E(f 1^)1^]
JeCE^I^I^J^P = J"E(£|02)£fP.
The function E(£|^2) is ^-measurable and, since ^2 c #lf also
^-measurable. It follows that E(£\&2) is a variant of the expectation
E[E(£|^2)|^!], which proves Property I*.
J*. Since E^ is a ^-measurable function, we have only to verify that
I*-I
E£rfP,
i.e. that E[^ • /B] = E^ • EIB. If E | ^| < oo, this follows immediately from
Theorem 6 of §6. The general case can be reduced to this by applying
Problem 6 of §6.
The proof of Property K* will be given a little later; it depends on
conclusion (a) of the following theorem.
218
II. Mathematical Foundations of Probability Theory
Theorem 2 (On Taking Limits Under the Expectation Sign). Let {£n}n^ 1
be a sequence of extended random variables.
(a) If \£„\ < n, En < oo and £„ -> £ (a.s.\ then
E«J0)->E(£|0) (a.s.)
and
E(|&, -£l|S)->0 (a.s.).
(b) // £, > 17, En > - oo and £„H («•«•). ^"
E«JSOtE«W (**).
(c) If £„ < n, En < oo, and <!;„ I £ (a.s.), rfcen
E«JS01E«|S0 M-)-
(d) // <!;„ > n, En > — oo, then
E(!im<^)<!imE(£n|^) (a.s.).
(e) // £„ < n, En < oo, then
fim" E(£„| ^) < E(hm £„ |^) (a.s.).
(0 //&,=> Often
E(Z^|S0 = IE(^|S0 (**)•
Proof, (a) Let £„ = supm>„ \£m — £\. Since <!;„-»> ^ (a.s.), we have C„J,0
(a.s.). The expectations E£„ and E^ are finite; therefore by Properties D*
and C* (a.s.)
ie«j9) - e(^i^)| = iek. - ^isoi < E(i^ - t\m < E(un
Since E(Cn+ A&) < E(C„W (a.s.), the limit h = lim„ E(CJ^) exists (a.s.). Then
0 < \hdP < f E(CJ^) </P = f C </P -► 0, n - oo,
where the last statement follows from the dominated convergence theorem,
since 0 < £, < 2m, En < oo. Consequently Jn h dP = 0 and then Jz = 0
(a.s.) by Property H.
(b) First let n = 0. Since E(£n|«f) < E(^n+1|^) (a.s.) the limit C(«) =
lim„ E(£„\<g) exists (a.s.). Then by the equation
UndP= \E{U&)dP, AeV,
J A JA
and the theorem on monotone convergence,
\^dP = fc</P, Ae&.
J A J A
Consequently ^ = C (as.) by Property I and Problem 5 of §6.
§7. Conditional Probabilities and Expectations with Respect to a a-Algcbra 219
For the proof in the general case, we observe that 0 < & T £+> and bv
what has been proved,
EiCmmCm (a.s.). (7)
But 0 < C < <T, E<T < oo, and therefore by (a)
E(£T|0)->E(ri?),
which, with (7), proves (b).
Conclusion (c) follows from (b).
(d) Let Cn = infma.„ £m; then £„ | £ where £ = lim <!;„. According to (b),
E(CJ^) T E(CI^) (a.s.). Therefore (a.s.) E(lim £,|?) = E(C|^) = lim„ E(CJ^)
= !imE(CJ^)<!imE(<U^).
Conclusion (e) follows from (d).
(f) If 4 > 0, by Property D* we have
E(£<^) = £E(£k|«0 (a.s.)
which, with (b), establishes the required result.
This completes the proof of the theorem.
We can now establish Property K*. Let rj = IB, Be&. Then, for every
\^dP = f £dP = f E(£|^)</P= \lBE{^)dP = fiyE^l^rfP.
•M J/inB JAnB JA JA
By the additivity of the Lebesgue integral, the equation
hrjdP= f>?E(£|^)rfP, Ae&, (8)
remains valid for the simple random variables *7 = SUi 3^¾^ Bke<&.
Therefore, by Property I (§6), we have
E($||9) = i|E(£|9) (a.s.) (9)
for these random variables.
Now let r\ be any ^-measurable random variable with E\r\\ < oo, and
let {»?„}„> i be a sequence of simple ^-measurable random variables such that
\rj„| < y\ and r\n -+ r\. Then by (9)
E(^J^) = ifcEtfl?) (a.s.).
It is clear that |^„| < |fiy|, where E|£„| < oo. Therefore E(^J^)-^
E(^|^) (a.s.) by Property (a). In addition, since E |^| < oo, we have E(£|^)
finite (a.s.) (see Property C* and Property J of §6). Therefore »?nE(^|^ -►
r]E(£\&) (a.s.). (The hypothesis that E(£|^) is finite, almost surely, is essential,
since, according to the footnote on p. 172, 0 • oo = 0, but if r\n = 1/n, y\ = 0,
we have 1/n • oo -A 0 • oo = 0.)
220
II. Mathematical Foundations of Probability Theory
5. Here we consider the more detailed structure of conditional expectations
E(^|^), which we also denote, as usual, by E(^|i?).
Since E(^|i?) is a ^-measurable function, then by Theorem 3 of §4 (more
precisely, by its obvious modification for extended random variables) there
is a Borel function m = m{y) from R to R such that
m(r,(co)) = E(t\ri)(o>) (10)
for all cdeQ. We denote this function m(y) by E(^|i? = y) and call it the
conditional expectation oft, 'with respect to the event {n = y), or the conditional
expectation oft, under the condition that r\ = y.
Correspondingly we define
UdP = \\L(Z\ri)dP= [m{r()dP, ^¾. (11)
Ja Ja Ja
Therefore by Theorem 7 of §6 (on change of variable under the Lebesgue
integral sign)
f m(ri)dP= \m{y)Pn{dy\ Be<%(R\ (12)
J {to: tie B) JB
where Pn is the probability distribution of n. Consequently m = m(y) is a
Borel function such that
f £dP= \m{y)dP,. (13)
Jlw.tieB) JB
for every B e @(R).
This remark shows that we can give a different definition of the conditional
expectation E(^\n = y).
Definition 4. Let £ and n be random variables (possible, extended) and let
E^ be defined. The conditional expectation of the random variable £ under
the condition that n = y is any ^(K)-measurable function m = m(y) for
which
f £dP= [m{y)PJ(dyl Be@{R). (14)
J{(o:t}eB) JB
That such a function exists follows again from the Radon-Nikodym theorem
if we observe that the set function
Q(B) = I ZdP
-J <
is a signed measure absolutely continuous with respect to the measure Pt
§7. Conditional Probabilities and Expectations with Respect to a a-Algebra 221
Now suppose that m(y) is a conditional expectation in the sense of
Definition 4. Then if we again apply the theorem on change of variable under the
Lebesgue integral sign, we obtain
f £ = f m{y)PJLdy) = \ mfoX B e Ofo
J{a:tieB} JB J{to:t)eB}
The function m(ri) is ^-measurable, and the sets {w.rieB}, Be@l(R\
exhaust the subsets of ^.
Hence it follows that w(i?) is the expectation E(^|iy). Consequently if we
know E(£\r] = y) we can reconstruct E(^|i?), and conversely from E(^\rj) we
can find E(£1*7 = y).
From an intuitive point of view, the conditional expectation E(^|iy = y)
is simpler and more natural than E(^|i?). However, E(£|*7), considered as a
^-measurable random variable, is more convenient to work with.
Observe that Properties A*-K* above and the conclusions of Theorem 2
can easily be transferred to E(£\r] = y) (replacing "almost surely" by
"P^-almost surely"). Thus, for example, Property K* transforms as follows:
if E | ^ | < oo and E | £f{rj) \ < oo, where f = f(y) is a <%(R) measurable
function, then
EWOfllif = y)= f(ym\ri = y) (P.-a-S.). (15)
In addition (cf. Property J*), if £ and r\ are independent, then
E«|i, = jO = E£ (P.-a-S.).
We also observe that if B e<%(R2) and £ and r\ are independent, then
^\}^n)\n = y\ = ^i^y) (P„-a.s.), (16)
and if <p = <p(x, y) is a ^(/?2)-measurable function such that E \(p{^, rj)\ < oo,
then
E|>(£ n)\n =-y] = EM y)1 (P.-a-S.).
To prove (16) we make the following observation. If B = Bl x B2, the
validity of (16) will follow from
f W bJ&* r})P(dw) = f E/BlX Ba(& y)P,(dy).
J{a:tieA) •'(yeX)
But the left-hand side is P{£eBl3 tjeA n B2}> and the right-hand side is
P^eB^PirjeA n B2)\ their equality follows from the independence of £
and r\. In the general case the proof depends on an application of Theorem 1,
§2, on monotone classes (cf. the corresponding part of the proof of Fubini's
theorem).
Definition 5. The conditional probability of the event Ae^- under the
condition that r\ = y (notation: P(A\rj = y)) is E(IA\r\ = y).
222
II. Mathematical Foundations of Probability Theory
It is clear that P(A \rj = y) can be defined as the ^(K)-measurable function
such that
P(An{r}EB})= f P{A\n = y)P,(dy), Be@(R). d?)
Jb
6. Let us calculate some examples of conditional probabilities and
conditional expectations.
Example 1. Let i? be a discrete random variable with P(rj = yk) > 0,
I?*iPfo =^)=1. Then
For 3>£ (3>i, 3>2, • • •} the conditional probability P(A\r] = y) can be defined
in any way, for example as zero.
If ^ is a random variable for which E£ exists, then
When y £ {y^ y2,...} the conditional expectation E(£\r] = y) can be defined
in any way (for example, as zero).
Example 2. Let (<!;, rj) be a pair of random variables whose distribution has a
density f^{x, y):
P{(£, rfi e B} = jfs„(x, y) dx dy, B e <%{R2).
Let f£x) and fjiy) be the densities of the probability distribution of £ and rj
(see (6.46), (6.55) and (6.56).
Let us put
^13^¾^ °8)
taking^|„(x|3/) = 0 if f,(y) = 0.
Then
P«eC|ij =y) = jf^x\y)dx9 Ce@{R\ (19)
i.e./^xly) is the density of a conditional probability distribution.
§7. Conditional Probabilities and Expectations with Respect to a a-Algcbra 223
In fact, in order to prove (19) it is enough to verify (17) for B e
A = {£ e C}. By (6.43), (6.45) and Fubini's theorem,
fz\v(x\y)fr,(y)dxdy
Jcxb
= I f*f&y)dxdy
JCXB
= P{(& ri)eCxB} = P{(^eC) n(^B)},
which proves (17).
In a similar way we can show that if E^ exists, then
/•00
E(f|*7 = j>)= x^|„(x|j/)rfx.
J-oo
(20)
Example 3. Let the length of time that a piece of apparatus will continue to
operate be described by a nonnegative random variable y\ = 17(a)) whose
distribution F^y) has a density fn(y) (naturally, Fn{y) = fn(y) = 0 for y < 0).
Find the conditional expectation E{r\ — a\r\ > a), i.e. the average time for
which the apparatus will continue to operate on the hypothesis that it has
already been operating for time a.
Let PO7 > a) > 0. Then according to the definition (see Subsection 1) and
(6.45),
E(„ -a\r,>a) = EH*T "^ = J" ^ 'J^-f^
1 P(r]>a) P(rj> a)
_l?(y-g)My)dy
!?My)dy '
It is interesting to observe that if r\ is exponentially distributed, i.e.
{te~kv v > 0,
0 yZo, (21>
then Eri = E(r}\r] > 0) = 1/A and EO7 -a\r\ > a) = 1/A for every a > 0. In
other words, in this case the average time for which the apparatus continues
to operate, assuming that it has already operated for time a, is independent
of a and simply equals the average time Erj.
Under the assumption (21) we can find the conditional distribution
P(*7 — a < x\r\ > a).
224
II. Mathematical Foundations of Probability Theory
We have
P(»Z -a <x\rj >a) =
P(a<ri <a +^x)
P{ri > a)
Fn(a + x) - F„(fl) + Pfr = a)
1 - Fn{a) + P(r, = a)
[1 _ e-Ma+x)] - [1 - e-kar]
1 - [1 - e~Xtr]
-ka
= \-e~
Therefore the conditional distribution P(rj — a < x\rj > a) is the same
as the unconditional distribution P(rj < x). This remarkable property
is unique to the exponential distribution: there are no other distributions
that have densities and possess the property P(rj — a < x\rj > a) = P(rj < x),
a > 0,0 < x < oo.
Example 4 (Buffon's needle). Suppose that we toss a needle of unit length
"at random" onto a pair of parallel straight lines, a unit distance apart, in
a plane. What is the probability that the needle will intersect at least one of the
lines?
To solve this problem we must first define what it means to toss the
needle "at random." Let ^ be the distance from the midpoint of the needle to
the left-hand line. We shall suppose that ^ is uniformly distributed on [0, 1],
and (see Figure 29) that the angle 6 is uniformly distributed on [—n/2, n/2].
In addition, we shall assume that ^ and 6 are independent.
Let A be the event that the needle intersects one of the lines. It is easy to
see that if
B = {(a, x): \a\ < |, x e [0, icos d] u [1 - icos a, 1]},
then A = {cd: (6, £) e J3}, and therefore the probability in question is
P{A) = EIA(w) = E/B(0(co), £(co)).
Figure 29
§7. Conditional Probabilities and Expectations with Respect to a <r-Algcbra
225
By Property G* and formula (16),
EW4 «©)) = E(E[/B(0(co), «fl)))|0(fl))])
= fE[/B(0(co),^(co))|0(co)]P(rfco)
= f/2 E[/B(0(co),^(co))|0(co) = a]Pe(^)
1 fK/2 1 fK/2 2
= - E/B(a, £(co)) da = - cos a da = -,
tt J-tt/2 7C J-K/2 K
where we have used the fact that
E/b(<3, £(co)) = P{^ e [0, \ cos a] u [1 — 7 cos a]} = cos a.
Thus the probability that a "random" toss of the needle intersects one of
the lines is 2/n. This result could be used as the basis for an experimental
evaluation of n. In fact, let the needle be tossed N times independently.
Define ^£ to be 1 if the needle intersects a line on the ith toss, and 0 otherwise.
Then by the law of large numbers (see, for example, (1.5.6))
P{p-^7-^ - P(^> > ej "> °> N - oo.
for every e > 0.
In this sense the frequency satisfies
and therefore
2N
*l+-" + &
This formula has actually been used for a statistical evaluation of n. In
1850, R. Wolf (an astronomer in Zurich) threw a needle 5000 times and
obtained the value 3.1596 for n. Apparently this problem was one of the first
applications (now known as Monte Carlo methods) of probabilistic-
statistical regularities to numerical analysis.
7. If {<!;„}„£! is a sequence of nonnegative random variables, then according
to conclusion (f) of Theorem 2,
E(I<y^) = lE(<y«0 (as.)
In particular, if Blf B2,... is a sequence of pairwise disjoint sets,
P(LBnm = Y,P{Bn\<$) (as.) (22)
226
II. Mathematical Foundations of Probability Theory
It must be emphasized that this equation is satisfied only almost surely
and that consequently the conditional probability P(B\&)(cd) cannot be
considered as a measure on B for given cd. One might suppose that, except
for a set Jf of measure zero, P(-| ^)(co) would still be a measure for coe Jf.
However, in general this is not the case, for the following reason. Let
^{Bu B2,.. -) be the set of sample points cd such that the countable additivity
property (22) fails for these BUB2, Then the excluded set Jf is
jr = \Jjr{Bl9B2,...), (23)
where the union is taken over all Bl3 B2i... in !F. Although the P-measure
of each set ^{Blt B2i...)is zero, the P-measure of Jf can be different from
zero (because of an uncountable union in (23)). (Recall that the Lebesgue
measure of a single point is zero, but the measure of the set Jf = [0,1],
which is an uncountable sum of the individual points {x}, is 1).
However, it would be convenient if the conditional probability P(-| ^)(co)
were a measure for each cdeQ since then, for example, the calculation of
conditional probabilities E(^|^) could be carried out (see Theorem 3 below)
in a simple way by averaging with respect to the measure P(|^)(co):
E«|0)= f«fl>)P(«M9) (a.s.)
Jsi
(cf. (1.8.10)).
We introduce the following definition.
Definition6. A function P(co; B\ defined for all co e Q, and Be^iisa. regular
conditional probability with respect to ^ if
(a) P(co; •) is a probability measure on !F for every coe CI;
(b) For each Bef the function P(co; B\ as a function of co, is a variant of the
conditional probability P(J3|^)(co), i.e. P(co: B) = P(B|^)(co) (a.s.).
Theorem3. Let P(co; B) be a regular conditional probability with respect to
& and let £ be an integrable random variable. Then
E«|0)(fl>)= [mP(v,dw) (a.s.). (24)
Proof. If ^ = IBi B e ^\ the required formula (24) becomes
P(B\<&)(co) = P(co; B) (a.s.),
which holds by Definition 6(b). Consequently (24) holds for simple functions.
§7. Conditional Probabilities and Expectations with Respect to a <r-Algcbra 227
Now let ^ > 0 and £„ | ^, where £„ are simple functions. Then by (b) of
Theorem 2 we have E(^|^)(co) = lim„ E(£„|^)(co) (a.s.). But since P(co; •)
is a measure for every co e £2, we have
lim E(£„|«0(co) = lim f a^)^(w; deb) = f ^(ft>)P(co; rfco)
by the monotone convergence theorem.
The general case reduces to this one if we use the representation £ =
c - r.
This completes the proof.
Corollary. Let ^ = ^, -where n is a random variable, and let the pair (<!;, n)
have a probability distribution with density f^fx, y). Let E\g(Q\ < oo. Then
£(9(Q\ri = y)= V 9{x)fajx\y)dx,
«/-00
where f^tl(x\y) is the density of the conditional distribution (see (18)).
In order to be able to state the basic result on the existence of regular
conditional probabilities, we need the following definitions.
Definition 7. Let (£, S) be a measurable space, X = X(co) a random element
with values in £, and ^ a cr-subalgebra of &. A function Q(co; B\ defined
for co eQ, and B e£ is a regular conditional distribution oj'X with respect to
(a) for each coeQ the function Q(co; B) is a probability measure on (£, <f);
(b) for each B e & the function Q(co\ B), as a function of co, is a variant of the
conditional probability P(X e B\&)(co), i.e.
' e(co;J3) = P(XeJ3|^)(co) (a.s.).
Definition8. Let ^ be a random variable. A function F = F(co;x), coeQ
x eR, is a regular distribution function for £ with respect to& if :
(a) F(co; x) is, for each co e £2, a distribution function on /?;
(b) F(co; x) = P(£ < x|^)(co) (a.s.), for each xeR.
Theorem 4. A regular distribution function and a regular conditional
distribution function always exist for the random variable £ with respect to <0.
228
II. Mathematical Foundations of Probability Theory
Proof. For each rational number re/?, define Fr(co) = P(£ < r 1^)(^),
where P(£ < r\&)(co) = E(/{§:Sr}|^)(co) is any variant of the conditional
probability, with respect to ^, of the event {£ < r}. Let {r,-} be the set of
rational numbers in R. If r,- < rj9 Property B* implies that P(£ < ^1^) <
P(£ < rj|0) (a.s.), and therefore if Akj = {co: Fr/co) < Fri(w)}, ,4 = LMu»
we have P(A) = 0. In other words, the set of points co at which the
distribution function Fr(o>), y e {r,-}, fails to be monotonic has measure zero.
Now let
Bt = {co: lim Fri+(1/n)(co) * Fri(4, B = Q *i-
It is clear that /(^,-.+<i/n)> 1 ^r.}. n ~* °°- Therefore, by (a) of Theorem 2,
Fr.+(1/n)(co) -> Fr.(co) (a.s.), and therefore the set B on which continuity on the
right fails (with respect to the rational numbers) also has measure zero,
P(J3) = 0.
In addition, let
C = <co: lim F„(cd) # 1 I u \co: lim F„(co) > 0V.
(, n-K» J (. n--oo J
Then, since {<!; < n} | £2, n -> oo, and {^ < n} 10, n -» — oo, we have
P(C) = 0.
Now put
_, . f lim Fr(
F(w; x) = J rlx
lG(x\
.(co), cD¢AvBvC,
coeAu BuC,
where G(x) is any distribution function on R\ we show that F(co; x) satisfies
the conditions of Definition 8.
Let co^AuBuC. Then it is clear that F(co; x) is a nondecreasing
function of x. If x < x' < r, then F(co; x) < F(co; x') < F(co; r) = Fr(co) I F(co, x)
when rlx. Consequently F(co;x) is continuous on the right. Similarly
lim^^ F(co; x) = 1, lim^..^ F(co; x) = 0. Since F(co; x) = G(x) when
co6/4uBuC, it follows that F(co; x) is a distribution function on R for
every coeQ, i.e. condition (a) of Definition 8 is satisfied.
By construction, P(£ < r)|^)(co) = Fr(co) = F(co; r). If rjx, we have
F(co; r) I F(co; x) for all co Eft by the continuity on the right that we just
established. But by conclusion (a) of Theorem 2, we have P(£ < r|^)(co) -►
P(£<x|^)(co) (a.s.). Therefore F(co;x) = P(£ < x|G)(co) (a.s.), which
establishes condition (b) of Definition 8.
We now turn to the proof of the existence of a regular conditional
distribution of £ with respect to ^.
Let F(co; x) be the function constructed above. Put
C(co; B) = F(cd; dx\
\f(co\i
Jb
§7. Conditional Probabilities and Expectations with Respect to a <r-Algcbra 229
where the integral is a Lebesgue-Stieltjes integral. From the properties of
the integral (see §6, Subsection 7), it follows that Q(co\ B) is a measure on B
for each given cd e Q. To establish that Q(co\ B) is a variant of the conditional
probability P(£ei3|^)(co), we use the principle of appropriate sets.
Let # be the collection of sets B in @(R) for which Q(co; B) =
P(£eB\&)(cD) (a.s.). Since F(co;x) = P(£ < x\<0)(co) (a.s.), the system #
contains the sets B of the form B = ( — oo, x], xeR. Therefore # also
contains the intervals of the form (a, b], and the algebra s0 consisting of finite
sums of disjoint sets of the form (a, b]. Then it follows from the continuity
properties of Q(co; B) (co fixed) and from conclusion (b) of Theorem 2 that #
is a monotone class, and since s$ c # c g&{R\ we have, from Theorem 1
of §2,
@(R) = aisST) s (jtff) = /4#) = # c ^(/?),
whence # = @(R).
This completes the proof of the theorem.
By using topological considerations we can extend the conclusion of
Theorem 4 on the existence of a regular conditional distribution to random
elements with values in what are known as Borel spaces. We need the
following definition.
Definition 9. A measurable space (E, <f) is a Borel space if it is Borel equivalent
to a Borel subset of the real line, i.e. there is a one-to-one mapping cp = cp(e):
(£, £) -► (/?, @(R)) such that
(1) cp(E) = {cp(e): e e E} is a set in @(R);
(2) cp is ^-measurable (cp~\A) eSyAe cp(E) n @(R)),
(3) cp~l is ^(/?)/^-measurable (cp(B)e<p(E) n @(R\ Beg).
Theorem 5. Let X = X(co) be a random element with values in the Borel space
(£, $). Then there is a regular conditional distribution ofX with respect to <&.
Proof. Let cp = cp(e) be the function in Definition 9. By (2), cp(X(co)) is a
random variable. Hence, by Theorem 4, we can define the conditional
distribution Q(co; A) of cp(X(co)) with respect to^Ae <p(E) n @(R).
We introduce the function Q(w; B) = Q(cd\ cp(B\ Be£. By (3) of
Definition 9, cp(B) e cp(E) n <%(R) and consequently Q(co\ B) is defined.
Evidently Q(co; B) is a measure on Be^ for every co. Now fixBetf. By the
one-to-one character of the mapping cp = cp(e)t
Q(w; B) = Q(cd; cp(B)) = P{cp(X)ecp(B)m = P{XeB|0} (a.s.).
Therefore Q(co; B) is a regular conditional distribution of X with respect
to^.
This completes the proof of the theorem.
230
II. Mathematical Foundations of Probability Theory
Corollary. LetX = X(cd) be a random element with values in a complete
separable metric space (E, S). Then there is a regular conditional distribution of X
with respect to &. In particular, such a distribution exists for the spaces
(/?", @(Rn)) and (Rm, @{R<°)).
The proof follows from Theorem 5 and the well known topological result
that such spaces are Borel spaces.
8. The theory of conditional expectations developed above makes it possible
to give a generalization of Bayes's theorem; this has applications in statistics.
Recall that if @ = {Alt...t A„} is a partition of the space Q with
P(/4.) > 0, Bayes's theorem (1.3.9) states that
p(^|B)-S^Pw- (25)
for every B with P(B) > 0. Therefore if 6 = Ys=i a^At ls a discrete random
variable then, according to (1.8.10),
EMmim - S-irta'*W(BI4) r2M
tumm- B=iP(WMj) . (*>
or
E[9(0)|B] " P„P(B\e = aWda) ■ (2?)
On the basis of the definition of EQ?(0)|i3] given at the beginning of this
section, it is easy to establish that (27) holds for all events B with P(B) > 0,
random variables 6 and functions g = g(a) with E \g(6)\ < oo.
We now consider an analog of (27) for conditional expectations
E\jg(0)\&1 with respect to a <r-algebra ^, ^ c ^
Let
Q(B) = [g{6)P{dw\ Be<$.
Jb
(28)
Then by (4)
EWOT=J(4 (29)
P(B)= fl
We also consider the cr-algebra &0. Then, by (5),
P(B\%)dP (30)
or, by the formula for change of variable in Lebesgue integrals,
P(B)=C P(B\0 = a)Pe(da). (31)
J -oo
§7. Conditional Probabilities and Expectations with Respect to a <r-Algcbra 231
Since
Q(B) = EWfl/j = Efc?(0) • E(/bI^)1
we have
Q(B) = P g(a)P(B\6 = a)P^da). (32)
•/-oo
Now suppose that the conditional probability P(B\6 = a) is regular and
admits the representation
P(B\6 = a) = [p(<D\a)Kdo>\ (33)
where p = p(a>; a) is nonnegative and measurable in the two variables
jointly, and A is a cr-finite measure on (Q, &).
Let E |#(0)| < oo. Let us show that (P-a.s.)
J-oo P(cd; a)Pe(da)
(generalized Bayes theorem).
In proving (34) we shall need the following lemma.
Lemma. Let (£2, 2F) be a measurable space.
(a) Let p and X be a-finite measures, and f = f(co) an ^-measurable function.
Then
Lfdti=LfiM <35>
(in the sense that if either integral exists, the other exists and they are equal).
(b) If v is a signed measure and p, A are cr-finite measures v <£ p, p < K then
dv dv dp.
ST^S <*"> (36)
and
Proof, (a) Since
dv _ dv Idp
dp dk\ dk
a.s.) (37)
(35) is evidently satisfied for simple functions f = £/-/^. The general case
follows from the representation f = f+ — f and the monotone
convergence theorem.
232
II. Mathematical Foundations of Probability Theory
(b) From (a) with f = dv/dfi we obtain
-=£(I)*-10#-
Then v < X and therefore
whence (36) follows since A is arbitrary, by Property I (§6).
Property (37) follows from (36) and the remark that
4-M-t
(on the set {w: dfi/dA = 0} the right-hand side of (37) can be defined
arbitrarily, for example as zero). This completes the proof of the lemma.
To prove (34) we observe that by Fubini's theorem and (33),
Q(B) = £ [p g(a)p(a>; a^da^cD), (38)
P(B) = £ I"^ p(co; a)Pe(da)h.(dcD). (39)
Then by the lemma
dQ dQ/dX
(P-a.s.).
dP dPIdX
Taking account of (38), (39) and (29), we have (34).
Remark. Formula (34) remains valid if we replace 6 by a random element
with values in some measurable space (£, £) (and replace integration over
R by integration over E).
Let us consider some special cases of (34).
Let the cr-algebra ^ be generated by the random variable ^, ^ = #4.
Suppose that
P(ZeA\0 = a)= f q(x; a)Mdx\ Ae@(R\ (40)
Ja
where q = q(x; a) is a nonnegative function, measurable with respect to both
variables jointly, and A is a cr-finite measure on (/?, @(R)). Then we obtain
§7. Conditional Probabilities and Expectations with Respect to a a-Algcbra
233
In particular, let (0, ^) be a pair of discrete random variables, 0 = ^^/^,
£>=Y,xjlBy Then, taking A to be the counting measure (A({x,})= 1, i = 1,2,...)
we find from (40) that
WK XjJ ZiP(.i = xi\8 = ai)P(e = ai) ■ (42)
(Compare (26).)
Now let (0, ^) be a pair of absolutely continuous measures with density
fe,£a,x). Then by (19) the representation (40) applies with q(x; a) =
ft\e(x\a) and Lebesgue measure h Therefore
i-* fs\fA.x\a)fe(a) da
9. Problems
1. Let f and tj be independent identically distributed random variables with E£ defined.
Show that
E(Z\Z + r,) = Em + r,) = ^- (a.s.).
2. Let ^,^2.--- be independent identically distributed random variables with
E|£.| < oo. Show that
E(5, | S„,Sn+ .,...) = - (a-s-X
/7
where S„ = ^ H h fn.
3. Suppose that the random elements (X, Y) are such that there is a regular distribution
px(B) = P{YeB\X = x). Show that if E \g(X, Y)\ < oo then
E[0(X, Y)\X = x] = jg(x, y)Px(dy) (P^-a.s.).
4. Let f be a random variable with distribution function F4(x). Show that
(assuming that F^fe) - F^(a) > 0).
5. Let g = g(x) be a convex Borel function with E|g(f)| < oo. Show that Jensen's
inequality
rtEtfW) < E(0(0|S)
holds for the conditional expectations.
234
II. Mathematical Foundations of Probability Theory
6. Show that a necessary and sufficient condition for the random variable f and the
a-algebra <S to be independent (i.e., the random variables £ and JB(co) are
independent for every JSe<?) is that E{g(J;)\<g) = Egtf) for every Borel function g(x) with
E|0«)|<oo.
7. Let f be a nonnegative random variable and ^ a a-algebra, ^ c &. Show that
E(f |^) < oo (a.s.) if and only if the measure 0, defined on sets A e^ by Q(A) =
\A f dP, is a-finite.
§8. Random Variables. II
1. In the first chapter we introduced characteristics of simple random
variables, such as the variance, covariance, and correlation coefficient. These
extend similarly to the general case. Let (Q, ^ P) be a probability space and
£ = £(cd) a random variable for which E^ is defined.
The variance of £> is
V£ = E« - EQ2.
The number a = +yJVt, is the standard deviation.
If ^ is a random variable with a Gaussian (normal) density
f£x) = -jLr- e-K*-m)»V2«^ a > 0> _ oo < w < oo, (1)
the parameters m and a in (1) are very simple:
m = E& a2 = V£.
Hence the probability distribution of this random variable ^, which we call
Gaussian, or normally distributed, is completely determined by its mean
value m and variance a2. (It is often convenient to write ^ ~ Jf (w, cr2).)
Now let (<!;, 17) be a pair of random variables. Their covariance is
cov«,!,) = E«-EO(i,-Ei,) (2)
(assuming that the expectations are defined).
If cov(^, n) = 0 we say that £ and n are uncorrected.
If V£ > 0 and V17 > 0, the number
is the correlation coefficient of £ and »7.
The properties of variance, covariance, and correlation coefficient were
investigated in §4 of Chapter I for simple random variables. In the general
case these properties can be stated in a completely analogous way.
§8. Random Variables. II
235
Let £ = (gl9..., £n) be a random vector whose components have finite
second moments. The covariance matrix of £ is the n x n matrix U = WRyW,
where Ru =cov(£l-, £,). It is clear that R is symmetric. Moreover, it is non-
negative definite, i.e.
for all X{ e R, i = 1,..., n, since
£ RtjXiXj = e\£ (¾ - E«aT > 0.
The following lemma shows that the converse is also true.
Lemma. A necessary and sufficient condition that an n x n matrix U is the
covariance matrix of a vector £ = (^,..., <!;„) is that the matrix is symmetric
and nonnegative definite, or, equivalently, that there is an n x k matrix A
(1 < k £ n) such that
U = AAT,
where T denotes the transpose.
Proof. We showed above that every covariance matrix is symmetric and
nonnegative definite.
Conversely, let U be a matrix with these properties. We know from matrix
theory that corresponding to every symmetric nonnegative definite matrix U
there is an orthogonal matrix 0 (i.e., 00T = E, the unit matrix) such that
0TUO = D,
where
„ (di • 0\
is a diagonal matrix with nonnegative elements dt,i = I,..., n.
It follows that
U = ODOT = (tfBXB1^1),
where B is the diagonal matrix with elements bt = +\fdl, i = I,..., n.
Consequently if we put A = OB we have the required representation
U = AAT for U.
It is clear that every matrix AAT is symmetric and nonnegative definite.
Consequently we have only to show that U is the covariance matrix of some
random vector.
Let 1/1,1/2.---^11 t>e a sequence of independent normally distributed
random variables, ^V(0,1). (The existence of such a sequence follows, for
example, from Corollary 1 of Theorem 1, §9, and in principle could easily
236
II. Mathematical Foundations of Probability Theory
be derived from Theorem 2 of §3.) Then the random vector £ = Arj (vectors
are thought of as column vectors) has the required properties. In fact,
E^T = E(Arj)(Ar])T = A • Erjtj7 • AT = AEAT = AAT.
(If C = IIC.jll is a matrix whose elements are random variables, EC means the
matrix ||E£l7||).
This completes the proof of the lemma.
We now turn our attention to the two-dimensional Gaussian (normal)
density
_ 2p (x-m,)(y-mj + (^¾ (4)
^1^2 ^2 JJ
characterized by the five parameters ml3 w2, ou o2 and p (cf. (3.14)), where
|mx | < oo, | m21 < oo, o^ > 0, o2 > 0, \p \ < 1. An easy calculation identifies
these parameters:
Wl = E£, cr? = V£,
m2 = Erj, a\ = Vrj,
P = P(£ n)-
In §4 of Chapter I we explained that if ^ and rj are uncorrelated (p(£, rj) = 0),
it does not follow that they are independent. However, if the pair (£, rj) is
Gaussian, it does follow that if £ and rj are uncorrelated then they are
independent.
In fact, if p = 0 in (4), then
f, (x, j;) = e-H*-»u)2]/2ff? . e-[(y-m,)2]/2<7§
But by (6.55) and (4),
J-co yJAKGi
Consequently
from which it follows that ^ and ?; are independent (see the end of Subsection
9 of §6).
§8. Random Variables. II
237
2. A striking example of the utility of the concept of conditional expectation
(introduced in §7) is its application to the solution of the following problem
which is connected with estimation theory (cf. Subsection 8 of §4 of Chapter
I).
Let (<?, rj) be a pair of random variables such that £ is observable but n is
not. We ask how the unobservable component n can be "estimated" from
the knowledge of observations of £.
To state the problem more precisely, we need to define the concept of an
estimator. Let <p = q>(x) be a Borel function. We call the random variable
<p(C) an estimator off? in terms of ^, and E[rj — <p(£)]2 tne (mean square) error
of this estimator. An estimator <p*(£) is called optimal (in the mean-square
sense) if
A s E[f, - <p*(£)]2 = inf E[i, - (piOV, (5)
where inf is taken over all Borel functions q> = q>(x).
Theorem 1. Let En2 < oo. Then there is an optimal estimator q>* = q>*(£)
and <p*(x) can be taken to be the junction
<?*(*) = Efo|£ = x). (6)
Proof. Without loss of generality we may consider only estimators <p(£)
for which E<p2(£) < oo. Then if <p(£) is such an estimator, and q>*(£) = E(n | £),
we have
ED, - (p(OV = E[(f, - 9*(0) + (<p*(0 - vim2
= ed, - 9*(0]2 + e[9*(o - cpior
+ 2E[fo - <p*(Q){<p*{Q " 9(0)] * E[i, - 9*(0]2,
since E[p*(f) — <p(<!;)]2 > 0 and, by the properties of conditional
expectations,
E[(f, - <P*(0)(<P*(0 ~ 9(0)] = E{E[(f, - 9*(0)(9*(0 " 9(0)1^}
= E{(9*(0 - cp^Win - <p*(Q\Q) = 0.
This completes the proof of the theorem.
Remark. It is clear from the proof that the conclusion of the theorem is still
valid when £ is not merely a random variable but any random element
with values in a measurable space (£, <f). We would then assume that
q> = <p(x) is an <f/^CR)-measurable function.
Let us consider the form of <p*(x) on the hypothesis that (£, n) is a Gaussian
pair with density given by (4).
238
II. Mathematical Foundations of Probability Theory
From (1), (4) and (7.10) we find that the density /„K(y |x) of the conditional
probability distribution is given by
/„# |x) = l ^ ^-^^.id-P^ (7)
^/271(1 - p2)o2
where
m(x) = w2 + — p • (x — mj. (8)
Then by the Corollary of Theorem 3, §7,
E(i||£ = x) = f j^0>|x)^ = m(x) (9)
J-00
and
Vfo|£ = x) = E[fo - Efo|£ = x))2|£ = x]
= C (y-m(x))%^\x)dy
J-co
= ai(l - p2). (10)
Notice that the conditional variance V(n \ £ = x) is independent of x and
therefore
A=ED?-Eto|<m2 = <Ti(l-p2). (11)
Formulas (9) and (11) were obtained under the assumption that V^ > 0
and Vn > 0. However, if V^ > 0 and Vn = 0 they are still evidently valid.
Hence we have the following result (cf. (1.4.16) and (1.4.17)).
Theorem 2. Let (<!;, rj) be a Gaussian vector with V^ > 0. Then the optimal
estimator ofn in terms oft, is
covf^ n)
£¢,10-6, + -^%-Eft (12)
and its error is
A=E[,-E(,|{)]2 = V,-5?^p. (13)
Remark. The curve y(x) = E(rj\£ = x) is the curve of regression of n on £
or of n with respect to £. In the Gaussian case E(n \ £ = x) = a + bx and
consequently the regression of n and £ is linear. Hence it is not surprising
that the right-hand sides of (12) and (13) agree with the corresponding parts
of (1.4.6) and (1.4.17) for the optimal linear estimator and its error.
§8. Random Variables. II
239
Corollary. Let £j and e2 be independent Gaussian random variables with mean
zero and unit variance, and
£ = «1^1 + a2B2, r\ = b^ + b2B2.
Then E£ = En = 0, V£ = a\ + af, Vm = b\ + &i,cov(£m) = axbx + a2fc2,
and r/<3i + a\ > 0, tJzen
EOf|0-£«*|±f|*i6 (14)
<3i + <22
A = (a1b2-a2b1)>
a\ + a22
3. Let us consider the problem of determining the distribution functions of
random variables that are functions of other random variables.
Let ^ be a random variable with distribution function F£x) (and density
/$(x), if it exists), let q> = q>(x) be a Borel function and w = q>{£). Letting
Iy = (—oo, y), we obtain
F,iy) = Pin <y) = P(*(0 e Iy) = P(£e q>-\Iy)) = f F^x), (16)
which expresses the distribution function F^iy) in terms of F^(x) and ¢).
For example, if n = a£ + b, a > 0, we have
If w = £2, it is evident that F„(y) = 0 for 3; < 0, while for j; > 0
JW = P(e <y) = Pi-y/y < Z < y/y)
= F^y) - F£-yfi) + P(£ = -^5). (18)
We now turn to the problem of determining /,,00-
Let us suppose that the range of £ is a (finite or infinite) open interval
I = (a, b\ and that the function q> = <p(x), with domain (a, b\ is continuously
differentiable and either strictly increasing or strictly decreasing. We also
suppose that <p'(x) ^ 0, xel. Let us write h(y) = <p~l(y) and suppose for
definiteness that <p(x) is strictly increasing. Then when y e <p(I\
F,(y) = P(ij *y) = P(<p(Q <y) = P(£ < gT'GO)
£(y)
/,(x)dx. (19)
00
240
II. Mathematical Foundations of Probability Theory
By Problem 15 of §6,
fs(x)dx = ) fs(h(z)W(z)dz (2°)
and therefore
Uy) = M¥y)W(y)- (21>
Similarly, if <p(x) is strictly decreasing,
ffr) = f&tyM-K(y)).
Hence in either case
m = fjm)W<y)l (22)
For example, if y\ = a^ + b, a ^ 0, we have
m=y^± and m = ±.A(^JL}
If £ - JT (w, cr2) and 17 = e\ we find from (22) that
1 f ln(j;/M)2l
//y)=j>/S«iy 1 2<72 J' ' (23)
lo y < 0,
with M = e™.
A probability distribution with the density (23) is said to be lognormal
(logarithmically normal).
If (p = <p(x) is neither strictly increasing nor strictly decreasing, formula
(22) is inapplicable. However, the following generalization suffices for many
applications.
Let q> — <p(x) be defined on the set £fc=i [ak,fck], continuously
differentiate and either strictly increasing or strictly decreasing on each open
interval Ik = (ak, bk\ and with q>\x) ^ 0 for xelk. Let hk = hk(y) be the
inverse of <p(x) for x e Ik. Then we have the following generalization of (22):
Uy) = t Uh(y)Wk(y)\- iDk(y\ (24)
where Dk is the domain of hk(y).
For example, if r\ = £2 we can take Jx = (—oo, 0), /2 = (0, oo), and
find that ht(y) = —y/y, h2(y) = yfy, and therefore
w> = i wy (25)
10, j;<0.
§8. Random Variables. II
241
We can observe that this result also follows from (18), since P(^ = — ^fy) = 0.
In particular, if f ~ Jf (0, 1),
1 ze"y/2, j>>0,
f^AyfhTy ' ' (26)
k y < 0.
A straightforward calculation shows that
AW-g"**-* -J
4. We now consider functions of several random variables.
If £ and rj are random variables with joint distribution F^(x, y), and
q> = <p(x, 3;) is a Borel function, then if we put £ = <p(<!;, rj) we see at once that
F,{z) = f <*Fft(x,j/). (29)
J{x,y.q>(x,y)<.z)
For example, if <p(x, y) = x + y, and ^ and ?; are independent (and
therefore F^(x, 3;) = F^(x) • FJiy)) then Fubini's theorem shows that
Fi(z)= f dFM-dFJiy)
J{x,y:x + y£z}
(30)
and similarly
F,(z)= C F,(z-y)dF,(y). (31)
J-00
If F and G are distribution functions, the function
H(z)= f F(z-x)dG(x)
J-00
is denoted by F * G and called the convolution of F and G.
Thus the distribution function Ff of the sum of two independent random
variables £ and n is the convolution of their distribution functions F% and Fn:
It is clear that F, * F. = F„ * F,.
242
II. Mathematical Foundations of Probability Theory
Now suppose that the independent random variables £ and r\ have
densities/^ and/,. Then we find from (31), with another application of Fubini s
theorem, that
f"(z)=n [f-j*(u) du]fti(y) dy
= £^ \j[j& - 3» dH\fJiy) dy = J* ]T ftu - y)fn(y) ^J du,
whence
and similarly
m = \™/&-y)mdy,
&z)= r f„(z-x)fs(x)dx.
J -00
(32)
(33)
Let us see some examples of the use of these formulas.
Let f lf f2. • • •» £n be a sequence of independent identically distributed
random variables with the uniform density on [— 1,1]:
Then by (32) we have
and by induction
(0, |x| > 1.
2-
0,
(3
-
4
-
x\
>
l*l)2
16
3-x2
8
0,
1*1 < 2,
|x| > 2,
, 1 < |x| < 3,
0< |x|< 1,
|x| > 3,
II l(n + x)/2]
0, |x| > n.
Now let f ~ ^r (mlf <r?) andy ~ jV (m2, <ri). If we write
<?(*) =
1
x/2^
-*2/2
§8. Random Variables. II
243
then
1 (x - wA 1 (x - m2\
and the formula
V erf + ^2 \ V <7i + ^2/
follows easily from (32).
Therefore the sum of two independent Gaussian random variables is again a
Gaussian random variable with mean w, + m2 and variance a\ + a\.
Let £„ ..,, £n be independent random variables each of which is normally
distributed with mean 0 and variance 1. Then it follows easily from (26) (by
induction) that
\ y<n/2)-l -x/2 v > 0
2«/>rW2)X ' X>U' (34)
0, x < 0.
The variable £\ + • • • + £2 is usually denoted by #«, and its distribution
(with density (30)) is the x2 distribution ("chi-square distribution") with n
degrees of freedom (cf. Table 2 in §3).
If we write x„ = +VZ, it follows from (28) and (34) that
I2x"-le-x2/2
2"'2rW2) ' X~ ' (35)
0, x < 0.
The probability distribution with this density is the ^-distribution (chi-
distribution) with n degrees of freedom.
Again let £ and n be independent random variables with densities f% and
f„. Then
*"*(*) = JJ ffr)fjiy)dxdyt
{x,y:xy<z)
*W*)= JJ ffr)Uy)dxdy.
{x,y:x/y<z}
Hence we easily obtain
«*-r.4va-r.*©*»s »
and
/«W = £ ffcy)Uy)M <ty- 07)
244
II. Mathematical Foundations of Probability Theory
Putting £ = £0 and rj = V(£i + ■ ■ ■ + GVn, in (37), where &. £1>•: *' ^n
are independent Gaussian random variables with mean 0 and variance
g2 > 0, and using (35), we find that
/&/[./(1/ii)«?+"-+3)]W — —/= AA 7 v2\(n +1)/2 V '
The variable to/ty/iX/nHti + ■ ■ • + &?)] is denoted by t, and its distribution
is the t-distribution, or Student's distribution, with n degrees of freedom (cf.
Table 2 in §3). Observe that this distribution is independent of a.
5. Problems
1. Verify formulas (9), (10), (24), (27), (28), and (34)-(38).
2. Let f j,..., £„, n > 2, be independent identically distributed random variables with
distribution function F(x) (and density /(x), if it exists), and let | = max(f j,..., O.
I = min(£„ .... O, p = Z-§. Show that
^<**> = W, y<x,
Afro- {T
- l)lF(y) - F(x)T-2f(x)f(y), y > x,
y < Xy
(n J=„ [FW - F(y - x)]- '/CO #, x ;> 0,
f"(X)=l0, x<0,
W-j*"^
, DF(y) - F(y - x)]"-2/(y - x)/G» dy, x > 0,
x <0.
3. Let £t and £2 be independent Poisson random variables with respective parameters
At and A2. Show that £t + £2 has a Poisson distribution with parameter At + A2.
4. Let mj = m2 = 0 in (4). Show that
^ ' n{c22z - 2pcxc2z + a2)'
5. The maximal correlation coefficient of £ and f/ is p*(£,f/) = supH<tf p(u(£X u(0)>
where the supremum is taken over the Borel functions u = u(x) and y = u(x) for
which the correlation coefficient p(u(£)> »(£)) is defined. Show that £ and n are
independent if and only if /)*(£,»/) = 0.
6. Let t,, t2 x„ be independent nonnegative identically distributed random
variables with the exponential density
fit) = Xe~x\ t > 0.
§9. Construction of a Process with Given Finite-Dimensional Distribution 245
Show that the distribution of xx + h xk has the density
and that
*-i o*y
^+- + 1^0 = 2^.
7. Let $ ~ */T(0, <x2). Show that, for every p > 1,
E|£|* = CP<7*
where
and T(s) = Jo e~xx*~' d* is the gamma function. In particular, for each integer n ^ 1,
Ef2w = (2n- l)!!<x2w.
§9. Construction of a Process with Given
Finite-Dimensional Distribution
1. Let £ = £(cd) be a random variable defined on the probability space
(O, &, P), and let
Ffic) = P{cd: &cd) < x}
be its distribution function. It is clear that F^(x) is a distribution function
on the real line in the sense of Definition 1 of §3.
We now ask the following question. Let F = F(x) be a distribution
function on R. Does there exist a random variable whose distribution function is
F(x)?
One reason for asking this question is as follows. Many statements in
probability theory begin, "Let ^ be a random variable with the distribution
function F(x); then ...". Consequently if a statement of this kind is to be
meaningful we need to be certain that the object under consideration actually
exists. Since to know a random variable we first have to know its domain
(£2, #"), and in order to speak of its distribution we need to have a probability
measure P on (£2, #"), a correct way of phrasing the question of the existence
of a random variable with a given distribution function F(x) is this:
Do there exist a probability space (£2, !F, P) and a random variable £ = £(cd)
on it, such that
P{cd: £(cd) <x} = F(x)?
246
II. Mathematical Foundations of Probability Theory
Let us show that the answer is positive, and essentially contained in
Theorem 1 of §1.
In fact, let us put
Q = R, & = @(R).
It follows from Theorem 1 of §1 that there is a probability measure P (and
only one) on (R, @(R)) for which P(a, fc)] = F(b) - F(a\ a < b.
Put £(co) = w. Then
P{w: £(cd) < x} = P{w: cd < x} = P(-oo, x] = F(x).
Consequently we have constructed the required probability space and the
random variable on it.
2. Let us now ask a similar question for random processes.
Let X = (£t)teT be a random process (in the sense of Definition 3, §5)
defined on the probability space (Q, &, P), with teT^ R.
From a physical point of view, the most fundamental characteristic of a
random process is the set {Ftl tn(xu ..., x„)} of its finite-dimensional
distribution functions
Ftl JLxu ..., x„) = P{cd: £tl < x„ ..., £n < x„}, (1)
defined for all sets £,,...,£„ with tY < t2 < ••• <t„.
We see from (1) that, for each set tu...,t„ with tl < t2 < ••• <t„ths
functions Ftl fn(x!,..., x„) are n-dimensional distribution functions (in
the sense of Definition 2, §3) and that the collection {Ftl Jx^ ..., x„)}
has the following consistency property:
lim Ftl fn(Xi,..., x„) = Ftl Tfc Jxi,..., xk,..., x„) (2)
where " indicates an omitted coordinate.
Now it is natural to ask the following question: under what conditions
can a given family {Ftl fn(x!,..., x„)} of distribution functions
Ftl fn(Xi,..., x„) (in the sense of Definition 2, §3) be the family of finite-
dimensional distribution functions of a random process? It is quite
remarkable that all such conditions are covered by the consistency condition (2).
Theorem 1 (Kolmogorov's Theorem on the Existence of a Process). Let
{Ftl In(xlf..., x„)}, with tieT^R,tl < t2 < • • • < r„, n > 1, be a given
family of finite-dimensional distribution functions, satisfying the consistency
condition (2). Then there are a probability space (CI, SF, P) and a random
process X = (£t)teT sucn tnat
P{cd: Ztl <xx,...Atn< x„} = Ftl... Jxj • • - x„). (3)
§9. Construction of a Process with Given Finite-Dimensional Distribution
247
Proof. Put
Q = RT, & = @(RT\
i.e. take Q to be the space of real functions cd = (cDt)teT with the <r-algebra
generated by the cylindrical sets.
Let t = [tl9..., £„], tx < t2 < • • ■ < t„. Then by Theorem 2 of §3 we can
construct on the space (/?", ^(R")) a unique probability measure Pr such that
Pt{(a>tl, - • •, <*J: cotl < xu ..., (Dtn < xn) = Ftl...fn(Xi,..., x„). (4)
It follows from the consistency condition (2) that the family {Pr} is also
consistent (see (3.20)). According to Theorem 4 of §3 there is a probability
measure P on (RT, @(RT)) such that
P{G>:(G>tti...,cDjeB} = Pt(B)
for every set t = [tu .. , £„], tt < • • • < t„.
From this, it also follows that (4) is satisfied. Therefore the required
random process X = (£t(co))teT can be taken to be the process defined by
&(co) = co„ teT. (5)
This completes the proof of the theorem.
Remark 1. The probability space (RT, @(RT), P) that we have constructed
is called canonical, and the construction given by (5) is called the coordinate
method of constructing the process.
Remark 2. Let (£a, <fa) be complete separable metric spaces, where a belongs
to some set 31 of indices. Let {Pr} be a set of consistent finite-dimensional
distribution functions Pr, t = [ax,..., a„] on
(£ai x •••x^,^®...®^).
Then there are a probability space (£2, #", P) and a family of «F/<f a-measurable
functions (ATa(co))ae2I such that
P{(Xai,...,Xan)eB} = Pr(B)
for all t = [a„..., a„] and B e Sa ® ■ ■ ■ ® San.
This result, which generalizes Theorem 1, follows from Theorem 4 of §3
if we put Q = Y\a E*> & = Ha &a ancl ^<x(w) = w<x for each co = co(coa), a 8¾.
Corollary 1. Let Fx(x), F2(x), ...be a sequence of one-dimensional distribution
functions. Then there exist a probability space (CI, fF, P) and a sequence of
independent random variables £u £2> ■ ■ ■ sucn tnat
P{a>: £(co) <x} = F£(x). (6)
248
II. Mathematical Foundations of Probability
In particular, there is a probability space (Q, &, P) on which an infinite
sequence of Bernoulli random variables is defined (in this connection see
Subsection 2 of §5 of Chapter I). Notice that Q can be taken to be the space
Q = {cd: cd = (al3 a2,...), «i = 0, 1}
(cf. also Theorem 2).
To establish the corollary it is enough to put Fx „(xi, • • •, xn) =
Fi(xi)-- F„(x„) and apply Theorem 1.
Corollary 2. Let T = [0, oo) and let {p(s, x; t, B} be a family of nonnegative
functions defined for s, t e T, t > s, x e K, J3 e @(R\ and satisfying the following
conditions:
(a) p(s, x\t,B) is a probability measure on Bfor given s, x and t;
(b) for given s, t and B, the function p(s, x\t,B) is a Borel function of x;
(c) for 0 < s < t < t and B e @(R), the Kolmogorov-Chapman equation
p(s, x;x,B)= p(s, x; t, dy)p(t, y,r,B) (7)
Jr
is satisfied.
Also let n = n(B) be a probability measure on (R, <%(R)). Then there are
a probability space (£2, &, P) and a random process X = (<!;,), >0 defined on
it, such that
P{ti0 ^ *o, £, <xu...,£tn<xn} = f ° n(dy0) f ' p(0, y0; t„ dyx)
J — 00 J — 00
' p(tn-l,yn-l;tn,dyn) (8)
-r
for 0 = t0 < 11 < • • • < t„.
The process X so constructed is a Markov process with initial distribution
7r and transition probabilities {p(s, x; t, J3}.
Corollary 3. Let T= {0, 1, 2,...} aw/ tet {Pk(x; B)} be a family of non-
negative functions defined for k > 1, xei?, Be<%(R\ and such that pk(x\ J3)
is a probability measure on B (for given k and x) and measurable in x (for
given k and B). In addition, let n = n(B) be a probability measure on (R, 8d(R)).
Then there is a probability space (Q, 8F, P) with a family of random
variables X = {£0, f,,...} defined on it, such that
Erxi
<dy0) P^yoit^dy,)
o J — oo
••• f p(t„-„^-i;tn,rf3;n)
J-oo
§9. Construction of a Process with Given Finite-Dimensional Distribution
249
3. In the situation of Corollary 1, there is a sequence of independent random
variables £t,£2i... whose one-dimensional distribution functions are
F,, F2,..., respectively.
Now let (El9 £x\ (E2i S2\ • • • be complete separable metric spaces and
let Pu P2, •.. be probability measures on them. Then it follows from Remark
2 that there are a probability space (£2, 3F, P) and a sequence of independent
elements Xl,X2,... such that X„ is #7<f„-measurable and P(X„eB) =
Pn(B),BE<?„.
It turns out that this result remains valid when the spaces (E„, <f„) are
arbitrary measurable spaces.
Theorem 2 (Ionescu Tulcea's Theorem on Extending a Measure and the
Existence of a Random Sequence). Let (Q„, ^„), n = 1, 2,..., be arbitrary
measurable spaces and £2 = f} £2„, 2F = js{ ,^. Suppose that a probability
measure P1 is given on (Ql7 #i) and that, for every set (cdu ..., cd„)eQ1 x
• • • x Qn, n > 1, probability measures P^,..., co„; •) are given on (Q, +,, J* + j).
Suppose that for every Be^„+1 the functions P{mu ..., co„; B) are Borel
functions on (co^ ..., a>„)and let
Pn(Ai x ••• x An) = f P^dwJ f Pico^dwJ
Jai Ja2
••• I P(cDu...,con-l;dcDn) Axe&x, n>l. (9)
Then there is a unique probability measure P on (CI, &*) such that
P{cd: 0)^4..., cd„eA„} = P„(AX x • ■ • x A„) (10)
for every n > 1, and there is a random sequence X = (Xi(co), X2(cd), ...)
such that
P{co: X^co) eAu..., X„(cd) e An} = P„(Ai x • • • x An), (11)
where Axe Sx.
Proof. The first step is to establish that for each n > 1 the set function P„
defined by (9) on the rectangle Ax x ••• x A„ can be extended to the
<r-algebra ^, ® • • • ® ^,.
For each n > 2 and B e ^ (g) • • • (g) &n we put
pn(B)= r PiidcDj r PicD^dcoj r ^,...,^.2^^.0
x Ib(o>u • • •. (0„)P{col3 • • • > u>n-1 '■> du>„). (12)
Jo,
It is easily seen that when B = Ax x • • • x A„ the right-hand side of (12)
is the same as the right-hand side of (9). Moreover, when n = 2 it can be
250
II. Mathematical Foundations of Probability Theory
shown, just as in Theorem 8 of §6, that P2 is a measure. Consequently it is
easily established by induction that P„ is a measure for all n > 2.
The next step is the same as in Kolmogorov's theorem on the extension of
a measure in (Rm, 3B(Rm)) (Theorem 3, §3). Thus for every cylindrical set
jn(E) = {cdeQ: (cdu ..., cd„)e£}, Be^®--®^,, we define the set
function P by
P(J„(B)) = Pn(B). (13)
If we use (12) and the fact that P(col3..., cok; •) are measures, it is easy to
establish that the definition (13) is consistent, in the sense that the value of
P(J„(B)) is independent of the representation of the cylindrical set.
It follows that the set function P defined in (13) for cylindrical sets, and in
an obvious way on the algebra that contains all the cylindrical sets, is a
finitely additive measure on this algebra. It remains to verify its countable
additivity and apply Caratheodory's theorem.
In Theorem 3 of §3 the corresponding verification was based on the
property of (/?", 38(Rn)) that for every Borel set B there is a compact set
A c B whose probability measure is arbitrarily close to the measure of B.
In the present case this part of the proof needs to be modified in the following
way.
As in Theorem 3 of §3, let {B„}„^.l be a sequence of cylindrical sets
that decrease to the empty set 0, but have
lim P(£„) > 0. (14)
n-»oo
For n > 1, we have from (12)
p(4) = f /i"(»,)P.(rf«>,),
where
/iVl) = f Pfrl \dw2)" f IBn(cDu ..., CD„)P(CD2, . . . , CO„_ 1 i dcD„).
Since Bn+1 c fini We have Bn+l c B„ x Qn+l and therefore
'Bn+,(a>i, •.., con+1) < IBn(wl9..., co„)/nn+l(con+1).
Hence the sequence {/!,1)(co1)}nsl decreases. Let /(1)(g>i) = lim„/^(^0-
By the dominated convergence theorem
lim P0„) = lim f fXKiDtyPiidtOi) = f /^(^)^1(^).
n n JSli Jfl,
By hypothesis, lim„ P(B„) > 0. It follows that there is an co? e B such that
/(1)(co?) > 0, since if col$Bl then fXKioJ = 0 for n > 1.
§9. Construction of a Process with Given Finite-Dimensional Distribution 251
Moreover, for n > 2,
/iV!)= f /£2)(^2)PK;dco2), (15)
where
f(n\oy2)= \ P(CD°uCD2;dcD3)
••' \ hS<»U G>2 , • • • , C0n)P(C0°u W2, . . ., C0„_ i, </«„).
•'An
We can establish, as for {/^(coi)}* that {/i,2,(co2)} is decreasing. Let
f2\w2) = lim^^ f™(cD2). Then it follows from (15) that
0 < /«>K) = f /<2>(co2)P« rfco2),
Jsi2
and there is a point w02eO.2 such that /(2)(co2) > 0. Then (co?, co5)eB2-
Continuing this process, we find a point (co?,..., coJJ) e B„ for each n.
Consequently (coj,..., co°,...) e f) B„, but by hypothesis we have f] B„ = 0.
This contradiction shows that lim„ P(B„) = 0.
Thus we have proved the part of the theorem about the existence of the
probability measure P. The other part follows from this by putting X„(cd)
= co„,n > 1.
Corollary 1. Let (E„, <f„)n>i be any measurable spaces and 0P„)„>i, measures
on them. Then there are a probability space (CI, SF, P) and a family of
independent random elements XUX2,... with values in (EX,S{), (E2, S2\ ...,
respectively, such that
P{w: X„(cd)eB} = Pn(B), BeS„,n>\.
Corollary 2. Let E = {1, 2,...}, and let {pk(x, y)} be a family of nonnegative
functions, k > 1, x,yeE, such that £ye£ pk(x; y) = 1, xeE, k > 1. Also
let n = n(x)be a probability distribution on E (that is,n(x) > 0,£xe£7r(x) = 1).
Then there are a probability space (Q, &, P) and a family X = {£0, ^,...}
of random variables on it, such that
P{£o = x0,£l = xl,...,£„ = x„} = n(x0)pl(x0, xx) • • ■ pn(x«-i, x„) (16)
(cf. (1.12.4)) for all xt e E and n > 1. We may take Q to be the space
Q = {CD'.CD = (X0,X!, . . .), X,- E E}.
A sequence X = {£0, ^,...} of random variables satisfying (16) is a
Markov chain with a countable set E of states, transition matrix {pk(x, y)}
and initial probability distribution n. (Cf. the definition in §12 of Chapter I.)
252
II. Mathematical Foundations of Probability Theory
4. Problems
1. Let Q = [0,1], let & be the class of Borel subsets of [0,1], and let P be Lebesgue
measure on [0,1]. Show that the space (CI, &, P) is universal in the following sense.
For every distribution function F(x) on (O, SF, P) there is a random variable £ = £(fi>)
such that its distribution function F£x) = P(£ < x) coincides with F(x). (Hint.
£(co) = F'^coX 0 < Q)< 1, where F'1^) = sup{x: F(x) < co}, when 0 < co< 1,
and £(0), £(1) can be chosen arbitrarily.)
2. Verify the consistency of the families of distributions in the corollaries to Theorems
1 and 2.
3. Deduce Corollary 2, Theorem 2, from Theorem 1.
§10. Various Kinds of Convergence of Sequences
of Random Variables
1. Just as in analysis, in probability theory we need to use various kinds of
convergence of random variables. Four of these are particularly important:
in probability, with probability one, in mean of order p, in distribution.
First some definitions. Let £, £u £2. • • ■ be random variables defined on a
probability space (CI, &, P).
Definition 1. The sequence £l3 £2>--- of random variables converges in
probability to the random variable £ (notation: £„ -¾ ¢) if for every e > 0
P{|&-£l >«}-><>, n^oo. (1)
We have already encountered this convergence in connection with the
law of large numbers for a Bernoulli scheme, which stated that
\\n \ I
(see §5 of Chapter I). In analysis this is known as convergence in measure.
Definition 2. The sequence f lf ^2,... of random variables converges with
probability one (almost surely, almost everywhere) to the random variable
£if
P{a>: 6,-^0=0, (2)
i.e. if the set of sample points cd for which £„(cd) does not converge to £ has
probability zero.
This convergence is denoted by £„ -»• £ (P-a.s.), or £„ ^> £ or £„ "' ^.
§10. Various Kinds of Convergence of Sequences of Random Variables
253
Definition3. The sequence flf f2.--- °f random variables converges in
mean of order p, 0 < p < oo, to the random variable £ if
E|£,-£|*^0, n-^oo. (3)
In analysis this is known as convergence in LP, and denoted by £„ £ £.
In the special case p = 2 it is called mean square convergence and denoted by
£ = Li.m. £„ (for "limit in the mean").
Definition^ The sequence ^, £2,... of random variables converges in
distribution to the random variable £ (notation: £„ -4 ^) if
E/(&)-> E/(0, n-^oo, (4)
for every bounded continuous function f = f(x). The reason for the
terminology is that, according to what will be proved in Chapter III, §1,
condition (4) is equivalent to the convergence of the distribution F§n(x) to
F£x) at each point x of continuity of F^(x). This convergence is denoted by
We emphasize that the convergence of random variables in distribution
is defined only in terms of the convergence of their distribution functions.
Therefore it makes sense to discuss this mode of convergence even when the
random variables are defined on different probability spaces. This
convergence will be studied in detail in Chapter III, where, in particular, we
shall explain why in the definition of F§n => F§ we require only convergence
at points of continuity of F§(x) and not at all x.
2. In solving problems of analysis on the convergence (in one sense or
another) of a given sequence of functions, it is useful to have the concept of a
fundamental sequence (or Cauchy sequence). We can introduce a similar
concept for each of the first three kinds of convergence of a sequence of
random variables.
Let us say that a sequence {£„}„^ 1 of random variables is fundamental in
probability, or with probability 1, or in mean of order, p, 0 < p < 00, if the
corresponding one of the following properties is satisfied: P{\£„ — £m\ > e}
-*■ 0, as m, n -» 00 for every s > 0; the sequence {^o>)}Mai is fundamental
for almost all coeCl; the sequence {^„(co)}„>1 is fundamental in LP, i.e.
E16, - £m\p -> 0 as n, m -> 00.
3. Theorem 1.
(a) A necessary and sufficient condition that £„ -*■ £ (P-a.s.) is that
pjsup|£k-£l >el-0, n-00. (5)
for every e > 0.
254
II. Mathematical Foundations of Probability Theory
pfef'"*,la£H n^
(b) The sequence (On^i is fundamental with probability 1 if and only if
(6)
for every e > 0; or equivalently
P{sup|£n+k - U > s\ -* 0, n -* oo. (7)
Proof, (a) Let Azn = {cd: |£„ - £| > e}, Az = I5ii4« = f|"=i LU« ^*-Then
£^0 m=l
But
P(X£) = lim P|
Hence (a) follows from the following chain of implications:
(b) Let
ain of imp]
0 = P{cd: £„ -A Q = P( |J A*J *> p( 0 ^1/m) = °
*> P(A1,m) = 0, w > 1 o P(X£) = 0, e > 0,
opA)^)->°. n^oooP^sup|4-^|>e^O,
n -*■ oo.
^., = {fi>:|&-6l^«}. ** = 0 U Bli-
n-l fc^n
Then {cd: {£,,(^)),,^ is not fundamental} = (J£>0 B\ and it can be shown
as in (a) that P{cd: {£,,(^)),,^ is not fundamental} = 0o(6). The
equivalence of (6) and (7) follows from the obvious inequalities
sup|£n+k - U < sup|£n+k - £n+l\ < 2sup|£n+k - £„|.
fc^O fc^O fc^O
This completes the proof of the theorem.
Corollary. Since
p{sup|£k - £| > e\ = PJU (Kk " £1 > fi)j < ZP(I& ~ ^1 ^ «>.
§10. Various Kinds of Convergence of Sequences of Random Variables
255
a sufficient condition for £„ ^+ £ is that
£p{|£k-£|>£}< oo (8)
is satisfied for every e > 0.
It is appropriate to observe at this point that the reasoning used in
obtaining (8) lets us establish the following simple but important result which
is essential in studying properties that are satisfied with probability 1.
Let Alt A2i... be a sequence of events in F. Let (see the table in §1)
{A„ i.o.} denote the event Tm\A„ that consists in the realization of infinitely
many of Al9 A2i
Borel-Cantelli Lemma.
(a) //£ P(AJ < oo then P{An lo.} = 0.
(b) lfYj p(^n) = °° and AlyA2i... are independent, then P{A„ i.o.} = 1.
Proof, (a) By definition
{Ani.o.}=mAn= 0 \JAk.
n— 1 fc^n
Consequently
P{An Lo.} = p{ 0 U 4k j = lim p( U A] * lim Z p(A)>
and (a) follows.
(b) If AlfA2i... are independent, so are Au A2i Hence for N > n
we have
/ N \ N
\k = n / k = n
and it is then easy to deduce that
p(rU)=f[po4*)- (9)
Since log(l — x) < —x, 0 < x < 1,
log fl [1 - P(A)] = £ log[l - P(A)] < - I P(^4k) =-oo.
fc = n k = n k = n
Consequently
plo/>)=°
for all n, and therefore P(X„ i.o.) = 1.
This completes the proof of the lemma.
256
II. Mathematical Foundations of Probability Theory
Corollary 1. If A* = {co: |£„ - 0 > e} then (8) shows that Y?-i p(>U < °°»
e > 0, and rten fry rJze Borel-Cantelli lemma we have 9(4") = 0, e > 0, where
A£ = iim A%. Therefore
Z P(I4 - ^1 > e} < oo, e > 0 => P(A£) = 0, e > 0
=>P{co: £,^0) = 0,
as we already observed above.
Corollary 2. Let (e„)na. t be a sequence of positive numbers such that e„ 10,
n-^ co. If
£p{I4-0 >£„}<«>, (1°)
n= 1
In fact, let X„ = {|£„ - 0 > £„}. Then P(A„ i.o.) = 0 by the Borel-
Cantelli lemma. This means that, for almost every co eQ, there is an N =
N(co) such that \£„(co) - £(co)| < e„ for n > N(co). But e„ J, 0, and therefore
£„(co) -> ^(co) for almost every co e £2.
4. Theorem 2. Wie toe the following implications:
£,^ =>£,££, (11)
£,-£=>£,-^, P>0, (12)
£,-^=>£,-^. (13)
Proof. Statement (11) follows from comparing the definition of convergence
in probability with (5), and (12) follows from Chebyshev's inequality.
To prove (13), let/(x) be a continuous function, let \j (x)| < c, let e > 0,
and let N be such that P(|0 > N) < e/4c. Take 3 so that |/(x) - f(y)\ <
e/2c for |x| < N and |x — y\ < 3. Then (cf. the proof of Weierstrass's
theorem in Subsection 5, §5, Chapter I)
E|/«J -/(01 = Ed/gJ -/(01; It, - ¢1 < <5, |C| < N)
+ E(|/(« -/(01; 16, - «1 < S, 10 > JV)
+ E(|/(4)-/(OI;l4-0>5)
<e/2 + e/2 + 2cP{|^-i|>5}
= fi + 2cP{|£,-0>*}.
But P{|£„ - ¢1 > 5} -^ 0, and hence E|/(£„) - /(£)| < 2e for sufficiently
large n; since e > 0 is arbitrary, this establishes (13).
This completes the proof of the theorem.
We now present a number of examples which show, in particular, that the
converses of (11) and (12) are false in general.
§10. Various Kinds of Convergence of Sequences of Random Variables 257
ExampleI (i-S^fi,**; £>«*&=*& Let Q = [0, 1], ^ =
^([0,1])» P = Lebesgue measure. Put
= [^4 fi = /^(o>X i = 1,2,...,n;n>l.
4
Then the sequence
Ki;«,£i;«,£i,£i;-»}
of random variables converges both in probability and in mean of order
p > 0, but does not converge at any point co e [0,1].
ExAMPLE2(£n^£^£nA<^£n^£p>0). Again let Q = [0,1], ^ =
, 1], P = Lebesgue measure, and let
,, , fe", 0 < co < 1/n,
^) = {o, co > 1/n.
Then {£„} converges with probability 1 (and therefore in probability) to
zero, but
e"p
E|£JP = > oo, n->oo,
for every p > 0.
Example 3 (f „ ^*
variables with
£ =££/-$£). Let (U be a
P(£„ = 1) = P.,
Then it is easy to show that
^0op„
iSoop,
p(£„
-►0,
-►0,
seque
= 0)
n->
n-*
;nce
= 1
00,
00,
(14)
(15)
4-0=>fpn<oo. (16)
n= 1
In particular, if p„ = 1/n then <!;„ — 0 for every p > 0, but <!;„ -£ 0.
The following theorem singles out an interesting case when almost sure
convergence implies convergence in L1.
Theorem 3. Let (<!;„) be a sequence of nonnegative random variables such that
£, =* £ and E£„ -+ E£ < oo. Then
E|£„ -^1-0, n-oo. (17)
258
II. Mathematical Foundations of Probability Theory
Proof. We have E^„ < oo for sufficiently large n, and therefore for such n
we have
E |{? - U = E« - «J/KiW + E(£„ - Q/(fii>0
= 2E(£ - 4)W> + E(fi, - ©.
But 0 ^ (£ - ^n)/{^w ^ £. Therefore, by the dominated convergence
theorem, lim„ E(£ - £„)/{^ > = 0» which together with E£„ -> E£ proves
(17).
Remark. The dominated convergence theorem also holds when almost sure
convergence is replaced by convergence in probability (see Problem 1).
Hence in Theorem 3 we may replace "<!;„ ^+ f" by "<!;„ -½ f."
5. It is shown in analysis that every fundamental sequence (x„), xn e /?, is
convergent (Cauchy criterion). Let us give a similar result for the convergence
of a sequence of random variables.
Theorem 4 (Cauchy Criterion for Almost Sure Convergence). A necessary and
sufficient condition for the sequence (£„)„^i of random variables to converge
with probability 1 (to a random variable £) is that it is fundamental with
probability 1.
Proof. If£„"£ then
sup|£k - f,| < sup 16 - £| + sup 16 - £|,
k>n kz.n l^n
whence the necessity follows.
Now let (£n)n:>i be fundamental with probability 1. Let JT = {co: (£„(co))
is not fundamental}. Then whenever coeQ\^r the sequence of numbers
(6(w))n2li is fundamental and, by Cauchy's criterion for sequences of
numbers, lim £„(($) exists. Let
„.£■«» :-«
The function so defined is a random variable, and evidently £„ ^4" ^.
This completes the proof.
Before considering the case of convergence in probability, let us establish
the following useful result.
Theorem 5. If the sequence (<!;„) is fundamental (or convergent) in probability,
it contains a subsequence (£„k) that is fundamental (or convergent) with
probability 1.
§10. Various Kinds of Convergence of Sequences of Random Variables
259
Proof. Let (f n) be fundamental in probability. By Theorem 4, it is enough
to show that it contains a subsequence that converges almost surely.
Take nx — 1 and define nk inductively as the smallest n > nk_ x for which
P{|£-£|>2-k}<2-\
for all s > n, t > n. Then
ZP{l<U+.-£J>2-k}<Z2-k<a)
k
and by the Borel-Cantelli lemma
P{l^fc + .-4J>2-ki.o.} = 0.
Hence
00
fc=l
with probability 1.
Let Jf = {co: J |£„fc + l - f J = oo}. Then if we put
fc=l
o, co e ^r,
we obtain ^ ^ £.
If the original sequence converges in probability, then it is fundamental in
probability (see also (19)), and consequently this case reduces to the one
already considered.
This completes the proof of the theorem.
Theorem 6 (Cauchy Criterion for Convergence in Probability). A necessary
and sufficient condition for a sequence (<!;„)„>. i ofrandom variables to converge in
probability is that it is fundamental in probability.
Proof. If £„ £ f then
P{l£, " U > £} < P{\L ~ £1 > £/2} + P{lk - ¢1 > £/2} (19)
and consequently (<!;„) is fundamental in probability.
Conversely, if (£„) is fundamental in probability, by Theorem 5 there are
a subsequence (£„k) and a random variable ^ such that £^ ^ c;. But then
Pflft, - ¢1 > e} < Pflft, - ftj > e/2} + P{|£„fc - ¢1 > a/2},
from which it is clear that £„ -½ ^. This completes the proof.
Before discussing convergence in mean of order p, we make some
observations about Lp spaces.
**>) =
260
11. Mathematical Foundations of Probability Theory
We denote by Lp = LP(Q, &\ P) the space of random variables £, = £(w)
with E|^|p = Jn |^|p dP < oo. Suppose that p > 1 and put
imip = (Eim1/p-
It is clear that
imiP>o, (20)
Kllp=|c|||£||p, c constant, (21)
and by Minkowski's inequality (6.31)
U + n\\P < U\\P + IMIp- (22)
Hence, in accordance with the usual terminology of functional analysis, the
function ||||p, defined on LP and satisfying (20)-(22), is (for p > 1) a semi-
norm.
For it to be a norm, it must also satisfy
ll£llp = 0=>£ = 0. (23)
This property is, of course, not satisfied, since according to Property H
(§6) we can only say that £ = 0 almost surely.
This fact leads to a somewhat different view of the space LP. That is, we
connect with every random variable £ e LP the class [£] of random variables
in LP that are equivalent to it (c and n are equivalent if £ = n almost surely).
It is easily verified that the property of equivalence is reflexive, symmetric,
and transitive, and consequently the linear space LP can be divided into
disjoint equivalence classes of random variables. If we now think of [Lp] as
the collection of the classes [£] of equivalent random variables £ e LP, and
define
K] + M = K + n\
«K] = \.a^\ where a is a constant,
raip=imip>
then [Lp] becomes a normed linear space.
In functional analysis, we ordinarily describe elements of a space [Lp], not
as equivalence classes of functions, but simply as functions. In the same way
we do not actually use the notation [Lp]. From now on, we no longer think
about sets of equivalence classes of functions, but simply about elements,
functions, random variables, and so on.
It is a basic result of functional analysis that the spaces LP, p > 1, are
complete, i.e. that every fundamental sequence has a limit. Let us state and
prove this in probabilistic language.
Theorem 7 (Cauchy Test for Convergence in Mean pth Power). A necessary
and sufficient condition that a sequence (£n)r.>i of random variables in V
§10. Various Kinds of Convergence of Sequences of Random Variables 261
convergences in mean of order ptoa random variable in Lp is that the sequence
is fundamental in mean of order p.
Proof. The necessity follows from Minkowski's inequality. Let (4) be
fundamental (||4 — £J|p -»• 0, n, m -*■ oo). As in the proof of Theorem 5, we
select a subsequence (4fc) such that 4fc ^+ ^, where ^ is a random variable with
Kh < oo.
Let nx = 1 and define nk inductively as the smallest n > nk-l for which
Ut-UP<2~2k
for all s > n, t > n. Let
Ak={cD:\t„k + l-ZJ>2-k}.
Then by Chebyshev's inequality
P(Ak) <: '*■»♦;.» €°J 5= -^w = 2"" £ 2"'.
As in Theorem 5, we deduce that there is a random variable £ such that
*»nfc «»•
We now deduce that ||4 — ^||p -» 0 as n -» oo. To do this, we fix e > 0
and choose N = N(e) so that 114 - 4,11^, < £ for all n > N, m > N. Then for
any fixed n > N, by Fatou's lemma,
E |4 - W = e{ lim |4 - 4Jpj = e{ lim |4 - £J*|
< lim E |4 - 4JP = Urn ||4 - 4J|J < e.
nic-»oo nic-»oo
Consequently E |4 — £|p -► 0, n -► oo. It is also clear that since £ = (<!; — 4)
+ 4 we nave E | ^ |p < oo by Minkowski's inequality.
This completes the proof of the theorem.
Remark 1. In the terminology of functional analysis a complete normed
linear space is called a Banach space. Thus Lp,p> 1, is a Banach space.
Remark 2. If 0 < p < 1, the function ||£||p = (£|£|p)1/p does not satisfy the
triangle inequality (22) and consequently is not a norm. Nevertheless the
space (of equivalence classes) Lp, 0 < p < 1, is complete in the metric
Remark 3. Let Lm — L°°(f2, ^ P) be the space (of equivalence classes of)
random variables ^ = £(co) for which \\£\\m < oo, where H^IL, the essential
supremum of ^, is defined by
IKIIco = ess sup|£| = inf{0 < c < oo: P(|£| > c) = 0}.
The function || -1| ro is a norm, and .C00 is complete in this norm.
262
II. Mathematical Foundations of Probability Theory
6. Problems
1. Use Theorem 5 to show that almost sure convergence can be replaced by
convergence in probability in Theorems 3 and 4 of §6.
2. Prove that L°° is complete.
3. Show that if £„ £ f and also £„ £ rj then f and r\ are equivalent (P(f ¥" *l) = 0).
4. Let f „ £ f, »/„ £ f/, and let ^ and rj be equivalent. Show that
P{l6.-tfnl ^ £)-^0, /7->oo,
for every £ > 0.
5. Let £„ £ ¢, *7n £ »7. Show that a£„ + brj„ & a£ + brj (a, b constants), |£,| £ |f |,
6. Let (£, - O2 -> 0. Show that f 2 - f 2.
7. Show that if £„ -4 C, where C is a constant, then this sequence converges in
probability:
8. Let (On^, have the property that Y%=i El£.lp < °° for some P > °- Show that
f „ -► 0 (P-a.s.).
9. Let (^n)n2: i be a sequence of independent identically distributed random variables.
Show that
E|«il<ooof P{|{,| >e-n}< oo
n=l
oVpi pM >el<oo=> —->0 (P-a.s.).
.-i ll«l J n
10. Let (f n)n .> i be a sequence of random variables. Suppose that there are a random
variable f and a sequence {nj such that £„k -> ^ (P-a.s.) and maxnk_, <,<nk | £i — &*-, I -*0
(P-a.s.) as k ->• oo. Show that then £„ -> £ (P-a.s.).
11. Let the d-metric on the set of random variables be defined by
and identify random variables that coincide almost surely. Show that convergence
in probability is equivalent to convergence in the d-metric.
12. Show that there is no metric on the set of random variables such that convergence
in that metric is equivalent to almost sure convergence.
§11. The Hilbert Space of Random Variables with
Finite Second Moment
1. An important role among the Banach spaces Lp, p > 1, is played by the
space L2 = L2(Q, &\ P), the space of (equivalence classes of) random
variables with finite second moments.
§11. The Hilbert Space of Random Variables with Finite Second Moment
263
If i and n e L2, we put
(&if)sEft. (1)
It is clear that if ^, n, £ e L2 then
{at, + to,, 0 = o«, 0 + «1. ft ^ * e jR,
«, 0 > 0
and
(^ = 00^ = 0.
Consequently (<!;, 17) is a sca/ar product. The space L2 is complete with
respect to the norm
imi = «, o1/2 (2)
induced by this scalar product (as was shown in §10). In accordance with the
terminology of functional analysis, a space with the scalar product (1) is a
Hilbert space.
Hilbert space methods are extensively used in probability theory to study
properties that depend only on the first two moments of random variables
("L2-theory"). Here we shall introduce the basic concepts and facts that will
be needed for an exposition of L2-theory (Chapter VI).
2. Two random variables £ and n in L2 are said to be orthogonal (^ _L n)
if (<!;, ri) = Eft = 0. According to §8, q and r\ are uncorrected ifcov(^, r\) = 0,
i.e. if
Eft = EflEij.
It follows that the properties of being orthogonal and of being uncorrelated
coincide for random variables with zero mean values.
A set M c L2 is a system of orthogonal random variables if c _L 17 for
every £, 17 e M (£ 96 »7).
If also || £11 = 1 for every (eM, then M is an orthonormal system.
3. Let M = {17 x,..., n„) be an orthonormal system and £ any random
variable in L2. Let us find, in the class of linear estimators £"=, 0,17,-, the best mean-
square estimator for £ (cf. Subsection 2, §8).
A simple computation shows that
£ - I Wi
II " II2 / \
II «=1 II \ «'=i «'=i /
= IKII2 - 2 t °fo nd + ft fliiji. I <vjf)
i=l \i=l i=l /
i=l i = l
264
II. Mathematical Foundations of Probability Theory
= imi2 - zi«.iji>ia+ Ii«.- -«. ^)i2
1=1 1=1
>l|{||2-11(4.^)12. <3)
i = l
where we used the equation
af - 2afo m) = I a, " «, ld\2 ~ l(& If)I2-
It is now clear that the infimum of E|f - Z"=i 0,17,-12 over all real
fli,..., a„ is attained for a,- = (<!;, 17,), i = 1,..., n.
Consequently the best (in the mean-square sense) estimator for f in terms
of i?!,...,!?„ is
1=1(4.¾.½. (4)
1=1
Here
A = infE
£ - z *<?*
= E|£-||2=||£||2- £ |«, i?,-)!2 (5)
(compare (1.4.17) and (8.13)).
Inequality (3) also implies BesseVs inequality: if M = Oh,*h>---} *s
an orthonormal system and ^ e L2, then
Z 1(^^.)12 ^ imi2; (6)
i=l
and equality is attained if and only if
£ = l.i.m. Z (& ij*i. (7)
The fcesf /mear estimator of ^ is often denoted by £(£1^,..., *7„) and called
the conditional expectation (of £ with respect to i^,..., 17J in the wide sense.
The reason for the terminology is as follows. If we consider all estimators
<P ~ <P0h> - - -. In) of £ in terms of rjl3..., nn (where q> is a Borel function),
the best estimator will be <p* = E(£ | nu..., 17,,), i.e. the conditional expectation
of £ with respect to nu ..., n„ (cf. Theorem 1, §8). Hence the best linear
estimator is, by analogy, denoted by §(£| 17^...,17,,) and called the
conditional expectation in the wide sense. We note that if nu ..., n„ form a
Gaussian system (see §13 below), then E(£\tjl3..., rjn) and £(£1^,..., n„)
are the same.
Let us discuss the geometric meaning of \ — E(£ 1^,..., n„).
Let <£ = y{nu ...,n„} denote the linear manifold spanned by the ortho-
normal system of random variables n1,...ir]n (i.e., the set of random
variables of the form JJ^ x 0,17,-, a{ e R).
§11. The Hilbert Space of Random Variables with Finite Second Moment
265
Then it follows from the preceding discussion that £ admits the
"orthogonal decomposition"
£ = { + «-& (8)
where | e S£ and ^ — | _L S£ in the sense that ^ — | _L A for every A e <£.
It is natural to call | the projection of <!; on j£? (the element of JS? "closest"
to 0, and to say that ^ — {is perpendicular to JSP.
4. The concept of orthonormality of the random variables nu...,nn makes it
easy to find the best linear estimator (the projection) | of ^ in terms of
iji,..., n„. The situation becomes complicated if we give up the hypothesis of
orthonormality. However, the case of arbitrary nx,..., n„ can in a certain
sense be reduced to the case of orthonormal random variables, as will be
shown below. We shall suppose for the sake of simplicity that all our random
variables have zero mean values.
We shall say that the random variables rjl9..., n„ are linearly independent
if the equation
i>,ij, = 0 (P-a.s.)
i=l
is satisfied only when all a, are zero.
Consider the covariance matrix
U =EiyiyT
of the vector n = (1/1,...,1/,,). It is symmetric and nonnegative definite,
and as noticed in §8, can be diagonalized by an orthogonal matrix <9\
0TIR0 = D,
where
has nonnegative elements di9 the eigenvalues of IR, i.e. the zeros X of the
characteristic equation det(IR — IE) = 0.
If rjl3..., nn are linearly independent, the Gram determinant (det IR) is
not zero and therefore dt > 0. Let
and V ° ''>/**)
P = B~lO'Tri. (9)
Then the covariance matrix of /J is
E0Pr = B~l &rEr]r]rOB-1 = JT^IKPJr1 = £,
and therefore /? = (fiu ..., /?„) consists of uncorrected random variables.
266
II. Mathematical Foundations of Probability Theory
It is also clear that
n = (&B)P. (10)
Consequently if nu ..., n„ are linearly independent there is an orthonormal
system such that (9) and (10) hold. Here
This method of constructing an orthonormal system ft, ..., ft is
frequently inconvenient. The reason is that if we think of i/£ as the value of the
random sequence (nu..., n„) at the instant i, the value ft constructed above
depends not only on the "past," (1^,...,1/1), but also on the "future,"
(i/h-i, -.., *7„). The Gram-Schmidt orthogonalization process, described
below, does not have this defect, and moreover has the advantage that it can
be applied to an infinite sequence of linearly independent random variables
(i.e. to a sequence in which every finite set of the variables are linearly
independent).
Let ih, ifc» ■■• be a sequence of linearly independent random variables in
L2. We construct a sequence el3 e2,... as follows. Let ex = i/i/||)/ill* H"
£!,...,£„_! have been selected so that they are orthonormal, then
--*»-*■ (11)
" Ito.-WT
where f}„ is the projection of n„ on the linear manifold SP{eu...,en^^
generated by
n-l
% = Z Ok. e*K- (12)
fc=l
Since nu...,n„ are linearly independent and J£? {i/i,..., n„- x} =
JSf {fii,..., e„_ x}, we have ||i?„ — fj„\\ > 0 and consequently s„ is well defined.
By construction, ||e„|| = 1 for n > 1, and it is clear that (e„, ek) = 0 for
k < n. Hence the sequence el3 e2,... is orthonormal. Moreover, by (11),
n» = nn + Mn,
where fen = ||i?n — i)„|| and »Jn is defined by (12).
Now let nu ..., n„ be any set of random variables (not necessarily linearly
independent). Let det IR = 0, where IR == ||r,v|| is the covariance matrix of
Oh, - • •, 1n\ and let
rank IR = r < n.
Then, from linear algebra, the quadratic form
n
Q(a) = Yi rijaiaj> a = (fli. • • •, an\
has the property that there are n — r linearly independent vectors aa\ ...,
a(n~r) such that Q(ali)) = 0, i = 1,..., n - r.
§11. The Hilbert Space of Random Variables with Finite Second Moment 267
But
(2(^) = E( |>k>7kj .
Consequently
ZflFfc-OL 1-1.....n-r,
with probability 1.
In other words, there are n — r linear relations among the variables
i/i,..., *7„. Therefore if, for example, nu...,nr are linearly independent, the
other variables iyr+lJ..., i?„ can be expressed linearly in terms of them, and
consequently £?{tiu ...,*/„} = £f{el3..., er}. Hence it is clear that we can
find r orthonormal random variables e^ ..., er such that nu..., n„ can be
expressed linearly in terms of them and JSPfai, •-.,»/„} = &{&u • • •, er}.
5. Let ij1, *72.--- be a sequence of random variables in L2. Let 3? —
«^0h. *h.---} be the linear manifold spanned by tyls ife, ..-ii-e. the set of
random variables of the form £"=10,-1/,-, n ^ 1, ateR. Then j£? =
JSPfai, »/2. • • •} denotes the closed linear manifold spanned by nu rj2,...,
i.e. the set of random variables in S£ together with their mean-square limits.
We say that a set rjl3 rj2,... is a countable orthonormal basis (or a complete
orthonormal system) if:
(a) *7i. *72. • • • is an orthonormal system,
(b) ^,^,...) = 1,2.
A Hilbert space with a countable orthonormal basis is said to be separable.
By (b), for every £eL2 and a given e > 0 there are numbers als..., a„
such that
K - Z Oifo ^ £•
Then by (3)
II '•=! II
Consequently every element of a separable Hilbert space L2 can be
represented as
¢- I«.vi)-ifi. (13>
i= 1
or more precisely as
f = l.i.m. X (£, ^,-.
268
II. Mathematical Foundations of Probability Theory
We infer from this and (3) that ParsevaVs equation holds:
ll£l2=£ l(&i?i)l2, ZeL2. (14)
i=l
It is easy to show that the converse is also valid: if nu n2, • • - is an ortho-
normal system and either (13) or (14) is satisfied, then the system is a basis.
We now give some examples of separable Hilbert spaces and their bases.
Example 1. Let Q = R, & = ^(/?), and let P be the Gaussian measure,
P(—oo, a~\— <p(x)dx, <p(x) — —-= e~*2/2.
J-m ^/In
Let D = d/dx and
fljW.(-y n±0. (15)
(pyx)
We find easily that
D<p(x) — —x<p(x),
D2<p(x) = (x2 - l)<p(x\ (16)
D3<p(x) = (3x - x3)<p(xl
It follows that H„(x) are polynomials (the Hermite polynomials). From (15)
and (16) we find that
H0(x) = 1,
HiM = x,
H2(x) = x2 - 1,
H3(x) = x3 - 3x,
A simple calculation shows that
(Hm,H„)= f Hm(x)H„(x)dP
/•CO
HJx)Hn{x)q>{x)dx^n\dnm,
J-m
where <5mn is the Kronecker delta (0, if m ^ n, and 1 if m = n). Hence if we
put
^-¾
§11. The Hilbert Space of Random Variables with Finite Second Moment
269
the system of normalized Hermite polynomials {hn(x)}„^0 will be an ortho-
normal system. We know from functional analysis that if
lim f°° e^'P^x) <oo, (17)
c],0 J—m
the system {1, x, x2,...} is complete in L2, i.e. every function £ = ^(x) in L2
can be represented either as JJ=1 a^-fr), where j/,(x) = x', or as a limit of
these functions (in the mean-square sense). If we apply the Gram-Schmidt
orthogonalization process to the sequence ^(x), ni{x\ • • •. with n^x) = x1',
the resulting orthonormal system will be precisely the system of normalized
Hermite polynomials. In the present case, (17) is satisfied. Hence {h„(x)}„>.0
is a basis and therefore every random variable £ = ^(x) on this probability
space can be represented in the form
^(x) = l.i.m. £ (£, hMx). (18)
n £ = 0
Example 2. Let Q = {0, 1, 2,...} and let P = {Plf P2,...} be the Poisson
distribution
Px =——, x = 0, 1,...; A>0.
x!
Put Af (x) = f(x) - f(x - 1) (/(x) = 0, x < 0), and by analogy with (15)
define the Poisson-Charlier polynomials
(—1VA"P
n„(x) = 5L_ii—*, n;>i, n0 = i. (19)
*■ x
Since
(nm, n„) = f; nm(x)n„(x)px = c„5mn,
x=0
where c„ are positive constants, the system of normalized Poisson-Charlier
polynomials {7c„(x)}n2.0. nn(x) = n„(x)/v/c^, is an orthonormal system, which
is a basis since it satisfies (17).
Example 3. In this example we describe the Rademacher and Haar systems,
which are of interest in function theory as well as in probability theory.
Let Q = [0, 1], & — ^([0, 1]), and let P be Lebesgue measure. As we
mentioned in §1, every x e [0, 1] has a unique binary expansion
x = Y + ¥ + -'
where xf = 0 or 1. To ensure uniqueness of the expansion, we agree to
consider only expansions containing an infinite number of zeros. Thus we
270
£,(*),
II. Mathematical Foundations of Probability Theory
£z(x).
-I—I -I I >
—*» ^x
0 i 1 0 i i i 1 "
Figure 30
choose the first of the two expansions
1 _ 1 0 0 _ 0 1 1
2~2 + 22" + 2I + "*~2 + 22 + 23 + '"
We define random variables ^(x), ^(x), ... by putting
£„(*) = *«■
Then for any numbers a„ equal to 0 or 1,
P{x: ^=^,...,6, = ¾}
= p{x:xe[f
fl2 an ^1 ^2
22 2" 2 22
2" 2" J
5=+111-1.
2" 2"Jj 2"
It follows immediately that f lf f 2. ■ ■ ■ f°rm a sequence of independent Bernoulli
random variables (Figure 30 shows the construction of f x = ^x(x) and
£2 = £2(*)).
1
*z(*);
1
-+-► 0
M-^
Figure 31. Rademacher functions.
§11. The Hilbert Space of Random Variables with Finite Second Moment 271
If we now set R„(x) = 1 — 2^„(x), n > 1, it is easily verified that {R„}
(the Rademacher functions, Figure 31) are orthonormal:
ER„Rm= \ Rn(x)Rm(x)dx = 3„,
Jo
Notice that (1, R„) = ER„ = 0. It follows that this system is not complete.
However, the Rademacher system can be used to construct the Haar
system, which also has a simple structure and is both orthonormal and
complete.
Again let Q = [0, 1) and «F = ^([0, 1)). Put
Hl(x) = 1,
H2(x) = /^(x),
[2il2R.{x) if^i<x<^, n = 2J' + fc, 1 < k < 2jJ > 1,
Hn(x) = \ 2 2J
L0, otherwise.
It is easy to see that H„(x) can also be written in the form
(2mf2, O^x <2~lm+1\
I2'
H2m+l(x) = \-
lo,
H2m+£x) = H2m+1(x-^J, j =1,...,2-
2"*2, 2-(m+1)<x<2-M, m =1,2,...,
otherwise,
Figure 32 shows graphs of the first eight functions, to give an idea of the
structure of the Haar functions.
It is easy to see that the Haar system is orthonormal. Moreover, it is
complete both in L1 and in L2, i.e. if f = f(x) e If for p = 1 or 2, then
f l/M " I (/, Hk)Hk(x) \"dx-^0, n^ oo.
Jo fc=l
The system also has the property that
t(f>Hk)Hk(x)^f(x), n-^oo,
with probability 1 (with respect to Lebesgue measure).
In §4, Chapter VII, we shall prove these facts by deriving them from general
theorems on the convergence of martingales. This will, in particular, provide
a good illustration of the application of martingale methods to the theory of
functions.
272
II. Mathematical Foundations of Probability
-21 ^ -2-\ V -21 W -2-}
Figure 32. The Haar functions Ht(x\ ..., /J8(x).
6. If!?!,..., !?„ is a finite orthonormal system then, as was shown above, for
every random variable £eL2 there is a random variable £ in the linear
manifold 3? = S£{r\x, ..., i?n}, namely the projection of ^ on JSP, such that
||£ -ei|=inf{|K-CI|:Ce^{^,..., !?„}}.
Here £ = Yj=i (& ^)^- This result has a natural generalization to the case
when !?!, i72l •.. is a countable orthonormal system (not necessarily a basis).
In fact, we have the following result.
Theorem. Let Y\^r\i,...be an orthonormal system of random variables, and
L = L{nu n2* • • •} tne closed linear manifold spanned by the system. Then
there is a unique element £ e L such that
Moreover,
||£-ei|=inf{||£-CI|:Ce^}.
I = l.i.m. Yj (& ndni
(20)
(21)
andt-lL^eL.
§11. The Hilbert Space of Random Variables with Finite Second Moment 273
Proof. Let d = infflK — £||: (, e&} and choose a sequence Ci, C2, • • • such
that 11^ — CJI -»■ d. Let us show that this sequence is fundamental. A simple
calculation shows that
c + C II2
It " U2 = 2|t - £||2 + 2IC - £112 - 4
It is clear that (C„ + CJ/2 e 2\ consequently ||[(C„ + CJ/2] - £||2 > d2 and
therefore ||C„ — CJI2 -> 0, n, m -► 00.
The space L2 is complete (Theorem 7, §10). Hence there is an element |
such that IIC - III - 0. But <£ is closed, so | e j^. Moreover, ||C„ - £|| -> d,
and consequently IK — ||| = d, which establishes the existence of the
required element.
Let us^show that | is the only element of <£ with the required property.
Let |e^ and let
IK - ||| = |K - Id = d.
Then (by Problem 3)
III + I - 2£||2 + \\l - Hi2 = 2||| - £||2 + 2||| - £||2 = 4d2.
But
Hi + I- 2£||2 =4||i(| + |)- £||2 > 4d2.
Consequently ||| — |||2 = 0. This establishes the uniqueness of the element
of S£ that is closest to ^.
Now let us show that f - | _L £ C e J*. By (20),
IK -1 - cCII > IK - III
for every ceR. But
U - I - cCII2 = IK - III2 + c2||CII2 - M ~ I cQ.
Therefore
c2||CII2> 2(£- I, cQ. (22)
Take c = A(£ - |, Q, A e /?. Then we find from (22) that
(^-i,cm2ncii2-2A]>o.
We have A2||C||2 — 2A < 0 if X is a sufficiently small positive number.
Consequently (£ - I, C) = 0, C e L.
It remains only to prove (21).
The set S£ — S£{r\i, r}2, •..} is a closed subspace of L2 and therefore a
Hilbert spacejwith the same scalar product). Now the system r}urj2,...
is a basis for S£ (Problem 5), and consequently
f = l.i.m. £(!,%)%. (23)
n fc=l
But £ - I _L i?k, A: > 1, and therefore (|, i?k) = (£, i?k), k > 0. This, with (23)
establishes (21).
This completes the proof of the theorem.
274
II. Mathematical Foundations of Probability Theory
Remark. As in the finite-dimensional case, we say that | is the projection oi
£ on L = L{rju r\2,...}, that £ - | is perpendicular to L, and that the
representation
£ = I + (£ - |)
is the orthogonal decomposition of ^.
We also denote £ by E(f |»h. »h. ■ ■ •) and call it the conditional expectation
in the wide sense (of ^ with respect to nu rj2, •. •)• From the point of view of
estimating £ in terms of nu n2, ■. •, the variable | is the optimal linear
estimator, with error
A = E|«-||2 = ||{ - |p = U\\2 ~ I 1«, *>l*.
£=1
which follows from (5) and (23).
7. Problems
1. Show that if f = l.i.m. £, then ||fJI -> |f ||.
2. Show that if f = 1-i-m. £„ and»/ = l.i.m. n„ then (£„, »/n) -> (&»/).
3. Show that the norm ||-|| has the parallelogram property
H + ri\\2 + M-r,\\2 = 2(H\\2 + W\2).
4. Let (f „ ..., O be a family of orthogonal random variables. Show that they have the
Pythagorean property,
5. Let nlf n2,... be an orthonormal system and S£ = S?{ny, n2,...} the closed linear
manifold spanned by tjlt n2, Show that the system is a basis for the (Hilbert)
space <£.
6. Let £t, £2« - - ■ be a sequence of orthogonal random variables and S„ = £l + ■■■ + {,,.
Show that if ^£= i Ef2 < oo there is a random variable S with ES2 < oo such that
l.i.m. S„ = S, i.e. \\S„ - S\\2 = E \Sn - S\2 - 0, n - oo.
7. Show that in the space L2 = L2([-7r, 7t], ^([-7C, 7r]) with Lebesgue measure jz
the system {(1/^/270^, n = 0, ±1,...} is an orthonormal basis.
§12. Characteristic Functions
1. The method of characteristic functions is one of the main tools of the
analytic theory of probability. This will appear very clearly in Chapter III
in the proofs of limit theorems and, in particular, in the proof of the central
limit theorem, which generalizes the De Moivre-Laplace theorem. In the
present section we merely define characteristic functions and present their
basic properties.
First we make some general remarks.
§12. Characteristic Functions
275
Besides random variables which take real values, the theory of
characteristic functions requires random variables that take complex values (see
Subsection 1 of §5).
Many definitions and properties involving random variables can easily
be carried over to the complex case. For example, the expectation EC of a
complex random variable C = £ + in will exist if the expectations E^ and
En exist. In this case we define EC = E£ + iErj. It is easy to deduce from the
definition of the independence of random elements (Definition 6, §5) that
the complex random variables Ci = £i + »7i and C2 = £2 + »fc are
independent if and only if the pairs (gl9 ^) and (£2, rj2) are independent; or,
equivalently, the cr-algebras JSf4lifll and ^^,m are independent.
Besides the space L2 of real random variables with finite second moment,
we shall consider the Hilbert space of complex random variables C = £ + iq
with E|C|2 < 00, where |C|2 = £2 + rj2 and the scalar product (d, C2) IS
defined by ECif2» where J2 is the complex conjugate of C- The term "random
variable" will now be used for both real and complex random variables,
with a comment (when necessary) on which is intended.
Let us introduce some notation.
We consider a vector a g R" to be a column vector,
a =
and ar to be a row vector, ar = (alt..., a„). If a and beR" their scalar product
(a, b) is Y2=i a-ib-i. Clearly (a, b) = arb.
IfaeR" and 03 = ||ry|| is an n by n matrix,
(Ua,a) = arUa = £ r^aj. (1)
£../=1
2. Definition 1. Let F = F(xl9 • ■■ix„) be an n-dimensional distribution
function in (R", @(R")). Its characteristic Junction is
<p(t)= f eilt'x)dF(x% teR". (2)
Jit*
Definition 2. If f = (^,..., £,) is a random vector defined on the probability
space (CI, !F, P) with values in R", its characteristic Junction is
<pfi)= f e*-*dFjx), teR", (3)
where F^ = F^fo,..., x„) is the distribution function of the vector £ =
(£„...,«.
If F(x) has a density / = /(x) then
<p(t) = f e*'*>f{x)dx.
276
II. Mathematical Foundations of Probability Theory
In other words, in this case the characteristic function is just the Fourier
transform of f(x).
It follows from (3) and Theorem 6.7 (on change of variable in a Lebesgue
integral) that the characteristic function q>£t) of a random vector can also
be defined by
9<(f) = £<?*• o teR". (4)
We now present some basic properties of characteristic functions, stated
and proved for n = 1. Further important results for the general case will be
given as problems.
Let i = £(cd) be a random variable, F^ = F£x) its distribution function,
and
its characteristic function.
We see at once that if rj = a£ + b then
q,fi) = Ee11" = Eeitla*+b) = eitbEeiat*.
Therefore
cp.it) = eitbcp£at). (5)
Moreover, if £i, £2» •■■» £n are independent random variables and
Sn = £i + ••• + £„,then
<Psn(0 = f[ 9</M- (6)
In fact,
<pSn = Eg*«.+ - +««> = Eeu^ • • • e**"
= Eei^---Eeit*» = fl<ptfi),
where we have used the property that the expectation of a product of
independent (bounded) random variables (either real or complex; see Theorem 6
of §6, and Problem 1) is equal to the product of their expectations.
Property (6) is the key to the proofs of limit theorems for sums of
independent random variables by the method of characteristic functions (see §3,
Chapter III). In this connection we note that the distribution function FSn
is expressed in terms of the distribution functions of the individual terms in a
rather complicated way, namely FSn = F§1 * • • • * F^ where * denotes
convolution (see §8, Subsection 4).
Here are some examples of characteristic functions.
Example 1. Let £ be a Bernoulli random variable with P(£ = 1) = p,
P(£ = 0) = q, p + q = 1,1 > p > 0; then
<Pt(t) = peil + q.
§12. Characteristic Functions
277
If ¢1 in are independent identically distributed random variables like
^, then, writing T„ = (S„ — np)/y/npq, we have
(pT(t) = EeiT"t = e-t'^^tpe1"^™ + qj
Notice that it follows that asw->oo
<Prn(t)^e-t2,2> Tn = ^^. (8)
/npq
Example2. Let £, ~ Jfim, a2), \m\ < oo, a2 > 0. Let us show that
<pfi) = eitm-t2a2'2 (9)
Let rj = (^ — m)/a. Then r\ ~ .^(0, 1) and, since
q>^t) = Jtnq>n{ot)
by (5), it is enough to show that
%{t) = e-*2. (10)
We have
tpfi) = Be*" = -$= f eitxe-*2/2 dx
yj2n J-m
_J_r yWe-*i = ff±[" „.,-** dx
= y 0£(2„-i)M = y ¢££01
where we have used the formula (see Problem 7 in §8)
-)= C x2ne-x212 dx = Erj2n = (2n - 1)!!.
y/2n J-oo
Example 3. Let ^ be a Poisson random variable,
Ptt = k) = -—i fe = Q,l
Then
Ee* _£«-«_! = «-*£«!>. = exp{^« - 1». (11)
k=0 Ki k=0 *!
278
II. Mathematical Foundations of Probability Theory
3. As we observed in §9, Subsection 1, with every distribution function in
(R, &(R)) we can associate a random variable of which it is the distribution
function. Hence in discussing the properties of characteristic functions (in
the sense either of Definition 1 or Definition 2), we may consider only
characteristic functions q>(t) = q>£t) of random variables £ = £(co).
Theorem 1. Let £bea random variable with distribution function F — F(x) and
tp(t) = Eeh*
its characteristic function. Then q> has the following properties:
(1) I<p(0I<p(0) = 1;
(2) q>(t) is uniformly continuous for teR;
(3) q>{t) = ¢(=0:
(4) <p(t) is real-valued if and only if F is symmetric (JB dF(x) = J_B dF(x)\
Be@(R), -B = {-x:xeB}\
(5) ifE\£\" < oo for some n > 1, then q>{r\i) exists for every r < n, and
q>^(t)= f {ix)reitx dF{x\ (12)
E^ = « (13)
9(t)=i^E? + ^sn(t), (14)
where |e„(f)l < 3E \£\" and e„(t) -* 0, t -> 0;
(6) if<pt2"\0) exists and is finite then E£2" < oo;
(7) ifE\£\" < oo for all n> 1 and
(E|£n^ 1
lim = —- < oo,
n e-R
then
«P(0 = f^fE{». (15)
forall\t\ <R.
Proof. Properties (1) and (3) are evident. Property (2) follows from the
inequality
\cp{t + /0- <p(t)\ = lEe'V* - 1)1 < E \eih* - 11
and the dominated convergence theorem, according to which E | eiH — 11 —> 0,
Property (4). Let F be symmetric. Then if g(x) is a bounded odd Borel
function, we have JR g(x) dF(x) = 0 (observe that for simple odd functions
§12. Characteristic Functions
279
this follows directly from the definition of the symmetry of F). Consequently
Jit sin tx dF(x) = 0 and therefore
q>(i) = E cos t£.
Conversely, let <^(f) be a real function. Then by (3)
<p.£i) = q>£-t) = ^(0 = <pfi), t e R.
Hence (as will be shown below in Theorem 2) the distribution functions
F_£ and F^ of the random variables — ^ and £ are the same, and therefore (by
Theorem 3.1)
P(^eB) = P(-^B) = P(^-B)
for every Be£
Property (5). If E |^|" < oo, we have E \£\r < oo for r < n, by Lyapunov's
inequality (6.28).
Consider the difference quotient
q>{t + ft) - q>{t)
h
Since
-^m
x - 1
—Is '*'•
and E |^| < oo, it follows from the dominated convergence theorem that the
limit
lim
exists and equals
Eei« nm(^lfl\ = £Kctt) = ,- r xe><* dF(x). (16)
h-*0\ " / J-m
Hence <p'(t) exists and
q>\t) = KE&'t) = i f xeitxdF(x).
J — 00
The existence of the derivatives q>{r\t), 1 <r < n, and the validity of (12),
follow by induction.
Formula (13) follows immediately from (12). Let us now establish (14).
Since
"-1 (iv)k (iv)"
eiy = cos y + i sin y = £ ^ff- + ^-f- [cos 0xy + i sin 02y]
280
II. Mathematical Foundations of Probability Theory
for real y, with |0X | < 1 and \62 \ < 1, we have
J* =¾1 ^TT + ^T £cos *i(®« + » sin ««*a <17>
and
where
Ee« =¾ (g_ E? + g^ [E£« + e„(0], (18)
=-o fe! »!
en(t) = E[£"(cos 0X (">)*£ + i sin 02(co)f£ - 1)].
It is clear that \e„(t)\ < 3E|<f;"|- The theorem on dominated convergence
shows that e„(t) -► 0, t -► 0.
Property (6). We give a proof by induction. Suppose first that <p"(0)
exists and is finite. Let us show that in that case E^2< oo. By L'Hopital's
rule and Fatou's lemma,
,,o,-„„.[ffi^ + St|^]
h-»o o" h-»o ™
/.oo / ihx _ -ihx\2
-aL(—a—)*«->
.. f00 /sin/*x\2 , lr,, , f00 ,. /sin/wcV , ,„, v
= -hm [ — x2 dF(x) < - hm[ — x2 dF(x)
*-.oJ-co\ hx J J-oo/^oV hx J
= - r x2 dF(x).
J -O0
Therefore,
f x2dF{x)< -<p"(0)< oo.
J-oo
Now let q>l2k+2\0) exist, finite, and let f±£ x2k dF(x) < oo. If f f?w x2kdF{x)
= 0, then {!?„ x2k+2 dF(x) = 0 also. Hence we may suppose that
{?«, x2k dF(x) > 0. Then, by Property (5),
<Pl2k)(t)= f O'x)2 keitx dF(x)
J -oo
and therefore,
(_l)y2fc)(,)= p gft*dG(xX
J —oo
where G(x) = f* ro u2k JF(u).
§12. Characteristic Functions
281
Consequently the function (—l)k<pl2k)(t)G(co)~l is the characteristic
function of the probability distribution G(x) • G-1(oo) and by what we have
proved,
G~\oo) f x2dG(x)< oo.
J-00
But G_1(oo) > 0, and therefore
r x2k+2 dF(x) = r x2 dG(x) < oo.
J — 00 J — 00
Property (7). Let 0 < tQ < R. Then, by Stirling's formula we find that
n e*f0 w e \ n! /
Consequently the series £ [E|f |M«o/w!] converges by Cauchy's test, and
therefore the series Y?=o [(Oflr\\E? converges for \t\ < t0. But by (14),
for n > 1,
" (itY
where |/?„(*)I < 3( 11 \n/n !)E | f |". Therefore
for all 111 < R. This completes the proof of the theorem.
Remark 1. By a method similar to that used for (14), we can establish that if
E |^|" < oo for some n > 1, then
*') = I ^TT^ f" ^ dF&> + ^7^- «* - s)- (19)
where |e„(f - s)\ < 3E|f"|, and e„(f -s)-»0asr-s-»0.
Remark 2. With reference to the condition that appears in Property (7),
see also Subsection 9, below, on the "uniqueness of the solution of the
moment problem."
4. The following theorem shows that the characteristic function is uniquely
determined by the distribution function.
282
II. Mathematical Foundations of Probability Theory
fix)
b-e
Figure 33
Theorem 2 (Uniqueness). Let F and G be distribution functions with the same
characteristic function, i.e.
C eitx dF{x) = C eitx dG(x)
J — 00 J —00
(20)
for all t e R. Then F(x) = G(x).
Proof. Choose a and beR, and e > 0, and consider the function f£ = f\x)
shown in Figure 33. We show that
n nx)dF{x)= r nx)dG{x).
J — 00 J — 00
(21)
Let n > 0 be large enough so that [a — e, b + e] ^ [—n, h], and let the
sequence {3„} be such that 1 > 8„ 10, n -> oo. Like every continuous function
on [—n, n] that has equal values at the endpoints,/c = fc(x) can be uniformly
approximated by trigonometric polynomials (Weierstrass's theorem), i.e.
there is a. finite sum
/«(*) = Z flk expfiroc - J
such that
sup \fe(x)-ff,(x)\<dn.
-n£x<n
Let us extend the periodic function f„(x) to all of/?, and observe that
sup|/^(x)|<2.
X
Then, since by (20)
f pn{x)dF{x)= r f%x)dG{x\
J — 00 J — 00
(22)
(23)
§12. Characteristic Functions
283
we have
J"°° f\x)dF{x) - ^ f\x)dG(x)\ = \[_fdF- J" PdG
^\LndF-LndG
I /»00 /»00
ALfUF-lf"dG
+ 2dn
+ 2<5„
+ 2F([-n, n]) + 2G([-n, n]),
(24)
where F(A) = $A dF(x)t G{A) = \A dG(x). As n -> oo, the right-hand side
of (24) tends to zero, and this establishes (21).
As e -> 0, we have /£(x) -> /(fl,b](x). It follows from (21) by the theorem on
distribution functions' being the same.
p Wx)dF(x)= f°° /(fl,fc](x)rfG(x),
J— 00 J — 00
i.e. F(fc) — F(a) = G(fc) — G(fl). Since a and fc are arbitrary, it follows that
F(x) = G(x)forallxeK.
This completes the proof of the theorem.
5. The preceding theorem says that a distribution function F = F(x) is
uniquely determined by its characteristic function <p = <p(t). The next theorem
gives an explicit representation of F in terms of <p.
Theorem 3 (Inversion Formula). Let F = F(x) be a distribution junction and
(p(t) = r eitx dF(x)
J-00
its characteristic function.
(a) For pairs of points a and b(a < b) at which F = F(x) is continuous,
F(b) - F(a) = Km — r— <p(f) dt;
c-»oo27Tj_c It
(25)
(b) If $?m \<p(t)\ dt < oo, the distribution function F(x) has a density /(x),
*"(*)= f f(y)dy (26)
J-00
and
l r°°
>(0 dt.
(27)
284
II. Mathematical Foundations of Probability Theory
Proof. We first observe that if F(x) has density f(x) then
<p(t)= P eitxf(x)dx,
J -00
(28)
and (27) is just the Fourier transform of the (integrable) function <p(t).
Integrating both sides of (27) and applying Fubini's theorem, we obtain
F(b) - F(a) = £/(x) dx = ± £ [/"/"***« dt] dx
=i\yifae~itxdx]dt
After these remarks, which to some extent clarify (25), we turn to the proof,
(a) We have
1. f e-«« _ e->*
•«"&J_, Tt «*>*
= r Vc(x)dF(x), (29)
J-oo
where we have put
1 rc e~ita — p~itb
cV ' 2tt J_c if
-eitxdt
and applied Fubini's theorem, which is applicable in this case because
-Ita a~itb
lt I I J.
e~itxdx
<b - a
and
f f (b - a) dF{x) < 2c(b - a) < oo.
J — c J—m
§12. Characteristic Functions
285
In addition,
1 fc sin t(x — a) — sin t(x — b)
1 r(x-fl) sin v J 1 /**-*> sin u J
2n J-clx-a) v 2n J-clx-b) u
(30)
The function
, . f* sin v ,
g(s,t)= dv
Js V
is uniformly continuous in s and U and
9(s,t)-+n (31)
as s i — oo and 11 oo. Hence there is a constant C such that | ¥c(x) | < C < oo
for all c and x. Moreover, it follows from (30) and (31) that
¥c(x) - ¥(x), c -^ oo,
where
T(x)
{0, x < a, x > b,
i, x = a, x = b,
1, a < x < b.
Let fi be a measure on (/?, ^(/?)) such that ju(o, b] = F(b) - F(a). Then
if we apply the dominated convergence theorem and use the formulas of
Problem 1 of §3, we find that, as c -*■ oo,
<DC = P ¥c(x) dF(x) - C ¥(x) dF(x)
J — oo J— 00
= F0-) - F(a) + «F(fl)- F(a-) + F(fc) - F(fc-)]
= F(b) + F(fr-) _ F(«) + F(«-) = _
2 2
where the last equation holds for all points a and b of continuity of F(x).
Hence (25) is established,
(b) Let pw |<p(OI * < oo- Write
/(*) = ^JVtep(0A.
286
II. Mathematical Foundations of Probability Theory
It follows from the dominated convergence theorem that this is a continuous
function of x and therefore is integrable on [a, b]. Consequently we find,
applying Fubini's theorem again, that
fj™dx=[ULe~!'Mt)dt)dx
1 ff e~ita — e~itb
= lim ±\ - -^- <p(t) dt = F(b) - F{a)
for all points a and b of continuity of F(x).
Hence it follows that
F(x)= [* /(y)«ty, xe/?,
J-oo
and since f(x) is continuous and F(x) is nondecreasing, f{x) is the density
ofF(x).
This completes the proof of the theorem.
Corollary. The inversion formula (25) provides a second proof of Theorem 2.
Theorem 4. A necessary and sufficient condition for the components of the
random vector £ = (^l3..., <!;„) to be independent is that its characteristic
function is the product of the characteristic functions of the components:
Eentiti + - +..A0 = fl Eeitk*k, (flf..., tn) £ R".
fc=l
Proof. The necessity follows from Problem 1. To prove the sufficiency we
let F(xx,..., x„) be the distribution function of the vector f = (f lt..., &,)
and Fk(x), the distribution functions of the £k, 1 < k < n. Put G = G(xx,..., x„)
= ^1(^1)'' • f«(4 Then, by Fubini's theorem, for all (tlt..., t„) e R",
f entlXl + ••• +«„*„) dG(Xi ... xj = Yl f ei,fc*fc </Fk(x)
Jjt" k=i Jji
= ff Eei,fc*fc = Eef(,^l + ""+,fc5fc)
fc=l
= f J«lXl+"m+,**ddF(kxl---xJ.
Therefore by Theorem 2 (or rather, by its multidimensional analog; see
Problem 3) we have F = G, and consequently, by the theorem of §5, the
random variables f 1,..., f„ are independent.
§12. Characteristic Functions
287
6. Theorem 1 gives us necessary conditions for a function to be a characteristic
function. Hence if q> — q>(t) fails to satisfy, for example, one of the first three
conclusions of the theorem, that function cannot be a characteristic function.
We quote without proof some results in the same direction.
Bochner-Khinchin Theorem. Let q>{t) be continuous, t e Ry with <p(0) — \. A
necessary and sufficient condition that q>(i) is a characteristic function is that it
is positive semi-definite, i.e. that for all real tlt...,t„ and all complex klt..., ^,
n= 1,2,...,
£ q>{U - tjftXj > 0.
1,./=1
The necessity of (32) is evident since if <p(t) = ]"?«, eitx dF(x) then
(32)
dF(x) > 0.
t.;=i J-oo|k=i
The proof of the sufficiency of (32) is more difficult.
Polya's Theorem. Let a continuous even function q>(t) satisfy q>{t) > 0,
<p(0) = 1, q>{i) -> 0 as t -> oo and let q>{t) be convex on 0 < t < oo. Then
<p(t) is a characteristic function.
This theorem provides a very convenient method of constructing
characteristic functions. Examples are
^(O^-W
Jl-UI, M<1,
*aW*fo U|>1.
Another is the function <p3(t) drawn in Figure 34. On [—a, a], the function
<p3(0 coincides with q>2(t). However, the corresponding distribution
functions F2 and F3 are evidently different. This example shows that in general
two characteristic functions can be the same on a finite interval without their
distribution functions' being the same.
03(O,
288
II. Mathematical Foundations of Probability Theory
Marcinkiewicz's Theorem. If a characteristic function q>(t) is of the form
exp 0>(t\ where 0>(t) is a polynomial, then this polynomial is of degree at
most 2.
It follows, for example, that e~tA is not a characteristic function.
7. The following theorem shows that a property of the characteristic
function of a random variable can lead to a non trivial conclusion about the
nature of the random variable.
Theorem 5. Let q>^{t) be the characteristic function of the random variable £.
(a) If \q>z(tQ)\ = 1 for some tQ ^ 0, then £ is concentrated at the points
a + nh,h = 2n/t0ifor some a, that is,
£ Ptf = a + nfc} = 1, (33)
n— — oo
where a is a constant.
(b) If \<pfi)\ = l<Ps(«OI = 1 for two different points t and at, where a is
irrational, then £ is degenerate:
PK-«}-i.
where a is some number.
(c) If\q>fi)\ = 1, then £ is degenerate.
Proof, (a) If \<p$0)\ = 1, tQ # 0, there is a number a such that <p(t0) = eitoa.
Then
eitoa = f eit0x dF(x) =» 1 = f °° eW*-a) dp^ ^
J — 00 J — 00
/•00 /»00
1 = cos t0(x - a) rfF(x) => [1 - cos t0(x - a)] dF(x) = 0.
J — 00 J — 00
Since 1 — cos f0(x — a) > 0, it follows from property H (Subsection 2 of
§6) that
1 = cos t0(£ — a) (P-a.s.),
which is equivalent to (33).
(b) It follows from \<p£t)\ = |tp4(af)| = 1 and from (33) that
If ^ is not degenerate, there must be at least two pairs of common points:
2n 2n 2n , 2n
a H Wi = b H Wi, a H n2 — b-\ m2,
t at y at
§12. Characteristic Functions
289
in the sets
<a + —n,n = 0,±l,...i and \b + — m, m = 0, ±1, ...X
whence
2rc In
— («1 ~ «2) = ~(Wl ~ w2),
and this contradicts the assumption that a is irrational. Conclusion (c)
follows from (b).
This completes the proof of the theorem.
8. Let ^ = (£!,..., £k) be a random vector,
<p,(0 = Eef('-*>, ^ = (^,...,^),
its characteristic function. Let us suppose that E|£,.|" < 00 for some n ^ 1,
i = 1,..., k. From the inequalities of Holder (6.29) and Lyapunov (6.27)
it follows that the (mixed) moments E(£\l • • • g*) exist for all nonnegative
vu ..., vk such that vx + • • • + vk < n.
As in Theorem 1, this implies the existence and continuity of the partial
derivatives
^vi + — +Vfc
for vx + • • • + vk < n. Then if we expand <pfily..., tk) in a Taylor series,
we see that
<P&19..., tk) = X '7 *Vfcf ^¾1 Vkhi * * * '*" + ^^ <34>
v,+ -+vkin Vjl ■■• Vk!
where |r| = |fx | -H — -H Inland
mp Vfc) = Eft1---^
is the mixed moment of order v = (vx,..., vk).
Now <p£tu..., tk) is continuous, <^(0,..., 0) = 1, and consequently this
function is different from zero in some neighborhood \t\ <5 of zero. In
this neighborhood the partial derivative
QVi + — +Vfc
exists and is continuous, where In z denotes the principal value of the
logarithm (if z = reie, we take In z to be In r + W). Hence we can expand
In (p£t!,..,, **) by Taylor's formula,
In <P^i ••**)= I f,* *Vfc|4Vl k)tV'--tik + o(\tn (35)
v.+ -+vk£« V' Vfc!
290
II. Mathematical Foundations of Probability Theory
where the coefficients sJWl Vfc) are the (mixed) semi-invariants or cumulants
of order v = v(v1}..., vk) of £ = £lt.., £k.
Observe that if ^ and n are independent, then
In <p<+JLt) = In q>£t) + In <p£t\ (36)
and therefore
sjvu....vk) = gCv! vfc) + s<vt vfc) (37)
(It is this property that gives rise to the term "semi-invariant" for sfl Vfc).)
To simply the formulas and make (34) and (35) look "one-dimensional,"
we introduce the following notation.
If v = (yl9..., vk) is a vector whose components are nonnegative integers,
we put
v! = Vl! ••• vk!, |v| = Vl + ... + vk, f = tp -. ■ fj*
We also put sf = sfl Vfc), mf1 = mp Vfc).
Then (34) and (35) can be written
;|v|
^(0= I -m?t" + o(\tn (38)
|v|in v!
ta9«(0- I ^s?tv + o(\tn (39)
The following theorem and its corollaries give formulas that connect
moments and semi-invariants.
Theorem6. Let £ = (^,...,4) be a random vector with E|^,.|"<oo,
i = 1,..., fc, n ^ 1. Then for v = (v1}..., vk) such that | v| < n
4V,= I,, hjan^IU™. (40)
s<v,= j (~1J"\,1),V' „„ flmr", (41)
wJzere Xa<i>+-+a<«>=v indicates summation over all ordered sets of nonnegative
integral vectors A(p), |A(p)| > 0, whose sum is v.
Proof. Since
<p£t) = exp(ln <^(0),
if we expand the function exp by Taylor's formula and use (39), we obtain
" 1 I i|A| \q
^(0 = 1 + I-t I -jysfV) +o(\t\«). (42)
§12. Characteristic Functions
291
Comparing terms in tA on the right-hand sides of (38) and (42), and using
|A(1)| + • • • + \Alq)\ = |A(1) + • • • + A(9>|, we obtain (40).
Moreover,
ln^0 = ln[l+ I ^mftA + 0(Unl. (43)
L l£|A|£n /! J
For small z we have the expansion
ln(l + z) = Y ^-^ z« + o(z«).
Using this in (43) and then comparing the coefficients of tx with the
corresponding coefficients on the right-hand side of (38), we obtain (41).
Corollary 1. The following formulas connect moments and semi-invariants'.
1 v!
L —
:{riA<I)+..5^,=v} r,! • • • rx\ (*»!)" - • - (^>!)^ ^"^ (44)
s<v) = y (-lr^-D! v! fr Lmr)JJ
S* ^+..^.., r,! • • • rx\ {X"T • • • <A«iy-M L * J '
(45)
where X(»-iA<l)+-+r»A<*)=v} denotes summation over all unordered sets of
different nonnegative integral vectors A°\ | AlJi \ > 0, and over all ordered sets of
positive integral numbers rs such that rx#l) + • • • + rxX(x) — v.
To establish (44) we suppose that among all the vectors A(1),..., A(9)
that occur in (40), there are rx equal to A(l'l),..., rx equal to X(ix) (rj > 0,
rx + • • • + rx = q), where all the A°'s) are different. There are q \/{rl!... rx!)
different sets of vectors, corresponding (except for order) with the set {A(1),...
A(9)}). But if two sets, say, {A(1),..., A(9)} and {I(1>,..., Jlq)} differ only in order,
then f]?=i 4A(P>> = n?=i sf<P>)- Hence if we identify sets that differ only in
order, we obtain (44) from (40).
Formula (45) can be deduced from (41) in a similar way.
Corollary 2. Let us consider the special case when v = (1,..., 1). In this case
the moments w|v) s E£x ■ ■ ■ fk, and the corresponding semi-invariants, are
called simple.
Formulas connecting simple moments and simple semi-invariants can
be read off from the formulas given above. However, it is useful to have them
written in a different way.
For this purpose, we introduce the following notation.
Let ^ = (^,..., 4) be a vector, and /^ = {1, 2,..., k} its set of indices.
If I c /4> let \t denote the vector consisting of the components of ^ whose
292 H. Mathematical Foundations of Probability Theory
indices belong to /. Let #(/) be the vector {#i,..., x„} for which Xt = 1 if
i e /, and Xi = 0 if i $ /. These vectors are in one-to-one correspondence with
the sets /^/^. Hence we can write
m^I) = w4*,(/), s£I) = s^(/».
In other words, m^J) and s^(/) are simple moments and semi-invariants
of the subvector £, of ^.
In accordance with the definition given on p. 12, a decomposition of
a set / is an unordered collection of disjoint nonempty sets Ip such that
!,', = '•
In terms of these definitions, we have the formulas
»tfJ)« I iWp), (46)
^(/)= I (-ir^-oirWu (47)
Z|=l/P=/ P=i
where Xi|=i/P=/ denotes summation over all decompositions of /,
1 < q < iV(/), where N(I) is the number of elements of the set /.
We shall derive (46) from (44). If v = *(/) and A(1) + • • • + A(9) = v, then
A(*> = x(lp\ lp c /, where the X{p) are all different, A(p)! = v! = 1, and every
unordered set (x(/i),..., x(/g)} is in one-to-one correspondence with the
decomposition I = ^=1 /p. Consequently (46) follows from (44).
In a similar way, (47) follows from (35).
Example 1. Let £ be a random variable (k — 1) and m„ = w|n) = E£",
s„ = s£°. Then (40) and (41) imply the following formulas:
mi — su
™i = s2 + sf,
W3 = S3 + 3S1S2 + Si,
w4 = s4 + 3sf + 4SiS3 + 6sfs2 + sf, (48)
and
Sl = Wl = E£,
s2 = mi — ™\ = V£,
s3 = m3 — 3mlm2 + 2wi,
s4 = w4 — 3wf — 4^1^3 + \2m\m2 — 6wf, (49)
§12. Characteristic Functions
293
Example 2. Let f ~ Jf{m, a2). Since, by (9),
In <pfi) = ifm —,
we have sx = m, s2 = <72 by (39), and all the semi-invariants, from the third
on, are zero: s„ = 0, n > 3.
We may observe that by Marcinkiewicz's theorem a function exp 0>(t\
where & is a polynomial, can be a characteristic function only when the
degree of that polynomial is at most 2. It follows, in particular, that the
Gaussian distribution is the only distribution with the property that all its
semi-invariants s„ are zero from a certain index onward.
Example 3. If ^ is a Poisson random variable with parameter k > 0, then
by (11)
In <pfi) = Me* - 1).
It follows that
sn = A (50)
for all n > 1.
Example 4. Let £ = (^ 1}..., <!;„) be a random vector. Then
m,(l) = s,(l),
m,(l, 2) = s,(l, 2) + s,(l)s,(2),
m,(l, 2, 3) = s,(l, 2, 3) + s,(l, 2)5,(3) + (51)
+ s,(l, 3)s,(2) +
+ 5,(2, 3)5,(1) + 5,(1)5,(2)5^3)
These formulas show that the simple moments can be expressed in terms
of the simple semi-invariants in a very symmetric way. If we put £x = £2 =
• • • s 4, we then, of course, obtain (48).
The group-theoretical origin of the coefficients in (48) becomes clear
from (51). It also follows from (51) that
s,(l, 2) = m,(l, 2) - m,(l)m,(2) = E<^2 - E^Efc, (52)
i.e., s,(l, 2) is just the covariance of ^ and ^2-
9. Let ^ be a random variable with distribution function F = F(x) and
characteristic function <p(i). Let us suppose that all the moments m„ = E^",
n > 1, exist.
It follows from Theorem 2 that a characteristic function uniquely
determines a probability distribution. Let us now ask the following question
294
II. Mathematical Foundations of Probability Theory
(uniqueness for the moment problem): Do the moments {w„}„^i determine
the probability distribution!
More precisely, let F and G be distribution functions with the same
moments, i.e.
P x" dF{x) = P x" dG{x) (53)
for all integers n > 0. The question is whether F and G must be the same.
In general, the answer is "no." To see this, consider the distribution F
with density
/(X) = {o, x<0,
where a > 0,0 < X < |, and k is determined by the condition J" /(*) dx = 1.
Write ft = a tan Arc and let g(x) = 0 for x < 0 and
0(X) = ke'^Xl + e sin(0xA)], |e| < 1, x > 0.
It is evident that g(x) > 0. Let us show that
["xTe-***sin 0xA dx = 0 (54)
Jo
for all integers n > 0.
For p > 0 and complex # with Re # > 0, we have
Take p = (n + 1)/A, g = a + #, r = xA. Then
J
«>o
fV{[(n+i)/Ai-i}e-(a+.7J)^AxA-i rfx = A rrox"e-(a+,'^Arfx
Jo Jo
= A f x"e-a*A cos0xa dx - iA f x"e--** sin 0xA dx
Jo Jo
•ft1)
a(n+i)M(1 +r-tanA7c)(n+1)/;i"
But
(1 + i tan A7c)(n+1)/A = (cos An + i sin An){n+l)/\cos Xn)-^l),x
= efn(n+1,(cosA7t)-(n+1,/A
= cos n(n + 1) • cos(A7t)-(n+ 1)/A,
since sin 7c(n + 1) = 0.
(55)
§12. Characteristic Functions
295
Hence right-hand side of (55) is real and therefore (54) is valid for all
integral n > 0. Now let G(x) be the distribution function with density g{x).
It follows from (54) that the distribution functions F and G have equal
moments, i.e. (53) holds for all integers n > 0.
We now give some conditions that guarantee the uniqueness of the
solution of the moment problem.
Theorem 7. Let F = F(x) be a distribution function and n„ = $™m \x\" dF(x).
If
hm £=- < oo, (56)
n-»oo "
the moments {?«„}„>!, where m„ = ]""„ x" dF(x\ determine the distribution
F = F(x) uniquely.
Proof. It follows from (56) and conclusion (7) of Theorem 1 that there is a
t0 > 0 such that, for all 11\ < t0i the characteristic function
<p(r)= f eitxdF(x)
J -00
can be represented in the form
k=0 Kl
and consequently the moments {w„}„> x uniquely determine the
characteristic function <p(t) for 11 \ < t0.
Take a point s with \s\ < tQ/Z Then, as in the proof of (15), we deduce
from (56) that
00 ik(t — s\k
«>(') = i LV%<"(s)
k=0 K1
for |t — s| ^ f0, where
«/-00
is uniquely determined by the moments {w„}n2. x. Consequently the moments
determine <p(i) uniquely for |t| < ft0. Continuing this process, we see that
{mj^! determines <p(t) uniquely for all t, and therefore also determines
F(x).
This completes the proof of the theorem.
Corollary 1. The moments completely determine the probability distribution
if it is concentrated on a finite interval.
296
II. Mathematical Foundations of Probability Theory
Corollary 2. A sufficient condition for the moment problem to have a unique
solution is that
hm ^¾^ < oo. (57)
For the proof it is enough to observe that the odd moments can be
estimated in terms of the even ones, and then use (56).
Example. Let F(x) be the normal distribution function,
F(x) = ^=L= f e-'2'2aZdt.
yJlTZG2 J -oo
Then m2n+ x = 0, m2n = [(2n) \/2"n !]cr2n, and it follows from (57) that these
are the moments only of the normal distribution.
Finally we state, without proof:
Carleman's test for the uniqueness of the moment problem.
(a) Let {m„}„^ l be the moments of a probability distribution, and let
£ 1
^o (m2n)h
-= oo.
Then they determine the probability distribution uniquely.
(b) If {mj^j are the moments of a distribution that is concentrated on
[0, oo), then the solution will be unique if we require only that
I
i
(»0
l/2n
10. Let F = F(x) and G = G(x) be distribution functions with
characteristic functions f = f(t) and g = g(t\ respectively. The following theorem,
which we give without proof, makes it possible to estimate how close F
and G are to each other (in the uniform metric) in terms of the closeness of
f and g.
Theorem (Esseen's Inequality). Let G(x) have derivative G'(x) with
sup|G'(x)| < C. Then for every T > 0
sup|F(x) ,=,^-. 1/(0-*)
2 CT
)~G(x)\<-
n Jo
dt+ ^sup \G'(x)\. (58)
TlT x
(This will be used in §6 of Chapter III to prove a theorem on the rapidity
of convergence in the central limit theorem.)
§13. Gaussian Systems
297
11. Problems
1. Let f and rj be independent random variables, /(x) = /i(x) + i/iOO, 9(x) = 9i(x)
+ W2(x), where /k(x) and gk(x) are Borel functions, k = 1,2. Show that if E |/(f)|< oo
andE\g(ri)\ < oo, then
z\mgbi)\<«>
and
E/<9flfo) = E/(O.E0fo).
2. Let £ = (¢,,..., O and E ||f H" < oo, where ||f || = +N/£#- Show that
^(0=InE(r,£)* + £|J(OI|r|r,
k=0R-
where t = {tl3 ...>t„)and e„(t) -> 0, r -> 0.
3. Prove Theorem 2 for /7-dimensional distribution functions F = F„(xlt..., x„) and
GJLxlt...txJ.
4. Letf = F(xly..., x„) bean w-dimensional distribution function and <p = (p(tly...,t„)
its characteristic function. Using the notation of (3.12), establish the inversion formula
(We are to suppose that (a, fc] is an interval of continuity of P(a, b]t i.e. for k = 1,
..., n the points ak, bk are points of continuity of the marginal distribution functions
Fk(Xk) which are obtained from F(xlt..., x„) by taking all the variables except
xk equal to + oo.)
5. Let <pk(t\ k > 1, be a characteristic function, and let the nonnegative numbers Ak,
k > 1, satisfy £ Ak = 1. Show that J] Ak^)k(0 is a characteristic function.
6. If q>{t) is a characteristic function, are Re <p(t) and Im <p(t) characteristic functions?
7. Let <plt <p2 and ^>3 be characteristic functions, and (p^<p2 = (p\<Pz • Does it follow that
8. Construct the characteristic functions of the distributions given in Tables 1 and 2
of §3.
9. Let f be an integral-valued random variable and <p{ (t) its characteristic function.
Show that
P(f = fc) = J- V e-ikl(p^t)du fc = 0,±l,±2....
§13. Gaussian Systems
1. Gaussian, or normal, distributions, random variables, processes, and
systems play an extremely important role in probability theory and in
mathematical statistics. This is explained in the first instance by the central
298
II. Mathematical Foundations of Probability Theory
limit theorem (§4 of Chapter HI and §8 of Chapter VII), of which the De
Moivre-Laplace limit theorem is a special case (§6, Chapter I). According
to this theorem, the normal distribution is universal in the sense that the
distribution of the sum of a large number of random variables or random
vectors, subject to some not very restrictive conditions, is closely
approximated by this distribution.
This is what provides a theoretical explanation of the "law of errors" of
applied statistics, which says that errors of measurement that result from
large numbers of independent "elementary" errors obey the normal
distribution.
A multidimensional Gaussian distribution is specified by a small number
of parameters; this is a definite advantage in using it in the construction of
simple probabilistic models. Gaussian random variables have finite second
moments, and consequently they can be studied by Hilbert space methods.
Here it is important that in the Gaussian case "uncorrected" is equivalent
to "independent," so that the results of L2-theory can be significantly
strengthened.
2. Let us recall that (see §8) a random variable £ = £(co) is Gaussian, or
normally distributed, with parameters m and a2 (^ ~ jV(m, cr2)), |w| < oo,
a2 > 0, if its density f£x) has the form
#x)--l «-*-"**•», (1)
where g = -\-yfff2.
As cr X 0, the density f£x) "converges to the <5-function supported at
x = w." It is natural to say that £ is normally distributed with mean m
and a2 = 0 (£ ~ jV(m, 0)) if £ has the property that P(£ = m) = 1.
We can, however, give a definition that applies both to the nondegenerate
{a2 > 0) and the degenerate (a2 = 0) cases. Let us consider the characteristic
function q>^t) s Ee'"'4, t e R.
If P(£ = m) = 1, then evidently
<P&) = eitm, (2)
whereas if ^ ~ jV(m, a2\ a2 > 0,
(p£t) = eitm-^l2)t2a\ (3)
It is obvious that when a2 = 0 the right-hand sides of (2) and (3) are the
same. It follows, by Theorem 2 of §12, that the Gaussian random variable
with parameters m and a2 (| m \ < oo, cr2 > 0) must be the same as the random
variable whose characteristic function is given by (3). This is an illustration
of the "attraction of characteristic functions," a very useful technique in the
multidimensional case.
§13. Gaussian Systems
299
Let £ = {£u ..., <!;„) be a random vector and
<p£t) = Eei('-o, t = (tl5..., £„) e /?", (4)
its characteristic function (see Definition 2, §12).
Definition 1. A random vector £ = {£l9..., <!;„) is Gaussian, or normally
distributed, if its characteristic function has the form
<^f) = ^)-(1/2)((^ (5)
where m = (mlf..., wj, |wk| < oo and IR = ||rw|| is a symmetric nonnega-
tive definite n x n matrix; we use the abbreviation £ ~ Jfijn, IR).
This definition immediately makes us ask whether (5) is in fact a
characteristic function. Let us show that it is.
First suppose that IR is nonsingular. Then we can define the inverse
A = U~l and the function
/to = ^2^71 exp{-i(X(x - m\ (x - m))}, (6)
where x = (xu..., x„) and \A\ = det A. This function is nonnegative. Let
us show that
{
e*>x)f(x) dx = ei{t'm)-{ll2mt^\
or equivalently that
1 == f 0iiUx-m)\^]__a-{Lll2)(A(x-m)Ax-m)) Jy _ _-(l/2)(RM) /7^
Let us make the change of variable
x-m = ®u, t = &v,
where 0 is an orthogonal matrix such that
0TUO = D,
and
Hi-:)
is a diagonal matrix with d,- > 0 (see the proof of the lemma in §8). Since
| IR| = det IR # 0, we have dt > 0, i = 1,..., n. Therefore
\A\ = \U-l\ = dTl---d;K (8)
300
II. Mathematical Foundations of Probability Theory
Moreover (for notation, see Subsection 1, §12)
i(U x - m) - i(A(x - m), x - w)) = i(Ov, Gu) - i(A0u, Ou)
= i(Ov)TOu - i(Ou)TA(Ou)
= ivTu - \i?(FA(9u
= ii?Tu — iMTD-1u.
Together with (8) and (12.9), this yields
I„ = (2nynl2{dx • • • d„y1/2 f exp(ii>Tu - iiiTD"hi) du
= fH2ndkrm f exp
= exp(~ivTDv) = exp(-^vT0TROv) = exp(-%tTRt) = exp(-|(IR*, *)).
It also follows from (6) that
f/(x)rfx = l. (9)
JR"
Therefore (5) is the characteristic function of a nondegenerate n-di-
mensional Gaussian distribution (see Subsection 3, §3).
Now let IR be singular. Take e > 0 and consider the positive definite
symmetric matrix IRE = IR + eE. Then by what has been proved,
<p\t) = exp{i(r, m) - i(IRcf, t)}
is a characteristic function:
<pe(t)= f ei{ux)dFt(x\
JR"
where F£(x) = F£(xls..., x„) is an n-dimensional distribution function.
As e -► 0,
q>\t) -+ q>{t) = exp{i(t, m) - i(U% 0}-
The limit function <p(t) is continuous at (0,..., 0). Hence, by Theorem 1
and Problem 1 of §3 of Chapter III, it is a characteristic function.
We have therefore established Theorem 1.
3. Let us now discuss the significance of the vector m and the matrix
IR = Ikwll that appear in (5).
Since
n j n
In <pfi) = KU m) - i(IRt, 0 = i I hmk - o I ruhU, (10)
fc=l M=l
(ivkuk - |M duk = ft exp(-^ dk)
§13. Gaussian Systems
301
we find from (12.35) and the formulas that connect the moments and the
semi-invariants that
mx = s*1-0 °> = E^ mk = s<° 0-» = E&.
Similarly
'11 = 42*° °} = VZi> rl2 = s^.i.o....) =Cov(^, £2),
and generally
ru=cov(&,«.
Consequently m is the mean-value vector of ^ and IR is its covariance
matrix.
If IR is nonsingular, we can obtain this result in a different way. In fact,
in this case £ has a density f(x) given by (6).
A direct calculation shows that
E&s Jxk/(x)</x = mk, (11)
cov(&, Q = (xk - mk)(x, - wi,)/(x) rfx = rw.
4. Let us discuss some properties of Gaussian vectors.
Theorem 1
(a) The components of a Gaussian vector are uncorrelated if and only if they
are independent.
(b) A vector £ = (^ x,..., <!;„) is Gaussian if and only if, for every vector
X — (Ax,..., An), Ak eR, the random variable (<!;, A) = A^ + • • • + A„^„
fos a Gaussian distribution.
Proof, (a) If the components of ^ = (£i> ---.^)are uncorrelated, it follows
from the form of the characteristic function q>^{t) that it is a product of
characteristic functions. Therefore, by Theorem 4 of §12, the components
are independent.
The converse is evident, since independence always implies lack of
correlation.
(b) If ^ is a Gaussian vector, it follows from (5) that
EexpOX^A! + --- + ^A„)} = expji£(£Akmk)-y Q>wAkA,)>, teR,
and consequently
(<U)~^£Akmk,2>wAkA,).
302 H. Mathematical Foundations of Probability Theory
Conversely, to say that the random variable (<!;, A) = £>\h\ + "'" + ^"^"
is Gaussian means, in particular, that
Ee«t,» = expW A) - ^1^1 = exp{i £ AkE£k - i £ AkA,cov(£k, «}.
Since A,,..., A„ are arbitrary it follows from Definition 1 that the vector
£ = (f ls..., £„) is Gaussian.
This completes the proof of the theorem.
Remark. Let (0, ^) be a Gaussian vector with 0 = (0X,..., 0k) and £ =
(£i» • • -. £k)- If 0 and ^ are uncorrelated, i.e. cov(0,-, £,-) = 0, /= 1,..., k;
j — 1,...,/, they are independent.
The proof is the same as for conclusion (a) of the theorem.
Let £ = (^lf..., <!;„) be a Gaussian vector; let us suppose, for simplicity,
that its mean-value vector is zero. If rank IR = r < n, then (as was shown in
§11), there are n — r linear relations connecting £l9..., £„. We may then
suppose that, say, ^,..., ^r are linearly independent, and the others can
be expressed linearly in terms of them. Hence all the basic properties of the
vector £ = £i,..., £„ are determined by the first r components (gl9..., £,.)
for which the corresponding covariance matrix is already known to be
nonsingular.
Thus we may suppose that the original vector £ = (f t,..., <!;„) had linearly
independent components and therefore that |IR| > 0.
Let G be an orthogonal matrix that diagonalizes IR,
0TUO = D.
The diagonal elements of D are positive and therefore determine the inverse
matrix. Put B2 = D and
p = B-l&r£.
Then it is easily verified that
E £*((./» = EeipTt = e-u/2xa.0f
i.e. the vector /? = (fil9..., /?„) is a Gaussian vector with components that are
uncorrelated and therefore (Theorem 1) independent. Then if we write
A = OB we find that the original Gaussian vector £ = (^l9..., <!;„) can be
represented as
Z~AP, " (12)
where /? = (fil9..., /?„) is a Gaussian vector with independent components,
Pk ~ .^(0, 1). Hence we have the following result. Let £ = (^l9..., £,) be a
§13. Gaussian Systems
303
vector with linearly independent components such that E£k — 0, k = 1,
..., n. This vector is Gaussian if and only if there are independent Gaussian
variables fil9..., fi„, fik ~ «/T(0, 1), and a nonsingular matrix A of order n
such that £, — Aft. Here IR = AAr is the covariance matrix of ^.
If | U | ^ 0, then by the Gram-Schmidt method (see §11)
& = & + Mk. fc=l,...,«, (13)
where since e = (e1}..., ek) ~ «/T(0, E) is a Gaussian vector,
l* = 1(^,^, (14)
** = n& - Lw (is)
and
^Kl-.-.W = ^^,...,¾}. (16)
We see immediately from the orthogonal decomposition (13) that
&-*(&!&_!,...,«. (17)
From this, with (16) and (14), it follows that in the Gaussian case the
conditional expectation E(<!;k|£k_ lt..., f x) is a linear function of (glt..., ^k_ x):
£(^1^,,...,^)= 1^.- (18)
i=i
(This was proved in §8 for the case k = 2.)
Since, according to a remark made in Theorem 1 of §8, E(^k|^k_ l9..., £x)
is an optimal estimator (in the mean-square sense) for £k in terms of
fx ^k_!, it follows from (18) that in the Gaussian case the optimal
estimator is linear.
We shall use these results in looking for optimal estimators of 0=(0,,..., 0k)
in terms of f = (f ls..., &) under the hypothesis that (0, 0 is Gaussian. Let
m0 = E0, w^ = E£
be the column-vector mean values and
Vee = cov(6>, 0) = ||cov(ft, 61,-) ||, 1 < ij < k,
V^ = cov(6l, © s ||cov(6>,., y ||, 1 < i < k, 1 < j < /,
V« = covtf, © = llcovft, y ||, 1 < ij < I
the covariance matrices. Let us suppose that V^ has an inverse. Then we have
the following theorem.
Theorem 2 (Theorem on Normal Correlation). For a Gaussian vector (0, £),
the optimal estimator E(0|<!;) of 0 in terms of £,, and its error matrix
A = E[0-E(0|£)][0-E(0(£)]T
304
II. Mathematical Foundations of Probability Theory
are given by the formulas
E(0|0 = me + Ve^V(£ - m& (19)
A = Vee-Ve,V,V(Ve,)T. (20)
Proof. Form the vector
n^{6-me)-Me,M^-m,). (21)
We can verify at once that En(£ — w5)T = 0, i.e. n is not correlated with
(^ — m^). But since (0, ^) is Gaussian, the vector (n, £) is also Gaussian. Hence
by the remark on Theorem 1, n and £ — m$ are independent. Therefore n and
^ are independent, and consequently E(n\\) = E^ = 0. Therefore
E[0 - 11¾10 " V^Vk1^ - m«) = 0.
which establishes (19).
To establish (20) we consider the conditional covariance
cov(6>, 0|O = E[(6I - E(0|©)(0 - E(0|O)Tia (22)
Since 0 — E(010 = 17, and 17 and £ are independent, we find that
cov(0,0|£) = E(wT|£) = EwT
Since cov(0, 0|O does not depend on "chance," we have
A = E cov(0, 010 = cov(0, 010,
and this establishes (20).
Corollary. Let (0, £lt...t £„) be an (n + l)-dimensional Gaussian vector, with
f x,..., £„ independent. Then
E(»l«. O-EO+i^^^tt-Ett
A = v0-y^^i>
(cf. (8.12) and (8.13)).
5. Let £,, £2.... be a sequence of Gaussian random vectors that converge
in probability to £. Let us show that £ is also Gaussian.
In accordance with (b) of Theorem 1, it is enough to establish this only for
random variables.
§13. Gaussian Systems
305
Let mn = E£n, a2 = V^n. Then by Lebesgue's dominated convergence
theorem
lim g*"H-ci/2>«*» = iim Ee'"^ = Ee^.
n-»oo n-»oo
It follows from the existence of the limit on the left-hand side that there are
numbers m and g2 such that
m — lim mn, a2 — lim a2.
n-*oo n-*oo
Consequently
Egi*S _ gitm-(1/2)^2^
i.e. £ ~ >'(w, <r2).
It follows, in particular, that the closed linear manifold S£{£^ £2.---}
generated by the Gaussian variables f lf f 2» • • • (see §11» Subsection 5) consists
of Gaussian variables.
6. We now turn to the concept of Gaussian systems in general.
Definition 2. A collection of random variables £ = (4), where a belongs to
some index set 31, is a Gaussian system if the random vector (4,, • • •, £«n) is
Gaussian for every n ^ 1 and all indices a1}..., a„ chosen from 31.
Let us notice some properties of Gaussian systems.
(a) If £ = (£a), a e 31, is a Gaussian system, then every subsystem £' = (£«.),
a' e 31' c 5is is also Gaussian.
(b) If 4» ae3I, are independent Gaussian variables, then the system
£ = (4), a e 31, is Gaussian. _
(c) If £ = (4), a e 31, is a Gaussian system, the closed linear manifold J5f(^),
consisting of all variables of the form £J= t ca.£a., together with their
mean-square limits, forms a Gaussian system.
Let us observe that the converse of (a) is false in general. For example,
let 4 and nx be independent and 4 ~ «/T(0, 1), nl ~ jV(0, 1). Define the
system
(C,,) l«,.-ln,l) if«,<0. (2J)
Then it is easily verified that £ and n are both Gaussian, but (£, 17) is not.
Let £ = (4)«e9i be a Gaussian system with mean-value vector m — (wa),
oce3I, and covariance matrix IR = (rap)a,pegi, where ma = E£a. Then IR is
evidently symmetric (raP = rPa) and nonnegative definite in the sense that
for every vector c = (ca)ae9I with values in /?a, and only a finite number of
nonzero coordinates cai
(Rc,c)s Jr^c^tt (24)
306
II. Mathematical Foundations of Probability Theory
We now ask the converse question. Suppose that we are given a parameter
set 31 = {a}, a vector m = (wa)ae9I and a symmetric nonnegative definite
matrix IR = (raP)^PeW. Do there exist a probability space (£2, &, P) and a
Gaussian system of random variables £ = (4)«e9i on it, sucn tnat
cov«., £,) - r..„ a,jSe«l?
If we take a finite set a1}..., a„, then for the vector m = (mai,..-, mUf)
and the matrix M = (raP\ a, 0 = a1}..., a„, we can construct in /?" the
Gaussian distribution Faii...tan(x!,..., x„) with characteristic function
¢)(0 = exp{z(f, m) - i(IRt, 0}, t = (faii,..., O-
It is easily verified that the family
ft, Jx, xj;«ie«}
is consistent. Consequently by Kolmogorov's theorem (Theorem 1, §9,
and Remark 2 on this) the answer to our question is positive.
7. If 51 = {1, 2,...}, then in accordance with the terminology of §5 the
system of random variables £ = (£a)a€ai is a random sequence and is denoted
by £ = (^1,^2,---)- A Gaussian sequence is completely described by its
mean-value vector m = (mlf m2i. ■ ■) and covariance matrix IR = ||r0-||,
rtj =cov(£,., <y. In particular, if r,j = of <5lV, then f = (fi, f2» ■•■) lS a
Gaussian sequence of independent random variables with ££ ~ ^(m,-, erf),
i> 1-
When S& = [0, 1], [0, 00), (- 00, 00),..., the system £ = (£), f e SI, is a
random process with continuous time.
Let us mention some examples of Gaussian random processes. If we take
their mean values to be zero, their probabilistic properties are completely
described by the covariance matrices ||rj|. We write r(s, t) instead of ra
and call it the covariance function.
Example 1. If T = [0, 00) and
r(s, t) = min(s, 0, • (25)
the Gaussian process £ = (£,),^ 0 with this covariance function (see Problem
2) and £0 = 0 is a Brownian motion or Wiener process.
Observe that this process has independent increments; that is, for arbitrary
tx < t2 < • • • < t„ the random variables
are independent. In fact, because the process is Gaussian it is enough to
verify only that the increments are uncorrelated. But ifs<f<u<i? then
EK, " «Kp " &] = L>k v) - r(t, u)] - lr(s, v) - rfc «)]
= (t - t) - (s - s) = 0.
§13. Gaussian Systems
307
Example 2. The process Z = (Q, 0 ^ t <, 1, with £0 = 0 and
r(s, 0 = min(s, t) - sf (26)
is a conditional Wiener process (observe that since r(l, 1) = 0 we have
P«i = 0) = 1).
Example 3. The process £ = (£,), — oo < t < oo, with
Ks, £) = <?" "~s| (27)
is a Gauss-Markov process.
8. Problems
1. Let f!, f2, f3 be independent Gaussian random variables, f,- ~ ^T(0,1). Show that
*1 + ^3 ~ ^(0, 1).
(In this case we encounter the interesting problem of describing the nonlinear
transformations of independent Gaussian variables £t,..., £„ whose distributions
are still Gaussian.)
2. Show that (25), (26) and (27) are nonnegative definite (and consequently are actually
covariance functions).
3. Let A be an m x n matrix. Annxm matrix A® is a pseudoinverse of A if there are
matrices U and V such that
AA®A = A, A&= UAT = ATV.
Show that A® exists and is unique.
4. Show that (19) and (20) in the theorem on normal correlation remains valid when
V^ is singular provided that V^1 is replaced by Vg.
5. Let (0, £) = (6l3..., 0k; £ls..., £|) be a Gaussian vector with nonsingular matrix
A = VM - VgV^. Show that the distribution function
P(e<a\o = P(el<ali...3ek<ak\o
has (P-a.s.) the density p{au ..., ak|f) defined by
IA"1/2I
^p[exp{-i(fl - EpiOFA-'fc - E(0|£))}.
6. (S. N. Bernstein). Let f and rj be independent identically distributed random variables
with finite variances. Show that if f + rj and f — rj are independent, then f and j/
are Gaussian.
CHAPTER III
Convergence of Probability Measures.
Central Limit Theorem
§1. Weak Convergence of Probability Measures and
Distributions
1. Many of the fundamental results in probability theory are formulated as
limit theorems. Bernoulli's law of large numbers was formulated as a limit
theorem; so was the De Moivre-Laplace theorem, which can fairly be called
the origin of a genuine theory of probability and, in particular, which led the
way to numerous investigations that clarified the conditions for the validity
of the central limit theorem. Poisson's theorem on the approximation of the
binomial distribution by the "Poisson" distribution in the case of rare events
was formulated as a limit theorem. After the example of these propositions,
and of results on the rapidity of convergence in the De Moivre-Laplace and
Poisson theorems, it became clear that in probability it is necessary to deal
with various kinds of convergence of distributions, and to establish the
rapidity of convergence connected with the introduction of various "natural"
measures of the distance between distributions. In the present chapter we
shall discuss some general features of the convergence of probability
distributions and of the distance between them. In this section we take up questions
in the general theory of weak convergence of probability measures in metric
spaces. The De Moivre-Laplace theorem, the progenitor of the central limit
theorem, finds a natural place in this theory. From §3, it is clear that the
method of characteristic functions is one of the most powerful means for
proving limit theorems on the weak convergence of probability distributions
in R". In §6, we consider questions of metrizability of weak convergence.
Then, in §8, we turn our attention to a different kind of convergence of
distributions (stronger than weak convergence), namely convergence in vari-
§1. Weak Convergence of Probability Measures and Distributions
309
ation. Proofs of the simplest results on the rapidity of convergence in the
central limit theorem and Poisson's theorem will be given in §§10 and 11.
2. We begin by recalling the statement of the law of large numbers (Chapter
I, §5) for the Bernoulli scheme.
Let <!?!, £2,... be a sequence of independent identically distributed
random variables with P(f£ = 1) = p, P(f£ = 0) = q, p + q = 1. In terms
of the concept of convergence in probability (Chapter II, §10), Bernoulli's
law of large numbers can be stated as follows:
^-¾ p, n->oo, (1)
where S„ = f x + • • • + £„. (It will be shown in Chapter IV that in fact we
have convergence with probability 1.)
We put
""-ft IV;,
FIX)-
(2)
where F(x) is the distribution function of the degenerate random variable
£ = p. Also let P„ and P be the probability measures on (/?, 88(R))
corresponding to the distributions F„ and F.
In accordance with Theorem 2 of §10, Chapter II, convergence in
probability, SJn A p, implies convergence in distribution, SJn A p, which means that
<#
► E/(p), n -+ 00, (3)
for every function/ = f(x) belonging to the class C(R) of bounded
continuous functions on R.
Since
EJ^A = jRf(x)Pn(dx), E/(p) = jR f(x)P(dx),
(3) can be written in the form
f f(x)Pn(dx)-^ f f{x)P{dx\ feC(R),
JR Jr
or (in accordance with §6 of Chapter II) in the form
f f(x) dFn(x) -+ f f(x) dF{x\ fe C(R).
Jr Jr
In analysis, (4) is called weak convergence (of P„ to P, n -»■ 00) and written
P„ -^ P. It is also natural to call (5) weak convergence of F„ to F and denote
it by Fn -3- F.
(4)
(5)
310
III. Convergence of Probability Measures. Central Limit Theorem
Thus we may say that in a Bernoulli scheme
^Ap=>F„-^F. (6)
It is also easy to see from (1) that, for the distribution functions defined
in (2),
F„(x) -► F(x), n-»oo,
for all points xeR except for the single point x = p, where F(x) has a
discontinuity.
This shows that weak convergence F„-> F does not imply pointwise
convergence of F„(x) to F(x), n -► oo, for all points xeR. However, it turns
out that, both for Bernoulli schemes and for arbitrary distribution functions,
weak convergence is equivalent (see Theorem 2 below) to "convergence
in general" in the sense of the following definition.
Definition 1. A sequence of distribution functions {F„}, defined on the real
line, converges in general to the distribution function F (notation: F„ => F)
if as n -> oo
F„(x)^F(x), xePc(F),
where PC(F) is the set of points of continuity of F = F(x).
For Bernoulli schemes, F = F(x) is degenerate, and it is easy to see
(see Problem 7 of §10, Chapter II) that
Therefore, taking account of Theorem 2 below,
(^pj^F^/OoJF^^^pJ (7)
and consequently the law of large numbers can be considered as a theorem
on the weak convergence of the distribution functions defined in (2).
Let us write
I s/npq J
F(x)=-^= f e~u2/2du. (8)
yjln J- 00
The De Moivre-Laplace theorem (§6, Chapter I) states that F„(x) -» F(x)
for all x e /?, and consequently F„ => F. Since, as we have observed, weak
convergence F„^> F and convergence in general, F„ => F, are equivalent,
we may therefore say that the De Moivre-Laplace theorem is also a theorem
on the weak convergence of the distribution functions defined by (8).
§1. Weak Convergence of Probability Measures and Distributions
311
These examples justify the concept of weak convergence of probability
measures that will be introduced below in Definition 2. Although, on the
real line, weak convergence is equivalent to convergence in general of the
corresponding distribution functions, it is preferable to use weak convergence
from the beginning. This is because in the first place it is easier to work with,
and in the second place it remains useful in more general spaces than the
real line, and in particular for metric spaces, including the especially
important spaces Rn, Rm, C, and D (see §3 of Chapter II).
3. Let (£, <f, p) be a metric space with metric p = p(x, y) and cr-algebra S
of Borel subsets generated by the open sets, and let P, Plf P2,... be
probability measures on (£, <f, p).
Definition 2. A sequence of probability measures {P„} converges weakly to the
probability measure P (notation: P„ ^ P) if
j f{x)PJLdx)^ f f(x)P{dx) (9)
for every function f = f(x) in the class C(£) of continuous bounded
functions on E.
Definition 3. A sequence of probability measures {P„} converges in general
to the probability measure P (notation: P„ => P) if
P£A)-+P(A) (10)
for every set A of & for which
P(cU) = 0. (11)
(Here dA denotes the boundary of A:
dA = [A\ n [A\
where [A~\ is the closure of A.)
The following fundamental theorem shows the equivalence of the
concepts of weak convergence and convergence in general for probability
measures, and contains still another equivalent statement.
Theorem 1. The following statements are equivalent.
(I) Pg_-* P-
(II) lim Pn(A) < P(A\ A closed.
(III) Jim P„(A) > P(A\ A open.
(IV) P„=>P.
312
III. Convergence of Probability Measures. Central Limit Theorem
Proof. (I) => (II). Let A be closed, f(x) = IA(x) and
Mx) = gfiip(x,A)\9 e>0,
where
p(x, A) = inf{p(x, y):ye A},
II, t < 0,
1 - t, 0 < £ < 1,
0, £ > 1.
Let us also put
A& — {x:p(x,A) < e}
and observe that At IA as e 10.
Since/£(x) is bounded, continuous, and satisfies
Pn(A) = f IA(x)PJdx) < jEf£(x)Pn(dx),
we have
llm" P„04) < 1¾ f ft{x)Pn{dx) = f /£(x)P(rfx) < P(AE) i P(A), e [ 0,
» " Je Je
which establishes the required implication.
The implications (II) => (III) and (III) => (II) become obvious if we take
the complements of the sets concerned.
(III) => (IV). Let A0 = A\dA be the interior, and [,4] the closure, of A.
Then from (II), (III), and the hypothesis P(dA) = 0, we have
Hm Pn(A) < Em" P„(D4]) < P(M) = P(,4),
n n
Hm Pn{A) > Hm P„04°) > P(^t°) = P(,4),
and therefore P„(A) -> P(A) for every A such that P(dA) = 0.
(IV) -► (I). Let/ = /(x) be a bounded continuous function with | f(x)\ <
M. We put
D = {tE/?:P{x:/(x) = r}^0}
and consider a decomposition Tk = (t0, t1}..., tk) of [—M, A^]:
—M = t0 < tx < • • • < tk = M, A: > 1,
with t, £ D, i = 0, 1,...,/:. (Observe that D is at most countable since the
sets/_1{t} are disjoint and P is finite.)
Let Bt = {x: t{ < f(x) < ti+1}. Since /(x) is continuous and therefore
the set/"1^,-, ti+1) is open, we have dBt c /_1{t,} u/"1^^!}. The points
£,-, £,-+! $ D; therefore P(dJ3,-) = 0 and, by (IV),
§1. Weak Convergence of Probability Measures and Distributions
313
i=o *=o
But
HI
/(x)P„(rfx) -
+ ll'««P„(B.) •
| i = 0
+
i=0
< 2 max (ti+1
0<5i<;fc-l
+
zVn(s.)-
i=0
i=0 |
- l't.P(B.)|
i=0 |
- f/(x)P(dx)
-0
- z't.PW) .
i=0 1
whence, by (12), since the Tk (k > 1) are arbitrary,
lim f /(x)P„(rfx) = f /(x)P(rfx).
n Je Je
This completes the proof of the theorem.
Remark 1. The functions /(x) = IA(x) and f£(x) that appear in the proof
that (I) => (II) are respectively upper semicontinuous and uniformly continuous.
Hence it is easy to show that each of the conditions of the theorem is
equivalent to one of the following:
00 Je/(x)PiX*) dx -* hf(x)P(dx) for all bounded uniformly continuous
/M;
(VI) lim SEf(x)Pn(dx) < \Ef{x)P{dx)for all bounded f(x) that are upper
semicontinuous (hm/(x„) < f(x), x„ -> x);
(VII) Hm Uf{x)Pn{dx) > [Ef(x)P(dx)for all bounded f(x) that are lower
semicontinuous (lim /(x„) > /(x), x„ -> x).
n
Remark 2.Theorem 1 admits a natural generalization to the case when the
probability measures P and P„ defined on (£, <f, p) are replaced by arbitrary
(not necessarily probability) finite measures ji and jin. For such measures
we can introduce weak convergence Hn^fi and convergence in general
fin=> H and, just as in Theorem 1, we can establish the equivalence of the
following conditions:
(I*) i±Z k
(II*) lim iin(A) < ji{A\ where A is closed and n„(E) -> n(E);
(III*) lim ix„(A) > n(A\ where A is open and n„(E) -► ;u(£);
(IV*) /,„=>/,.
314
III. Convergence of Probability Measures. Central Limit Theorem
Each of these is equivalent to any of (V*), (VI*), and (VII*), which are
(V), (VI), and (VII) with P„ and P replaced by jin and ji.
4. Let (/?, @(R)) be the real line with the system @(R) of sets generated by
the Euclidean metric p(x, y) ~ \x — y\ (compare Remark 2 of subsection 2
of §2 of Chapter II). Let P and P„, n > 1, be probability measures on (R, @(R))
and let F and F„, n > 1, be the corresponding distribution functions.
Theorem 2. The following conditions are equivalent:
(1)P„-P,
(2) P„=>P,
(3) F„A F,
(4)F„=>F.
Proof. Since (2) o (1) o (3), it is enough to show that (2) o (4).
If P„ => P, then in particular
P„(—°o, *] -* P( —oo, x]
for all x e R such that P{x} = 0. But this means that F„ => F.
Now let Fn => F. To prove that P„ => P it is enough (by Theorem 1) to
show that lim„ P„{A) > P(A) for every open set A.
If A is open, there is a countable collection of disjoint open intervals
/,, /2,... (of the form (a, b)) such that A = £"= 14- Choose e > 0 and in
each interval Ik = (ak, fck) select a subinterval /k = (ak, 6k] such that a'k,
bkePc(F) and P(Ik) < P(I'k) + e-2"k. (Since F(x) has at most countably
many discontinuities, such intervals /k, k > 1, certainly exist.) By Fatou's
lemma,
lim P„(,4) = Hm f P^/J > £ Hm P„(/k)
n n k=1 k=1 n
> £ limP„(/k).
k=l "
But
Pn(/k) = F„(fek) - Fn{a'k) - F(b'k) - F(4) = P(/D.
Therefore
Hm P„(A) > £ P(/k) > £ (P(/k) - e • 2"k) = P(,4) - e.
n k=l k=l
Since e > 0 is arbitrary, this shows that lim„ P„(A) > P(A) if A is open.
This completes the proof of the theorem.
5. Let (F, S) be a measurable space. A collection Jf0(E) c g of subsets is
a determining class whenever two probability measures P and Q on (F, <f)
§1. Weak Convergence of Probability Measures and Distributions
315
satisfy
P(A) = Q(A) for all AeJfr0(E)
it follows that the measures are identical, i.e.,
P(A) = Q(A) for all Ae£
If (E, S, p) is a metric space, a collection Jf^E) c £ is a convergence-
determining class whenever probability measures P, Pl9 P2,... satisfy
Pn(A)->P(A) for all AeJfT^E) with P(dA) = 0
it follows that
P„(A)->P(A) for all Xe£ with P(dA) = 0.
When (£, <f) = (/?, ^(/?)), we can take a determining class JT0(R) to be
the class of "elementary" sets Jf = {( —oo, x], xeR} (Theorem 1, §3,
Chapter II). It follows from the equivalence of (2) and (4) of Theorem 2
that this class Jf is also a convergence-determining class.
It is natural to ask about such determining classes in more general spaces.
For R", n>2, the class JT of "elementary" sets of the form (—oo, x] =
(—00, xj x • • • x ( — oo, x„], where x = (xlf..., x„)eR", is both a
determining class (Theorem 2, §3, Chapter II) and a convergence-determining
class (Problem 2).
For Rm the cylindrical sets Jf0(/?°°) are the "elementary" sets whose
probabilities uniquely determine the probabilities of the Borel sets (Theorem
3, §3, Chapter II). It turns out that in this case the class of cylindrical sets is
also the class of convergence-determining sets (Problem 3). Therefore
JT^jR") = JT0(Rm).
We might expect that the cylindrical sets would still constitute
determining classes in more general spaces. However, this is, in general, not the
case.
\Jn 2/ii
Figure 35
316
111. Convergence of Probability Measures. Central Limit Theorem
For example, consider the space (C, @0(C\ p) with the uniform metric
p (see subsection 6, §2, Chapter II). Let P be the probability measure
concentrated on the element x = x(t) = 0, 0 < t < 1, and let P„, n > 1, be the
probability measures each of which is concentrated on the element x — x„(t)
shown in Figure 35. It is easy to see that P„(A) -► P(A) for all cylindrical
sets A with P(dA) = 0. But if we consider, for example, the set
A = {aeC: |a(r)| < i, 0 < t < 1} e<#0(C),
then P(dA) = 0, P„(A) = 0, P(A) = 1 and consequently P„ =f P.
Therefore Jr0(C) = @Q(C) but JT0(C) c JfT^C) (with strict inclusion).
6. Problems
1. Let us say that a function F = F{x\ defined on R", is continuous atxeR" provided
that, for every £ > 0, there is a 5 > 0 such that \F(x) - F(y)\ < e for all yeR" that
satisfy
x — be < y < x + 5et
where e = (1,..., 1) e R". Let us say that a sequence of distribution functions {Fn}
converges in general to the distribution function F (F„ => F) if JF^x) -> i^x), for all
points x e R" where F = F(x) is continuous.
Show that the conclusion of Theorem 2 remains valid for R"y n > 1. (See the remark
on Theorem 2.)
2. Show that the class X of "elementary" sets in R" is a convergence-determining class.
3. Let E be one of the spaces /*°°, C, or D. Let us say that a sequence {P„} of problem
measures (defined on the a-algebra & of Borel sets generated by the open sets)
converges in general in the sense of finite-dimensional distributions to the probability
measure P (notation: P„ 4 P) if P„(y4) -> P(A)t n -> oo, for all cylindrical sets A with
P(dA) = 0.
For R°°, show that
(P„4p)o(Pn^P).
4. Let F and G be distribution functions on the real line and let
L(F, G) = inf{fc > 0: F(x - h) - h < G(x) < F{x + h) + h}
be the Levy distance (between F and G). Show that convergence in general is
equivalent to convergence in the Levy metric:
(Fn=>F)oL(FIJ,F)-0.
5. Let F„=>F and let F be continuous. Show that in this case F„(x) converges uniformly
to F(x):
sup |JF„(x) — F(x)| -► 0, n -» oo.
6. Prove the statement in Remark 1 on Theorem 1.
§2. Relative Compactness and Tightness of Families of Probability Distributions 317
7. Establish the equivalence of (I*)-(IV*) as stated in Remark 2 on Theorem 1.
8. Show that P„ -^ P if and only if every subsequence {P„.} of {P„} contains a
subsequence {P„..} such that P„.. -S P.
§2. Relative Compactness and Tightness of Families
of Probability Distributions
1. If we are given a sequence of probability measures, then before we can
consider the question of its (weak) convergence to some probability measure,
we have of course to establish whether the sequence converges in general
to some measure, or has at least one convergent subsequence.
For example, the sequence {P„}, where P2n = P, P2n+1 = 0, and P and 0
are different probability measures, is evidently not convergent, but has the
two convergent subsequences {P2n} and {P2n+i}«
It is easy to construct a sequence {P„} of probability measures P„, n > 1,
that not only fails to converge, but contains no convergent subsequences at
all. All that we have to do is to take P„, n > 1, to be concentrated at {«} (that
is, P„{n} = 1). In fact, since lim„ P„(a, b] = 0 whenever a < b, a limit measure
would have to be identically zero, contradicting the fact that 1 = P„(R) -/> 0,
n -> oo. It is interesting to observe that in this example the corresponding
sequence {F„} of distribution functions,
Fn(x) = ft **".
"y J (0, x < n,
is evidently convergent: for every xeR,
F„(x) - G(x) = 0.
However, the limit function G = G(x) is not a distribution function (in the
sense of Definition 1 of §3, Chapter II).
This instructive example shows that the space of distribution functions is
not compact. It also shows that if a sequence of distribution functions is to
converge to a limit that is also a distribution function, we must have some
hypothesis that will prevent mass from "escaping to infinity."
After these introductory remarks, which illustrate the kinds of difficulty
that can arise, we turn to the basic definitions.
2. Let us suppose that all measures are defined on the metric space (E, S, p).
Definition 1. A family of probability measures & — {Pa; a e 51} is relatively
compact if every sequence of measures from & contains a subsequence which
converges weakly to a probability measure.
318
III. Convergence of Probability Measures. Central Limit Theorem
We emphasize that in this definition the limit measure is to be a probability
measure, although it need not belong to the original class &. (This is why the
word "relatively" appears in the definition.)
It is often far from simple to verify that a given family of probability
measures is relatively compact. Consequently it is desirable to have simple
and useable tests for this property. We need the following definitions.
Definition 2. A family of probability measures & = {Pa; a e 91} is tight if,
for every e > 0, there is a compact set K £ E such that
supPXEVQ < e. (1)
ae2I
Definition 3. A family of distribution functions F = {Fa; a e 91} defined on
R",n> 1, is relatively compact (or tight) if the same property is possessed by
the family & — {Pa; a e 51} of probability measures, where Pa is the measure
constructed from Fa.
3. The following result is fundamental for the study of weak convergence of
probability measures.
Theorem 1 (Prokhorov's Theorem). Let & = {Pa; ae9I} be a family of
probability measures defined on a complete separable metric space (F, <f, p).
Then & is relatively compact if and only if it is tight.
We shall give the proof only when the space is the real line. (The proof can
be carried over, almost unchanged, to arbitrary Euclidean spaces /?", n > 2.
Then the theorem can be extended successively to /?°°, to cr-compact spaces,-
and finally to general complete separable metric spaces, by reducing each
case to the preceding one.)
Necessity. Let the family & — {Pa: a e 51} of probability measures defined
on (/?, @(R)) be relatively compact but not tight. Then there is an e > 0 such
that for every compact K ^ R
sup Pa(R\K) > £,
a
and therefore, for each interval I = (a, b\
sup Pa(R\I) > B.
a
It follows that for every interval /„ = (—n, n), n > 1, there is a measure Pttn
such that
Pan(R\ln) > s.
Since the original family & is relatively compact, we can select from {P«n}n> i
a subsequence {PanJ such that Pttnfc ^ Q, where Q is a probability measure.
Then, by the equivalence of conditions (I) and (II) in Theorem 1 of §1, we
have
hm Pank(R\In) < Q(R\In) (2)
fc-*CO
§2. Relative Compactness and Tightness of Families of Probability Distributions 319
for every n > 1. But Q(R\In) J, 0, n -> oo, and the left side of (2) exceeds
e > 0. This contradiction shows that relatively compact sets are tight.
To prove the sufficiency we need a general result (Helly's theorem) on the
sequential compactness of families of generalized distribution functions
(Subsection 2 of §3 of Chapter II).
Let «/ = {G} be the collection of generalized distribution functions
G = G(x) that satisfy:
(1) G(x) is nondecreasing;
(2) 0<G(-oo),G(+oo)< 1;
(3) G(x) is continuous on the right.
Then «/ clearly contains the class of distribution functions 2F = {F}
for which F(-oo) = 0 and F( + oo) = 1.
Theorem 2 (Helly's Theorem). The class J = {G} of generalized distribution
Junctions is sequentially compact, i.e., for every sequence {G„} of functions
from«/ we can find a function Ge/ and a sequence {nk} c {n} such that
Gnk{x) - G(x), k - oo,
for every point x of the set PC(G) of points of continuity ofG = G(x).
Proof. Let T — {xj, x2,...} be a countable dense subset of R. Since the
sequence of numbers (G„(xi)} is bounded, there is a subsequence N1 =
{nli\ nl2],...} such that G^i^x^ approaches a limit gx as i -> oo. Then we
extract from Nx a. subsequence N2 = {n(i2), nl£\...} such that G„*2>(x2)
approaches a limit g2 as i -> oo; and so on.
On the set T c /? we can define a function GT(x) by
Gt(x,) = 0m x.er,
and consider the "Cantor" diagonal sequence N = {/1^, nl22),. -.}. Then, for
each x,- e T, as m -> oo, we have
G„<-)(x,) -► G^x,).
Finally, let us define G = G(x) for all x e R by putting
G(x) = inf{GTO0: yeT,y>x}. (3)
We claim that G = G(x) is the required function and G„jno(x) -> G(x) at all
points x of continuity of G.
Since all the functions G„ under consideration are nondecreasing, we have
G„^{x) < Gnin»(y) for all x and y that belong to T and satisfy the inequality
x < y. Hence for such x and y,
GT(x) < GT(y).
It follows from this and (3) that G « G(x) is nondecreasing.
320
III. Convergence of Probability Measures. Central Limit Theorem
Now let us show that it is continuous on the right. Let xk|x and d —
limk G(xk). Clearly G(x) < d, and we have to show that actually G(x) = d.
Suppose the contrary, that is, let G(x) < d. It follows from (3) that there is
a y e X x < j>, such that Gjiy) < d. But x < xk < y for sufficiently large k,
and therefore G(xk) < GT(y) < d and lim G(xk) < d, which contradicts
d = limk G(xk). Thus we have constructed a function G that belongs to J.
We now establish that Gn<no(x°) -► G(x°) for every x° e PC(G).
Ifx° < y eT, then
lim Gnfn,,(x°) < lim Gnto,,(^) = GT(y\
m m
whence
lim" GnLm,(x0) < inf{GT(j/): y > *°, j> e T} = G(x°). (4)
m
On the other hand, let x1 < y <x°,yeT. Then
G(xl) < Gjiy) = lim Gnifr)(y) = lim Gnte.,(y) < Hm Gnto,,(x°).
m m m
Hence if we let x1 t x° we find that
G(xo -)<M Gnifr>(x0). (5)
But if G(x° -) = G(x°) we can infer from (4) and (5) that Gnfn,,(x°) -► G(x°),
m -> oo.
This completes the proof of the theorem.
We can now complete the proof of Theorem 1.
Sufficiency. Let the family & be tight and let {P„} be a sequence of
probability measures from 0>. Let {F„} be the corresponding sequence of
distribution functions.
By Helly's theorem, there are a subsequence {FnJ c {F„} and a generalized
distribution function GeJ such that F„fc(x) -► G(x) for x e PC(G). Let us
show that because & was assumed tight, the function G = G(x) is in fact a
genuine distribution function (G(—oo) = 0, G( + oo) = 1).
Take e > 0, and let I = (a, fc] be the interval for which
sup P„(R\I) < e,
or, equivalently,
1 - £ < Pn(a, b], n > 1.
Choose points a', V e PC(G) such that a' < a,b' > b. Then 1 - e < P„k(a, b]
< Pnk(a\ fc'] = Fnk{V) - Fnfc(a') -+ Gib') - G(a'). It follows that G(+oo) -
G(-oo) = 1, and since 0 < G(-oo) < G( + oo) < 1, we have G(-oo) = 0
and G(+oo) = 1.
§3. Proofs of Limit Theorems by the Method of Characteristic Functions 321
Therefore the limit function G = G(x) is a distribution function and
F„k => G. Together with Theorem 2 of §1 this shows that P„fc ^ 0, where Q
is the probability measure corresponding to the distribution function G.
This completes the proof of Theorem 1.
4. Problems
1. Carry out the proofs of Theorems 1 and 2 for Rn, n>2.
2. Let Pa be a Gaussian measure on the real line, with parameters ma and <r„, a e 31.
Show that the family & = {Pa; a e 21} is tight if and only if
IwJ^a, cl<b, ae2l.
3. Construct examples of tight and nontight families & = {Pa; a e 21} of probability
measures defined on (R°°, ^(R°°)).
§3. Proofs of Limit Theorems by the Method of
Characteristic Functions
1. The proofs of the first limit theorems of probability theory—the law of
large numbers, and the De Moivre-Laplace and Poisson theorems for
Bernoulli schemes—were based on direct analysis of the limit functions of the
distributions F„, which are expressed rather simply in terms of binomial
probabilities. (In the Bernoulli scheme, we are adding random variables that
take only two values, so that in principle we can find Fn explicitly.) However,
it is practically impossible to apply a similar direct method to the study of
more complicated random variables.
The first step in proving limit theorems for sums of arbitrarily distributed
random vaxiables was taken by Chebyshev. The inequality that he discovered,
and which is now known as Chebyshev's inequality, not only makes it
possible to give an elementary proof of James Bernoulli's law of large numbers,
but also lets us establish very general conditions for this law to hold, when
stated in the form
{|S ES I 1
— - > e > -> 0, n -> oo, every e > 0,
\n n\ J
(1)
for sums S„ = ^ + • • • + <!;„, n > 1, of independent random variables. (See
Problem 2.)
Furthermore, Chebyshev created (and Markov perfected) the "method of
moments" which made it possible to show that the conclusion of the De
Moivre-Laplace theorem, written in the form
322
III. Convergence of Probability Measures. Central Limit Theorem
is "universal," in the sense that it is valid under very general hypotheses
concerning the nature of the random variables. For this reason it is known as
the central limit theorem of probability theory.
Somewhat later Lyapunov proposed a different method for proving the
central limit theorem, based on the idea (which goes back to Laplace) of the
characteristic function of a probability distribution. Subsequent
developments have shown that Lyapunov's method of characteristic functions is
extremely effective for proving the most diverse limit theorems. Consequently
it has been extensively developed and widely applied.
In essence, the method is as follows.
2. We already know (Chapter II, §12) that there is a one-to-one
correspondence between distribution functions and characteristic functions. Hence we
can study the properties of distribution functions by using the corresponding
characteristic functions. It is a fortunate circumstance that weak convergence
F„ ^ F of distributions is equivalent to pointwise convergence q>„ -> <p of
the corresponding characteristic functions. Moreover, we have the following
result, which provides the basic method of proving theorems on weak
convergence for distributions on the real line.
Theorem 1 (Continuity Theorem). Let {F„} be a sequence of distribution
functions F„ = Fn(x\ xeR, and let {q>„} be the corresponding sequence of
characteristic functions,
q>n(t)= f eitxdF„(xl t e R.
J- 00
(1) If F„^> F, where F = F(x) is a distribution function, then q>n(t) -> <p(r),
teR, where <p(t) is the characteristic function oj'F — F(x).
(2) If lim„ <p„(t) exists for each t eR and <p(i) — lim„ q>n(i) is continuous at
t = 0, then <p{i) is the characteristic function of a probability distribution
F = F(x), and
Fn^F.
The proof of conclusion (1) is an immediate consequence of the definition
of weak convergence, applied to the functions Re eitx and Im eitx.
The proof of (2) requires some preliminary propositions.
Lemma 1. Let {P„} be a tight family of probability measures. Suppose that
every weakly convergent subsequence {P„.} of {P„} converges to the same
probability measure P. Then the whole sequence {P„} converges to P.
Proof. Suppose that P„ -J* P. Then there is a bounded continuous function
f — f(x) such that
f/(x)Pn(</x)^ { f(x)P(dx).
Jr Jr
§3. Proofs of Limit Theorems by the Method of Characteristic Functions 323
It follows that there exist e > 0 and an infinite sequence {«'} c {n} such
that
I jRf(x)PAdx) - jRf(x)P(dx)
> e > 0. (3)
By Prokhorov's theorem (§2) we can select a subsequence {P„»} of {P„.} such
that P„» -^ 0, where 0 is a probability measure.
By the hypotheses of the lemma, Q = P, and therefore
{f{x)PAdx)-+ ( f(x)P(dx)i
JR JR
which leads to a contradiction with (3). This completes the proof of the
lemma.
Lemma 2. Let {P„} be a tight family of probability measures on (/?,
A necessary and sufficient condition for the sequence {P„} to converge weakly
to a probability measure is that for each t eRthe limit lim„ <p„(t) exists, where
<p„(t) is the characteristic function of P„:
<pn(t)= f eitxP„(dx).
Jr
Proof. If {P„} is tight, by Prohorov's theorem there is a subsequence
{P„.} and a probability measure P such that P„. -^ P. Suppose that the whole
sequence {P„} does not converge to P (P„ ^ P). Then, by Lemma 1, there is a
subsequence {P„..} and a probability measure 0 such that P„» A Q, and
P?*Q.
Now we use the existence of lim„ q>„(t) for each t e R. Then
and therefore
lim f eitxPn(dx) = lim f eitxPn.(dx)
n1 JR n- JR
\eitxP(dx)= f eitxQ(dx\ teR.
Jr Jr
But the characteristic function determines the distribution uniquely
(Theorem 2, §12, Chapter II). Hence P = Q, which contradicts the assumption
thatP„^P.
The converse part of the lemma follows immediately from the definition
of weak convergence.
The following lemma estimates the "tails" of a distribution function in
terms of the behavior of its characteristic function in a neighborhood of
zero.
Lemma 3. Let F = F(x) be a distribution function on the real line and let
324 III. Convergence of Probability Measures. Central Limit Theorem
(p = (p(t) be its characteristic function. Then there is a constant K > 0 such
that for every a > 0
f dF(x) < - f[l - Re ¢)(0] dt. (4)
J\x\ZUa a Jo
Proof. Since Re <p(t) = $-m cos tx dF(x\ we find by Fubini's theorem that
- f[l - Re ¢)(0] dt = - f P (1 - cos tx) dF(x) dt
= f°° \- f (1 - cos tx)dt\dF(x)
= -^ f ^00,
^ J\x\^Ha
where
1= inffl-^Ul-sinl>i,
so that (4) holds with K = 7. This establishes the lemma.
Proof of conclusion (2) of Theorem 1. Let <p„(t) -> ¢)(0, n -* oo, where
¢)(0 is continuous at 0. Let us show that it follows that the family of
probability measures {P„} is tight, where P„ is the measure corresponding to F„.
By (4) and the dominated convergence theorem,
-- f [1 -Re 9(f)] A
a Jo
as n -*■ oo.
Since, by hypothesis, ¢)(0 is continuous at 0 and ¢)(0) = 1, for every e > 0
there is an a > 0 such that
for all n > 1. Consequently {P„} is tight, and by Lemma 2 there is a prob-
§3. Proofs of Limit Theorems by the Method of Characteristic Functions 325
ability measure P such that
P„*P.
Hence
q>n(t)= f eitxP„(dx)-> f eitxP(dx),
J — oo J — 00
but also <p„(t) -► q>(i). Therefore q>(i) is the characteristic function of P.
This completes the proof of the theorem.
Corollary. Let {F„} be a sequence of distribution functions and {<p„} the
corresponding sequence of characteristic functions. Also let Fbea distribution
function and <p its characteristic function. Then F„^ F if and only if<p„(t) -*■
(p(t)for all t e R.
Remark. Let n, nu n2i ■ • • be random variables and Fnn -^ Fn. In accordance
with the definition of §10 of Chapter II, we then say that the random variables
Vu 12> • ■ ■ converge to n in distribution, and write n„ ^* n.
Since this notation is self-explanatory, we shall frequently use it instead
of F,n A Fn when stating limit theorems.
3. In the next section, Theorem 1 will be applied to prove the central limit
theorem for independent but not identically distributed random variables.
In the present section we shall merely apply the method of characteristic
functions to prove some simple limit theorems.
Theorem 2 (Law of Large Numbers). Let ^, ^2. ■ • • be a sequence ofindependent
identically distributed random variables with E |^x | < oo, S„ — £x + • ■ ■ + £„
and E^x = m. Then SJn -¾ w, that is, for every e > 0
P"J — — m > £>-► 0, n-»oo.
11" I J
Proof. Let <p(i) = Eeitil and <pSjn(t) = EeitSnf". Since the random variables
are independent, we have
by (II. 12.6). But according to (11.12.14)
<p(t) = 1 + km + o(f), t -> 0.
326
III. Convergence of Probability Measures. Central Limit Theorem
Therefore for each given t e R
(pl-\ = 1 + i-m + of-), n-»oo,
\n/ n \nl
and therefore
Vsn,n(t)=\l+i^m + o^
-+eitr
The function <p(t) = eitm is continuous at 0 and is the characteristic function
of the degenerate probability distribution that is concentrated at m. Therefore
► w,
and consequently (see Problem 7, §10, Chapter II)
This completes the proof of the theorem.
Theorem 3 (Central Limit Theorem for Independent Identically Distributed
Random Variables). Let £u £2>- •• oe a sequence of independent identically
distributed (nondegenerate) random variables with E£\ < oo and S„ =
£i + • ■ ■ + fii- Then asn-^ oo
where
¢(:
:X) = iJ-
e-»2'2 du.
Proof. Let E<^ = m, V<^ = <r2 and
q>(t) = Eeitl^-m).
Then if we put
<pn(t) = E exp
we find that
(5)
§3. Proofs of Limit Theorems by the Method of Characteristic Functions 327
But by (1112.14)
Therefore
as n -»■ oo for fixed t.
The function e~t2/2 is the characteristic function of a random variable
(denoted by jV(0, 1)) with mean zero and unit variance. This, by Theorem 1,
also establishes (5). In accordance with the remark in Theorem 1, this can
also be written in the form
S„ - ES„
> ^"(0, 1).
(6)
This completes the proof of the theorem.
The preceding two theorems have dealt with the behavior of the
probabilities of (normalized and symmetrized) sums of independent and identically
distributed random variables. However, in order to state Poisson's theorem
(§6, Chapter I) we have to use a more general model.
Let us suppose that for each n > 1 we are given a sequence of independent
random variables £nli..., <!;„„. In other words, let there be given a triangular
array
/<„
I £21» £22
\^31> ^32» £33/
of random variables, those in each row being independent. Put
S„ = £,!+•■• + Ln-
Theorem 4 (Poisson's Theorem). For each n>\ let the independent random
variables ^nl,..., £nn be such that
Pnk + <ink = 1
andY?k=iPnk
Proof. Since
P(U = l) =
Suppose that
max
l£k£>
-> A > 0, n -»■ 00.
= Pnk,
1
Then, for
e~AA
P(s„ = m)„
p«,*
n-
each
m
= 0)
-»■ OO,
m =
n->
= Qnk
0, 1,..
00.
(7)
Ee1** = Pnke" + q*
328
III. Convergence of Probability Measures. Central Limit Theorem
for 1 < k < n, by our assumptions we have
<pSn(t) = EeitS»= fliP^ + qnk)
= fl(l + ft*fe" - 1)) -* exp{A(e'< - 1)}, n -+ oo.
fc=l
The function <p{t) = exp{A(efr — 1)} is the characteristic function of the
Poisson distribution (11.12.11), so that (7) is established.
If 7t(A) denotes a Poisson random variable with parameter A, then (7) can
be written like (6), in the form
S„ ±> 7t(A).
This completes the proof of the theorem.
4. Problems
1. Prove Theorem 1 for R\ n>2.
2. Let £t, €2>--- he a sequence of independent random variables with finite means
E | f„ | and variances Vf„ such that Vf„ < K < oo, where K is a constant. Use
Chebyshev's inequality to prove the law of large numbers (1).
3. Show, as a corollary to Theorem 1, that the family {q>„} is uniformly continuous and
that <pn^q> uniformly on every finite interval.
4. Let £„, n > 1, be random variables with characteristic functions (pin(t\ n>l. Show
that &, -^ 0 if and only if (pin(t) -> 1, n -> oo, in some neighborhood of t = 0.
5. Let X„ X2, ■ • - be a sequence of independent random vectors (with values in Rk) with
mean zero and (finite) covariance matrix T. Show that
x> + -_+x»A ^(o,r).
V"
(Compare Theorem 3.)
§4. Central Limit Theorem for Sums of Independent
Random Variables.
I. The Lindeberg Condition
1. In this section, the central limit theorem for (normalized and centralized)
sums of independent random variables £x, ^2> ■•• will t>e proved under the
traditional hypothesis that the classical Lindeberg condition is satisfied. In the
next section, we shall consider a more general situation. First, the central
limit theorem will be stated in the "series form" and, second, we shall prove
it under the so-called nonclassical hypotheses.
§4. Central Limit Theorem for Sums of Independent Random Variables 329
Theorem 1. Let £lf £2> ■ ■ be a sequence of independent random variables with
finite second moments. Let mk = E£k, o* = V^k > 0, S„ = ^ + ■•■ + £„,
&n = Sfc=i °ft» and let Fk = Fk(x) be the distribution function of the random
variable £k.
Let us suppose that the Lindeberg condition is satisfied: for every e > 0
(L) 4 1 [ (x- mk)2 dFk(x) -+ 0, n - oo. (1)
^n ^=1 J{x:|x-mfc|ScDn}
Then
Sn - ES„ ,
-^(0,1). (2)
VVS„
Proof. Without loss of generality we assume that mk = 0 for k > 1. We set
%(t) = Ee*& T„ = 5,,/7¾ = Sn//)„, %ii(t) = Eeu\ <pTJt) = EeitT».
Then
<pTjt)=&«■■=e*~=^QJ=j «*(£), (3)
and for the proof of (2) it is sufficient (by Theorem 1 of §3) to establish that,
for every t e R,
<pTn(r)-e~t2/2, n-oo. (4)
We choose a t e R and suppose that it is fixed throughout the proof. By
the representations
e»_l + iy + e-lf,
which are valid for all real y, with 0X = O^y) and 62 = Qi(y\ such that \QX \ <
1,| 021 <, 1, we obtain
%(t) = Ee^ = f°° eta <*Fft(x) = f (1 + fcx + ^^) dFk(x)
+ 1 + fcx— +^l-L\dFk(x)
J\x\<eD„\ 2 V /
= 1 + % f ^1 x2 <*Fk(x) - £ f x2 dFk(x)
+!£( e2|x|3dfi(x)
0 J\x\<eD„
(here we have also used the fact that, by hypothesis, mk = $™m x dFk(x) = 0).
330
III. Convergence of Probability Measures. Central Limit Theorem
Consequently,
OUn J\x\<tDn
Since
|2
if 0lX2dFk(x)\<U x2dFk(x),
* J\x\>:tD„ I Z J|x|2eft.
we have
if elx2dFk(x) = 6l\ x2dFk{x\ (6)
where 0X = 0^, k, n) and |0X | < 1/2.
In the same way,
- 02|x|3 dFk(x)\ <-\ -=.|x|3 JFk(x) <- sD„x2 dFk{x\
°J\x\<tD„ | °J|x|<eDn \x\ °J\x\<tDn
|6
and therefore,
i f 02|x|3 </Fk(x) = 02 f e£>„x2 dFk(x), (7)
where 02 = 62(t, fc, n) and |02| < 1/6.
We now set
Akn = j&\ x2dFk(x),
Un J\x\<tDn
un J\x\^tD„
Then, by (5)-(7),
%Qr) = 1-^ + 12¾¾ + |t|3tf24b, = 1 + Ckn. (8)
We note that
t (4* + Bkn) = 1 (9)
and by (1)
E4^0, n-oo. (10)
fc=l
Consequently, for sufficiently large n,
§4. Central Limit Theorem for Sums of Independent Random Variables 331
and
max |CkJ < t2e2 + e|r|3
l£k£n
Z |QJ < t2 + s\t\*.
(11)
(12)
We now appeal to the fact that, for any complex numbers z with \z\ < 1/2,
ln(l +z) = z + 0|z|2,
where 0 = 6(z) with |0| < 1 and In denotes the principal value of the
logarithm.* Then, for sufficiently large n, it follows from (8) and (11) that, for
sufficiently small E > 0,
ln<pk[
= ln(l + Ckn) = Ckn + 0kJCJ2
where |0kJ < 1. Consequently, by (3),
t2 t2
- + In q>Tn{t) = - +
itiy,
£in ^(w)=? + £ c*» + £ e^c^2-
k=i \DnJ I k=i fc=i
But
% + t Ckn = ^( 1 - f xkn) + r2 f 0x(r, *, n)Bkn + e|r|3 f 02(r, k, n)Akni
Z fc=l l \ k=l J fc=l fc=l
and by (9) and (10), for any b > 0 we can find numbers n0 and e > 0, with n0
so large that for all n > n0
-+IQ,
'2'
In addition, by (11) and (12), we can find a positive number e such that
X ^JCftJ
< max |CJ £ |CJ<(r2e2 + e|r|3)(r2 + e|r|3).
l^k^n fc=l
Therefore, for sufficiently large n, we can choose e > 0 so that
5
,_ *»ic;tari<-
fc=l
and consequently,
- + In <pTn(t)
<5.
♦The principal value lnz of the complex number z is denned by lnz = ln|z| + iargz,
—it < arg z<,it.
332 III. Convergence of Probability Measures. Central Limit Theorem
Therefore, for any real ty
q>Tn(t)et2'2 -► 1, n -► oo
and hence,
q>Tn(t) -► e~t2/2, n -► oo.
This completes the proof of the theorem.
2. We turn our attention to some special cases in which the Lindeberg
condition (1) is satisfied and consequently, the central limit theorem is valid.
a) Let the "Lyapunov condition" be satisfied: for some 8 > 0
* £EKk-mk|2+^0, ii-oo. (13)
vn fc=l
Let e > 0; then
E|£k - mk\2+d =T |x - mk\2+d dFk{x)
J -oo
>f \x-mk\2+5dFk(x)
J {x: x\x-mk\ 2leD„}
>e*Ds„[ (x-mk)2dFk(x)
J {x:\x-mk\>:eD„}
and therefore,
4 I f (* " mk)2 dFk(x) < * • * ± E|& - mk\2+s.
Consequently, the Lyapunov condition implies the Lindeberg condition.
b) Let f x, £2.... be independent identically distributed random variables
with m = E^i and variance 0 < g2 = V^ < oo. Then
4tf |x-m|2JFk(x)
Mi fc=l J{x:|x-m|S:£Dn}
= Af |x-m|2 4^)-+0,
W<* J {x:|x-m|££oV^
since {x: |x — m| > eo^^/n} J, 0, n -»■ oo, and cr2 = E|^t — m|2 < oo.
Therefore, the Lindeberg condition is satisfied and consequently, Theorem
3 of §3 follows from the proof of Theorem 1.
c) Let £ls £2> • • • be independent random variables such that for all n > 1
|£k|<K<oo,
where K is a constant and D„ -> oo, n -* oo.
§4. Central Limit Theorem for Sums of Independent Random Variables 333
Then by Chebyshev's inequality
x - mk\2 dFk{x) = E[(& - mk)2/(|£k - mk\ > eDJ]
f,
{x:\x-mk\^eD„}
< (2K)2P{|^ - mk\ £ eD„} < (2K)2-^j
e Lf„
and therefore,
«sL
|x-mk|2JFk(x)< ^--0,
{x:|x-mfc|£CDn} £ "„
Consequently, the Lindeberg condition is satisfied again and therefore, the
central limit theorem is verified.
3. Remark 1. Let Tn = (S„ - ES„)/D„ and FTn(x) = P(T„ < x). Then
proposition (2) shows that for all x e R
ft„(x) -> O(x), n -► oo.
Since 0(x) is continuous, the convergence here is actually uniform (problem
5,§1):
sup |FTn(x) - 0(x)| -►(), n->oo. (14)
xeJt
In particular, it follows that
P{S„<x}-0.(^^)-0, 1,-00.
This proposition is often expressed by the statement that for sufficiently large
n the value S„ is approximately normally distributed with mean ES„ and
variance D2 = VS„.
Remark 2. Since, according to the preceding remarks, FTn(x)-*0(x) as
n -> oo, uniformly in x, it is natural to raise the question of the rate of
convergence in (14). In the case when the numbers ^, £2>. • • are independent
and uniformly distributed with El^l3 < oo, this question is answered by the
Berry-Esseen inequality:
sup \FTn(x) - 0>(x)| < C^1"^1' f (15)
x cr y/n
where the absolute constant C satisfies the inequality
1/V(2*0 < C < 0.8.
The proof of (15) will be given in §11.
Remark 3. We can state the Lindeberg condition in a somewhat different
334
III. Convergence of Probability Measures. Central Limit Theorem
(and more compact) version which is especially convenient in the "series
form."
Let ^, £2, ••• be a sequence of independent random variables, with
mk = E£k, al = V£k > 0, D* = £2-i <rk2, and Znk = (¾ - mk)/Dn. In this
notation, condition (1) assumes the following form:
(L) £ E[&/(|U > «)] - 0, n ^ oo. (16)
k=i
If Sn = ^nl + ••• + <!;„„, we have MSn = 1 and Theorem 1 can be given the
following form: if (16) is satisfied, we have
Sn -^(0,1).
In this form the central limit theorem is valid without the assumption that ^
has the special form (£nk — mk)/D„. In fact, we have the following result whose
proof is word for word the same as that of Theorem 1.
Theorem 2. For each n>llet
£nl> £n2>---> £nn
be a sequence of independent random variables for which E£nk = 0 and VS„ = 1,
where Sn = £nl+- + tm.
Then the Lindeberg condition (16) is a sufficient condition for the
convergence S„ -i JV(0,1).
4. Since
max E& < e2 + Z E[&!(!£ J > e)],
l£fc:Sn fc=l
it is clear that the Lindeberg condition (16) implies that
max E^k->0, n-»oo. (17)
l£k:Sn
It is noteworthy that when this condition is satisfied, it follows automatically
from the validity of the central limit theorem that the Lindeberg condition is
satisfied (Lindberg-Feller theorem).
Theorem 3. For each n>llet
^nl> ^n2> • ■ • > Qnn
be a sequence of independent random variables for which E£nk = 0 and VS„ = 1,
where Sn = £nl + ••• + £„„. Let (17) be satisfied. Then the Lindeberg
condition is necessary and sufficient for the validity of the central limit theorem,
S„ -^(0,1).
The sufficiency follows from Theorem 2. To establish the necessity we need
the following lemma (compare Lemma 3, §3, Chapter III).
§4. Central Limit Theorem for Sums of Independent Random Variables
335
Lemma. Let ^ be a random variable with distribution function F = F(x),
E£ = 0, V£ = y > 0. Then for every a>0
f x2 dF(x) < \ [Re fiy/ia) - 1 + 3ya2l
J\x\>lla a
(18)
J\x\>l/a
where f(t) = Eeil* is the characteristic function of £
Proof. We have
Re Jit) - 1 + iyt2 = b*2 - f°° [1 - cos tx] dF(x)
J -oo
= bt2 - f [1 - cos fcc] dF(x) - f [1 - cos fcc] dF(x)
J\x\<l/a J\x\^l/a
> b*2 - it2 f x2 dF(x) - 2a2 f x2 dF{x)
J\x\<l/a JW^l/fl
= {\t2-2a2y\ x2dF(x).
If we set t = y/6a, we obtain (18), as required.
We now turn to the proof of the necessity in Theorem 3.
Let
Fnk(x) = P(U<x\ Jnk(t) = Eeit^
e^ = o, \/U = ynk>oi (19)
n
£y«k = 1> maxy^-O, n-oo.
k=l l£k£n
Let In z denote the principal value of the logarithm of the complex number z.
Then
lnf[/nk(0=i;in/nk(r) + 27«m,
k=i k=i
where m = m(n, t) is an integer. Consequently,
Re In f[ Jnk(t) = Re £ In /nk(r). (20)
fc=l fc=l
Since
we have
n/-»
336
Therefore,
For \z\ < 1
111. Convergence of Probability Measures. Central Limit Theorem
Re In n fnk(t) = Re In JJ /*(*)
z2 z3
ln(l+z) = z-y + y-
(21)
(22)
and for \z\ < 1/2
|ln(l+z)-z|<|z|2. (23)
By (19), for each fixed t, all sufficiently large n and all k = 1,2,..., n, we have
l/*W- H <W2^i (24)
Hence, we obtain from (23) and (24)
Z {ln[l + (/*« - 1)] - (f*(t) - 1)}
fc=l
<z luo-h2
(25)
^T max ynk Z tt* = T max ^fc"*0' w->oo,
** l£k£n k=l ^ l£k:Sn
and consequently,
j Z ln/*« " Re t U*M " 1)U Q. » - <».
fc=l fc=l |
It follows from (20), (21), and (25) that
Re Z U*M " 1) + i'2 = Z [Re f*(t) ~ 1 + it2yj - 0, n ^ co.
fc=l fc=l
Setting £ = ^/6(3, we find that for each a > 0
Z [Re/„k(x/M - 1 + 3a2ynk] -+ 0, n ^ co. (26)
k=i
Finally, from (18) with a = 1/e and (26), we obtain
Z *K*I(\U>*\=t [ *2<lFnk(x)
fc=l fc=l J\x\2>t
<s2t CRe /„k(x/6«) - 1 + 3a2ynk] -+ 0, n ^ co,
k=i
which shows that the Lindeberg condition is satisfied.
5. Problems
1. Let £,, £2> ••• he a sequence of independent identically distributed random
variables with Ef! = 0 and Eff = 1. Show that
§5. Central Limit Theorem for Sums of Independent Random Variables
337
2. Show that in the Bernoulli scheme the quantity supx|l^n(x) — 0(x)| is of order
§5. Central Limit Theorem for Sums of Independent
Random Variables
II. Nonclassical Conditions
1. It was shown in §4 that the Lindeberg condition (4.16) implies that the
condition
max EfiJk-*0,
l£fc:Sn
is satisfied. In turn, this implies the so-called condition of being negligible in
the limit (asymptotically infinitesimal), that is, for every e > 0,
max P{|^nk| > e}->0, n-*co.
l£fc£n
Consequently, we may say that Theorems 1 and 2 of §4 provide a
condition of realization of the central limit theorem for sums of independent
random variables under the hypothesis of negligibility in the limit. Limit
theorems in which conditions of negligibility in the limit are imposed on
individual terms are usually called theorems with a classical formulation.
It is easy, however, to give examples of nondegenerate random variables
for which neither the Lindeberg condition nor negligibility in the limit is
satisfied, but nevertheless the central limit theorem is satisfied. Here is the
simplest example.
Let £lf £2> ••• be a sequence of independent normally distributed random
variables with E£„ = 0, V^ = 1, V£k = 2k"2, k > 2. Let S„ = £nl + • • • + Zm
with
It is easily verified that here neither the Lindeberg condition nor the
condition of negligibility in the limit is satisfied, although the validity of the central
limit theorem is evident, since S„ is normally distributed with ES„ = 0 and
VS„ = 1.
Theorem 1 (below) provides a sufficient (and necessary) condition for the
central limit theorem without assuming the "classical" condition of
negligibility in the limit. In this sense, condition (A), presented below, is an example
of "nonclassical" conditions which reflect the title of this section.
338
111. Convergence of Probability Measures. Central Limit Theorem
2. We shall suppose that for each n > 1 there is a given sequence ("series
form") of independent random variables
with EU = 0, V^ = o£ > 0, SU <r„2k = 1. Let Sn = £Hl + ■•• + £„„,
Fnk(x)=Pfck<x}, *(x) = (2w)-^r e"y2/2^, 0^) = 0
J-00
Theorem 1. To have
Sn -i .^(0,1), (1)
ft is sufficient (and necessary) that for every e>0the condition
(A) £ f |x||Fnk(x)-Onk(x)|dx--0, n->oo (2)
fc=l J\x\>t
is satisfied.
The following theorem clarifies the connection between condition (A) and
the classical Lindeberg condition
(L) £ f x2rfFnk(x)-0, n^oo. (3)
fc=l J|x|>s
Theorem 2.1. Tfce Lindeberg condition implies that condition (A) is satisfied:
(L)=>(A).
2. // max1:Sk<.n E^ -> 0 as n -> oo, ffce condition (A) implies the Lindeberg
condition (L):
(A)=>(L).
Proof of Theorem 1. The proof of the necessity of condition (A) is rather
complicated. Here we only prove the sufficiency.
Let
/*(t)=Eeft6*, f„(t)=Eei's»,
<Pr*(t) = r eitx dO^x), q>(t) = f eitx dO(x\
J— 00 J— 00
It follows from §12, Chapter II, that
<pjt) = e-****2, q>(t) = e~t2f2.
By the corollary of Theorem 1 of §3, we have S„ 4. jV(0, 1) if and only if
f„(t) -> <p(t) as n -> oo, for every real t.
§5. Central Limit Theorem for Sums of Independent Random Variables
339
We have
fn(t)-<p(t)=Ylfnk(t)-l\<pnk(t).
Since \fnk(t)\ < 1 and \<pnk(t)\ < 1, we have
\fn(t)-<p(t)\ =
Ufn^)-U<Pnk(t)
fc=l fc=l
^ I \fjfi - <Pjt)\ = 11 f" **i{F* - o„t
fc=l k=l | J-oo
=il£(eta-to+¥)^-^
(4)
where we have used the fact that
f xkdFnk=f xkdOnk for fc = l,2.
J—oo J—00
If we apply the formula for integration by parts (Theorem 11, §6, Chapter
II) to the integral
£ (eitx - itx + l^f}d{Fnk - Onk),
we obtain (taking account of the limits x2[l — Fnk(x) + Fnfc(—x)] -»0), and
x2[l - Onk(x) + Onfc(-x)] -* 0, x-* oo)
|ro ^.« _ itx+^!^(Fnk _ 0nk)
= it f° (eta - 1 - Ux)(F*(x) - Onk(x)) dx.
J-00
From (4) and (5), we obtain
\fn{t) - <p(t)\ < t I * f <**" " 1 " fcx)W*M " *«*(*))
fc=l | J-oo
\t\3 n r
+ 2t2 £ f |x||Fnk(x)-Onk(x)|dx
k=l J|x|>£
< e\t\3 t <& + 2'2 t f |x||Fnk(x) - Onk(x)| dx, (6)
k=l k=l J|x|>£
where we have used the inequality
(5)
340
111. Convergence of Probability Measures. Central Limit Theorem
I \x\\Fnk(x)-Onk(x)\dx^2x7^ (7)
which is easily established by using (71), §6, Chapter II.
It follows from (6) that fn(t)-► q>{t) asn-»oo, because e is an arbitrary
positive number and condition (A) is satisfied.
This completes the proof of the theorem.
Proof of Theorem 2.1. By §4 of the Lindeberg condition (L), it follows that
maxls.ks.n<r4 -► 0. Hence, if we use the fact that Y?k=i o^ = 1, we obtain
£ [ x2 dO^x) < f x2 dO(x) - 0, n -► oo. (8)
Together with Condition (L), this shows that, for every e > 0,
*2d[_Fnk{x) + OUMl^O, n-oo. (9)
fe=l J\x\>t
Let us fix e > 0. Then there is a continuous differentiable even function
h = h(x) for which \h(x)\ < x2, \h'(x)\ < 4x, and
h(x) = K M>*
' l0, |x| < e.
For h(x), we have by (9)
t \ h(x)dlFnk(x) + Onk(x)^^0> n->oo. (10)
By integrating by parts in (10), we obtain
t f *'(x)[(l " F*M) + (1- <J>.*M)] &c
= £f ^ft + OJ-^O,
t f *'MK*(*) + **(*)] &c = £ f *M«F* + <■>*] -o.
fc=l Jx^-c fe=l Jx£-t
Since fc'(x) = 2x for |x| > 2e, we obtain
t \ \x\\Fnk(x)-Onk(x)\dx^01 n^oo.
Therefore, since e is an arbitrary positive number, we find that (L) => (A).
2. For the function h = h(x) introduced above, we find by (8) and the
condition maxx £k£n o2k -> 0 that
£ f h(x) dOJpc) <t\ x2 d**(x) - 0, n -+ oo. (11)
fc=l J|x|>8 fc = l J|x|>£
§6. Infinitely Divisible and Stable Distributions
341
If we integrate by parts, we obtain
I I f Hx)dlFnk - Onk] I < I t f Hx)dUl - Fnk) -(1- Onk)]
+ |£ f Hx)dlFnk-Onk]\
< t f Ifc'MI IC(1 " *W -(1- *.*)]l dx
+1 [ mx)\\Fnk-^nk\dx
<4£ f |x||Fnk(x)-<X>nk(x)|rfx.
It follows from (11) and (12) that
(12)
£ f x2 JFnk(x) < £ f h(x) dFnk(x) -+ 0, n -
i.e., the Lindeberg condition (L) is satisfied.
This completes the proof of the theorem.
3. Problems
1. Establish formula (5).
2. Verify relations (10) and (12).
§6. Infinitely Divisible and Stable Distributions
1. In stating Poisson's theorem in §3 we found it necessary to use a triangular
array, supposing that for each n > 1 there was a sequence of independent
random variables {^„tk}, 1 < k < n.
Put
^ = ^1 + --- + ^.., «>!■ (1)
The idea of an infinitely divisible distribution arises in the following
problem: how can we determine all the distributions that can be expressed as
limits of sequences of distributions of random variables T„, n > 1?
Generally speaking, the problem of limit distributions is indeterminate
in such great generality. Indeed, if £ is a random variable and £nt x = £, £„ k = 0,
1 < k < n, then T„ = £ and consequently the limit distribution is the
distribution of ^, which can be arbitrary.
In order to have a more meaningful problem, we shall suppose in the
342
III. Convergence of Probability Measures. Central Limit Theorem
present section that the variables £nf,,..., £„t„ are, for each n > 1, not only
independent, but also identically distributed.
Recall that this was the situation in Poisson's theorem (Theorem 4 of §3).
The same framework also includes the central limit theorem (Theorem 3
of §3) for sums Sk = f, + • ■ • + £,, n > 1, of independent identically
distributed random variables £,, &,... . In fact, if we put
I = ^ ~ E(^ y2 = yS
then
t _ V ^ _ ^" ~ "
k — 1 vn
Consequently both the normal and the Poisson distributions can be
presented as limits in a triangular array. If T„ -> T, it is intuitively clear that
since T„ is a sum of independent identically distributed random variables, the
limit variable T must also be a sum of independent identically distributed
random variables. With this in mind, we introduce the following definition.
Definition 1. A random variable T, its distribution FT, and its
characteristic function <pT are said to be infinitely divisible if, for each n > 1, there are
independent identically distributed random variables rju..., n„ such thatf
T - *li + ''' + *ln (°r» equivalently, FT = Fni * ■ ■ ■ *Frin, or <pT = {(p^J1).
Theorem 1. A random variable T can be a limit of sums T„ = Yi=i £„fk if and
only if T is infinitely divisible.
Proof. If T is infinitely divisible, for each n > 1 there are independent
identically distributed random variables £„A £n_k such that T =
£«. l + ' • * + Cn.fc* and tnis means that T = Tni n > 1.
Conversely, let T„-^T. Let us show that T is infinitely divisible, i.e., for
each k there are independent identically distributed random variables
nXi...,riksuch that T = ^ + • • ■ + rjk.
Choose a k > 1 and represent Tnk in the form Cl,1) + ■ ■ • + C„\ where
(n — £nk, 1 + -- + £„fc,„, • • • , Cn = €nk,n(k- 1)+1 + * " + Znk.nk-
Since Tnk ^* T, n -> oo, the sequence of distribution functions corresponding
to the random variables Tnk, n > 1, is relatively compact and therefore, by
Prohorov's theorem, is tight. Moreover,
[PCft" > z)]k = P(Ci,1) > z,..., Cik) > z) < P(T* > kz)
and
cpcci11 < -*)]k = p(cin < -z,..., e} < -z) < p(Tnk < -kz).
t The notation £ = t] means that the random variables £ and »/ agree in distribution, i.e.,
F,(x) = Fn(x),xeR.
§6. Infinitely Divisible and Stable Distributions
343
The family of distributions for g*\ n > 1, is tight because of the preceding
two inequalities and because the family of distributions for Tnk, n> 1, is
tight. Therefore there is a subsequence {nj =" {n} and a random variable
rji such that Cjj* -4 ^ as n{,-> oo. Since the variables £j,l), •. •, ff0 are identically
distributed, we have (J? -^ >?2, • • •, CS? ** *7k> where >a = ij2 = - • • = r}k.
Since C5,1 ^---, £ik) are independent, it follows from the corollary to Theorem 1
of §3 that qu..., r\k are independent and
^ = ^ + -- + ^111+-- + ¾.
But Tn.k -4 T, therefore (Problem 1)
7 = ^ +••• + rjk.
This completes the proof of the theorem.
Remark. The conclusion of the theorem remains valid if we replace the
hypothesis that £n> l9..., £n%n are identically distributed for each n > 1 by the
hypothesis that they are uniformly asymptotically infinitesimal (4.2).
2. To test whether a given random variable T is infinitely divisible, it is
simplest to begin with its characteristic function q>{i). If we can find
characteristic functions <pn(t) such that <p(t) = \jp„(t)Y for every n > 1, then T is
infinitely divisible.
In the Gaussian case,
<p(t) = eitme~ w2)'2"2,
and if we put
<pn(t) = ei"nl»e-ll/2)t2<T2/\
we see at once that q>(i) = [<pn(t)]n-
In the Poisson case,
<p(t) = exp{ V ~ 1)},
and if we put <pn(r) = exp{(A/n)(el'f - 1)} then <p(t) = [<pn(t)J.
If a random variable T has a T-distribution with density
Ixa-le-x/p
0, x < 0,
it is easy to show that its characteristic function is
Consequently <p(0 = 0„(t)]" where
and therefore T is infinitely divisible.
344
III. Convergence of Probability Measures. Central Limit Theorem
We quote without proof the following result on the general form of the
characteristic functions of infinitely divisible distributions.
Theorem 2 (Levy-Khinchin Theorem). A random variable T is infinitely
divisible if and only if<p(t) — exp ij/(t) and
w = » - ¥ + £Jf -1 - ■£?) l-±£** rc
where fie R,g2 >0and A is a finite measure on (R, @(R)) with A{0} = 0.
3. Let £1} £2,... be a sequence of independent identically distributed
random variables and S„ = f x + • • • + f„. Suppose that there are constants
b„ and a„ > 0, and a random variable T, such that
^-^4T. (3)
We ask for a description of the distributions (random variables T) that can be
obtained as limit distributions in (3).
If the independent identically distributed random variables £lf £2»--«
satisfy 0 < cr2 = V£x < oo, then if we put b„ = wE^ and an = cr^/n, we
find by §4 that T is the normal distribution JV(0, 1).
If f(x) = 6/n(x2 + 02) is the Cauchy density (with parameter 0 > 0)
and £ls ^2.--- are independent random variables with density /(x), the
characteristic functions <p^(t) are equal to e~e|f| and therefore q>Sn/n(t) =
(e-e|f|/n^n __ e-e\t\^ j e ^ ^yn ajso has a Cauchy distribution (with the same
parameter 6).
Consequently there are other limit distributions besides the normal, for
example the Cauchy distribution.
If we put £nk = (£Jan) - (b„/na„\ 1 < k < n, we find that
^- it.. (=¾
"n fc=l
Therefore all conceivable distributions for T that can conceivably appear as
limits in (3) are necessarily (in agreement with Theorem 1) infinitely divisible.
However, the specific characteristics of the variable T„ = (S„ — b„)/a„ may
make it possible to obtain further information on the structure of the limit
distributions that arise.
For this reason we introduce the following definition.
Definition 2. A random variable T, its distribution function F(x), and its
characteristic function q>(t) are stable if, for every n > 1, there are constants
a„ > 0, b„, and independent random variables £1}..., £„, distributed like T,
such that
anT + ^=4^+... + 6, (4)
§6. Infinitely Divisible and Stable Distributions
345
or, equivalently, F[(x - bn)/a„] = F * . • • * F(x), or
n times
Mt)T = lq>{ant)]eib^ (5)
Theorem 3. A necessary and sufficient condition for the random variable T
to be a limit in distribution of random variables (S„ — b„)/a„, a„ > 0, is that T is
stable.
Proof. If T is stable, then by (4)
* S -h
a„
where S„ = £x + • • • + f„, and consequently (S„ — b„)/a„ ^* T.
Conversely, let £l3 £2, • • • be a sequence of independent identically
distributed random variables, S„ = £t + • • ■ + f„ and (S„ - fc„)A*n -> T, a„ > 0.
Let us show that T is a stable random variable.
If T is degenerate, it is evidently stable. Let us suppose that T is non-
degenerate.
Choose k > 1 and write
si" = ¢1 + ••• + 6.,..-, s*> = 6*->,„+i + ••• + fit.,
*■ n » • • • > x n —
a„ an
It is clear that all the variables Tl„l\..., Tp have the same distribution and
7J,°-4T, n-oo, i = 1 fc.
Write
Up = Tlnl) + --- + Tp.
Then
lj(k) 4, ^cn + ... + T(k)
where T(1) J= .-.i=T<k)^T.
On the other hand,
^ = ^1 + -- + ^-^
a„
kb„
. 0^ (tl + • • • + &h ~ fcfcn\ + &knj
= al^n + flk), (6)
where
a(fe) = fkn ^ ^(fe) = fefcn- k&n
«n «n
346
III. Convergence of Probability Measures. Central Limit Theorem
and
£! + ••• + &,-
K„ = -
It is clear from (6) that
_ U™ - ftk)
Vkn~ «<k) '
where V^ -4 7, L/J,k) -4 7(1) + • • • + T(k>, n -► oo.
It follows from the lemma established below that there are constants
a(k) > 0 and 0(k) such that aj,k) -► a(k) and flk) -► 0(k) as n -► oo. Therefore
, 7^ + ---+7^-/^
i " a(k)
which shows that T is a stable random variable.
This completes the proof of the theorem.
We now state and prove the lemma that we used above.
Lemma. Let £„ -4 £ and let there be constants a„ > 0 and bn such that
anZn + bn±l
where the random variables £ and f are not degenerate. Then there are
constants a > 0 and b such that lim a„ = a, lim b„ = b, and
l = a£ + b.
Proof. Let <p„, <p and q> be the characteristic functions of <!;„, £ and |,
respectively. Then <pfln5n+bn(0> the characteristic function of a„£„ + b„, is equal
to ^(pJLant) and, by Theorem 1 and Problem 3 of §3,
eitb"(Pn(ant)^m, (7)
<PM -+ ¢,(0 (8)
uniformly on every finite interval of length t.
Let {nt-} be a subsequence of {n} such that a„. -> a. Let us first show that
a < oo. Suppose that a = oo. By (7),
sup IIPnfe.01 ~ 10(011 -> 0, n -► oo
for every c > 0. We replace t by foAv Then, since ani -> oo, we have
K*3I-K9I-
and therefore
kni.(fo)l-19(0)1 = 1.
§6. Infinitely Divisible and Stable Distributions
347
But \<Pnt(tQ)\ -> \<p(t0)\. Therefore \<p(t0)\ = 1 for every t0eR, and
consequently, by Theorem 5, §12, Chapter II, the random variable ^ must be
degenerate, which contradicts the hypotheses of the lemma.
Thus a < oo. Now suppose that there are two subsequences {nf} and {n'£}
such that a„{ -► a, ani -»■ a', where a =fc a'; suppose for definiteness that
0 < a' < a. Then by (7) and (8),
I <PnK 01 -H <P("t) I. I <P«ffl*t 0 I - 10(01
and
I PmK 01 - I <p(*'t) I, | <pnfai 01 -I0(01 •
Consequently
\<p(at)\ = k(fl'0|,
and therefore, for all t e /?,
If&)l-|»(f)|—-|»((^"t)|-I, -co.
Therefore |<p(OI = 1 and, by Theorem 5 of §12, Chapter II, it follows that £
is a degenerate random variable. This contradiction shows that a — d
and therefore that there is a finite limit lim an = a, with a > 0.
Let us now show that there is a limit lim fc„ = b, and that a > 0. Since (8)
is satisfied uniformly on each finite interval, we have
<Pn(an t)-Kp(at\
and therefore, by (7), the limit lim,,.^ eitbn exists for all t such that <p{at) ^ 0.
Let 3 > 0 be such that <p(at) ^ 0 for alHf | < <5. For such t, lim eitbn exists.
Hence we can deduce (Problem 9) that lim \b„\ < oo.
Let there be two sequences {nf} and {«•} such that lim bni = b and
lim bni = b'. Then
for 1t1 < 5, and consequently b = fc'. Thus there is a finite limit b = lim fc„
and, by (7),
0(0 = eitb<p(at\
which means that £ = a£ + b. Since £ is not degenerate, we have a > 0.
This completes the proof of the lemma.
4. We quote without proof a theorem on the general form of the
characteristic functions of stable distributions.
Theorem 4 (Levy-Khinchin Representation). A random variable T is stable
if and only if its characteristic function <p(t) has the form <p(t) = exp ^(0,
348
III. Convergence of Probability Measures. Central Limit Theorem
m = itP - d\tr(l + iOyly &f, a)\ (9)
whereO<a < 2,/Se R,d > O,|0| < l,t/\t\ = 0fort = 0,and
G(t,a) = {tan^a ^9^ (10)
l" ' 1(2/71) log |r| jfa-1.
Observe that it is easy to exhibit characteristic functions of symmetric
stable distributions:
<p{t) = *-<""", (11)
where 0 < a < 2, d > 0.
5. Problems
1. Show that f = t\ if f„ i f and £, -i ij.
2. Show that if <p, and q>2 are infinitely divisible characteristic functions, so is q>i- (p2-
3. Let <p„ be infinitely divisible characteristic functions and let q>„(t) -»<p(t) for every
t e K, where <p(f) is a characteristic function. Show that (pit) is infinitely divisible.
4. Show that the characteristic function of an infinitely divisible distribution cannot take
the value 0.
5. Give an example of a random variable that is infinitely divisible but not stable.
6. Show that a stable random variable f always satisfies the inequality E | f |r < oo for all
r e (0, a).
7. Show that if f is a stable random variable with parameter 0 < a < 1, then <p(t) is not
differentiable at t = 0.
8. Prove that e~dUia is a characteristic function provided that d > 0, 0 < a < 2.
9. Let (&„)„., ,_bea sequence of numbers such that limn eitbn exists for all \t\ < d, d > 0.
Show that lim |fcn| < oo.
§7. Metrizability of Weak Convergence
1. Let (£, <f, p) be a metric space and &>(E) = {P}, a family of probability
measures on (£, S). It is natural to raise the question of whether it is possible
to "metrize" the weak convergence P„^>P that was introduced in §1, that
is, whether it is possible to introduce a distance ;u(P, P) between any two
measures P and P in 0>{E) in such a way that the limit ;u(Pn, P)-»0 is
equivalent to the limit P„ ^ P.
In connection with this formulation of the problem, it is useful to recall
that convergence of random variables in probability, £„ -^ £, can be metrized
§7. Metrizability of Weak Convergence
349
by using, for example, the distance dP(£, rj) = inf{e > 0: P(|£ — rj\ > e) < e}
or the distances d(£ rj) = E(|£ - ij|/(1 + If - n\)l d{& ij) = E min(l, |£ - n\\
(More generally, we can set d(£, rj) = Eg(|£ — rj\\ where the function g = g{x),
x > 0, can be chosen as any nonnegative increasing Borel function that is
continuous at zero and has the properties g(x + j;) < g{x) + g{y) for x > 0,
y > 0, g(0) = 0, and g(x) > 0 for x > 0.) However, at the same time there is,
in the space of random variables over (£2, ^, P), no distance d(£, rj) such that
d(£,n, £) -> 0 if and only if £„ converges to £ with probability one. (In this
connection, it is easy to find a sequence of random variables £n, n > 1, that
converges to £ in probability but does not converge with probability one.) In
other words, convergence with probability one is not metrizable. (See the
statements of problems 11 and 12 in §10, Chapter II.)
The aim of this section is to obtain concrete instances of two metrics,
L(P, P) and \\P — P\\$L in the space ^(E) of measures, that metrize weak
convergence:
Pn^PoL{Pn,P)^0o\\Pn-PnL^0.
2. The Levy-Prokhorov metric L(P9 P). Let
p(x, A) = inf{p(x, y):ye A},
A* = {xe E: p(x, A) < e}, A e <f.
For any two measures P and P e ^(E), we set
<r(P, P) = inf{e > 0: P{F) < P(Fe) + e for all closed sets F e <f} (2)
and
L(P, P) = maxWP, P), <r(P, P)]. (3)
The following lemma shows that the function L{P, P) e &>(E\ which is
defined in this way, and is called the Levy-Prokhorov metric, actually defines
a metric.
Lemma 1. The function L(P, P) has the following properties:
(a) L{P, P) = L{P, P)(= <r(P, P) = <r(P, P)),
(b) L(P, P) < L(P, P) + L(P, P),
(c) L(P, P) = 0if and only if P = P.
Proof, a) It is sufficient to show that (with a > 0 and ft > 0)
"P(F) < P(Fa) + 0 for all closed sets Fe<T (4)
if and only if
"P(F) < P(F*) + 0 for all closed sets Fe<T. (5)
Let T be a closed subset of S. Then the set Ta is open and it is easy to verify
that T £ E\(E\Taf. If (4) is satisfied, then, in particular,
350
III. Convergence of Probability Measures. Central Limit Theorem
P(E\Ta) < P((E\Taf) + 0
and therefore,
P(T) < P(E\(E\Taf) < P(Ta) + ft
which establishes the equivalence of (4) and (5). Hence, it follows that
<r(P, P) = <t(P, P) (6)
and therefore,
L{P, P) = <r(P, P) = <r(P, P) = L(P, P). (7)
b) Let L(P, P) < St and L(P, P)<32. Then for each closed set F e £
P(F) < P(F5*) + 32 < PdF**)5*) + <5X + 52 < P(F5*+d>) + 6t + 82
and therefore, L(P, P)<dl+ 52. Hence, it follows that
L(P, P) < L{P, P) + L(P, P).
c) If L{P, P) = 0, then for every closed set F e £ and every a > 0
P(P) <, P(Fa) + a. (8)
Since F* IF, a 10, we find, by taking the limit in (8) as a 10, that P(F) <
P(F) and by symmetry P(F) < P(F). Hence, P(F) = P(P) for all closed sets
FES'. For each Borel set A e & and every e > 0, there is an open set G£^A
and a closed set Ft^A such that P(G£\F£) < e. Hence, it follows that every
probability measure P on a metric space (£, S, p) is completely determined
by its values on closed sets. Consequently, it follows from the condition
P(F) = P(F) for all closed sets F e S that P(A) = P(A) for all Borel sets A e S.
Theorem 1. The Levy-Prokhorov metric L{P, P) metrizes weak convergence:
L(Pn,P)^0oPn^P. (9)
Proof. (=>) Let L(P„, P) -► 0, n -► oo. Then for every specified closed set F e £
and every e > 0, we have, by (2) and equation a) of Lemma 1,
hm Pn(F) < P(F*) + e. (10)
If we then let e 10, we find that
fim Pn(F)<:P(F).
According to Theorem 1 of §1, it follows that
Pn^P. (11)
The proof of the implication (<=) will be based on a series of deep and
powerful facts that illuminate the content of the concept of weak convergence
and the method of establishing it, as well as methods of studying rates of
convergence.
§7. Metrizability of Weak Convergence
351
Thus, let P„ ^ P. This means that for every bounded continuous function
/ = /(*)•
| f(x)Pn(dx)^^f(x)P(dx). (12)
Now suppose that ^ is a class of equicontinuous functions g = g(x) (for
every e > 0 there is a 8 > 0 such that \g(y) — g(x)\ < e if p(x, y) < 3 for all
ge<0) and \g(x)\ < C for the same constant C > 0 (for all x e E and g e ^).
By Theorem 3, §8, the following condition, stronger than (12), is valid for ^:
Pn -^P=>sup f g(x)Pn(dx) - f 0(x)P(rfx) -0. (13)
ge& \Je Je I
For each A e <f and e > 0, we set (as in Theorem 1, §1)
JBW-[i-^J. (14)
It is clear that
U*) <£(*)</*(*) (15)
and
\f!(x) ~ fHy)\ < B~l\p(x, A) - p(y, A)\ < s-lp(x, y).
Therefore, we have (13) for the class &* = {fl(x\ A e <f}, i.e.,
An = sup f fj(x)Pn{dx) - [ fj(x)P(dx)
AeS\JE JE
► 0, n -► oo. (16)
Je Je I
From this and (15) we conclude that, for every closed set A e S and e > 0,
P(A*) > | /i(x) JP > J fX(x) rfP„ - A„ > P„(,4) - A„. (17)
We choose n(e) so that An < e for n > n(e). Then, by (17), for n > n(e)
P(A*)>Pn(A)~£- (18)
Hence, it follows from definitions (2) and (3) that L(Pn, P) < e as soon as
n > n(e). Consequently,
Pn-^P=»An-^0=»L(Pn,P)-0.
The theorem is now proved (up to (13)).
3. The metric \\P — P\\bl- We denote by BL the set of bounded continuous
functions f = /(x), xeE (with \\f\\m = supx|/(x)| < oo) that also satisfy the
Lipschitz condition
ll/L=»P^<».
**y P(x, y)
352
III. Convergence of Probability Measures. Central Limit Theorem
We set ||/IIbl= H/IL + II/Il- The sPace BL with the norm INIbl is a
Banach space.
We define the metric \\P — P||j£L by setting
||P - Pllfc = sup jl \fd(P - P) : I/Hju. < 1 j
/e£L (J J | J
(19)
(We can verify that \\P — P||j£L actually satisfies the conditions for a matric;
Problem 2.)
Theorem 2. 77ze metric \\P — P||j£L metrizes weak convergence:
WPu-PUl^OoP^P.
Proof. The implication (<=) follows directly from (13). To prove (=>), it is
enough to show that in the definition of weak convergence P„ -^ P as given
by (12) for every continuous bounded function f = f(x), it is enough to
restrict consideration to the class of bounded functions that satisfy a Lipschitz
condition. In other words, the implication (=>) will be proved if we establish
the following result.
Lemma 2. Weak convergence Pn^*P occurs if and only if property (12) is
satisfied for every function f — f(x) of class BL.
Proof. The proof is obvious in one direction. Let us now consider the
functions fX = f£(x) defined in (14). As was established above in the proof of
Theorem 1, for each e > 0 the class ^£ = {/i(x), A e ^} c BL. If we now
analyze the proof of the implication (I) => (II) in Theorem 1 of §1, we can
observe that it actually establishes property (12) not for all bounded
continuous functions but only for functions of class ^£, e > 0. Since &£ £ BL, e > 0,
it is evidently true that the satisfaction of (12) for functions of class BL implies
proposition II of Theorem 1, §1, which is equivalent (by the same Theorem 1,
§1) to the weak convergence P„ -^ P.
Remark. The conclusion of Theorem 2 can be derived from Theorem 1 (the
same as before) if we use the following inequalities between the metrics
L(Pt P) and \\P — P||bl» which are valid for the separable metric spaces
(£,*,p):
I|P-J1Sl<2L(P,P), (20)
9>(L(P,P))<||P-P|||L, (21)
where q>{x) = 2x2/(2 + x).
We notice that, for x > 0, we have 0 < q> < 2/3 if and only if x < 1; and
(2/3)x2 < <p(x) for 0 < x < 1; we deduce from (20) and (21) that if L(P, P) < 1
or ||P - P||j§L < 2/3, we have
|L2(P, P) < ||P - Pllfe < 2L(P, P). (22)
§8. On the Connection of Weak Convergence of Measures with Almost Sure 353
4. Problems
1. Show that in case E = R the Levy-Prokhorov metric between the probability
distributions P and P becomes the Levy distance L(F, F) between the distributions
F and F that correspond to P and P (see Problem 4 in §1).
2. Show that formula (19) defines a metric on the space BL.
3. Establish the inequalities (20), (21), and (22).
§8. On the Connection of Weak Convergence of
Measures with Almost Sure Convergence of
Random Elements ("Method of a Single
Probability Space")
1. Let us suppose that on the probability space (£2, &, P) there are given
random elements X — X(co), X„ — Xn(ay\ n > 1, taking values in the metric
space (E, <f, p); see §5, Chapter II. We denote by P and P„ the probability
distributions of X and Xni i.e., let
P(A) = ?{cd: X(co) e A}, P„(A) = P{co: X„(cd) e A), Ae <f.
Generalizing the concept of convergence in distribution of random
variables (see §10, chapter II), we introduce the folio wing definition.
Definition 1. A sequence of random elements Xni n> 1, is said to converge in
distribution, or in law (notation: X„ -> X, or Xn -> X), if Pn ^* P.
By analogy with the definitions of convergence of random variables in
probability or with probability one (§10, Chapter II), it is natural to introduce
the following definitions.
Definition 2. A sequence of random elements Xnin> 1, is said to converge in
probability to X if
P{co:p(^ft>),X(co))>e}->0, n-oo. (1)
Definition 3. A sequence of random elements X„, n > 1, is said to converge to
X with probability one (almost surely, almost everywhere) if p(Xn(co)i X(co)) %►"
0, n -> oo.
Remark 1. Both of the preceding definitions make sense, of course, provided
that p(Xn(w\ X(cd)) are, as functions of cd e Q, random variables, i.e., SF-
measurable functions. This will certainly be the case if the space (£, <f, p) is
separable (Problem 1).
354
III. Convergence of Probability Measures. Central Limit Theorem
Remark 2. In connection with Definition 2, we note that our convergence
in probability is metrized by the following metric that connects random
elements X and Y (defined on (£2, ^, P) with values in £):
dP(X, Y) = inf{e > 0: P{p(X(cd), Y(cd)) >b}< b}. (2)
Remark 3. If the definitions of convergence in probability and with
probability one are defined for random elements on the same probability space, the
definition J„-»Iof convergence in distribution is connected only with the
convergence of distributions, and consequently, we may suppose that X(cd),
X^cd), X2(co)i ... have values in the same space E, but may be defined on
"their own" probability spaces (£2, ^, P), (Ql9 &u Px), (£22, &ly P2),....
However, without loss of generality we may always suppose that they are defined
on the same probability space, taken as the direct product of the preceding
spaces and with the definitions X(co, colico2i...) = X(cd), Xl(coicolico2i...)
2. By Definition 1 and the theorem on change of variables under the Lebesgue
integral sign (Theorem 7, §6, Chapter II)
Xn%XoEnXn)^Ef(X) (3)
for every bounded continuous function f = f(x), xeE.
From (3) it is clear that, by Lebesgue's theorem on dominated convergence
(Theorem 3, §6, Chapter II), the limit X„^X immediately implies the limit
Xn -> X, which is hardly surprising if we think of the situation when X and
X„ are random variables (Theorem 2, §10, Chapter II). More unexpectedly, in
a certain sense there is a converse result, the precise formulation and
application we now turn to.
Preliminarily, we introduce a definition.
Definition 4. Random elements X = X(co') and Y = Y{co"\ defined on
probability spaces (£?, ^', P') and (Q", &", P") and with values in the same space
£, are said to be equivalent in distribution (notation: X = Y), if they have
congruent probability distributions.
Theorem 1. Let (£, <f, p) be a separable metric space.
1. Let random elements X, Xni n > 1, defined on a probability space
(Q, ^, P), and with values in £, have the property that X„ -+ X. Then we can
find a probability space (Q,*, &*, P*) and random elements X*, X*, n ^ 1,
defined on it, with values in £, such that
and
X* = X, X* = Xni n> 1.
§8. On the Connection of Weak Convergence of Measures with Almost Sure 355
2. Let P, Pn, n > 1, be probability measures on (£, <£ p). Then there is a
probability space (£2*, ^*, P*) and random elements X*, X*y n > 1, defined on
it, with values in E, such that
X*a^X*
and
P*=P, P„* = P„, »>1,
where P* and P* are the probability distributions ofX* and X*.
Before turning to the proof, we first notice that it is enough to prove only
the second conclusion, since the first follows from it if we take P and P„ to be
the distributions of X and X„. Similarly, the second conclusion follows from
the first. Second, we notice that a proof of the theorem in full generality is
technically rather complicated. For this reason, here we give a proof only of
the case E = R. This proof is rather transparent and moreover, provides a
simple, clear construction of the required objectives. (Unfortunately, this
construction does not work in the general case, even for E = R2.)
Proof of the theorem in the case E = R. Let F — F(x) and F„ = Fn(x) be
distribution functions corresponding to the measures P and P„ on (/?, @(R)).
We associate with a function F = F(x) its corresponding quantile Junction
Q = Q(u\ uniquely defined by the formula
Q(u) = inf {x: F(x) > u}, 0 < u < 1. (4)
It is easily verified that
F(x)>uoQ(u)<x. (5)
We now take Q* = (0,1), &* = ^(0,1), P* to be Lebesgue measure, and
P*(dx) = dx. We also take X*(cd*) = Q(cd*) and cd* e CI*. Then
P*{co*: X*(co*) <x} = P*{cd*: Q(cd*) <x} = P*{cd*: cd* < F(x)} = F(x),
i.e., the distribution of the random variable X*(co*) = Q(cd*) coincides
exactly with P. Similarly, the distribution of X*(cd*) = Q„(co*) coincides
with P„.
In addition, it is not difficult to show that the convergence of F„(x) to F(x)
at each point of continuity of the limit function F = F(x) (equivalent, if
E = R, to the limit P„^P; see Theorem 1 in §1) implies that the sequence of
quantiles Qn(u% n > 1, also converges to Q(u) at every point of continuity
of the limit Q - Q(u). Since the set of points of discontinuity of Q = Q(u\
u e (0,1), is at most countable, its Lebesgue measure P* is zero and therefore,
X*(cd*) = Q„(cd*)^ X*(cd*) = Q(cd*).
The theorem is established in the case of E = R.
356
III. Convergence of Probability Measures. Central Limit Theorem
This construction in Theorem 1 of a passage from given random elements
X and Xn to new elements X* and X*y defined on the same probability space,
explains the announcement in the heading of this section of the method of a
single probability space.
We now turn to a number of propositions that are established very simply
by using this method.
3. Let us assume that the random elements X and Xni n > 1, are defined, for
example, on a probability space (£2, ^, P) with values in a separable metric
space (£, <f, p), so that X„ -> X. Also let h = h(x\ x e £, be a measurable
mapping of (£, <f, p) into another separable metric space (F, <f, p'). In
probability and mathematical statistics it is often necessary to deal with the search
for conditions under which we can say of h = h(x) that the limit X„ -»■ X
implies the limit h(X„) -¾ h(X).
For example, let ^, £2,... be independent identically distributed random
variables with E^ = m, V^ = a2 > 0. Let X„ = (^ + • ■ • + £„)/n. The
central limit theorem shows that
G
Let us ask, for what functions h = h(x) can we guarantee that
ft(^(*;-m))^(^(o,D)?
(The Mann-Wald theorem, which is applicable to the present case, since it
is satisfied for continuous functions h = h(x), says that n(X — m)2/a2 -> Xi»
where xl is a random variable with a chi-squared distribution with one
degree of freedom; see Table 2 in §3, Chapter I.)
A second example. If X = X(t, co), X„ = X„(t, co\ teT, are random processes
(see §5, Chapter II) and h(X) = supter |A"(t, co)|, &(*„) = supter |X„(r, co)|, our
problem amounts to asking under what conditions on the convergence in
distribution of the processes X„-^X will there follow the convergence in
distribution of their suprema, h(X„) -> h(X).
A simple condition that guarantees the validity of the implication
is that the mapping h = h(x) is continuous. In fact, if f = f(x') is a bounded
continuous function on £', the function f(h(x)) will also be a bounded
continuous function on E. Consequently,
Xn 4 X => Ef(h(X„)) -+ Ef(h(X)).
The theorem given below shows that in fact the requirement of continuity
of the function h = h(x) can be somewhat weakened by using the limit
properties of the random element X.
§8. On the Connection of Weak Convergence of Measures with Almost Sure 357
We denote by Ah the set {xeE: h(x) is not p-continuous at x}; i.e., let Ah
be the set of points of discontinuity of the function h = Jz(x). We note that
Ah e S (problem 4).
Theorem 2.1. Let (£, <£ p) and (F, S\ p') be separable metric spaces, and
Xn -+ X. Let the mapping h = h(x% xeE, have the property that
P{cd:X(cd)eA„} = 0. (6)
Thenh(X„)%h(X).
2. Let P, P„, n > 1, be probability distributions on the separable metric
space (E, St p) and h = h(x) a measurable mapping of (E, $, p) on a separable
metric space (£', £', p'). Let
P{x:xeAh} = 0.
Then I* A P", wnere Pn"(>4) = Pn{h(x) e A}, P\A) = P{h(x) e A}, A e £'.
Proof. As in Theorem 1, it is enough to prove the validity of, for example, the
first proposition.
Let X* and X*f n > 1, be random elements constructed by the "method of
a single probability space," so that X* = X, X* = Xni n > 1, and X* a4" X*.
Let A* = {co*; p(X*f X*) + 0}, B* = {co*: X*(co*) e A„}. Then P*(A* u B*)
= 0, and for w* ¢A*u B*
h(X*(co*))-+h(X*(co*)l
which implies that Jz(X*)^ h(X*). As we noticed in subsection 1, it follows
that h(X*) 4 h(X *). But Jz(X*) = h(X„) and Jz(**) = h(X). Therefore, h(X*) 4
This completes the proof of the theorem.
4. In §6, in the proof of the implication (<=) in Theorem 1, we used (13). We
now give a proof that again relies on the "method of a single probability
space."
Let (£, <£ p) be a separable metric space, and & a class of equicontinuous
functionsg = g(x)for which also \g(x)\ < Cfor allxe£ and ge&
Theorem 3. Let P and P„, n > 1, be probability measures on (£, <f, p) /or wJzicJz
P„^P. Then
sup
| 0(x)Pn(dx)-J g(x)P(dx)\
(7)
Proof. Let (7) not occur. Then there are an a > 0 and functions gl9 g2 ...
from & such that
| 5n(x)P„(rfx)-| g(x)P(dx)
>a>0 (8)
358
III. Convergence of Probability Measures. Central Limit Theorem
for infinitely many values of n. Turning by the "method of a single probability
space" to random elements X* and X* (see Theorem 1), we transform (8) to
the form
| E*gn{X*) - E*gn(X*)\ > a > 0 (9)
for infinitely many values of n. But, by the properties of % for every e > 0
there is a 3 > 0 for which \g(y) — g(x)\ < e for all ge% if p(x, y) < 3. In
addition, \g(x)\ < C for all x e E and ge<&. Therefore,
\E*gn{X*) - E*gn{X*)\ < E*{\gn(XZ) - g„(X*)\; p(X*f X*) > 3}
+ E*{\gn(XZ) ~ 9n(X*)\; p(X*, X*) < 3}
<2CP{p(X*,X*)>3}+e.
Since X*1^X*, we have ?*{p(X*f X*) > 3} -► 0 as n -► oo. Consequently,
since e > 0 is arbitrary,
llm" \E*gn{X*) - E*gn{X*)\ = 0,
which contradicts (9).
This completes the proof of the theorem.
5. In this section the idea of the "method of a single probability space" used
in Theorem 1 will be applied to estimating upper bounds of the Levy-
Prokhorov metric L(P, P) between two probability distributions on a
separable space (£, <£ p).
Theorem 4. For each pair P, P of measures we can find a probability space
(£2*, ^*„ P*) and random elements X and X on it with values in E such that
their distributions coincide respectively with P and P and
L(P, P) < dP.(X, X) = inf{e > 0: P*(p(X, X) > s) < e}. (10)
Proof. By Theorem 1, we can actually find a probability space (£2*, ^*, P*)
and random elements X and X such that P*(X eA) = P(A) and P*(X eA) =
P(A), Ae&
Let e > 0 have the property that
P*(P(*, £)>£)<;£. (11)
Then for every AeS
P(A) = P*(X e A)=P*(X e A,X e A£) + P*(X e AX $A£)
< P*(X e Ae) + P*(p(X, X) > e) <; P(A£) + e.
Hence, by the definition of the Levy-Prokhorov metric (subsection 2, §6)
L(P, P) < e. (12)
From (11) and (12), if we take the infimum for e > 0 we obtain the required
assertion (10).
§9. The Distance in Variation between Probability Measures
359
Corollary. Let X and X be random elements defined on a probability space
(£2, ^ P) with values in E. Let Px and Px be their probability distributions.
Then
L(PXiPx)<dP(X,Xy
Remark 1. The preceding proof shows that in fact (10) is valid whenever we
can exhibit on any probability space (£2*, #"*, P*) random elements X and X
with values in E whose distributions coincide with P and P and for which
the set {w*: p(X(cd*), X(cd*)) > e} e ^*, e > 0. Hence, the property of (10)
depends in an essential way on how well, with respect to the measures P and
P, the objects (£2*, #"*, P*) and X, X are constructed. (The procedure for
constructing £2*, #"*, P* and X, X as well as the measure P*, is called
coupling (joining, linking).) We could, for example, choose P* equal to the
direct product of the measures P and P, but this choice would, as a rule, not
lead to a good estimate (10).
Remark 2. It is natural to raise the question of when there is equality in (10).
In this connection we state the following result without proof: Let P and P be
two probability measures on a separable metric space (£, <f, p); then there are
(ft*, ^*, P*) and X, X, such that
L(P, P) = dP.(X, X) = inf{e > 0: P*(p(X, X)>e)< e}.
5. Problems
1. Prove that in the case of separable metric spaces the real function p(X(co), Y(co)) is
a random variable for all random elements X(co) and Y(co>) denned on a probability
space (fi, ^, P).
2. Prove that the function dP(Xf Y) denned in (2) is a metric in the space of random
elements with values in E.
3. Establish (5).
4. Prove that the set Ah = {x e E: h(x) is not p-continuous at x} e <£
§9. The Distance in Variation between Probability
Measures. Kakutani-Hellinger Distance and
Hellinger Integrals. Application to Absolute
Continuity and Singularity of Measures
1. Let (Q, &) be a measurable space and & = {P} a family of probability
measures on it.
Definition 1. The distance in variation between measures P and P in &
(notation: ||P - P||) is the total (signed) variation of P - P, i.e.,
360
III. Convergence of Probability Measures. Central Limit Theorem
(1)
Var(P - P) = sup f (p(w)d(P - P)
|Jn
where the sup is over the class of all ^-measurable functions that satisfy the
condition that |<p(co)| < 1.
Lemma 1. The distance in variation is given by
||P-P||=2sup|P(X)-P(X)|. (2)
Ae&
Proof. Since, for all A e &,
P(A)-P(A) = P(A)-P(A),
we have
2\P(A) - P(A)\ = \P(A) - P(A)\ + \P(A) - P(A)\ < \\P - P\\,
where the last inequality follows from (1).
For the proof of the converse inequality we turn to the Hahn
decomposition (see, for example, [K9] or [HI], p. 121) of a signed measure n = P — P.
In this decomposition the measure fi is represented in the form \i = n+ — ;u_,
where the nonnegative measures /li+ and ju_ (the upper and lower variations
of fi) are of the form
JAnM JAnM
bsetofj^. Here
Var ju = Var /i+ + Var fi_ = fi+(Q) + fi_(Q).
JAnM
where M is a subset of <F. Here
Since
fi+(Q) = P(M) - P{M\ /jl.(C1) = P(M) - P(M),
we have
||P - p|| = (P(M) - p{M)) + (P(M) - P(M)) < 2 sup \P(A) - P(A)\.
Ae&
This completes the proof of the lemma.
Definition 2. A sequence of probability measures P„, n > 1, is said to be
convergent in variation to the measure P if
||P„-P||-0, n-oo. (3)
From this definition and Theorem 1, §1, Chapter III, it is easily seen that
convergence in variation of probability measures defined on a metric space
(H, ^, p) implies their weak convergence.
The proximity in variation of distributions is, perhaps, the strongest form of
closeness of probability distributions, since if two distributions are close in
§9. The Distance in Variation between Probability Measures 361
variation, then in practice, in specific situations, they can be considered
indistinguishable. In this connection, the impression may be created that the
study of distance in variation is not of much probabilistic interest. However,
for example, in Poisson's theorem (§6, Chapter I) the convergence of the
binomial to the Poisson distribution takes place in the sense of convergence
in variation to the zero distribution. (Later, in §11, we shall obtain an upper
bound for this distance.)
We also provide an example from the field of mathematical statistics,
where the necessity of determining the distance in variation between
measures P and P arises in a natural way in connection with the problem of
discrimination between the results of observations of two statistical
hypotheses H (the true distribution is P) and H (the true distribution is P) in
connection with the question of whether the measure P or P, defined on (£2, 2P), is
more plausible. If co e Q is treated as the result of an observation, by a test
(for different hypotheses H and H) we understand any ^-measurable
function <p = <p(co) with values in [0,1], the statistical meaning of which is that
<p(aj>) is "the probability with which hypothesis H is accepted if the result of
the observation is co."
We shall characterize the quality of different hypotheses H and H by the
probabilities of errors of the first and second kind:
u(q>) = Eq>(co) (= Prob (accepting H\H is true)),
P(q>) = 1(1 — q>(co)) ( = Prob (accepting H\H is true)).
In the case when hypotheses H and H are equivalent, the optimum is naturally
to consider a test q>* — <p*(co) (if there is such a test) that minimizes the sum
a(<p) + P(q>) of the errors.
We set
*r(P,J^ = inf [«(?)+ «?)]. (4)
Let Q = (P + P)/2 and z = dP/dQ, z = dP/dQ. Then
Sr(P, P) = inf [Eq> + 1(1 - ¢))]
= inf EQ[z<p + z(l - ¢))] = 1 + inf EQl<p(z - z)]„
where EQ is the expectation of the measure Q.
It is easy to see that the inf is attained by the function
<P*(cd) = I{z <z}
and, since EQ(z — z) — 0, that
Sr(P, P) = 1 - iEc|z - z\ = 1 - i\\P - P\\, (5)
where the last equation will follow from Lemma 2, below. Therefore, it is
clear from (5) that the quality of various hypotheses that characterize the
362
III. Convergence of Probability Measures. Central Limit Theorem
function Sr(Pt P) really depends on the degree of proximity of the measures
P and P in the sense of distance in variation.
Lemma 2. Let Qbea a-finite measure such that P « Q, P « Q and z = dP/dQ,
z = dP/dQ are Radon-Nikodym measures of P and P with respect to Q. Then
||P-P|| = Ec|z-z| (6)
and ifQ = (P + P)/2, we have
||P - P|| = EQ\z -z\ = 2EC|1 - z\ = 2EC|1 - z\. (7)
Proof. For all ^"-measurable functions if/ = \J/(cd) with |^(co)| < 1, we see
from the definitions of z and z that
|E^ - E^| = |Ec^(z - z)\ < Ec|^||z - z\ ^ EQ\z - z\. (8)
Therefore,
||P-P||<Ec|z-z|. (9)
However, for the function
ij/ = sgn(z - z)
we have
_ | 1, z > z,
~\— 1, z<z,
|E^-E^| = Ec|z-z|. (10)
We obtain the required equation (6) from (9) and (10). Then (7) follows
from (6) because z + z — 2 (Q-a.s.).
Corollary 1. Let P and P be two probability distributions on (P, ^(P)) with
probability densities (with respect to Lebesgue measure dx) p(x) and p(x), xeR.
Then
\\p- p\\=r \p(x)~p(x)\dx. at)
J -oo
(As the measure Q, we are to take Lebesgue measure on (P, £
Corollary 2. Let P and P be two discrete measures, P = (pl9 p2,...), P =
(p, p2,---\ concentrated on a countable set of points xltx2, Then
\\P-?\\ = t\Pi-Pil (12)
(As the measure Q, we are to take the counting measure, Le., that with
e({*j) = u = i,2,....)
2. We now turn to still another measure of the proximity of two probability
measures, from among many (as will follow later) related proximities of
measures in variation.
§9. The Distance in Variation between Probability Measures
363
Let P and P be probability measures on (Q, &) and Q, the third
probability measure, dominating P and P, i.e., with the probabilities P « Q and P « Q.
We again use the notation
_dP _ = dP
z~hq: Z~dQ'
Definition 3. The Kakutani-Hellinger distance between the measures P and P
is the nonnegative number p(P, P) such that
p2{P,P) = iEcCv^-V^]2- (13)
Since
ec[v^-75]2=£[
it is natural to write pl(P, P) symbolically in the form
p2(P,P) = if Ly/dP -yfdPl2. (15)
Jsi
If we set
H{P, P) = EQy/7z, (16)
then, by analogy with (15), we may write symbolically
H(P, P) = f VdPdP. (17)
Jn
From (13) and (16), as well as from (15) and (17), it is clear that
p2(P, P) = 1 - H(P, P). (18)
The number H (P, P) is called the Hellinger integral of the measures P and
P. It turns out to be convenient, for many purposes, to consider the Hellinger
integrals H(a; P, P) of order a e (0, 1), defined by the formula
H(a;P,P)=Eezaz1-a, (19)
or, symbolically,
H(a; P,P)=[ {dPY{d?)l-«. (20)
Jsi
It is clear that tf (1/2; P, P) = H(P, P).
For Definition 3 to be reasonable, we need to show that the number
p2(P, P) is independent of the choice of the dominating measure and that in
fact p(P, P) satisfies the requirements of the concept of "distance."
Lemma 3.1. The Hellinger integral of order a e (0, 1) (and consequently also
p{P, P)) is independent of the choice of the dominating measure Q.
(14)
364
III. Convergence of Probability Measures. Central Limit Theorem
2. The function p defined in (13) is a metric on the set of probability
measures.
Proof. 1. If the measure Q' dominates P and P, Q' also dominates Q =
(P + P)/2. Hence, it is enough to show that if Q « Q\ we have
Ec(z«z1-«) = Ec,(z')a(z')1-a,
where z' = dP/dQ' and z' = dP/dQ'.
Let us set v = dQ/dQ. Then z' — zv, T — zv, and
Eflfz-z1-) = ZQ.{vz«zl-a) = Efl.CzTe?')1-",
which establishes the first assertion.
2. If p(P, P) = 0 we have z = z (Q-a.e.) and hence, P = P. By symmetry,
we evidently have p(P, P) = p(P, P). Finally, let P, P\ and P" be three
measures, P « Q, P' « Q, and P" « Q, with z = rfP/rfQ, z' = dF/dQ, and
z" = dP"/dQ. By using the validity of the triangle inequality for the norm in
L2(H, ^ 0, we obtain
[Ec(Vz" - Vz7)2]1'2 <c [Ec(Vz" - V?)2]1/2 + [E^V? - V^)2]1/2,
i.e.,
p(P, P") <; p(P, P') + p{P\ P").
This completes the proof of the lemma.
By Definition (19) and Fubini's theorem (§6, Chapter II), it follows
immediately that in the case when the measures P and P are direct products of
measures, P = Pt x • ■ • x Pn,P = Px x • • • x Pn (see subsection 9, §6, Chapter
II), the Hellinger integral between the measures P and P is equal to the
product of the corresponding Hellinger integrals:
tf(a;P,P)=ntf(a;P£,P£).
The following theorem shows the connection between distance in
variation and Kakutani-Hellinger distance (or, equivalently, the Hellinger
integral). In particular, it shows that these distances define the same topology in
the space of probability measures on (H, #").
Theorem 1. We have the following inequalities:
2[1 - H(P, P)] <; ||P - P|| <; V8[l- fl(P,P)], (21)
||P - P|| <; 2>/l- H2(P,P). (22)
In particular,
2p2(P, P) <; ||P - P|| <; yfepiP, P). (23)
Proof. Since H(P, P) ^ 1 and 1 - x2 <, 2(1 - x) for 0 ^ x <, 1, the right-
§9. The Distance in Variation between Probability Measures
365
hand inequality in (21) follows from (22), the proof of which is provided by
the following chain of inequalities (where Q = (1/2)(P + P)):
il|P - P\\ = Ec|l - z\ < VEC|1 - z\2 = Vl - Ecz(2 - z)
= 71 - EQzz = Vl - Ec(7^)2 < Vl - (Ecx/^)2
= \/l - H2(P, P).
Finally, the first inequality in (21) follows from the fact that by the
inequality
\\_J~z - ^/2 - z\2 < \z - 1|, z e [0, 2],
and we have (again, Q = (1/2)(P + P))
1 - H(P, P) = p2(P, P) = iEc[v^ - 72^¾2 < iEc|z - 1| = i||P - P||.
Remark. It can be shown in a similar way that, for every a e (0,1),
2[1 - tf (a; P, P)] < ||P - P|| < x/ca(l-H(a;P,P)), (24)
where ca is a constant.
Corollary 1. Let P and P", n > 1, fee probability measures on (£2, ^"). TJzen (as
n-»>oo)
||P" - P|| -► 0*>H(P", P) -► 1 *>p(P", P) -► 0,
||P" - P|| - 2*>H(P", P) - 0*>p(P", P) - 1.
Corollary 2. Since by (5)
<frOP,P)=l-illP-PH,
we toe, by (21) and (22),
i#2(P, P) < 1 - >/l-H2(P,P) < ^r(P, P) < H(P, P). (25)
In particular, let
P" = Px-xP, Pn = Px-xP
be direct products of measures. Then, since H(P", P") = [H(P, P)]" = e~A"
with A = -In H(P, P) ^ p2(P, P), we obtain from (25) the inequalities
\e~2Xn < <fr(P", P") < e~Xn < e-np2iP'p\ (26)
In connection with the problem, considered above, of distinguishing two
statistical hypotheses from these inequalities, we have the following result.
Let £x, £2> •■■ be independent identically distributed random elements,
that have either the probability distribution P (Hypothesis H) or P (Hypothe-
366
III. Convergence of Probability Measures. Central Limit Theorem
sis H), with P¥>P, ^nd therefore, p2(P, P) > 0. Therefore, when n -> oo,
the function Sr{PnLPn\ which describes the quality of optimality of the
hypotheses H and H as observations of £1} £2,..., decreases exponentially to
zero.
4. In using Hellinger integrals of order a (described above), it will be
convenient to introduce the notions of absolute continuity and singularity of
probability measures.
Let P and P be two probability measures defined on a measurable space
(£2, &?). We say that P is absolutely continuous with respect to P (notation:
P«P) if P(A) = 0 whenever P(A) = 0 for A e &. If P « P and P « P, we say
that P and P are equivalent (P ~ P). The measures P and P are called^sin^u/ar
or orthogonal (P _L P), if there is an A e & for which P(A) = 1 and P(A) = 1
(i.e., P and P "sit" on different sets).
Let Q be a probability measure, with P « Q, P « Q, z = dP/dg, £ = dP/dQ.
Theorem 2. The following conditions are equivalent:
(a) P«P,
(b) P(z > 0) = 1,
(c) H(a;P,P)-l,a|0.
Theorem 3. The following conditions are equivalent:
(a) P_LP,
(b) P(z > 0) = 0,
(c) H(a;P,P)-0,oUO,
(d) if (a; P, P) = 0 /or a// a e (0, 1),
(e) H(a; P, P) = 0 for some a e (0, 1).
The proofs of these theorems will be given simultaneously. By the
definitions ofzandz,
P{z = 0) = EQ[_zI(z = 0)] = 0, (27)
P(A n{z> 0}) = EQlzI(A n{z> 0})]
= Ee zZ-I(An{z > 0}) = E -I(An{z > 0}) 1
=e[H
(28)
Consequently, we have the Lebesgue decomposition
P(X) = E -/(^)+P(Xn{z = 0}), AeSF, (29)
in which Z = zjz is called the Lebesgue derivative of P with respect to P and
§9. The Distance in Variation between Probability Measures
367
denoted by dP/dP (compare the remark on the Radon-Nikodym theorem,
§6, Chapter III).
Hence, we immediately obtain the equivalence of (a) and (b) in both
theorems.
Moreover, since
zaz1-«-f/(z>0), a 10,
and for a e (0,1)
0 < zazx~a <oz + (\-ol)z<z + z
with EQ(z + z) — 2, we have, by Lebesgue's dominated convergence theorem,
lim H{a\ P, P) = EQzI(z > 0) = P(z > 0)
and therefore, (b)o(c) in both theorems.
Finally, let us show that in the second theorem (c)o(d)o(e). For this,
we need only note that H{a\P, P) = E(z/zfl(z > 0) and P(z > 0) = 1. Hence,
for each a e (0, 1) we have P(z > 0) = 0oH(u; P, P) = 0, from which there
follows the implication (c)o(d)o(e).
Example 1. Let P = Pt x P2 x ..., P = Pt x P2 ..., where Pk and Pk are
Gaussian measures on (R, @(R)) with densities
A(x) = JU(*~flk>2/2' AW = 4=e~(*~ak>2/2-
Since
H(a;P,P)=f[H(a;Pk,Pk),
where a simple calculation shows that
H(a;Pk,Pk) = p pJCxJpJ-'CxJdx^e-w1-^2^-^
J-co
we have
H(<x; P, P) = e-^1-0')/2^*"!^-5*)2.
From Theorems 2 and 3, we find that
CO
P « PoP « Pop ~ Po £ (ak - 5k)2 < oo,
k=i
PlPof(ak-5k)2 = ^.
fc=l
Example 2. Again letP = P^^x ...,P = ^xPjX ..., wherePk and Pk
are Poisson distributions with respective parameters Ak > 0 and Ik > 0. Then
it is easily shown that
368
III. Convergence of Probability Measures. Central Limit Theorem
PkkPoPkkPoP-PoY {y/Xk-y/lk)2<ooi
k=1 (30)
Pipof (7^-7¾)2 -oo.
fc=l
5. Problems
1. In the notation of Lemma 2, set
P a P = EQ(z a z),
where z a z = min(z, z). Show that
||P-P||=2(1-PaP)
(and consequently, <£V(P, P) = P a P).
2. Let P, P„, n ^ 1, be probability measures on (R, 38(R)) with densities (with respect
to Lebesgue measure) p(x\ p„(x), n > 1. Let p„(x)->p(x) for almost all x (with
respect to Lebesgue measure). Show that then
>-Pn\\=r
J—c
\p(x) ~ Pn(x)\ dx-+0, n -» oo
(compare Problem 17 in §6, Chapter II).
3. Let P and P be two probability measures. We define Kullback information K(P, P)
as information by using P against P, by the equation
"-H
K(P,P) = lElnidP,d?> lfP<<jF'
otherwise.
Show that
K(P, P)>-2 ln(l - p2(P, P)) > 2p2(P, P).
4. Establish formulas (11) and (12).
5. Prove inequalities (24).
6. Let P, P, and Q be probability measures on (R,^(R)); P*Q and P*Q, their
convolutions (see subsection 4, §8, Chapter II). Then
iip*e-p*en<np-pii.
7. Prove (30).
§10. Contiguity and Entire Asymptotic
Separation of Probability Measures
1. These concepts play a fundamental role in the asymptotic theory of
mathematical statistics, being natural extensions of the concepts of absolute conti-
§10. Contiguity and Entire Asymptotic Separation of Probability Measures 369
nuity and singularity of two measures in the case of sequences of pairs of
measures.
Let us begin with definitions.
Let (Q", ^")BS1 be a sequence of measurable spaces] let (P")„^i and (P")n^i
be sequences of probability measures with P" and P" defined on (Q", &n\
n>\.
Definition 1. We say that a sequence (P") of measures is contiguous to the
sequence (P") (notation: (P")o (P")) if, for all A" e &" such that P"(A") -► 0
as n -*■ oo, we have P"(A") -► 0, n -► oo.
Definition 2. We say that sequences (P") and (P") of measures are entirely
(asymptotically) separated (or for short: (P") A(P")), if there is a subsequence
nk t oo, fc -► oo, and sets X"fc e «^"fc such that
P"*^"*)-*! and P%4"fc) -► 0, fc->oo.
We notice immediately that entire separation is a symmetric concept: (P") A
(P")o(Pn) A(P"). Contiguity does not have this property. If (P")o (P") and
(P")<i (P"), we write (P")o o(P") and say that the sequences (P") and (P") of
measures are mutually contiguous.
We notice that in the case when (Q", J^") = (H, J^), F = P, pn = p for all
n > 1, we have
(Pn)<i(Pn)oP«P, (1)
(P")<jo(P")*>P~P, (2)
(Pn)A(Pn)*>P_LP. (3)
These properties and the definitions given above explain why contiguity
and entire asymptotic separation are often thought of as "asymptotic
absolute continuity" and "asymptotic singularity" for sequences (P") and
(P").
2. Theorems 1 and 2 presented below are natural extensions of Theorems 2
and 3 of §8 to sequences of measures.
Let (Q", ^"")„£i be a sequence of measurable spaces; Q", a probability
measure on (Q", 2rn)\ and <!;", a random variable (generally speaking,
extended; see §4, Chapter II) on (Q", &n\ n > 1.
Definition 3. A sequence (£") of random variables is tight with respect to a
sequence of measures (Q") (notation: (£"16") is tight) if
Urn hm Q-flf "| > JV) = 0. (4)
JVfco n
(Compare the corresponding definition of tightness of a family of probability
measures in §2.)
370
III. Convergence of Probability Measures. Central Limit Theorem
We shall always set
pn + pn dpn dpn
O" = , z" = , z" = .
^ 2 ' dQni dQn
We shall also use the notation
Z" = zn/zn (5)
for the Lebesgue derivative of P" with respect to P" (see (29) in §9), taking
2/0 = oo. We note that if P" « P", Z" is precisely one of the versions
of the density dP"/dP" of the measure P" with respect to P" (see §6,
Chapter II).
For later use it is convenient to note that since
Kz"4)=e-Hz"4))4 (6)
and Z" < 2/z", we have
((l/z")|P") tight, (Z-|P-) tight. (7)
Theorem 1. The following statements are equivalent:
(a) (P")<i(P"),
(b) (z-"|P") is tight,
(b') (Z"|Pn) is tight,
(c) limai0Um„/f(a; P", P") = 1.
Theorem 2. The following statements are equivalent:
(a) (P")A(P"),
(b) limn P"(z" > e) = 0 for every e > 0,
(b') iim„ ^(Z" < N) = 0 /or every N > 0,
(c) lim^oHm^ajP", P") =? 0,
(d) Hm„tf (a; P", P") = 0/or a// a e (0,1),
(e) Hm„ tf(a; P", P") = 0 for some a e (0,1).
Proof of Theorem 1.
(a) => (b). If (b) is not satisfied, there are an e > 0 and a sequence n f oo
such that P"ft(z"ft < l/nk) > e. But by (6), Pn«(zn« < l/nk) £ 1/¾. fc - oo, which
contradicts the assumption that (P")<i (P").
(b)*>(b')- We have only to note that Z" = 2z~k - 1.
(b) => (a). Let A" e &n and Pn{An) -► 0, n -► oo. We have
P"(X") < P"(z" < e) + ECn(zn/(y4n n {zn > e}))
< P"(z" < e) + - EQn(znI(An)) = P"{zn < e) + -P"(y4").
§10. Contiguity and Entire Asymptotic Separation of Probability Measures
371
Therefore,
fim P"(A") < fim P"(z" < e), e > 0.
n n
Proposition (b) is equivalent to saying that limc40fimn P"(z" < e) = 0.
Therefore, PV) - 0, i.e., (b) => (a).
(b)=>(c). Let £>0. Then
H(a;P",P")=Ec„[(znr(5n)1-a]
>ECn^0/(z">e)/(z">O)z"J
= E*[(0/(*" > £)] > (f)"*V > e), (8)
since z" + z" = 2. Therefore, for e > 0,
ljm ljm H(a; P", P") > lim (t) ljm P"(z" > e) = lim P"(z" > e). (9)
aiO n aTo \2/ n n
By (b), hm£ioljmn P"(z" > e) = 1. Hence, (c) follows from (9) and the fact that
H(a;P",P")<l.
(c)=>(b). Let 3 e (0,1). Then
Jf(a; P", P") = Eff.Kz-HS")1"'/^ < e)]
+ £Qnl{znY{zn)l-«I{zn > e, z" < 3)-]
+ Ecn[(z")a(n1"a/(^>£,^>5)]
^ 2ea + 251-" + EeJ 5"f pj/(z" > £, z" > <5)
< 2£a + 251-" + (lfp"(zn > e). (10)
Consequently,
lim ljm P"{zn > e)-> ( ff ljm tf (a; P", P") - ^3
eXo n \2/ n 2
for all a e (0,1), 3 e (0, 1). If we first let a J, 0, use (c), and then let 3J,0, we
obtain
lim lim P"(z" ^ e) > 1, -
e4-0 n
from which (b) follows.
Proof of Theorem 2.
(a) => (b). Let (P") A (P"), nk | oo, and let X"fc e ^"fc have the property that
p»K(A"k) -► 1 and P"k(A"k) -► 0. Then, since z" + z" = 2, we have
372
III. Convergence of Probability Measures. Central Limit Theorem
P"k(z"k > e) < P"k{A"k) + EQnAznk-^I(Ank)I(znk > e)\
= P"k(A»k) + Epnit^I(A"k)I(z"k 2> e)|
< P""(Ank) + -P"k(A"k).
e
Consequently, P"k(z"k > e) -► 0 and therefore, (b) is satisfied,
(b) => (a). If (b) is satisfied, there is a sequence nk f oo such that
H^i)^0- k-
Hence, having observed (see (6)) that Pnk(z"k ^ 1/k) ^1- (1/fc), we obtain (a).
(b) => (b'). We have only to observe that Z" = (2/zn) - 1.
(b)=>(d). By (10) and (b),
ljm H(u; P", P") < 2ea + 25l~a
for arbitrary s and 8 on the interval (0,1). Therefore, (d) is satisfied,
(d) => (c) and (d) => (e) are evident.
Finally, from (8) we have
ljm P"(z" > e) < (-) ljm H{a\ P", P").
Therefore, (c) => (b) and (e) => (b), since (2/e)a -► 1, a J, 0.
3. We now consider a special case corresponding to the method of
independent observations, where the calculation of the integrals H(a;P",P") and
application of Theorems 1 and 2 do not present much difficulty.
Let us suppose that the measures P" and P" are direct products of
measures:
P" = P1 x ••• xP„, PH = PlX'~xP„, n^l,
where Pfc and Pfc are given on (Qki ^), k ^ 1.
Since in this case
H(a; P", P") = ft #(«; A> ft) = eIn*=lln[1 -(1 -*<«:*fcH
fc=l
we obtain the following result from Theorems 1 and 2:
(n<a (P")olim 05 t [! " »(«: A, ft)] = 0, (11)
aiO n fc=l
(P")A(P-)oG5 £ [1 - H(a; P*, ft)] = oo. (12)
§11. Rapidity of Convergence in the Central Limit Theorem
373
Example. Let (Qki &k) = (P, @(R)\ ak e [0,1),
Pk{dx) = I[0t ^(x) dx, Pk{dx) = /[flft, u(x) dx.
i — ak
Since here H(a;Pk,Pk) = (1 - ak)a, ae(0,l), from (11) and the fact that
H{ol\ Pk, Pk) = H(l - a; Pk, Pk), we obtain
(P")<i (P^fim na„ < oo, i.e., a„ = O (-),
(Pn) o (P") o fim na„ = 0, i.e., an = o f -
(P")A(Pn)*>firnnan = oo.
4. Problems
1. Let P" = P? x •■• x P£, P" = ft x ••• x ft, w ^ 1, where P£ and ft are Gaussian
measures with parameters (ak, 1) and (ak, 1). Find conditions on (a^) and (££) under
which (P")<3 (P") and (Pn) A(Pn>
2 Let P" = PJ x • • • x P„n and P" = ft x • • • x ft, where P% and ft are probability
measures on (P, ^(P)) for which Pkn(dx) = JIOli](x) dx and ft(dx) = I[flntl+fln](dx),
0 < a„ < 1. Show that JJ(a; P£, ft) = 1 - a„ and
(P")<j (P")o(P»)<3 (P")ofirH na„ = 0, (P") A(P")oBm na„ = oo.
§11. Rapidity of Convergence in the
Central Limit Theorem
1. Let fnl,..., £„„ be a sequence of independent random variables, S„ =
&,! + ■•■ + ¢-. ^(x) = P(S„ < x). If S„ -> ^(0,1), then F„(x) -><D(x) for
every x e P. Since <I>(x) is continuous, the convergence here is actually
uniform (Problem 5 in §1):
sup | F„(x) - <X>(x) | -► 0, n -> oo. (1)
X
It is natural to ask how rapid the convergence in (1) is. We shall establish
a result for the case when
sn = ^ + -r+i", „*1,
Gy/n
where £l9 £2, • • • is a sequence of independent identically distributed random
variables with Efk = 0, Vfk = cr2 and E|f x |3 < oo.
374
III. Convergence of Probability Measures. Central Limit Theorem
Theorem (Berry and Esseen). We have the bound
CE\tx\
sup|F„(x) -<X<x)| <
(2)
where C is an absolute constant ((27t)~1/2 < C < 0.8).
Proof. For simplicity, let a2 = 1 and p3 = E|^|3. By Esseen's inequality
(Subsection 10, §12, Chapter II)
dt +
24 1
nT ^fhz
(3)
(4)
2 r fit) - <p{t)
sup|Fn(x)-0>(x)|<- \JnK) yu
x n J0 \ t
where <p(t) = e~t2'2 and
/,(0 = WlyftfW
with/(0 = Ee'^.
In (3) we may take T arbitrarily. Let us choose
T = 7^/(503).
We are going to show that for this T,
1/,(0 - 9(01 ^ \h> \t?e~*\ \t\ < T.
The required estimate (2), with C an absolute constant, will follow
immediately from (3) by means of (4). (A more detailed analysis shows that
C < 0.8.)
We now turn to the proof of (4).
By formula (18) from §2, Chapter II (n = 3, Ef x = 0, Eff = 1, E|^|3 < oo)
we obtain
t2 (if\3
/(0 = Ee«« = 1 - - + ^-[Ef?(cos Orfi + i sin OrfM, (5)
where |0X| < 1, \02\ < 1. Consequently,
If \t\ < T = y/n/5p3i we find, by using the inequality )¾ > a3 = 1 (see (28),
§6, Chapter II), that
Consequently, for \t\ < Tit is possible to have the representation
+ 3n3'2 ~25'
[^1--^-
(6)
§11. Rapidity of Convergence in the Central Limit Theorem
375
where In z means the principal value of the logarithm of the complex number
z (In z — ln|z| + i arg z, — n < arg z < n).
Since fi3 < oo, we obtain from Taylor's theorem with the Lagrange
remainder (compare (35) in §12, Chapter II)
since the semi-invariants are s^* — E^ = 0, s^2) = cr2 = 1.
In addition,
0n /(s))w = /W(5)-/2(5) - 3/^(5)/-(5)/(5) + 2(/-(5))3
_ EEKOVg"]/2^) ~ 3E[(tf1)2gig-nE[(»g1)g,c^/](s) + 2E[(tf1)g|C"]3
/3(s)
From this, taking into account that 1/(^/^)1 > 24/25 for \t\ < T and
|/(s)| < 1, we obtain
|an/r(^)|^ + 3^ + 2/j?^fo (8)
(& = Elf il*> k = 1,2, 3; ft < &1/2 < &1/3; see (28), §6, Chapter II).
From (6)-(8), using the inequality \ez - 1| < |z|e|r|, we find for \t\ < T =
V^/5^3 that
mi-*-
_ |enln/(f/v^)_e-»2/2|
'GKH^GHK^-
This completes the proof of the theorem.
Remark. We observe that unless we make some supplementary hypothesis
about the behavior of the random variables that are added, (2) cannot be
improved. In fact, let £lt £2, ... be independent identically distributed
Bernoulli random variables with
P(& = 1)=P(& =-1) = 1-
It is evident by symmetry that
/ 2n \
1,
2P(|&<°) + P(I^ = 0) =
376
III. Convergence of Probability Measures. Central Limit Theorem
and hence, by Stirling's formula ((6), §2, chap. I)
| ^1,4 <0)"*|-^l^*0)
" l^n V(27r).(2n)"
It follows, in particular, that the constant C in (2) cannot be less than (2n)~1/2.
2. Problems
1. Prove (8).
2. Let f!, f 2,... be independent identically distributed random variables with E£k = 0,
V^k = ff2andE|^|3 < oo.
It is known that the following nonuniform inequality holds: for all x e R,
l^(x)-0(x)|^CE1^13 '
ay£ (1 + 1*I)3"
Prove this, at least for Bernoulli random variables.
§12. Rapidity of Convergence in Poisson's Theorem
1. Let f,, f2, • • •, f„ be independent Bernoulli random variables that take the
values 1 and 0 with probabilities
P(^ = l) = Pk>P(^ = 0) = ^k(=l-pk), l<fc<n.
We set S = f! + • • • + f„; let B = (B0, Bx,..., Bn) be the binomial
distribution of probabilities of the sum S, where Bk = P(5 = k). Also let II =
(II0, nl9...) be the Poisson distribution with parameter X, where
e~x2k
We noticed in §6, Chapter I, that if
Pi =••• = ?„, A = np, (1)
there is the following estimate (Prokhorov) for the distance in variation
between the measures B and II (Bn+1 = B„+2 — • • • = 0):
\\B - n|| = £ \Bk - Ilk| < d(A)p = Q(A)--, (2)
fc=0 w
where
C1(A) = 2min(2,A). (3)
§12. Rapidity of Convergence in Poisson's Theorem
377
For the case when pk are not necessarily equal, but satisfy £k=1 Pk — K
LeCam showed that
||B - n|| = £ \Bk - nk| < C2(A) max pk, (4)
where
C2{k) = 2 min(9, X). (5)
A theorem to be presented below will imply the estimate
||J3 - n|| < C3(A) max pk, (6)
l<.k<.n
in which
C3(A) = 2A. (7)
Although C2(A) < C3(A) for X > 9, i.e., (6) is worse than (4), we nevertheless
have preferred to give a proof of (6), since this proof is essentially elementary,
whereas an emphasis on obtaining a "good" constant C2(A) in (4) greatly
complicates the proof.
2. Theorem. LetA = £fc=i pk. Then
||B-n||=£|Bfc-nk|<2£pk2. (8)
fc=0 fc=l
Proof. We use the fact that each of the distributions B and n is a convolution
of distributions:
B = B(pl)*B(p2)*-*B(pn),
n = n(p1)*n(p2)*---*n(p„),
understood as a convolution of the corresponding distribution functions (see
subsection 4, §8, Chapter II), where B{pk) = (1 — pk, pk) is a Bernoulli
distribution on the points 0 and 1, and n(pk) is a Poisson distribution supported
on the points 0,1,... with the parameters pk.
It is easy to show that the difference B — Ti can be represented in the form
B-n = K1+- • + *„, (10)
where
Rk = (B(Pk)-Tl(pk))*Fk (11)
with
F1 = n(p2)*---*n(pj,
/rk = JB(Pi)*---*B(pk-i)*n(pk+1)*---*n(p„), 2<k<n-l,
F„ = B(p1)*~*B(p„„1)TI,
378
III. Convergence of Probability Measures. Central Limit Theorem
By problem 6 in §9, we have ||Rk|| < \\B(pk) - n(pk)||. Consequently, we
see immediately from (10) that
|£-n|£il*fo)-ri(n)|. (12)
fc=l
By formula (12) in §9, we see that there is no difficulty in calculating the
variation ||J3(pk)-II(pk)||:
\\B(pk)-Tl(pk)\\
= 1(1 - Pk) ~ e-p*\ + \pk - pke-*-\ + £ ^^
= 1(1 " Pk) ~ e~Pk\ + \pk - pke~Pk\ + 1 - e-'* - pke~^
^2pk(l-e-^)<2pl
From this, together with (12), we obtain the required inequality (8).
This completes the proof of the theorem.
Corollary. Since £k=1 pk < X max1^k^„pk, we obtain (6).
3. Problems
1. Show that, if Ak = -ln(l - pk\
\\B(pk) - II(Ak)|| = 2(1 - e->* - he-**) < Ak3
and consequently, ||JS - II || < YUi K.-
2. Establish the representations (9) and (10).
CHAPTER IV
Sequences and Sums of Independent
Random Variables
§1. Zero-or-One Laws
1. The series Y*=i (Vn) diverges and the series Y*=i (~l)"(Vn) converges.
We ask the following question. What can we say about the convergence or
divergence of a series ££. t (£Jn\ where ^, f2, • • • is a sequence of independent
identically distributed Bernoulli random variables with P(f x = +1) =
P(^i = — 1) = \1 In other words, what can be said about the convergence
of a series whose general term is + 1/n, where the signs are chosen in a random
manner, according to the sequence £l3 f2, • • • ?
Let
Ai = \cd: £ — converges>
be the set of sample points for which J^l x (ZJri) converges (to a finite
number) and consider the probability PiAj) of this set. It is far from clear,
to begin with, what values this probability might have. However, it is a
remarkable fact that we are able to say that the probability can have only two
values, 0 or 1. This is a corollary of Kolmogorov's "zero-one law," whose
statement and proof form the main content of the present section.
2. Let (£2, ^ P) be a probability space, and let ^, f2,... be a sequence of
random variables. Let &™ = cr(f„, fn+lt...) be the cr-algebra generated by
fn> £».+ii---t and write
n=l
380
IV. Sequences and Sums of Independent Random Variables
Since an intersection of cr-algebras is again a cr-algebra, SC is a cr-algebra. It is
called a tail algebra (or terminal or asymptotic algebra), because every
event A e SC is independent of the values of £lt..., £„ for every finite number n,
and is determined, so to speak, only by the behavior of the infinitely remote
values of ^, £2,----
Since, for every k > 1,
^, {|f converges} = {|f,
we have Ax e f]k &™ = SC. In the same way, if £lf £2,... is any sequence,
1 converges > e SF™,
A2
-{?
£„ converges > e SC.
The following events are also tail events:
{£„ g /„ for infinitely many n},
where /„ e
n> 1;
/15 =
A6 =
An =
A« =
}■
im £„ < 00
IIm^ + - + ^<oo
^-^1+--- + ^ I
im — < c }
— converges >;
}>
'7-
In log n J
On the other hand,
Bx = {£„ = 0foralln> 1},
B:
= jlimtfi
+ ••• + £„) exists and is less than c >
are examples of events that do not belong to SC.
Let us now suppose that our random variables are independent. Then by
the Borel-Cantelli lemma it follows that
P(^3) = OoX%6/„)<00,
Therefore the probability of A3 can take only the values 0 or 1 according to
the convergence or divergence of J] P(£n e /„). This is Borel's zero-one law.
§1. Zero-or-One Laws
381
Theorem 1 (Kolmogorov's Zero-One Law). Let £l3 f2,... be a sequence of
independent random variables and let AeSC. The P(A) can only have one of the
values zero or one.
Proof. The idea of the proof is to show that every tail event A is independent
of itself and therefore P(AnA)=P(A)'P(A% i.e., P(A) = P2 (/1), so that
P(/i) = 0orl.
If A ear then A e 3Ff = <t{^,^2, ...} = <r(U„ &\\ where ^ =
o-{fi,...,f,,}, and we can find (Problem 8, §3, Chapter II) sets A„e&\,
m > 1, such that P(A A A„) -► 0, n -► oo. Hence
P(A„)^P(A), P(A„ n A) -+ P(A). (1)
But if A e #\ the events An and A are independent for every n > 1. Hence it
follows from (1) that P(A) = P2(A) and therefore P(A) = 0 or 1.
This completes the proof of the theorem.
Corollary. Let nbea random variable that is measurable with respect to the tail
a-algebra SC, i.e., {n e B} e SC, B e 08(R). Then r\ is degenerate, i.e., there is a
constant c such that P(r] = c)~ 1.
3. Theorem 2 below provides an example of a nontrivial application of
Kolmogorov's zero-one law.
Let ^, £2,... be a sequence of independent Bernoulli random variables
with P(4 = 1) = p, P(4 = -1) = q, p + q=l n > 1, and let S„ =
£i + • • • + £„. It seems intuitively clear that in the symmetric case (p = £)
a "typical" path of the random walk S„, n > 1, will cross zero infinitely often,
whereas when p ^ \ it will go off to infinity. Let us give a precise formulation.
Theorem 2. (a) Ifp = \ then P(S„ = 0 i.o.) = 1.
(b) Ifp ^ £, then P(S„ = 0 i.o.) = 0.
Proof. We first observe that the event B = (S„ = 0 i.o.) is not a tail event, i.e.,
B¢£ = P| «^°, ^3° = 0-(6,, £,+ !,...}. Consequently it is, in principle,
not clear that B should have only the values 0 or 1.
Statement (b) is easily proved by applying (the first part of) the Borel-
Cantelli lemma. In fact, if B2n = {S2„ = 0}, then by Stirling's formula
P(B2n) = C"2np"q" r * Pq)
and therefore £ P(B2n) < oo. Consequently P(S„ = 0 i.o.) = 0.
To prove (a), it is enough to prove that the event
{lim7=n
—^ — oo, hm—-= ■■
yjn y/n
has probability 1 (since A Q B).
382
IV. Sequences and Sums of Independent Random Variables
Let
Ac = <fim—J= > c> rWlim—^= < —cU=A'enA'c).
Then Ac J, A, c -> oo, and all the events A, Ac, A'c, A'c are tail events. Let us
show that P(/i;) = P{A"C) = 1 for each c> 0. Since A'c e X and A"c e 9C, it is
sufficient to show only that P(A'C) > 0, P{A'^) > 0. But by Problem 5
p(uffl^<-c) = p(iffi^>c)snsp(^>c)>o,
where the last inequality follows from the De Moivre-Laplace theorem.
Thus P(AC) = 1 for all c> 0 and therefore P(A) = lim^ P(AC) = 1.
This completes the proof of the theorem.
4. Let us observe again that B = {Sn = 0 i.o.} is not a tail event. Nevertheless,
it follows from Theorem 2 that, for a Bernoulli scheme, the probability of this
event, just as for tail events, takes only the values 0 and 1. This phenomenon
is not accidental: it is a corollary of the Hewitt-Savage zero-one law, which
for independent identically distributed random variables extends the result
of Theorem 1 to the class of "symmetric" events (which includes the class of
tail events).
Let us give the essential definitions. A one-to-one mapping n =
(nl3 n2,...) of the set (1, 2,...) on itself is said to be a finite permutation if
n„ = n for every n with a finite number of exceptions.
If £ = £lt £2,... is a sequence of random variables, n(^) denotes the
sequence (fKl, fKa,...). If A is the event {£eJ3}, Be@(Rm\ then n(A)
denotes the event {n(Q eB},Be @(Rm).
. We call an event A = {<!; e B}, B e ^(/?°°), symmetric if n(A) coincides with
A for every finite permutation n.
An example of a symmetric event is A = {Sn = 0 i.o.}, where Sn =
£! + ■•• + £„. Moreover, we may suppose (Problem 4) that every event in
the tail <r-algebra &(S) = f] ^°(S) = <x{co: S„, Sn+1,...} generated by
Sx = £lt S2 = Zi + £2, • • • is symmetric.
Theorem 3 (Hewitt-Savage Zero-One Law). Let ^, £2,.. .be a sequence of
independent identically distributed random variables, and
A = {a>:(Zl.Z29...)eB}
a symmetric event. Then P(A) = 0 or 1.
Proof. Let A = {<!; e B} be a symmetric event. Choose sets J3„ e @(R") such
that, for ^in = {co: (^,..., 4) eB„},
P(^i A An) -► 0, n -»- 00. (2)
§1. Zero-or-One Laws
383
Since the random variables ^, £2,.-- are independent and identically
distributed, the probability distributions P^B) = P(£ e B) and PnwAG(B) =
P(7tn(Q e J3) coincide. Therefore
P(A A An) = P£B A Bn) = F^JJ A B„). (3)
Since ^4 is symmetric, we have
A = {ZeB} = *JLA) = MQeB).
Therefore
Pnn(^B A 2*„) = PWO e B) A (W|I(© e B„)}
= P{(£ eB)A MQ e B„)} = P{X A 7rn04n)}. (4)
Hence, by (3) and (4),
P(AAAn) = P(AAn„(An)). (5)
It then follows from (2) that
P{AA{Anrnin{An)))-^^ n^oo. (6)
Hence, by (2), (5), and (6), we obtain
P(A„) -* P(A), P(nn(A)) -* P(A\
P(A„ n n„(An)) - PW (7)
Moreover, since t>l and ^2 are independent,
P{An n 7r„K)) = P{(£lf..., 4) e B„, (4+i, • - •, t2n) e £„}
= P«£lf..., « e *„} • P{(£„+!,--., £2n) e Bn}
= P(^)P(7T„K)),
whence by (7)
P(A) = P2(A)
and therefore P(A) = 0 or 1.
This completes the proof of the theorem.
5. Problems
1. Prove the corollary to Theorem 1.
2. Show that if (<^„) is a sequence of independent random variables, the random variables
fim f „ and iim £„ are degenerate.
3. Let (&,) be a sequence of independent random variables, S„ = £t -\ + ^,,, and let
the constants b„ satisfy 0 < bn | oo. Show that the random variables fim(SJb„) and
\im(SJb„) are degenerate.
4. Let S„ = f, + • • • + £,, n > 1, and <T(S) = f] 3F"{S\ &"{$) = o{(o: 5„, Sn+ „...}.
Show that every event in SC{S) is symmetric.
5. Let (<!;„) be a sequence of random variables. Show that {Iim &, > c} 2 Iim{^n > c}
for each c > 0.
384
IV. Sequences and Sums of Independent Random Variables
§2. Convergence of Series
1. Let us suppose that £,, £2,... is a sequence of independent random
variables, S„ = £x + • • • + £„, and let A be the set of sample points co for
which £ £„(cd) converges to a finite limit. It follows from Kolmogorov's
zero-one law that P(A) = 0 or 1, i.e., the series £<!;„ converges or diverges
with probability 1. The object of the present section is to give criteria that will
determine whether a sum of independent random variables converges or
diverges.
Theorem 1 (Kolmogorov and Khinchin).
(a) Let Ef„ = 0, n > 1. Then if
ZEg<W, (1)
the series ]T £„ converges with probability 1.
(b) If the random variables £„, n > \,are uniformly bounded (i.e., P(|£„| < c)
= 1, c < oo), the converse is true: the convergence of £ £„ with probability
1 implies (1).
The proof depends on
Kolmogorov's Inequality
(a) Let cx, £2,..., £„ be independent random variables with Ef £ = 0, E£? < oo,
i < n. Then for every e > 0
pjmax |SJ>e}<^. (2)
UrSfcrSn J e
(b) If also P(|£.| ^ c) = 1, i < n, r/ien
p{max|Sfc|>4>l-^-1^. (3)
Proof, (a) Put
A = {max|SJ > e},
\ = {\St\<e,i = 1,...,fc- 1, |SJ>e}, 1 <fr<n.
Then /4 = J] Ak and
§2. Convergence of Series
385
But
ES2IAtc = E(Sk + (&+1 + - - • + 4))2/^
= ES2IAlt + 2ESk(£k+1 +--- + OU, + E(£k+1 + • - • + Q2IAlc
> £S2kIAk,
since
ESk(£k+! + • - - + «7^ = ESk/^ - E(£k+, + • - • + O = 0
because of independence and the conditions E^,- = 0, i < n. Hence
ES„2 > Y, ES?/^ > «2 I P(A) = e2P(A),
which proves the first inequality,
(b) To prove (3), we observe that
ES2IA = ES2 - ES2IA > ES2 - e2P(A) = ES2 - e2 + e2P(A). (4)
On the other hand, on the set Ak
IVil<£, |Skl<|Sk-il + 141 <e + c
and therefore
ES}lA = £ ESIIA„ + £ B(IA,{S„ - Stf)
k k
<hc)2iw+ ipw t E<y
k fc=l j = fc+l
< P(A)\(b + c)2 + £ E§1 = P(X)[(e + c)2 + ES2].
(5)
From (4) and (5) we obtain
KW ^ /- . „\2 , r C2 „2 l /„ . „\2 , rc2 „2 ^ X
(e + c)2 + ES2 - e2 (e + c)2 + ES2 - e2 ~ ES2 '
This completes the proof of (3).
Proof of Theorem 1. (a) By Theorem 4 of §10, Chapter II, the sequence
(S„), n > 1, converges with probability 1, if and only if it is fundamental with
probability 1. By Theorem 1 of §10, Chapter II, the sequence (S„), n > 1, is
fundamental (P-a.s.) if and only if
P<sup|Sn+k-S„| >ei-»0, n-
(6)
U^i )
By (2),
pisup|Sn+k - S„\ > b\ = Urn p] max |Sn+k - S„\ > el
U>1 J ^-»00 llZkZN )
W-oo e e
Therefore (6) is satisfied if £k°= x E£2 < oo, and consequently £ £k converges
with probability 1.
386
IV. Sequences and Sums of Independent Random Variables
(b) Let Y, £k converge. Then, by (6), for sufficiently large n,
pjsup|Sn+k-Sn|>ej<i. (7)
By (3),
pfsup \Sn+k - Sn\ > s\ > 1 -ff,+ 'L
Therefore if we suppose that Yj?= i E^ = oo, we obtain
p|sup|Sn+k-SJ>ej=l,
which contradicts (7).
This completes the proof of the theorem.
Example. If ^, £2>--- is a sequence of independent Bernoulli random
variables with P(£„ = +1) = P(f„ = -1) = i, then the series Xf„a„, with
\a„\ < c, converges with probability 1, if and only if £ a* < oo.
2. Theorem 2 (Two-Series Theorem). A sufficient condition for the convergence
of the series £ £„ of independent random variables, with probability 1, is that
both series £ E£„ and £ V^„ converge. If P( | <!;„ | < c) = 1, rfce condition is also
necessary.
Proof. If ^ V^„ < oo, then by Theorem 1 the series £ (<!;„ - Efn) converges
(P-a.s.). But by hypothesis the series £ E£n converges; hence J] <!;„ converges
(P-a.s.).
To prove the necessity we use the following symmetrization method. In
addition to the sequence £l3 £2.--- we consider a different sequence |x,
|2, • • • of independent random variables such that |„ has the same
distribution as £„, n > 1. (When the original sample space is sufficiently rich, the
existence of such a sequence follows from Theorem 1 of §9, Chapter II. We
can also show that this assumption involves no loss of generality.)
Then if £ £„ converges (P-a.s.), the series £ |„ also converges, and hence
so does £ (L ~ l„). But E(£„ - |„) = 0 and P(|£„ - |„| < 2c) = 1.
Therefore £ V(£„ — |„) < 00 by Theorem 1. In addition,
IV4 = iIV(£n-|n)<oo.
Consequently, by Theorem 1, £ (£„ — E£n) converges with probability 1,
and therefore £ E£n converges.
Thus if J] £n converges (P-a.s.) (and P( | £„ | < c) = 1, n > 1) it follows that
both J] E£„ and £ V£„ converge.
This completes the proof of the theorem.
3. The following theorem provides a necessary and sufficient condition for
the convergence of J] £„ without any boundedness condition on the random
variables.
§2. Convergence of Series 387
Let c be a constant and
fc m<c,
* 10, |£|>c
Theorem 3 (Kolmogorov's Three-Series Theorem). Let ^, £2»--- ^e a
sequence of independent random variables. A necessary condition for the
convergence ofYj £n with probability 1 is that the series
£E£, £V£, lP(|5„|>c)
converge for every c > 0; a sufficient condition is that these series converge
for some c > 0.
Proof. Sufficiency. By the two-series theorem, J] ££ converges with probability
1. But if J] P(|£M| > c) < oo, then by the Borel-Cantelli lemma we have
£/(141 ^ c) < °o with probability 1. Therefore <!;„ = ^ for all n with at
most finitely many exceptions. Therefore J] £„ also converges (P-a.s.).
Necessity. If £ <!;„ converges (P-a.s.) then £„ -»■ 0 (P-a.s.), and therefore,
for every c > 0, at most a finite number of the events {|^„| > c} can occur
(P-a.s.). Therefore £ I(\£„\ > c) < oo (P-a.s.), and, by the second part of the
Borel-Cantelli lemma, £ P(|5„| > c) < oo. Moreover, the convergence of
Y, £„ implies the convergence of £ £c„. Therefore, by the two-series theorem,
both of the series £ E^ and £ V^ converge.
This completes the proof of the theorem.
Corollary. Let £i, £2 > • • • be independent variables with E£„ = 0. TJzen 1/
tJze series £ <!;„ converges with probability 1.
For the proof we observe that
IE y^\ < °°^1^(16.1 ^ « + \UK\U > 1)] < 00.
Therefore if £l = 6,/(16,1 ^ 1), we have
£E(6!)2<ao.
Since E^„ = 0, we have
I |Efi| = £|E6,/(I6,I < Dl = I|E£,/(|<U > Dl
<lE|4|/(l4l>D<oo.
Therefore both £ E£! and I V£! converge. Moreover, by Chebyshev's
inequality,
p{|6l > 1} = p{|6|/(I6I >D>1}< E(|6I/(I6I > 1).
Therefore £ P(|6I > 1) < °°- Hence the convergence of £ £, follows from
the three-series theorem.
388
IV. Sequences and Sums of Independent Random Variables
4. Problems
1. Let f|, £2,.-. be a sequence of independent random variables, S„ = fp-■•»£«•
Show, using the three-series theorem, that
(a) if £ f j? < 00 (P-a.s.) then £ £„ converges with probability 1, if and only if
£Ef|/(IGl£l) converges;
(b) if £ £„ converges (P-a.s.) then £ fjf < 00 (P-a.s.) if and only if
I(E|U/(IU<l))2<oo.
2. Let £1, £2, • • • be a sequence of independent random variables. Show that £ £,1 < 00
(P-a.s.) if and only if
SErfc<co-
3. Let £1, £2,... be a sequence of independent random variables. Show that £ £,
converges (P-a.s.) if and only if it converges in probability.
§3. Strong Law of Large Numbers
1. Let f!, f 2,... be a sequence of independent random variables with finite
second moments; S„ = ¢1 + • ■ ■ + 6,- By Problem 2, §3, Chapter III, if the
numbers V££ are uniformly bounded, we have the law of large numbers:
SH-ESnP,
0, R - 00. (1)
A strong law of large numbers is a proposition in which convergence in
probability is replaced by convergence with probability 1.
One of the earliest results in this direction is the following theorem.
Theorem 1 (Cantelli). Let £l3 £2,... be independent random variables with
finite fourth moments and let
E|£„-E<U4<C, n>l
for some constant C. Then as n -> 00
^=^=-0 (P-a.S.). (2)
Proof. Without loss of generality, we may assume that E£„ = 0 for n > 1.
By the corollary to Theorem 1, §10, Chapter II, we will have SJn -»- 0 (P-a.s.)
provided that
MI5I"}
< 00
for every e > 0. In turn, by Chebyshev's inequality, this will follow from
Is."
£E
< 00.
§3. Strong Law of Large Numbers
389
Let us show that this condition is actually satisfied under our hypotheses.
We have
si-«, + - + «4 - £tf - I =^6¾
+ I2TTT7T^^+ 2 4!«,«,
t*j ^: ill: i<j<k<i
i*k
4!
Remembering that E£k = 0, k < n, we then obtain
ESJ = £E# + 6 £ EtfE# < nC + 6 £ ^/Etf-E^
i=l I, j=l U=l
i<i
< nC + 6W(W " ^ C = (3n2 - 2n)C < 3n2C.
Consequently
This completes the proof of the theorem.
2. The hypotheses of Theorem 1 can be considerably weakened by the use of
more precise methods. In this way we obtain a stronger law of large numbers.
Theorem 2 (Kolmogorov). Let £,x, £,2,...bea sequence of independent random
variables with finite second moments, and let there be positive numbers b„ such
that b„ | oo and
I^<°°- (3)
Then
In particular, if
then
*!—*$•_>() (P-*.*). (4)
Z^<^ (5)
S" ES»^o (P-a.s.). (6)
n
For the proof of this, and of Theorem 2 below, we need two lemmas.
390
IV. Sequences and Sums of Independent Random Variables
Lemma 1 (Toeplitz). Let {a„} be a sequence of nonnegative numbers, b„ =
Yj=i ««> b„>0 for n> 1, and b„ | oo, n -► oo. Let {x„} be a sequence of
numbers converging to x. Then
On j= 1
In particular, ifa„ = 1 then
*i + ■ • ■ + x„
(7)
(8)
Proof. Let e > 0and let nQ = n0(e) be such that \x„ - x\ < e/2 for all n>n0.
Choose n1 > nQ so that
Then, for n > n1
1
"nj=
laJXJ-
£ |x,.-x|<e/2.
^i- i°j\xj-x\
unj=\
I n0 1 "
= jT HaJ\xi-x\+— X aj\xj-x\
°nj=\ °nj=n0+l
J "o 1 "
^JT I,aj\xj-x\+r I a;l*;-*l
This completes the proof of the lemma.
Lemma 2 (Kronecker). Let {b„} be a sequence of positive increasing numbers,
b„ | oo, n -»■ oo, and let {x„} be a sequence of numbers such that £ x„ converges.
Then
l-j^bjXj-^O, n-oo. (9)
In particular, ifbn = n, x„ = j;„/n and £ (j>„/«) converges, then
J"+- + *^ft „^oo. (10)
Proof. Let fc0 = 0, S0 = 0, S„ = £"=! xj. Then (by summation by parts)
I *;*; = I bASJ ~ SJ-1) = * A - Mo - I Sj- l(bJ - bt. 0
§3. Strong Law of Large Numbers
391
and therefore
°n »= 1 °n /= 1
since, if S„ -► x, then by Toeplitz's lemma,
1
This establishes the lemma.
Proof of Theorem 1. Since
Sn-
on j= i
;*-&«
a suflficient condition for (4) is, by Kronecker's lemma, that the series
E K^fe ~" ^£k)/bk\ converges (P-a.s.). But this series does converge by (3) of
Theorem 1, §2.
This completes the proof of the theorem.
Example 1. Let f lt f2,... be a sequence of independent Bernoulli random
variables with P(£„ = 1) = P(£n= -l)=i- Then, since £ [l/(n log2 n)]<oo,
we have
.0 (P-a.s.). (11)
y/n\ogn
3. In the case when the variables £u ^2> • • • are not onty independent but
also identically distributed, we can obtain a strong law of large numbers
without requiring (as in Theorem 2) the existence of the second moment,
provided that the first absolute moment exists.
Theorem 3 (Kolmogorov). Let ^, f2,... be a sequence of independent
identically distributed random variables with E |^x | < oo. Then
— -m (P-a.s.) (12)
where m — E^.
We need the following lemma.
Lemma 3. Let £bea nonnegative random variable. Then
£ P(£ > n) < E£ < 1 + £ P(£ > n). (13)
392
IV. Sequences and Sums of Independent Random Variables
The proof consists of the following chain of inequalities:
f P(£ > n) = £ XP(^<^<H1)
n=l n=l fc>n
= f /cP(/c ^^</c+l)= f E[W(/c <£<* + !)]
fc=l fc=0
^ I EKJ(* <£ ^ < * + 1)]
fc = 0
= E£ < £ E[(* + l)/(* < ^ < k + 1)]
fc = 0
= £ (* + i)p(/c ^ ^ < fc +1)
fc = 0
= £ P(£ > n) + fp(/c^^<fc+l)= fp(U«) + l.
n=l fc = 0 n=\
Proof of Theorem 3. By Lemma 3 and the Borel-Cantelli lemma,
EKil<ax>XP{K1|>n}<oo
<>E P(I4I >«}<ooo P{|fcJ > n i.o.} = 0.
Hence |^„| < n, except for a finite number of n, with probability 1.
Let us put
. ft. 141 <",
Cn 10, 141 >",
and suppose that E£n = 0, n > 1. Then (^ + --- + 4)/n -► 0 (P-a.s.), if
and only if (^ + • • • + |„)/n -»- 0 (P-a.s.). Note that in general E|„ ^ 0 but
E?„ = Efc /(141 < n) = E^/(|^ | < n) - E^t = 0.
Hence by Toeplitz's lemma
- Z E4 - °> « -^ oo, *
nk=i
and consequently (^ + • • • + 4)/n -► 0 (P-a.s.), if and only if
(|1-E|,) + ^ + (|„-E4)^0j ^ (p_as)j ^ (M)
Write J„ = f„ — E|„. By Kronecker's lemma, (14) will be established if
Yi (4/n) converges (P-a.s.). In turn, by Theorem 1 of §2, this will follow if we
show that, when E 1^ | < oo, the series £ (Vf„/«2) converges.
§3. Strong Law of Large Numbers
393
We have
wE co p22 co i
I -$-* _! -jjr- _I ^E[«. /(|{.| < *)]2
= I A EK?'(l«il < ")] = I A lE[fl/(k - 1) ^ |{,l< fc)]
n=\ n n=l " fc=l
= ZE[fi/(fc-l^|^|<fc)]. f -1
k=l n=kW
fc=l ^
< 2 f E[|^|/(fc - 1 < |^| < k)1 = 2E|^| < oo.
fc=i
This completes the proof of the theorem.
Remark 1. The theorem admits a converse in the following sense. Let
£l3 ^.--- be a sequence of independent identically distributed random
variables such that
¢1 + -- + 6, r
► C,
n
with probability 1, where C is a (finite) constant. Then E | f x | < oo and C = Ef x.
In fact, if S„/n -► C (P-a.s.) then
^ = -=- -JL_L^o (P-a.s.)
n n \ n J n — 1
and therefore P(|^„| > n i.o.) = 0. By the Borel-Cantelli lemma,
lP(|^|>n)<oo,
and by Lemma 3 we have E |^, | < oo. Then it follows from the theorem that
C = E^.
Consequently for independent identically distributed random variables
the condition E | £t \ < oo is necessary and sufficient for the convergence (with
probability 1) of the ratio SJn to a finite limit.
Remark 2. If the expectation m = E^ exists but is not necessarily finite, the
conclusion (10) of the theorem remains valid.
In fact, let, for example, E£;~ < oo and E£j* = oo. With C > 0, put
i= 1
394
IV. Sequences and Sums of Independent Random Variables
Then (P-a.s.).
Hm^ > Urn-=- = EWfo ^ Q.
But as C -»■ oo,
E^/(^<C)-E^ = oo;
therefore Sn/n -»• +oo (P-a.s.).
4. Let us give some applications of the strong law of large numbers.
Example 1 (Application to number theory). Let Q = [0,1), let ^ be the
algebra of Borel subsets of Q, and let P be Lebesgue measure on [0, 1).
Consider the binary expansions co=0. co^ ... of numbers cdeQ (with infinitely
many O's) and define random variables ^i(co),^2(w), . • • by putting £„(cd) = cd„.
Since, for all n> > 1 and all xu..., x„ taking the values 0 or 1,
{cd: ^(co) = xl9..., £„(co) = xn}
{X, x2 x„ X, x„ 1)
the P-measure of this set is 1/2". It follows that ^, £„,... is a sequence of
independent identically distributed random variables with
P(£i = 0) = % = 1) = i
Hence, by the strong law of large numbers, we have the following result of
Borel: almost every number in [0, 1) is normal, in the sense that with probability
1 the proportion of zeros and ones in its binary expansion tends to ^, i.e.,
Z Z/(&-D-i (P-a.s.).
n fc=l
Example 2 (The Monte Carlo method). Let f{x) be a continuous function
defined on [0, 1], with values on [0, 1]. The following idea is the foundation
of the statistical method of calculating [lf(x)dx (the "Monte Carlo
method").
Let fii *hi £2> *h»--- be a sequence of independent random variables,
uniformly distributed on [0, 1]. Put
fl if/(«>i|l,
Pi [0 if/(fix*.
It is clear that
Epi = P{/(« > iji} = f/(x)</x.
«/0
§4. Law of the Iterated Logarithm
395
By the strong law of large numbers (Theorem 3)
" I P.— CfMdx (P-as.)-
n ,-=i Jo
Consequently we can approximate an integral [lf{x)dx by taking a
simulation consisting of a pair of random variables (f,-, 17,-), i > 1, and then
calculating pt and (1/n) £J=1 p£.
5. Problems
1. Show that Ef2 < 00 if and only if ££,, nP(|f | > n) < 00.
2. Supposing that f|, f2J... are independent and identically distributed, show that if
E |f t |a < 00 for some a, 0 < a < 1, then Sn/nlla -> 0 (P-a.s.), and if E |f, \p < 00 for
some ft 1 < $ < 2, then (Sn - nE^)/nl,p -> 0 (P-a.s.).
3. Let f!, f2,... be a sequence of independent identically distributed random variables
and let E |f 11 = 00. Show that
IS I
— — an = 00 (P-a.s.)
n I
for every sequence of constants {«„}.
4. Show that a rational number on [0,1) is never normal (in the sense of Example 1,
Subsection 4).
§4. Law of the Iterated Logarithm
1. Let £ x, f2,... be a sequence of independent Bernoulli random variables
with P(4 = 1) = P(£, = -1) = i; let S„ = f x + • • . + f„. It follows from
the proof of Theorem 2, §1, that
fiE-^==+oo, ljm-^==-oo, (1)
y/n y/n
with probability 1. On the other hand, by (3.11),
*Jn logn
0 (P-a.s.). (2)
Let us compare these results.
It follows from (1) that with probability 1 the paths of (S„)„>i intersect
the "curves" ±e^/n infinitely often for any given e; but at the same time (2)
396
IV. Sequences and Sums of Independent Random Variables
shows that they only finitely often leave the region bounded by the curves
+ ex/nlogn. These two results yield useful information on the amplitude
of the oscillations of the symmetric random walk (S„)„>i- The law of the
iterated logarithm, which we present below, improves this picture of the
amplitude of the oscillations of (S„)„> i-
Let us introduce the following definition. We call a function (p* = <?*(«),
n > 1, upper (for (S„)„>i) if. witn probability 1, S„ < (p*(n) for all n from
n = n0(co) on.
We call a function <p^ = <pjn)->n > ljower (for (S„)„>i) if, with probability
1, S„ > <pjji) for infinitely many n.
Using these definitions, and appealing to (1) and (2), we can say that every
function <p* = e^/n log w, e > 0, is upper, whereas <p# = e^/n is lower, e > 0.
Let <p = <p(ri) be a function and q>* — (1 + e)(p, (p^e = (1 — e)(p, where
e > 0. Then it is easily seen that
^fim-7!h< U =]lim sup -^-\ < \>
o < sup —r-z < 1 + e for every e > 0, from some nJs) on >
Uni(C) <P(m) J
<*■ {5,„ < (1 + s)<p(m) for every e > 0, from some w,(e) on}.
(3)
In the same way,
o < sup ——- < 1 + e for every e > 0, /rom some w,(e) on >
{Sm > (1 — E)<p(m) for every e > 0 and for infinitely
many m larger than some n3(e) > n2(e)} • (4)
It follows from (3) and (4) that in order to verify that each function <p* =
(1 + e)<p, e > 0, is upper, we have to show that
But to show that q>^t = (1 — e)<p,e > 0, is lower, we have to show that
■t5^1}-1- (6)
§4. Law of the Iterated Logarithm
397
2. Theorem 1 (Law of the Iterated Logarithm). Let f,, £2,. . - be a sequence of
independent identically distributed random variables with E^ = 0 and E£f =
a2 > 0. Then
p{mwrl}=1' (7)
where
ij/(n) = y/2(j2n log log n. (8)
For uniformly bounded random variables, the law of the iterated logarithm
was established by Khinchin (1924). In 1929 Kolmogorov generalized this
result to a wide class of independent variables. Under the conditions of
Theorem 1, the law of the iterated logarithm was established by Hartman
and Wintner (1941).
Since the proof of Theorem 1 is rather complicated, we shall confine
ourselves to the special case when the random variables £„ are normal,
£, - JT1& 1), n > 1.
We begin by proving two auxiliary results.
Lemma 1. Let £i,..., £„ be independent random variables that are symmetrically
distributed (P(& 6B) = P(-^eB) for every Be@ (/?), k < n). Then for
every real number a
P( max Sk > a) < 2P(S„ > a).
\i<fc<« /
(9)
Proof. Let A — {ma.Xi<k<„ Sk > a}, Ak = {S( < a, i < k — 1; Sk > a} and
B = {S„ > a}. Since Sn > a on Ak (because Sk < S„), we have
P(B n Ak) > P(Ak n {S„ > Sk}) = P(Ak)P(S„ > Sk)
= P(Xk)P(4+1 + ■•• + £,=><>).
By the symmetry of the distributions of the random variables f lf..., &,,
we have
P(4+, + ■•• + 4 > 0) = P(4+i + .-• + £,< 0).
Hence P(4+1 + ■ • ■ + L > 0) > i, and therefore
P(B) > t P^* n *> * \ I p(^> = } P^)>
k=i zk=i z
which establishes (9).
Lemma 2. Let S„ ~ ^T (0, <r2(n)), <r2(n) | oo, aw/ fet a(n\ n > 1, sotis^
a(n)/cr(n) -> oo, n -> oo. Tfcen
P(S„ > fl(n» - -7=^- exp{ -M»)/*2W}. (10)
V27ca(n)
398
IV. Sequences and Sums of Independent Random Variables
The proof follows from the asymptotic formula
x-»oo,
since SJa(n) ~ JV(0, 1).
1 /.00 1
-j= e~y212 dy ~ -7L- e~x2'\
Proof of Theorem 1 (for f £ ~ jV(0, 1)).
Let us first establish (5). Let e > 0, X = 1 + e, nk = Ak, where k>kQ, and
k0 is chosen so that In In kQ is defined. We also define
Ak — {$n > #(«) for some ne(nkink+ J}, (11)
and put
A = {Ak i.o.} = {Sn > hj/{n) for infinitely many n}.
In accordance with (3), we can establish (5) by showing that P(A) — 0.
Let us show that £ P(Ak) < oo. Then P(A) = 0 by the Borel-Cantelli
lemma.
From (11), (9), and (10) we find that
P(Ak) < P{S„ > hl/(nk) for some ne(nk,nk+l)}
< P{S„ > hjf(nk) for some n < nk+1}
< Cx exp(-A In In Xk) < Ce~Mnk = C2k~\
where Cx and C2 are constants. But ££- L /c~A < oo, and therefore
£ P(Xk) < oo.
Consequently (5) is established.
We turn now to the proof of (6). In accordance with (4) we must show that,
with X — 1 — e, e > 0, we have with probability 1 that S„ > X\j/{n) for
infinitely many n.
Let us apply (5), which we just proved, to the sequence (—S„)„^. L. Then we
find that for all n, with finitely many exceptions, — Sn < 2ij/(ri) (P-a.s.).
Consequently if nk = Nk, N > 1, then for sufficiently large k, either
or
Snk>Yk-2^,.,1 (12)
where Yk = Snk-S„k_r
Hence if we show that for infinitely many k
Yk>X^{nk) + 2^{nk.l\ (13)
§4. Law of the Iterated Logarithm
399
this and (12) show that (P-a.s.) S„k > ty(nk) for infinitely many k. Take some
X e (A, 1). Then there is an N > 1 such that for all k
Xl2(Nk - Nk~») In In N"]1/2 > A(2N" In In N*)1'2
+ 2(2^-1 In In N*"1)1'2 = ^(Nfc) + 2*}/(Nk-1).
It is now enough to show that
Yk > X[2(Nk - Nk~l) In In N*]1'2 (14)
for infinitely many k. Evidently Yk ~ ^V(0, Nk — Nk~l). Therefore, by
Lemma 2,
1 — /J'\2l„l„ Wfc
P{Yk > A'[2(N* - Nk~l) In In N"]1/2} -
JziiXXl In In N")1/2
*A*k-**>-c*
(Ink)112 ~k\nk'
Since £ (\/k In k) = oo, it follows from the second part of the Borel-Cantelli
lemma that, with probability 1, inequality (14) is satisfied for infinitely many
/c, so that (6) is established.
This completes the proof of the theorem.
Remark 1. Applying (7) to the random variables (—S„)„^. u we find that
Hm-^-=-l. (15)
<p(n)
It follows from (7) and (15) that the law of the iterated logarithm can be put
in the form
+:SH-:
(16)
Remark 2. The law of the iterated logarithm says that for every e > 0 each
function ij/f — (1 + e)^ is upper, and ^w = (1 — e)^ is lower.
The conclusion (7) is also equivalent to the statement that, for each e > 0,
P{|SJ>(l-e)^(«)i.o.} = l,
P{|S„|>(1 + e)^(n) i.o.} = 0.
3. Problems
1. Let f J, f2,... be a sequence of independent random variables with fn ~ jV(0,1).
Show that
p{lim~-7J===ll=l.
I Jllnn J
400
IV. Sequences and Sums of Independent Random Variables
2 Let ¢,, £2,... be a sequence of independent random variables, distributed according
to Poisson's law with parameter X > 0. Show that (independently of A)
L_ f „ In In n }
3. Let f „ £2,... be a sequence of independent identically distributed random variables
with
Ee"«« = e-l«la, 0 < a < 2
Show that
{I o ll/dnlnn) *)
m\M =e"°\=]
4. Establish the following generalization of (9). Let f „..., fn be independent random
variables. Levy's inequality
p\ max [Sk + n(Sn - SJ] > a \ < 2P(Sn > a\ S0 = 0,
l.0<k<n J
holds for every real a, where n(0 is the median of f, i.e. a constant such that
P(£>MO)>i P(£<M£))>i
§5. Rapidity of Convergence in the Strong Law
of Large Numbers and in the Probabilities
of Large Deviations
1. By the results of §6, Chapter I, we have the following estimate for the
Bernoulli scheme:
IN-}
< 2e~2n£2 (1)
(see (42), subsection 7, §6, Chapter I). From this, of course, there follows the
inequalities
p{sup|^-p|>4< X p{|^-p|>el<_
[m>n\m \ ) r&n [\m | J 1
(2)
which provide an approximation of the rate of convergence to p by the
quantity SJn with probability 1.
We now consider the question of the validity of formulas of the types (1)
and (2) in some more general situations, when S„ = f x + • • • + £„ is a sum of
independent identically distributed random variables.
2. Let ^, fi, f2, ••• be a sequence of independent random variables. We say
that a random variable satisfies Cramer's condition if there is a X > 0 for
which
§5. Rapidity of Convergence in the Strong Law
401
<p(X) = Eem < oo (3)
(it can be shown that this condition is equivalent to an exponential decrease
ofP(|f|>x),x-oo).
Let
A = {AeR:<p(A)<oo}, (4)
where we suppose that A contains a neighborhood of the point A = 0, i.e.,
Cramer's condition (3) is satisfied for some A > 0.
On the set A the function
^(A)-In 9(A) (5)
is convex (from below), and strictly convex if the random variable £ is not
degenerate. We also notice that
,/,(0) = 0, ^'(0) = w(= Ef), ^"(A)>0.
We extend ^(A) to all A e R by setting ij/(X) = oo for A $ A.
We define the function
H{a) = sup [aA - ^(A)], aeR, (6)
A
called the Cramer transform (of the distribution function F = F(x) of the
random variable f). The function if (a) is also convex (from below) and its
minimum is zero, attained at A = m.
Ua>m, we have
H{a) = sup [aA - ^(A)].
A>0
Then
P{f :> a} <£ inf EeA(*-fl> = inf <t&*-*wh = <r«w (7)
A>0 A>0
Similarly, for a < m we have H(a) = supA<0 [aA — ^(A)] and
P{f < fl} < e-H(fl). (8)
Consequently (compare (42), §6, Chapter 1),
P{|£ - m\ ^ 6} ^ e-«nin{H(m-£),H(m+£)} (9)
If £, fls ... <!;„ are independent identically distributed random variables
that satisfy Cramer's condition (3), S„ = f x + —I- <!;„, ^„(A) = In E exp(AS^/n),
^(A) = In EeA«, and
tfn(a) = sup[flA-MA)], (10)
A
then
Hn(a) = nH{a) ( = n sup [aA - ^(A)] J
and the inequalities (7), (8), and (9) assume the following forms:
402
IV. Sequences and Sums of Independent Random Variables
'{i
>a>^e~nHla\ a>m,
PJ— <a\<>e-nma\ a<m,
p\\— — m\>B><l 2e~min{Hlm~t),Hlm+e)}
Remark. Results of the type
'{
— - m\ >£> <ae~bn,
n 1 J
(11)
(12)
(13)
(14)
where a > 0 and b > 0, indicate exponential convergence "adjusted" by the
constants a and b. In the theory of large deviations, results are often presented
in a somewhat different, "cruder," form:
BmilnPJp-m >ej<0, (15)
that clearly arises from (14) and refers to the "exponential" rapidity of
convergence, but without specifying the values of the constants a and b.
Now we turn to the question of upper bounds for the probabilities
P<sup-^>ai, piinf-^<flL P<sup \-±-m >ei,
which can provide definite bounds on the rapidity of convergence in the strong
law of large numbers.
Let us suppose that the independent identically distributed nondegenerate
random variables ^, ^, ^2» • • • satisfy Cramer's condition, i.e., <p(A) = Eem <
oo for some k > 0.
We fix n > 1 and set
k = inf<k>n:-£ >a>,
taking k = oo if SJk < a for k > n.
In addition, let a and X > 0 satisfy
Xa - In q>{X) > 0. (16)
Then
pfeH=pfe(H}
= p<pi> fl, K < ooj = P{e^ > <?**, K < 00}
§5. Rapidity of Convergence in the Strong Law
403
= p/g^-Kln^A) > g»c(Aa-ln<f>(A))^ K < ^J
< P{eAS«~,c,nvW > en(Aa-ln<p(A)) K < ^j
< p J SUp g ASfc-ft In <p(A) > e«(Aa-ln <p(A) I /jy\
Ift^n J'
To take the final step, we notice that the sequence of random variables
eAsfc-fcin<,(A)} k > 1}
with respect to the flow of cr-algebras &\ = crf^,..., £ft}, fc > 1, forms a
martingale. (For more details, see Chapter VII and, in particular, Example 1
in §1). Then it follows from inequality (8) in §3, Chapter VII, that
P \ SUp e*S*-* ■"«»<« > en(Xa-lntpW) ( < e-n(Aa-ln ?(A»
U>» J
and consequently, (by (16)) we obtain the inequality
P jsup ^ > 4 < <r«w«-i»rf*» (18)
Let a>m. Since the function /(A) = aA — In q>{k) has the properties /(0) = 0,
/'(0) > 0, there is a A > 0 for which (16) is satisfied, and consequently, we
obtain from (18) that if a > m we have
p J SUp _* > a \ < e~" supx>° lXa~ln <P(A)1 = e~nH(a). (19)
[ten k J
Similarly, if a < m, we have
p J gup — < a [ < e~nsuPA<o[^-m«*A)] = e~nH(a) ^20)
[k^n k J
From (19) and (20), we obtain
P {sup ^ - m > el < 2e-minmm-thHim+t)]-". (21)
Remark. Combining the right-hand sides of the inequalities (11) and (19)
leads us to suspect that this situation is not random. In fact, this expectation
is concealed in the fact that the sequences (Sk/k)„^k^N form, for every n < N,
reversed martingales (see Problem 6 in §1, Chapter VII, and Example 4 in §11,
Chapter I).
2. Problems
1. Carry out the proof of inequalities (8) and (20).
2. Investigate the properties of H(a).
CHAPTER V
Stationary (Strict Sense) Random
Sequences and Ergodic Theory
§1. Stationary (Strict Sense) Random Sequences.
Measure-Preserving Transformations
1. Let (Q, #", P) be a probability space and £, = (£l9 £2> • • •) a sequence of
random variables or, as we say, a random sequence. Let 6k £ denote the sequence
(£k+l> £fc + 2> • • •)•
Definition 1. A random sequence £ is stationary (in the strict sense) if the
probability distributions of 6k£ and £ are the same for every k > 1:
P((^, fe,.. .)eB) = P((£k+1, £k+2i.. ,)eB\ Be@(Rm).
The simplest example is a sequence £ = (£i, £2» ■■■) °f independent
identically distributed random variables. Starting from such a sequence, we
can construct a broad class of stationary sequences n = (i^, n2,. • •) by
choosing any Borel function g(xl9..., x„) and setting nk = g(£,ki f *+1» ■•■»&+ i)-
If £ = (£,, ^2, • • •) is a sequence of independent identically distributed
random variables with E |^, | < 00 and E^ = m, the law of large numbers
tells us that, with probability 1,
£1 + --- + 6,
* m, n -*■ 00.
n
In 1931 Birkhoflf obtained a remarkable generalization of this fact for the case
of stationary sequences. The present chapter consists mainly of a proof of
Birkhoflf's theorem.
The following presentation is based on the idea of measure-preserving
transformations, something that brings us in contact with an interesting
§1. Stationary (Strict Sense) Random Sequences. Measure-Preserving Transformations 405
branch of analysis (ergodic theory), and at the same time shows the
connection between this theory and stationary random proceesses.
Let (Q, SF, P) be a probability space.
Definition 2. A transformation T of Q into Q, is measurable if, for every A e «F,
T~lA = {co:TcoeA}e^.
Definition 3. A measurable transformation T is a measure-preserving
transformation (or morphism) if, for every AeF,
P(T~lA) = P(A).
Let T be a measure-preserving transformation, T" its nth iterate, and f x =
£i(cd) a random variable. Put ^k(co) = £i(T"~ 1co)i n > 2, and consider the
sequence £ = (£lt £2,.. .)• We claim that this sequence is stationary.
In fact, let A = {cd:£eB} and Ax = {w.O^eB}, where BE@(Rm).
Since A = {cd: (^(co), ^(Tcd), ...) e B}, and Al = {w: {^{Tw\ ^{T2w\...)
eBKwehavecoe^ifandonlyifeitherTcoeyloryl! = T~1A.ButP{T~1A)
= P(A\ and therefore P(At) = P(A). Similarly P(Ak) = P(A) for every
Ak= {co:0k£eJ3},fc>2.
Thus we can use measure-preserving transformations to construct
stationary (strict sense) random variables.
In a certain sense, there is a converse result: for every stationary sequence
£ considered on (£2, ^", P) we can construct a new probability space (ft, #, P),
a random variable fi(d>) and a measure-preserving transformation T,
such that the distribution of \ = {^(cb), ?i(Tc&),...} coincides with the
distribution of ^.
In fact, take ft to be the coordinate space Rm and put # = @(Rm\
P = Pf, where P£B) = P{cd: £eB}, BE@(Rm). The action of f on ft is
given by
f(x1,x2,...) = (x2,x3,...).
If d) = (xi, X2,.. .)> Put
li(0) = x„ ?„(«) = Uf"- xc5), n > 2
Now let A = {&: (xlt..., xk) e B}, B e@(Rk\ and
f-lA = {cb:(x2,...,xk+l)EB}.
Then the property of being stationary means that
P(A) = P{w: {Zl9..., « e B} = P{cd: (£2,..., &+ x) 6 2*} = P(f " M),
i.e. T is a measure-preserving transformation. Since P{ct>: (f!,..., fk) e J3} =
P {co: (f x,..., £k) eB} for every fc, it follows that £ and \ have the same
distribution.
Here are some examples of measure-preserving transformations.
406
V. Stationary (Strict Sense) Random Sequences and Ergodic Theory
Example 1. Let Q = {colt..., co„} consist of n points (a finite number), n > 2,
let «F be the collection of its subsets, and let Tcot = coi+l,\ < i < n - 1, and
7^ = cov If P(coj) = 1/n, the transformation T is measure-preserving.
Example 2. If Q = [0,1), & = ^([0, 1)), P is Lebesgue measure, X e [0, 1),
then Tx = (x + X) mod 1 and T = 2x mod 1 are both measure-preserving
transformations.
2. Let us pause to consider the physical hypotheses that led to the
consideration of measure-preserving transformations.
Let us suppose that Q, is the phase space of a system that evolves (in discrete
time) according to a given law of motion. If co is the state at instant n = 1, then
T"co, where T is the translation operator induced by the given law of motion,
is the state attained by the system after n steps. Moreover, if A is some set of
states co then T~1A = {co: TcoeA} is, by definition, the set of states co that
lead to A in one step. Therefore if we interpret Q as an incompressible fluid, the
condition P(T~1A) = P(A) can be thought of as the rather natural condition
of conservation of volume. (For the classical conservative Hamiltonian
systems, Liouville's theorem asserts that the corresponding transformation T
preserves Lebesgue measure.)
3. One of the earliest results on measure-preserving transformations was
Poincare's recurrence theorem (1912).
Theorem 1. Let (£2, ,^, P)bea probability space, let Tbea measure-preserving
transformation, and let AeF. Then, for almost every point coeA, we have
T"coeAfor infinitely many n > 1.
Proof. Let C = {coeA: T"co$A, for all n > 1}. Since C n T~nC = 0 for
all n > 1, we have T~mC n T~(m+n)C = T~m{C n T~"C) = 0. Therefore
the sequence {T~"C} consists of disjoint sets of equal measure. Therefore
In°°=o P(Q = XT-o P(T"nC) < P(Q) = 1 and consequently P(C) = 0.
Therefore, for almost every point coeA, for at least one n > 1, we have
T"coeA. It follows that T"coeA for infinitely many n.
Let us apply the preceding result to Tk, k > 1. Then for every coeA\N,
where N is a set of probability zero, the union of the corresponding sets
corresponding to the various values of k, there is an nk such that (Tk)"kco e A.
It is then clear that T"co e A for infinitely many n. This completes the proof of
the theorem.
CoroHary. Let £(co) > 0. Then
f^(Tkco) = a) (P-a.s.)
fc = 0
on the set {co: £(co) > 0}.
§2. Ergodicity and Mixing
407
In fact, let A„ = {co: £(co) > 1/n}. Then, according to the theorem,
£fc°=o £(Tkco) = oo (P-a.s.) on Ani and the required result follows by letting
n-»oo.
Remark. The theorem remains valid if we replace the probability measure P
by any finite measure [i with fi(Q) < oo.
4. Problems
1. Let The a measure-preserving transformation and f = f (co) a random variable whose
expectation Ef (co) exists. Show that Ef(co) = E£(Tco).
2. Show that the transformations in Examples 1 and 2 are measure-preserving.
3. Let Q = [0,1), F = ^?([0,1)) and let P be a measure whose distribution function is
continuous. Show that the transformations Tx = Ax,0 < A < 1, and Tx = x2 are not
measure-preserving.
§2. Ergodicity and Mixing
1. In the present section T denotes a measure-preserving transformation on
the probability space (Q, #", P).
Definition 1. A set Ae^ is invariant if T~lA = A. A set Xef is almost
invariant if A and T~lA differ only by a set of measure zero, i.e. P(A A T~ lA)
= 0.
It is easily verified that the classes J and J* of invariant or almost
invariant sets, respectively, are cr-algebras.
Definition 2. A measure-preserving transformation T is ergodic (or metrically
transitive) if every invariant set A has measure either zero or one.
Definition 3. A random variable ^ = ^(co) is invariant (or almost invariant) if
£(cd) = £(Tcd) for all cd eCl (or for almost all co eQ).
The following lemma establishes a connection between invariant and
almost invariant sets.
Lemma 1. If A is almost invariant, there is an invariant set B such that
P(A AB) = 0.
Proof. Let B = fim T~nA.Then T~lB = fim T~{n+l)A = B, i.e. BeJf.lt is
easily seen that AAB^ (J"=o (T~kA A T~lk+1)A). But
P{T~kA A T~lk+l)A) = P(A A T~lA) = 0.
Hence P(A AB) = 0.
408
V. Stationary (Strict Sense) Random Sequences and Ergodic Theory
Lemma 2. A transformation T is ergodic if and only if every almost invariant set
has measure zero or one.
Proof. Let AeJ*\ then according to Lemma 1 there is an invariant set B
such that P(A AB) = 0. But T is ergodic and therefore P(B) = 0 or 1.
Therefore P(A) = 0 or 1. The converse is evident, since J ^ J*. This
completes the proof of the lemma.
Theorem 1. Let Tbea measure-preserving transformation. Then the following
conditions are equivalent'.
(1) T is ergodic;
(2) every almost invariant random variable is (P-a.s.) constant;
(3) every invariant random variable is (P-a.s.) constant.
Proof. (1) o (2). Let T be ergodic and £, almost invariant, i.e. (P-a.s.) £(co) =
£(Tcd). Then for every ceR we have Ac= {w: £(co) < c} e./*, and then
P(v4c) = 0 or 1 by Lemma 2. Let C = sup{c: P(,4C) = 0}. Since Ac | CI as
c | oo and Ac 10 as c I — oo, we have \C\ < oo. Then
P{cd: «©) < C} = p| (J j«fl>) < C - iJJ = 0
and similarly P{co: £(a>) > C} = 0. Consequently P{co: ^(co) = C} = 1.
(2) => (3). Evident.
(3) => (1). Let A e «/; then IA is an invariant random variable and therefore,
(P-a.s.), IA = 0 or IA = 1, whence P(v4) = 0 or 1,
Remark. The conclusion of the theorem remains valid in the case when
"random variable" is replaced by "bounded-random variable".
We illustrate the theorem with the following example.
Example. Let Q = [0,1), & = ^([0, 1)), let P be Lebesgue measure and let
Tcd = (co + k) mod 1. Let us show that T is ergodic if and only if X is
irrational.
Let E, = £(cd) be a random variable with E^2(co) < oo. Then we know that
the Fourier series ££L _ m c„e2ninto of £(co) converges in the mean square sense,
£|c„|2<oo, and, because T is a measure-preserving transformation
(Example 2, §1), we have (Problem 1, §1) that for the random variable £
cnE£(ft>)e27rin*(co) = E^TcD)e2ninTt0 = e2Tcin*E£(TcD)e2Tcint0
So c„(l — e2ninX) = 0. By hypothesis, X is irrational and therefore e2ninX ^ 1
for all n ^0. Therefore c„ = 0, n ^ 0, £(co) = c0 (P-a.s.), and T is ergodic by
Theorem 1.
§3. Ergodic Theorems
409
On the other hand, let X be rational, i.e. X = k/m, where k and m are
integers. Consider the set
A 27~2 [ * k + n
It is clear that this set is invariant; but P(A) = \. Consequently T is not ergodic.
2. Definition 4. A measure-preserving transformation is mixing (or has the
mixing property) if, for all A and BeF,
lim P(A n T"nB) = P(A)P(B). (1)
n-»oo
The following theorem establishes a connection between ergodicity and
mixing.
Theorem 2. Every mixing transformation T is ergodic.
Proof. Let Aeg?, BeJ. Then B= T~"B,n> 1, and therefore
P{A n T-"J3) = P04 n J3)
for all « > 1. Because of (1), P{A n B) = P04)P(J3). Hence we find, when
,4 = J3, that P(B) = P2(J3), and consequently P{B) = 0 or 1. This completes
the proof.
3. Problems
1. Show that a random variable £ is invariant if and only if it is ^-measurable.
2 Show that a set A is almost invariant if and only if either
P(7~ lA\A) = 0 or P(A\T~ lA) = 0.
3. Show that the transformation considered in the example of Subsection 1 of the
present section is not mixing.
4. Show that a transformation is mixing if and only if, for all random variables £ and rj
with E£2 < oo and Erj2 < oo,
E^(rw)#) -> Ef(G>)E»/(G>), n -> oo.
§3. Ergodic Theorems
1. Theorem 1 (Birkhoff and Khinchin). Let T be a measure-preserving
transformation and f — £(cd) a random variable with E | f | < oo. Then (P-a.s.)
lim^ I^(Tkco) = E(^|^). (1)
n nk=0
410
V. Stationary (Strict Sense) Random Sequences and Ergodic Theory
If also T is ergodic then (P-a.s.)
1 n-l
lim - £ Z(Tkco) = Ef. (2)
n nk=Q
The proof given below is based on the following proposition, whose simple
proof was given by A. Garsia (1965).
Lemma (Maximal Ergodic Theorem). Let T be a measure-preserving
transformation, let £be a random variable with E |£| < oo, and let
Sk(CD) = «fl>) + &TCD) + • • • + ftr'-'co),
Mk(co) = max{0, S^co),..., Sk(w)}.
Then
E[£(co)/{Mn>0}(co)] > 0
for every n > 1.
Proof. If n > k, we have M„(Tcd) > Sk(Tco) and therefore f(co) + M„(Tco) >
f(co) + Sk(Tco) = Sk+1(co). Since it is evident that f(co) > S^co) - M„(Tcd%
we have
«a>) > maxtf^co),..., S„(co)} - M„(Tco).
Therefore
E[f(">)Wo>(">)] > E^ax^^co),..., S„(co)) - Mn(Ta>)\
But max(S!,..., S„) = M„ on the set {M„ > 0}. Consequently,
E[»/,Mn>o}N] > E{(M„(co) - Mn(Tco))/{Mn(CI))>0}}
> E{M„(co) - M„(Tco)} = 0,
since if T is a measure-preserving transformation we have EM„(co) =
EM„(Tco) (Problem 1, §1).
This completes the proof of the lemma.
Proof of the Theorem. Let us suppose that E(f|*/) = 0 (otherwise replace
fbyf-E(£M0).
Let n = hm(SJri) and q = lim(5n/w). It will be enough to establish that
(P-a.s.)
0<3 <ff <0.
Consider the random variable fj = fj(co). Since fj(co) = fj(Tco), the variable i; is
invariant and consequently, for every e > 0, the set Ae = {fj(co) > e} is also
invariant. Let us introduce the new random variable
^(^) = (^)-6)/^(0)),
and put
S*(co) = f *(co) + • • • + f *(Tk~ ^), Mj?(o>) = max(0, S?,..., S£).
§3. Ergodic Theorems
411
Then, by the lemma,
EK*/{M>o}]>0
for every n > 1. But as n -► oo,
{M* > 0} = j maxSk* > ol T jsup Sk* > 0 j = {sup ^ > 0 j
where the last equation follows because supk^^S^/k) ^ rj, and X£ =
{co: ^ > e}.
Moreover, E|f*| < E|f| + e. Hence, by the dominated convergence
theorem,
Thus
0 <£ EK*/^ = E[tf - e)/^] = ER/^] - eP(A£)
= ECEKI^/J - eP(A£) = -eP(X£),
so that P(/4£) = 0 and therefore P(>j < 0) = 1.
Similarly, if we consider — £(cd) instead of f(co), we find that
=(-¾ —«»1—a
and P(-g < 0) = 1, i.e. Pfo > 0) = 1. Therefore 0 <n<fj < 0 (P-a.s.) and
the first part of the theorem is established.
To prove the second part, we observe that sjnce E(£\f) is an invariant
random variable, we have E(£\f) = Ef (P-a.s.) in the ergodic case.
This completes the proof of the theorem.
Corollary. A measure-preserving transformation T is ergodic if and only if for
all A and Be^,
lim - £ P(A n T~kB) = P(A)P(B). (3)
n nk=0
ToprovetheergodicityofTweuseX = B e J in (3). Then A n T~kB = B
and therefore P(B) = P2(B), i.e. P(B) = 0 or 1. Conversely, let T be ergodic.
Then if we apply (2) to the random variable f = /B(co), where Be^we find
that (P-a.s.)
lim - £ h-^i^) = P(B).
n wk=0
412
V. Stationary (Strict Sense) Random Sequences and Ergodic Theory
If we now integrate both sides over A e 2F and use the dominated convergence
theorem, we obtain (3) as required.
2. We now show that, under the hypotheses of Theorem 1, there is not only
almost sure convergence in (1) and (2), but also convergence in mean. (This
result will be used below in the proof of Theorem 3.)
Theorem2. Let Tbea measure-preserving transformation and let £ = £(co) be a
random variable with E |£| < co. Then
- nXf(Tka>)-E(f|JO
If also T is ergodic, then
- Z t(Tkco) - E£
(4)
(5)
Proof. For every e > 0 there is a bounded random variable rj(\rj(cD)\ < M)
such that E |f - rj\ < e. Then
- ££(Tkco)-E(£|JO
«1=0
<E
+ E
- "t^Tkco)-E(r]\Jf)
nk=0
X (&TkcD) - rj(TkcD))\
£ = 0 |
+ E|E(£|J0-Efo|J0|. (6)
Since \r\ \ <, M, then by the dominated convergence theorem and by using (1)
we find that the second term on the right of (6) tends to zero as n -»• oo. The
first and third terms are each at most e. Hence for sufficiently large n the left-
hand side of (6) is less than 2e, so that (4) is proved. Finally, if T is ergodic, then
(5) follows from (4) and the remark that E(f |/) = Ef (P-a.s.).
This completes the proof of the theorem.
3. We now turn to the question of the validity of the ergodic theorem for
stationary (strict sense) random sequences f = (f x, f 2, • • •) defined on a
probability space (Q, #; P). In general, (Q, #; P) need not carry any measure-
preserving transformations, so that it is not possible to apply Theorem 1
directly. However,^asjve observed in §1, we can construct a coordinate
probability space (Q, &, P), random variables I = (J1} f2» • • •)» and a measure-
preserving transformation T such that f„(c5) = \i(fn~l<b) and the
distributions of £ and J are the same. Since such properties as almost sure
convergence and convergence in the mean are defined only for
probability distributions, from the convergence of (1/n) ][£= x J^T*"^) (P-a.s.
and in mean) to a random variable fj it follows that (1/n) Yi=i 6k(°0 als°
converges (P-a.s. and in mean) to a random variable « such that rj = fj.lt
§3. Ergodic Theorems
413
follows from Theorem 1 that if E | Jx | < oo then = £(JX \J\ where 3 is a
collection of invariant sets (E is the average with respect to the measure P). We
now describe the structure of?;.
Definition 1. A set A e 2F is invariant with respect to the sequence f if there is a
set B e @(Roo) such that for n > 1
A = {a>:(ZH,ZH+u...)eB}.
The collection of all such invariant sets is a cr-algebra, denoted by J^.
Definition 2. A stationary sequence £ is ergodic if the measure of every
invariant set is either 0 or 1.
Let us now show that the random variable n can be taken equal to E(^ | f£.
In fact, let AeS^. Then since
\nk=l |
we have
-t ffc^- \ndP- (7)
nk=l J A J A
Let Be@(Rm) be such that A = {w:(fk, fk+1,...)eB} for all k > 1. Then
since f is stationary,
f&dP- f trfP- f ttdP- U,dP.
JA J {«:<&.&+i.-)eB) J{co:($i,$2,...)eB} ^
Hence it follows from (7) that for all AeJ^ which implies (see §7, Chapter II)
that n = E(f x \S$. Here E(f x |^) = Eft if f is ergodic.
Therefore we have proved the following theorem.
Theorem 3 (Ergodic Theorem). Let £ = (f lf f2, ■ • •) &e « stationary (strict
sense) random sequence with E | f x | < oo. 77zen (P-a.s., and in tne wean)
lim± £«©) = £«! 1^
wk=i
//£ is a/so an ergqdic sequence^ then (P-a.s., and in the mean)
limi f4(co) = E^.
4. Problems
1. Let £ = (£ n £2,...) be a Gaussian stationary sequence with E£„ = 0 and covariance
function K(n) = E£k+IJ£k. Show that R(ii) -> 0 is a sufficient condition for £ to be
ergodic.
414
V. Stationary (Strict Sense) Random Sequences and Ergodic Theory
2. Show that every sequence f = (f lf f2,...) of independent identically distributed
random variables is ergodic.
3. Show that a stationary sequence f is ergodic if and only if
- 11^ &+k) - P((f i £i+k) e B) (P-a.s.)
for every B e @{R\ k = 1,2,....
CHAPTER VI
Stationary (Wide Sense) Random
Sequences. L2-Theory
§1. Spectral Representation of the
Covariance Function
1. According to the definition given in the preceding chapter, a random
sequence £ = (f lf £2,...) is stationary in the strict sense if, for every set
B e ^(R00) and every n > 1,
P{(t1,t2i...)eB} = P{(tn+1,tn+2i...)eB}. (1)
It follows, in particular, that if E£? < oo then E£n is independent of n:
E£„ = E^, (2)
and the covariance cov(fn+fnfn) = E(£n+m - Efn+J(fn - EQ depends only
on m:
cov(£n+f„, C) = covO^ +m, ^)- (3)
In the present chapter we study sequences that are stationary in the wide
sense (and have finite second moments), namely those for which (1) is
replaced by the (weaker) conditions (2) and (3).
The random variables £„ are understood to be defined for n e Z =
{0, ± 1,...} and to be complex-valued. The latter assumption not only does
not complicate the theory, but makes it more elegant. It is also clear that
results for real random variables can easily be obtained as special cases of the
corresponding results for complex random variables.
Let H2 = H2(Q, ^, P) be the space of (complex) random variables
f = a + ift a, fieR, with E|f |2 < oo, where |f |2 = a2 + 02. If f and
y\ e H2, we put
&ri) = E«, (4)
416
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
where y\ = a — ifi is the complex conjugate of r\ = a + if} and
As for real random variables, the space H2 (more precisely, the space of
equivalence classes of random variables; compare §§10 and 11 of Chapter II)
is complete under the scalar product (£, n) and norm ||f ||. In accordance with
the terminology of functional analysis, H2 is called the complex (or unitary)
Hilbert space (of random variables considered on the probability space
(Q,iF,P)).
If £, y\ e H2 their covariance is
cov«,i|) = E«[-EOfo-Ei7)i (6)
It follows from (4) and (6) that if Ef = Erj = 0 then
cov«, If) = «,!/> (7)
Definition. A sequence of complex random variables £ = (£n)neZ with
E | £„ |2 < oo, n e Z, is stationary (in the wide sense) if, for all «eZ,
Et = E£o,
cov(£fc+n, £fc) = cov(4, £0), fceZ. (8)
As a matter of convenience, we shall always suppose that E£0 = 0. This
involves no loss of generality, but does make it possible (by (7)) to identify the
covariance with the scalar product and hence to apply the methods and results
of the theory of Hilbert spaces.
Let us write
K(n) = cov(£n,S0), WeZ, (9)
and (assuming R(0) = E|f0|2 # 0)
*„) = §! neZ. (10)
We call jR(h) the covariance function, and p(n\ the correlation function, of the
sequence £ (assumed stationary in the wide sense).
It follows immediately from (9) that R(n) is nonnegative-definite, i.e. for all
complex numbers at am and tlt..., tm e Z, m > 1, we have
m
X a.ajRiu-t^O. (11)
u=i
It is then easy to deduce (either from (11) or directly from (9)) the following
properties of the covariance function (see Problem 1):
jR(0) > 0, R(-n) = Rfn), \R(n)\ < R(0),
|R(n) - R(m)\2 < 2jR(0)[R(0) - Re R(n - m)]. (12)
§1. Spectral Representation of the Covariance Function
417
2. Let us give some examples of stationary sequences f = (fn)ne2. (From
now on, the words "in the wide sense" and the statement neZ will both be
omitted.)
Example 1. Let £„ = f0 •#(«), where E^0 = 0, E£g = 1 and g = g(n) is_a
function. The sequence £ = (£,) will be stationary if and only if g(k + n)g(k)
depends only on n. Hence it is easy to see that there is a A such that
g(n) = 0(Oy*".
Consequently the sequence of random variables
L = Zo 9(P)eiXn
is stationary with
R(ri) = \9(0)\2ean.
In particular, the random "constant" £ = £0 is a stationary sequence.
Example 2. An almost periodic sequence. Let
C-£**". (13)
fc=l
where zx zN are orthogonal (Ez£Zj = 0, i # j) random variables with zero
means and E \zk\2 = c\ > 0; — n < Xk < n, k = 1 N; X{^ Xjt i ¥= j.
The sequence £ = (£„) is stationary with
R(n)= £ <**"*". (14)
jt=i
As a generalization of (13) we now suppose that
& = I WiXk\ (15)
fc=-oo
where zk, keZ, have the same properties as in (13). If we suppose that
Zfc°= - oo Gk < °°»the series on the right of (15) converges in mean-square and
R(n)= I <^£Akn. (16)
fc=-oo
Let us introduce the function
F(X)= I ol (17)
Then the covariance function (16) can be written as a Lebesgue-Stieltjes
integral,
R(n) = f* e£A" dF(A). (18)
418
VI. Stationary (Wide Sense) Random Sequences. Z,2-Theory
The stationary sequence (15) is represented as a sum of "harmonics" el k"
with "frequencies" Xk and random "amplitudes" zk of "intensities" g\ =
E |zj2. Consequently the values of F(X) provide complete information on the
"spectrum" of the sequence f, i.e. on the intensity with which each frequency
appears in (15). By (18), the values of F(X) also completely determine the
structure of the covariance function R(n).
Up to a constant multiple, a (nondegenerate) F(X) is evidently a
distribution function, which in the examples considered so far has been piecewise
constant. It is quite remarkable that the covariance function of every stationary
(wide sense) random sequence can be represented (see the theorem in
Subsection 3) in the form (18), where F(X) is a distribution function (up to
normalization), whose support is concentrated on [—7t, n\ i.e. F(X) = 0 for
A < -71 and F(X) = F(n) for X > n.
The result on the integral representation of the covariance function, if
compared with (15) and (16), suggests that every stationary sequence also
admits an "integral" representation. This is in fact the case, as will be shown in
§3 by using what we shall learn to call stochastic integrals with respect to
orthogonal stochastic measures (§2).
Example 3 (White noise). Let e = (e„) be an orthonormal sequence of random
variables, Ee„ = 0, Ee^- = <5lV, where <5£j- is the Kronecker delta. Such a
sequence is evidently stationary, and
„, x fl n = 0,
m = \0, n#0.
Observe that R(n) can be represented in the form
* eikn dF(X\ (19)
/W = —, -n<X<n. (20)
Comparison of the spectral functions (17) and (20) shows that whereas the
spectrum in Example 2 is discrete, in the present example it is absolutely
continuous with constant "spectral density "/(A) = 57c. In this sense we can
say that the sequence e = (e„) "consists of harmonics of equal intensities." It
is just this property that has led to calling such a sequence e = (e„) "white
noise " by analogy with white light, which consists of different frequencies with
the same intensities.
Example 4 (Moving averages) Starting from the white noise e = (e„)
introduced in Example 3, let us form the new sequence
00
4=1 aken-ki (21)
fc=-oo
R(n) =
where
F(X)= f f(y)dv;
J — n
§1. Spectral Representation of the Covariance Function
419
where ak are complex numbers such that X"=-oo lfljtl2 < °°- By Parseval's
equation,
00
cov(^+m, fj = cov(f„, f0) = I an+kak,
fc=-oo
so that £ = (£fc) is a stationary sequence, which we call the sequence obtained
from e = (ek) by a (two-sided) moving average.
In the special case when the ak of negative index are zero, i.e.
00
L = £ Afcfin-fc.
fc = 0
the sequence £ = (£„) is a one-sided moving average. If, in addition, ak = 0 for
fc> p, i.e. if
&. = flofin + «ifi„- !+••■ + ape„_pt (22)
then £ = (£„) is a moving average of order p.
We can show (Problem 5) that (22) has a covariance function of the form
#(") — i-n &**fW dh where the spectral density is
m) = ^t\P(e-i")\2 (23)
with
P(z) = a0 + axz + ••■ + apzp.
Example 5 (Autoregression). Again let e = (e„) be white noise. We say that a
random sequence £ = (£„) is described by an autoregressive model of order q if
£. + tit-1 + --- + *,£,-, = V (24)
Under what conditions on b1 b„ can we say that (24) has a stationary
solution? To find an answer, let us begin with the case q = 1:
£, = 06,^+¾. (25)
where a = — bl. If |a| < 1, it is easy to verify that the stationary sequence
e = (Dwith
?„ = f>V/ (26)
j=0
is a solution of (25). (The series on the right of (26) converges in mean-square.)
Let us now show that, in the class of stationary sequences £ = (£„) (with
finite second moments) this is the only solution. In fact, we find from (25),
by successive iteration, that
jt-i
£„ = a^n_! + s„ = a[afn_2 + e^] + e„ = • • • = afcfn_fc + £ aje„-j.
420
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
Hence it follows that
Ek - "l «V;T = E[^„_fc]2 = a2fcE£2-* = «2fcE£2 - 0, k - oo.
Therefore when |a| < 1 a stationary solution of (25) exists and is
representable as the one-sided moving average (26).
There is a similar result for every q > 1: if all the zeros of the polynomial
Q{z)=\ +blz + --' + bq2« (27)
lie outside the unit disk, then the autoregression equation (24) has a unique
stationary solution, which is representable as a one-sided moving average
(Problem 2). Here the covariance function R(n) can be represented (Problem
5) in the form
R(n) = f eiXn dF(X), F(X) = f /(v) dv, (28)
J-n J-n
where
J 1_
2n' |Q(e~iX)\2
/»>~i 'TnTFW (29)
In the special case q= 1, we find easily from (25) that E£0 = 0,
and
(when n < 0 we have R(«) = R(-n)\ Here
2tt |1 -ae-,A|2
Example 6. This example illustrates how autoregression arises in the
construction of probabilistic models in hydrology. Consider a body of water;
we try to construct a probabilistic model of the deviations of the level of the
water from its average value because of variations in the inflow and
evaporation from the surface.
If we take a year as the unit of time and let Hn denote the water level in
year «, we obtain the following balance equation:
Hn+1 = H„-KS(Hn) + Zn+1, (30)
where £„+1 is the inflow in year (n + 1), S(H) is the area of the surface of the
water at level H, and K is the coefficient of evaporation.
§1. Spectral Representation of the Covariance Function
421
Let £„ = H„ — H be the deviation from the mean level (which is obtained
from observations over many years) and suppose that S(H) = S(H) +
c{H — H). Then it follows from the balance equation that £„ satisfies
6.+ 1 =afc, +¢.+ 1 (31)
with a = 1 - cK, e„ = Z„ - KS(H). It is natural to assume that the random
variables s„ have zero means and are identically distributed. Then, as we
showed in Example 5, equation (31) has (for \a\ < 1) a unique stationary
solution, which we think of as the steady-state solution (with respect to time
in years) of the oscillations of the level in the body of water.
As an example of practical conclusions that can be drawn from a
(theoretical) model (31), we call attention to the possibility of predicting the
level for the following year from the results of the observations of the present
and preceding years. It turns out (see also Example 2 in §6) that (in the mean-
square sense) the optimal linear estimator of £n+1 in terms of the values of
•••>f„-i>f„ is simply u£„.
Example 7 (Autoregression and moving average (mixed model)). If we
suppose that the right-hand side of (24) contains a0e„ + ale„-1 + ••• +
ape„-p instead of e„, we obtain a mixed model with autoregression and
moving average of order (p, q):
£„ + b^n_x + ••• + bq£„-q = a0sn + a1e„-1 + ••• + ape„_p. (32)
Under the same hypotheses as in Example 5 on the zeros it will be shown later
(Corollary 2 to Theorem 3 of §3) that (32) has the stationary solution £ = (£„)
for which the covariance function is R(n) = Jln eiXn dF(X) with F(X) =
Ji„/(v)dv, where
3. Theorem (Herglotz). Let R(ri) be the covariance function of a stationary
(wide sense) random sequence with zero mean. Then there is, on
([-71, 71), ^([-71, 71))),
a finite measure F = F(B), B e &([—n, 7i)), such that for every neZ
R(n) = f* eikn F(dX). (33)
J — n
Proof. For N > 1 and Xe [-7i, 7i], put
^(A> = 9^I lK(fc-/)e-£fcV". (34)
Z7WV fc=1 /=1
P(e~a)
Q(e~iX)
422
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
Let
Then
(36)
Since R(n) is nonnegative definite, fN(X) is nonnegative. Since there are
N — \m\ pairs (k, I) for which k — I = w, we have
fM = i I (i-^W-"1 <35>
FN(B)= [fMdK Be@(l-n,n)).
Jb
lo, \n\ > N.J
The measures FN, N > 1, are supported on the interval [—7i, 7t] and
Fn(L—n> ^]) = ^(0) < oo for all N > 1. Consequently the family of measures
{FN}t N > 1, is tight, and by Prokhorov's theorem (Theorem 1 of §2 of
Chapter III) there are a sequence {Nk} £ {N} and a measure F such that
FNk^F. (The concepts of tightness, relative compactness, and weak
convergence, together with Prokhorov's theorem, can be extended in an
obvious way from probability measures to any finite measures.)
It then follows from (36) that
f ^nF(dX) = lim f eiXnFNk(dX) = R(n).
J — n Nfc-»oo J— n
The measure F so constructed is supported on [—n, ii\. Without changing
the integral §™m eiA"F(dX)t we can redefine F by transferring the "mass"
F({n}\ which is concentrated at n, to — n. The resulting new measure (which
we again denote by F) will be supported on [—n, n).
This completes the proof of the theorem.
Remark 1. The measure F = F(B) involved in (33) is known as the spectral
measure, and F(A) = F([—n, A]) as the spectral function, of the stationary
sequence with covariance function R(n).
In Example 2 above the spectral measure was discrete (concentrated at
Xk, k = 0, ± 1,...). In Examples 3-6 the spectral measures were absolutely
continuous.
Remark 2. The spectral measure F is uniquely defined by the covariance
function. In fact, let Ft and F2 be two spectral measures and let
P e^F^dX) = f eanF2(dX% neZ.
J -it J-n
§2. Orthogonal Stochastic Measures and Stochastic Integrals
423
Since every bounded continuous function g(X) can be uniformly
approximated on [—71,7i) by trigonometric polynomials, we have
f AxyFMX) = f AWitdxy.
J — n J-it
It follows (compare the proof in Theorem 2, §12, Chapter II) that FX(B)
= F2(B) for all B e@([-n, n)).
Remark 3. If £ =(£„) is a stationary sequence of real random variables £„,
then
R(n)= f" cos XnF(dX).
4. Problems
1. Derive (12) from (11).
2. Show that the autoregression equation (24) has a stationary solution if all the zeros
of the polynomial Q(z) defined by (27) lie outside the unit disk.
3. Prove that the covariance function (28) admits the representation (29) with spectral
density given by (30).
4. Show that the sequence f = (f„) of random variables, where
CO
L = Z («fc sin Kn + Pk cos Xkn)
fc=i
and ak and f}k are real random variables, can be represented in the form
fc=-oo
with zk = j(J}k — iak) for k > 0 and zk = z_fc, Xk = — A_fc for k < 0.
5. Show that the spectral functions of the sequences (22) and (24) have densities given
respectively by (23) and (29).
6. Show that if J] \R(n)\ < oo, the spectral function F(A) has density f{X) given by
/W-7- £ e-""R(n).
^71 f!=-00
§2. Orthogonal Stochastic Measures and
Stochastic Integrals
1. As we observed in §1, the integral representation of the covariance
function and the example of a stationary sequence
L = I zke^ (1)
fc=-oo
424
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
with pairwise orthogonal random variables zk,keZ, suggest the possibility
of representing an arbitrary stationary sequence as a corresponding integral
generalization of (1).
If we put
Z«= I «*, (2)
{fc:Afc<A}
we can rewrite (1) in the form
fc=-oo
where AZ(Afc) = Z(Afc) - Z(Xk -) = zk.
The right-hand side of (3) reminds us of an approximating sum for an
integral J"„ eUn dZ(X) of Riemann-Stieltjes type. However, in the present
case Z(X) is a random function (it also depends on co). Hence it is clear that
for an integral representation of a general stationary sequence we need to use
functions Z(X) that do not have bounded variation for each co. Consequently
the simple interpretation of {*„ eiXn dZ{X) as a Riemann-Stieltjes integral
for each a> is inapplicable.
2. By analogy with the general ideas of the Lebesgue, Lebesgue-Stieltjes
and Riemann-Stieltjes integrals (§6, Chapter II), we begin by defining
stochastic measure.
Let (Q, ^ P) be a probability space, and let £ be a subset, with an algebra
@q of subsets and the cr-algebra @ generated by <f>0.
Definition 1. A complex-valued function Z(A) = Z(a>\ A), defined for coeQ
and Ael0, is a finitely additive stochastic measure if
(1) E|Z(A)|2 < oo for every Ae^o;
(2) for every pair Ax and A2 of disjoint sets in g0i
Z(AX + A2) = Z(AX) + Z(A2) (P-a.s.) (4)
Definition 2. A finitely additive stochastic measure Z(A) is an elementary
stochastic measure if, for all disjoint sets A^ A2,... of £0 such that A =
Z(A)-£Z(Afc)
► 0, n -+ oo. (5)
Remark 1. In this definition of an elementary stochastic measure on subsets
of ^0, it is assumed that its values are in the Hilbert space H2 = H2(Q, ^ P),
and that countable additivity is understood in the mean-square sense (5).
There are other definitions of stochastic measures, without the requirement
of the existence of second moments, where countable additivity is defined
(for example) in terms of convergence in probability or with probability
one.
§2. Orthogonal Stochastic Measures and Stochastic Integrals
425
Remark 2. In analogy with nonstochastic measures, one can show that for
finitely additive stochastic measures the condition (5) of countable additivity
(in the mean-square sense) is equivalent to continuity (in the mean-square
sense) at "zero":
E|Z(A„)|2-+0, A„|0, Ane<?0- (6)
A particularly important class of elementary stochastic measures consists
of those that are orthogonal according to the following definition.
Definition 3. An elementary stochastic measure Z(A), Ae<f0, is orthogonal
(or a measure with orthogonal values) if
EZ(Ax)Z(Ar) = 0 (7)
for every pair of disjoint sets Ax and A2 in S0\ or, equivalently, if
EZ(AX)Z(A2) = E |Z(At n A2)|2 (8)
for all Ax and A2 in S0.
We write
m(A) = E|Z(A)|2, Ae^o- P)
For elementary orthogonal stochastic measures, the set function m = m(A),
Ae^0, is, as is easily verified, a finite measure, and consequently by
Caratheodory's theorem (§3, Chapter II) it can be extended to (£, <f).
The resulting measure will again be denoted by m = m(A) and called the
structure function (of the elementary orthogonal stochastic measure Z =
Z(A),Ae<f0).
The following question now arises naturally: since the set function
m = w(A) defined on (£, <^0) admits an extension to (£, <f), where 8 = c($0\
cannot an elementary orthogonal stochastic measure Z = Z(A), A e<f0, be
extended to sets A in £ in such a way that E |Z(A)|2 = m(A), A e<f?
The answer is affirmative, as follows from the construction given below.
This construction, at the same time, leads to the stochastic integral which we
need for the integral representation of stationary sequences.
3. Let Z = Z(A) be an elementary orthogonal stochastic measure, Ae<f0>
with structure function m = m(A), A e S. For every function
/W = I/JAfc, Afce<f0> (10)
with only a finite number of different (complex) values, we define the random
variable
^(/) = I/fcZ(Afc).
426
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
Let L2 = L2(E, <f, m) be the Hilbert space of complex-valued functions
with the scalar product
and the norm ||/|| = </,/>1/2, and let H2 = H2(Q, &, P) be the Hilbert
space of complex-valued random variables with the scalar product
and the norm ||£| = (f, f)1'2.
Then it is clear that, for every pair of functions f and g of the form (10),
and
IW)II2 = ll/ll2 = \ \f(X)\2m(dX).
Je
Now let/e L2 and let {/,} be functions of type (10) such that \\f - fn\\ -► 0,
n -» oo (the existence of such functions follows from Problem 2). Consequently
IW„) - ^(/Jll = \\fn ~ /Jl - 0, ii, m - oo.
Therefore the sequence {./(/„)} is fundamental in the mean-square sense
and by Theorem 7, §10, Chapter II, there is a random variable (denoted by
J{f)) such that JF(f) e H2 and \\J{fn) - Jf(f)\\ - 0, n - oo.
The random variable ./(/) constructed in this way is uniquely defined
(up to stochastic equivalence) and is independent of the choice of the
approximating sequence {/„}. We call it the stochastic integral of feL2 with
respect to the elementary orthogonal stochastic measure Z and denote it by
>(/)= [mm**).
Je
We note the following basic properties of the stochastic integral J(f)\
these are direct consequences of its construction (Problem 1). Let g, f, and
/„eL2.Then
(SU).s(g)) = <f,9>i (ID
IW)II = 11/11; (12)
f(af + bg) = aJf(f) + bJ{g) (P-a.s.) (13)
where a and b are constants;
IWn) -^(/)11-0, (14)
§2. Orthogonal Stochastic Measures and Stochastic Integrals
427
4. Let us use the preceding definition of the stochastic integral to extend
the elementary stochastic measure Z(A), A e<f0, to sets in 8 = c{£0).
Since m is assumed to be finite, we have IA = /A(A) e L2 for all A e S.
Write Z(A) = J(IJ. It is clear that Z(A) = Z(A) for A e <f0. It follows from
(13) that if At n A2 = 0 for Ax and A2 e <f, then
2(Ai + A2) = Z(At) + _Z(A2) (P-a.s.)
and it follows from (12) that
E|Z(A)|2 = m(A), Ae<f.
Let us show that the random set function Z(A), A e <f, is countably additive
in the mean-square sense. In fact, let Afc e £ and A = £fc°=i Afc. Then
fc=i
where
Z(A) - £ Z(Afc) = J{gn\
fc=i
fc=l fc=n+l
But
E|^„)|2=||^H2 = m(Zn)10, n-+oo,
E|Z(A)- £Z(Afc)|2-+0, «-oo.
fc=i
It also follows from (11) that
EZ(A!)I(A2) = 0
when Ax n A2 = 0, Alf A2 e S.
Thus our function Z(A), defined on A e <f, is countably additive in the
mean-square sense and coincides with Z(A) on the sets A e £0. We shall call
Z(A), A e <f, an orthogonal stochastic measure (since it is an extension of
the elementary orthogonal stochastic measure Z(A)) with respect to the
structure function m(A), AgI; and we call the integral J(f) = Je/0*)Z(*M),
defined above, a stochastic integral with respect to this measure.
5. We now consider the case (E, ^) = (R, ^(jR)), which is the most
important for our purposes. As we know (§3, Chapter II), there is a one-to-one
correspondence between finite measures m = m(A) on (R, @(R)) and certain
(generalized) distribution functions G = G(x), with m(a, b] = G(b) — G(a).
It turns out that there is something similar for orthogonal stochastic
measures. We introduce the following definition.
428
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
Definition 4. A set of (complex-valued) random variables {ZA}, XeR,
defined on (Q, ^ P), is a random process with orthogonal increments if
(1) E|ZA|2 <oo,AeR;
(2) for every X e R
E|ZA-ZAJ2-+0, A„U KeR;
(3) whenever Xx < X2 < X3 < A4,
E(ZA„ - ZXJ(Z,2 - ZAl) = 0.
Condition (3) is the condition of orthogonal increments. Condition (1)
means that ZA e H2. Finally, condition (2) is included for technical reasons;
it is a requirement of continuity on the right (in the mean-square sense) at
each X e R.
Let Z = Z(A) be an orthogonal stochastic measure with respect to the
structure function m = m(A), of finite measure, with the (generalized)
distribution function G(X). Let us put
ZA = Z(-oo, A].
Then
E |Z;J2 = m(-oo, X] = G(X) < oo, E |ZA - Zxf = m(A„, X] 10, K1 K
and (evidently) 3) is satisfied also. Then this process {ZA} is called a process
with orthogonal increments.
On the other hand, if {ZA} is such a process with E |ZA|2 = G(X), G(— oo)
= 0, G(+oo) < oo, we put
Z(A) = Zb-Za
when A = (a, b]. Let £0 be the algebra of sets
A = I(£fc,« and Z(A)=tz(aktbkl
fc=i jt=i
It is clear that
E|Z(A)|2 = m(A),
where m(A) = Yi=i LG(bk) - G(ak)] and
EZ(A2)Z(A2) = 0
for disjoint intervals Aj = (al9 bx~] and A2 = (a2t fcj.
Therefore Z = Z(A), Ae<f0> is an elementary stochastic measure with
orthogonal values. The set function m = m(A), Ae^0, has a unique extension
to a measure on £ = ^(R), and it follows from the preceding constructions
that Z = Z(A), Ae^0) can also be extended to the set A e <f, where S = ^(R),
and E | Z(A) |2 = m(A), A e B{@).
§3. Spectral Representation of Stationary (Wide Sense) Sequences
429
Therefore there is a one-to-one correspondence between processes
{ZA}, AeR, with orthogonal increments and E|ZA|2 = G(A), G(-oo) = 0,
G( + oo) < oo, and orthogonal stochastic measures Z = Z(A), Aeg§(R%
with structure functions m = m(A). The correspondence is given by
ZA = Z(- oo, X], G(X) = m(- oo, A]
and
Z(a, b] = Z„ - Zfl, m(a, b] = G(b) - G{a).
By analogy with the usual notation of the theory of Riemann-Stieltjes
integration, the stochastic integral JR f(A) dZXi where {ZA} is a process with
orthogonal increments, means the stochastic integral JR f(X)Z{dX) with
respect to the corresponding process with an orthogonal stochastic measure.
6. Problems
1. Prove the equivalence of (5) and (6).
2. Let f e L2. Using the results of Chapter II (Theorem 1 of §4, the Corollary to Theorem
3 of §6, and Problem 9 of §3), prove that there is a sequence of functions f„ of the
form (10) such that ||/ - /„|| -»• 0, n -> oo.
3. Establish the following properties of an orthogonal stochastic measure Z(A) with
structure function m(A):
E|Z(A1)-Z(A2)|2 = m(A1AA2X
Z(AX\A2) = Z(AX) - Z(At n A2) (P-a.s.),
Z(AX A A2) = Z(AX) + Z(A2) - 2Z(At n A2) (P-a.s.).
§3. Spectral Representation of Stationary
(Wide Sense) Sequences
1. If £ = (fn) is a stationary sequence with Ef„ = 0, neZ, then by the
theorem of §1, there is a finite measure F = F(A) on ([—7c, 7c), ^([—7c, 7c)))
such that its covariance function R(n) = cov(£fc+n, £fc) admits the spectral
representation
R(n) = f" e**"F{dX). (1)
The following result provides the corresponding spectral representation
of the sequence f = (fn), «eZ, itself.
430
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
Theorem 1. There is an orthogonal stochastic measure Z = Z(A),
A e ^([—7t, 7t)), such that for every neZ (P-a.s.)
£, = f e"nZ(dX). (2)
J -n
Moreover, E |Z(A)|2 = F(A).
The simplest proof is based on properties of Hilbert spaces.
Let L2{F) = L2(E, <f, F) be a Hilbert space of complex functions,
E = [-7t, 7t), S = ^([-7t, 7t)), with the scalar product
<f>0>= f fWSWF{dX\ (3)
and let L&F) be the linear manifold (L&F) c L2(F)) spanned by e„ = en(X),
n e Z, where e„(X) = ea".
Observe that since E = [ — n, ri) and F is finite, the closure of Lq(F)
coincides (Problem 1) with L2(F):
U0(F) = L\F).
Also let LoOJ;) be the linear manifold spanned by the random variables £„,
n e Z, and let L2(£) be its closure in the mean-square sense (with respect
toP).
We establish a one-to-one correspondence between the elements of
L&F) and Lo(f), denoted by "<-►", by setting
e„++Z„, neZ, (4)
and defining it for elements in general (more precisely, for equivalence
classes of elements) by linearity:
£«„*„<-£«„£„ (5)
(here we suppose that only finitely many of the complex numbers a„ are
different from zero).
Observe that (5) is a consistent definition, in the sense that Z a„e„ = 0
almost everywhere with respect to F if and only if £ a„£„ = 0 (P-a.s.).
The correspondence "<-►" is an isometry, i.e. it preserves scalar products.
In fact, by (3),
<*„. O = f eJiXyeJLXMdX) = f ^""'TO = K(n - m)
and similarly
<I «,*„, I ftO = (I «,,£,, I &«• (6)
§3. Spectral Representation of Stationary (Wide Sense) Sequences
431
Now let y\ e L2(f). Since L2(f) = Lo(f), there is a sequence {rj„} such that
r]neL%(£) and \\rj„ — #71| -► 0, n -» 00. Consequently {?;„} is a fundamental
sequence and therefore so is the sequence {/,}, where /„ eL&F) and f„<^r}„.
The space L2(F) is complete and consequently there is an feL\F) such
that ||/„-/|| —0.
There is an evident converse: if feL2(F) and \\f — fn\\ ->0, /neLo(F),
there is an element r\ of L2(f) such that ||iy — iy„|| -► 0, ??„ e Lj(^) and »?„<-►/,.
Up to now the isometry "<-►" has been defined only as between elements
of LgOJ;) and Lj(F). We extend it by continuity, taking/ <-► ?; when/and ?; are
the elements considered above. It is easily verified that the correspondence
obtained in this way is one-to-one (between classes of equivalent random
variables and of functions), is linear, and preserves scalar products.
Consider the function f(X) = IA(A), where Ae^([- n, n)\ and let
Z(A) be the element of L2(f) such that /A(A) <-► Z(X). It is clear that ||/A(A)||2
= F(A) and therefore E|Z(A)|2 = F(A). Moreover, if A1nA2 = 0, we
have EZ(A1)Z(A2) = 0 and E |Z(A) - JJ=1 Z(Afc)|2 - 0, n - 00, where
Hence the family of elements Z(A), Ae^([- 7i, 7i)), form an orthogonal
stochastic measure, with respect to which (according to §2) we can define
the stochastic integral
>(/)= f f(X}Z(dX)9 feL2(F).
J — n
Let /e L2(F) and !/«-►/. Denote the element r\ by ¢(/) (more precisely,
select single representatives from the corresponding equivalence classes of
random variables or functions). Let us show that (P-a.s.)
S(f) = ^(/)- CO
In fact, if
/(A) = EM*,(A) (8)
is a finite linear combination of functions /Afc(A), Afc = (afc, bj, then, by the
very definition of the stochastic integral, J(f) = £ ocfcZ(Afc), which is
evidently equal to ¢(/). Therefore (7) is valid for functions of the form (8).
But if feL2(F) and \\f„ - f\\ - 0, where /„ are functions of the form (8),
then 110)(/,) - ¢(/)11 - 0 and \\S(fn) - S(f)\\ - 0 (by (2.14)). Therefore
¢(/) = >(/) (P-a-s.).
Consider the function /(A) = elkn. Then ^e,An) = fn by (4), but on the
other hand J{eiXn) = /"_„ eanZ{dX\ Therefore
£„= f e,AnZ(<U), neZ (P-a.s.)
J —n
by (7). This completes the proof of the theorem.
432
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
Corollary 1. Let £ = (£„) be a stationary sequence of real random variables £„,
neZ. Then the stochastic measure Z = Z(A) involved in the spectral
representation (2) has the property that
Z(A) = ZFA) (9)
for every A = ^([—n, n)), where —A = {X: — X e A}.
In fact, let f(X) = £ ak eak and n = £ ocfc £k (finite sums). Then f ++n and
therefore
^ = Iafc^-Ia^,Afc=7FI). (10)
Since ^A(A)<-»>Z(A), it follows from (10) that either ^A(-A)<-»>Z(A) or
Jf-A(X)<r+Z(A). On the other hand, ^_A(A)<-»Z(-A). Therefore Z(A)
= Z(-A)(P-a.s.).
Corollary 2. Again let £ = (^„) be a stationary sequence of real random
variables £„ and Z(A) = ZX(A) + iZ2(A). Then
EZ1(A1)Z2(A2) = 0 (11)
for every A1 and A2; and ifAlnA2 = 0 then
EZ^A^Z^) = 0, EZ2(A1)Z2(A2) = 0. (12)
In fact, since Z(A) = Z(—A), we have
Zx(-A) = ZX(A), Z2(-A) = -Z2(A). (13)
Moreover, since EZ(A2)Z(A2) = E |Z(AX n A2)|2, we have Im EZ(AX)Z(A2)
= 0, i.e.
EZ^A^Z^A,) + EZ2(A1)Z1(A2) = 0. (14)
If we take the interval - A2 instead of Aj we therefore obtain
EZ^-A^Z^AJ + EZ^-A^Z^) = 0,
which, by (13), can be transformed into
EZ^A^Z^A,) - EZ2(A1)Z1(A2) = 0. (15)
Then (11) follows from (14) and (15).
On the other hand, if A2 n A2 = 0 then EZ(A2)Z(A2) = 0, whence
Re EZ(A2)Z(A2) = 0 and Re EZ(-AX)Z(A2) = 0, which, with (13), provides
an evident proof of (12).
Corollary 3. Let £ = (£„) be a Gaussian sequence. Then, for every family
A2 Afc, the vector (Z^Aj) Z^AJ, Z2(A2) Z2(Afc)) is normally
distributed.
§3. Spectral Representation of Stationary (Wide Sense) Sequences
433
In fact, the linear manifold Lq(£) consists of (complex-valued) Gaussian
random variables r\, i.e. the vector (Re r\t Im rj) has a Gaussian distribution.
Then, according to Subsection 5, §13, Chapter II, the closure of Lj(f) also
consists of Gaussian variables. It follows from Corollary 2 that, when
£ = (£„) is a Gaussian sequence, the real and imaginary parts of Zj and Z2
are independent in the sense that the families of random variables (Z^Aj),
..., Zt(^)) and (Z2(Aj),..., Z2(Afc)) are independent. It also follows
from (12) that when the sets Ax Afc are disjoint, the random variables
Z.^Aj),..., Z,(Afc) are collectively independent, i = 1, 2.
Corollary 4. If £ = (£„) is a stationary sequence of real random variables,
then (P-fl.s.)
£, = f cos Xn Z^dX) + f sin Xn Z2(dX). (16)
J-n J-n
Remark. If {ZA}, Ae[—7i, 7t), is a process with orthogonal increments,
corresponding to an orthogonal stochastic measure Z = Z(A), then in
accordance with §2 the spectral representation (2) can also be written in the
following form:
<-£
eiXndZAt neZ. (17)
2. Let £ = (£„) be a stationary sequence with the spectral representation (2)
and let r\ e L2(£). The following theorem describes the structure of such
random variables.
Theorem 2.1frje L2(£), there is a function <p e L2(F) such that (P-a.s.)
.n= f <p(X)Z(dX). (18)
J -71
Proof. If
then by (2)
ri„ = I «*&> (19)
|fc|<n
»n= f (l OLkeAz{dX\ (20)
J -n \|fc|<n /
i.e. (18) is satisfied with
<PM = I *neak. (21)
|fc|<;n
In the general case, when r\ e L2(£), there are variables r\n of type (19) such
that \r\ - n„\\ -► 0, n -► oo. But then \\<p„ - <pm\\ = \\n„ - nm\\ -+ 0, n, m -► oo.
434
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
Consequently {<p„} is fundamental in L2(F) and therefore there is a function
<peL2(F) such that \\<p — <pn\\ -»> 0, n -»> oo.
By property (2.14) we have ||./(<pn) - S((p)\\ -► 0, and since nn =•/(<?„)
we also have y\ = «/(<p) (P-a.s.).
This completes the proof of the theorem.
Remark. Let H0(£) and H0(F) be the respective closed linear manifolds
spanned by the variables £„ and by the functions en when n < 0. Then if
y\ e H0(£) there is a function <p e Hq(F) such that (P-a.s.) r\ = \n-n (p(X)Z(dX).
3. Formula (18) describes the structure of the random variables that are
obtained from £„, «eZ, by linear transformations, i.e. in the form of finite
sums (19) and their mean-square limits.
A special but important class of such linear transformations are defined
by means of what are known as (linear) filters. Let us suppose that, at instant
m, a system (filter) receives as input a signal xmt and that the output of the
system is, at instant «, the signal h(n — m)xmi where h = h(s), seZ, is a
complex valued function called the impulse response (of the filter).
Therefore the total signal obtained from the input can be represented in
the form
00
yn = Z h(" - ni)xm. (22)
m= —oo
For physically realizable systems, the values of the input at instant n
are determined only by the "past" values of the signal, i.e. the values xm for
m < n. It is therefore natural to call a filter with the impulse response h(s)
physically realizable if h(s) = 0 for all s < 0, in other words if
n oo
yn = Z Kn - m)xm = £ h(m)xn-m. (23)
m=— oo m = 0
An important spectral characteristic of a filter with the impulse response h
is its Fourier transform
(p(X)= £ e-amh{m\ (24)
m= —oo
known as the frequency characteristic or transfer function of the filter.
Let us now take up conditions, about which nothing has been said so far,
for the convergence of the series in (22) and (24). Let us suppose that the input
is a stationary random sequence £ = (£„), neZ, with covariance function
R(n) and spectral decomposition (2). Then if
£ h(k)R(k - l)h(l) < oo, (25)
fc./=-oo
§3. Spectral Representation of Stationary (Wide Sense) Sequences
435
the series £^= -mh(n — m)£m converges in mean-square and therefore there
is a stationary sequence r\ = (»/„) with
1n= I Kn-m)Zm= £ h(m)^_m. (26)
m oo m oo
In terms of the spectral measure, (25) is evidently equivalent to saying that
«p(A)eL2(F),i.e.
f \<P(W2F(dX) < oo. (27)
J — n
Under (25) or (27), we obtain the spectral representation
nn = f e'V)ZW (28)
of r\ from (26) and (2). Consequently the covariance function R^n) of r\
is given by the formula
*„(")= {* e»*\<p{X)\2F{dX). (29>
In particular, if the input to a filter with frequency characteristic <p = <p(A)
is taken to be white noise b =b (£„), the output will be a stationary sequence
(moving average)
1.= t Km)B„.m (30)
m= —oo
with spectral density
The following theorem shows that there is a sense in which every stationary
sequence with a spectral density is obtainable by means of a moving average.
Theorem 3. Let r\ = (»/„) be a stationary sequence with spectral density fn(X).
Then (possibly at the expense of enlarging the original probability space) we
can find a sequence e = (e„) representing white noise, and a filter, such that the
representation (30) holds.
Proof. For a given (nonnegative) function fn(X) we can find a function <p(X)
such that fn(X) = (l/2n)\<p(X)\2.Since $1nf^X)dX < oo,wehave«p(A)eL2(^),
where \i is Lebesgue measure on [—n, n). Hence <p can be represented as a
Fourier series (24) with h(m) = (l/2n)j1Ln eimX<p(X) dX, where convergence is
understood in the sense that
\<pW- £ e-amh(m)
I \m\£n
I2
<tt-0,
436
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
Let
nn= f* eiXnZ(dX), neZ.
J -n
Besides the measure Z = Z(A) we introduce another independent
orthogonal stochastic measure Z = Z(A) with E|Z(a, b]\2 = (b - a)/2n. (The
possibility of constructing such a measure depends, in general, on having a
sufficiently "rich" original probability space.) Let us put
Z(A) = f <p®{X)Z{dX) + f [1 - <p®(X)<p(Xy]Z(dX),
Ja Ja
where
fa"1, ifa#0,
[0, iffl = 0.
The stochastic measure Z = Z(A) is a measure with orthogonal values, and
for every A = (a, b] we have
E |Z(A)|2 = ^jjcp®(X)\2\<p(X)\2 dX + i- J 11 - <p®(X)<p(X)\2 dX = ^,
where |A| = b — a. Therefore the stationary sequence e = (e„), neZ, with
£„ = f" e'A"Z(<U),
J—n
is a white noise.
We now observe that
^ e"n<p{X)Z(dX) = ^ ein*Z(dX) = nn (31)
and, on the other hand, by property (2.14) (P-a.s.)
f* einX(p(X)Z(dX) = f* e>*n( £ e-'A,"/i(m))z(^)
J-n J-n \m=-co /
= I Km) f ^"-m»Z(« = £ /i(m)6n_m,
m=— oo ^-n m=—oo
which, together with (31), establishes the representation (30).
This completes the proof of the theorem.
Remark. If f£X) > 0 (almost everywhere with respect to Lebesgue measure),
the introduction of the auxiliary measure Z = Z(A) becomes unnecessary
(since then 1 — <p®(X)<p(X) = 0 almost everywhere with respect to Lebesgue
measure), and the reservation concerning the necessity of extending the
original probability space can be omitted.
§3. Spectral Representation of Stationary (Wide Sense) Sequences
437
Corollary 1. Let the spectral density fn(X) > 0 (almost everywhere with
respect to Lebesgue measure) and
MX) = ^\<PM\2:
where
(p(X) = Yde-akh(k)i I |/i(/c)|2 < oo.
fc=0 fc=0
Then the sequence rj admits a representation as a one-sided moving average,
00
fin = Z Km)e„-m.
m = 0
In particular, let P(z) = a0 + a^z H V apzp be a polynomial that has
no zeros on {z: \z\ = 1}. Then the sequence y\ = (r]„) with spectral density
UV = ^\P(e-")\2
can be represented in the form
In = a0e„ + flifin_! + ••• + ape„-p.
Corollary 2. Let £ = (£„) be a sequence with rational spectral density
m =
P(e~a)
\Q(e-a)\ '
where P(z) = a0 + axz + • • • + apzp, Q(z) = 1 + bxz + • • • + fc9z«
(32)
Let us show that if P(z) and Q{z) have no zeros on {z: \z\ = 1}, there is a
white noise e = e(n) such that (P-a.s.)
£, + b1^„-1 + ••• + fef^_f = a0en + «!£„_! + ••• + ape„_p. (33)
Conversely, every stationary sequence f = (fn) that satisfies this equation
with some white noise e = (e„) and some polynomial Q(z) with no zeros on
{z: |z| = 1} has a spectral density (32).
In fact, let *„ = £, + fr^-i + ■ ■ - + *,£,_,. Then #A) = (l/27r)|P(e-'A)|2
and the required representation follows from Corollary 1.
On the other hand, if (33) holds and F£X) and F„(A) are the spectral
functions of f and rj, then
F„(X) = j* \QLe-»)\2 dFjy) = i- j" |P(e-")|2 rfv.
Since \Q(e~iv)\2 > 0, it follows that F£X) has a density defined by (32).
438
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
4. The following mean-square ergodic theorem can be thought of as an
analog of the law of large numbers for stationary (wide sense) random
sequences.
Theorem4. Let £ = (£„), neZ, be a stationary sequence with E£n = 0,
covariance function (1), and spectral resolution (2). Then
and
Proof. By (2),
-I&^Z({0»
n fc = 0
- £R(/c)-+F({0}).
n fc=0
iz^- f J "z«**z(<tt)- f<p„a)z(dA),
where
It is clear that
Hfc=0
l%WI ^ i-
Moreover, <p„(A) ^2 /{0)(A) and therefore by (2.14)
f <pJLX)Z(dX) £ f* /(0)(A)Z(rfA) = Z({0}),
which establishes (34).
Relation (35) can be proved in a similar way.
This completes the proof of the theorem.
(34)
(35)
(36)
Corollary. If the spectral function is continuous at zero, i.e. F({0}) = 0, then
Z({0}) = 0 (P-a.s.) and by (34) and (35),
i n-l i n-l
wfc=0 "fc=0
Since
j n-l
«fc=0
I /i n-l \ 12
I \" fc = 0 / I
1 n-l
§3. Spectral Representation of Stationary (Wide Sense) Sequences
439
the converse implication also holds:
1 n-l 2 i n-1
-Z&-o.»-£j«k)-.a
"fc=0 "fc=0
Therefore the condition (l/«) Yi=h ^CO -► 0 is necessary and sufficient
for the convergence (in the mean-square sense) of the arithmetic means
(1/h) Yi=o £kto zero- It follows that if the original sequences £ = (£„) has
expectation m (that is, E£0 = m), then
ilVw-Ooi'l&^m (37)
where R(n) = E(£„ - E£„)(£0 - E£0).
Let us also observe that if Z({0}) # 0 (P-a.s.) and m = 0, then £„ "contains
a random constant a":
£n = tt + In,
where a = Z({0}); and in the spectral representation r\n = ^1LneiXnZn(dX)
the measure Z„ = Z„(A) is such that Z„({0}) = 0 (P-a.s.). Conclusion (34)
means that the arithmetic mean converges in mean-square to precisely this
random constant a.
5. Problems
1. Show that Lq(F) = L\F) (for the notation see the proof of Theorem 1).
2. Let ^ = (^„) be a stationary sequence with the property that %n+N = ^„ for some N
and all n. Show that the spectral representation of such a sequence reduces to (1.13).
3. Let ^ = (^„) be a stationary sequence such that E£„ = 0 and
±1 iw-o-J I Wi-!Jflscw-
iV fc=o / = 0 ™ |fc|<N-l L ^J
for some C > 0, a > 0. Use the Borel-Cantelli lemma to show that then
TF I&-0 (p-as>
iV fc=0
4. Let the spectral density f£X) of the sequence £ = (£,,) be rational,
1 IP- i(c"w)|
where P„_ i(z) = a0 + «iz + ^ an- i*"~l and QJz) = 1 + bxz + h fcnZ", and
all the zeros of these polynomials lie outside the unit disk.
Show that there is a white noise £ = (em), meZ, such that the sequence (^m) is a
component of an ^-dimensional sequence (¢^, ££,..., ¢£), £^ = £m, that satisfies
440 VI. Stationary (Wide Sense) Random Sequences. L2-Theory
the system of equations
«:*.--"i;Vi«i+i + A^*.. (39)
where pt = a0, ft = «,-_! -J]l-i &&*-*•
§4. Statistical Estimation of the Covariance Function
and the Spectral Density
1. Problems of the statistical estimation of various characteristics of the
probability distributions of random sequences arise in the most diverse
branches of science (geophysics, medicine, economics, etc.) The material
presented in this section will give the reader an idea of the concepts and
methods of estimation, and of the difficulties that are encountered.
To begin with, let £ = (£„), n e Z, be a sequence, stationary in the wide
sense (for simplicity, real) with expectation E£„ = m and covariance R(n) =
Let x0ixx xw_j be the results of observing the random variables
£0, £i,..., fw-i. How are we then to construct a "good" estimator of the
(unknown) mean value m?
Let us put
j N-\
^n(x) = — X xk- (1)
iV fc = o
Then it follows from the elementary properties of the expectation that this is
a "good" estimator of m in the sense that it is unbiased "in the mean over all
kinds of data x0 xN- x", i.e.
-E(Ks*)-
Emw«) = E - YtkUrn. (2)
In addition, it follows from Theorem 4 of §3 that when (1/N) ££L0 R(k) - 0,
N -*■ oo, our estimator is consistent (in mean-square), i.e.
E|m,v(f)-m|2-0, N - oo. (3)
Next we take up the problem of estimating the covariance function R(n\
the spectral function F(A) = F([-7r, X]\ and the spectral density f(A\ all
under the assumption that m = 0.
§4. Statistical Estimation of the Covariance Function and the Spectral Density
441
Since R(n) = Efn+fcffc, it is natural to estimate this function on the basis
of N observations x0, xx xN-1 (when 0 < n < N) by
i N-n-l
RN(n;x) = — £ xn+kxk.
iv — n k=0
It is clear that this estimator is unbiased m the sense that
ERN(n;0 = R(n\ 0<n<N.
Let us now consider the question of its consistency. If we replace £fc in
(3.37) by £n+k£k and suppose that the sequence £ = (£„) under consideration
has a fourth moment (E£j < oo), we find that the condition
-j; I EK„+fc& - *(")] LUo ~ RWl - 0, N -+ oo, (4)
iV fc = 0
is necessary and sufficient for
E\RN(n;0-R(n)\2^>0, N-+oo. (5)
Let us suppose that the original sequence £ = (£n) is Gaussian (with zero
mean and covariance R(n)). Then by (11.12.51)
EK»+*& - RinmUo - R(n)] = EZm+kZktmZo - *2(")
= E£n+fc£fc • E£„£0 + E£n+fc£„' E^fc^o
+ Etn+kt0-Etkt„-R2(n)
= R2(k) + R(n + k)R(n - k\
Therefore in the Gaussian case condition (4) is equivalent to
Tr Z [K2C0 + R(n + k)R(n - /c)] - 0, N - oo. (6)
Since |K(n + /c)jR(h - k)\ < \R(n + /c)|2 + \R(n - /c)|2, the condition
4 Z^)-0. W-°°» (7>
N fc=0
implies (6). Conversely, if (6) holds for n = 0, then (7) is satisfied.
We have now established the following theorem.
Theorem. Let £ = (£„) be a Gaussian stationary sequence with E£„ = 0 and
covariance function R(n). Then (7) is a necessary and sufficient condition that,
for every n > 0, the estimator RN(n; x) is mean-square consistent, (i.e. that
(5) is satisfied).
442
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
Remark. If we use the spectral representation of the covariance function,
we obtain
iV fc = 0 J-n J-n iV fc = 0
= f f Att v)F(dX)F{dv\
J—it J—it
where (compare (3.35))
II, A = v,
1 - ei(A-v)JV _ ,
N[l - ^"v>] ' X^V-
But as N -> oo
Therefore
^ Y^) - f f f(K v)F(<tt)F(rfv)
iv fc=0 j_„ j_„
= fF(W)F(^) = XF2(U}),
where the sum over X contains at most a countable number of terms since
the measure F is finite.
Hence (7) is equivalent to
XF2({A}) = 0, (8)
A
which means that the spectral function F(A) = F([—n, X]) is continuous.
2. We now turn to the problem of finding estimators for the spectral function
F(A) and the spectral density f{X) (under the assumption that they exist).
A method that naturally suggests itself for estimating the spectral density
follows from the proof of Herglotz' theorem that we gave earlier. Recall that
the function
^=^0-"h*~"" (9)
introduced in §1 has the property that the function
FM= ffN(y)dv
J —It
§4. Statistical Estimation of the Covariance Function and the Spectral Density 443
converges on the whole (Chapter III, §1) to the spectral function F(X).
Therefore if F(X) has a density f(X\ we have
f fN(y)dv^ f f(y)dv
J-n J-n
(10)
for each A e [—71,71).
Starting from these facts and recalling that an estimator for R(n) (on
the basis of the observations x0» *i» • • • i *w-i) *s ^w("J x)» we ta^e as an
estimator for f(X) the function
/^)=^(1-^(^-- (id
putting RN(n; x) = RN( \ n |; x) for | n \ < N.
The function fN(X\ x) is known as a periodogram. It is easily verified that
it can also be represented in the following more convenient form:
/w(*; *):
1
ItzN
Z v*
(12)
Since Etf^n; f) = R(n), |n| < JV, we have
If the spectral function F(A) has density f(X), then, since /W(A) can also be
written in the form (1.34), we find that
Z7T7V fc = 0 / = 0 J-n
£.
- t..div[?/-"'r'<">"-
The function
**M-Siv
I^fc
l
:27iN
sin-N
sin A/2
is the Fejer kernel. It is known, from the properties of this function, that for
almost every X (with respect to Lebesgue measure)
t
%^-v)f(v)dv^f(X).
(13)
Therefore for almost every Xe [—n, n)
EA(A;0-/(A); (14)
in other words, the estimator fN(X; x) off(X) on the basis of x0, Xj xw_ j
is asymptotically unbiased.
444
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
In this sense the estimator fN(X; x) can be considered to be "good."
However, at the individual observed values x0,...,xN-l the values of the
periodogram fN(X; x) usually turn out to be far from the actual values/(A).
In fact, let £ = (£„) be a stationary sequence of independent Gaussian random
variables, £„ - ^(0,1). Then f(X) = 1/2tt and
/*(«) = ^
i N-l
limE \fN(X;t)-f(X)\2 = {
Then at the point X = 0 we have /^(0, f) coinciding in distribution with the
square of the Gaussian random variable r\ ~ jV(Q, 1). Hence, for every N,
E|/*(0; £)-/(0)12 = ^E\n2 - 1|2 > 0.
Moreover, an easy calculation shows that if f(X) is the spectral density of a
stationary sequence £ = (£„) that is constructed as a moving average:
£„=!>**„-* (15)
fc = 0
with Yj?=o kfcl < °°> Zfc°=o kfcl2 < °°» where e = (e„) is white noise with
EeJ < °o, then
2/2(0), A = 0,±tt,
/2(AX A#0, ±tt. u°'
Hence it is clear that the periodogram cannot be a satisfactory estimator of
the spectral density. To improve the situation, one often uses an estimator for
f(X) of the form
7JW- f WMA - v)/„(v; x) dv, (17)
J—n
which is obtained from the periodogram fN{X\ x) and a smoothing function
W^(A), and which we call a spectral window. Natural requirements on
WN(X) are:
(a) WN(X) has a sharp maximum at X = 0;
(b) J-,Wfctt)<tt=:l;
(c) P|/S'a;£)-/(A)|2-0, JV->oo, A e [-71,71).
By (14) and (b) the estimators f™{X; £) are asymptotically unbiased.
Condition (c) is the condition of asymptotic consistency in mean-square, which,
as we showed above, is violated for the periodogram. Finally, condition (a)
ensures that the required frequency X is "picked out" from the periodogram.
Let us give some examples of estimators of the form (17).
Bartlett's estimator is based on the spectral window
WN(X) = aNB(aNX),
§4. Statistical Estimation of the Covariance Function and the Spectral Density 445
where aN | oo, aN/N -► 0, JV -► oo, and
B(X) =
sin(A/2)
X/2
Parzen's estimator takes the spectral window to be
WM = aNP(aNX\
where aN are the same as before and
P(X).
8tt
sin(A/4)
X/4
Zhurbenko's estimator is constructed from a spectral window of the form
WM = aNZ(aNX)
with
a + 1 a + 1
2a
|A|>1,
where 0 < a < 2 and the aN are selected in a particular way.
We shall not spend any more time on problems of estimating spectral
densities; we merely note that there is an extensive statistical literature
dealing with the construction of spectral windows and the comparison of
the corresponding estimators/^(A; x).
3. We now consider the problem of estimating the spectral function F(X) =
F([—71, A]). We begin by defining
FN(X) = J_ fN(v) dv, FN(X; x) = J_ /w(v; x) dv,
where f^v; x) is the periodogram constructed with (x0, x1 xN-i).
It follows from the proof of Herglotz' theorem (§1) that
f* eiAn dF„(X) - f* eiXn dF(X)
J —it J—n
for every neZ. Hence it follows (compare the corollary to Theorem 1, §3,
Chapter III) that FN => F, i.e. FN(X) converges to F(X) at each point of
continuity of F(X).
Observe that
J%* <#»<*; 0-^0(1-¾)
446
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
for all M < JV. Therefore if we suppose that RN(n; f) converges to R(n)
with probability one as N -» oo, we have
f ea" dFN{X\ f) - f* eUn dF(X) (P-a.s.)
J— n J—it
and therefore FN(X; f) => F(X) (P-a.s.).
It is then easy to deduce (if necessary, passing from a sequence to a
subsequence) that if RN(n; f) -► R(n) in probability, then PN(X; f) => F(X) in
probability.
4. Problems
1. In (15) let e„ ~ jr(p, l). Show that
(N - nWR^n, ¢) -+ 2n f (1 + e2inX)f2(X) dX
•'-it
for every «, as N -»• oo.
2. Establish (16) and the following generalization:
f 2/2(0), X = v = 0, ±7t,
lim cov(/N(A; ¢), A(v; ¢)) = | /2(A), A = v # 0, ±«,
N"" lo, A#±v.
§5. Wold's Expansion
1. In contrast to the representation (3.2) which gives an expansion of a
stationary sequence in the frequency domain, Wold's expansion operates
in the time domain. The main point of this expansion is that a stationary
sequence f = (£n), n e Z, can be represented as the sum of two stationary
sequences, one of which is completely predictable (in the sense that its
values are completely determined by its "past"), whereas the second does
not have this property.
We begin with some definitions. Let iJ„(f) = L2(f") and H(0 = L2(f)
be closed linear manifolds, spanned respectively by £" = (..., £n-u f„) and
£ = (•..£„_!,£„,...). Let
n
For every r\ e H(£\ denote by
Uri) = mHn\0)
the projection of rj on the subspace //„(£) (see §11, Chapter II). We also
write
*-Ja) = ms(0).
§5. Wold's Expansion
447
Every element r\ e H(£) can be represented as
r\ = ii-^) + (n- ft-Jfl)\
where y\ — n_ m(rj) ± H-Jljj). Therefore H(£) is represented as the orthogonal
sum
M = S(0©R(ft
where S(£) consists of the elements it-Jin) with y\ e H(£), and jR(£) consists
of the elements of the form r\ — it-Jin).
We shall now assume that E ft = 0 and Vft > 0. Then H(£) is automatically
nontrivial (contains elements different from zero).
Definition 1. A stationary sequence £ = (ft) is regular if
mo = R(&
and singular if
mo = s(&
Remark. Singular sequences are also called deterministic and regular
sequences are called purely or completely nondeterministic. If S(ft is a proper
subspace of H(0 we just say that £ is nondeterministic.
Theorem 1. Every stationary (wide sense) random sequence £ has a unique
decomposition
*. = c + e, (i)
w/zere ft = (££) is regular and ft = (ft) is singular. Here ft amd ft are
orthogonal (ft ± ft, /or a//« and m).
Proof. We define
Since ft ± S(ft, for every n, we have S(ft) ± S(ft. On the other hand, S(ft)
£ S(ft and therefore S(ft) is trivial (contains only random sequences that
coincide almost surely with zero). Consequently ft is regular.
Moreover, Hn(0 £ tf„(ft) 0 //n(ft) and tf„(ft) c tfn(ft, tf„(ft) c tf„(ft.
Therefore tf„(ft = H„(ft) 0 tf„(ft) and hence
S(^W©« (2)
for every n. Since ft ± S(ft it follows from (2) that
S(ft s tf„(ft),
448
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
and therefore S(f) <= S(<f) c tf(f*). But g <= S(£); hence tf(<f) <= S(0 and
consequently
S(0 = S(« = tf(«,
which means that f5 is singular.
The orthogonality of £s and fr follows in an obvious way from g e S(f)
and£±S(£).
Let us now show that (1) is unique. Let £„ = if„ + if„t where tf and 17s are
regular and singular orthogonal sequences. Then since H„(rf) = Hirj*), we
have
HJ® = Hn(rf) 0 H^O = tf „W) 0 tf (if),
and therefore S(f) = S(?/r) 0 tffos)- But S(rf) is trivial, and therefore
S(0 = fldfl.
^ Since i/JeflOrt-SK) and ifi ± tf(if) = S«), we have E(£JS(£)) =
E(»7^ + ^|S(<^)) = rfn, i.e. 77* coincides with g; this establishes the uniqueness
of(l).
This completes the proof of the theorem.
2. Definition 2. Let £ = (£„) be a nondegenerate stationary sequence. A
random sequence e = (e„) is an innovation sequence (for £) if
(a) e = (e„) consists of pairwise orthogonal random variables with Ee„ = 0,
E|e„|2 = l;
(b) tf„(£) = tfn(6)forallneZ.
Remark. The reason for the term "innovation" is that en+1 provides, so to
speak, new " information " not contained in tf „(£) (in other words," innovates "
in tfn(£) the information that is needed for forming tfn+1(£)).
The following fundamental theorem establishes a connection between
one-sided moving averages (Example 4, §1) and regular sequences.
Theorem 2. A necessary and sufficient condition for a nondegenerate sequence
£ to be regular is that there are an innovation sequence e = (e„) and a sequence
(a„) of complex numbers, n > 0, with Y,n=o M2 < °°> such that
00
fc. = I ake„-k (P-a.s.) (3)
fc = 0
Proof. Necessity. We represent tf „(<!;) in the form
Hn(t) = Hn.1(t)®Bn.
Since tfn(£) is spanned by elements of //„_!(£) and elements of the form
P • £„, where /? is a complex number, the dimension (dim) of B„ is either zero
or one. But the space tfn(£) cannot coincide with tf„_ ^£) for any value of n.
§5. Wold's Expansion
449
In fact, if B„ is trivial for some n, then by stationarity Bk is trivial for all /c,
and therefore H(Q = S(f), contradicting the assumption that f is regular.
Thus B„ has the dimension dim B„ = 1. Let rj„ be a nonzero element of fl„.
Put
E =-3=-
" foj'
where ||rjn\\2 = E\r]n\2 > 0.
For given n and /c > 0, consider the decomposition
hjlq = H„_k(o e Bn_k+, e • • • e b„.
Then £„_fc e„ is an orthogonal basis in B„_k+i® ••■ ® B„ and
fc-i
&. = I «j«^-j + *„_*(£„), (4)
J=0
where a,- = E£,5,_,.
By Bessel's inequality (II. 11.16)
I kjl2 < IIU2 < oo.
j=o
It follows that Yj=oai£n-j converges in mean square, and then, by (4),
equation (3) will be established as soon as we show that 7cn_fc(^„)^0,
/c-» oo.
It is enough to consider the case n = 0. Since
fc
7T_fc = 7T0 + Z tf-i ~ ^-£+ll
£=0
and the terms that appear in this sum are orthogonal, we have for every
k>0
£ ||7T_£ - 7T_£+1||2 = J t (*-«• - ft-£+l)ir
£ = 0 II £ = 0 ||
= ||7T_fc - TToll2 < 4||^ol|2 < 00.
Therefore the limit lim^^ 7c_fc exists (in mean square). Now 7r_fcef/_fc(<5;)
for each /c, and therefore the limit in question must belong to Hfc>o Hk(0 =
S(£). But, by assumption, S(Q is trivial, and therefore n_k ^ 0, k -► co.
Sufficiency. Let the nondegenerate sequence £ have a representation (3),
where e = (e„) is an orthonormal system (not necessarily satisfying the
condition H„(0 = Hn(el neZ). Then H„(0 = H„(e) and therefore S(f) =
Hfc Hk(0 c Hn(e) for every n. But sn+1 ± H„(e), and therefore e„+1 1 S(Q
and at the same time s = (s„) is a basis in H(£). It follows that S(f) is trivial,
and consequently £ is regular.
This completes the proof of the theorem.
450
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
Remark. It follows from the proof that a nondegenerate sequence £ is regular
if and only if it admits a representation as a one-sided moving average
00
L = I akE„-ki (5)
fc = 0
where e = e„ is an orthonormal system which (it is important to emphasize
this!) does not necessarily satisfy the condition H„(Q = H„(e% neZ.ln this
sense the conclusion of Theorem 2 says more, and specifically that for a
regular sequence £ there exist a = (a„) and an orthonormal system e = (¾)
such that not only (5), but also (3), is satisfied, with //„(£) = H„(e), neZ.
The following theorem is an immediate corollary of Theorems 1 and 2.
Theorem 3 (Wold's Expansion). If £ = (£„) is a nondegenerate stationary
sequence, then
£n = £+I>fc£n-fc> (6)
fc = 0
where Yj?=o \ak\2 < °° and £ = (e„) is an innovation sequence (for £r).
3. The significance of the concepts introduced here (regular and singular
sequences) becomes particularly clear if we consider the following (linear)
extrapolation problem, for whose solution the Wold expansion (6) is
especially useful.
Let H0(£) = L2(£°) be the closed linear manifold spanned by the variables
£° = (..., £_ u £0). Consider the problem of constructing an optimal (least-
squares) linear estimator ln of £„ in terms of the "past" £° = (..., £_ lf £0).
It follows from §11, Chapter II, that
L = Z(L\H0(0). (7)
(In the notation of Subsection 1, \n = n0(£n).) Since fr and £s are orthogonal
and H0(0 = H0(?) 0 tf0(f*), we obtain, by using (6),
£, = 6(fi + £|tf0(£)) = E(£|H0(£)) + E(£|H0(£))
= EorjtfoOH e HoOD) + E(£itf0or) e HoO
= E(^|H0(^)+E(^|Ho(^))
\fc=0
In (6), the sequence e = (e„) is an innovation sequence for £r = (¾) and
therefore H0(^) = H0(e). Therefore
In = fi + E( £ akEn-k\H0(E)) = e + £ akE„.k (8)
\fc=0 / fc=n
§5. Wold's Expansion
451
and the mean-square error of predicting £„ by f 0 = (..., f _ lf f0) is
v2n=Z\tn-L\2 = J:\ak\2. (9)
fc = 0
We can draw two important conclusions.
(a) If f is singular, then for every n > 1 the error (in the extrapolation) <72
is zero; in other words, we can predict £„ without error from its "past"
^ = (...,^-.,6,).
(b) If £ is regular, then <72 < <72+, and
lim al = £ |flfc|2. (10)
n-»oo fc = 0
Since
f|flfc|2 = E|^|2,
fe = 0
it follows from (10) and (9) that
l„ -» 0, « -»• oo;
i.e. as n increases, the prediction of £„ in terms of f0 = (..., f _ lf f0) becomes
trivial (reducing simply to E£n = 0).
4. Let us suppose that £ is a nondegenerate regular stationary sequence.
According to Theorem 2, every such sequence admits a representation as a
one-sided moving average
00
L= I>fcen_fc, (11)
fc = 0
where J]*=0 1^ I2 < °o and the orthonormal sequence e = (e„) has the
important property that
HJLQ = HM neZ. (12)
The representation (11) means (see Subsection 3, §3) that £„ can be
interpreted as the output signal of a physically realizable filter with impulse
response a = (afc), k > 0, when the input is s = (e„).
Like any sequence of two-sided moving averages, a regular sequence has
a spectral density f(X). But since a regular sequence admits a representation
as a one-sided moving average it is possible to obtain additional information
about properties of the spectral density.
In the first place, it is clear that
452
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
where
fc = 0 fc=0
Put
<D(z) = £ flfczfc. (14)
fc = 0
This function is analytic in the open domain \z\ < 1 and since Y*=o I^12
< oo it belongs to the Hardy class H2, the class of functions g = g{z\ analytic
in \z\ < 1, satisfying
sup ^- f \g(re>6)\2 d6 < oo. (15)
0<r<l ** J-n
In fact,
and
i- f mrJ6)\2d6=t M2r2
Ln J-n fc=0
sup I|flfc|2r2fc<X|flfc|2<oo.
0<r<l
It is shown in the theory of functions of a complex variable that the
boundary function <J>(e,A), — n < X < n, of <J>ef/2, not identically zero, has
the property that
f* \n\Q(e-ix)\dX> -oo. (16)
J—n
In our case
/a) = ^i^-£A)i2,
where O e H2. Therefore
In f(X) = -In 2tt + 2 In |Ofe-")|,
and consequently the spectral density f(X) of a regular process satisfies
f_Jnf(X)
dX>-oo. (17)
On the other hand, let the spectral density/(A) satisfy (17). It again follows
from the theory of functions of a complex variable that there is then a
function 0(z) = Yj?=o afczfc in the Hardy class H2 such that (almost
everywhere with respect to Lebesgue measure)
f(X) = ^l<Ke-")l2.
§6. Extrapolation, Interpolation and Filtering
453
Therefore if we put <p(X) = 0>(e~£A) we obtain
/W = ^k(A)|2,
where <p(X) is given by (13). Then it follows from the corollary to Theorem 3,
§3, that £ admits a representation as a one-sided moving average (11), where
e = (e„) is an orthonormal sequence. From this and from the Remark on
Theorem 2, it follows that £ is regular.
Thus we have the following theorem.
Theorem 4 (Kolmogorov). Let £ be a nondegenerate regular stationary
sequence. Then there is a spectral density f(X) such that
f* \nf(X)dX> -oo. (18)
J-n
In particular, f(X) > 0 (almost everywhere with respect to Lebesgue measure).
Conversely, if £ is a stationary sequence with a spectral density satisfying
(18), the sequence is regular.
5. Problems
1. Show that a stationary sequence with discrete spectrum (piecewise-constant spectral
function F(A)) is singular.
2. Let al = E|f„ - |„|2, |„ = E(fjtf0(0). Show that if a* = 0 for some n > 1, the
sequence is singular; if o* -* R(fi) as n -* oo, the sequence is regular.
3. Show that the stationary sequence ^ = (^„), ^„ = ein,p, where q> is a uniform random
variable on [0, 27t], is regular. Find the estimator |„ and the number c*, and show
that the nonlinear estimator
provides a correct estimate of f„ by the "past" f° = (..., ¢- lt f0)> i-e-
E||n-£n|2 = 0, «^1-
§6. Extrapolation, Interpolation and Filtering
1. Extrapolation. According to the preceding section, a singular sequence
admits an error-free prediction (extrapolation) of £„, n > 1, in terms of the
"past," £° = (..., £_i, £0)- Consequently it is reasonable, when considering
the problem of extrapolation for arbitrary stationary sequences, to begin
with the case of regular sequences.
454
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
According to Theorem 2 of §5, every regular sequence f = (fn) admits a
representation as a one-sided moving average,
00
£. = E ^n-fc (1)
fc = 0
with ££°=0 \ak\2 < oo and some innovation sequence e = (e„). It follows
from §5 that the representation (1) solves the problem of finding the optimal
(linear) estimator | = E(£n|H0(£)) since, by (5.8),
00
ln=Zflfcen-fc (2)
k = n
and
rf = E|&-*J2=lW (3)
fc = 0
However, this can be considered only as a theoretical solution, for the
following reasons.
The sequences that we consider are ordinarily not given to us by means
of their representations (1), but by their covariance functions R(ri) or the
spectral densities f(X) (which exist for regular sequences). Hence a solution
(2) can only be regarded as satisfactory if the coefficients ak are given in
terms of R(ri) or of/(A), and ek are given by their values ---¾.^¾.
Without discussing the problem in general, we consider only the special
case (of interest in applications) when the spectral density has the form
/W = ^We-tt)l2, (4)
where <b(z) = Yj?=o Kzk nas radius of convergence r > 1 and has no zeros
in \z\ < 1.
Let
fc, = f e»"Z(dX) (5)
be the spectral representation of £ = (£„), neZ.
Theorem 1. If the spectral density of £ has the density (4), then the optimal
(linear) estimator |„ of£„ in terms of£° = (..'., f_ lf f0) is given by
In = f 0MZ(dX)t (6)
J — it
where
*"w=e ^4 (7)
and
§6. Extrapolation, Interpolation and Filtering
455
Proof. According to the remark on Theorem 2 of §3, every variable
|„ e H0(0 admits a representation in the form
ln= [ <pn{X)Z{dX\ <p„eH0(n (8)
where H0(F) is the closed linear manifold spanned by the functions e„ = eix"
forn<0(F(X) = SLnf(y)dv).
Since
E|£, - |„|2 = e| J" (*"" - ^(A))Z(dA)|2
= f \eiXn-ipM\2f{X)dK
the proof that (6) is optimal reduces to proving that
inf f \eiXn - <pn(X)\2f(X) dX = f \eiXn - «A)|2/W dX. (9)
It follows from Hilbert-space theory (§11, Chapter II) that the optimal
function $n(X) (in the sense of (9)) is determined by the two conditions
(1) 0n(X)eHo(Fl (10)
(2) eiXn-0n(X)±Ho(F).
Since
eiXnQ>„(e-a) = eiXnlbne~iXn + bn+1e~iX{n+1) + -]6i/0(F)
and in a similar way l/<D(e~'A)e H0(F% the function $„(X) defined in (7)
belongs to H0(F). Therefore in proving that 0n(X) is optimal it is sufficient
to verify that, for every m > 0,
ean - 0n(X) ± eiXm,
i.e.
h,m = f [*"" ~ MV]e-iXmf(X) dX = 0, m > 0.
The following chain of equations shows that this is actually the case:
= _L f gW"—)^-") - On(e-ix)We~iX)dX
2te J_„
2n J_„ \fc=o / \i=o /
456
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
where the last equation follows because, for m > 0 and r > 1,
f" e~iAmeiAr dX = 0.
J — n
This completes the proof of the theorem.
Remark 1. Expanding 0„(X) in a Fourier series, we find that the predicted
value |„ of fn, n > 1, in terms of the past, f° = (..., f_ lf f0), is given by the
formula
In = C0^o + C-^-! + C_2£_2 + .-..
Remark 2. A typical example of a spectral density represented in the form
(4) is the rational function
where the polynomials P(z) = a0 + axz + ■■■ + apzp and Q(z) = 1 + bxz
+ • • • + bqZ9 have no zeros in {z: |z| <, 1}.
In fact, in this case it is enough to put O(z) = P(z)/Q(z). Then <D(z)
= £j£=o Qzfc and the radius of convergence of this series is greater than one.
Let us illustrate Theorem 1 with two examples.
Example 1. Let the spectral density be
/(A) = ^ (5 + 4 cos X).
The corresponding covariance function R(n) has the shape of a triangle with
R(0) = 5, R(± 1) = 2, R(n) = 0 for |n| > 2. (11)
Since this spectral density can be represented in the form
we may apply Theorem 1. We find easily that
¢^ = ^2 + 1-^ «*) = ° for n^2- (12>
Therefore |„ = 0 for all n > 2, i.e. the (linear) prediction of £„ in terms of
£° = (..., £_ j, £o) is trivial, which is not at all surprising if we observe that,
by (11), the correlation between £„ and any of £0> £-1, •.. is zero for n > 2.
P(e-lX)\2
Q(e-iX)\ '
§6. Extrapolation, Interpolation and Filtering
457
For n = 1 we find from (6) and (12) that
_i r l - (-1)' f
Z°° (~ 1) Cfc _1e i; ,
Ofc+1 ~~ 2C0 — 4C-1 + ■*■•
fc = 0 Z
Example 2. Let the covariance function be
R(n) = an, \a\<\.
Then (see Example 5 in §1)
/CD-:' ^1^
|fc fit
e-ik3LZ(dX)
2n\l -ae~iX\2i
1
where
/a) = ^|0(e-£A)|2,
0(z) = U |fl" = (1 - |flp)i#2 j; (azft
1 - «2 fc=o
from which cpn{X) = a" and therefore
1 = r fl"z(<u)=«0.
In other words, in order to predict the value of £„ from the observations
£° = (..., £_lt £0) it is sufficient to know only the last observation £0.
Remark 3. It follows from the Wold expansion of the regular sequence
£ = (£„) with
fc = 0
that the spectral density f(X) admits the representation
/W = ^l^-iA)l2, (14)
where
¢(4 = 1^. (is)
fc = 0
458
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
It is evident that the converse also holds, that is, if f(X) admits the
representation (14) with a function <D(z) of the form (15), then the Wold expansion of £„
has the form (13). Therefore the problem of representing the spectral density
in the form (14) and the problem of determining the coefficients ak in the
Wold expansion are equivalent.
The assumptions that <D(z) in Theorem 1 has no zeros for |z\ < 1 and that
r > 1 are in fact not essential. In other words, if the spectral density of a
regular sequence is represented in the form (14), then the optimal estimator
|„ (in the mean square sense) for £„ in terms of £° = (..., £_ i, £0) is
determined by formulas (6) and (7).
Remark 4. Theorem 1 (with the preceding remark) solves the prediction
problem for regular sequences. Let us show that in fact the same answer
remains valid for arbitrary stationary sequences. More precisely, let
-£
L = «1 + £, L = ^nZ(dX% F(A) = E |Z(A)|2,
and let fr(X) = (l/2n)\0(e~iX)\2 be the spectral density of the regular
sequence £r = (£r„). Then |„ is determined by (6) and (7).
In fact, let (see Subsection 3, §5)
L = f ww4 & = f mwidxi
where Zr(A) is the orthogonal stochastic measure in the representation of the
regular sequence £r. Then
E|£„ - L\2 = JV'An - ^)I2W
> J" WXn - 0n(X)\2frW dX > jn \eiXn - FM\2fr(X) dX
= E|£,-£|2. (16)
But £, - |„ = £ - ?. Hence E|£„ - |„|2 = E|« - &\\ and it follows
from (16) that we may take $n(X) to be $rn(X).
2. Interpolation. Suppose that £ = (£„) is a regular sequence with spectral
density f(X). The simplest interpolation problem is the problem of
constructing the optimal (mean-square) linear estimator from the results of the
measurements {^„, n = ±1, ±2,...} omitting f0.
Let H°(£) be the closed linear manifold spanned by fn, n # 0. Then
according to the results of Theorem 2, §3, every random variable r\ e H°(£)
can be represented in the form
I
<p{X)Z{dX\
§6. Extrapolation, Interpolation and Filtering
459
where <p belongs to H°(F), the closed linear manifold spanned by the
functions eiXn, n # 0. The estimator
f0 = f ffl}Z(dX) (17)
J-n
will be optimal if and only if
inf E|£0 - r\|2 = inf f |1 - <p(X)\2F(dX)
= f \l-<p(X)\2F(dX) = £\to-U2.
It follows from the perpendicularity properties of the Hilbert space
H°(F) that 0(A) is completely determined (compare (10)) by the two conditions
(1) <p(X)eH°(F)t (18)
(2) l-q>(X)±H°(F).
Theorem 2 (Kolmogorov). Let £ = (f „) be a regular sequence such that
Or* <19)
Then
m = l~wy (20)
where
2n
(21)
r —
and the interpolation error b2 = E |^0 - f o \2 is given by d2 = 2n • a.
Proof. We shall give the proof only under very stringent hypotheses on the
spectral density, specifically that
0 < c < f(X) < C < oo. (22)
It follows from (2) and (18) that
l
[1 - <p(V]einXf(X) dX = 0 (23)
for every n # 0. By (22), the function [1 - <p(X)]f(A) belongs to the Hilbert
space L2([—7i, 7i], ^[—7i, it], n) with Lebesgue measure p. In this space the
functions {einXly/2lzi n = 0, ± 1,...} form an orthonormal basis (Problem 7,
§11, Chapter II). Hence it follows from (23) that [1 - <p{X)]f{X) is a constant,
460
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
which we denote by a. Thus the second condition in (18) leads to the
conclusion that
™-l-m (24)
Starting from the first condition (18), we now determine a.
By (22), <p e L2 and the condition <p e H°(F) is equivalent to the
condition that <p belongs to the closed (in the L2 norm) linear manifold spanned
by the functions ea"t n # 0. Hence it is clear that the zeroth coefficient in the
expansion of q>(X) must be zero. Therefore
0 = f q>(X) dX = 2n - a f dX
and hence a is determined by (21).
Finally,
s2 = e 1¾ - ioi2 = J" n - m\2fmdx
~|a| Lm*~ r w
L
This completes the proof (under condition (22)).
Corollary. If
9W- I c,«°*.
0<|fc|<W
then
Co = I ck f" fi^Z(£tt) = X cfc£fc.
0<|fc|<W J-n 0<|fc|<W
Example 3. Let f(X) be the spectral density in Example 2 above. Then an
easy calculation shows that
and the interpolation error is
^ = 1^.
1 + 1« I2
3. Filtering. Let (0, f) = ((0n), (£„)), n e Z, be a partially observed sequence,
where 0 = (0„) and £ = (£n) are respectively the unobserved and the observed
components.
§6. Extrapolation, Interpolation and Filtering
461
Each of the sequences 0 and f will be supposed stationary (wide sense)
with zero mean; let the spectral densities be
0„ = f* <P*ZMX\ and £, = f e^Z^dX).
J-n J-n
We write
Fe(A) = E |Ze(A)|2, F£A) = E\Z4(A)\2
and
F^(A) = EZe(A)Z^(A).
In addition, we suppose that 0 and £ are connected in a stationary way, i.e.
that their covariance functions coy(0„, ^m) = E6nZm depend only on the
differences n — m. Let Re£n) = E0nfo; then
The filtering problem that we shall consider is the construction of the
optimal (mean-square) linear estimator 0„ of 0„ in terms of some observation
of the sequence £.
The problem is easily solved under the assumption that 0„ is to be
constructed from all the values £mi m e Z. In fact, since 6„ = E(0„|//(£)) there is
a function 0„(X) such that
6n = f MVZ^dX). (25)
J — n
As in Subsections 1 and 2, the conditions to impose on the optimal cpn(X)
are that
(1) Qtf)eH{F&
(2) 0„ -0„±tf(£).
From the latter condition we find
f J*-m)Fe£dX) - f e-iXm0n(X)F^dX) = 0 (26)
J-n J-n
for every m e Z. Therefore if we suppose that Fe^X) and F£X) have densities
/^(A) and f£X\ we find from (26) that
f <P*-*U*&) ~ e-fQMf&yidX = 0.
If f£X) > 0 (almost everywhere with respect to Lebesgue measure) we
find immediately that
MX) = ean0(Xl (27)
462
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
where
M) = fe&)-ftto
and /®(A) is the "pseudotransform" of f£X\ i.e.
fiW-\0, /^) = 0.
Then the filtering error is
E1¾ - Bn\2 = f UM - fiAQftiXf] M. (28)
As is easily verified, 0 e #(F5), and consequently the estimator (25), with
the function (27), is optimal.
Example 4. Detection of a signal in the presence of noise. Let £„ = 6n + r\ni
where the signal 0 = (6„) and the noise r\ = (*/„) are uncorrelated sequences
with spectral densities fe(X) and fn(X). Then
ft, = f e"n0(X)Zs(dX)t
where
and the filtering error is
E|^-^|2= f UMUXf]LAW + uxyf dX.
J — n
The solution (25) obtained above can now be used to construct an optimal
estimator 6n+m of 6„+m as a result of observing £kik < n, where m is a given
element of Z. Let us suppose that £ = (£„) is regular, with spectral density
/a) = ^|0>(e-£A)|2,
where O(z) = Yj?=o flfcz*- By the Wold expansion,
00
4 = Z flfcen-fc»
fc=0
where e = (ek) is white noise with the spectral resolution
e„ = f e^ZJidX).
Since
en+m = tiem+m\Hjioi = £\£ieH+m\H(&}\H,t£n = §[ft,+ji«©]
§6. Extrapolation, Interpolation and Filtering
463
and
where
then
Bn+m = f e^"+m>0(Xme-iX)ZE(dX) = X an+m-keki
ak = ^ f ei"k0(XMe-iX) dk (29)
^ = ^ I an+m-kek\Hn(o\
\jc£n + m J
But Hn{£) = Hn(e) and therefore
kzn J-n \_k<n J
= f f^t^e-'^ie-^Z^dkl
where O® is the pseudotransform of O.
We have therefore established the following theorem.
Theorem 3. If the sequence £ = (£„) under observation is regular, then the
optimal {mean-square) linear estimator 8n+m of 6n+m in terms of £fc, k < n,
is given by
Bn+m = f e^H^e-^Z^dXl (30)
J —71
where
Hm(e~iX) = £ al+me-iuQ>®(e-ix) (31)
and the coefficients ak are defined by (29).
4. Problems
1. Let f be a nondegenerate regular sequence with spectral density (4). Show that 0(z)
has no zeros for \z\ < 1.
2. Show that the conclusion of Theorem 1 remains valid even without the hypotheses
that 0(z) has radius of convergence r > 1 and that the zeros of 0(z) all lie in |z| > 1.
464
VI. Stationary (Wide Sense) Random Sequences. Z,2-Theory
3. Show that, for a regular process, the function <D(z) introduced in (4) can be
represented in the form
<D(z) = ^S expire + £ ckA \z\<h
where
ck = ±f e** In f(X)dX.
Deduce from this formula and (5.9) that the one-step prediction error a\ = E ||, — £x \2
is given by the Szego-Kolmogorov formula
tr? = 27rexp|^J't \nf(X)dx\.
4. Prove Theorem 2 without using (22).
5. Let a signal 0 and a noise rj, not correlated with each other, have spectral densities
Using Theorem 3, find an estimator Bn+m for 6n+m in terms of £k, k < «, where
£k = Ok + *lk- Consider the same problem for the spectral densities
feW = ^\2 + e-"\2 and fn(X) = i-.
§7. The Kalman-Bucy Filter and Its Generalizations
1. From a computational point of view, the solution presented above for
the problem of filtering out an unobservable component 6 by means of
observations of £ is not practical, since, because it is expressed in terms of the
spectrum, it has to be carried out by spectral methods. In the method proposed
by Kalman and Bucy, the synthesis of the optimal filter is carried out
recursively; this makes it possible to do it with a digital computer. There are
also other reasons for the wide use of the Kalman-Bucy filter, one being that
it still "works" even without the assumption that the sequence (0, £) is
stationary.
We shall present not only the usual Kalman-Bucy method, but also
a generalization in which the recurrent equations determined by (0, £)
have coefficients that depend on all the data observed in the past.
Thus, let us suppose that (0, f) = ((0n), (f„)) is a partially observed
sequence, and let
0» = (0i(«)..-..0*00) and £, = (^...,600)
be governed by the recurrent equations
0,,+ 1 = a0(n, 0 + fl,(«, f)0„ + ^(n, ^(n +1) + b2(n, £)e2(n + 1),
fc,+ 1 = A0(n, 0 + Afa £)0n + Bx{nt Oe,(n +1) + B2(n, £)e2(n + 1).
(1)
§7. The Kalman-Bucy Filter and Its Generalizations
465
Here
fiiOO = fci i(«) eu(n)) and s2(n) = (e21(h),..., e2l{n))
are independent Gaussian vectors with independent components, each of
which is normally distributed with parameters 0 and 1; a0(n, 0 = (a01(n, 0,
..., <*(*(«, 0) and A0(n, 0 = (A0l(n, 0,..., A0l(n, 0) are vector functions,
where the dependence on £ = (£0,..., £„) is determined without looking
ahead, i.e. for a given n the functions a0"(w» 0,--, ^o/(", £) depend only on
£o £„", the matrix functions
Wh. 0 = 11¾^ 011» «*. 0 = 11¾¾ 0II,
*i(n, 0 = l|5Sj>(«, 0|, fl2(n, 0 = l&fiKn, £)||,
fll(W, ft = ||flSj>(W, oil, ^(n, 0 = lAQXn, Oil
have orders k x /c, /c x /, / x /c, / x /, fc x /c, / x /c, respectively, and also
depend on £ without looking ahead. We also suppose that the initial vector
(#o> £o) is independent of the sequences ex = (e^ri)) and e2 = (£2(1))-
To simplify the presentation, we shall frequently not indicate the
dependence of the coefficients on £.
So that the system (1) will have a solution with finite second moments,
we assume that E(||0O||2 + ||£0||2) < °°
(ll*||2 = £*?, x = (Xl xfc)), \tf\n, 01 < C, \A$\n, 01 < C,
and if g(n, £) is any of the functions a0i, A0ji b%\ b[j\ Bty or B\f then
E|#(«, 012 < °o, n = 0,1,.... With these assumptions, (0,0 has
E(||0J|2 + ll£„||2) < oo, n > 0.
Now let &% = ¢7(0): £0,.•., £„} be the smallest cr-algebra generated by
£0,...,4 and
m„ = E(0„|J^), yn = E[(0„ - mn)(0„ - m„)*|J^].
According to Theorem 1, §8, Chapter II, m„ = (m^w),..., mk(n)) is an optimal
estimator (in the mean square sense) for the vector 6„ = (0i(«),..., 0fc(«)),
and Ey„ = E[(0„ - m„)(0„ - m„)*] is the matrix of errors of observation.
To determine these matrices for arbitrary sequences (0, 0 governed by
equations (1) is a very difficult problem. However, there is a further
supplementary condition on (0O, £0) that leads to a system of recurrent equations
for mn and ym that still contains the Kalman-Bucy filter. This is the condition
that the conditional distribution P(0O < a|£0|) ls Gaussian,
P(0o S a I &) = -4= f expf- (*~?o)2l dx, (2)
with parameters m0 = m0(£0), 7o = Vo(£o)-
To begin with, let us establish an important auxiliary result.
466
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
Lemma 1. Under the assumptions made above about the coefficients of (I),
together with (2), the sequence (0, £) is conditionally Gaussian, i.e. the
conditional distribution function
is (P-fl.s.) the distribution Junction of an n-dimensional Gaussian vector whose
mean and covariance matrix depend on (£o,..., £„).
Proof. We prove only the Gaussian character of P(0„ < a\^)\ this is
enough to let us obtain equations for mn and yn.
First we observe that (1) implies that the conditional distribution
is Gaussian with mean-value vector
and covariance matrix
/bob b»B\
\(b°B)* BoB/'
where bo b = bxb% + b2b%, b°B = M* + b2B%, B<>B = B^ + B2B%.
Let C = (0n, £„) and t = (tl9.... tk+l). Then
E[exp(if*C„+i)l^, OJ = exp{if*(A0(n, 0 + Arfu Q6J - i**B(n, {)*}.
(3)
Suppose now that the conclusion of the lemma holds for some n > 0. Then
E[exp(if Arfw, £)0n)|^] = expOW^Oi, &mn - if^n, Oym ARn, 0)t.
(4)
Let us show that (4) is also valid when n is replaced by n + 1.
From (3) and (4), we have
E[exp(ir*C„+i)l^] = exp{if*(A0(n, Q + Arfn, £)m„)
-it*B(ii, Ot - i^Arfn, Oy,,AKii, OW.
Hence the conditional distribution
P(en+1<a,tn+l<zx\3?t) (5)
is Gaussian.
As in the proof of the theorem on normal correlation (Theorem 2, §13,
Chapter II) we can verify that there is a matrix C such that the vector
r\ = [0,,+ 1 - E(0n+1|J^)] - CK,+ 1 - E(4+1|^)]
has the property that (P-a.s.)
edj«Ui - E«„+ii^>mfl = o.
§7. The Kalman-Bucy Filter and Its Generalizations
467
It follows that the conditionally-Gaussian vectors rj and fn+1, considered
under the condition ^, are independent, i.e.
PfyeA, £,+ 1 eB\**> = P(neA\**). P(£„+1 eB\P*>
fora\\Ae@(Rk%Be@(Rl).
Therefore if s = (s1 s„) then
E[exrffe*0B+1)l^5,fc,+ i]
= E{exp(is*[E(0n+i 1-^) + n + C[£n+1 - E«B+1 I^IDI^J, £„+1}
= cxp{ls*DEft+1|^9 + C[£n+1 - E«B+1|^S)]}
xE[exp(is^)|^,4+1]
= exp{is*[E(0n+1|J^)] + CR,;, - E(£n+1|J^)]}
x E(exp(fe*i7)|^a. (6)
By (5), the conditional distribution P(rj < y\^%) is Gaussian. With (6),
this shows that the conditional distribution P(0n+1 <a\^+l) is also
Gaussian.
This completes the proof of the lemma.
Theorem 1. Let (0, £) be a partial observation of a sequence that satisfies the
system (1) and condition (2). Then (m„ty„) obey the following recursion
relations:
mn+1 = [fl0 + fltmj + \boB + aiyuA?\\BoB + A^AfT
x K«+i -Ao-A^, (7)
yn+1 = [fliy.flT + b°b] - \b °B + aiy„AnLBoB+ A^Aft*
x U>oB + aiyHAf]*. (8)
Proof. From (1),
E(6U 11 ^) = a0 + fllmn, E(£n+11 &*) = A0 + >l1mn (9)
and
0„+i - E(0n+1|J^) = fliW, - mj + Mi(n + D + M2(" + 1).
£„+1 - E(£n+1|J^) = Ax\fiu - mj + Biei(ii + 1) + B2e2(n + 1).
(10)
Let us write
dn=cov(0n+1,0n+1|.^)
= E{[0„+1 - E(0n+1|^)][0n+1 - E{0H+l\P*Q*/*lh
d12=cov(0n+1,^+1|^«)
= E{[0n+1 - E[0l+1|^]Kl+i - E(t+1|^£]*/jF<},
d22=cov(£n+1,£n+1|J^)
= E(K„+1 - E(£n+1|^)][£n+1 - E&+1|^]*/^9.
468
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
Then, by (10),
du = aiynaX + bob, d12 = axynA*x + b°B, d22 = Aiy„A^ + B °fl.
(11)
By the theorem on normal correlation (see Theorem 2 and Problem 4,
§13, Chapter II),
mn+1 = E(0n+1|J^,£n+1) = E(0n+1|^) + dl2df2(tn+1 - E(£n+1|^))
and
yn+1 =cov(en_lten+^ltn+1)= d^ - d12df2d*l2-
If we then use the expressions from (9) for E(0n+1|J*-;) and E(fn+1|J^-;)
and those for dn, d12t d22 from (11), we obtain the required recursion
formulas (7) and (8).
This completes the proof of the theorem.
Corollary 1. If the coefficients a0(nt £) B2(n, £) in (I) are independent of
£ the corresponding method is known as the Kalman-Bucy method, and
equations (7) and (8)/or mn and yn describe the Kalman-Bucy filter. It is important
to observe that in this case the conditional and unconditional error matrices
y„ agree, i.e.
yn = Ey„ = E[(0„ - mn)(6n - m„)*].
Corollary 2. Suppose that a partially observed sequence (0„, £„) has the
property that 6n satisfies the first equation (1), and that £„ satisfies the equation
ft, = A0(n -UO + A,(n - 1, £)0„
+ B,{n - 1, Ob^h) + B2(n - 1, £)e2(n). (12)
Then evidently
£„+i = Ao(", t) + Ax{n, t)£a0(n, ¢) + aM &6n
+ bM Oe^n +1) + b2(nt £)e2{n+ 1)] + B^n, Qe^n + 1)
+ B2(n, £)e2{n + 1),
and with the notation
A0 = A0 + /4^0, Ax = A^a^
Bl = AJ>X + Blt B2 = Axb2 + B2i
we find that the case under consideration also depends on the model (1), and
that m„ and yn satisfy (7) and (8).
2. We now consider a linear model (compare (1))
0„+i = a0 + fli0„ + a2f„ + b^in +1) + b2e2(n + 1),
fc,+ i = A0 + A,en + A2£H + B^n +1) + B2e2(n + 1), (13)
§7. The Kalman-Bucy Filter and Its Generalizations
469
where the coefficients a0 B„ may depend on n (but not on £), and e£/h)
are independent Gaussian random variables with Ee£/h) = 0 and EagOO = 1.
Let (13) be solved for the initial values (0O, £0) so that the conditional
distribution P(0O < a\£0) is Gaussian with parameters m0 = E(0O, £0) and
y = cov(0o, 0ol£o) = Ey0. Then, by the theorem on normal correlation and
(7) and (8), the optimal estimator m„ = E(6„\^f,) is a linear function of
«0.*I fc-
This remark makes it possible to prove the following important statement
about the structure of the optimal linear filter without the assumption that
it is Gaussian.
Theorem 2. Let (0, £) = (0„, £„)„>0 be a partially observed sequence that
satisfies (13), where Ej/h) are uncorrected random variables with Ee£/«) = 0,
Efiy(w) = 1» and tne components of the initial vector (0O, £0) have finite second
moments. Then the optimal linear estimator m„ = E(0n\£0 £„) satisfies
(7) with a0(n, f) = a0(n) + a2(n)£nt A0(n, f) = A0(n) + A2(n)£ni and the
error matrix y„ = E[(0„ — 6m)(6n — m„)*] satisfies (8) with initial values
m0 = cov(0o, f0)cov®(f0> f o) • f o,
(14)
y0 = cov(0o, 0O) ~ cov(0o, fo)cov®(f 0, fo)cov*(0o, f <>)•
For the proof of this lemma, we need the following lemma, which reveals
the role of the Gaussian case in determining optimal linear estimators.
Lemma 2. Let(at /?) be a two-dimensional random vector with E(oc2 + /?2) < oo,
a(6c, /¾ a two-dimensional Gaussian vector with the same first and second
moments as (a, /?), Le.
Ea£ = Ea£, E/?£ = E/?£, i = 1, 2; Eaj? = Ea/J.
Lef A(b) be a linear function ofb such that
X(b) = W\P = b).
Then X{f$) is the optimal (in the mean square sense) linear estimator of a in
terms of'/?, i.e.
§(a|j8) = A(J0.
/fere EA(/?) = Ea.
Proof. We first observe that the existence of a linear function X(b) coinciding
with E(a|/? = b) follows from the theorem on normal correlation. Moreover,
let 1(b) be any other linear estimator. Then
e[« - KM2 > E[« - KM2
470
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
and since 1(b) and Mb) are linear and the hypotheses of the lemma are
satisfied, we have
E[a - X(/?)]2 = E[a - 1(¾]2 > E[fi - KM = E[« - W)]2,
which shows that MP) is optimal in the class of linear estimators. Finally,
Em = E*(ft = E[E(a|?)] = Ea = Ea.
This completes the proof of the lemma.
Proof of Theorem 2. We consider, besides (13), the system
0,.+ 1 = a0 + aA + a2ln + Mu(« +1) + b2el2(n + 1),
ln+l = A0 + Afiu + A2\n + B^in +1) + B2s22(n + 1),
where eu(h) are independent Gaussian random variables with Ee£/h) = 0
and Eefj(ri) = 1. Let (0O, J0) also be a Gaussian vector which has the same
first moment and covariance as (60, £0) and is independent of £,/«). Then
since (15) is linear, the vector (0O,..., 0„, J0 {„) is Gaussian and
therefore the conclusion of the theorem follows from Lemma 2 (more precisely,
from its multidimensional analog) and the theorem on normal covariance.
This completes the proof of the theorem.
3. Let us consider some illustrations of Theorems 1 and 2.
Example 1. Let 6 = (6„) and rj = (rj„) be two stationary (wide sense)
uncorrected random sequences with E0„ = Erjn = 0 and spectral densities
2tt| 1 + bxe-a\2 J+ ' 2tt 11 + b2e~a\2i
where^| < l,|b2| < 1.
We are going to interpret 6 as a useful signal and rj as noise, and suppose
that observation produces a sequence £ = (£„) with
ft, = On + 17».
According to Corollary 2 to Theorem 3 of §3 there are (mutually
uncorrected) white noises ex = (e^n)) and s2 = (e2(ri)) such that
0„+i + bxen = e^n + 1), r]n+l + b2T]„ = s2(n + 1).
Then
fc,+ i = 0n+i + tf.,+1 = ~M„ - M„ + ^i(« +1) + e2(n + 1)
= -b2(0„ + *7„) - 0n(fci - b2) + ei(« +1) + e2(n + 1)
= -½ - (*>i - h2^n + *i{n +1) + E2(n + 1).
§7. The Kalman-Bucy Filter and Its Generalizations
471
Hence 6 and f satisfy the recursion relations
0„+i= -Mn + e^ + 1),
(16)
fc.+i = -0>i - b2)6„ - b2£„ + e^n + 1) + s2(n + 1),
and, according to Theorem 2, m„ = E(0n|fo f„) and 7n = E(#n - ™n)2
satisfy the following system of recursion equations for optimal linear filtering:
mn+1 = -b1m„ + ,^,1 _l)Y" KB+1 + (^ - b2)m„ + fc2£J,
2 + ¢1 - W2?.
(17)
Let us find the initial conditions under which we should solve this system.
Write du = E0„2, d12 = E0„fn, d22 = Ef2. Then we find from (16) that
dn = b\dlx + 1,
d12 = b1(bl - b2)dlx + bxb2dl2 + 1,
d22 = (b, - b2fdlx + b\d22 + 2b2{b, - b2)d12 + 2,
from which
dll_r^fcf d'2-FTfcf d" -(i- 6j)(i - biy
which, by (14), leads to the following initial values:
2io~2-b]
1 1 - fcf 1
""•"^n*0- 2-fcf-fc^0-
70 = ^11-^- =
^22 1-½ (1 - b2)(2 - b2 - bi) 2-b\-b\ (18)
Thus the optimal (in the least squares sense) linear estimators mn for the
signal 0„ in terms of £o £„ and the mean-square error are determined
by the system of recurrent equations (17), solved under the initial conditions
(18). Observe that the equation for y„ does not contain any random
components, and consequently the number yn, which is needed for finding m„,
can be calculated in advance, before the filtering problem has been solved.
Example 2. This example is instructive because it shows that the result of
Theorem 2 can be applied to find the optimal linear filter in a case where the
sequence (0, £) is described by a (nonlinear) system which is different from
(13).
472
VI. Stationary (Wide Sense) Random Sequences. L2-Theory
Let e1 = (£i(«)) and s2 = (s2(n)) be two independent Gaussian sequences
of independent random variables with Ee£(h) = 0 and Ee?(h) = 1, n > 1.
Consider a pair of sequences (0, £) = (0„, £„), n > 0, with
^+^^ + (1+^)8^+1),
Zn+l = A6n + e2(n+ 1).
We shall suppose that 60 is independent of (elt e2) and that 60 ~ ^*(m0, 7o)-
The system (19) is nonlinear, and Theorem 2 is not immediately applicable.
However, if we put
,«0,+,)-^rft?^+,x
we can observe that Ee^h) = 0, E^O^E^m) = 0, « # m, EeJXh) = 1. Hence
we have reduced (19) to a linear system
0n+1 = aien + b1e1(n+ 1),
4+i = ^i0„ + £2(h+ 1),
where bx = yjE{\ + 0„)2, and {^(w)} is a sequence of uncorrected random
variables.
Now (20) is a linear system of the same type as (13), and consequently
the optimal linear estimator mn = E(0n|£o £„) and its error yn can be
determined from (7) and (8) via Theorem 2, applied in the following form in
the present case:
mn+1 = almn + { + [fn+1 - AlmJ,
where bx = yjE(l + 6n)2 must be found from the first equation in (19).
Example 3. Estimators for parameters. Let 6 = (6lt..., 6k) be a Gaussian
vector with E6 = m and cov(0, 0) = y. Suppose that (with known m and v)
we want the optimal estimator of 6 in terms of observations on an
/-dimensional sequence £ = (£„), n > 0, with
4+1 = ^o(* 0 + Mn, 06 + JJ^ii, Qe^n + 1), & = 0, (21)
where ex is as in (1).
Then from (7) and (8), with m„ = E(0|^) and yn, we find that
mn+1 =mn + y„A*(n, OL(BlBf)(n9 0 + ^(n, £)M«n, £)]e
x K„+i - ^o("» f) - Ai("> &nj,
y»+i = r» - y^Kn BR^UDOi. £) + i4,(ii, ©Mftn, QfA^n, Qyn
(22)
§7. The Kalman-Bucy Filter and Its Generalizations
473
If the matrices BXB^ are nonsingular, the solution of (22) is given by
mn+1 = h + y | Aftm, £)(5^)-'(m, &4?(m, o]
x L + y J] XKm, ©(B^rr^m, 0«M+1 - X0(m, 0)1,
7n+i = h + y I XJdn, OO^fr^m, ©^(m, ©1 V (23)
where E is a unit matrix.
4. Problems
1. Show that the vectors mn and 0„ — iw„ in (1) are uncorrected:
E[m*(6 - mj] = 0.
2. In (1), let y and the coefficients other than a0(«, ¢) and A0(nt £) be independent of
"chance" (i.e. of £). Show that then the conditional covariance yn is independent of
"chance": yn = Ey„.
3. Show that the solution of (22) is given by (23).
4. Let(0, £) = (0„, £„) be a Gaussian sequence satisfying the following special case of (1):
9n+ !=a9n + bex{n + 1), £n+ l = A6n + Be2(n + 1).
Show that if A ^ 0, b ^ 0, B ^ 0, the limiting error of filtering, y = lim,,..^ y„,
exists and is determined as the positive root of the equation
r +
F^-4-?=*
CHAPTER VII
Sequences of Random Variables
That Form Martingales
§1. Definitions of Martingales and Related Concepts
1. The study of the dependence of random variables arises in various ways
in probability theory. In the theory of stationary (wide sense) random
sequences, the basic indicator of dependence is the covariance function,
and the inferences made in this theory are determined by the properties of
that function. In the theory of Markov chains (§12 of Chapter I; Chapter
VIII) the basic dependence is supplied by the transition function, which
completely determines the development of the random variables involved
in Markov dependence.
In the present chapter (see also §11, Chapter I), we single out a rather wide
class of sequences of random variables (martingales and their
generalizations) for which dependence can be studied by methods based on a discussion
of the properties of conditional expectations.
2. Let (Q, !F, P) be a given probability space, and let (^) be a family of
<7-algebras J*, n > 0, such that J^ c ^ c ... c J*.
Let X0, Xu ... be a sequence of random variables defined on (Q, 3F, P).
If, for each n > 0, the variable Xn is ^-measurable, we say that- the set X =
(X„, ^), n > 0, or simply X = (Xn, &\), is a stochastic sequence.
If a stochastic sequence X = (Xn, &\) has the property that, for each
n > 1, the variable Xn is ^_ ^measurable, we write X■= (Xn, ^,-i), taking
F-i = F0, and call X a predictable sequence. We call such a sequence
increasing ifX0 = 0 and Xn < Xn+l (P-a.s.).
Definition 1. A stochastic sequence X = (X„, ^„) is a martingale, or a sub-
martingale, if, for all n > 0,
E|X„|<oo (1)
§1. Definitions of Martingales and Related Concepts
475
and, respectively,
E(*„+11&n) = Xn (P-a.s.) (martingale)
or (2)
E(*n+1|^) > X„ (P-a.s.) (submartingale).
A stochastic sequence X = {Xn&\) is a supermartingale if the sequence
— X = (—Xni ^) is a submartingale.
In the special case when &n = ^*t where &% = a{co:X0,..., Xn}, and
the stochastic sequence X = (X„, ^) is a martingale (or submartingale),
we say that the sequence (Xn)n^.0 itself is a martingale (or submartingale).
It is easy to deduce from the properties of conditional expectations that
(2) is equivalent to the property that, for every n > 0 and A e ^,,
f XH+ldP = f XmdP
f Xn+ldP> f XndP.
Ja Ja
(3)
Example 1. If (£n)n2.0 is a sequence of independent random variables with
Ef„ = 0 and Xn = & + ••• + £.. ^n = o{co:£0 £,}, the stochastic
sequence X = {Xnt &\) is a martingale.
Example 2. If (£„)„> 0 is a sequence of independent random variables with
Efn = 1, the stochastic sequence (Xni ^) with X„ = Y[k=o £k> ^,=
g{w. £0 £„} is also a martingale.
Example 3. Let £ be a random variable with E | £| < oo and
^¾ £ ^ c ... c ^-.
Then the sequence X = (Xni &0 with Ar„ = E(^|^ is a martingale.
Example 4. If (£„)*> o is a sequence of nonnegative integrable random
variables, the sequence (Xn) with Xn = f0 + h f„ is a submartingale.
Example 5. If X = (Xni &% is a martingale and g(x) is convex downward
with E \g{X^\ < oo, n > 0, then the stochastic sequence (g(X„), &\) is a
submartingale (as follows from Jensen's inequality).
If X = (Xnt $\) is a submartingale and g(x) is convex downward and
nondecreasing, with E \g(X„)\ < oo for all n > 0, then (g(Xn\ ^) is also a
submartingale.
Assumption (1) in Definition 1 ensures the existence of the conditional
expectations E(Xn+1 \^„)t n > 0. However, these expectations can also exist
without the assumption that E|Xn+11 < oo. Recall that by §7 of Chapter
476
VII. Sequences of Random Variables That Form Martingalt
II, E(X++1\&$ and E(X~+1\^) are always defined. Let us write A = B
(P-a.s.) when P(A A fl) = 0. Then if
{o>: E(X;+11¾ < 00} u {or. E(X" W < 00} = Q (P-a.s.)
we say that E(Xn+11 J^J) is also defined and is given by
EfciW = E(Xn++1|JD - E(X-+1|JD.
After this, the following definition is natural.
Definition 2. A stochastic sequence X = (X„, &\) is a generalized martingale
(or submartingale) if the conditional expectations E(Xn+1|^D are defined
for every n > 0 and (2) is satisfied.
Notice that it follows from this definition that E(In"+1|^) < 00 for a
generalized submartingale, and the E(\Xn+l\\£r„) < 00 (P-a.s.) for a
generalized martingale.
3. In the following definition we introduce the concept of a Markov time,
which plays a very important role in the subsequent theory.
Definition 3. A random variable t = t(co) with values in the set {0,1 + 00 }
is a Markov time (with respect to (¾) (or a random variable independent of
the future) if, for each n > 0,
{r = n}e&n. (4)
When P(t < 00) = 1, a Markov time t is called a stopping time.
Let X = (Xnt &\) be a stochastic sequence and let t be a Markov time
(with respect to (Jj|)). We write
00
n = 0
(hence Xx = 0 on the set {co: t = 00}).
Then for every B e ^(R),
{co:XteB}= f {I„6B,T = n}e^
n = 0
and consequently Xx is a random variable.
Example 6. Let X = (Xni&\) be a stochastic sequence and let Be^(jR).
Then the time of first hitting the set B, that is,
TB = inf{n>0:A:neB}
§1. Definitions of Martingales and Related Concepts
477
(with xB= + oo if {•} = 0) is a Markov time, since
{xB = n) = {X0iB Xn_^BiXneB}e3?
for every n > 0.
Example7. Let X = {Xni JQ be a martingale (or submartingale) and t a
Markov time (with respect to (¾). Then the "stopped" process Xx =
(I„AI1 &\) is also a martingale (or submartingale).
In fact, the equation
n-l
^hm = Lt XmIix = m) + XnIix>n]
m = 0
implies that the variables Ar„At are ^-measurable, are integrable, and satisfy
A(b+1)ai ~~ ^nAt = '{t>n}(-^n+l — -^n)»
whence
E[X(n+1)At - Xn„m = I{x>n)EJLXn+1 - Xm\#] = 0 (or > 0).
Every system (J*|) and Markov time t corresponding to it generate a
collection of sets
&\ = {A e &: A n {t = n} e Wn for all n > 0}.
It is clear that Qe^and $\ is closed under countable unions. Moreover, if
Ae&r, then ^n{t=w} = {t=h}\(^n{T=w})e^n and therefore Ae$\.
Hence it follows that &\ is a <r-algebra.
If we think of ^ as a collection of events observed up to time n (inclusive),
then &\ can be thought of as a collection of events observed at the "random"
time t.
It is easy to show (Problem 3) that the random variables t and Xx are
^-measurable.
4. Definition 4. A stochastic sequence X = (Xn, &\) is a local martingale
(or submartingale) if there is a (localizing) sequence (xk)k a j of Markov times
such that Tfc < tfc+1 (P-a.s.), xk | oo (P-a.s.) as fc -» oo, and every "stopped"
sequence XXk = (1^^ • /{rfc>o}> «^) is a martingale (or submartingale).
In Theorem 1 below, we show that in fact the class of local martingales
coincides with the class of generalized martingales. Moreover, every local
martingale can be obtained by a "martingale transformation" from a
martingale and a predictable sequence.
478
VII. Sequences of Random Variables That Form Martingales
Definition 5. Let Y= (Ynt &\) be a stochastic sequence and let V = (Vn,^,,-1)
be a predictable sequence (^_! = ^). The stochastic sequence V • Y =
«K-7),,,¾ with
(V-Y)„=V0Y0+ £ ^A7£, (5)
£= 1
where AY{ = Yt - ^_i, is called the transform ofYbyV. If, in addition, yis
a martingale, we say that V- Yis a martingale transform.
Theorem 1. Let X = (Xni &$,)„>0 be a stochastic sequence and let X0 = 0
(P-a.s.). The following conditions are equivalent:
(a) X is a local martingale;
(b) X is a generalized martingale;
(c) X is a martingale transform^ i.e.t there are a predictable sequence V =
(1^,, J^_i) with V0 = 0 and a martingale Y=(Yni^„) with Y0 = 0 such
thatX = VY
Proof, (a) => (b). Let A~ be a local martingale and let (rfc) be a local sequence
of Markov times for X. Then for every m > 0
ED***JWoil <°°> (6)
and therefore
E[|^(„+i,AtJ/{tfc>n)] = E[|A-n+1|/{tfc>n)] <a). (7)
The random variable /{tfc>n) is ^-measurable. Hence it follows from (7) that
E[|A-n+1|/{tfc>n)|^] = /{tfc>n)E[|An+1||^] <oo (P-a.s.).
Here /{tfc>n) -»• 1 (P-a.s.), k -► oo, and therefore
E[|A-n+1||J£|<oo (P-a.s.). (8)
Under this condition, E[An+1|^] is defined, and it remains only to show
thatE[An+1|^] = A„(P-a.s.).
To do this, we need to show that
j Xn+1dP= j XndP
for Ae J*. By Problem 7, §7, Chapter II, we have E[|A"n+1|J*] <oo (P-a.s.)
if and only if the measure $A \X„+l \ dPt A e &m is ff-finite. Let us show that
the measure \A \Xn\ dPt A e &nt is also ff-finite.
Since XXk is a martingale, \XXk\ = (|ArtfcAn|/{tfc>0}, &n) is a submartingale,
and therefore (since {rk >n}e &n)
479
f \X„\dP=\ |X„A,J/{t|i>0,dP
JAn{tk>n} JAn{xk>n}
<:[ |jr(„+1)Atj/{t„>0}dP=f \xn+l\dP.
JAn{xk>n} JAn{xk>n}
Letting k -* oo, we have
J* \X„\dP<j \Xn+l\dPt
from which there follows the required <7-finiteness of the measures \A \Xn\ dPt
Ae3?n.
Let A e ^n have the property \A \Xn+l \ d? < oo. Then, by Lebesgue's
theorem on dominated convergence, we may take limits in the relation
r xndp=\ xn+1dp,
JAn{xk>n} JAn{xk>n]
which is valid since X is a local martingale. Therefore,
j^XndP = ^Xn+1dP
for all A e &„ such that $A \Xn+1 \ dP < oo. It then follows that the preceding
relation also holds for every A e &ni and therefore, E(Xn+l \^n) = Xn (P-a.s.).
b)=*c). Let AXn = Xn- *„_!, X0 = 0, and V0 = 0,Vn= EClAXJl^-J,
n ^ 1. We set
and Yn = £?=1 W^AX£, n ;> 1. It is clear that
EOAFJl^J < 1, ECAYJJVJ = 0,
and consequently, Y = (Ynt &„) is a martingale.
Consequently, Y = (Y„, ^,) is a martingale. Moreover, X0 = V0 • Y0 = 0
and A(V- Y)n = AXn. Therefore
X= V-Y.
(c) => (a). Let X = V- Y where Kis a predictable sequence, Yis a
martingale and V0= Y0 = 0. Put
Tfc = inf{/i^0:|Kn+1|>/c},
and suppose that rk = oo if the set {•} = 0. Since Vn+i is ^-measurable,
the variables xk are Markov times for every k > 1.
Consider a "stopped" sequence XXk = ((V■ Y)n/,XkIiZk>0)t&\,). On the
set {tfc > 0}, the inequality |K,AtJ < k is in effect. Hence it follows that
E|(^ • ^)nAtJ{tk>o}l < °° for every w ^ 1. In addition, for n > 1,
480
VII. Sequences of Random Variables That Form Martingales
E{[(K- Y\n
since (see Example 7) E{^n+1)Atfc - YnAXk\&„} = 0.
Thus for every k>\ the stochastic sequences XXk are martingales,
Tfc t oo (P-a.s.), and consequently X is a local martingale.
This completes the proof of the theorem.
5. Example 8. Let (w„)n;>i be a sequence of independent identically
distributed Bernoulli random variables and let P(w„ = 1) = p, P(w„ = — 1) = q,
p + q = 1. We interpret the event {rj„ = 1} as success (gain) and {rj„ = — 1}
as failure (loss) of a player at the wth turn. Let us suppose that the player's
stake at the wth turn is Vn. Then the player's total gain through the wth turn is
Xn= t Virh = Xn_l + Vnrlni X0 = 0.
i= 1
It is quite natural to suppose that the amount V„ at the wth turn may depend
on the results of the preceding turns, i.e., on Vu..., Vn_x and on r\^ >/„_!.
In other words, if we put Fo = {0, Q} and Fn = g{co: y\± w„}, then
Vn is an tFn-x-measurable random variable, i.e., the sequence V = (Vnt ^n-x)
that determines the player's "strategy" is predictable. Putting Yn =
wx + • • + w„, we find that
Xn=t V^Yit
i= 1
i.e., the sequence X = (Xni &n) with X0 = 0 is the transform of Y by V.
From the player's point of view, the game in question is/air (or favorable,
or unfavorable) if, at every stage, the conditional expectation
E(Xn+ 1-Xn\$r) = 0 (or > 0 or < 0).
Moreover, it is clear that the game is
fair if p = q = \t
favorable if p > qr,
unfavorable, if p < q.
Since X = (X„,^)isa
martingale if p = q = \,
submartingale if p > q,
supermartingale if p < q,
we can say that the assumption that the game is fair (or favorable, or
unfavorable) corresponds to the assumption that the sequence A" is a martingale
(or submartingale, or supermartingale).
Let us now consider the special class of strategies V= (Vni i^_1)II>1
with Vl = 1 and (for w > 1)
§1. Definitions of Martingales and Related Concepts
481
f2"-' if* = -,1 ^-, = -1,
" [0 otherwise.
In such a strategy, a player, having started with a stake Vl = 1, doubles the
stake after a loss and drops out of the game immediately after a win.
If r\i = — 1,..., n„ = — 1, the total loss to the player after n turns will be
£ 2'"1 = 2n- 1.
i=l
Therefore if also nn+1 = 1, we have
Xn+1 =Xn+ Vn+1 = -(2" - 1) + 2" = 1.
Let t = inf{w > 1: X„ = 1}. If p = q = %t i.e., the game in question is
fair, then P(t = n) = (%)", P(t < oo) = 1, P(XX = 1) = 1, and EXZ = 1.
Therefore even for a fair game, by applying the strategy (9), a player can in a
finite time (with probability unity) complete the game "successfully,"
increasing his capital by one unit (EXt = 1 > X0 = 0).
In gambling practice, this system (doubling the stakes after a loss and
dropping out of the game after a win) is called a martingale. This is the origin
of the mathematical term "martingale."
Remark. When p = q = |, the sequence X = (X„t ^) with X0 = 0 is a
martingale and therefore
EX„ = EXo = 0 for every n > 1.
We may therefore expect that this equation is preserved if the instant n is
replaced by a random instant t. It will appear later (Theorem 1, §2) that
EXZ = EX0 in "typical" situations. Violations of this equation (as in the
game discussed above) arise in what we may describe as physically
unrealizable situations, when either t or | Xn \ takes values that are much too large.
(Note that the game discussed above would be physically unrealizable, since
it supposes an unbounded time for playing and an unbounded initial capital
for the player.)
6. Definition 6. A stochastic sequence £ = (£„, ^) is a martingale-difference
ifE|f| <oo for all n > Oand
E(£„+iTO = 0 (P-a.s.). (10)
The connection between martingales and martingale-differences is clear
from Definitions 1 and 6. Thus if X = (Ar„, J^) is a martingale, then £ =
(C„, ^) with f0 = X0 and £„ = AXni n > 1, is a martingale-difference. In
turn, if f = (fn, J£) is a martingale-difference, then X = (X„, J^) with
Xn = f o + • •' + f n is a martingale.
482
VII. Sequences of Random Variables That Form Martingales
In agreement with this terminology, every sequence £ = (^„)n^o °* m"
dependent integrable random variables with E£„ = 0 is a martingale-
difference (with J* = o{co: f0, fi £„})•
7. The following theorem elucidates the structure of submartingales (or
supermartingales).
Theorem 2 Poob). Let X = (Xnt &n) be a submartingale. Then there is a
martingale m = (m„, &Q and a predictable increasing sequence A = (Ani ^_ j)
such that, for every n > 0, Doob's decomposition
Xn = mn + An (P-a.s.) (11)
holds. A decomposition of this kind is unique.
Proof. Let us put m0 = X0i A0 = 0 and
mn = m0+ £ LXj+1 - E(Xj+l\^)l (12)
;=o
An = nZ[E(Xi+ tl^-XJ. (13)
i=o
It is evident that m and A, defined in this way, have the required properties.
In addition, let Xn = m'„ + A'„, where m' = (m'„, &„) is a martingale and A' =
(A'n, Fn) is a predictable increasing sequence. Then
A'n+1 - A'„ = (An+l - A„) + (mn+1 - m„) - (m'n+l - m'n\
and if we take conditional expectations on both sides, we find that (P-a.s.)
A'n+l — A'n = An+1 — An. But A0 = A'0 = 0, and therefore A„ = A'„ and
m„ = m'„ (P-a.s.) for all n > 0.
This completes the proof of the theorem.
It follows from (11) that the sequence A = (A„, Fn_i) compensates X =
(Xn, Fn) so that it becomes a martingale. This observation is justified by the
following definition.
Definition?. A predictable increasing sequence A = {Ani^!rn_l) appearing
in the Doob decomposition (11) is called a compensator (of the
submartingale X).
The Doob decomposition plays a key role in the study of square integrable
martingales M = (M„, Fn) i.e., martingales for which EM2 < oo, n > 0; this
depends on the observation that the stochastic sequence M2 = (M2, ^) is
a submartingale. According to Theorem 2 there is a martingale m = (mni ^,)
and a predictable increasing sequence <M> = «M>„, ^-1) such that
M2n =mn + <M>„. (14)
§1. Definitions of Martingales and Related Concepts
483
The sequence <M> is called the predictable quadratic variation or the
quadratic characteristic of M and, in many respects, determines its structure
and properties.
It follows from (12) that
<M>„= £ EKAM,.)2!^] (15)
i=i
and, for all / < /c,
E[(Mfc - M,)2|^] = E[Mfc2 - M?|*j] = E[<M\ - <M>,|^]. (16)
In particular, if M0 = 0 (P-a.s.) then
EM2 = E<M>fc. (17)
It is useful to observe that if M0 = 0 and M„ = ^+--- + ^„, where
(£„) is a sequence of independent random variables with E££ = 0andE£2 <oo,
the quadratic variation
<M>„ = EM2 = V& +-.. + V£„ (18)
is not random, and indeed coincides with the variance.
If X = (Xni &\) and Y = (Ynt ^) are square integrable martingales, we
put
<x, y\ = K<x + y>„ - <x - y>J. (19)
It is easily verified that (Xn Yn — <A\ y>„, ^) is a martingale and therefore,
for / < /c,
E[(xfc - ^)(¾ - ioi ^a = e[<x, y>fc - <*, K>,i^a. (20
In the case when X„ = £t + ••• + £„, Yn = nx + • • • + n„, where (f„)
and (r]„) are sequences of independent random variables with E^£ = Ef/£ = 0
and ££? <oo, Erjf <oo, the variable <A", y>„ is given by
<X,y>„= £cov(6,ifc).
The sequence <X, y> = «X, y>„, .^) defined in (19) is often called the
mutual characteristic of the (square integrable) martingales X and Y
It is easy to show (compare with (15)) that
<X9Y>N='jtEtAXtAYl\rl-l-].
In the theory of martingales, an important role is also played by the
quadratic covariation,
K ^n = E AXiAY»
and the quadratic variation,
484
VII. Sequences of Random Variables That Form Martingales
pa = £(Ax£)2,
which can be defined for all random sequences X = (XnX^ and Y = (Yn)n-zi •
8. Problems
1. Show that (2) and (3) are equivalent.
2. Let a and t be Markov times. Show that t + <r, t a <t, and t v a are also Markov
times; and if P(c < x) = 1, then ^ff c ^t.
3. Show that t and Xt are .^-measurable.
4. Let y = (y„, ^„) be a martingale (or submartingale), let V = (Vni ^_j) be a
predictable sequence, and let (P- y)„ be integrable random variables, n > 0. Show that
P- yis a martingale (or submartingale).
5. Let^-, c^2c...bea nondecreasing family of a-algebras and £ an integrable
random variable. Show that (X„)ni: i with Xn = E(£|££) is a martingale.
6. Let ^, 2 ^2 2 • • • be a nonincreasing family of <r-algebras and let £ be an integrable
random variable. Show that (X„)„sl with Xn = E(f|^„) is a reversed martingale, i.e.,
E(Xn\Xn+l, Xn+2, ...) = Xn+1 (P-a.s.)
for every n > 1.
7. Let f lt £2> £3.--- be independent random variables, P(f, = 0) = P(& = 2) = \ and
X„ = Yl"=i &. Show that there does not exist an integrable random variable £ and a
nondecreasing family (^,) of <r-algebras such that Xn = E(£|.^). This example
shows that not every martingale (X„)n2:1 can be represented in the form (E(f |^,))n;sl
(compare Example 3, §11, Chapter I.)
§2. Preservation of the Martingale Property Under
Time Change at a Random Time
1. If X = (X„, ^)„>0 is a martingale, we have
EXn = EX0 (1)
for every n > 1. Is this property preserved if the time n is replaced by a
Markov time t? Example 8 of the preceding section shows that, in general,
the answer is "no": there exist a martingale X and a Markov time t (finite
with probability 1) such that
EXt#EX0. (2)
The following basic theorem describes the "typical" situation, in which,
in particular, EXX = EX0.
§2. Preservation of the Martingale Property Under Time Change at a Random Time 485
Theorem 1 (Doob). Let X = (Xn, &\) be a martingale {or submartingale),
and ti and t2, stopping times for which
E|Xt||<oo, 1 = 1,2, (3)
lim f \Xn\dP = Ot i= 1,2. (4)
n-»oo J{xt>n)
Then
E(XJ*y = Xxx ({t2 > t,}; P-fl.5.). (5)
If also P{zl < t2) = 1, then
EXt2 = EXtl. (6)
(Here and in the formulas below, read the upper symbol for martingales
and the lower symbol for submartingales.)
Proof. It is sufficient to show that, for every A e &\lt
f XndP= f X„dP. (7)
•Mn{t2;>Ti} le=' ^n{t2>ti)
For this, in turn, it is sufficient to show that, for every n > 0,
f XX2dP = f XXldP,
•Mn{t2;>Ti}n{Ti=n} (^) JAnix27>xi)nix1=n)
or, what amounts to the same thing,
JBn{T2>n) JBn{x2>n)
(8)
where B = An {^ = n}e^n.
We have
f XndP=[ XndP+\ XndP= f XndP
JBn{x2>n) JBn{x2 = n) JBn{x2>n) l_ JBn{x2 = n)
+ f E(Xn+1\&n)dP = f Xt2dP+f Xn+1</P
JBn{t2>n) JBn{T2 = n} «'Bn{t2^n+l}
= f xtIdP+ f xn+2dP = ...
lS' JBn{n^t2<n+l) JBn{t2^n+2}
= r xt2</p+ r x^rfp,
(-) ./Bn{n<ST2<Srn} JBn{t2>m}
nSt2
whence
f x„a> ~ r x„dp- r xmdp
JBn{n£r2<m) JBnin^Tj) JBn{m<X2)
486
VJI. Sequences of Random Variables That Form Martingales
and since Xm = 2XJ, - \Xm\, we have, by (4),
f XndP s HinTf XndP- f Xmdp\
JBn(r22n) m-*ao\_ JBninZxj) JBn{m<x2) J
= f XndP- ]im f XmdP= f XndP,
JBn{n£T2) m-»oo JBn{rn<T2} JBn{z2>n)
which establishes (8), and hence (5). Finally, (6) follows from (5).
This completes the proof of the theorem.
Corollary 1. If there is a constant Nsuch that P(t, < N) = landP(r2 < N) =
1, then (3) and (4) are satisfied. Hence if in addition, P(t, < t2) = 1 and X is
a martingale, then
EX0 = EXtl = EXZ2 = EXN. (9)
Corollary 2. If the random variables {Xn} are uniformly integrable (in
particular, if\X„\ < C <oo, n > 0), then (3) and (4) are satisfied.
In fact, P(T| > n) -► 0, n -> oo, and hence (4) follows from Lemma 2, §6,
Chapter IL In addition, since the family {X„} is uniformly integrable, we
have (see 11.6.(16))
supE|XN|<oo. (10)
If t is a stopping time and A" is a submartingale, then by Corollary 1, applied
to the bounded time rN = x a N,
EX0 <: EXXN.
Therefore
E\XZN\ = 2EXt+N - EXtN < 2EX+ - EX0. (11)
The sequence X+ = (X+, ^) is a submartingale (Example 5, §1) and
therefore
E*£= I f */<*P + f XSdP< £ f «dP
i=0 %=jl J{x>N) j=0J{xN=j]
+ f X+dP = EX£ <E\XN\<supE\XN\.
J{r>N) N
From this and (11) we have
E|AtN|<3supE|AA,|,
TV
and hence by Fatou's lemma
E|At|<3supE|AN|.
§2. Preservation of the Martingale Property Under Time Change at a Random Time 487
Therefore if we take t = rh i = 1, 2, and use (10), we obtain E |ATT.| <oo,
i = 1, 2.
Remark. In Example 8 of the preceding section,
f \X„\ dP = (2n - 1)P{t >n} = (2n - 1) • 2"" -> 1, n - oo,
J{x>n)
and consequently (4) is violated (for t2 = t).
2. The following proposition, which we shall deduce from Theorem 1, is
often useful in applications.
Theorem 2. Let X = (X„) be a martingale (or submartingale) and x a stopping
time (with respect to {&*\ where IF* = <t{cd:X0 X„). Suppose that
Et <oo,
and that for some n>0 and some constant C
E{\Xn+l-Xn\\&*} <C ({t > n}; P-a.s.).
Then
E\XX\ <oo
and
EXt = EX0- (12)
We first verify that hypotheses (3) and (4) of Theorem 1 are satisfied with
T2 = T.
Let
y0 = l*ol, Yj = \Xj-xj-1\, j>\.
Then\Xx\<Yj=oYjand
\J=0 / JQ\j=0 J n = oJ{t=n}j = 0
oo n /• oooo/» oo /•
- Z S f YJdP= I I YJdP= I YJdp-
„ = 0j=0 J[T = n) j=0n=jJ{T=n) j=0 J[t7>j)
The set {t > j} = Q\{r <j} e &j-uj > 1. Therefore
f YjdP= f ElYj\X0 Xj_JdP<CP{r>j}
488
VII. Sequences of Random Variables That Form Martingales
for j > l;and
E|*.l ^ Eft y\ <E\X0\ + cf P{t >j} = E\X0\ + CEt <oo.
(13)
Moreover, if z > n, then
1¾^ 1¾.
;=o j=0
and therefore
f lATJ^P < f tr,a>.
J{x>n) J{x>n)j=0
E Yj=o Yj <oo and {t>«}|0,h
t yields
lim f \Xn\dP<M f [i^P = 0.
Hence since (by (13)) E Yj=o Yj <co and {t > n} 10, n -> oo, the dominated
convergence theorem yields
Hence the hypotheses of Theorem 1 are satisfied, and (12) follows as
required.
This completes the proof of the theorem.
3. Here we present some applications of the preceding theorems.
Theorem 3 (Wald's Identities). Let £lt ^2i...be independent identically
distributed random variables with E|£,| <oo and z a stopping time (with
respect to ^), where &* = g{cd: ^1 £„}, z > 1), and Et < oo. Then
EG?! +.-. +W-E^-Et. (14)
IfalsoE£f <oo then
E{(£i + ••■ + £)- tE^}2 = V^ -Et. (15)
Proof. It is clear that X = (Xn, ^¾^ with Xn = (£t + • • • + £„) - nE^
is a martingale with
EUXn+i-Xn\\Xli...iXn]=Etttn+1-Et1\\ti U
= E|£n+1-E£1|<2E|£1|<oo.
Therefore EXX = EX0 = 0, by Theorem 2, and (14) is established.
Similar considerations applied to the martingale Y=(Yni^) with
Yn = Xl - nVf x lead to a proof of (15).
Corollary. Let £lt£2>---be independent identically distributed random variables
with
P(6 = i) = p(& = -i) = is„ = ^i + ••• + ft,
§2. Preservation of the Martingale Property Under Time Change at a Random Time 489
and x = inf{n > 1:S„ = 1}. Then P{x <oo} = 1 {seejor example, (1.9.20))
and therefore P(St = 1) = 1, ESX = 1. Hence it follows from (14) that Ex = oo.
Theorem 4 (Wald's Fundamental Identity). Let £u £2 be a sequence of
independent identically distributed random variables, S„ = ^l + • • • + £„,
and n > 1. Let <P(t) = Ee'^,teR, and for some t0 # 0 let (P(t0) exist and
Hto) > 1.
Ifx is a stopping time (with respect to (&%), &1 = o{a>: ^l <!;„}, x > 1),
such that \S„\ < C ({t > n}; P-a.s.) and Ex <oo, then
J e'°s* 1
"LWod
1. (16)
l_vv*o;;j
Proof. Take
Yn = e'»H<p(t0)rn-
Then Y = (Yn, ^)n2:i is a martingale with EYn = 1 and, on the set {x>n},
e«**i-*n* Y"} = Y"E\!i^-lh 4
= yn.E{|^>-1(t0)-i|}<5<oo,
where B is a constant. Therefore Theorem 2 is applicable, and (16) follows
since EYl = 1.
This completes the proof.
Example 1. This example will let us illustrate the use of the preceding
examples to find the probabilities of ruin and of mean duration in games
(see §9, Chapter I).
Let £lt £2i • • • be a sequence of independent Bernoulli random variables
withP(£,. = 1) = p, P(£ = -1) = q,p + q = \,S = ^ + ••• + £„,and
x = inf{« > 1: Sn = B or A}, (17)
where (—A) and B are positive integers.
It follows from (1.9.20) that P(t <oo) = 1 and Ex <oo. Then if a =
P(St = A), p = P(St = B\ we have a + /? = 1. If p = q = i, we obtain
0 = ESX = aA + pB, from (14),
whence
fl + Ml' p fl + MI
Applying (15), we obtain
Ez = ES2 = aA2 +pB2 = \AB\.
490
VII. Sequences of Random Variables That Form Martingales
However, if p # q we find, by considering the martingale ((q/p)s")nzi> that
Em =E(«r = i,
\P/ \p,
and therefore
Together with the equation a + /? = 1 this yields
"m-RF- "'m'^- (18)
Finally, since ESt = (p — ^)Et, we find
Et = E5* = ^ + ^
where a and )5 are defined by (18).
Example 2. In the example considered above, let p = q = \. Let us show that
for every X in 0 < X < n/(B + \A |) and every time t defined in (17),
, B + A
COS A
E(CMA)" -bTRT (19)
cos A ^—-
For this purpose we consider the martingale X = (Xn, ^J)n2.0 with
X„ = (cos Xyn cos ^S„ - ?±A\
(20)
and S0 = 0. It is clear that
EXn = EX0 = cos;i^y^. (21)
Let us show that the family {A"nAJ is uniformly integrable. For this purpose
we observe that, by Corollary 1 to Theorem 1 for 0 < X < n/(B + | A |),
EX0 = EXnAX = E (cos A)-<»^cos x(s„AX - ^±A\
> E (cos Xy(n At> cos X B~A.
§2. Preservation of the Martingale Property Under Time Change at a Random Time 491
Therefore, by (21),
,B + A
COS A
E(cosA)-(nAt)<
,B + \A\>
cos A
and consequently by Fatou's lemma,
cos A—-—
E(cosA)-^ (22)
cos A ^—-
Consequently, by (20),
\Xn„\<{cosX)~\
With (22), this establishes the uniform integrability of the family {XnAX}.
Then, by Corollary 2 to Theorem 1,
cos k—-— = EXq = BXX = E(cos A)_tcos X—-—,
from which the required inequality (19) follows.
4. Problems
1. Show that Theorem 1 remains valid for submartingales if (4) is replaced by
lim f X* dP = 0, i = 1, 2.
n-oo J{x,>n}
2. Let X = (Xn, &jX^0 be a square-integrable martingale, t a stopping time and
lim I* XZdP = 0,
n-oo J[t>n}
lim f \Xn\dP = 0.
n-oo*'{t>n}
Show that then
EX? = E<X>t(=E£(AX,.)2),
where AX0 = X0, AXj = Xj - Xj_lyj £ 1.
3. Show that
E|Xt|< limE|XJ
for every martingale or nonnegative submartingale X = (Xn, ^)„^0 and everv
stopping time t.
492
VII. Sequences of Random Variables That Form Martingales
4. Let X = (A:„,^)n£0 be a submartingale such that Xn > E(£|^) (P-a.s.), n >l 0,
where E | ^ | < oc. Show that if t, and t2 are stopping times with P(t , < t?) = 1, then
Xtl>E(Xt2\^u) (P-a.s.).
5. Let ¢,, £2»... be a sequence of independent random variables with P(£,- = 1) =
P(£, = — 1) = ^, o. and b positive numbers, b > a,
Xn = at /(& =+l)-*>L /«* =-D
and
t = inf{« > 1: Xn < -r}, r > 0.
Show that EeAt < oo for A < a0 and EeXt = oo for X > a0, where
b , 2fc a , la
a0 = In + In .
a+b a+b a+b a+b
6. Let ¢,, ^2,... be a sequence of independent random variables with E& = 0, V£, = of,
Sn = ¢, + ••• + £n* &n = a{(JtJ'-¢11 ■ ■ • ifn}- Prove the following generalizations of
Wald's identities(14)and (15): If E jj=, E |¢,1 < 00 then ESt = 0;if E £/=, Eg < 00,
then
ESt2 = E £ % = E £ 0]. (23)
;= 1 /= 1
§3. Fundamental Inequalities
1. Let X = (Xni &\)„>o be a stochastic sequence,
X* = max \Xj\t \\Xn\\p = (E\X„n,p, P > 0.
In Theorems 1-3 below, we present Doob's fundamental "maximal
inequalities for probabilities" and "maximal inequalities in Lp" for submartingales,
supermartingales and martingales.
Theorem 1.1. Let X = (X„, J^)„>0 be a submartingale. Then for allX>0
XP jmax Xk > X j < E \X*I (max Xk > Xj\ < EXn+, (1)
X? jmin Xk < -xl < E \x„I (min Xk > -X J - EX0 < EXn+ - EX0i (2)
AP^max \Xk\ > X> < 3 max E\Xk\. (3)
3. Fundamental Inequalities 493
II. Let Y = (Yn, «^,)n;>o be a supermartingale. Then for allX>0
APJmax Yfc> XV <EY0- E Ynl( max Yk<XJ \< EY0 + EYn", (4)
ipjmin Yfc<-;ii< -E Ynl(minYk< -Xj \< EY~t
Lp{max|yfc|>4 <3 max El Kl.
APJminyfc<-A>< -E YnllminYk< -X] \< EY~t (5)
AP^max|yfc|>A> <3maxE|Yfc|. (6)
III. Let Y = (Yni !Fn)n7t0 be a nonnegative supermartingale. Then for all
X>0
APjmax Yfc>;i><EYo, (7)
APJsup Yfc>;il<EYn. (8)
Theorem 2. Let X = (Xni ^,)„^0 be a nonnegative submartingale. Then for
p>lwe have the following inequalities:
IXAiZlXSliZji-jlXJi,; (9)
ifP=h
IA.I < |A?|, < j^j {1 + 11¾ ln+ XJ, }■ (10)
Theorem 3. Let X = (X„, &n)n*.o be a martingale, X>0andp>l. Then
p{ma^|>A}<^^ (11)
and if p > 1
|X.I,£|XriF£^ryl*.l,- (12)
In particular, if p = 2
p{max|Xi|*4*^r-' <13)
E|maxA:fc2l<4EA:n2.
(14)
494
VII. Sequences of Random Variables that Form Martingales
Proof of Theorem 1. Since a submartingale with the opposite sign is a
supermartingale, (1)-(3) follow from (4)-(6). Therefore, we consider the case
of a supermartingale Y = (Yni ^,)n2.0-
Let us set t = inf{fc < n: Yk > X) with t = n if maxfc5n Yk < X. Then, by
(2.6),
Ey0>Eyr=E Yv;maxYk>X +E yr;maxyfc<A
L ten J L ten J
> X? \ max Yk > X > + E Yn; max Yk < X ,
(. k£n J L ten J
which proves (4).
Now let us set o = inf{fc < n: Yk ^ —X}, and take o = n if minfc5n Yk > —X.
Again, by (26),
EF„^Eyr=E yr;minrfc^ -X + E Yv;mmYk> -X\
L ten J L ten J
<XP\mmYk< -x\ + E\ Yn;minYk>-X \.
I ten ) L ten J
Hence,
X?\minYk< -X>< -E r„;min^< -X \< EY~
IkZn J L ten J
which proves (5).
To prove (6), we notice that Y~ = (- Y)+ is a submartingale. Then, by (4)
and(l),
pimax |Ffc| > X> < P^max Yk+ > x\ + pimax rfc" > a!
(. k£n ) (. k<Zn ) ( kZn )
= P<maxl^>A> + P<max Yk >X\
{ten ) I k£n J
< Ey0 + 2EY~ < 3 max E| Yk\.
k£n
Inequality (7) follows from (4).
To prove (8), we set y = inf{fc > n: Yk > X}, taking y = oo if Yk < X for all
k > n. Now let n < N < oo. Then, by (2.6),
EYn > EYyAN > ElYyANI(y ^ iV)] > X?{y <> N},
from which, as N -*■ oo,
EYn > AP{y < oo} = APjsup Yk > A.
§3. Fundamental Inequalities
495
Proof of Theorem 2.
The first inequalities in (9) and (10) are evident.
To prove the second inequality in (9), we first suppose that
|A?|F<oo, (15)
and use the fact that, for every nonnegative random variable £ and for r > 0,
E£' = r C t'-lP(Z>t)dt. (16)
Jo
Then we obtain, by (1) and Fubini's theorem, that for p > 1
£(X*r = P [mt^P{x*>t}dt<P fV2(f xndp)dt
Jo Jo \J {**;>»} /
= P J" f-2 [£ XnI{X* > t) dp] dt
Hence, by Holder's inequality,
E(x*y <; giixji,- ux*y-\ = «|x.i,[E(x?)T*, (is)
where q = p/(p — 1).
If (15) is satisfied, we immediately obtain the second inequality in (9) from
(18).
However, if (15) is not satisfied, we proceed as follows. In (17), instead of
X* we consider (X* a L), where L is a constant. Then we obtain
E(X* a Lf < qElXn(X* a L)'"1] < q\\Xn\\p^(X* a LfY>\
from which it follows, by the inequality E(X* a Lf < IF < oo, that
E(X*ALr<q^EX^ = qnXX
and therefore,
E(X*f = lim E(X* a Lf < qp\\Xn\\^
We now prove the second inequality in (10).
Again applying (1), we obtain
EX* - 1 < E(X* - 1)+ = J* ?{X* -l>t\dt
496
VII. Sequences of Random Variables that Form Martingales
Since, for arbitrary a > 0 and b > 0,
a In b < a ln+ a + be~\ (19)
we have
EX: - 1 ^ EX„ In X* < EX„ In+ X„ + <T«EX„*.
If EX* < oo, we immediately obtain the second inequality (10).
However, if EX* = oo, we proceed, as above, by replacing X* by X* a L.
This proves the theorem.
The proof of Theorem 3 follows from the remark that \X\P, p> 1, is a
nonnegative submartingale (if E\Xn\p < oo, n > 0), and from inequalities (1)
and (9).
Corollary of Theorem 3. Let X„ = £0 + ••• + £„, n > 0, where (£fc)fc;>o is a
sequence of independent random variables with E£k = 0 and E^ < oo. Then
inequality (13) becomes Kolmogorov's inequality (§2, Chapter IV).
2. Let X = (Xnt ^,) be a nonnegative submartingale and
X„ = M„ + 4,,
its Doob decomposition. Then, since EM„ = 0, it follows from (1) that
P{X„*>E}<^.
Theorem 4, below, shows that this inequality is valid, not only for sub-
martingales, but also for the wider class of sequences that have the property
of domination in the following sense.
Definition. Let X = (Xn, ^,) be a nonnegative stochastic sequence, and
A = (Ant ^„-i) an increasing predictable sequence. We shall say that X is
dominated by the sequence A if
EXr £ EAX (20)
for every stopping time t.
Theorem 4. If X = (Xn, &n) is a nonnegative stochastic sequence dominated by
an increasing predictable sequence A = (Ant ^n-x\ then for A > 0, a > 0, and
any stopping time t,
P{X?>X}<t£, (21)
P{X? > X) < i E(AZ A a) + P(AZ £ a\ (22)
§3. Fundamental Inequalities
497
<**■* * (l3^)1/P l^i" 0 < p < 1. (23)
Proof. We set
¢7,, = min{j < t a w: Xj > X},
ing cr„ = t a n, if {•} = 0. Then
EAT > EACn > EXCn > [ Xan d? > XP{X*AIt > X},
J[Xt\n>X}
from which
P{AJA1>i}^E4
and we obtain (21) by Fatou's lemma.
For the proof of (22), we introduce the time
y = 'mf{j:AJ+1>a},
setting y = ooif{-} = 0. Then
P{X* >X} = P{X* > K Av < a} + P{X* > K Av > a}
<P{I{A,<a]X*>X} + P{Av>a}
< P{X?A7 >X} + P{Av>a}<jEA„y + PAy > a}
<jE(AvAa) + P(Az>a),
where we used (21) and the inequality I{At<a]X* < X*^r Finally, by (22),
iix*ij = E(x*y = J" p{(x*r >t}dt = J" p{x* > t1*} dt
< r r"'E[A, a f1"] dt + P P{A? > t} dt
Jo Jo
= E f ' dt + E P (Azr^) dt + EA? = \^- EA*.
Jo J a? l~P
This completes the proof.
Remark. Let us suppose that the hypotheses of Theorem 4 are satisfied,
except that the sequence A = (Ani ^,)n2.0 1S not necessarily predictable, but
has the property that for some positive constant c
3{sup|A,4fc| <c} =
498
VII. Sequences of Random Variables that Form Martingales
where AAk = Ak — Ak_x. Then the following inequality is satisfied (compare
(22)):
P{X? >*)<\ E[>lt a (a + c)] + ?{AX > a}. (24)
The proof is analogous to that of (22). We have only to replace the time
y = inf{j: Aj+1 > a] by y = inf{j: Aj > a} and notice that Ay < a + c.
Corollary. Let the sequences Xk = (Xk, &k) and Ak = (Ak, &k\ n>Q,k>l
satisfy the hypotheses of Theorem 4 or the remark. Also, let (t*)^! be a
sequence of stopping times (with respect to #"* = 0^,*)) and A** -^ 0. Then
(x% ^0.
3. In this subsection we present (without proofs, but with applications) a
number of significant inequalities for martingales. These generalize the
inequalities of Khinchin and of Marcinkiewicz and Zygmund for sums of
independent random variables.
Khinchin's Inequalities. Let £,, £2> . . . be independent identically distributed
Bernoulli random variables with P(££ = 1) = P(£,- = — 1) = \ and let (c„)n2:1
be a sequence of numbers.
Then for every p, 0 < p <oo, there are universal constants Ap and Bp
(independent of(cn)) such that
A{icf^c4^B{ic'T (25)
for every n > 1.
The following result generalizes these inequalities (for p > 1).
Marcinkiewicz and Zygmund'sInequalities. If £1} £2>--- is a sequence of
independent integrable random variables with E£,- = 0, then for p > 1 there
are universal constants Ap and Bp (independent of(£„)) such that
for every n > 1.
In (25) and (26) the sequences X = (X„) with X„ = £;=1 cfij and Xn =
Yj=i Zj are martingales. It is natural to ask whether the inequalities can be
extended to arbitrary martingales.
The first result in this direction was obtained by Burkholder.
(26)
\p
§3. Fundamental Inequalities
499
Burkholder's Inequalities. If X = (Xn,^n) is a martingale, then for every
p > 1 there are universal constants Ap and Bp (independent of X) such that
ApW^VQJp < \\Xn\\p < Bp\\j[xTn\\P, (27)
for every n > 1, where [X~\n is the quadratic variation ofXn,
m. = t (^,)2. *o = 0. (28)
The constants Ap and Bp can be taken to have the values
Ap = [18p3'2/(p - I)]"\ Bp = 18p3'2/(p - D1/2.
It follows from (17), by using (2), that
Aplly/VQJp < \\X*\\P < Bjllx/m^llp, (29)
where
Ap = [18p3/2/(p - l)]"1, B* = 18p5'2/(p - 1)3/2-
Burkholder's inequalities (27) hold for p > 1, whereas the Marcinkiewicz-
Zygmund inequalities (26) also hold when p = 1. What can we say about the
validity of (27) for p = 1 ? It turns out that a direct generalization to p = 1
is impossible, as the following example shows.
Example. Let flf £2, • • be independent Bernoulli random variables with
P(6=l)=P(£=-l) = iandlet
HAT
where
inf|«>l: t ^=l|.
The sequence X = (Xnt &*) is a martingale with
n-X-Ji = EIA-..I = 2EJT+ — 2, ii-oo.
But
'M.
= EVTXL = e( £ 1J = Zy/^i -+ oo.
Consequently the first inequality in (27) fails.
It turns out that when p = 1 we must generalize not (27), but (29) (which
is equivalent when p > 1).
Davis's Inequality. If X = (Xn, ^n) is a martingale, there are universal
500
VII. Sequences of Random Variables That Form Martingales
constants A and B, 0 < A < B < co, such that
Aiiy/mj, < \\x*\\i < WPoji, (3°)
i.e.t
AE It (AX,)2 < e\ max \Xn\\ < BE /£ (AX,-)2-
Corollary 1. Let £lt £2>-- be independent identically distributed random
variables; Sn = £,x + • • • + f„. //E |f t| <oo and E£,x = 0, then according to
Wald's inequality (2.14) we have
EST = 0 (31)
for every stopping time x (with respect to (^)) for which Et < oo.
It turns out that (31) is still valid under a weaker hypothesis than Et < oo
if we impose stronger conditions on the random variables. In fact, if
Ei^r<oo,
where 1 < r < 2, the condition Et1/r < oo is a sufficient condition for EST = 0.
For the proof, we put t„ = ta«J= sup„| SrJ, and let m = [f] (integral
part of f) for t > 0. By Corollary 1 to Theorem 1, §2, we have ESTn = 0.
Therefore a sufficient condition for ESZ = 0 is (by the dominated convergence
theorem) that E sup„|SrJ <oo.
Using (1) and (27), we obtain
P(y > 0 = P(t > f, Y > t) + P(t < f9 Y > t)
< P(t > f) + Pi max \SZJ\ > ti
<P(T>o + r-E|Srmr
< P(t > f) + t~rBrE
(s«)
< p(t > n + rrBrE £ ^/.
Notice that (with &% = {0, Q})
E I |£/ = E I/O^TjIf/
J= 1 j=1
= lEEc/a^oif/i^j-i]
= E £ /(J < TjEQf n^j-J = E £ E|f/ = AfrETM>
§3. Fundamental Inequalities
501
where \ir = E |^1|f'. Consequently
P(Y > 0 < P(t > f) + rrBrnrErm
= P(t > O + Brnrr\ mP(t > O + f *dP 1
< (1 + Brfxr)P(T > f) + BrA/rrr f t tfP
J{r<f}
and therefore
E
y\ P(Y> t)dt < (1 + BrA/r)Et1/r + BrA/r P rT f rdP\ dt
Jo Jo L J{r<n J
= (1 + B^Et1" + Brixr J"J £f'J dP
= A+BrAir + -^iW^<oo.
Corollary 2. Let M = (M„) fee a martingale with E |M„ |2r < oofor some r > 1
and swc/z fto (wir/i M0 = 0)
£*££<- (32)
T/ie« (compare Theorem 2 of §3, Chapter IV) we /wroe f/ze strong law of large
numbers:
^^0 (P-a.s.\ n->oo. (33)
When r = 1 the proof follows the same lines as the proof of Theorem 2,
§3, Chapter IV. In fact, let
" AMk
jt=i K
Then
n n nfcfi
and, by Kronecker's lemma (§3, Chapter IV) a sufficient condition for the
limit relation (P-a.s.)
1 "
- £ kAmk -> 0, « -> oo,
nk=i
is that the limit limnm„ exists and is finite (P-a.s.) which in turn (Theorems 1
and 4, §10, Chapter II) is true if and only if
PJsup|mn+fc-m„| >el->0, n-
(34)
502
VII. Sequences of Random Variables That Form Martingales
By(l),
pjsup|m„+s - m„| £ si < e-> f 1¾^
Hence the required result follows from (32) and (34).
Now let r > 1. Then the statement (33) is equivalent (Theorem 1, §10,
Chapter II) to the statement that
e2rpjsup!Ml>ej_0j ,,.,00 (35)
Lj*n J J
for every e > 0. By inequality (53) of Problem 1,
e^pLp \MA > gj = S2r lim pj max \M£L > g2rj
Lj2:n J J m-*co (.n<j<m J J
<±-rE\Mn\2' + £ ijEdMjP'-IMj-.l2').
It follows from Kronecker's lemma and (32) that
lim-^E|Mn|2- = 0.
n-»oo "
Hence to prove (35) we need only prove that
ZiE(|Mir-|Mj_1|20<oo. (36)
We have
j=2i
< y T 1 11E|M. » | E|MW|*
By Burkholder's inequality (27) and Holder's inequality,
E\Mj\2' < e[£ (AM£)2T < E/"1 £ |AM£|2-.
Hence
n-i i j N FIAM l2r
j = 2 7 i=l j=2 J
(Q are constants). By (32). this establishes (36).
§3. Fundamental Inequalities
503
4. The sequence of random variables {Xn}n2:1 has a limit lim Xn (finite or
infinite) with probability 1, if and only if the number of "oscillations between
two arbitrary rational numbers a and b, a < b" is finite with probability 1.
Theorem 5, below, provides an upper bound for the number of "oscillations"
for submartingales. In the next section, this will be applied to prove the
fundamental result on their convergence.
Let us choose two numbers a and b,a < b, and define the following times
in terms of the stochastic sequence X = (Xn, ^):
t0 = 0,
xx = min{« > 0: Xn < a},
x2 = min{« > rl: X„> b}t
x2m_l = min{n > x2m_2:Xn < a},
x2m = min{n >x2m_l:Xn> b},
taking xk = oo if the corresponding set {•} is empty.
In addition, for each n > 1 we define the random variables
«*fc) = {£ax{m:,
if t2 > n,
:T2m<«} ift2<".
In words, fSn(at b) is the number of upcrossings of [a, b~\ by the sequence
Theorem 5 poob). Let X = (Xni^„)„^i be a submartingale. Then, for
every n > 1,
m",b)<ELXb"-f. (37)
Proof. The number of intersections of X = (X„, &\) with [a, b] is equal to
the number of intersections of the nonnegative submartingale X+ =
((X„ — a)+t ^) with [0, b - a]. Hence it is sufficient to suppose that X is
nonnegative with a = 0, and show that
cy
E/U0,fc)<-^. (38)
Put X0 = 0, &o = {0, O}, and for i = 1, 2,..., let
{1 if xm < i < xm+! for some odd m,
0 if xm < i < xm+! for some even m.
It is easily seen that
%(0,&)< I ^-[X£ -Xt-J
and
{</>,.= !} = U [(^<0\{tm+ ,</}]e^_v
504
VII. Sequences of Random Variables That Form Martingales
Therefore
bE/?„(0, b) < E £ v-lXt - X£_,] = £ f (Xt - X,-,) dP
«=i «=i Jwi=i\
ZiXt-Xi-A&t-JdP
,= i J{Vl=i
}
< £ f [1(^,1^,-,) - X._J <*P = EX„,
i=i Jo
which establishes (38).
5. In this subsection we discuss some of the simplest inequalities for the
probability of large deviations for martingales of integrable square.
Let M = (M„, ^n)n^o be a martingale of integrable square with quadratic
variation <M> = «M>n,^n_1). If we apply inequality (22) to Xn = M„,
An = <M>„, we find that for a > 0 and b > 0
P \ max |Mfc| > an} = P < max Mfc2 > (an)2 >
= M2
< —j E[<M>„ a (Ml + P«M>„ > An}. (39)
In fact, at least in the case when \AM„\ < C for all n and co e fi, this inequality
can be substantially improved by using the ideas explained in §5, Chapter IV
for estimating the probability of large deviations for sums of independent
identically distributed random variables.
Let us recall that in §5, Chapter IV, when we introduced the
corresponding inequalities, the essential point was to use the property that the sequence
(«*■/[*«]", ^nUl, ^. = *{£l U (40)
formed a nonnegative martingale, to which we could apply the inequality (8).
If we now take M„ instead of S„, by analogy with (40), the martingale
(6^-/4(4^1,
will be nonnegative, where
wn-nE^^-i) (41)
j-i
is called the stochastic exponential.
This expression is rather complicated. At the same time, in using (8) it
is not necessary for the sequence to be a martingale. It is enough for it to
be a nonnegative supermartingale. Here we can arrange this by forming a
§3. Fundamental Inequalities
505
sequence (Z„(X), &n) ((43), below), which sufficiently depends simply on Mn
and <M>„, and to which we can apply the method used in §5, Chapter IV.
Lemma 1. Let M = (M„, ^n)ni0 be a square-integrable martingale, M0 = 0,
AM0 = 0, and \AM„(co)\ < c for all n and co. Let X > 0,
V° - 1 - Xc
c>0,
fc«H \2 (42)
and
Zn(X) = e^n-^c(A)<M>n (43)
Then for every c > 0 the sequence Z(X) = (Zn(X\ ^„)„^0 is a non-negative
supermartingale.
Proof. For |x| < c,
m2:2 W! «12:2 MI
Using this inequality and the following representation (Z„ = Zn(X))
AZ„ = ^,[(6^- - l)e-A<M>n^c(A) + (e-A<M>n^c(A) _ ^
we find that
E(AZn|^n_1)
= 2^ [E^**"" - 1^ )«-*<">»<«* + (e-A^«« - 1)]
= Z^Efe**"" - 1 - XAMJ^-Je-W****™ + (e-A<^>n^(A) _ ^
< ZH.ll^e(X)E((AMH^\^H.l)e-A<M^^ + (e-A<M>-WA) - 1)]
= Z„_! [^C(A)A<M>ne-A<M>"^(;i) + (e-A<M>MX) - 1)] ^ 0, (44)
where we have also used the fact that, for x > 0,
xe~x + (e~x - 1) <, 0.
We see from (44) that
E(ZJ^n_1)<Zn_1,
i.e., Z(X) = (Zn(X\ .¾ is a supermartingale.
This establishes the lemma.
Let the hypotheses of the lemma be satisfied. Then we can always find
X > 0 for which, for given a > 0 and b > 0, we have aX - bty£X) > 0. From
this, we obtain
506
VII. Sequences of Random Variables that Form Martingales
P < max Mk > an > = P < max eXMk > e*™ \
[kzn J (. k<,n J
< p J max gAMk-^c(A)<M>k > eXan-tc(i.KM>n (,
= P {max e^-<MA><M>k > g^-^AXM^ <M>n ^ fe„j
+ pjmax e^-<MA><Af>k > eAfl„-^(A)<M>n} <M>^ > bA
< p J max gAMfc-^c(A)<M>fc > eAan-i},cWbn ( (45)
U^n J
where the last inequality follows from (7).
Let us write
Hc(a, b) = sup [aX - tye(A)].
Then it follows from (45) that
P jmax Mk > anl < P{<M>„ > bn} + e-nH<<a-b). (46)
Passing from M to — M, we find that the right-hand side of (46) also provides
an upper bound for the probability P{minfcinMfc < — an}. Consequently,
P jmax \Mk\ >an\< 2P{<M>„ > bn} + 2e-"H<ia'b). (47)
Thus, we have proved the following theorem.
Theorem 6. Let M = (M„, !Fn) be a martingale with uniformly bounded steps,
i.e., \AM„\ < c for some constant c> 0 and all n and co. Then for every a>0
and b>0,we have the inequalities (46) and (47).
Remark 2.
W) = i(^)ln(1 + £)-?. (48)
6. Under the hypotheses of Theorem 6, we now consider the question of
estimates of probabilities of the type
p|sup-^->4>
[ten <M>fc J
which characterize, in particular, the rapidity of convergence in the strong
law of large numbers for martingales (also see Theorem 4 in §5).
§3. Fundamental Inequalities
507
Proceeding as in §5, Chapter IV, we find that for every a > 0 there is a
X > 0 for which aX - tyc(X) > 0. Then, for every b > 0,
P {sup -^- > a\ < P {sup e^-<MA)<Af>k > ^-W)]<jif>nj
Ifc^ <M>fc J (^ J
<p|i
1
SUP e^^k-^cU)<M>k > e[aX-tc(X)}bn
ten J
+ P{<M>„ < b«} < e-te[flA-WA)1 + P{<M>„ < b«},
(49)
from which
pisuP TT^r >a\<P{<M>„ < bn} + e-"H<{ab'b) (50)
(.fc2:n <M>fc J
pisuP tttH > «1 < 2P{<M>„ < bn} + 2e-"H«<fl6-6>. (51)
U;>n |<M>fc| J
We have therefore proved the following theorem.
Theorem 7. Let the hypotheses of the preceding theorem be satisfied. Then
inequalities (50) and (51) are satisfied for alla> 0 and b > 0.
Remark 3. Comparison of (51) with the estimate (21) in §5, Chapter IV, for the
case of a Bernoulli scheme, p = 1/2, M„ = S„ — («/2), b = 1/4, c = 1/2, shows
that for small e > 0 it leads to the same result
,f Mk 1 „f \Sk-k2\ s}
1 SUP 7TFT > e> = P1 Sup * ' > ->
{ten \<M\\ J [fc^ fc 4J
>T^<2e
-4e2h
7. Problems
1. Let X = (Xni &n) be a nonnegative submartingale and let V = (Vni ^j,_i) be a
predictable sequence such that 0 < Vn+1 <V„<C (P-a.s.), where C is a constant.
Establish the following generalization of (1):
epj max VjXj > el + f VnXndP < £ EVjAXj. (52)
2. Establish Krickeberg's decomposition: every martingale X = {Xm^n) with
sup E|X„| < oo can be represented as the difference of two nonnegative martingales.
3. Let flt £2,... be a sequence of independent random variables, Sn = £1 +■•■ + £„
and Smn = £"=„,+1 £/• Establish Ottavianfs inequality:
P< max |Sj| > 2b> 2
ll<j<n J
P{|S„| > e}
miniSJ<nP{|Siin|<£}
508
VII. Sequences of Random Variables That Form Martingales
and deduce that
f pj max \Sj\ > 2t\dt < 2E|S„| + 2 f P{|S„| > r}dt. (53)
•'O UsjSn J *'2E|Sn|
4. Let £lt £2,... be a sequence of independent random variables with E£j = 0. Use
(53) to show that in this case we can strengthen inequality (10) to
ES* < 8E|S„|.
5. Verify formula (16).
6. Establish inequality (19).
7. Let the a-algebra &0l..., &n be such that ^0 ^ ^ c ■ • • c &n and let the events
Ak e ^, fc = 1,..., n. Use (22) to establish Dvoretzky's inequality: for each e > 0,
p[ ^0 ^l^o] ^ ^ + P[£ P(Ak\#k-,) > £|^0] (P-a-s.).
§4. General Theorems on the Convergence of
Submartingales and Martingales
1. The following result, which is fundamental for all problems about the
convergence of submartingales, can be thought of as an analog of the fact
that in real analysis a bounded monotonic sequence of numbers has a (finite)
limit.
Theorem 1 (Doob). Let X = (Xnt ^) be a submartingale with
supE|X„|<oo. (1)
n
Then with probability 1, the limit lim Xn = Xm exists and E\Xm\ <oo.
Proof. Suppose that
P(Iim Xn > Urn Xn) > 0. (2)
Then since
{limXn > lim*n} = U {HmXn > b > a > limX„]
a<b
(here a and b are rational numbers), there are values a and b such that
P{iim X„ > b > a > lim Xn} > 0. (3)
Let /?„(a, b) be the number of upcrossings of (a, b) by the sequence
X, Xni and let /?<>, *>) = limn/?„(a, b). By (3.27),
§4. General Theorems on the Convergence of Submartingales and Martingales 509
EAfe. b) s b_a < fc_a
and therefore
Eft.fr t) = BmE^ t) < S"P"f ■* + "" <co,
which follows from (1) and the remark that
supEIXJ <ooosupEI„+ <oo
n n
for submartingales (since EX„+ < E\X„\ = 2EX„+ - EXn < 2EX„+ - EX\).
But the condition Epjja, b) < oo contradicts assumption (3). Hence lim Xn
= Xm exists with probability 1, and then by Fatou's lemma
EIXJ^supEIX^oo.
n
This completes the proof of the theorem.
Corollary 1. If X is a nonpositive submartingale, then with probability 1 the
limit lim Xn exists and is finite.
Corollary 2.IfX = (X„, &„)„>i is a nonpositive submartingale, the sequence
X = (Xn, J*) with 1 < n <oo, Xm = lim Xn and &m = o{\j^n) is a (non-
positive) submartingale.
In fact, by Fatou's lemma
E^ = E lim X„ > Urn EXn > EXX > -oo
and (P-a.s.)
£(*«,|^J = E(lim Xn\^m) > IIS E(Xn\3Fm) > Xm.
Corollary 3. If X = (Xn, ^) is a nonnegative martingale, then lim Xn exists
with probability 1.
In fact, in that case
supE|XJ = supEX„ = EXX <oo,
and Theorem 1 is applicable.
2. Let ^, £2,... be a sequence of independent random variables with
P(6 = 0) = P(& = 2) = i Then X = (Xn, 3F$, with Xn = ftUi b and
&* = a{co: ^ £„} is a martingale with EX„ = 1 and Xn -► Xm = 0
(P-a.s.). At the same time, it is clear that E\X„ — Xm\ = 1 and therefore
X„ ->* Xm. Therefore condition (1) does not in general guarantee the
convergence of X„ to Xm in the L1 sense.
Theorem 2 below shows that if hypothesis (1) is strengthened to uniform
integrability of the family {X„\ (from which (1) follows by Subsection 4,
510
VII. Sequences of Random Variables That Form Martingales
§6, Chapter II), then besides almost sure convergence we also have
convergence in L1.
Theorem 2. Let X = {Xn, &„} be a uniformly integrable submartingale (that
is, the family {Xn} is uniformly integrable). Then there is a random variable
Xm with E | Xm | < oo, such that as n -» oo
XH-+Xn (P-a.s.), (4)
*„"*.. (5)
Moreover, the sequence X = (Xn, &"„), 1 < n < oo, with &m = ^(j^,), is
also a submartingale.
Proof. Statement (4) follows from Theorem 1, and (5) follows from (4) and
Theorem 4, §6, Chapter II.
Moreover, if A e !Fn and m > n, then
EIA\Xm-Xm\^0, m->oo,
and therefore
lim [ XmdP = f XmdP.
m-oo JA JA
The sequence {lAXmdP}m>„ is nondecreasing and therefore
f XndP< f XmdP< f X„dP,
Ja Ja Ja
whence Xn < E(Xm\^ (P-a.s.) for n > 1.
This completes the proof of the theorem.
Corollary. IfX = (X„, &\) is a submartingale and, for some p > 1,
supE|XJ'<a), (6)
n
then there is an integrable random variable Xm for which (4) and (5) are
satisfied.
For the proof, it is enough to observe that, by Lemma 3 of §6 of Chapter
II, condition (6) guarantees the uniform integrability of the family{X„}.
3. We now present a theorem on the continuity properties of conditional
expectations. This was one of the very first results concerning the
convergence of martingales.
Theorem 3 (P. Levy). Let (Q, &, P) be a probability space, and let (.¾.^
be a nondecreasing family of a-algebras, ^, c&2—'" — &> Let £ be a
random variable with E|f | < oo and &„ = c(\Jn J?,). Then, both P-a.s. and
in the L1 sense,
EGlFJ-EtflF.X ii-oo. (7)
§4. General Theorems on the Convergence of Submartingales and Martingales 511
Proof. Let Xn = E(f |J*), n > 1. Then, with a > 0 and b > 0,
f \XAdP< f E(\t\\Ft)dP = f \Z\dP
<J{\Xi\>a) J{\Xt\Z:a) J{\Xt\>a)
< f \Z\dP + f \Z\dP
J{\Xt\*a) n{|«| £6} J{\Xi\>a) n{|«|>b)
<bP{\Xt\>a} + f |£|<*P
J{|«|>6)
<^E|AT£| + f \Z\dP
a Ji\S\>b)
a J{|«|>6>
Letting a -» oo and then b -> oo, we obtain
lim sup f \Xi\dP = 0i
fl-oo £ J{|Jf4|>a}
i.e., the family {X„} is uniformly integrable.
Therefore, by Theorem 2, there is a random variable Xm such that
X„ = E(£\Fn)^>Xm ((P-a.s.) and in the L1 sense). Hence we only have to
show that
XO0 = E(t\^O0) (P-a.s.).
Let m> n and A e 2Fn. Then
J* X^P = J* *„</P = J E(£\Fn)dP = j ZdP.
Since the family {X„} is uniformly integrable and since, by Theorem 5,
§6, Chapter II, we have EIA\Xm - Xm\ -► 0 as m -»> oo, it follows that
J>"P=1
ZdP. (8)
This equation is satisfied for a\\Ae^„ and therefore for all A e (J*=, &„.
Since EIAT^I <og and E |^| <oo, the left-hand and right-hand sides of (8)
are ©--additive measures; possibly taking negative as well as positive values,
but finite and agreeing on the algebra (J*= Y 3Fn. Because of the uniqueness of
the extension of a ©--additive measure to an algebra over the smallest g-
algebra containing it (Caratheodory's theorem, §3, Chapter II, equation (8)
remains valid for sets AeFa> = g(\JF„). Thus,
f XndP= fJdP = f E{Z\Pn)dP, AePn
(9)
Since XK and E(£\^m) are .^-measurable, it follows from Property I
of Subsection 2, §6, Chapter II, and from (9), that Xm = E(f |J%J (P-a.s.).
This completes the proof cf the theorem.
512 VII. Sequences of Random Variables That Form Martingales
Corollary. A stochastic sequence X = (Xn, &>„) is a uniformly integrable
martingale if and only if there is a random variable £ with E|£| < oo such that
Xn = E(£\Fn)for all n > 1. Here Xn - E{Z\Fm) {both P-a.s. and in the L1
sense) as n-> oo.
In fact, if X = (X„,^„) is a uniformly integrable martingale, then by
Theorem 2 there is an integrable random variable Xm such that Xn -> Xm
(P-a.s. and in the L1 sense) and X„ = E^^IF,,). As the random variable £
we may take the .^-measurable variable Xm.
The converse follows from Theorem 3.
4. We now turn to some applications of these theorems.
Example 1. The "zero or one" law. Let f lf £2, • ■ • be a sequence of independent
random variables, &* = a{co: £t <!;„} and let X be the ©--algebra of the
"tail" events. By Theorem 3, we have E(/x|Fj)-> E(JJFi) = IA (P-a.s.).
But IA and (f t £„) are independent. Since E(IA\F^) = EIA and therefore
Ia = EIa (P-a.s.), we find that either P(A) = 0 or P(A) = 1.
The next two examples illustrate possible applications of the preceding
results to convergence theorems in analysis.
Example 2. If/ = /(x) satisfies a Lipschitz condition on [0,1), it is absolutely
continuous and, as is shown in courses in analysis, there is a (Lebesgue)
integrable function g = g(x) such that
fix)-f(0)= [Xg{y)dy. (10)
Jo
(In this sense, g(x) is a "derivative" of/(x).)
Let us show how this result can be deduced from Theorem 1.
Let Q = [0, 1), & = ^([0,1)), and let P denote Lebesgue measure. Put
*,^ v k-l(k-l ^ k\
^(x)=Z-^-/{-^-<x<-j,
&n = a{x: £i U = <t{x: £„}, and
y Atn + 2-»)-f(tn)
Since for a given £„ the random variable £n+1 takes only the values £„ and
£„ + 2-(n+1) with conditional probabilities equal to J, we have
E[*n+1|jg = ZLXn+l\M = 2"+1E[/(£n+1 + 2-<«+1>) -/«B+1)m
= 2w+i{K/(^ + 2-<w+i>) - /(^)] + \imn + 2~n) - mn + 2-"+i>m
= 2"{/(^ + 2~n) -/(4)} -AV
§4. General Theorems on the Convergence of Submartingales and Martingales 513
It follows that X = (Xnt ^„) is a martingale, and it is uniformly integrable
since \X„\ < L, where Lis the Lipschitzconstant: |/(x) - f(y)\ < L\x — y\.
Observe that & = ^([0, 1)) = a((J J*,). Therefore, by the corollary to
Theorem 3, there is an ^-measurable function g = g(x) such that X„^g
(P-a.s.) and
Xn = E[g\^]. (11)
Consider the set B = [0, fc/2"]. Then by (11)
(Jc\ ffc/2" ffc/2»
^j-/(0)=Jo Xndx=jQ 9i*)dx,
and since n and k are arbitrary, we obtain the required equation (10).
Example 3. Let Q = [0,1), ^ = ^([0,1)) and let P denote Lebesgue
measure. Consider the Haar system {H„(x)}„±l9 as defined in Example 3
of §11, Chapter II. Put ^„ = a{x:H1 H„} and observe that o{[j^n) = &.
From the properties of conditional expectations and the structure of the
Haar functions, it is easy to deduce that
E[/(x)|^„] = t "Mx) (P-a.s.), (12)
fc=i
for every Borel function/eL, where
^ = (f,Hk)= j f(x)Hk(x)dx.
In other words, the conditional expectation E[/(x)|^jJ is a partial sum of
the Fourier series of f(x) in the Haar system. Then if we apply Theorem 3
to the martingale we find that, as n -*■ oo,
t (f,Hk)Hk(x)^f(x) (P-a.s.)
fc=i
and
f I tu\Hk)Hk(x)-f(x)\dx^0.
JO |fc=l I
Example 4. Let (£n)n:>i be a sequence of random variables. By Theorem 2,
§10, Chapter II, the P-a.e. convergence of the series £ £„ implies its
convergence in probability and in distribution. It turns out that if the random
variables £lt £2, .•. are independent, the converse is also valid: the
convergence in distribution of the series £ £„ of independent random variables
implies its convergence in probability and with probability one.
Let Sn = £i + • • • + £„, n > 1 and Sn ^ S. Then Ee*s» - EeitS for every real
number t. It is clear that there is a <5 > 0 such that | EeitS\ > 0 for all |t| < <5.
Choose t0 so that |t0| < <5. Then there is an n0 = n0(t0) such that \EeitoS»\ >
c> 0 for all n > n0, where c is a constant.
For n > n0, we form the sequence X = {Xn, &n) with
514
VII. Sequences of Random Variables That Form Martingales
Since £lt £2,... were assumed to be independent, the sequence X = (Xn, &n)
is a martingale with
sup E|XJ < c~l < oo.
nSno
Then it follows from Theorem 1 that with probability one the limit lim„ X„
exists and is finite. Therefore, the limit,,-^ eitoSn also exists with probability
one. Consequently, we can assert that there is a 8 > 0 such that for each t in
the set T = {t: \t\ < 6} the limit \im„eitSn exists with probability one.
Let T x Q = {(t, co): t e T, co e Q}, let @(T) be the <7-algebra of Lebesgue
sets on T and let X be Lebesgue measure on (T, @(T)). Also, let
C = |(t, co) e T x Q: lim efts»(c>) existsi.
It is clear that C e M{J) <g) ^.
It was shown above that P(Ct) = 1 for every t e T, where Q = {co e Q:
(t, co) e C} is the section of C at the point t. By Fubini's theorem (Theorem 8,
§6, Chapter II)
f Jc(t, a>)«l(A x P) = f ( f Ic(t, co) dp) dX
= f P(Q) AX = X{T) = 28 > 0.
On the other hand, again by Fubini's theorem,
MT) = f /cfc a>)«l(A xP)={ dp({ /c(t, a>)<tt) = [ X(CJdP,
Jtxo Jo \Jt J Jo
where Cw = {v. (t, a>) e C}.
Hence, it follows that there is a set Q with P(Q) = 1 such that X(Ca) =
X(T) = 2d > 0 for all co e Q.
Consequently, we may say that for every co e Q the limit lim„ eitSn exists for
all t e Cm. In addition, the measure of C„ is positive. From this and Problem
8, it follows that the limit lim„ S„(co) exists and is finite for coeQ. Since
P(Q) = 1, the limit lim„ Sn(co) exists and is finite with probability one.
5. Problems
1. Let {&„} be a nonincreasing family of a-algebras, ^, 2 ^2 2 ... ^ = f)yn, and
let y\ be an integrable random variable. Establish the following analog of Theorem 3:
as« ->oo,
Efo IG*) -> E(f1GJ (P-a.s. and in the V sense).
2. Let £„ £2,... be a sequence of independent identically distributed random variables
with E |£,| <00 and E£, = m; let Sn = £, + •. • + <!;„.Having shown (see Problem
2, §7, Chapter II) that
E«i|S.. Sn+ „...) = EtfJS.) = ^ (P-a.s.),
§5. Sets of Convergence of Submartingales and Martingales
515
deduce from Problem 1 a stronger form of the law of large numbers: as n ->oo,
— -»• m (P-a.s. and in the L1 sense).
n
3. Establish the following result, which combines Lebesgue's dominated convergence
theorem and P. Levy's theorem. Let {£„}„>, be a sequence of random variables such
that £„ -> £ (P-a.s.), |£J < 17, E/y < 00 and {&m}m> x is a nondecreasing family of a-
algebras, with ^ = <k\j ^). Then
KmE«J^J = E«|^J (P-a.s.).
4. Establish formula (12).
5. Let Q = [0,1], & = ^([0, 1)), let P denote Lebesgue measure, and let./' = /(x) e L1.
Put
Mk+1)2"
/„(*) = 2" /00 rfy, *2"" < x < (k + 1)2"".'
Jfc2 —
Show that /„(x) -> /(x) (P-a.s.).
6. Let Q = [0, 1), & = <#([(),1)), let P denote Lebesgue measure and let f = /(x) e L1.
Continue this function periodically on [0, 2) and put
/„(x) = £ 2-/(x + /2-).
i=l
Show that f„(x) -> /(x) (P-a.s.).
7. Prove that Theorem 1 remains valid for generalized submartingales X = (Xn, &$,
if infm supn^m E(X„+1&J < 00 (P-a.s.).
8. Let an, n > 1, be a sequence of real numbers such that for all real numbers t with
|r I < <5, S > 0, the limit lim„ eita" exists. Prove that then the limit lim a„ exists and is
finite.
§5. Sets of Convergence of Submartingales and
Martingales
1. Let X = (Xn, ^)bea stochastic sequence. Let us denote by {X„ -►}, or
{ — co < lim Xn <co}, the set of sample points for which lim Xn exists and
is finite. Let us also write A c B (P-a.s.) if P(IA <lB)= 1-
If X is a submartingale and sup E | X„ \ < 00 (or, equivalently, if sup
E X* < co), then according to Theorem 1 of §4 we have
{X„^} = Q (P-a.s.).
Let us consider the structure of sets {Xn -»>} of convergence for
submartingales when the hypothesis sup E \Xn\ <co is not satisfied.
Let a > 0, and xa = inf{« > 1: Xn > a} with xa = co if { • } = 0.
Definition. A stochastic sequence X = (Xni ^) belongs to class C+ (X e C+)
if
516
VII. Sequences of Random Variables That Form Martingales
E(A*ta)+/{tfl<oo}<oo (1)
for every a > 0, where AXn = X„ - X„- lt X0 = 0.
It is evident that X e C+ if
Esup|AA:n|<oo (2)
n
or, all the more so, if
|AX„|<C<oo (P-a.s.), (3)
for all n > 1.
Theorem 1. If the submartingale XeC+ then
{sup X„< 00)=(^-+} (P-fl.s.). (4)
Proof. The inclusion {Xn -►} c {sup Xn < oo} is evident. To establish the
inclusion in the opposite direction, we consider the stopped submartingale
*to = (*^w>^Then,by(l),
supEXt+aA„ < a + E[X+ -/{Tfl <oo}]
n (5)
< 2a + E[(AXta)+ • /{tfl <oo}] <oo,
and therefore by Theorem 1 of §4,
{tfl= oo} <={*„-+} (p-a.s.).
But Ufl>o(*fl = oo} = {sup Xn <oo}; hence {sup X„ <oo} c {Xn -►}
(P-a.s.}.
This completes the proof of the theorem.
Corollary. Let X be a martingale with E suplAXJ <oo. Then (P-a.s.)
{Xn^}u{\jmXn= -oo,IimXn= +00} = Q. (6)
In fact, if we apply Theorem 1 to X and to — X, we find that (P-a.s.)
{urn*,, <oo} = (sup Xn <oo} = {Xn -},
(limXn > -00} = {inf Xn > -00} = {*„-+}.
Therefore (P-a.s.)
{\miXn<oo}u{\jmXn> -00}= {*„-},
which establishes (6).
Statement (6) means that, provided that E sup | AXn\ <oo, either almost
all trajectories of the martingale M have finite limits, or all behave very
badly, in the sense that fim X„ = + 00 and Urn Xn= — 00.
2. If £lt £2 is a sequence of independent random variables with E£,. = 0
and I & I < c < 00, then by Theorem I of §2, Chapter IV, the series Y£i
§5. Sets of Convergence of Submartingales and Martingales
517
converges (P-a.s.) if and only if £ Ef? < oo. The sequence X = (Xn, ^n) with
X„ = £x + • • + £„ and &„ = g{cq: £x £„}„ is asquare-integrable
martingale with <A">„ = Jj=i E£?> anc^ ^ proposition just stated can be
interpreted as follows:
«X>00 <oo} = {Xn -+} = Q (P-a.s.),
where <X>00 = limn<A:>n.
The following proposition generalizes this result to more general
martingales and submartingales.
Theorem 2. Let X = (Xn, &„) be a submartingale and
X„ = m„ + A„
its Doob decomposition.
(a) IfX is a nonnegative submartingale, then (P-a.s.)
{AK <oo} e {Xn -+} c {sup*„ <oo}. (7)
(b) IfXeC+ then(P-a.s.)
{*„-+} = {supX„ <oo} e {^ <oo}. (8)
(c) //A" is a nonnegative submartingale and X e C+, t/iew (P-a.s.)
{*„ -} = {supX„ <ao} = {Am <oo}. (9)
Proof, (a) The second inclusion in (7) is obvious. To establish the first
inclusion we introduce the times
aa = inf{n > 1: An+l > a}, a> 0,
taking ca = + oo if { • } = 0. Then ACa < a and by Corollary 1 to Theorem
1 of §2, we have
EXn^a = EAn„,a<a.
Let Y° = Xn„aa. Then Ya = (Yan, &„) is a submartingale with supEy° <
a <oo. Since the martingale is nonnegative, it follows from Theorem 1,
§4, that (P-a.s.)
{Am<a} = {ca= co} <={Xn^}.
Therefore (P-a.s.)
{Am<cc}= (J {Am<a}^{Xn^}.
a>0
(b) The first equation follows from Theorem 1. To prove the second, we
notice that, in accordance with (5),
E^taAn = EXtaA„ < EXt:A„ < 2a + E[(AXta)+/{tfl <oo}]
and therefore
EAZa = E WmAXaAn <oo.
518
VII. Sequences of Random Variables That Form Martingales
Hence {xa = 00} ^ {,4^ <oo} and we obtain the required conclusion since
U«>o i*a = 00} = {supX„ <oo}.
(c) This is an immediate consequence of (a) and (b).
This completes the proof of the theorem.
Remark. The hypothesis that X is nonnegative can be replaced by the
hypothesis sup„ EX~ < 00.
Corollary 1. Let Xn = ^ + • • • + fn, where f, > 0, Ef, <oo, £• are Jv
measurable, and &0 = {0, Q}. Then (P-a.s.)
|£ E&l^-iXooje-
supn £„ < 00 then (P-a.s.)
{XEOU^_0<ooj ={*„-}• (ID
E(U^-i)<°oj <={*„-+}, (10)
and if, in addition, E sup„ £„ < 00 r/iew (P-a.s.)
Corollary 2 (Borel-Cantelli-Levy Lemma). //" t/ze events Bn e 3Fn, then if we
put £„ = IBn in (11), we find that
[l P«J*-i) <«>} = {t '*. «*>}• (12)
3. Theorem 3. Let M = (M„, ^)n:>i be a square-integrable martingale. Then
(P-a.s.)
{<M>m <00} c{Mn-}. (13)
//a/so E sup IAMJ2 <oo, then (P-a.s.)
{<M>00 <00} = {M„-+}, (14)
w/zere
<M>ro= £ E((AMn)2|^_0 (15)
n= 1
wit/z M0 = 0, J^o = (0. ^)-
Proof. Consider the two submartingales M2 = (M2, ^) and (M + 1)2 =
((M + l)2, ^,). Let their Doob decompositions be
M2n =m'n + A'n, (M„ + 1)2 = m"n + A!'n.
Then An and ylj,' are the same, since
A'n= t E(AMfc2|J^_1)= t EaAM^I^.O
jt=i jt=i
§5. Sets of Convergence of Submartingales and Martingales
519
and
A'H = £ E(A(Mfc + im-0 = t E(AM?|^_,)
fc=l fc=l
= £ EaAM,)2!^^)-
fc=l
Hence (7) implies that (P-a.s.)
{<M>m <oo} = {A'm <oo} c {M2n -+} n {(M„ + 1)2 -+} = {M„ ->}.
Because of (9), equation (14) will be established if we show that the
condition E sup|AMJ2 <oo guarantees that M2 belongs to C+.
Let tfl = inf{« > 1: M2 > a), a > 0. Then, on the set {xa <oo},
|AM2J = \M2a - Ml.A < \MZa - MXa_x\2
+ 21 Mta_, | • | MXa - Mta_ 11 < (AMJ2 + 2a1'2 \ AMla|,
whence
E|AM2J/{tfl <oo} < E(AMta)2/{tfl <oo} + 2ali2y/E(AMJ2I{ra <oo}
< E sup | AM„|2 + 2a1/VEsup|AMJ2 <oo.
This completes the proof of the theorem.
As an illustration of this theorem, we present the following result, which
can be considered as a distinctive version of the strong law of large numbers
for square-integrable martingales (compare Theorem 2 of §3, Chapter IV
and Corollary 2 of Subsection 3 of §3).
Theorem 4. Let M = (Mni&r„) be a square-integrable martingale and let
A = (An, ^_i) be a predictable increasing sequence with A± > 1, Am = oo
(P-fl.S.)-
If(P-a.s.)
then
gEKAMglF..,]
,= i Af
Mn/An^0t n-+oo, (17)
with probability 1.
In particular, if <M> = (M„, ^,) is t/ze quadratic characteristic of the square-
integrable martingale, M = (Mnt^n) and <M>ro = oo (P-a.s.% then with
probability 1
M
wrr0, "^°°- (18)
520
VII. Sequences of Random Variables That Form Martingales
Proof. Consider the square-integrable martingale m = (m„, &\) with
Then
" AM,-
i=l "-i
<^=iE[(Ay-3. (19)
i= 1 Ai
Since
Mw = S=1^fcAmfc
we have, by Kronecker's lemma (§3, Chapter IV), MJAn -»> 0 (P-a.s.) if the
limit limnm„ exists (finite) with probability 1. By (13),
«m>00<oo}c {m„-+}. (20)
Therefore it follows from (19) that (16) is a sufficient condition for (17).
If now A„ = <M>„, then (16) is automatically satisfied (see Problem 6)
and consequently we have
-^--0 (P-a.s.).
This completes the proof of the theorem.
Example. Consider a sequence f lf f 2l... of independent random variables
with E£,-= 0, V& = Vt > 0, and let the sequence X = {Xn}n^.0 be defined
recursively by
XH+l = 0XH+ZH+l9 (21)
where X0 is independent of £i,£2»--- anc* 0 *s an unknown parameter,
— oo < 6 <oo.
We interpret Xn as the result of an observation made at time n and ask
for an estimator of the unknown parameter 6. As an estimator of 6 in terms
ofXo,*! Xn, we take
a.°".°-/fe . (22)
fc = 0 Kfc+1
taking this to be 0 if the denominator is 0. (The number 0„ is the least-squares
estimator.)
It is clear from (21) and (22) that
6 = 6+^,
§5. Sets of Convergence of Submartingales and Martingales
521
where
M„ = "l £M. A, = <M>„ = "l ^-.
fc = 0 ^fc+1 fc = 0 yk+l
Therefore if the true value of the unknown parameter is 0, then
P0„ - 0) = 1, (23)
when (P-a.s.)
M
—2-0, n -oo. (24)
K
(An estimator 0„ with property (23) is said to be strongly consistent; compare
the notion of consistency in §7, Chapter I.) Let us show that the conditions
sup%i<oo, IE(fAl)=0° (25)
K
are sufficient for (24), and therefore sufficient for (23).
We have
n=l\Kn / n = 1 vn n = 1 yn
Therefore
{Ii(|aiWoo|e{<M>00=oo}.
By the three-series theorem (Theorem 3 of §2, Chapter IV) the divergence
ofZn°°=iE((£n7K,) a l)guaranteesthedivergence(P-a.s.)of5>i((^2/K,) a 1).
Therefore P«M>00 = oo} = 1. Moreover, if
" AM,-
then
<m> = f wh
and (see Problem 6) POw^ <oo) = 1. Hence (24) follows directly from
Theorem 4.
(In Subsection 5 of the next section we continue the discussion of this
example for Gaussian variables £lt £2 )
522
VII. Sequences of Random Variables That Form Martingales
Theorem 5. Let X = (X„, ^) be a submartingale, and let
Xn = m„ + A„
be its Doob decomposition. If\AXn\ < C, then (P-a.s.)
«m>00 + ^00<oo} = {Xn-}, (26)
or equivalently,
£ E[AX„ + (AX„)2|Js;-i] <oo j = {Xn -}. (27)
Proof. Since
An= £ E^XJ^^), (28)
fc=i
and
mn= t [AXfc-E(AXfc|^_1)]> (29)
fc=i
it follows from the assumption that \AXk\ < C that the martingale m =
(m„, ^) is square-integrable with |Am„| < 2C. Then by (13)
{<m>m + Am<oo} <={*„-} (30)
and according to (8)
{Xn^}cz{Am<oo}.
Therefore, by (14) and (20),
{Xn -+} = {Xn -+} n {Am <oo} = {Xn -+} n {Am <oo} n {mn -+}
= {*„ -} ri {^oo <oo} n Km)^ <oo}
= {*„ -} n {/!«, + <m>ro <oo} = {Am + <!»>«, <oo}.
Finally, the equivalence of (26) and (27) follows because, by (29),
<m>„ = BEKAX*)2!^] - [E^XJ^)]2},
and the convergence of the series Y*=i E(AXfc|^_j) of nonnegative terms
implies the convergence of ]T£°=1 [E(AA'fc|^_1)]2. This completes the proof.
4. Kolmogorov's three-series theorem (Theorem 3 of §2, Chapter IV)
gives a necessary and sufficient condition for the convergence, with
probability 1, of a series Y£n °f independent random variables. The following
theorems, whose proofs are based on Theorems 2 and 3, describe sets of
convergence of Jj£n without the assumption that the random variables
€ii €2» • • • axe independent.
§5. Sets of Convergence of Submartingalcs and Martingales 523
Theorem 6. Let £ = (£ni&\)t n > 1, be a stochastic sequence, let ^0 =
{0, Q}, and let cbe a positive constant. Then the series Yfin converges on the
set A of sample points for which the three series
converge, where ?n = fnI(|fn| < c).
Proof. Let X„ = J£=1£k. Since the series ]TP(|fn| ^ c\&n-\) converges,
by Corollary 2 of Theorem 2, and by the convergence of the series
lE^I^-xXwehave
A n {Xn -+} = A n { £ fc/flfej < c) -}
= X n { £ [fe/a&l < c) - E(&/fl&| < c)| J^)] ->}.
(31)
Let ^ = &/a&| < c) - Efo/fl&l < cXJ^-O and let Yn = £^¾. Then
Y = (Yni &„) is a square-integrable martingale with |??fc| < 2c. By Theorem
3, we have
A S (I V(£|^„-i) < oo} = «y>00 < oo} = {y„ -}. (32)
Then it follows from (31) that
An{Xn^} = A,
and therefore A ^ {jfn -».}. This completes the proof.
5. Problems
1. Show that if a submartingale X = (Xn, ^) satisfies E supJXJ <oo, then it belongs
to class C+.
2. Show that Theorems 1 and 2 remain valid for generalized submartingales.
3. Show that generalized submartingales satisfy (P-a.s.) the inclusion
jinf supE(X; \&m) < ool c {Xn -}.
4. Show that the corollary to Theorem 1 remains valid for generalized martingales.
5. Show that every generalized submartingale of class C+ is a local submartingale.
6. Let an > 0, n > 1, and let bn = JJ= t ak. Show that
524
VII. Sequences of Random Variables That Form Martingales
§6. Absolute Continuity and Singularity of
Probability Distributions
1. Let (Q, &) be a measurable space on which there is defined a family
(.&n)n> l of ff-algebras such that ^ c g?2 c ... c & and
■-<&4
a)
Let us suppose that two probability measures P and P are given on (Q, &).
Let us write
for the restrictions of these measures to ^,, i.e., let P„ and P„ be measures on
(Q, J*) and for Be J* let
Pn(B) = P(fl), Pn(B) = P(B).
Definition 1. The probability measure P is absolutely continuous with respect
to P (notation, P « P) if P(A) = 0 whenever P(A) = 0,Ae&
When P « P and P « P the measures P and P are equivalent (notation,
P~P).
The measures P and P are singular (or orthogonal) if there is a set A e !F
such that P(A) = 1 and P(A) = 1 (notation, P ± P).
Definition 2. We say that P is locally absolutely continuous with respect to
P (notation, P << P) if
Pn « P„ (2)
for every n > 1.
The fundamental question that we shall consider in this section is the
determination of conditions under which local absolute continuity P «c P
implies one of the properties P « P, P ~ P, P ± P. It will become clear that
martingale theory is the mathematical apparatus that lets us give definitive
answers to these questions.
Let us then suppose that P << P. We write
the Radon-Nikodym derivative of Pn with respect to P„. It is clear that
z„ is .^-measurable; and if A e 3Fn then
J A JAar„+i
,(A)
§6. Absolute Continuity and Singularity of Probability Distributions
525
It follows that, with respect to P, the stochastic sequence Z — (z„, ^)n2. i is a
martingale.
Write
zm = fimz„.
Since Ez„ = 1, it follows from Theorem 1, §4, that lim z„ exists P-a.s. and
therefore P(zm — lim zn) = 1. (In the course of the proof of Theorem 1 it
will be established that lim zn exists also for P, so that P(zm — lim z„) — 1.)
The key to problems on absolute continuity and singularity is Lebesgue's
decomposition.
Theorem 1. Let P << P. Then for every Ae&,
P(A) = j zmdP + P{A n (zro = oo)}, (3)
and the measures /j,(A) — P{A n (zm — oo)} and P(/4), A e^, are singular.
Proof. Let us notice first that the classical Lebesgue decomposition shows
that if P and P are two measures, there are unique measures X and /j. such
that P = X + n, where X « P and n ± P. Conclusion (3) can be thought of
as a specialization of this decomposition under the assumption that Pn «
Let us introduce the probability measures
0 = i(P + P), Qn = ¥Pn + Pn\ n > 1,
and the notation
dP dP ~dP_n ^ =dPr
3 dQ' 3 dQ' 3n dQn' 3n dQn'
Since P(g = 0) = P(3 = 0) = 0, we have Q(g = 0, 3 = 0) = 0. Consequently
the product 3-3-1 can be defined consistently on the set Q\{3 = 0, 3 = 0};
we define it to be zero on the set {3 = 0, 3 = 0}.
Since P„ « P„ « Q„, we have (see (11.7.36))
dQn~dPn dQn (0aS) (4)
i.e.,
3n = 2„3n (0-a.s.) (5)
whence
Zn = 3n'3n-1 (O^-S.)
where, as before, we take §„ • 3"1 = 0 on the set (3„ = 0, 3„ = 0}, which is of
0-measure zero.
526
VII. Sequences of Random Variables That Form Martingales
Each of the sequences (g„, &\) and (3„, 9\) is (with respect to 0) a uniformly
integrable martingale and consequently the limits lim %n and lim 3„ exist.
Moreover (0-a.s.)
lim %n = 3, lim 3„ = 3. (6)
From this and the equations z„ = g„3~1 (0-a.s.) and Q(g = 0, 3 = 0) = 0,
it follows that (0-a.s.) the limit lim z„ = zm exists and is equal to g • 3"1.
It is clear that P « 0 and P « Q. Therefore lim z„ exists both with respect
to P and with respect to P.
Now let
X(A) = J* zmdP, n{A) = P{A n (zm = 00)}.
To establish (3), we must show that
P(A) = X(A) + n(A\ A«P, ix LP.
We have
P(A) = jj dQ = J* g3f dQ + J* 3[1 - 3f] <*Q
= £i^P +J D-3f]^P= J^co^P + PM^(3 = 0)}, (7)
where the last equation follows from
p{f = 3-^ = 1, P{*.-j-r,) = i.
Furthermore,
P{A n (3 = 0)} = P{A n (3 = 0) n (3 > 0)}
= P{^n(3-3_1 = 00)}= P{An(zm = 00)},
which, together with (7), establishes (3).
It is clear from the construction of A that X « P and that P(zro < 00) = 1.
But we also have
ix{zm <oo) = P{(zm <oo) n (zm = 00)} = 0.
Consequently the theorem is proved.
The Lebesgue decomposition (3) implies the following useful tests for
absolute continuity or singularity for locally absolutely continuous
probability measures.
§6. Absolute Continuity and Singularity of Probability Distributions 527
Theorem 2. Let P << P, i.e. P„«Pn,n>\. Then
P«PoE:00= loP(z00<oo)= 1,
P lPoE:ffl = OoP(zm = oo) = 1,
where E denotes averaging with respect to P.
Proof. Putting A = Q in (3), we find that
Ezro= loP(Zoo = oo) = 0, (10)
E2co = OoP(Z(0 = 00)= 1. (11)
If P(zx = x) = 0, it again follows from (3) that P « P.
Conversely, let P « P. Then since P(zro = oo) = 0, we have P(zra = oo)
= 0.
In addition, if P ± P there is a set B e & with P(B) = 1 and P(B) = 0.
Then P(B n (zm = cc)) = 1 by (3), and therefore P(zro = oo) = 1. If, on the
other hand, P(zm = oo) = 1 the property P ± P is evident, since
P(zoc = x) = 0.
This completes the proof of the theorem.
2. It is clear from Theorem 2 that the tests for absolute continuity or
singularity can be expressed either in terms of P (verify the equation Ezro = 1 or
Ezro = 0), or in terms of P (verify that P(zro < oo) = 1 or that P(zro = oo) = 1).
By Theorem 5 of §6, Chapter II, the condition Ez^ = 1 is equivalent to
the uniform integrability (with respect to P) of the family {z„}„>i. This
allows us to give simple sufficient conditions for the absolute continuity
P « P. For example, if
supE[z„ln+zJ <oo (12)
n
or if
supEz„1+E <oo, e>0, (13)
n
then, by Lemma 3 of §6, Chapter II, the family of random variables {z„}n2:1
is uniformly integrable and therefore P « P.
In many cases it is preferable to verify the property of absolute continuity
or of singularity by using a test in terms of P, since then the question is
reduced to the investigation of the probability of the "tail" event {zm <oo},
where one can use propositions like the "zero-one" law.
Let us show, by way of illustration, that the "Kakutani dichotomy"
can be deduced from Theorem 2.
Let (Q, ^, P) be a probability space, let (R00, @m) be a measurable space
of sequences x = (x,, x2,...) of numbers with &m = ^(Rm)t and let @„ =
o{x: {xj,..., x„)}. Let £ = (f lf f2, •. •) and { = (Jlf f2,...) be sequences
of independent random variables.
(8)
(9)
528
VII. Sequences of Random Variables That Form Martingales
Let P and P be the probability distributions on (R00, @m) for f and |,
respectively, i.e.
P(B) = P{f e B}, P(B) = P{ J e B}, Be@m.
Also let
Pn = P\@n, Pn = P\@n
be the restrictions of P and P to ^„ and let
PsM) = p(£„ e A), P-^A) = P(J„ e A\ Ae ^(R1).
Theorem 3 (Kakutani Dichotomy). Let £ = (f lf f2,...) <wuf | = (|lf ?2» • • •)
fee sequences of independent random variables for which
pln«^n. " ^ 1- (14)
T/zen eit/zer P « P or P ± P.
. loc r
Proof. Condition (14) is evidently equivalent to P„ « P„, n > 1, i.e. P << P.
It is clear that
where
dP* , ^ / ^
Zn =Jp= ^l(Xl)- -««(4
^.)=^(^)- (15>
Consequently
{x:zro <oo} = {x:lnzro <oo} :
jx: X lng£(x£)<ooj-.
The event {x: YT=i ^n 3i(xi) <°°} 1S a tail event. Therefore, by the Kolmo-
gorov zero-one law (Theorem 1 of §1, Chapter IV) the probability
P{x:zaa <oo} has only two values (0 or 1), and therefore by Theorem 2
either P 1 P or P « P.
This completes the proof of the theorem.
3. The following theorem provides, in "predictable" terms, a test for absolute
continuity or singularity.
Theorem 4. Let P «c P and let
a„ = z„z®_lt n>l,
with z0 = 1. Then (with &0 = {0,Q})
P « PopjJ, [1 - E^IJVi)] < ooj = 1,
P ± P~p{ £ [1 - E(V^-i)] = ooj = 1.
(16)
(17)
§6. Absolute Continuity and Singularity of Probability Distributions
529
Proof. Since
Pn{zn = 0} = f zndP = 0,
J[z„=0)
we have (P-a.s.)
zn=nocfc = exp{t lnoj. (18)
fc=i U=i J
Putting A = {zm = 0} in (3), we find that P{zm = 0} = 0. Therefore, by
(18), we have (P-a.s.)
{zm <oo} = {0 < zm <oo} = {0 < limz„ <oo}
= -J — oo < lim £ In afc < oo >.
(19)
Let us introduce the function
u(x) = {*' ,x| - U
(signx, |x| > 1.
< - oo < lim £ In afc <oo > = < - oo < lim £ w(ln afc) <oo >.
(20)
Let E denote averaging with respect to P and let y\ be an ^-measurable
integrable random variable. It follows from the properties of conditional
expectations (Problem 4) that
2^(1^-1) = EfozJ^,) (P- and P-a.s.), (21)
Uril&n-iy-^-iHrizJ^-J (P-a.s.). (22)
Recalling that oc„ = zJ®_1z„, we obtain the following useful formula from
(22):
^1^-0 = EOx^lJ^) (P-a.s.). (23)
From this it follows, in particular, that
E(aJ^-i)=l (P-a.s.). (24)
By (23),
E[M(ln an)\^n- J = E[a„M(ln *n)\SFn_ {] (P-a.s.).
Since xw(ln x) > x - 1 for x > 0, we have, by (24),
ElXlnoOI^.J^O (^-a.s.).
530
VII. Sequences of Random Variables That Form Martingales
It follows that the stochastic sequence X = (X„, ^) with
n
Xn = £ «(ln afc)
fc=i
is a submartingale with respect to P; and | AXn\ = |w(ln a„)| < 1.
Then, by Theorem 5 of §5, we have (P-a.s.)
\ -co < lim £ w(lnafc) <ooi = \ £ E[w(lnafc) + i/2(lnafc)|^'fc_1]<ool.
(25)
Hence we find, by combining (19), (20), (22), and (25), that (P-a.s.)
{zm <oo} = j £ E[«(lnafc) + u2(lnafc)|^_1]<ool
= J Z E[afc«(ln «fc) + afcw2(ln aJI^-J < *>i
and consequently, by Theorem 2,
P«Po P!I E[>fcM(ln ocfc) + afcM2(ln afc)|^_ t] < ooj = 1, (26)
PlPoP{I Efeudnocfc) + ocfcM2(lnocfc)|^_J = ooj = 1. (27)
We now observe that by (24),
E[(l - x/^)2l^-i] = 2E[1 - 7^1^-J (P-a.s.)
and for x > 0 there are constants ^1 and B{Q < A < B <co) such that
/1(1 - V*)2 ^ xw (In x) + XM2(ln x) + 1 - x < 5(1 - ^/x)2. (28)
Hence (16) and (17) follow from (26), (27) and (24), (28).
This completes the proof of the theorem.
Corollary 1. If for all n> 1, the a-algebras a(a„) and 8Fn-x are independent
with respect to P(or P), and P << P, then we have the dichotomy: either
P « P or P ± P. Correspondingly,
00
P«PoJ [1 - E^/oJ < oo,
n= 1
PlPof [l-E^^oo.
§6. Absolute Continuity and Singularity of Probability Distributions
531
In particular, in the Kakutani situation (see Theorem 3) a„ = q„ and
P«Pot [l-Ev/^00]<oo,
P±Po£ [l-EV^OO]=oo.
Corollary 2. Let P << P. Then
pjf E(anlnan|^_l)<oo| =
1 => p « P.
For the proof, it is enough to notice that
x In x + f (1 - x) > 1 - xl/2, (29)
for all x > 0, and apply (16) and (24).
Corollary 3. Since the series ££L i [1 — E(\/aJ^i-i)]» which has nonnegative
(P-a.s.) terms, converges or diverges with the series J] | In E(>/o^ 1^,.^1,
conclusions (16) and (17) of Theorem 4 can be put in the form
P«Pop|ZHnE(x/^|^_1)l<oo|=l, (30)
PXPopjf llnE^I^OHooj^l. (31)
Corollary 4. Let there exist constants A and B such that 0<A<1,B>0 and
P{1 -A <an< 1 +B} = 1, «> 1.
Then ifp « P we have
P « Popjl E[(l - <02|JS-i] <oo| = 1,
P X Popj'f E[(l - an)2|JS-i] =oo| = 1.
For the proof it is enough to notice that when x e [1 — A, 1 + B], where
0 < A < 1, B > 0, there are constants c and C (0 < c < C < oo) such that
c(l - x)2 < (1 - ,/x)2 < C(l - x)2. (32)
4. With the notation of Subsection 2, let us suppose that f = (f lf f2,...)
and J = (Si, ?2» • ■ ■) are Gaussian sequences, P„ ~ P„, « > 1. Let us show
that, for such sequences, the "Hajek-Feldman dichotomy," either P ~ P
or P X P, follows from the "predictable" test given above.
By the theorem on normal correlation (Theorem 2 of §13, Chapter II)
the conditional expectations M(x„ |^„_ t) and M(x„|^„_ t), where M and M
532
VII. Sequences of Random Variables That Form Martingales
are averages with respect to P and P, respectively, are linear functions of
xx Vi- We denote these linear functions by a„_x(x) and a„_ x(x) and
put
^-1 = (^^,,-¾.^)]2)172.
£,-^(^^-¾.^)]2)1^
Again by the theorem on normal correlation, there are sequences
e = (fij, e2,...) and £ = (^,62,-..) of independent Gaussian random
variables with zero means and unit variances, such that
xn = an- iW + K- ie„, (P-a.s.),
(33)
x„ = a„_ x(x) + fc„_ ^,, (P-a.s.).
Notice that if bn-x = 0, or B„-l = 0, it is generally necessary to extend
the probability space in order to construct e„ or e„. However, if bn-1 = 0
the extended vector (xlt..., x„) will be contained (P-a.s.) in the linear
manifold x„ = a„_ ^x), and since by hypothesis P„ ~ P„, we have bn_l = 0,
a„_! = a„_ i(x), and a„(x) = 1 (P- or P-a.s.). Hence we may suppose without
loss of generality that b2 > 0, S2 > 0 for all n > 1, since otherwise the
contribution of the corresponding terms of the sum Y^= 1 [1 — M^/o^,\Bn_ x]
(see (16) and (17)) is zero.
Using the Gaussian hypothesis, we find from (33) that, for n > 1,
a - d~l ex J- (*n-fln-i(*))2 , (xw-flw-1(x))2]
oc-^expj ^^ + ^ J. (34)
where d„ = |5„ • £~ *| and
a0(x) = Eflf a0(x) = EJl
From (34),
Since In [2^,^/(1 + d2_i)] ^ 0, statement (30) can be written in the form
= I- (35)
The series
00 1 , J2 00
£taI + *=I and J] (*-i-l)
§6. Absolute Continuity and Singularity of Probability Distributions
533
converge or diverge together; hence it follows from (35) that
where A„(x) = a„(x) - a„(x).
Since an(x) and a„(x) are linear, the sequence of random variables
{An(x)/b„}n;>0 is a Gaussian system (with respect to both P and P). As follows
from a lemma that will be proved below, such sequences satisfy an analog
of the zero-one law:
>m (36) that
y
Then it is clear that if P and P are not singular measures, we have P « P.
But by hypothesis, Pn ~ P„, n > 1; hence by symmetry we have P « P.
Therefore we have the following theorem.
Theorem 5 (Hajek-Feldman Dichotomy). Let £ = (£,, £2,.. .) and | =
(li» ^2 > • • •) be Gaussian sequences whose finite-dimensional distributions are
equivalent: Pn ~ Pnt n > 1. Then either P ~ P or P LP. Moreover,
P
Hence it follows from (36) that
P«
and in a similar way
Let us now prove the zero-one law for Gaussian sequences that we need
for the proof of Theorem 5.
Lemma. Let (5 = (/?„)„ ^j be a Gaussian sequence defined on (Q, 3F\ P). 77zett
p{£ /¾ <ooj = 1 - £ E£ <oo. (39)
Proof. The implication <= follows from Fubini's theorem. To establish the
opposite proposition, we first suppose that E/?„ = 0, n > 1. Here it is enough
to show that
E £ ft <; |eexp(- f ptjT2, (40)
534
VII. Sequences of Random Variables That Form Martingales
since then the condition P{£02 <oo} = 1 will imply that the right-hand
side of (40) is finite.
Select an n > 1. Then it follows from §§11 and 13, Chapter II, that there
are independent Gaussian random variables ft„, k = 1,..., r < n, with
Eft,„ = 0, such that
£ Pi = £ PL-
fc=i fc=i
If we write E/?2t„ = Aki„, we easily see that
E I ft2.- = I 4.» (41)
fc=l fc=l
and
E exp(- £ Pl„) = lid + 2^)-^2. (42)
\ fc=i / fc=i
Comparing the right-hand sides of (41) and (42), we obtain
E fci Pi = E £ Pitn <, [e exp(- £ ft2„)J 2 = Je exp(- £ ft2)] *,
from which, by letting n ->oo, we obtain the required inequality (40).
Now suppose that Eft =£ 0.
Let us consider again the sequence /7 = (/?„)„>! with the same distribution
as p = (/?„)„>! but independent of it (if necessary, extending the original
probability space). If P{£n°°=1ft2 < oo} = 1, then P{ff=1(i?n - ft)2 < oo} = 1,
and by what we have proved,
2 f E(ft - Eft)2 = f E(ft-ft)2<oo.
n=l n=1
Since
(Eft)2 < 2ft2 + 2(ft - Eft)2,
we have ££L x(Eft)2 < oo and therefore
£ Eft2 = f (Eft)2 + f E(ft - Mft)2 <oo.
This completes the proof of the lemma.
5. We continue the discussion of the example in Subsection 3 of the preceding
section, assuming that f0» f i» ■ ■ ■ are independent Gaussian random variables
with Hi = 0, V& = Vt > 0.
Again we let
§6. Absolute Continuity and Singularity of Probability Distributions
535
for n > 1, where X0 = £0, and the unknown parameter 6 that is to be
estimated has values in R. Let 0„ be the least-squares estimator (see (5.22)).
Theorem 6. A necessary and sufficient condition for the estimator 6n,n > 1, to
be strongly consistent is that
V
(43)
„ = 0 vn+\
Proof. Sufficiency. Let P0 denote the probability distribution on (R00, @m)
corresponding to the sequence (A"0, Xl9...) when the true value of the
unknown parameter is 6. Let E0 denote an average with respect to P0.
We have already seen that
<M>„'
where
fc = 0 vk+\
According to the lemma from the preceding subsection,
Pe«M>m = oo) = 1 <>Ee<M>m = oo,
i.e., <M>ro = oo (P0-a.s.) if and only if
But
and
00 c v2
I^R=°°- (44)
Eexi = £ e»vt-,
fc = 0 vk+\ fc=0 yk+l \i = 0 /
oo oo j/ oo 1/ °° / °° 1/ \
fc = 0 , = fc ^i+l £ = 0 Ki+1 fc=l \i = k yi+l/
(45)
Hence (44) follows from (43) and therefore, by Theorem 4, the estimator
0„, n > 1, is strongly consistent for every 6.
Necessity. For all 6e R, let P0(0„ -► 0) = 1. It follows that if 6t # 02, the
measures P0l and P02 are singular (P0I ± P02). In fact, since the sequence
(X0i Xlt...) is Gaussian, by Theorem 5 of §5 the measures P01 and P02 are
either singular or equivalent. But they cannot be equivalent, since if P01 ~ P02,
536
VII. Sequences of Random Variables That Form Martingales
but P01(6„ - 0X) = 1, then also P02(6n - 0X) = 1. However, by hypothesis,
Pelfin ->02)= 1 and 02 # 0V Therefore P01 ± P02 for 0X # 02.
According to (5.38),
P011^0(0, - 02)2 I ^,[^-1 = oo
fc=o LKfc+iJ
for 0j # 02. Taking 0X = 0 and 02 # 0, we obtain from (45) that
«=o Vi+1
which establishes the necessity of (43).
This completes the proof of the theorem.
6. Problems
1. Prove (6).
2. Let Pn ~ P„, n > 1. Show that
p ~ P<=>p{Zoo <oo} = Pfz^ > 0} = 1,
P_LP<=>p{zw = oo} = l or P{Zoo = 0} = l.
3. Let P„ <Pn,n> 1, let t be a stopping time (with respect to (J*)), and let Px = P |J*
and Pt = P|^v be the restrictions of P and P to the a-algebra ^. Show that Px -4 Pt
if and only if {t = oo} = {zm <oo} (P-a.s.). (In particular, if P{x <oo} = 1 then
Pt < P..)
4. Prove (21) and (22).
5. Verify (28), (29), and (32).
6. Prove (34).
7. In Subsection 2 let the sequences f = (f lf f2, • • •) and £ = (£i» ^2.--) consist of
independent identically distributed random variables. Show that if P^ < P^, then
P <$ P if and only if the measures P?1 and P^, coincide. If, however, P^, <^ P^, and
P|, # P^, then P _L P.
§7. Asymptotics of the Probability of the Outcome of
a Random Walk with Curvilinear Boundary
1. Let £ lf £2,... be a sequence of independent identically distributed random
variables. Let S„ = ^ + ■ • • + £n, let # = #(«) be a "boundary," n > 1, and
let
T = inf{n> l:S„<0(n)}
be the first time at which the random walk (S„) is found below the boundary
g = g(n). (As usual, x = oo if { • } = 0.)
§7. Asymptotics of the Probability of the Outcome of a Random Walk
537
It is difficult to discover the exact form of the distribution of the time t.
In the present section we find the asymptotic form of the probability P(t > n)
as n -► oo, for a wide class of boundaries g = g(n) and assuming that the £.
are normally distributed. The method of proof is based on the idea of an
absolutely continuous change of measure together with a number of the
properties-of martingales and Markov times that were presented earlier.
Theorem 1. Let £i,£2»«" be independent identically distributed random
variables, with ££ ~./K(0, 1). Suppose that g = g(n) is such that g{\) < 0 and,
for n > 2,
0<Ag(n+l)< Ag(n\ (1)
where Agr(«) = g(n) — g(n — 1) and
lnw = of f [A^(fc)]2V «-oo.
P(t >n) = expj-i f2[A0(/c)]2(l + o(l))j, n -+ (
Before starting the proof, let us observe that (1) and (2) are satisfied if,
for example,
g(n) = anv + b, \ < v < 1, a + b < 0,
or (for sufficiently large n)
gin) = nvL(n\ \ < v < 1,
where L(n) is a slowly varying function (for example, L(n) = C(ln n)p with
arbitrary /? for \ < v < 1 or with /? > 0 for v = i).
2. We shall need the following two auxiliary propositions for the proof of
Theorem 1.
Let us suppose that £lt£2»--- ls a sequence of independent identically
distributed random variables, £• ~^(0, 1). Let ^o = {0tQ}i^n =
(j{co: £lt..., £„}, and let 0 = (0,,,^.,) be a predictable sequence with
P(|a„| < C) = 1, n > 1, where C is a constant. Form the sequence z =
(7,,,¾ with
, = exp||; «»&-- f a2l, n> 1.
(4)
It is easily verified that (with respect to P) the sequence z = (zni &\) is a
martingale with Ez„ = 1, w > 1.
Choose a value « > 1 and introduce a probability measure Pn on the
measurable space (Q, ^) by putting
P„04) = EI(A)zni Ae3?n. (5)
538
VII. Sequences of Random Variables That Form Martingales
Lemma 1. With respect to P„, the random variables £fc = £fc — afc, 1 < k < n,
are independent and normally distributed, lk ~./K(0, 1).
Proof. Let £„ denote averaging with respect to Pn. Then for Xk e R, 1 < k < n,
E„ expji X XklkI = E expji £ Xklklz„
= e[«p{'[I J*fij*.-i -eJcxp^C - <*„) + *& -y) |^-i}j
= E expji X Afcjfciz^! exp{-i^} = ..• = expj -- E 4JJ-
Now the desired conclusion follows from Theorem 4 of §12, Chapter II.
i 2. Let X = (Xn, ^)nsl be a square-integrable martingale with mean
zero and
c = inf{n> l:X„< -b},
where b is a constant, b > 0. Suppose that
P(Xl < -b) > 0.
Then there is a constant C > 0 such that, for alln> 1,
P(tf>")>^§2- (6)
Proof. By Corollary 1 to Theorem VII.2.1 we have EArffA„ = 0, whence
-EI(a<n)Xa = EI(G>n)Xn. (7)
On the set {g < n)
-Xa>b>0.
Therefore, for n > 1,
-EJ(<7 < n)Xa > bP(<r <n)> bP(c = 1) = bP(Xl < - b) > 0. (8)
On the other hand, by the Cauchy-Schwarz inequality,
E/(<7 > n)Xn < [P(«7 > it) • EX2J,/2, (9)
which, with (7) and (8), leads to the required inequality with
C = (bP(Xl < - b))2.
Proof of Theorem 1. It is enough to show that
lim In P(t > n) / I [Agf(/c)]2 > - \ (10)
n-oo I fc=2
§7. Asymptotics of the Probability of the Outcome of a Random Walk
539
and
hm In P(t > n) / £ [A^(/c)]2 < - \. (11)
n-*co / fc=2
For this purpose we consider the (nonrandom) sequence (an)n3£l with
a, = 0, a„ = A0(h), n > 2,
and the probability measure (P„)nSl defined by (5). Then by Holder's
inequality
P„(t > n) = E/(t >n)zn< (P(t > fi))1'«(Ezfl1/p, (12)
where p > 1 and g = p/(p — 1).
The last factor is easily calculated explicitly:
(Ez*)1" = expj^ t2 C^(/c)]2}. (13)
Now let us' estimate the probability Pn(r > n) that appears on the left-
hand side of (12). We have
P„(t > ii) = Pn(Sk > g(k\ 1 < k < n) = P„(Sfc > 0(1), 1 < k < n\
where Sk = Yj=i Jm J« = & ~~ 0¾. By Lemma 1, the variables are independent
and normally distributed, J,- ~jV(0, 1), with respect to the measure P„.
Then by Lemma 2 (applied to b = —g(l\P = Pn,Xn = Sn) we find that
P(t >«)>-, (14)
where c is a constant.
Then it follows from (12)-(14) that, for every p > 1,
P(t > n) > Cpexp|- | t2 [^(/c)]2 - ^-In nj, (15)
where Cp is a constant. Then (15) implies the lower bound (10) by the
hypotheses of the theorem, since p > 1 is arbitrary.
To obtain the upper bound (11), we first observe that since z„ > 0 (P-
and P-a.s.), we have by (5)
P(T>«) = En/(t>«)z„-1, (16)
where E„ denotes an average with respect to P„.
In the case under consideration, a2 = 0, a„ = Ag(n\ n > 2, and therefore
for n > 2
z-« = exp{- £ Ag(k)-tk + \ £ [A^(/c)]2l.
I fc=2 ^fc=2 J
540
VIL Sequences of Random Variables That Form Martingalt
By the formula for summation by parts (see the proof of Lemma 2 of
§3, Chapter IV)
f Ag(k)-tk = Ag(n).S„- £ S^A^/c)).
fc=2 fc=2
Hence if we recall that by hypothesis Ag(k) > 0 and A(Ag(k)) < 0, we find
that, on the set {t > n} = {Sk > g(k\ 1 < k < n},
t **<*)•& > &M'M -i(Kk- 0A(A^(/c)) - ^Ag(2)
= I [A^(/c)]2 + g(l)Ag(2) - ^Ag(2).
fc=2
Thus, by (16),
P(t >n)< expj- i i^/c)]2 - ^(1)A^(2)| E„/(t > n>
- exp{" 5 A [A^(/c)]2}E"/(T > n)e~
-«iAfl(2)
where
£„/(* > «)e"«>^2) < Ez„e-«'A«(2> = Ee-^gl2) < oo.
Therefore
P(t >n)< Cexpj- i f2 [A^(/c)]4,
where C is a positive constant; this establishes the upper bound (11).
This completes the proof of the theorem.
3. The idea of an absolutely continuous change of measure can be used to
study similar problems, including the case of a two-sided boundary. We
present (without proof) a result in this direction.
Theorem 2. Let £i, £2, •• • be independent identically distributed random
variables with ££ ~^K(0, 1). Suppose that/ =f(n) is a positive function such
that
/(w)-»oo, «-»-oo,
and
i[A/(fc)]2=o(i/-2w). »■
fc=2 \fc=l /
Then if
a = mf{n>U\Sn\>f(n)}t
§8. Central Limit Theorem for Sums of Dependent Random Variables
541
we have
P(<7 > n) = exp j-y t .T2(fe)(l + <*1))J. n -+ oo. (17)
4. Problems
1. Show that the sequence defined in (4) is a martingale.
2. Establish (13).
3. Prove (17).
§8. Central Limit Theorem for Sums of Dependent
Random Variables
1. In §4, Chapter III, the central limit theorem for sums Sn = £nl + • • • + f„n>
n > 1, of random variables £nl £„„ was established under the
assumptions of their independence, finiteness of second moments, and negligibility in
the limit of their terms. In the present section, we give up both the
assumption of independence and even that of the finiteness of the absolute values of
the first-order moments. However, the negligibility in the limit of the terms
will be retained.
Thus, we suppose that on the probability space (fi, ^, P) there are given
stochastic sequences
e = (£nfc,^), O^k^n, n>h
with fn0 = 0, ^¾1 = {0. ^}. &k ^ &k+i ^ & We set
*?=Eo£*. 0£t£l.
Theorem 1. For a given t, 0 < t < 1, let the following conditions be satisfied:
for each e e (0, 1), as n -> oo,
[nr]
(A) E P(IU>e|^-i)^0,
[nr]
(B) lE[£nfc/(IU<l)|.
^fc-il-^O,
^nfc-i]^^2>0.
[nr]
(Q E VE^/flU^e)
fc=i
Then
Xnt -4^(0, (7?).
542
VII. Sequences of Random Variables That Form Martingalt
Remark 1. Hypotheses (A) and (B) guarantee that X? can be represented in
the form X? = Ytn + Z? with Znt 4 0 and Ytn = Y}k% fe where the sequence
nn = fonfc. ^fc") is a martingale-difference, and E{rink\&k-i) = 0 with \nnk\ < c,
uniformly for 1 < k < n and n > 1. Consequently, in the cases under
consideration, the proof reduces to proving the central limit theorem for
martingale-differences.
In the case when the variables £nl,..., £„„ are independent, conditions
(A), (B), and (C), with t = 1, and a2 = of, become
(a) t P(l4fcl>e)-0,
(b) £E[4fc/(IU<l)]-+0,
(C) E V[£nfc/(|U<E)]-+<72.
fc=l
These are well known; see the book by Gnedenko and Kolmogorov [G5].
Hence we have the following corollary to Theorem 1.
Corollary. If £„u ■ ■ ■ ■> £nn are independent random variables, n > 1, then
(a), (b),(c)=>X?-^(0,(72).
Remark 2. In hypothesis (Q, the case of = 0 is not excluded. Hence, in
particular, Theorem 1 yields a convergence condition for degenerate
distributions (XI 4> 0).
Remark 3. The method used to prove Theorem 1 lets us state and prove the
following more general proposition.
Let 0 < tx < t2 < • • • < tj < 1, <72 < <722 < • • • < <72, ¢7¾ = 0, and let
£i Sj be independent Gaus_sian random variables with zero means and
Ee2 = <72k — afk_,. Form the (Gaussian) vectors (Wtl Wtj) with Wtk =
£1+--- + ¾.
Let conditions (A), (B), and (C) be satisfied for t = tt tj. Then the
joint distribution (P^ tj) of the random variables (X"t X")
converges weakly to the Gaussian distribution P(t1 tj) of the variables
W Wt):
pn w p
r'i *J r' tj'
2. The first assertion of the following theorem shows that condition (A) is
equivalent to the condition of negligibility in the limit already introduced in
§4, chapter III:
(A*) max IU-^0.
(L) £ E[&/at»| > e)\
fc=0
§8. Central Limit Theorem for Sums of Dependent Random Variables 543
Theorem 2.
(1) Condition (A) is equivalent to (A*).
(2) Assuming (A) or (A*), condition (C) is equivalent to
[nt] I
(c*> i [6*-E(£nfc/au ^ i)\n-j\2$«?.
fc=0 I
Theorem 3. For each n>\ let the sequence
tn = (tnk,^d, l<k<n,
be a square-integrable martingale-difference:
E,»<co, E(t„|^J_I)-a
Suppose that the Lindeberg condition is satisfied: for s > 0,
*!_,]* a
Then (C) is equivalent to
<*">r**?. (1)
w/iere (quadratic variation)
[nt]
<*">,= IE(^|
fc=0 I
and (C*) is equivalent to
en* of. (3)
where (quadratic variation)
[nt]
IX"1 = £ &. (4)
fc = 0
The next theorem is a corollary of Theorems 1-3.
Theorem 4. Let the square-integrable martingale-differences £" = (^,^),
n > 1, satisfy (for a given t,Q < t < I) the Lindeberg condition (L). T/ze«
£ E«i|^J_ t) -¾ c7? =* X? 4^(0, o?X (5)
fc=0
[nt]
£ 42fc-^?=**,n-^((W): (6)
fc=0
3. Proof of Theorem 1. Let us represent X"t in the form
[nt] [nt]
xn=Z fc*/(|fcj < i) + I fc*/(|fcj > 1)
fc = 0 fc=0
n- ix (2)
544
VII. Sequences of Random Variables That Form Martingales
= I EK„S/(IU ^ i)
-^-.] + iuk\u>»
+ I {4J(I^UD-EK„t/(|^|<D
-^-.1)-
(7)
We define
b;= IEK„S/(I^|<1)
fc=0
*l-ll
(8)
where T is a set from the smallest ©--algebra <7(.R\{0}) and P(£nfc e T|^_ x)
is the regular conditional distribution of £nk with respect to #"£_ t.
Then (7) can be rewritten in the following form:
XI = Bl + I f x« + £ f xd(i4 - vft
fc=0 J|x|>l fc=0 J|x|£l
(9)
which is known as the canonical decomposition of (X"t ^"). (The integrals
are to be understood as Lebesgue-Stieltjes integrals, defined for every
sample point.)
According to (B) we have B" -^ 0. Let us show that (A) implies
fc=0 J|x|>l
MdfiZQ.
(10)
We have
I f 1*1*4 = I \UK\U > 1). (ID
fc=0 J|x|>l fc = 0
For every <5 e (0, 1),
{Io \Ui(\U > 0><5J< j£ /(i^i > 1) > A (12)
It is clear that
I/(|^| > 1)= X drf(=Uf„„).
fc = 0 fc = 0 J|x|>l
§8. Central Limit Theorem for Sums of Dependent Random Variables
545
By (A),
int] r
vu= I ^0> <13>
fc = 0 •'Ix^l
and V\ is SF\_^-measurable.
Then by the corollary to Theorem 2 of §3, Chapter VII,
^,,^0=^,,^0. (14)
(By the same corollary and the inequality A £/£„,, < 1, we also have the
converse implication
^,,^0=^,,^0, (15)
which will be needed in the proof of Theorem 2.)
The required proposition (10) now follows from (11)-(14).
Thus
Xnt = Ynt + ZJ, (16)
where
17- I =t«-<ft (17)
and
fc = 0 J|x|>l
fc = 0 Je<|x|<S
[««1
(18)
•Mxl>
It then follows by Problem 1 that to establish that
X1^jV(0taf)
we need only show that
Y?4 ^(0, (7?). (19)
Let us represent Y{f in the form
Y? = Y"[nt](e) + AU(e\ ee(0, 1],
where
(20)
A?nr](E)= I xd(^-vD. (21)
fc = 0 J\x\£E
As in the proof of (14), it is easily verified that, because of (A), we have
}fe](e)^0,n-oo.
546
VII. Sequences of Random Variables That Form Martingales
The sequence A"(e) = (Ank(e\ ^1), 1 < k < n, is a square-integrable
martingale with quadratic variation
<A"(s)>* = I [ f x2dv? - ( f xdvfY]
£=0 \_J\x\Ze \J\x\ze J J
i = 0 I
Because of (C),
<A"(E)>[nr]-^C7t2.
Hence, for every e e (0, 1],
m*x{f[nt](e\ |<A"(E)>[nt] - <7?|} ^ 0.
By Problem 2 there is then a sequence of numbers s„ j0 such that
tfnrlOO^O, <A"(En)>[nr]-^C7t2.
Therefore, again by Problem 1, it is enough to prove only that
where
Mnm ± ^(0, <7?), (22)
:=^fc(E„) = £ r x(^ - v?).
£=0 J|x|<SEn
M"fc s A"fc(E„) = V x(A - v?). (23)
Forre<7(R\{0}),let
/«n = '(AM*fc e n, wn = p(AM"fc e ri ^_ o
be a regular conditional probability, AMI = Mnk - Mnk_ lt k > 1, iWg = 0.
Then the square-integrable martingale Mn = (M£, ^¾ 1 < /c < n, can
evidently be written in the form
M\
= £ AM?= £ f x«.
£=1 £=1 •'M^Cn
(Notice that |AM?| < 2e„ by (23).)
To establish (22) we have to show that, for every real A,
E exp{iAMfa} - exp( - ±A2<72). (24)
Put
J=l J|x|£2e„
and
«<?■) = n (i+ag»j.
§8. Central Limit Theorem for Sums of Dependent Random Variables
547
Observe that
1 + AGnk = 1 + f (eax- l)dvnk= f
J\x\<2e„ J\x\<2
eiXxdvnk
= E[exp(UAM"fc)|^_1]
and consequently
S"k(G") = J! (1 + AG"fc).
On the basis of a lemma that will be proved in Subsection 4, (24) will
follow if for every real X
\snuG")\ =
Int]
[] E[exp(iAAMp|
^;-i]
> C(X) > 0 (25)
and
^,(Gn)-exp(-iA2c7?).
To see this we represent £k(G") in the form
6»k(G») = exp(G»- fl (1 + AG;)exp(- AG?).
(Compare the function E£A) defined by (76) in §6, Chapter II.)
Since
we have
f xrfvj = E(AMJ | ^5-0 = 0,
J|x|<2e„
j=l J\x\<2
(e«Ax_ ! ^iXx)dvnj.
Therefore
IAGnk\ < f \eiXx - 1 - Ux| dvnk < ±A2 f x2 dvnk
J\x\<2b„ J\x\<2e„
<U2(2en)2^0
and
By(C),
<M">[nr]-^C7?.
(26)
(27)
(28)
£ IAGJI < iA2 £ f x2 dv; = iA2<M">fc. (29)
(30)
548
VII. Sequences of Random Variables That Form Martingales
Suppose first that <M">fc < a (P-a.s.), k < [nf|, where a>c?+ 1. Then by
(28), (29), and Problem 3,
lit]
I] (1 + A<©exp(-A<S)-¾ 1, woo,
fc=i
and therefore to establish (26) we only have to show that
i.e., after (27), (29), and (30), that
[nt] r
£ (g«* _ j _ /Ax + ^2X2) fan £ Q (32)
fc=W|x|<;2En
But
|e£Ax - 1 - ihc + \X2x2\ < i|Ax|3
and therefore
£ f |e"x _ j _ Ux + ixVi dn < l|A|3(2En) J] f X2 ^
fc=l J|x|<S2e„ fc=lJ|x|<S2En
= is„\M3<Mn\nt]< ien\M*a -+0, n -> oo.
Therefore if <M">[nt] < a (P-a.s.), (31) is established and consequently
(26) is established also.
Let us now verify (25). Since \eiXx - 1 - iXx\ < i(Ax)2, we find from (28)
that, for sufficiently large «,
WKOI = I fl(l + AOD\ > fl(l - iA2A<M">,)
= exp^^In(l-iA2A<M->A
But
and A<Mn>j < (2e„)2 j0, n -+ co. Therefore there is an «0 = n0(X) such that
for all n > n0(X)t
\Snk(Gn)\>exp{-X\Mn\}
and therefore
\£?nt{Gn)\ > exp{-A2<M">[nr] > e-»°.
Hence the theorem is proved under the assumption that <M">[nt] < a
(P-a.s.). To remove this assumption, we proceed as follows.
§8. Central Limit Theorem for Sums of Dependent Random Variables
549
Let
t" = min{/c < [nt]: <M">fc > a2 + 1},
taking t" = oo if <M">N < o2 + 1.
Then for Mnk = M||At., we have
<M">[nI] = <M%,]Atn < 1 + a2 + 2e„2 < 1 + a2 + 2e\ (=a\
and by what has been proved,
E exp{/AM[„r]} -> exp(— \k2c2).
But
lim|E{exp(UMfnr]) - exp(UMfnr])} | < 2 lim P(t" < oo) = 0.
n n
Consequently
lim E exp(UMn[nt]) = lim E{exp(UMfnrJ) - exp(UMfnr])}
n n
+ lim E exp(iAM[„r]) = exp(— \k2of).
n
This completes the proof of Theorem 1.
Remark. To prove the statement made in Remark 2 to Theorem 1, we need
(using the Cramer-Wold method [B3]) to show that for all real numbers
X1 Xj
E exp f^Mk, + t2 UM[ntk] - MU.,,)]
- arf-itfo?,) - t \Xlipl - <_,).
fc=2
The proof of this is similar to the proof of (24), replacing (MJj, ^JJ) by the
square-integrable martingales ¢/¾. ^¾
Jtf"fc = £ v£AM?,
where v£ = Xl for i < [wfj] and v, = Aj for [«£;-i] < i < [ntj].
4. In this subsection we prove a simple lemma which lets us reduce the
verification of (24) to the verification of (25) and (26).
Let rf1 = Ql„ki &"& 1 < /c < «, « > 1, be stochastic sequences, let
Yn = t **•
fc=i
let
*V) = f[ E[exp(U»/nfc)l^-i], *eR,
fc=i
550
VII. Sequences of Random Variables That Form Martingales
and let Y be a random variable with
^(A) = E<?£Ay, AeR.
Lemma. If {for a given k)\gn(X)\ > c(A) > 0, n > 1, a sufficient condition for
the limit relation
Ee"Y"-+EeiXY (33)
is that
£ \A) -½ * (A). (34)
Proof. Let
iXYn
Then \nf(A)\ <c~l(A) <oo, and it is easily verified that
Emn(A) = 1.
Hence by (34) and the Lebesgue dominated convergence theorem,
|Ee'Ayn - EeiAy| = |E(e£Ayn - £(A))\
= \E(nf(A)[£n(A) - £(Xj\)\ < c_1(A)E|r(A) - £(A)\ - 0, n -oo.
Remark. It follows, from (33) and the hypothesis that £"(A) > c(A) > 0, that
£(A) # 0. In fact, the conclusion of the lemma remains valid without the
assumption that \Sn(A)\ > c(A) > 0, if restated in the form: if Sn(A) -½ 6(A)
and 6(A) # 0, then (33) holds (Problem 5).
5. Proof of Theorem 2. (1) Let e > 0,3 e (0, e), and for simplicity let r = 1.
Since
max|U<E+ t lfc*|/(lfc*l>e)
l<fc£n fc=l
and
{i \umj > «> > &} = (z '(i«*i > «> > *}.
we have
pjmax \U > e + d\ < P\t /(It* I > e) > 4 "
U<fc<;n J lfc=l J
If (A) is satisfied, i.e.,
p\i f <k>4-o
(fc=l J|X|>E J
§8. Central Limit Theorem for Sums of Dependent Random Variables
551
then (compare (14)) we also have
Therefore (A) => (A*).
Conversely, let
<7n = min{fc<«:|^fc|>E/2},
supposing that an = oo ifmax^^jfj < e/2. By (A*), lim„ P(g„ <oo) = 0.
Now observe that, for every 3 e (0, 1), the sets
fZn'(IU>e/2)>4 and j max IU>i4
coincide, and by (A*)
Y'(IU>e/2) = Y f ««*a
fc=l fc=l J\x\7>e/2
Therefore by (15)
n a a„ /• n a <r„ /•
I ASS I dvZ^O,
fc=l •'IxlarE fc=l J|x|:»a/2
which, together with the property lim„ P(a„ < oo) = 0, prove that (A*)
=* (A).
(2) Again suppose that t = 1. Choose an e e (0, 1] and consider the square-
integrable martingales
A"(<5) = (A"fc(<5), PQ (\<k< wX
with 5 e (0, e]. For the given e e (0, 1], we have, according to (C),
<A"(e)>„ *> a\.
It is then easily deduced from (A) that for every 3 e (0, e]
<A"(<5)>„ 4 aj. (35)
Let us show that it follows from (C*), (A), and (A*) that, for every 3 e (0, e],
[A"05)]„iU?, (36)
where
[A"(5)]„ = f K*/au * *> - f xd<i2-
fc=l J\x\<6
In fact, it is easily verified that by (A)
[A"(5)]n-[A«(l)]n^0. (37)
552
VII. Sequences of Random Variables That Form Martingales
But
I t k* - f xd<\2 - t \tnMU < i) - f *<kT
U=l|_ J\x\Zl J fc=l L J 1*1*1 J
< t '(lfi.il > Uftt*)2 + 21^11 J* ^xdW - v"fc)|l
<5imnk\>D\u2
fc=l
<
5 max \U2 t f ^-^0.
l<fc^n fc=l J\x\>l
(38)
Hence (36) follows from (37) and (38).
Consequently to establish the equivalence of (C) and (C*) it is enough to
establish that when (C) is satisfied (for a given eg(0, 1]), then (C*) is also
satisfied for every a > 0:
lim hm P{|[A"(<7)]„ - <A"05)>„ | > a} = 0. (39)
6-0 n
Let
mnk(S) = [A"05)]fc - <A"05)\, 1 < k < n.
The sequence m\8) = (mJJ(<5), ^JJ) is a square-integrable martingale, and
(m"(<5))2 is dominated (in the sense of the definition of §3 on p. 467) by the
sequences [m"(<5)] and <m"(<5)».
It is clear that
[m-Wl = £ (Am"fc(,5))2 < max |Am"fc(5)|{[A"(5)]n + <A%5)>n}
fc=l l<fc<»
< 3<52{[A"05)]„ + <A"05)>„}. (40)
Since [A"(<5)] and <A"(<5)> dominate each other, it follows from (40) that
(m"(<5))2 is dominated by the sequences 6<52[A"(<5)] and 6<52<A"(<5)>.
Hence if (C) is satisfied, then for sufficiently small <5 (for example, for
S < |fc(<72 + 1))
hm P(6<52<An(<5)>„ > b) = 0,
n
and hence, by the corollary to Theorem 2 of §3, we have (39). If (C*) is also
satisfied, then for the same values of dt
hm P(6<52[A"05)]fc > b) = 0. (41)
n
Since | A[An(<5)]fc| < (2<5)2, the validity of (39) follows from (41) and another
appeal to Theorem 2 of §3.
This completes the proof of Theorem 2.
§8. Central Limit Theorem for Sums of Dependent Random Variables
553
6. Proof of Theorem 3. On account of the Lindeberg condition (L), the
equivalence of (C) and (1), and of (C*) and (3), can be established by direct
calculation (Problem 6).
7. Proof of Theorem 4. Condition (A) follows from the Lindeberg
condition (L). As for condition (B), it is sufficient to observe that when £" is a
martingale-difference, the variables Bn that appear in the canonical
decomposition (9) can be represented in the form
[nt] /.
B?=- I xdvkn.
fc = 0 J|x|>l
Therefore B" -¾ 0 by the Lindeberg condition (L).
8. The fundamental theorem of the present section, namely Theorem 1, was
proved under the hypothesis that the terms that are summed are
uniformly asymptotically infinitesimal. It is natural to ask for conditions for the
central limit theorem without such a hypothesis. For independent random
variables, examples of such theorems are given by Theorem 1 (assuming
finite second moments) or Theorem 5 (assuming finite first moments) from
§4, Chapter III.
We quote (without proof) an analog of the first of these theorems,
applicable only to sequences £" = (£nk, ^) that are square-integrable martingale
differences.
Let &nk(x) = P(£nfc < x\^nk-x) be a regular distribution function of
£nk with respect to &"k_l9 and let Anfc = E(f2fc| J^_ J
Theorem 5. If a square-integrable martingale-difference £„ = (£„ki 3Fnk\
0 <k < n,n> 1, 40 = 0, satisfies the condition
[nt]
£A„fc^<72, 0<<7?<OO,
fc = 0
and for every e > 0
Z f |x|\3?M) - *(tH Idx*°'
then
^-4^(0,(7,2).
9. Problems
1. Let ^„ = »7n + ^„,« > 1, where t]n -4 tj and £„ 4» 0. Prove that f„ -^ tj.
2. Let (£„(e)), n > 1, e > 0, be a family of random variables such that %n(e) £ 0 for each
e > 0 as n ->oo. Using, for example, Problem 11 of §10, Chapter II, prove that there
is a sequence e„ j,0 such that ^„(e„) £ 0.
554
VII. Sequences of Random Variables That Form Martingales
3. Let (aJJ), l<fc<«,n>l, be a complex-valued random variable such that (P-a.s.)
I]|aZ|<C, K\<aniO.
Show that then (P-a.s.)
limn(l+aZ)exp(-aZ)=l.
4. Prove the statement made in Remark 2 to Theorem 1.
5. Prove the statement made in Remark 1 to the lemma.
6. Prove Theorem 3.
7. Prove Theorem 5.
§9. Discrete Version of Ito's Formula
1. In the stochastic analysis of Brownian motion and other related processes
(for example, martingales, local martingales, semi-martingales) with
continuous time Ito's change-of-variables formula plays a key role (see, for example,
[Jl], [L12]).
This section may be viewed as a prelude to Ito's formula for Brownian
motion. In it, we present a discrete (in time) version of Ito's formula and show
briefly how a corresponding formula for continuous time could be obtained
using a limiting procedure.
2. Let X = (Xn)0^n^N and Y = (Yn)0^n^N be two sequences of random
variables on the probability space (fi, ^, P), X0 = Y0 = 0 and
where
[*, r\n = t AX*AYi (1)
£=1
is the square covariance of (X0i Xu...t Xn) and (Y0t Yx Yn) (see §1). Also,
suppose that F = F(x) is an absolutely continuous function,
F(x) = F(0)+f7(j;)dj/, (2)
Jo
where f = f(y\ y e U is a Borel function such that
f \f(y)\ dy<cot c> 0.
J\y\<;c
The change-of-variables formula in which we are interested concerns the
§9. Discrete Version of Ito's Formula
555
possibility of representing the sequence
F(X) = (F(Xn))0 (3)
in terms of 'natural* functional from the sequence X. Of course, this requires
explanation and will be explained here.
Given the function f = f(x)t we form the square covariance [Xt f{X)~] as
follows:
= I WW - /(**-.))(** - **-,). (4)
We introduce two 'discrete integrals' (cf. Definition 5 of §1):
/„(X,/(X)) =^/(^.,)^, (5)
fc=l
Tn(Xif(X))=±f(Xk)AXk. (6)
Then
[X, f(X)\ = I„(X, f(X)) - ln{Xt f(X)). (7)
For fixed N, we introduce a new (reversed) sequence X = (Xn)0:Sn:SN with
Xn = XN-n. (8)
Then clearly,
IN(Xtf(X))=-IN(Xtf(X))
and analogously,
I„(X, f(X)) = - {IN(X, f(X)) - /N_„(X, f(X))}
(we set /0 = 70 = 0).
Thus,
LXt f(X)lN = - {IN(X, f(X)) + IN(X, f(X))}
and for 0 < n < N we have:
IX /Ml = - {«*. /(*)) - wx, f(X))} - in(xt f(X))
= -{ t f(Xk)AXk+tf(Xk)*Xk\.
Cfc=N=n+l fc=l )
(9)
Remark 1. We note that the structures of the right-hand sides of (7) and (9)
are different. Equation (7) contains two different forms of "discrete integral."
The integral In(Xt f(X)) is a "forward integral," while 7n(Xt f(X)) is a
"backward integral." In (9), both integrals are "forward integrals," over two differ-
556
VIII. Sequences of Random Variables that Form Markov Chains
(10)
ent sequences X and X.
3. Since for any function g = g(x)
<7(**-i) + ilg(Xk) - 9(xk-i)l - Kg(Xk) + <7(**-i)] = o,
it is clear that
F(Xn) = F(X0) + £ giX^AX, + \\X9 g{X)\
fc=i
+ i {to - ^.,)) - «<*»-'>+^¾}.
In particular, if #(x) = /(x), where /(x) is the function of (2), then
F(XJ = F(X0) + /„(*, /(*)) + i[X, /(*)]„ + R„(X, /(*)), (11)
where
RM,XX))=±[* [m-^-^^dx. (12)
From analysis, it is well known that if the function /"(x) is continuous, then
the following formula ("trapezoidal rule") holds:
(b-a)3 r1
= ^—j-1- J x(x - l)/"(£(a + x(b - a))) dx
_ (b - a)_r&a + -(6 _ fl))) f' x(x _ 1} ^
Jo
12 "/"(»/),
where £(x), x and tj are "intermediate" points in the interval [a, 6].
Thus, in (11)
R„(Xtf(X))= ~tr(nk)(AXk)3
where Xfc_x <rjk< Xkt whence
\Rn(X,f(X))\ < A sup/"(,/) f |AXfc|3
iz fc=1
where the supremum is taken over all rj such that
min(X0, Xlt..., Xn) < rj < max(X0, Xx Xn).
We shall refer to formula (11) as the discrete analogue of Ito's formula. We
note that the right-hand side of this formula contains the following three
2
(b-af
§9. Discrete Version of Ito's Formula
557
'natural' ingredients: 'the discrete integral' I„(Xt f(X)\ the square covariance
[X, f{X)\ and the 'residual' term Rn{X, f(X)).
4. Example 1. If f(x) = a + bx, then Rn(Xt f(X)) = 0 and formula (11) takes
the following form:
F(Xn) = F(X0) + In(Xt f(X)) + HX, /(*)]„. (13)
Example 2. Let
r l, x > o
f(x) = sign x = < 0, x = 0
[-1, x<0.
Then F(x) = |x|.
Let Xk = Ski where
^ = ^1 + ^2 + - + ¾.
where flf £2, ••• are independent Bernoulli random variables taking values
± 1 with probability 1/2. If we also set S0 = 0, we obtain
ISJ-frignSUASi + AL, (14)
fc=l
where S0 = 0 and
Nn = {0<k<n:Sk = 0}.
We note that the sequence of discrete integrals (££=i sign Sk-1ASk)n^1 forms
a martingale and therefore,
It is not difficult to show that as n -> oo,
E'S»'~V& (15)
thus, application of Ito's discrete formula gives the asymptotic behavior
of the average number of changes of sign on the path of the random walk
(&,<.: EN. ~^l2n.
5. Remarks. Let B = (Bt)0^f^i be a Brownian movement and Xk = ^/n,
k = 0,1 n. Then application of formula (11) leads to the following result:
F(BJ = F(B0) + £ /(B(fc_1)/n)A^/n + i[/(B./n), B./n]n + R„(B./n, /(B./n)).
fc=i -^
(16)
It is known from the stochastic calculus of Brownian motion that
t\Bkln-B(k^Jn\^0 (17)
558
VII. Sequences of Random Variables that Form Martingales
and if/ = f(x) e L,2oc (i.e., J,x,^fc/2(x) dx < oo for any k > 0), then the limit
L2-lim £ /(B(fc_1)/n)A^/n, (18)
n fc=0
exists and is denoted by
CfiBjdB,
Jo
and is called Ito's stochastic integral off(Bs) with respect to Brownian motion
([Jl], [L12]). In addition, if the function f(x) has a second derivative and
\f"(x)\ < C, then from (9), we obtain that P-lim R„(B.fnt f(B./n)) = 0. Thus,
from (8) and (10), it follows that the limit
P-Um[/(B./n),B./n]„
exists and is denoted by
and that the following formula holds (P-a.s.)
FiBJ = F(fy + J"' f(Bs) ABS + i[/(B), B],. (19)
It can be shown that, for the given smoothness assumptions
U(B),B\=Cf'(Bs)dst (20)
Jo
which leads to the known formula of ltd for Brownian motion:
F(Bt) = F(B0) + J' f(Bs) dBs + i J' f'(Bs) ds. (21)
6. Problems
1. Show that formula (14) is true.
2. Establish that the property (16) is true.
3. Prove formula (15).
§10. Applications to Calculations of the Probability
of Ruin in Insurance
1. The material studied in the present section is a good illustration of the fact
that the theory of martingales provides a quick and simple way of calculating
the risk of an insurance company.
§10. Applications to Calculations of the Probability of Ruin in Insurance 559
We shall assume that the evolution of the capital X = (Xt)t^.0 of a certain
insurance company takes place in a probability space (Q, ^", P) as follows.
The initial capital is X0 = u > 0. Insurance payments arrive continuously
at a constant rate c> 0 (in time At the amount arriving is cAt) and claims are
received at random times 7^, T2,... (0 < 7^ < T2 < •■•) where the amounts
to be paid out at these times are described by a nonnegative random
variables iu f2,....
Thus, taking into account receipts and claims, the capital Xt at time t > 0
is determined by the formula
Xt = u + ct- Sti (1)
S, = X £•/(?; <t). (2)
T = inf{t>0:X,<0}
the first time at which the insurance company's capital becomes less than or
equal to zero ('time of ruin'). Of course, if Xt > 0 for all t > 0, then the time
T is given to be equal to +oo.
One of the main questions relating to the operation of an insurance
company is the calculation of the probability of ruin, P(T < oo), and the
probability of ruin before time t, P(T < t) (inclusively).
2. To calculate these probabilities we assume we are in the framework of
the classical Cramer-Lundeberg model characterized by the following
assumptions:
A The times 7^, T2,... at which claims are received are such that the
variables (T0 = 0)
ot=Tt-Tt.l9 i>l
are independent, identically-distributed random variables having an
exponential distribution with density Ae~Ar, t > 0 (see Table 2, §3, Chapter II).
B The random variables £j, £2> ... are independent and identically
distributed with distribution function F(x) = P(^ < x) such that F(0) = 0,
A/ = J«5xdF(x)<oo.
C The sequences (Tl9 T2i...) and (^, £2>...) are independent sequences (in
the sense of Definition 6, §5, Chapter II).
We denote the process of the number of claims by N = (Nt)t^.0, i.e., set
^=E/CS<*). (3)
f;>i
It is clear that this process has a piecewise-constant trajectory with jumps
by a unit value at times Tl9 T2,... and with value N0 = 0.
where
We denote
560
VII. Sequences of Random Variables that Form Martingales
Since
{Tk >t} = {(T1 + - + (jk>t} = {Nt < k},
under the assumption A we find that, according to Problem 6, §2, Chapter II,
fc-l (U\i
P(Nt <k) = Pfa +••• + <*>*)= X e"*^.
Whence
P(Nt = k) = e~A'^p fc = 0,1,..., (4)
i.e., the random variable Nt has a Poisson distribution (see Table 1. §3,
Chapter II) with parameter Xt. Here, ENt = Xt.
The so-called Poisson process N = (Nt)t:t0 constructed in this way is
(together with the Brownian motion; §13, Chapter II) an example of another
classical random (stochastic) process with continuous time. Like the Brownian
motion, the Poisson process is a process with independent increments (see
§13, Chapter II), where, for s<t, the increments Nt — Ns have a Poisson
distribution with parameter X(t — s) (these properties are not difficult to
derive from assumption A and the explicit construction (3) of the process N).
3. From assumption C we find that
E(Xt -X.) = ct-ESt = ct-EYi ^I(Tt ^ t)
i
= ct-Z E&.EZfl^tHct-AiEPrc^t)
f i
= <* - n X P(Nt > i) = ct- fxENt = t(c - Xn).
i
Thus, we see that, in the case under consideration, a natural requirement for
an insurance company to operate with a clear profit (i.e. E(Xt — X0) > 0,
t > 0) is that
c> XfL (5)
In the following analysis, an important role is played by the function
/i(z)=fV*-l)dF(x), z>0, (6)
Jo
which is equal to F(—z) — 1, where
F(s) = [ e~sx dF(x)
Jo
is the Laplace-Stieltjes transformation (s is a complex number).
Denoting
g(z) = Xh{z) - cz £0 = 0,
§10. Applications to Calculations of the Probability of Ruin in Insurance
561
we find that for r > 0 with h(r) < oo,
£e-r(Xt-X0) _ £e-r(Xt-u) _ g-rrt. |^"5S0 ft
= e~rct £ Eer25o««P(iVf = w)
n=0
oo e-ArrJlrtn
-.-^a + fcw-^-
_ g-rrt.eArt(r) _ er[AA(r)-cr] _ gfgW
Analogously, it can be shown that for any s < t
£c-r(Xt-Xs) _ e(r-s)g(r)^ /y\
Let ^r = o-(.Ys, s <, t). Since the process X = (Xt)t^.0 is a process with
independent increments (Problem 2) (P-a.s.)
E(e~r<**-A*>|^) = Ee~r(Xt~*s) = e(r_s)ff(r),
then (P-a.s.)
E(e-rXt-rff(r)|^j = e-rX5-Sg(r) (g)
Denoting
Z = e-r^-r9(r), t > 0 (9)
we see that property (8) rewritten in the form
E(Z,|%) = ZS, s<t (10)
is a continuous analogue of the martingale property (2) of Definition 1 of §1.
By analogy with Definition 3 of §1, we shall say that the random variable
t = r(co) with values in [0, +oo] is a Markov time relative to the system of
<7-algebras (¾^ if for each t > 0 the set
{r(co)<t}e%
The process Z = (Zt)t^.0 is nonnegative with EZt = e~ra < oo. Thus, by
analogy with Definition 1 of §1, the process Z = (Zt)t^.0 with continuous time is a
martingale.
It turns out that for martingales with continuous time, Theorem 1 of §2
remains valid (with self-evident changes to the notation). In particular,
EZ,.,= EZo (11)
for any Markov time r (see for example [L5], §2, Chapter 3).
By virtue of (9) we find from (11) that for time r=T
e-ru _ Ee-rXt«T-{tAT)g(r)
> E[e-»*^-<*ATto(r)|T < qp(t < t)
= E[e-rX*-Tglr)\T < t]P(T < t)
> E[e-Tglr)\T < i]P(T <t)> min t~^r)P{T < t).
562
VII. Sequences of Random Variables that Form Martingales
Moreover,
p(T ^ *) ^ z^z i=S5w = e_ra max «"**■ (12)
Let us consider the function
g(r) = Xh(r) - cr
in more detail. Clearly, #(0) = 0, g'(0) = X\i — c < 0 (by virtue of (5)) and
g"(r) = Xh"(r) > 0. Thus, there exists a unique positive value r = R with
g(R) = 0.
Noting that for r > 0
f °° e"(l - F(x)) dx = r f" eTx dF(y) dx
Jo Jo Jx
= M fV'dxWo;)
-if-
r Jo
(ery-l)dF(y) = h(r\
R may be asserted to be the (unique) root of the equation
- f°°erx(l-F(x))dx=l. (13)
c Jo
Let us set r = R in (12). Then we obtain, for any t > 0,
P(T <t)< e~Ru (14)
whence
P(T < oo) < e~Ru. (15)
Moreover, we prove the following
Theorem. Suppose that in the Cramer-Lundeberg model assumptions A, B, C
and property (5) are satisfied (i.e.t X\i < c). Then the bound of (15) holds for the
probability of ruin P(T < oo), where R is the positive root of equation (13).
4. Remark. The above discussion could be greatly simplified from a
stochastic point of view if we assumed a geometric distribution for the Cj (Pfa = k) =
qk~1pi k= 1,2,...) instead of an exponential distribution. In this case, all the
time variables (Th T) would take discrete values, it would not be necessary to
call upon the results of the theory of martingales for continuous time and the
whole study could strictly be carried out based only on the 'discrete' (in time)
methods from the theory of martingales studied in the present chapter.
However, we have turned our attention to the "continuous" (in time)
scheme to illustrate both the method and the usefulness of the general theory
of martingales for the case of continuous time, based on the given example.
§10. Applications to Calculations of the Probability of Ruin in Insurance 563
5. Problems
1. Prove that the process N = (Nt)t^0 (under assumption A) is a process with
independent increments, where Nt — Ns has a Poisson distribution with parameter
X(t-s).
2. Prove that the process X = (Xt)t^0 is a process with independent increments.
3. Consider the problem of determining the probability of ruin P(T < oo)
assuming that the variables ax have a geometric (rather than exponential) distribution
(P(oi = k) = qk-lptk=lt2,...).
CHAPTER VIII
Sequences of Random Variables
That Form Markov Chains
§1. Definitions and Basic Properties
1. In Chapter I (§12), for finite probability spaces, we took the basic idea
to be that of Markov dependence between random variables. We also
presented a variety of examples and considered the simplest regularities that
are possessed by random variables that are connected by a Markov chain.
In the present chapter we give a general definition of a stochastic sequence
of random variables that are connected by Markov dependence, and devote
our main attention to the asymptotic properties of Markov chains with
countable state spaces.
2. Let (Q, ^ P) be a probability space with a distinguished nondecreasing
family (.¾ of <7-algebras, ^0£^i£-£^
Definition. A stochastic sequence X = (Xn, #„) is called a Markov chain
(with respect to the measure P) if
P{XmeB\PJ = P{XHeB\Xm} (P-a.s.) (1)
for all n > m > 0 and all B e @(R).
Property (1), the Markov property, can be stated in a number of ways.
For example, it is equivalent to saying that
^MXn)\^m] = E[jg(Xn)\Xm] (P-a.s.) (2)
for every bounded Borel function g = g(x).
§1. Definitions and Basic Properties
565
Property (1) is also equivalent to the statement that, for a given "present"
Xmi the "future" F and the "past" P are independent, i.e.
P(FP\Xm) = P(F\Xm)P(P\Xm\ (3)
where F e g{co: Xit i > m}, and B e J*,, n < m.
In the special case when
&n = FXn =°{CO:X0 Xn)
and the stochastic sequence X = (X„, &*) is a Markov chain, we say that
the sequence {Xn} itself is a Markov chain. It is useful to notice that if X =
{Xn, &„} is a Markov chain, then (X„) is also a Markov chain.
Remark. It was assumed in the definition that the variables Xm are real-
valued. In a similar way, we can also define Markov chains for the case when
X„ takes values in some measurable space (£. £). In this case, if all singletons
are measurable, the space is called a phase space, and we say that X =
(Xn, &?„) is a Markov chain with values in the phase space (£, S). When E
is finite or countably infinite (and £ is the ©--algebra of all its subsets) we
say that the Markov chain is discrete. In turn, a discrete chain with a finite
phase space is called a finite chain.
The theory of finite Markov chains, as presented in §12, Chapter I, shows
that a fundamental role is played by the one-step transition probabilities
P(Xn+1eB\ Xn). By Theorem 3, §7, Chapter II, there are functions P„+1 (x; £),
the regular conditional probabilities, which (for given x) are measures on
(R, @(R)\ and (for given B) are measurable functions of x, such that
P(Xn+i eB\Xn) = Pn+l(Xn; B) (P-a.s.). (4)
The functions P„ = P„(xt £), n > 0, are called transition functions, and
in the case when they coincide (Px = P2 = • ■ •)» the corresponding Markov
chain is said to be homogeneous (in time).
From now on we shall consider only homogeneous Markov chains, and
the transition function Px = P^x, £) will be denoted simply by P = P(x, £).
Besides the transition function, an important probabilistic property of a
Markov chain is the initial distribution n = rc(£), that is, the probability
distribution defined by 7t(£) = P(X0 e£).
The set of pairs (n, P), where n is an initial distribution and P is a transition
function, completely determines the probabilistic properties of A", since every
finite-dimensional distribution can be expressed (Problem 2) in terms of n
and P: for every n > 0 and A e @(Rn+!)
P{(X0 Xn)eA)
= (n(dxo) fp(x0;^x1)-- (\(xo xn)P(xn_1;dx„). (5)
Jr. Jr Jr
566
VIII. Sequences of Random Variables That Form Markov Chains
We deduce, by a standard limiting process, that for any @(Rn+ ^-measurable
function g(x0 x„), either of constant sign or bounded,
E0(*o Xn)
= f n(dx0) f P(x0; dXl) • • • f g(x0 xn)P(x„_,; dxn). (6)
Jr Jr Jr
3. Let Pln) = P(n)(x; B) denote a regular variant of the n-step transition
probability:
P(Xn eB\X0) = P(nKX0; B) (P-a.s.). (7)
It follows at once from the Markov property that for all k and /, (Jc, I > 1),
P(fc+l)(A:o;J?)= (Plk)(X0\dy)P(l)(y,B) (P-a.s.)- (8)
It does not follow, of course, that for allxeR
P(fc+/)(x; B) = f P(fc)(x; <ty)P(/)0;; B). (9)
Jr
It turns out, however, that regular variants of the transition probabilities
can be chosen so that (9) will be satisfied for allxeR (see the discussion in the
historical and bibliographical notes, p. 559).
Equation (9) is the Kolmogorov-Chapman equation (compare (1.12.13))
and is the starting point for the study of the probabilistic properties of Markov
chains.
4. It follows from our discussion that with every Markov chain X = (Xn, ^,),
defined on (Q, ^, P) there is associated a set (7i, P). It is natural to ask
what properties a set (n, P) must have in order for n = n(B) to be a
probability distribution on (R, ^(jR)) and for P = P(x; B) to be a function that is
measurable in x for given B, and a probability measure on B for every x, so
that n will be the initial distribution, and P the transition function, for some
Markov chain. As we shall now show, no additional hypotheses are required.
In fact, let us take (Q, J5") to be the measurable space (R00, ^(jR00)). On
the sets A e<%{Rn+1) we define a probability measure by the right-hand side
of formula (5). It follows from §9, Chapter II, that a probability measure P
exists on (jR00, ^(jR00) for which
P{co:(x0 x„)eA}
= (n(dx0) fp(x0;rfx1)-- (lA(x0 xn)P(xn_i;^xn). (10)
Jr Jr Jr
Let us show that if we put Xn(co) = x„ for co = (x0, xl9...), the sequence
X = (X„)„zo wiU constitute a Markov chain (with respect to the measure P
just constructed).
§1. Definitions and Basic Properties
567
In fact, if B e@{R) and C e@(Rn+ *), then
P{Xn+leB,(X0 Xn)eC)
= n(dx0) PixoldXi)--- /B(xn+1)Jc(x0 xn)P{xn\ dxn+x)
Jr Jr Jr
= [<dx0) (Pixoldx,)-- fp(xn;£)Jc(x0 xn)P(xn-x\ dxn)
Jr Jr Jr
= f P(X„;B)dPt
J{to:(X0 X„)eC)
whence (P-a.s.)
P{Xn+1eB\X0 Xn} = P(Xn;B). (11)
Similarly we can verify that (P-a.s.)
P{Xn+leB\Xn} = P(Xn;B). (12)
Equation (1) now follows from (11) and (12). It can be shown in the same
way that for every k > 1 and n > 0,
P{Xn+keB\X0 Xn} = P{Xn+keB\Xn} (P-a.s.).
This implies the homogeneity of Markov chains.
The Markov chain X = (X„) that we have constructed is known as the
Markov chain generated by (n, P). To emphasize that the measure P on
(R00, ^(jR00)) has precisely the initial distribution n, it is often denoted
byP„.
If 7i is concentrated at the single point x, we write Px instead of P„, and
the corresponding Markov chain is called the chain generated by the point x
(since PX(X0 =x} = 1).
Consequently, each transition function P = P(x, B) is in fact connected
with the whole family of probability measures {PXi x e jR}, and therefore with
the whole family of Markov chains that arise when the sequence (Xn)n:t0 is
considered with respect to the measures Px, xejR. From now on, we shall
use the phrase "Markov chain with given transition function" to mean the
family of Markov chains in the sense just described.
We observe that the measures P„ and Px constructed from the transition
function P = P(x, B) are consistent in the sense that, when A e^(jR°°),
Pn{(X0i Xlt...)eA\X0 = x} = Px{(X0i X„ ...) e A} (7i-a.s.) (13)
and
P„{(X0, ^,...)6^)= f PX{(X0, Xl9...) e A}n(dx). (14)
Jr
568
VIII. Sequences of Random Variables that Form Markov Chains
5. Let us suppose that (Q, &) = (R00, ^(R00)) and that we are considering
a sequence X = (Xn) that is defined coordinate-wise, that is, Xn{co) — x„ for
co = (x0, Xi, - - .)• Also let J*. = <r{<a- X0 Xn% n>0.
Let us define the shifting operators 6ni n > 0, on Q by the equation
6„(x0i xlt...) = (x„, x„+lt...),
and let us define, for every random variable y\ = r\{co\ the random variables
®nn by putting
(MHO)) = *7(0„o)).
In this notation, the Markov property of homogeneous chains can
(Problem 1) be given the following form: For every ^-measurable r\ = rj(co)t
every n > 0, and B e @(R),
P{enrteB\<Fn) = PXn{n e B} (P-a.s.). (15)
This form of the Markov property allows us to give the following
important generalization: (15) remains valid if we replace n by stopping times t.
Theorem. Let X = (X„) be a homogeneous Markov chain defined on
(jR00, &(Rm\ P) and let xbea stopping time. Then the following strong Markov
property is valid:
P{6xn eBm = PXx{rj e B} (P-a.s.). (16)
Proof. If A e &x then
P{6xrj e B, A} = £ P{Oxn eB,A,T = n}
n = 0
= £ P{6nr,eBiAiT = n}. (17)
n = 0
The events A n {t = n} e &„, and therefore
P{6nneBtAn{T = n}}= f P{6„ri e B\<Fn) dP
JAn{x = n)
= f PXn{neB}dP = f PXx{neB}dPt
JAr\{x = n) JAn{T = n)
which, with (17), establishes (16).
Corollary. If a is a stopping time such that P(a > t) = 1 and o is ^-measurable,
then
P{XaeB, g < oo|^} = PXr(B) ({ct < oo}; P-a.s.). (18)
6. As we said above, we are going to consider only discrete Markov chains
(with phase space E = {..., ij, k,...}). To simplify the notation, we shall
now denote the transition functions P(i; {/}) by ptj and call them transition
§2. Classification of the States of a Markov Chain in Terms of Arithmetic Properties 569
probabilities; an «-step transition probability from i to j will be denoted
bypi?.
Let E = {i, 2,...}. The principal questions that we study in §§2-4 are
intended to clarify the conditions under which:
(A) The limits nj = lim pi") exist and are independent of i;
(B) The limits (jzu n2i...) form a probability distribution, that is, 7r{ > 0,
71,. = 1;
(C) The chain is ergodic, that is, the limits (7^, n2,...) have the properties
*,->o,I?=i *,■=!;
(D) There is one and only one stationary probability distribution Q =
(<Zi> Qi > • • •)»tnat is,one such that qx > 0, JJi j gj = 1, and q} = £,- q;pijt
jeE.
In the course of answering these questions we shall develop a classification
of the states of a Markov chain as they depend on the arithmetic and
asymptotic properties of pj^ and pjf.
7. Problems
1. Prove the equivalence of definitions (1), (2), (3) and (15) of the Markov property.
2. Prove formula (5).
3. Prove equation (18).
4. Let (Xn)n a 0 be a Markov chain. Show that the reversed sequence (..., X„, X„_,,..., X0)
is also a Markov chain.
§2. Classification of the States of a Markov Chain in
Terms of Arithmetic Properties of the Transition
Probabilities p\f
1. We say that a state ieE = {1, 2,...} is inessential if, with positive
probability, it is possible to escape from it after a finite number of steps, without
ever returning to it; that is, there exist m andj such that pW > 0, but pjf = 0
for all n andj.
Let us delete all the inessential states from E. Then the remaining set of
essential states has the property that a wandering particle that encounters
it can never leave it (Figure 36). As will become clear later, it is essential
states that are the most interesting.
Let us now consider the set of essential states. We say that state j is
accessible from the point i (i -> j) if there is an m > 0 such that p\f > 0
570
VIII. Sequences of Random Variables that Form Markov Chains
Inessential state
Essential state
Figure 36
(pj^ = 1 if i = j, and 0 if / =^ j). States i andy communicate (i *-*j) if/ is
accessible from i and i is accessible fromy.
By the definition, the relation "<-►" is symmetric and reflexive. It is easy
to verify that it is also transitive (i«-»j, j *-*■ k => i «-»• k). Consequently the
set of essential states separates into a finite or countable number of disjoint
sets Elt E2 each of which consists of communicating sets but with the
property that passage between different sets is impossible.
By way of abbreviation, we call the sets Elt E2i. • • classes or
indecomposable classes (of essential communicating sets), and we call a Markov chain
indecomposable if its states form a single indecomposable class.
As an illustration we consider the chain with matrix
P =
i 1:0 0 0
H:0 0 0
0 0:010
o o;i o i
0 0:0 1 0
= (P1 0\
\0 P2)'
The graph of this chain, with set of states E = {1, 2, 3, 4, 5} has the form
i
It is clear that this chain has two indecomposable classes El = {1,2},
£2 = (3,4, 5}, and the investigation of their properties reduces to the
investigation of the two separate chains whose states are the sets Ex and E2,
and whose transition matrices are P1 and P2-
Now let us consider any indecomposable class E, for example the one
sketched in Figure 37.
§2. Classification of the States of a Markov Chain in Terms of Arithmetic Properties 571
Figure 37. Example of a Markov chain with period d = 2.
Observe that in this case a return to each state is possible only after an
even number of steps; a transition to an adjacent state, after an odd number;
the transition matrix has block structure,
Therefore it is clear that the class E = {1,2, 3, 4} separates into two
subclasses C0 = {1, 2} and C1 = {3, 4} with the following cyclic property:
after one step from C0 the particle necessarily enters Clf and from C, it
returns to C0.
This example suggests a classification of indecomposable classes into
cyclic subclasses.
2. Let us say that state j has period d = d(J) if the following two conditions
are satisfied:
(1) Pj?>0 only for values of n of the form dm;
(2) d is the largest number satisfying (1).
In other words, d is the greatest common divisor of the numbers n for which
pjff > 0. (If pff = 0 for all n> 1, we put d(J) = 0.)
Let us show that all states of a single indecomposable class E have the
same period d, which is therefore naturally called the period of the class,
d = d{E).
Let i and j e E. Then there are numbers k and I such that pff > 0 and
pf > 0. Consequently /?S?+1) > p$p$ > 0, and therefore k + I is divisible
by d(i). Suppose that n > 0 and n is not divisible by d(i). Then n + k + I is
also not divisible by d(i) and consequently p\1+k+l) = 0. But
I%+k+l)>P??PWP$
572
VIII. Sequences of Random Variables that Form Markov Chains
Figure 38. Motion among cyclic subclasses.
and therefore p$ = 0. It follows that if p$ > 0 we have n divisible by d(i),
and therefore d(i) < d{j). By symmetry, d(j) < d(i). Consequently d(i) = d(J).
Ud(J) = 1 (d(E) = 1), the state/ (or class E) is said to be aperiodic.
Let d = d(E) be the period of an indecomposable class E. The transitions
within such a class may be quite freakish, but (as in the preceding example)
there is a cyclic character to the transitions from one group of states to
another. To show this, let us select a state i0 and introduce (for d > 1) the
following subclasses:
C0 - l/eE.-pQ >0=>n = 0(mod d)};
Cl = {j e E: p© > 0 =*> n = l(mod d)};
Cd_! = 0" e E: p{Sj >0=>n = d- l(mod d)}.
Clearly E = C0 + Cx + h Cd_ j. Let us show that the motion from
subclass to subclass is as indicated in Figure 38.
In fact, let state i e Cp and pi} > 0. Let us show that necessarily
je Cp+Hmodd). Let n be such that p["/ > 0. Then n = ad + p and therefore
n = p(mod d) and n + 1 = p + l(mod d). Hence pfj^ > 0 and
/e^p+l(modd)«
Let us observe that it now follows that the transition matrix P of an
indecomposable chain has the following block structure:
§3. Classification of the States of a Markov Chain in Terms of Asymptotic Properties 573
£3 J Indecomposable classes
Cd_,) Cyclic subclasses
Figure 39. Classification of states of a Markov chain in terms of arithmetic properties of
the probabilities p"j.
Consider a subclass Cp. If we suppose that a particle is in the set C0 at the
initial time, then at time s = p + dt, t = 0, 1 it will be in the subclass Cp.
Consequently, with each subclass Cp we can connect a new Markov chain
with transition matrix (p^j)ttjec » which is indecomposable and aperiodic.
Hence if we take account of the classification that we have outlined (see the
summary in Figure 39) we infer that in studying problems on limits of
probabilities p\f we can restrict our attention to aperiodic indecomposable
chains.
3. Problems
1. Show that the relation "«-+" is transitive.
2. For Example 1, §5, show that when 0 < p < 1, all states belong to a single class with
period d = 2.
3. Show that the Markov chains discussed in Examples 4 and 5 of §5 are aperiodic.
§3. Classification of the States of a Markov Chain in
Terms of Asymptotic Properties of the
Probabilities pW
1. Let P = WpijW be the transition matrix of a Markov chain,
fV = Pi{Xk = UXl*iil<l<k-l} (1)
and for i # j
JV = Pt{Xk =j\Xl^jil<l<k- 1}. (2)
574
VIII. Sequences of Random Variables that Form Markov Chains
For X0 = i, these are respectively the probability of first return to state i
at time k, and the probability of first arrival at state j at time k.
Using the strong Markov property (1.16), we can show as in (1.12.38) that
For each ie£we introduce
/«= I /!?', (4)
n= 1
which is the probability that a particle that leaves state i will sooner or later
return to that state. In other words, fu = Pj{o-j < oo}, where Oi = inf{« > 1:
Xn = i} with ¢7, = oo when {•} = 0.
We say that a state i is recurrent if
fa = 1.
and nonrecurrent if
fn < 1-
Every recurrent state can, in turn, be classified according to whether the
average time of return is finite or infinite.
Let us say that a recurrent state i is positive if
and null if
^^(JS/i?) '=0.
Thus we obtain the classification of the states of the chain, as displayed in
Figure 40.
c
Positive
„ states
r Set of all "\
v^ states J
Recurrent ^\ f Nonrecurrent
states J y states
7Y
^ ( Null A
J \ states J
)
Figure 40. Classification of the states of a Markov chain in terms of the asymptotic
properties of the probabilities pj-0.
§3. Classification of the States of a Markov Chain in Terms of Asymptotic Properties
575
2. Since the calculation of the functions ffl can be quite complicated, it is
useful to have the following tests for whether a state i is recurrent or not.
Lemma 1
(a) The state i is recurrent if and only if
I d? = oo. (5)
n = l
(b) Ifstate j is recurrent and i *-*j then state i is also recurrent.
Proof, (a) By (3),
n
Pa — h J a Pa »
fc=i
and therefore (with pi?> = I)
IpW = 1 iftfrir» = tfPtrir»
n=l n= 1 fc= 1 fc= 1 n = k
n=0 \ n=l J
Therefore if £"= l p\f < oo, we have fu < 1 and therefore state i is
nonrecurrent. Furthermore, let £"=1 p\f = oo. Then
I # -is /s?vir" = I /!?>£ prk) z I /P I p!!>>
n=l n=lfc=l fc=l n = fc fc= 1 / = 0
and therefore
g-iPff
fc=i fc=i 2^,1=0 Pa
Thus if £"=! pif = oo then /-,• = 1, that is, the state i is recurrent,
(b) Let pW > 0 and /$> > 0. Then
(n+s+O ■> (s) (n) (r)
Pii ^-PijPjiPji*
and if ££L 1 pf> = oo, then also ££L x pjp = oo, that is, the state i is recurrent.
3. From (5) it is easy to deduce a first result on the asymptotic behavior
Lemma 2. If state j is nonrecurrent then
f>!;'<co (6)
n= 1
576
VIII. Sequences of Random Variables that Form Markov Chains
for every i, and therefore
pg> - 0, n -. oo. (?)
Proof. By (3) and Lemma 1,
n= 1 n= 1 fc= 1 fc= 1 n = 0
«=0 n=0
Here we used the inequality fu = Yj?=i f{!j ^ 1» which holds because the
series represents the probability that a particle starting at /" eventually arrives
at/. This establishes (6) and therefore (7).
Let us now consider recurrent states.
Lemma 3. Let j be a recurrent state with d(j) = 1.
(a) If i communicates with /, then
$>->!, .7-00. (8)
If in addition j is a positive state then
pW - 1 > 0, /7-00. (9)
If however J is a null state, then
I pg'-O, /7-00. (10)
(b) If i and) belong to different classes of communicating states, then
/?!;>-^', n-co. (11)
The proof of the lemma depends on the following theorem from analysis.
Let fi,f2, ■ •. be a sequence of nonnegative numbers with £?1 lj\= 1,
such that the greatest common divisor of the indices/ for which /} > 0 is 1.
Let k0 = 1, u„ = Y*= i /*"„-*, n = 1, 2,..., and let /H = £" , nfn. Then
u„ — \/n as n — oo. (For a proof, see [Fl], §10 of Chapter XIII.)
Taking account of (3), we apply this to u„ = p{J\\ fk = ff>. Then we
immediately find that
where ^- = £^=, n/j?.
§3. Classification of the States of a Markov Chain in Terms of Asymptotic Properties 577
Taking pff = 0 for s < 0, we can rewrite (3) in the form
fc=l
By what has been proved, we have pjj ~fc) -> nj 1, n -* oo, for each given k.
Therefore if we suppose that
Km I fVriJ-" = £ fu lim PS"fe>» (13>
n fc=l fc=l n
we immediately obtain
tf^Uifv)-^,,. (14)
which establishes (11).
Recall that ftj is the probability that a particle starting from state i arrives,
sooner or later, at state j. State j is recurrent, and if i communicates with j,
it is natural to suppose that ftj= 1. Let us show that this is indeed the case.
Let f\j be the probability that a particle, starting from state i, visits state j
infinitely often. Clearly fXJ > /j.. Therefore if we show that, for a recurrent
state j and a state i that communicates with it, the probability/jj = 1, we
will have established thatyjj = 1.
According to part (b) of Lemma 1, the state i is also recurrent, and therefore
A = I/P=1- (15)
Let
<7f = inf{«> l:Xn = i}
be the first time (for times n > 1) at which the particle reaches state i; take
¢7,- = oo if no such time exists.
Then
1 = A = I /i?) = f PK'i = ")= Pfo < oo). (16)
n=1 n=1
and consequently to say that state i is recurrent means that a particle starting
at i will eventually return to the same state (at a random time cr,). But after
returning to this state the "life" of the particle starts over, so to speak (because
of the strong Markov property). Hence it appears that if state i is recurrent
the particle must return to it infinitely often:
P,{Xn = i for infinitely many n) = 1. (17)
Let us now give a formal proof.
Let i be a state (recurrent or nonrecurrent). Let us show that the probability
of return to that state at least r times is (/j,)r.
578
VIII. Sequences of Random Variables that Form Markov Chains
For r = 1 this follows from the definition of fu. Suppose that the
proposition has been proved for r = m — 1. Then by using the strong Markov
property and (16), we have
P; (number of returns to i is greater than or equal to m)
_ " (ot = k* and the number of returns to i after time k\
"ij H is at least m - 1 )
£ « , . x« /at least m - 1 values I , \
-1^. = ^.^,.^,....^^ = ^
00
= Z pfc< = *0p«-(at least m - ! values of Xu X2, • ■ • equal i)
fc=i
Hence it follows in particular that formula (17) holds for a recurrent
state i. If the state is nonrecurrent, then
Pi{X„ = i for infinitely many n) = 0. (18)
We now turn to the proof that /y = 1. Since the state i is recurrent, we
have by (17) and the strong Markov property
1 = Z P,K- = k) + Pto = oo)
fc=i
" / the number of returns to i \
^Lfyj = K after time k is infinite ) + P^J = °°>
£ / infinitely many values of\
= Zp«K = /c' v v f . + P,(^j = oo)
ro /infinitely many I _ \
= Z Pifrj = *) ■ P,- values of Xaj+ lf X„_ 2. y "_ ,)
V... equal i I "~7
- v /■»> p /infiniteJy many vaIues\ ,/i /• x
"ftV° V^i^2, • • - equal i) + (1 ~ fij)
= Z /8Vu + (i - /</) = /Wi/ + 0 - /*)•
fc=l
Thus
and therefore
A/ = fij'fij-
) _,• +P,(^=C0)
§3. Classification of the States of a Markov Chain in Terms of Asymptotic Properties
579
Since i <-►/, we have fl} > 0, and consequently f\s = 1 and ftj = 1.
Therefore if we assume (13), it follows from (14) and the equation fu = 1
that, for communicating states i and/,
"j
As for (13), its validity follows from the theorem on dominated convergence
together with the remark that
This completes the proof of the lemma.
Next we consider periodic states.
i4. Let j be a recurrent state and let d(J) > 1.
(a) Ifi andj belong to the same class (of states), and ifi belongs to the cyclic
subclass Cr andj to Cr+fl, then
pif+°>->^-. (19)
(b) With an arbitrary i,
p(nd + a) _> f" £ /g<* + «>] .£% £1 = 0,1 rf- 1. (20)
Proof, (a) First let a = 0. With respect to the transition matrix Pd the state
j is recurrent and aperiodic. Consequently, by (8),
-1-. ^ ! = <—L
Suppose that (19) has been proved for a = r. Then
" d d
(nd+r+l) _ V n n(n.d + r) -> Y V:v ' — = —.
Pij - LPikPkj -+ LtV* „. •
fc=l fc=l A*r A*j
(b) Clearly
nd + a
P{lf+a) = I f\?Ptf+a+k)> a = 0,l,...,d-L
State/ has period d, and therefore p{jf+a~k) = 0, except when k — a has the
form r • d. Therefore
n
r = 0
and the required result (20) follows from (19).
This completes the proof of the lemma.
580 VIII. Sequences of Random Variables that Form Markov Chains
Lemmas 2-4 imply, in particular, the following result about limits
ofP|;».
Theorem 1. Let a Markov chain be indecomposable (that is, its states form a
single class of essential communicating states) and aperiodic.
Then:
(a) If all states are either null or nonrecurrent, then, for all i and),
p£>-0, m-oo; (21)
(b) if all states j are positive, then, for all i,
pS>-->0, n-oo; (22)
4. Let us discuss the conclusion of this theorem in the case of a Markov chain
with a finite number of states, E = {1, 2,"..., r}. Let us suppose that the
chain is indecomposable and aperiodic. It turns out that then it is
automatically increasing and positive:
indecomposability\
recurrence | (23)
positivity I
d=\ I
For the proof, we suppose that all states are nonrecurrent. Then by (21)
and the finiteness of the set of states of the chain,
l=lim £ pg>= £limpg> = 0. (24)
n j=l j=l n
The resulting contradiction shows that not all states can be nonrecurrent.
Let i0 be a recurrent state andy an arbitrary state. Since i0 <-► j\ Lemma 1 shows
that j is also recurrent.
Thus all states of an aperiodic indecomposable chain are recurrent.
Let us now show that all recurrent states are positive.
If we suppose that they are all null states, we again obtain a contradiction
with (24). Consequently there is at least one positive state, say i0. Let i be any
other state. Since i «-► i0, there are s and t such that pjjj > 0 and p$0 > 0, and
therefore
(n + s+f) ■> „[s)„ln) (r) (s) _j_ . _(f) ^ /\ f~r\
Pii Z- PiioPioioPioi ^ Piio ~ P'oi ■> U \ZV
"»'o
Hence there is a positive s such that pW > s > 0 for all sufficiently large n.
But pW -> \/fi; and therefore ^,- > 0. Consequently (23) is established.
Let Tij = 1/nj. Then itj > 0 by (22) and since
l=lim t pg>= £n.t
n j=l j=l
(indecomposabilityX
d-i )-
§3. Classification of the States of a Markov Chain in Terms of Asymptotic Properties 581
the (aperiodic indecomposable) chain is ergodic. Clearly, for all ergodic finite
chains,
there is an n0 such that min pj"* > Ofor all n>n0. (26)
It was shown in §12 of Chapter I that the converse is also valid: (26) implies
ergodicity.
Consequently we have the following implications:
(indecomposability\
recurrence U ergodicity ~ (26).
positivity J
d=l I
However, we can prove more.
Theorem 2. For a finite Markov chain
(indecomposability\
recurrence \^{trgodletty)<>(^
positivity J
d=l I
Proof. We have only to establish
(ergodicity) =>
/indecomposability\
I recurrence I
1 positivity /"
\ d=\ I
Indecomposability follows from (26). As for aperiodicity, increasingness, and
positivity, they are valid in more general situations (the existence of a limiting
distribution is sufficient), as will be shown in Theorem 2, §4.
5. Problems
1. Consider an indecomposable chain with states 0,1,2, A necessary and sufficient
condition for it to be nonrecurrent is that the system of equations Uj = £* w,-py,
7 = 0, I, , has a bounded solution such that ut £ c, / = 0,1,
2. A sufficient condition for an indecomposable chain with states 0,1,... to be recurrent
is that there is a sequence (w0, ux,...) with ut -» oo, i -» oo, such that Uj > J^ M,p0- for
all; 96 0.
3. A necessary and sufficient condition for an indecomposable chain with states 0,1,...
to be recurrent and positive is that the system of equations u} = £,■ w.p.j,./ = 0,1,...,
has a solution, not identically zero, such that £f |w;| < 00.
582
VIII. Sequences of Random Variables that Form Markov Chains
4. Consider a Markov chain with states 0,1,... and transition probabilities
POO = r0> Poi = Po > 0»
Pij =
+ 1,
- 1,
fp,>0, j = i
U>0, j = i
U>0, j = i
10 otherwise.
Let p0 = \,pm = (ql... gm)/(p,... pm). Prove the following propositions.
Chain is recurrent o £ pm = oo,
Chain is nonrecurrent o £ pm < oo,
Chain is positive o V < oo,
PmPm
Chain is null o Y pm = oo, Y = oo.
PmPm
5. Show that /ft^/y/jfc and sup pg> </■; < f Pi?-
n n=l
6. Show that for every Markov chain with countably many states, the limit of p{ff always
exists in the Cescaro sense:
lim-ipi? = ^.
„ "*=i /f;
7. Consider a Markov chain <^0, ¢,,... with £fc+, = (£fc+) + /7fc+,, where i}„ f/2,... is a
sequence of independent identically distributed random variables with P(nk=j)=pj,
7 = 0, 1, Write the transition matrix and show that if p0 > 0, p0 + p, < 1, the
chain is recurrent if and only if J^ kpk < 1.
§4. On the Existence of Limits and of Stationary
Distributions
1. We begin with some necessary conditions for the existence of stationary
distributions.
Theorem 1. Let a Markov chain with countably many states E = {1, 2,...} and
transition matrix P = ||/?,7|| be such that the limits
Iimpg> = 7tj,
n
exist for all i and / and do not depend on i.
§4. On the Existence of Limits and of Stationary Distributions
Then
(a) !>,•< l, TiniPu = n.i>
(b) either all n} = 0or Yj ns = 1;
(c) if all 71, = 0, there is no stationary distribution; if Y,inj= '»
n = (re,, n2,...) is the unique stationary distribution.
Proof. By Fatou's lemma,
XK; = IlimpS><!imIp!?=l.
j J n n J
Moreover,
I «iPtf = I (fim rf?) p0- s lim I rf? Po- = Um pS+ " = *,.
that is, for each /,
£ niPij < 7ij.
Suppose that
i
for some y0. Then
I wj > Z (I w*pij) = £ *i X p« = I **•
This contradiction shows that
£ T^iPij = nj
i
for all j.
It follows from (1) that
i
Therefore
7r,. = lim X 7c,pg> = £ 7rf lim pg> = h n)nj9
that is, for all j\
K,(i-2>.)-ft
from which (b) follows.
584
VIII. Sequences of Random Variables that Form Markov Chains
Now let Q = (qu q2,...) be a stationary distribution. Since £,■ q-ptf = qj
and therefore £f q^j = qjt that is, nj = qj for ally, this stationary distribution
must coincide with n = (7^, 7r2, •. .)• Therefore if all itj = 0, there is no
stationary distribution. If, however, £,- 7ij = 1, then n = (nlt n2, ■ • •) is the
unique stationary distribution.
This completes the proof of the theorem.
Let us state and prove a fundamental result on the existence of a unique
stationary distribution.
Theorem 2. For Markov chains with countably many states, there is a unique
stationary distribution if and only if the set of states contains precisely one
positive recurrent class (of essential communicating states).
Proof. Let N be the number of positive recurrent classes.
Suppose N = 0. Then all states are either nonrecurrent or are recurrent
null states, and by (3.10) and (3.20), lim„ p\f = 0 for all i and/. Consequently,
by Theorem 1, there is no stationary distribution.
Let N = 1 and let C be the unique positive recurrent class. If d(C) = 1 we
have, by (3.8),
pS«)_l>0, iJeC.
"i
If/ $ C, then j is nonrecurrent, and pjf -> 0 for all i as n -* 00, by (3.7).
Put
I—>0, jeC,
0, j$C.
Then, by Theorem 1, the set Q = (qlt q2,...) is the unique stationary
distribution.
Now let d = d(C) > 1. Let C0, •. •, Cd_ x be the cyclic subclasses. With
respect to Pd, each subclass Cfc is a recurrent aperiodic class. Then if i and
j e Cfc we have
pgd>-->0
by (3.19). Therefore on each set Cfc, the set d/fxjtjeCk, forms (with respect
toPd) the unique stationary distribution. Hence it follows, in particular,
that £i6Cfc (d/vij) = 1, that is, ZjecJWj) = l/d.
Let us put
I-|-, jeC = C0 +--- + 0,,-,,
0, y*c,
§4. On the Existence of Limits and of Stationary Distributions
585
and show that, for the original chain, the set Q = (qu q2,...) is the unique
stationary distribution.
In fact, for i e C,
Then by Fatou's lemma,
£ = Mm PF> a I Jim p!f -»p, = J] ip,
A*. n jeC n jeC f*j
and therefore
But
1 ^ V l
1 d-1 / 1\ Ml
ieC A*. fc = 0 \«eCfcAV fc = 0 "
As in Theorem 1, it can now be shown that in fact
1= vi
This shows that the set Q = (qx, q2i. ■ ■) is a stationary distribution, which is
unique by Theorem 1.
Now let there be N > 2 positive recurrent classes. Denote them by
C1,..., CN, and let Q* = (q\, q2,...) be the stationary distribution
corresponding to the class C and constructed according to the formula
I— >0, jeC,
0, /*C.
Then, for all nonnegative numbers ax,...,aN such that ax + • • • + aN = 1,
the set AiQ1 + • • • + a^Q^ will also form a stationary distribution, since
(fliQ1 + • • • + awQw)P = a^P + • • • + aNQNP = fliQ1 + - • - + aNQN.
Hence it follows that when N > 2 there is a continuum of stationary
distributions. Therefore there is a unique stationary distribution only in the case
N = 1.
This completes the proof of the theorem.
2. The following theorem answers the question of when there is a limit
distribution for a Markov chain with a countable set of states E.
586
VIII. Sequences of Random Variables that Form Markov Chains
Theorem 3. A necessary and sufficient condition for the existence of a limit
distribution is that there is, in the set E of states of the chain, exactly one
aperiodic positive recurrent class C such that fti = 1 for all] e C and i e E.
Proof. Necessity. Let qi = lim pl$ and let Q = (qlt q2,...) be a distribution
(qt > 0, £,- qi = 1). Then by Theorem 1 this limit distribution is the unique
stationary distribution, and therefore by Theorem 2 there is one and only one
recurrent positive class C. Let us show that this class has period d= \.
Suppose the contrary, that is, let J > 1. Let C0, Cu..., Cd_ i be the cyclic
subclasses. If i e C0 and j e C„ then by (19), p%d+1) - d/fr and pgd) = 0 for
all n. But d/fXj > 0, and therefore p\f does not have a limit as n -»> oo; this
contradicts the hypothesis that lim„ p{$ exists. Now let/e C and i e E. Then,
by (3.11), p\f -> fj/fXj. Consequently itj = fjny But rc,- is independent of i.
Therefore ^. = /),= 1.
Sufficiency. By (3.11), (3.10) and (3.7),
1-^, jeC, ieE,
0, y*C, ieE.
Therefore iffj = 1 for ally e C and i e E, then ^ = lim„ p\f is independent of i.
Class C is positive and therefore qi > 0 for./ e C. Then, by Theorem 1, we have
£\ q. = l and the set Q = (qlt q2,...) is a limit distribution.
3. Let us summarize the results obtained above on the existence of a limit
distribution, the uniqueness of a stationary distribution and ergodicity, for
the case of finite chains.
Theorem 4. We have the following implications for finite Markov chains:
. /chain indecomposableX
{ergodicity) <=> I recurrent, positive, I
\ with d=\ j
(limit distribution\ (2} (there existS ^tly one\
I • t I <=> I recurrent positive class I
V exists J \ withd=\ J
(unique stationary^ 0} (there exists exactly one\
I distribution )<>\ recurrent positive class)
Proof. The "vertical" implications are evident. {1} is established in Theorem
2, §3; {2} in Theorem 3; {3} in Theorem 2.
§5. Examples
587
4. Problems
1. Show that, in Example 1 of §5, neither stationary nor a limit distribution occurs.
2. Discuss the question of stationarity and limit distribution for the Markov chain with
transition matrix
3. Let P = \\pij\\ be a finite doubly stochastic matrix, that is, JjL { pu = \J = 1,..., m.
Show that the stationary distribution of the corresponding Markov chain is the vector
Q = (l/m 1/m).
§5. Examples
1. We present a number of examples to illustrate the concepts introduced
above, and the results on the classification and limit behavior of transition
probabilities.
Example 1. A simple random walk is a Markov chain such that a particle
remains in each state with a certain probability, and goes to the next state with
a certain probability.
The simple random walk corresponding to the graph
describes the motion of a particle among the states E = {0, ± 1,...} with
transitions one unit to the right with probability p and to the left with
probability q. It is clear that the transition probabilities are
fp, j=i+ 1,
-U j = i~h
[O otherwise.
P + q=h
If p = 0, the particle moves deterministically to the left; if p = 1, to the
right. These cases are of little interest since all states are inessential. We
therefore assume that 0 < p < 1.
With this assumption, the states of the chain form a single class (of essential
communicating states). A particle can return to each state after 2,4,6,... steps.
Hence the chain has period d = 2.
588
VIII. Sequences of Random Variables that Form Markov Chains
Since, for each i e E,
A*0- Cilpqf = (^{pqf,
then by Stirling's formula (which says n! ~ yjlnn n"e ") we have
v$n) -
(4pqT
Therefore £„ p\fn) = coifp = q, and ^ p\fn) < ooifp # q. In other words,
the chain is recurrent if p = q, but if p ^ q it is nonrecurrent. It was shown
in §10, Chapter I, that/!?n) - 1/(2^3¾ n->oo,ifp = g = J. Therefore
V-i — Z" (2«)y]!-2n> = oo, that is, all recurrent states are null states. Hence by
Theorem 1 of §3, pff -> 0 as n -» oo for all / and j.
There are no stationary, limit, or ergodic distributions.
Example 2. Consider a simple random walk with E = {0,1,2,...}, where 0 is
an absorbing barrier:
0 < p < I
State 0 forms a unique positive recurrent class with d = \. All other states
are nonrecurrent. Therefore, by Theorem 2 of §4, there is a unique stationary
distribution
II = (7T0, 7t,,7t2,...)
with 7r0 = 1 and 7r,- = 0, / > 1.
Let us now consider the question of limit distributions. Clearly pJJ = 1,
p\f -> 0, j > 1, i > 0. Let us now show that for i > 1 the numbers
a(i) = lim„ p\# are given by the formulas
p>q,
(1)
We begin by observing that since state 0 is absorbing we have
pj.J = Yjk<nfio an^ consequently a(/) = fi0, that is, the probability a(i) is the
probability that a particle starting from state i sooner or later reaches the null
§5. Examples
589
state. By the method of §12, Chapter I (see also §2 of Chapter VII) we can
obtain the recursion relation
«(i) = pa(i +1) + qa(i - 1), (2)
with a(0) = 1. The general solution of this equation has the form
«(0 = a + b{q/p)\ (3)
and the condition a(0) = 1 imposes the condition a + b = 1.
If we suppose that q > p, then since a(i) is bounded we see at once that
b = 0, and therefore a(i) = 1. This is quite natural, since when q > p the
particle tends to move toward the null state.
If, on the other hand, p > q the opposite is true: the particle tends to move
to the right, and so it is natural to expect that
a(0 -► 0, i -> oo, (4)
and consequently a = 0 and
<0-g)' (5)
To establish this equation, we shall not start from (4), but proceed differently.
In addition to the absorbing barrier at 0 we introduce an absorbing barrier
at the integral point N. Let us denote by aN(i) the probability that a particle
that starts at i reaches the zero state before reaching N. Then aN(i) satisfies (2)
with the boundary conditions
aw(Q) = 1, aN(N) = 0,
and, as we have already shown in §9, Chapter I,
W 14
<*"(0 = W /jf; . 0<i<N. (6)
l -
Hence
lim aN(i) =
and consequently to prove (5) we have only to show that
a(i) = lim aN(i). (7)
N
This is intuitively clear. A formal proof can be given as follows.
Let us suppose that the particle starts from a given state i. Then
a(i) = PfC4),
(8)
590
VIII. Sequences of Random Variables that Form Markov Chains
where A is the event in which there is an N such that a particle starting from i
reaches the zero state before reaching state N. If
An = {particle reaches 0 before N},
then A = U"=i+i AN. It is clear that AN £ AN+l and
P,/ 0 AN) = «
\N = i+l / N
lim Pi(AN).
(9)
But aN(i) = P£AN), so that (7) follows directly from (8) and (9).
Thus if p > q the limit lim pffl depends on i, and consequently there is no
limit distribution in this case. If, however, p < q, then in all cases lim pj-jj = 1
and lim p\f = 0, j > 1. Therefore in this case the limit distribution has the
form n = (1, 0, 0,...).
Example 3. Consider a simple random walk with absorbing barriers at 0
and N:
Here there are two positive recurrent classes {0} and {N}. All other states
{1,..., N — 1} are nonrecurrent. It follows from Theorem 1, §3, that there
are infinitely many stationary distributions Yl = (n0tn1 nN) with
n0 = a, nN = b, nx = • • • = nN_, = 0, where a > 0, b > 0, a + b = I.
From Theorem 4, §4 it also follows that there is no limit distribution. This is
also a consequence of the equations (Subsection 2, §9, Chapter I)
lim pfe>={
l-w
JT* P*<*>
p = q,
(10)
lim p&> = 1 - lim pft> and lim p\f = 0,1 < j < N - 1.
Example 4. Consider a simple random walk with E = {0,1,...} and a
reflecting barrier at 0:
0 < p < l
§5. Examples
591
It is easy to see that the chain is periodic with period d = 2. Suppose that
p > q (the moving particle tends to move to the right). Let i > 1; to determine
the probability fn we may use formula (1), from which it follows that
fn~®~1<u i>h
All states of this chain communicate with each other. Therefore if state i is
recurrent, state 1 will also be recurrent. But (see the proof of Lemma 3 in §3)
in that casefn must be 1. Consequently when p > q all the states of the chain
are nonrecurrent. Therefore pff -» 0, n -► oo for i and j e E, and there is
neither a limit distribution nor a stationary distribution.
Now let p < q. Then, by (1), fn = 1 for i > 1 and ftl = q + pf21 = 1.
Hence the chain is recurrent.
Consider the system of equations determining the stationary distribution
n = (n0,nlt...):
ni — no + n2q>
n2 = KiP + n3 q*
that is,
71! = Tl^q + 7T2g,
7T2 = n2q + n3q,
whence
»'=©*'- J=2-3
If p = q we have nl = n2 = ..., and consequently
71q = tc1 = n2 = — = 0.
In other words, if p = q, there is no stationary distribution, and therefore no
limit distribution. From this and Theorem 3, §4, it follows, in particular, that
in this case all states of the chain are null states.
It remains to consider the case p < q. From the condition ^JL0 itj = 1 we
find that
.[.♦.♦©♦0V-]-1-
that is
ni = —^—
2q
592
VIII. Sequences of Random Variables that Form Markov Chains
and
q-p
2q
j>2.
Therefore the distribution n is the unique stationary distribution. Hence
when p < q the chain is recurrent and positive (Theorem 2, §4). The
distribution n is also a limit distribution and is ergodic.
Example 5. Again consider a simple random walk with reflecting barriers at
QandN:
0 <p < l
q^N- 1 l
All the states of the chain are periodic with period d = 2, recurrent, and
positive. According to Theorem 4 of §4, the chain is ergodic. Solving the
system 7tj = Yj=o niPij subject to Ys=o nt= U we obtain the ergodic
distribution
71,=-
and
i+ I
n0 = nxq,
2<j£N - 1,
-iP-
2. Example 6. It follows from Example 1 that the simple random walk
considered there on the integral points of the line is recurrent if p = q, but
nonrecurrent if p # q. Now let us consider simple random walks in the plane
and in space, from the point of view of recurrence or nonrecurrence.
For the plane, we suppose that a particle in any state (i, J) moves up, down,
to the right or to the left with probability J (Figure 41).
—1—
i
1
4
11
i
I
'
i
i
^. 4
i
4
►
* >
I ^
i
1 ►
-1 0^1 2
Figure 41. A walk in the plane.
§5. Examples
593
For definiteness, consider the state (0,0). Then the probability
pk = p!o, 0).(0, o) of going from (0,0) to (0,0) in k steps is given by
^2„+i=0, « = 0,1,2,...,
«i,j):i + j=n.0<i<n) ll'JJ'
Multiplying numerators and denominators by (n !)2, we obtain
p2n = a)2n cn2n t acnr = a)2n(cn2n)\
i = 0
since
tcicr1 = c2n.
i=0
Applying Stirling's formula, we find that
P ^-
Pln nn'
and therefore J] P2n = cxd. Consequently the state (0,0) (likewise any other)
is recurrent.
It turns out, however, that in three or more dimensions the symmetric
random walk is nonrecurrent. Let us prove this for walks on the integral
points (i, j, k) in space.
Let us suppose that a particle moves from (i, j, k) by one unit along a
coordinate direction, with probability £ for each.
594
VIII. Sequences of Random Variables that Form Markov Chains
Then if Pk is the probability of going from (0, 0, 0) to (0, 0, 0) in k steps, we
have
P2n =
P2n+i = 0> « = 0,1,...,
r (2n)!
r(i)2'
{(f. j):0<,+J^„fo^^n, 0*j<n) 0'!)2 0'-')2((" ~ I ~ 7)02
Z {(f,j):0^« + j<n,0sfsn,0^jsn} [}-Jl\n l J)lJ
^ *"n 92n *- 2n ™ Z, ., ., , _ . _ ^, <,X>
Z J «i.j):0^i+j<n, O^i^n, Osjin) lJ- Kn l J)'
= C„2^C"2„1 „=1,2
(ID
where
Q= max [ n\ -1. (12)
Let us show that when n is large, the max in (12) is attained for i ~ n/3,
j ~ n/3. Let i0 and j0 be the values at which the max is attained. Then the
following inequalities are evident:
k]- O'o - 1)! (" - k ~ io + 1)! Jo]-hHn - k ~ io)!'
nl
n\
j0\(i0 + 1)! (n - j0 ~ k> ~ 1)! ~ O'o - 1)! hHn ~ k - io + 1)!
n\
~0o + l)!io!("-7o-io-l)!'
whence
n - i0 - 1 < 2j0 < n - i0 + 1,
" - 7o - 1 < 2/0 < " - k + 1»
and therefore we have, for large n, i0 ~ n/3,j0 ~ n/3, and
n\
By Stirling's formula,
1 r. l 3v/3
^n -i?»i *-"2n -»n
3" 27r3'V2'
§5. Examples
595
and since
£ 3^/3
„?12^V^<00'
we have £n P2n < °o. Consequently the state (0,0, 0), and likewise any
other state, is nonrecurrent. A similar result holds for dimensions greater
than 3.
Thus we have the following result (Polya):
Theorem. For R1 and R2, the symmetric random walk is recurrent; for R",
n > 3, it is nonrecurrent.
3. Problems
1. Derive the recursion relation (1).
2. Establish (4).
3. Show that in Example 5 all states are aperiodic, recurrent, and positive.
4. Classify the states of a Markov chain with transition matrix
(P q 0 0\
p q 0 0 '
0 0 p q/
where p + q = 1, p > 0, q > 0.
Historical and Bibliographical Notes
Introduction
The history of probability theory up to the time of Laplace is described by
Todhunter [Tl]. The period from Laplace to the end of the nineteenth
century is covered by Gnedenko and Sheinin in [K10]. Maistrov [Ml]
discusses the history of probability theory from the beginning to the thirties
of the present century. There is a brief survey in Gnedenko [G4]. For the
origin of much of the terminology of the subject see Aleksandrova [A3].
For the basic concepts see Kolmogorov [K8], Gnedenko [G4], Borovkov
[B4], Gnedenko and Khinchin [G6], A. M. and I. M. Yaglom [Yl], Prokhorov
and Rozanov [P5], Feller [Fl, F2], Neyman [N3], Loeve [L7], and Doob
[D3]. We also mention [M3] which contains a large number of problems
on probability theory.
In putting this text together, the author has consulted a wide range of
sources. We mention particularly the books by Breiman [B5], Ash [A4, A5],
and Ash and Gardner [A6], which (in the author's opinion) contain an
excellent selection and presentation of material.
For current work in the field see, for example, Annals of Probability
(formerly Annals of Mathematical Statistics) and Theory of Probability
and its Applications (translation of Teoriya Veroyatnostei i ee Primeneniya).
Mathematical Reviews and Zentralblatt fur Mathematik contain abstracts
of current papers on probability and mathematical statistics from all over the
world.
For tables for use in computations, see [Al].
598
Historical and Bibliographical Notes
Chapter I
§1. Concerning the construction of probabilistic models see Kolmogorov
[K7] and Gnedenko [G4]. For further material on problems of distributing
objects among boxes see, e.g., Kolchin, Sevastyanov and Chistyakov [K3],
§2. For other probabilistic models (in particular, the one-dimensional
Ising model) that are used in statistical physics,, see Jsihara..[I2]. ,
§3. Bayes's formula and theorem form the basis for the "Bayesian
approach" to mathematical statistics. See, for example, De Groot [Dl] and
Zacks [Zl].
§4. A variety of problems about random variables and their probabilistic
description can be found in Meshalkin [M3].
§5. A combinatorial proof of the law of large numbers (originating with
James Bernoulli) is given in, for example, Feller [Fl]. For the empirical
meaning of the law of large numbers see Kolmogorov [K.7].
§6. For sharper forms of the local and in tegrated theorems, and of Poisson's
theorem, see Borovkov [B4] and Prokhorov [P3].
§7. The examples of Bernoulli schemes illustrate some of the basic
concepts and methods of mathematical statistics. For more detailed discussions
see, for example, Cramer [C5] and van der Waerden [Wl].
§8. Conditional probability and conditional expectation with respect to
a partition will help the reader understand the concepts of conditional
probability and conditional expectation with respect to ©--algebras, which will
be introduced later.
§9. The ruin problem was considered in essentially the present form by
Laplace. See Gnedenko and Sheinin [K10]. Feller [Fl] contains extensive
material from the same circle of ideas.
§10. Our presentation essentially follows Feller [Fl]. The method for
proving (10) and (11) is taken from Doherty [D2].
§11. Martingale theory is throughly covered in Doob [D3]. A different
proof of the ballot theorem is given, for instance, in Feller [Fl].
§12. There is extensive material on Markov chains in the books by Feller
[Fl], Dynkin [D4], Kemeny and Snell [K2], Sarymsakov [SI], and
Sirazhdinov [S8]. The theory of branching processes is discussed by
Sevastyanov [S3].
Chapter II
§1. Kolmogorov's axioms are presented in his book [K8].
§2. Further material on algebras and ©--algebras can be found in, for
example, Kolmogorov and Fomin [K8], Neveu [Nl], Breiman [B5], and
Ash IAS].
§3. For a proof of Caratheodory's theorem see Loeve [17] or Halmos
[HI].
Historical and Bibliographical Notes
599
§§4-5. More material on measurable functions is available in Halmos
[HI].
§6. See also Kolmogorov and Fomin [K8], Halmos [HI], and Ash and
Gardner [A6]. The Radon-Nikodym theorem is proved in these books. The
inequality
P(|£|>e)<-^-
is sometimes called Chebyshev's inequality, and the inequality
P(|£|>£)<-^-, r>0,
£
is called Markov's inequality.
For Pratt's lemma see [P2].
§7. The definitions of conditional probability and conditional expectation
with respect to a ©--algebra were given by Kolmogorov [K8]. For additional
material see Breiman [B5] and Ash [A5]. The result quoted in the Corollary
to Theorem 5 can be found in [M5].
§8. See also Borovkov [B4], Ash [A5], Cramer [C5], and Gnedenko
[G4].
§9. Kolmogorov's theorem on the existence of a process with given finite-
dimensional distribution is in his book [K8]. For Ionescu-Tulcea's theorem
see also Neveu [Nl] and Ash [A5]. The proof in the text follows [AS].
§§10-11. See also Kolmogorov and Fomin [K9], Ash [A5], Doob [D3],
and Loeve [L7].
§12. The theory of characteristic functions is presented in many books.
See, for example, Gnedenko [G4], Gnedenko and Kolmogorov [G5],
Ramachandran [Rl], Lukacs [L8], and Lukacs and Laha [L9]. Our
presentation of the connection between moments and semi-invariants
follows Leonov and Shiryaev [L4].
§13. See also Ibragimov and Rozanov [II], Breiman [B5], and Liptser
and Shiryayev [L5].
Chapter III
§1. Detailed investigations of problems on weak convergence of
probability measures are given in Gnedenko and Kolmogorov [G5] and Billings-
ley [B3].
§2. Prokhorov's theorem appears in his paper [P4].
§3. The monograph [G5] by Gnedenko and Kolmogorov studies the
limit theorems of probability theory by the method of characteristic functions.
See also Billingsley [B3]. Problem 2 includes both Bernoulli's law of large
numbers and Poisson's law of large numbers (which assumes that £lt £2, - • •
are independent and take only two values (1 and 0), but in general are
differently distributed: P(f, = 1) = p„ P(ff = 0) = 1 - pi9 i > 1).
600
Historical and Bibliographical Notes
§4. Here we give the standard proof of the central limit theorem for sums
of independent random variables under the Lindeberg condition. Compare
[G5] and [P6],
In the first edition, we gave the proof of Theorem 3 here.
§5. Questions of the validity of the central limit theorem without the
hypothesis of negligibility in the limit have already attracted the attention of
P. Levy. A detailed account of the current state of the theory of limit
theorems in the nonclassical setting is contained in Zolotarev [Z4]. The statement
and proof of Theorem 1 were given by Rotar [R5].
§6. The presentation uses material from Gnedenko and Kolmogorov
[G5], Ash [A5], and Petrov [PI], [P6].
§7. The Levy-Prokhorov metric was introduced in a well-known work
by Prokhorov [P4], to whom the results on metrizability of weak
convergence of measures given on metric spaces are also due. Concerning the metric
\\P ~ P\\bl> see Dudley [D6] and Pollard [P7].
§8. Theorem 1 is due to Skorokhod. Useful material on the method of a
single probability space may be found in Borovkov [B4] and in Pollard [P7].
§§9-10. A number of books contain a great deal of material touching on
these questions: Jacod and Shiryaev [Jl], LeCam [L10], Greenwood and
Shiryaev [G7], and Liese and Vajda [Lll].
§11. Petrov [P6] contains a lot of material on estimates of the rate of
convergence in the central limit theorem. The proof given of the theorem of
Berry and Esseen is contained in Gnedenko and Kolmogorov [G5].
§12. The proof follows Presman [P8].
Chapter IV
§1. Kolmogorov's zero-or-one law appears in his book [K8]. For the
Hewitt-Savage zero-or-one law see also Borovkov [B4], Breiman [B5], and
Ash [A5].
§§2-4. Here the fundamental results were obtained by Kolmogorov and
Khinchin (see [K8] and the references given there). See also Petrov [PI] and
Stout [S9]. For probabilistic methods in number theory see Kubilius [Kll],
§5. Regarding these questions, see Petrov [P6], Borovkov [B4], and
Dacunha-Castelle and Duflo [D5].
Chapter V
§§1-3. Our exposition of the theory of (strict sense) stationary random
processes is based on Breiman [B5], Sinai [S7], and Lamperti [12]. The
simple proof of the maximal ergodic theorem was given by Garsia [Gl].
Historical and Bibliographical Notes
601
Chapter VI
§1. The books by Rozanov [R4], and Gihman and Skorohod [G2, G3]
are devoted to the theory of (wide sense) stationary random processes.
Example 6 was frequently presented in Kolmogorov's lectures.
§2. For orthogonal stochastic measures and stochastic integrals see also
Doob [D3], Gihman and Skorohod [G3], Rozanov [R4], and Ash and
Gardner [A6].
§3. The spectral representation (2) was obtained by Cramer and Loeve
(see, for example, [17]). The same representation (in different language) is
contained in Kolmogorov [K5]. Also see Doob [D3], Rozanov [R4], and
Ash and Gardner [A6].
§4. There is a detailed exposition of problems of statistical estimation of
the covariance function and spectral density in Hannan [H2, H3].
§§5-6. See also Rozanov [R4], Lamperti [L2], and Gihman and Skorohod
[G2, G3].
§7. The presentation follows Lipster and Shiryaev [L5].
Chapter VII
§1. Most of the fundamental results of the theory of martingales were
obtained by Doob [D3]. Theorem 1 is taken from Meyer [M4]. Also see
Meyer and Dellacherie [M5], Lipster and Shiryaev [15], and Gihman and
Skorohod [G3].
§2. Theorem 1 is often called the theorem "on transformation under a
system of optional stopping" (Doob [D3]). For the identities (14) and (15)
and Wald's fundamental identity see Wald [W2].
§3. Chow and Teicher [C3] contains an illuminating study of the results
presented here, including proofs of the inequalities of Khinchin, Marcinkie-
wicz and Zygmund, Burkholder, and Davis. Theorem 2 was given by Lenglart
[L3].
§4. See Doob [D3].
§5. Here we follow Kabanov, Liptser and Shiryaev [Kl], Engelbert and
Shiryaev [El], and Neveu [N2]. Theorem 4 and the example were given by
Liptser.
§6. This approach to problems of absolute continuity and singularity, and
the results given here, can be found in Kabanov, Liptser and Shiryaev [Kl].
Theorem 6 was obtained by Kabanov.
§7. Theorems 1 and 2 were given by Novikov [N4]. Lemma 1 is a discrete
analog of Girsanov's lemma (see [Kl]).
§8. See also Liptser and Shiryaev [L12] and Jacod and Shiryaev [Jl],
which discuss limit theoremsfor random processes of a rather general nature
(for example, martingales, semi-martingales).
602
Historical and Bibliographical Notes
Chapter VIII
§1. For the basic definitions see Dynkin [D4], Ventzel [V2], Doob [D3],
and Gihman and Skorohod [G3]. The existence of regular transition
probabilities such that the Kolmogorov-Chapman equation (9) is satisfied for all
x e R is proved in [N1] (corollary to Proposition V.2.1) and in [G3] (Volume
I, Chapter II, §4). Kuznetsov (see Abstracts of the Twelfth European Meeting
of Statisticians, Varna, 1979) has established the validity (which is far from
trivial) of a similar result for Markov processes with continuous times and
values in universal measurable spaces.
§§2-5. Here the presentation follows Kolmogorov [K4], Borovkov
[B4], and Ash [A4].
References *
[Al] M. Abramovitz and I. A. Stegun, editors. Handbook of Mathematical
Functions with Formulas, Graphs and Mathematical Tables. National Bureau of
Standards, Washington, D.C., 1964.
[A2] P. S. Aleksandrov. Einjuhrung in die Mengenlehre und die Theorie der reellen
Funktionen. DVW, Berlin, 1956.
[A3] N. V. Aleksandrova. Mathematical Terms [Matematicheskie terminy].
Vysshaia Shkola, Moscow, 1978.
[A4] R. B. Ash. Basic Probability Theory. Wiley, New York, 1970.
[A5] R. B. Ash. Real Analysis and Probability. Academic Press, New York, 1972.
[A6] R. B. Ash and M. F. Gardner. Topics in Stochastic Processes. Academic
Press, New York, 1975.
[Bl] S. N. Bernshtein. Chebyshev's work on the theory of probability (in Russian),
in The Scientific Legacy of P. L. Chebyshev [Nauchnoe nasledie P. L.
Chebysheva'], pp. 43-68, Akademiya Nauk SSSR, Moscow-Leningrad, 1945.
[B2] S. N. Bernshtein. Theory of Probability [Teoriya veroyatnostei\t 4th ed.
Gostehizdat, Moscow, 1946.
[B3] P. Billingsley. Convergence of Probability Measures. Wiley, New York, 1968.
[B4] A. A. Borovkov. Wahrscheinlichkeitstheorie: eine Einfuhrung, first edition
Birkhauser, Basel-Stuttgart, 1976; Theory of Probability, second edition
[Teoriya veroyatnostet}. "Nauka," Moscow, 1986.
[B5] L. Breiman. Probability. Addison-Wesley, Reading, MA, 1968.
[CI] P. L. Chebyshev. Theory of Probability: Lectures Given in 1879 and 1880
[Teoriya veroyatnostet: Lektsii akad. P. L. Chebysheva chitannye v 1879,1880
gg.~]. Edited by A. N. Krylov from notes by A. N. Lyapunov. Moscow-
Leningrad, 1936.
[C2] Y. S. Chow, H. Robbins, and D. Siegmund. The Theory of Optimal Stopping.
Dover, New York, 1991.
f Translator's note: References to translations into Russian have been replaced by references
to the original works. Russian references have been replaced by their translations whenever I
could locate translations; otherwise they are reproduced (with translated titles). Names of
journals are abbreviated according to the forms used in Mathematical Reviews (1982 versions).
604
References
[C3] Y. S. Chow and H. Teicher. Probability Theory: Independence, Interchanga-
bility, Martingales. Springer-Verlag, New York, 1978.
[C4] Kai-Lai Chung. Markov Chains with Stationary Transition Probabilities.
Springer-Verlag, New York, 1967.
[C5] H. Cramer, Mathematical Methods of Statistics. Princeton University Press,
Princeton, NJ, 1957.
[Dl] M. H. De Groot. Optimal Statistical Decisions. McGraw-Hill, New York,
1970.
[D2] M. Doherty. An amusing proof in fluctuation theory. Lecture Notes in
Mathematics, no. 452,101-104, Springer-Verlag, Berlin, 1975.
[D3] J. L. Doob. Stochastic Processes. Wiley, New York, 1953.
[D4] E. B. Dynkin. Markov Processes. Plenum, New York, 1963.
[El] H. J. Engelbert and A. N. Shiryaev. On the sets of convergence and
generalized submartingales. Stochastics 2 (1979), 155-166.
[Fl] W. Feller. An Introduction to Probability Theory and Its Applications, vol. 1,
3rd ed. Wiley, New York, 1968.
[F2] W. Feller. An Introduction to Probability Theory and Its Applications, vol. 2,
2nd ed. Wiley, New York, 1966.
[Gl] A. Garcia. A simple proof of E. Hopfs maximal ergodic theorem. J. Math.
Mech. 14 (1965), 381-382.
[G2] 1.1. Gihman [Gikhman] and A. V. Skorohod [Skorokhod]. Introduction to
the theory of random processes, first edition. Saunders, Philadelphia, 1969;
second edition [Vvedenie v teoriyu sluchainyh protsessov]. "Nauka", Moscow,
1977.
[G3] I. I. Gihman and A. V. Skorohod. Theory of Stochastic Processes, 3 vols.
Springer-Verlag, New York-Berlin, 1974-1979.
[G4] B. V. Gnedenko. The theory of probability. Mir, Moscow, 1988.
[G5] B. V. Gnedenko and A. N. Kolmogorov. Limit Distributions for Sums of
Independent Random Variables, revised edition. Addison-Wesley, Reading,
1968.
[G6] B. V. Gnedenko and A. Ya. Khinchin. An Elementary Introduction to the
Theory of Probability. Freeman, San Francisco, 1961; ninth edition [Ele-
mentarnoe vvedenie v teoriyu veroyatnosteQ. "Nauka", Moscow, 1982.
[HI] P. R. Halmos. Measure Theory. Van Nostrand, New York, 1950.
[H2] E. J. Hannan. Time Series Analysis. Methuen, London, 1960.
[H3] E. J. Hannan. Multiple Time Series. New York, Wiley, 1970.
[II] I. A. Ibragimov and Yu. V. Linnik. Independent and Stationary Sequences of
Random Variables. Walters-Noordhoff, Groningen, 1971.
[12] I. A. Ibragimov and Yu. A. Rozanov. Gaussian Random Processes. Springer-
Verlag, New York, 1978.
[13] A. Isihara. Statistical Physics. Academic Press, New York, 1971.
[Kl] Yu. M. Kabanov, R. Sh. Liptser, and A. N. Shiryaev. On the question of the
absolute continuity and singularity of probability measures. Math. USSR-Sb.
33(1977),203-221.
[K2] J. Kemeny and L. J. Snell. Finite Markov Chains. Van Nostrand, Princeton,
1960.
[K3] V. F. Kolchin, B. A. Sevastyanov, and V. P. Chistyakov. Random Allocations.
Halsted, New York, 1978.
[K4] A. N. Kolmogorov, Markov chains with countably many states. Byull.
Moskov. Univ. 1 (1937), 1-16 (in Russian).
[K5] A. N. Kolmogorov. Stationary sequences in Hilbert space. Byull. Moskov.
Univ. Mat. 2 (1941), 1-40 (in Russian).
[K6] A. N. Kolmogorov, The contribution of Russian science to the development
References
605
of probability theory. Uchen. Zap. Moskov. Univ. 1947, no. 91, 56ff. (in
Russian).
[K7] A. N. Kolmogorov, Probability theory (in Russian), in Mathematics: Its
Contents, Methods, and Value [Matematika, ee soderzhanie, metody i
znachenie]. Akad. Nauk SSSR, vol. 2,1956.
[K8] A. N. Kolmogorov. Foundations of the Theory of Probability. Chelsea, New
York, 1956; second edition [Osnovnye poniatiya teorii veroyatnostef].
"Nauka", Moscow, 1974.
[K9] A. N. Kolmogorov and S. V. Fomin. Elements of the Theory of Functions
and Functionals Analysis. Graylok, Rochester, 1957 (vol. 1), 1961 (vol. 2);
sixth edition [Elementy teorii funktsif i funktsional'nogo analiza]. "Nauka",
Moscow, 1989.
[K10] A. N. Kolmogorov and A. P. Yushkevich, editors. Mathematics of the
Nineteenth Century [Matematika XIX veka]. Nauka, Moscow, 1978.
[Kll] J. Kubilius. Probabilistic Methods in the Theory of Numbers. American
Mathematical Society, Providence, 1964.
[LI] J. Lamperti. Probability. Benjamin, New York, 1966.
[L2] J. Lamperti. Stochastic Processes. Springer-Verlag, New York, 1977.
[L3] E. Lenglart. Relation de domination entre deux processus. Ann. Inst. H.
Poincare. Sect. B (N.S.) 13 (1977), 171-179.
[L4] • V. P. Leonov and A. N. Shiryaev. On a method of calculation of semi-
invariants. Theory Probab. Appl. 4 (1959), 319-329.
[L5] R. S. Liptser and A. N. Shiryaev. Statistics of Random Processes. Springer-
Verlag, New York, 1977.
[L6] R. Sh. Liptser and A. N. Shiryaev. A functional central limit theorem for
semimartingales. Theory Probab. Appl. 25 (1980), 667-688.
[L7] M. Loeve. Probability Theory. Springer-Verlag, New York, 1977-78.
[L8] E. Lukacs. Characteristic Functions. Hafner, New York, 1960.
[L9] E. Lukacs and R. G. Laha. Applications of Characteristic Functions. Hafner,
New York, 1964.
[Ml] D. E. Maistrov. Probability Theory: A Historical Sketch. Academic Press,
New York, 1974.
[M2] A. A. Markov. Calculus of Probabilities [Ischislenie veroyatnostef], 3rd ed. St
Petersburg, 1913.
[M3] L. D. Meshalkin. Collection of Problems on Probability Theory \Sbomik
zadach po teorii veroyatnostef]. Moscow University Press, 1963.
[M4] P.-A. Meyer, Martingales and stochastic integrals. I. Lecture Notes in
Mathematics, no. 284. Springer-Verlag, Berlin, 1972.
[M5] P.-A. Meyer and C. Dellacherie. Probabilities and potential. North-Holland
Mathematical Studies, no. 29. Hermann, Paris; North-Holland, Amsterdam,
1978.
[Nl] J. Neveu. Mathematical Foundations of the Calculus of Probability. Holden-
Day, San Francisco, 1965.
[N2] J. Neveu. Discrete Parameter Martingales. North-Holland, Amsterdam, 1975.
[N3] J. Neyman. First Course in Probability and Statistics. Holt, New York,
1950.
[N4] A. A. Novikov. On estimates and the asymptotic behavior of the probability
of nonintersection of moving boundaries by sums of independent random
variables. Math. USSR-Izv. 17 (1980), 129-145.
[PI] V. V. Petrov. Sums of Independent Random Variables. Springer-Verlag,
Berlin, 1975.
[P2] J. W. Pratt. On interchanging limits and integrals. Ann. Math. Stat. 31 (1960),
74-77.
606
References
[P3] Yu. V. Prohorov [Prokhorov]. Asymptotic behavior of the binomial
distribution. Uspekhi Mat. Nauk 8, no. 3(55) (1953), 135-142 (in Russian).
[P4] Yu. V. Prohorov. Convergence of random processes and limit theorems in
probability theory. Theory Probab. Appl. 1 (1956), 157-214.
[P5] Yu. V. Prokhorov and Yu. A. Rozanov. Probability theory. Springer-Veriag,
Berlin-New York, 1969; second edition [Teoriia veroiatnostei]. "Nauka",
Moscow, 1973.
[Rl] B. Ramachandran. Advanced Theory of Characteristic Functions. Statistical
Publishing Society, Calcutta, 1967.
[R2] A. Renyi. Probability Theory, North-Holland, Amsterdam, 1970.
[R3] V. I. Rotar. An extension of the Lindeberg-Feller theorem. Math. Notes 18
(1975), 660-663. ■
[R4] Yu. A. Rozanov. Stationary Random Processes. Holden-Day, San Francisco,
1967.
[SI] T. A. Sarymsakov. Foundations of the Theory of Markov Processes [Osnovy
teorii protsessov Markova'}. GITTL, Moscow, 1954.
[S2] V. V. Sazonov. Normal approximation: Some recent advances. Lecture Notes
in Mathematics, no. 879. Springer-Veriag, Berlin-New York, 1981.
[S3] B. A. Sevyastyanov [Sewastjanow]. Verzweigungsprozesse. Oldenbourg,
Munich-Vienna, 1975.
[S4] A. N. Shiryaev. Random Processes [Sluchainye processy\. Moscow State
University Press, 1972.
[S5] A. N Shiryaev. Probability, Statistics, Random Processes [Veroyatnost,
statistika, sluchainye protsessy], vols. I and II. Moscow State University
Press, 1973,1974.
[S6] A. N Shiryaev. Statistical Sequential Analysis [Statisticheskii posledovatelnyi
analiz']. Nauka, Moscow, 1976.
[S7] Ya. G. Sinai. Introduction to Ergodic Theory. Princeton Univ. Press,
Princeton, 1976.
[S8] S. H. Sirazhdinov. Limit Theorems for Stationary Markov Chains {Predelnye
teoremy dlya odnorodnyh tsepei Markova']. Akad. Nauk Uzbek. SSR,
Tashkent, 1955.
[S9] W. F. Stout. Almost Sure Convergence. Academic Press, New York, 1974.
[Tl] I. Todhunter. A History of the Mathematical Theory of Probability from the
Time of Pascal to that of Laplace. Macmillan, London, 1865.
[VI] E. Valkeila. A general Poisson approximation theorem. Stochastics 7 (1982),
159-171.
[V2] A. D. Venttsel. A Course in the Theory of Stochastic Processes. McGraw-Hill,
New York, 1981.
[Wl] B. L. van der Waerden. Mathematical Statistics. Springer-Veriag, Berlin-
New York, 1969.
[W2] A. Wald. Sequential Analysis. Wiley, New York, 1947.
[Yl] A. M. Yaglom and I. M. Yaglom. Probability and Information. Reidel,
Dordrecht, 1983.
[Zl] S. Zacks. The Theory of Statistical Inference. Wiley, New York, 1971.
[Z2] V. M. Zolotarev. A generalization of the Lindeberg-Feller theorem. Theory
Probab. Appl. 12 (1967), 606-618.
[Z3] V. M. Zolotarev. Theoremes limites pour les sommes de variables aleatoires
independantes qui ne sont pas infinitesimales. C.R. Acad. Sci. Paris. Ser. A-B
264 (1967), A799-A800.
[D5] D. Dacunha-Castelle and M. Duflo. Probabilites et statistiques. 1. Problemes
a temps fixe. 2. Problemes a temps mobile. Masson, Paris, 1982; Probability
and Statistics. Springer-Verlag, New York, 1986 (English translation).
References
607
[D6] R. M. Dudley. Distances of probability measures and random variables. Ann.
Math. Statist. 39 (1968), 1563-1572.
[G7] P. E. Greenwood and A. N. Shiryaev. Contiguity and the Statistical In-
variance Principle. Gordon and Breach, New York, 1985.
[Jl] J. Jacod and A. N. Shiryaev. Limit Theorems for Stochastic Processes.
Springer-Verlag, Berlin Heidelberg, 1987.
[K12] V. S. Korolyuk. Aide-memoire de theorie des probabilites et de statistique
mathematique. Mir, Moscow, 1983; second edition {Spravochnik po teorii
veroyatnostei i matematicheskoi statistike]. "Nauka", Moscow, 1985.
[L10] L. LeCam. Asymptotic Methods in Statistical Theory. Springer-Verlag, New
York, 1986.
[LI 1] F. Liese and I. Vajda. Convex Statistical Distances. Teubner, Leipzig, 1987.
[L12] R. Sh. Liptser and A. N. Shiryaev. Theory of Martingales. Kluwer, Dordrecht,
Boston, 1989.
[P6] V. V. Petrov. Limit Theorems for Sums of Independent Random Variables
{Predel'nye teoremy dlya summ nezavisimyh sluchainyh velichin}. Nauka,
Moscow, 1987.
[P7] D. Pollard. Convergence of Stochastic Processes. Springer-Verlag, New York,
1984.
[P8] E. L. Presman. Approximation in variation of the distribution of a sum of
independent Bernoulli variables with a Poisson law. Theory of Probability
and Its Applications 30 (1985), no. 2,417-422
[R6] Yu. A. Rozanov. The Theory of Probability, Stochastic Processes, and
Mathematical Statistics [Teoriya veroyatnostei, sluchainye protsessy i matematiche-
skaya statistika~]. Nauka, Moscow, 1985.
[S10] B. A. Sevastyanov. A Course in the Theory of Probability and Mathematical
Statistics [Kurs teorii veroyatnostei i matematicheskoi statistiki}. Nauka,
Moscow, 1982.
[Sll] A. N. Shiryaev. Probability. Springer-Verlag, Berlin Heidelberg, 1984
(English translation); Wahrscheinlichkeit, 1988 (German translation).
[Z4] V. M. Zolotarev. Modern theory of summation of random variables [Sovre-
mennaya teoriya summirovaniya nezavisimyh sluchainyh velichin}. Nauka,
Moscow, 1986.
Index of Symbols
^. u> n> n i36>i3?
5, boundary 311
0, empty set 11,136
0 447
® 30,144
=, identity, or definition 151
~, asymptotic equality 20; or
equivalence 298
=>, o, implications 141,142
=>, also used for "convergence in
general" 310,311
= 342
=4 316
=^, finer than 13
U 137
± 265,524
a.s.^ a.e.^ d P LP ^ oco.
i 310*'""'
«,<< 524
<X, y> 483
{*„-} 515
(XI, closure of ,4 153,311
A, complement of A 11,136
[tx tn~\ combination, unordered
set 7
(tx t„) permutation, ordered
set 7
A + B union of disjoint sets 11,
136
AAB 43,136
{A„ i.o.} - lim sup A„ 137
a = min(fl, 0); a+ = max(a, 0) 107
a® = a-\a^0;0ta = 0 462
a a b = min(fl, 6); a v 6 = max(a, b)
484
i4® 307
j/ 13
ST, index set 317
^(R) = ^(K1) = ^ = ^ 143,144
@(Rn) 144,159
^otR") 146
^{Rm) 146,160; ^o(^°°) 147
^(Rr) 147,166
^(C), ^o(Q 150
@{D) 150
^[0,1] 154
@(R) ® • • • ® @{R) = @(Rn) 144
C 150
C = C[0, 00) 151
(C, ^(C)) 150
Clk 6
C+ 515
cov(f,>/) 41,234
Dt{D,@(D)) 150
0 12, 76,103,140,175
Ef 37,180,181,182
E«|D),E«|0) 78
E(£|9) 213,215,226
610
Index of Symbols
E(fM 81,221,238
Etflih tin) 81
6(^,...,%) 264
ess sup 261
(f,9),<f,9> 426
F*G 241
.^ 133,138
.^p 154
&l* 176
■^'*,^c»^4 139
H2 452
M*),tf„(*) 268,271
inf 44
J 163,425
/4,/(.4) 33
i~j 570
jxfdP 183
jofdP 183
(L-S)J,(R-S)J,(L)J,(R)J 183,
204, 205
L2 262
Lp 261
^ 264
2_ 267
lim, lim, lim sup, lim inf, lim t,
lim i 137,173
l.i.m. 253
(M)n 7
usr 140
N(A) 14,15
N(^) 13
N{Q) 5
jV(mto2) 234
0 235,265
p(o)) 13,17,20,110
^ 317
P 133
P(A) 10,134
P(/4|/)), P{A\@) 24,76,212
P(yl|^) 214
P(/4|f) 74,214
PxtPn 178
PcUO 310
P 115,275
P* 116
IIPtfll 115
\\p(*,y)\\ in
^ = {/V,oceSr} 317
<Q = (¢1,42,-..) 585
R 143
jR 144,173
jR" 144, 159
R 161,235,265
(jR, @) = (jR1 ; ^) = (jR, mR)) 143,
144,151
jR", @{Rn) 144,159
jR00,^*00) 146,162
RT, {RTt @(RT)) 147,166;
(jRT, ^(jRt), P) 247
Vf 41,234
WW 83
V(f|4?) 214
X*=max,SlI|X,| 492
11¾ 492
Z 177
Z 415
Z 436
Z(A),Z(A) 424
oc(<2>) 12
Sq 268
A 59,160, 237, 238,239, 264
6k£ 404,568
t^txiA) 132
^! x n2 198
f 32,170
<T 44
| 279
n=||A7ll 115
H 150
(n^r^,H,er^) 150
p«,ifl 41,234
<p,0 455
0 459
<D(x) 61,66
X,X2 156,243
Xb 174
Q 5
(Q,^, P) 18,29
(ft,iF, P) 138
Index
Also see the Historical and Bibliographical Notes (pp. 597-602), and the Index of Symbols.
Absolute continuity with respect to P 195,
524
Absolute moment 182
Absolutely continuous
distributions 155
functions 156
measures 155
probability measures 195,524
random variables 171
Absorbing 113,588
Accessible state 569
a.e. 185
Algebra
correspondence with decomposition 82
induced by a decomposition 12
of events 12,128
of sets 132,139
o- 133,139
smallest 140
tail 380
Almost everywhere 185
Almost invariant 407
Almost periodic 417
Almost surely 185
Amplitude 418
Aperiodic state 572
a posteriori probability 27
Appropriate sets, principle of 141
a priori probability 27
Arcsine law 102
Arithmetic properties 569
a.s. 185
Asymptotic algebra 380
Asymptotic properties 573
Asymptotically
infinitesimal 337
unbiased 443
Asymptotics 536
Atoms 12
Attraction of characteristic functions 298
Autoregression, autoregressive 419,421
Average time of return 574
Averages, moving 419,421,437
Axioms 138
Backward equation 117
Balance equation 420
Ballot theorem 107
Banach space 261
Barriers 588,590
Bartlett's estimator 444
Basis, orthonormal 267
Bayes's formula 26
Bayes's theorem 27,230
Bernoulli, James 2
distribution 155
law of large numbers 49
random variable 34,46
scheme 30,45,55,70
Bernstein, S. N. 4,307
inequality 55
612
polynomials 54
proof of Weierstrass' theorem 54
Berry-Esseen theorem 63,374
Bessel's inequality 264
Best estimator 42, 69. Also see Optimal
estimator
Beta distribution 156
Bilateral exponential distribution 156
Binary expansion 131,394
Binomial distribution 17,155
negative 155
Binomial random variable 34
Birkhoff, G. D. 404
Birthday problem 15
Bochner-Khinchin theorem viii, 287,4(
Borel, E. 4
algebra 139
function 170
rectangle 145
sets 143,147
space 229
zero-or-one law 380
Borel-Cantelli lemma 255
Borel-Cantelli-Levy theorem 518
Bose-Einstein 10
Boundary 536
Bounded variation 207
Branching process 115
Brownian motion 306
Buffon's needle 224
Bunyakovskii, V. Ya. 38,192
Burkholder's inequality 499
Canonical
decomposition 544
probability space 247
Cantelli, F. P. 255,388
Cantor, G.
diagonal process 319
function 157,158
Caratheodory's theorem 152
Carleman's test 296
Cauchy
distribution 156,344
inequality 38
sequence 253
Cauchy Bunyakovskii inequality 38,192
Cauchy criterion for
almost sure convergence 258
convergence in probability 259
convergence in mean-p 260
Central limit theorem 4,322,326,348
for dependent variables 541
nonclassical condition for 337
Certain event 11,136
Cesaro limit 582
Change of variable in integral 196
Chapman, D. G. 116,248, 566
Characteristic function of
distribution 4,274
random vector 275
set 33
Charlier, C V. L. 269
Chebyshev, P. L. 3, 321
inequality 47,55,192
Chi, chi-squared distributions 156
Class
C+ 515
convergence-determining 315
determining 315
monotonic 140
of states 570
Classical
method 15
models 17
Classification of states 569,573
Closed linear manifold 267
Coin tossing 1,5,17, 33, 83,131
Coincidence problem 15
Collectively independent 36
Combinations
Communicating states 570
Compact
relatively 317
sequentially 319
Compensator 482
Complement 11,136
Complete
function space 260
probability measure 154
probability space 154
Completely nondeterministic 447
Conditional
distribution 227
probability 23,77,221
regular 227
with respect to a
decomposition 77,212
tr-algebra 212,214
random variable 77,214
variance 83,214
Wiener process 307
Conditional expectation
in the wide sense 264,274
with respect to
decomposition 78
event 220
Index
613
set of variables 81
tr-algebra 214
Conditionally Gaussian 466
Confidence interval 74
Consistency property 163,246
Consistent estimator 71, 521, 535
Construction of a process 245,246
Continuity theorem 322
Continuous at 0 153,164
Continuous from above or below 134
Continuous time 177
Convergence-determining class 315
Convergence of
martingales and submartingales 508,
515
probability measures Chap. Ill, 308
random variables: equivalences 252
sequences. See Convergence of sequences
series 384
Convergence of sequences
almost everywhere 252,353
almost sure 252,353
at points of continuity 253
dominated 187
in distribution 252, 325,353
in general 310
in mean 252
in mean of order p 252
in mean square 252
in measure 252
in probability 252,348
monotone 186
weak 309,311
with probability 252
Convolution 241,377
Coordinate method 247
Correlation
coefficient 41,234
function 416
maximal 244
Counting measure 233
Covariance 41,232,293
function 306,416
matrix 235
Cramer condition 400
Cramer-Lundberg model 559
Cramer transform 401
Cramer-Wold method 549
Cumulant 290
Curve of regression 238
Curvilinear boundary 536
Cyclic property 571
Cyclic subclasses 571
Cylinder set 146
Davis's inequality 499
Decomposition
canonical 544
countable 140
Doob 482
Krickeberg 507
Lebesgue 525
of martingale 507
offi 12,140
of probability measure 525
of random sequence 447
of set 12,292
ofsubmartingale 482
trivial 80
Degenerate
distribution 298
distribution function 298
random variable 298
Delta function 298
Delta, Kronecker 268
DeMoivre,A. 2,49
De Moivre-Laplace limit theorem 62
Density
Gaussian 66,156,161,238
n-dimensional 161
of distribution 156
of measure with respect to a measure 196
of random variable 171
Dependent random variables 103
central limit theorem for 541
Derivative, Radon-Nikodym 196
Detection of signal 462
Determining class 315
Deterministic 447
regularity 1
Dichotomy
Hajek-Feldman 533
Kakutani 529
Difference of sets 11,136
Direct product 31,144,151
Dirichlet's function 211
Discrete
measure 155
random variable 171
time 177
uniform density 155
Discrete version of Ito's formula 554
Disjoint 136
Dispersion 41
Distance in variation 355, 376
Distribution. Also see Distribution function,
Probability distribution
Bernoulli 155
beta 156
614
Index
binomial 17,18,155
Cauchy 156,344
chi, chi-squared 243
conditional 212
discrete (list) 155
discrete uniform 155
entropy of 51
ergodic 118
exponential 156
gamma 156
Gaussian 66,156,161,293
geometric 155
hypergeometric 21
infinitely divisible 341
initial 112,565
invariant 120
limit 545
lognormal 240
multidimensional 160
multinomial 20
negative binomial 155
normal 66,156
of process 178
of sum 36
Poisson 64,155
polynomial 20
probability 33
stable 341
stationary 120
Student's 155,244
t- 155,244
two-sided exponential 155
uniform 156
with density (list) 156
Distribution function 34,35,152,171
absolutely continuous 156
degenerate 288
discrete 155
finite-dimensional 246
generalized 158
n-dimensional 160
of functions of random variables 36,
239ff.
of sum 36,241
Distribution of objects in cells 8
^-measurable 76
Dominated convergence 187
Dominated sequence 496
Doob, J. L. 482,485,492
Doubling stakes 89,481
Doubly stochastic 587
d-system 142
Duration of random walk 90
Dvoretzky's inequality 508
Efficient estimator 71
Eigenvalue 130
Electric circuit 32
Elementary
events 5,136
probability theory Chap. I
stochastic measure 424
Empty set 11,136
Entropy 51
Equivalent measures 524
Ergodic
sequence 407
theorems 110,409,413
maximal 410
mean-square 438
theory 409
transformation 408
Ergodicity 118,409,581
Errors
laws of 298
mean-square 43
of observation 2,3
Esseen's inequality 296
Essential state 569
Essential supremum 261
Estimation 70,237,440,454
Estimator 42, 70,237
Bartlett's 444
best 42,69
consistent 71
efficient 71
for parameters 472,520, 535
linear 43
of spectral quantities 442
optimal 70,237,303,454,461,463,469
Parzen's 445
unbiased 71,440
Zhurbenko's 445
Events 5,10,136
certain 11,136
elementary 5
impossible 11,136
independent 28,29
mutually exclusive 136
symmetric 382
Existence of limits and stationary
distributions 582ff.
Expectation 37,182
inequalities for 192,193
of function 55
of maximum 45
of random variable with respect to
decomposition 76
set of random variables 81
Index
615
tr-algebra 212
of sum 38
Expected value 37
Exponential distribution 156
Exponential random variable 156,244,
245
Extended random variable 178
Extension of a measure 150,163,249,427
Extrapolation 453
Fair game 480
Fatou's lemma 187,211
Favorable game 89,480
F-distribution 156
Feldman, J. 533
Feller, W. 597
Fermat, P. de 2
Fermi-Dirac 10
Filter 434,464
physically realizable 451
Filtering 453,464
Finer decomposition 80
Finite second moment 262
Finite-dimensional distribution function
246
Finitely additive 132,424
First
arrival 129,574
exist 123
return 94,129, 574
Fisher's information 72
^"-measurable 170
Forward equation 117
Foundations Chap. II, 131
Fourier transform 276
Frequencies 418
Frequency 46
Frequency characteristic 434
Fubini's theorem 198
Fundamental inequalities (for martingales)
492
Fundamental sequence 253,258
Gamma distribution 156,343
Garcia, A. viii, 410
Gauss, C F. 3
Gaussian
density 66,156,161,236
distribution 66,156,161,293
measure 268
random variables 234,243,298
random vector 299
sequence 306,413,439,441,466
systems 297,305
Gauss Markov process 307
Generalized
Bayes theorem 231
distribution function 158
martingale 476
submartingale 476,523
Geometric distribution 155
Gnedenko, B. V. vii, 510, 542
Gram determinant 265
Gram-Schmidt process 266
Haar functions 271,482
Hajek-Feldman dichotomy 533
Hardy class 452
Harmonics 418
Hartman, P. 372
Hellinger integral 3
Helly's theorem 319
Herglotz,G. 421
Hermite polynomials 268
Hewitt, E. 382
Hilbert space 262
complex 275,416
separable 267
unitary 416
Hinchin. See Khinchin
History 597-602
Holder inequality 193
Huygens, C. 2
Hydrology 420,421
Hypergeometric distribution 21
Hypotheses 27
Impossible event 11,136
Impulse response 434
Increasing sequence 137
Increments
independent 306
uncorrected 109,306
Indecomposable 580
Independence 27
linear 265,286
Independent
algebras 28,29
events 28,29
functions 179
increments 306
random variables 36,77,81,179,380,
513
Indicator 33,43
616
Index
Inequalities
Berry-Esseen 333
Bernstein 55
Bessel 264
Burkholder 499
Cauchy-Bunyakovskii 38,192
Cauchy-Schwarz 38
Chebyshev 3,321
Davis 499
Dvoretzky 508
Holder 193
Jensen 192,233
Khinchin 347,498
Kolmogorov 496
Levy 400
Lyapunov 193
Marcinkiewicz-Zygmund 498
Markov 598
martingale 492
Minkowski 194
nonuniform 376
Ottaviani 507
Rao-Cramer 73
Schwarz 38
two-dimensional Chebyshev 55
Inessential state 569
Infinitely divisible 341
Infinitely many outcomes 131
Information 72
Initial distribution 112,565
Innovation sequence 448
Insurance 558
Integral
Lebesgue 180
Lebesgue-Stieltjes 183
Riemann 183,205
Riemann-Stieltjes 205
stochastic 423
Integral equation 208
Integral theorem 62
Integration by
parts 206
substitution 211
Intensity 418
Interpolation 453
Intersection 11,136
Introducing probability measures 151
Invariant set 407,413
Inversion formulas 283,295
i.o. 137
Ionescu Tulcea, C. T. vii, 249
Ising model 23
Isometry 430
Iterated logarithm 395
Ito's formula for
Brownian motion 558
Jensen's inequality 192,233
Kakutani dichotomy 527,528
Kakutani-Hellinger distance 363
Kalman-Bucy filter 464
Khinchin, A. Ya 287,468
Kolmogorov, A. N. vu, 3,4, 384,395,498,
542
axioms 131
inequality 384
Kolmogorov-Chapman equation 116,248,
566
Kolmogorov's theorems
convergence of series 384
existence of process 246
extension of measures 167
iterated logarithm 395
stationary sequences 453,455
strong law of large numbers 366, 389,391
three-series theorem 387
two-series theorem 386
zero-or-one law 381
Krickeberg's decomposition 507
Kronecker, L. 390
delta 268
Kullback information 368
A (condition) 338
Laplace, P. S. 2,55
Large deviation 69,402
Law of large numbers 45,49,325
for Markov chains 122
for square-integrable martingales 519
Poisson's 599
strong 388
Law of the iterated logarithm 395
Least squares 3
Lebesgue, H.
decomposition 366,525
derivative 366
dominated convergence theorem 187
integral 180,181
change of variable in 196
measure 154,159
Lebesgue-Stieltjes integral 197
Lebesgue-Stieltjes measure 158,205
LeCam, L. 377
Levy, P.
Index
617
convergence theorem 510
distance 316
inequality 400
Levy-Khinchin representation 347
Levy-Khinchin theorem 344
Levy-Prokhorov metric 349
Likelihood ratio 110
lim inf, lim sup 137
Limit theorems 55
Limits under
expectation signs 180
integral signs 180
Lindeberg condition 328
Lindeberg-Feller theorem 334
Linear manifold 264
Linearly independent 265
Liouville's theorem 406
Lipschitz condition 512
Local limit theorem 55, 56
Local martingale, submartingale 477
Locally absolutely continuous 524
Locally bounded variation 206
Lognormal 240
Lottery 15,22
Lower function 396
L2-theory Chap. VI, 415
Lyapunov, A. M. 3,322
condition 332
inequality 193
Macmillan's theorem 59
Marcinkiewicz's theorem 288
Marcinkiewicz-Zygmund inequality 498
Markov, A. A. viii, 3,321
depencence 564
process 248
property 112,127,564
time 476
Markov chains 110, Chap. VIII, 251, 564
classification of 569, 573
discrete 565
examples 113,587
finite 565
homogeneous 113,5 65
Martingale 103, Chap. VII, 474
convergence of 508
generalized 476
inequalities for 492
in gambling 480
local 477
oscillations of 503
reversed 484
sets of convergence for 515
square-integrable 482,493, 538
uniformly integrable 512
Martingale-difference 481, 543, 559
Martingale transform 478
Mathematical expectation 37,76. Also see
Expectation
Mathematical foundations Chap. II, 131
Matrix
covariance 235
doubly stochastic 587
of transition probabilities 112
orthogonal 235,265
stochastic 113
transition 112
Maximal
correlation coefficient 244
ergodic theorem 410
Maxwell-Boltzmann 10
Mean
duration 90,489
-square 42
ergodic theorem 438
value 37
vector 301
Measurable
function 170
random variable 80
set 154
spaces 133
{C,@{Q) 150
(D,@(D)) 150
(nfi„H^(0) 150
(R,@(R)) 151
(U80, 38(10°)) 162
(Rn, @(R")) 159
(RT,@(RT)) 166
transformation 404
Measure 133
absolutely continuous 155,524
complete 154
consistent 567
countably additive 133
counting 233
discrete 155
elementary 424
extending a 152,249,427
finite 132
finitely additive 132
Gaussian 268
Lebesgue 154
Lebesgue-Stieltjes 158
orthogonal 366,425,524
probability 134
restriction of 165
618
Index
a-additive 133
(j-finite 133
signed 196
singular 158,366,524
stochastic 423
Wiener 169
Measure-preserving transformations 404ff.
Median 44
Method
of characteristic functions 321 if.
of moments 4,321
Metrically transitive 407
Minkowski inequality 194
Mises, R. von 4
Mixed model 421
Mixed moment 289
Mixing 409
Moivre. See De Moivre
Moment 182
absolute 182
and semi-invariant 290
method 4
mixed 289
problem 294ff.
Monotone convergence theorem 186
Monotonic class 140
Monte Carlo method 225,394
Moving averages 418,421
Multinomial distribution 20, 21
Multiplication formula 26
Mutual variation 483
Needle (Buffon) 224
Negative binomial 155
Noise 418,435
Nonclassical hypotheses 328,337
Nondeterministic 447
Nonlinear estimator 453
Nonnegative definite 235
Nonrecurrent state 574
Norm 260
Normal
correlation 303,307
density 66,161
distribution function 62, 66,156,161
number 394
Normally distributed 299
Null state 574
Occurrence of event 136
Optimal estimator 71,237,303,454,461,
463,469
Optional stopping 601
Ordered sample 6
Orthogonal
decomposition 265
increments 428
matrix 235,265
random variables 263
stochastic measures 423,425
system 263
Orthogonalization 266
Orthonormal 263,267,271
Oscillations of
submartingales 503
water level 420
Ottaviani's inequality 507
Outcome 5,136
Pairwise independence 29
P-almost surely, almost everywhere 185
Parallelogram property 274
Parseval's equation 268
Partially observed sequences 460ff.
Parzen's estimator 445
Pascal, B. 2
Path 48,85,95
Pauli exclusion principle 10
Period of Markov chain 571
Periodogram 443
Permutation 7,382
Perpendicular 265
Phase space 112,565
Physically realizable 434,451
w 225
Poincare recurrence principle 406
Poisson, D. 3
distribution 64,155
law of large numbers 599
limit theorem 64, 327
Poisson-Charlier polynomials 269
Polya's theorems
characteristic functions 287
random walk 595
Polynomials
Bernstein 54
Hermite 268
Poisson-Charlier 269
Positive semi-definite 287
Positive state 574
Pratt's lemma 211,599
Predictable sequence 446,474
Predictable quadratic variation characteristic
483
Preservation of martingale property 484
Index
619
Principle of appropriate sets 141
Probabilistic model 5,14,131
in the extended sense 133
Probability 2,134
a posteriori, a priori 27
classical IS
conditional 23, 76,214
finitely additive 132
measure 131,151
multiplication 26
of first arrival or return 574
of mean duration 90
of ruin 83
of success 70
total 25,77,79
transition 566
Probability distribution 33,170,178
discrete 155
lognormal 240
stationary 569
table 155,156
Probability of ruin in insurance 558
Probability measure 134,154,524
absolutely continuous 524
complete 154
Probability space 14,138
canonical 247
complete 154
universal 252
Probability of error 361
Problems on
arrangements 8
coincidence 15
ruin 88
Process
branching 115
Brownian motion 306
construction of 245ff.
Gaussian 306
Gauss-Markov 307
Markov 248
stochastic 4,177
Wiener 306,307
with independent increments 306
Prohorov, Yu. V. vii, 64,318
Projection 265,273
Pseudoinverse 307
Pseudotransform 462
Purely nondeterministic 447
Pythagorean property 274
Quadratic characteristic 483
Quadratic covariation 483
Quadratic variation 483
Queueing theory 114
Rademacher system 271
Radon-Nikodym
derivative 196
theorem 196,599
Random
elements 176ff.
function 177
process 177,306
with orthogonal increments 428
sequences 4, Chap. V, 404
existence of 246,249
orthogonal 447
Random variables 32ff., 166,234ff.
absolutely continuous 171
almost invariant 407
complex 177
continuous 171
degenerate 298
discrete 171
exponential 156,244,245
£-valued 177
extended 173
Gaussian 234,243,298
invariant 407
normally distributed 234
simple 170
uncorrelated 234
Random vectors 35,177
Gaussian 299,301
Random walk 18, 83,381
in two and three dimensions 592
simple 587
symmetric 94,381
with curvilinear boundary 536
Rao-Cramer inequality 72
Rapidity of convergence 373,376,400,402
Realization of a process 178
Recurrent state 574, 593
Reflecting barrier 592
Reflection principle 94,96
Regression 238
Regular
conditional distribution 227
conditional probability 226
stationary sequence 447
Relatively compact 317
Reliability 74
Restriction of a measure 165
Reversed martingale 105,403,484
Reversed sequence 130
620
Index
Riemann integral 204
Riemann-Stieltjes integral 204
Ruin 84,87,489
Sample points, space 5
Sampling
with replacement 6
without replacement 7,21, 23
Savage, L. J. 389
Scalar product 263
Schwarz inequality 38. Also see
Bunyakovskii, Cauchy
Semicontinuous 313
Semi-definite 287
Semi-invariant 290
Semi-norm 260
Separable 267
Sequences
almost periodic 417
moving average 418
of independent random variables 379
partially observed 460
predictable 446,474
random 176,404
regular 447
singular 447
stationary (strict sense) 404
stationary (wide sense) 416
stochastic 474,483
Sequential compactness 318
Series of random variables 384
Sets of convergence SIS
Shifting operators 568
tr-additive 134
Sigma algebra 133,138
asymptotic 380
generated by £ 174
tail, terminal 380
Signal, detection of 462
Significance level 74
Simple
moments 291
random variable 32
random walk 587
semi-invariants 291
Singular measure 158
Singular sequence 447
Singularity of distributions 524
Skorohod, A. V. 150
Slowly varying 537
Spectral
characteristic 434
density 418
rational 437,456
function 422
measure 422
representation of
covariance function 415
sequences 429
window 444
Spectrum 418
Square-integrable martingale 482,493, 518,
538
Stable 344
Standard deviation 41,234
State space 112
States, classification of 234,569, 573
Stationary
distribution 120, 569, 580
Markov chain 110
sequence Chap. V, 404; Chap. VI, 415
Statistical estimation
regularity 440
Statistically independent 28
Statistics 4,50
Stieltjes, T. J. 183,204
Stirling's formula 20,22
Stochastic
exponential 504
integral 423,426
matrix 113,587
measure 403,424
extension of 427
orthogonal 425
with orthogonal values 425,426
process 4,177
sequence 474,564
Stochastically independent 42
Stopped process 477
Stopping time 84,105,476
Strong law of large numbers 388,389,501,
515
Strong Markov property 127
Structure function 425
Student distribution 156,244
Submartingales 475
convergence of 508
generalized 476, 515
local 477
nonnegative 509
nonpositive 509
sets of convergence of 515
uniformly integrable 510
Substitution, integration by 211
Sum of
dependent random variables 591
events 11,137
Index
621
exponential random variables 245
Gaussian random variables 243
independent random variables 328
IV, 379
Poisson random variables 244
sets 11,136
Summation by parts 390
Supermartingale 475
Symmetric difference A 43,136
Symmetric events 382
Szego-Kolmogorov formula 464
Tables
continuous densities 156
discrete densities 155
terms in set theory and probability
137
Tail 49,323,335
algebra 380
Taxi stand 114
t-distribution 156,244
Terminal algebra 380
Three-series theorem 387
Tight 318
Time
change (in martingale) 484
continuous 177
discrete 177
domain 177
Toeplitz, O. 390
Total probability 25, 77, 79
Trajectory 178
Transfer function 434
Transform 478
Transformation, measure-preserving 405
Transition
function 565
matrix 112
probabilities 112,248,566
Trial 30
Trivial algebra 12
Tulcea. See Ionescu Tulcea
Two-dimensional Gaussian density 162
Two-series theorem 386
Typical
path 50,52
realization 50
Unbiased estimator 71,440
Uncorrected 42,234
increments 109
Unfavorable game 86, 89,480
Chap. Uniform distribution 155,156
Uniformly
asymptotically infinitesimal 337
continuous 328
integrable 188
Union 11,136,137
Uniqueness of
distribution function 282
solution of moment problem 295
Universal probability space 252
Unordered
samples 6
sets 166
Upper function 396
Variance 41
conditional 83
of sum 42
Variation quadratic 483
Vector
Gaussian 238
random 35,177,238,301
Wald's identities 107,488,489
Water level 421
Weak convergence 309
Weierstrass approximation theorem
for polynomials 54
for trigonometric polynomials 282
White noise 418,435
Wiener, N.
measure 169
process 306,307
Window, spectral 444
Wintner.A. 397
Wold's
expansion 446,450
method 549
Wolf,R. 225
Zero-or-one laws 354ff., 379, 512
Borel 380
for Gaussian sequences 533
Hewitt-Savage 382
Kolmogorov 381
Zhurbenko's estimator 445
Zygmund, A. 498
Graduate Texts in Mathematics
continuedfrom page ii
61 Whitehead. Elements of Homotopy
Theory.
62 Kargapolov/Merlzjakov. Fundamentals
of the Theory of Groups.
63 Bollobas. Graph Theory.
64 Edwards. Fourier Series. Vol. I 2nd ed.
65 Wells. Differential Analysis on Complex
Manifolds. 2nd ed.
66 Waterhouse. Introduction to Affine
Group Schemes.
67 Serre. Local Fields.
68 Weidmann. Linear Operators in Hilbert
69 Lang. Cyclotomic Fields II.
70 Massey. Singular Homology Theory.
71 Farkas/Kra. Riemann Surfaces. 2nd ed.
72 Stillwell. Classical Topology and
Combinatorial Group Theory. 2nd ed.
73 Hungerford. Algebra.
74 Davenport. Multiplicative Number
Theory. 2nd ed.
75 Hochscheld. Basic Theory of Algebraic
Groups and Lie Algebras.
76 Iitaka. Algebraic Geometry.
77 Hecke. Lectures on the Theory of
Algebraic Numbers.
78 Burris/Sankappanavar. A Course in
Universal Algebra.
79 Walters. An Introduction to Ergodic
Theory.
80 Robinson. A Course in the Theory of
Groups. 2nd ed.
81 Forster. Lectures on Riemann Surfaces.
82 Bott/Tu. Differential Forms in Algebraic
Topology.
83 Washington. Introduction to Cyclotomic
Fields. 2nd ed.
84 Ireland/Rosen. A Classical Introduction
to Modem Number Theory. 2nd ed.
85 Edwards. Fourier Series. Vol. II. 2nd ed.
86 van Lint. Introduction to Coding Theory.
2nd ed.
87 Brown. Cohomology of Groups.
88 Pierce. Associative Algebras.
89 Lang. Introduction to Algebraic and
Abelian Functions. 2nd ed.
90 Brondsted. An Introduction to Convex
Polytopes.
91 Beardon. On the Geometry of Discrete
Groups.
92 Deestel. Sequences and Series in Banach
93 Dubrovin/Fomenko/Novikov. Modem
Geometry—Methods and Applications.
Part I. 2nd ed.
94 Warner. Foundations of Differentiable
Manifolds and Lie Groups.
95 Shiryaev. Probability. 2nd ed.
96 Conway. A Course in Functional
Analysis. 2nd ed.
97 Koblitz. Introduction to Elliptic Curves
and Modular Forms. 2nd ed.
98 Brocker/Tom Deeck. Representations of
Compact Lie Groups.
99 Grove/Benson. Finite Reflection Groups.
2nd ed.
100 Berg/Christensen/Ressel. Harmonic
Analysis on Semigroups: Theory of
Positive Definite and Related Functions.
101 Edwards. Galois Theory.
102 Varadarajan. Lie Groups, Lie Algebras
and Their Representations.
103 Lang. Complex Analysis. 3rd ed.
104 Dubrovin/Fomenko/Novikov. Modern
Geometry—Methods and Applications.
Part II.
105 Lang. SLj(R).
106 Silverman. The Arithmetic of Elliptic
Curves.
107 Olver. Applications of Lie Groups to
Differential Equations. 2nd ed.
108 Range. Holomorphic Functions and
Integral Representations in Several
Complex Variables.
109 Lehto. Univalent Functions and
Teichmflller Spaces.
110 Lang. Algebraic Number Theory.
111 HusemOller. Elliptic Curves.
112 Lang. Elliptic Functions.
113 Karatzas/Shreve. Brownian Motion and
Stochastic Calculus. 2nd ed.
114 Koblitz. A Course in Number Theory
and Cryptography. 2nd ed.
115 Berger/Gostiaux. Differential
Geometry: Manifolds, Curves, and
Surfaces.
116 Kelley/Srintvasan. Measure and
Integral. Vol. I.
117 Serre. Algebraic Groups and Class
Fields.
118 Pedersen. Analysis Now.
119 Rotman. An Introduction to Algebraic
Topology.
120 Zeemer. Weakly Differentiable Functions:
Sobolev Spaces and Functions of
Bounded Variation.
121 Lang. Cyclotomic Fields I and II.
Combined 2nd ed.
122 Rbmmert. Theory of Complex Functions.
Readings in Mathematics
123 Ebbinghaus/Hermes et al. Numbers.
Readings in Mathematics
124 Dubrovin/Fomenko/Novikov. Modern
Geometry—Methods and Applications.
Part ID.
125 Berenstein/Gay. Complex Variables: An
Introduction.
126 Borel. Linear Algebraic Groups. 2nd ed.
127 Massey. A Basic Course in Algebraic
Topology.
128 Rauch. Partial Differential Equations.
129 Fulton/Harris. Representation Theory:
A First Course.
Readings in Mathematics
130 Dodson/Poston. Tensor Geometry.
131 LAM. A First Course in Noncommutative
Rings.
132 Beardon. Iteration of Rational Functions.
133 Harris. Algebraic Geometry: A First
Course.
134 Roman. Coding and Information Theory.
135 Roman. Advanced Linear Algebra.
136 Adkins/Weintraub. Algebra: An
Approach via Module Theory.
137 Axler/Bourdon/Ramey. Harmonic
Function Theory.
138 Cohen. A Course in Computational
Algebraic Number Theory.
139 Bredon. Topology and Geometry.
140 AUBIN. Optima and Equilibria. An
Introduction to Nonlinear Analysis.
141 Becker/Weispfenning/Kredel. Grobner
Bases. A Computational Approach to
Commutative Algebra.
142 Lang. Real and Functional Analysis.
3rded.
143 DOOB. Measure Theory.
144 DENNIS/FaRB. Noncommutative
Algebra.
145 VICK. Homology Theory. An
Introduction to Algebraic Topology.
2nd ed.
146 Bridges. Computability: A
Mathematical Sketchbook.
147 ROSENBERG. Algebraic AT-Theory
and Its Applications.
148 ROTMAN. An Introduction to the
Theory of Groups. 4th ed.
149 RATCLIFFE. Foundations of
Hyperbolic Manifolds.
150 ElSENBUD. Commutative Algebra
with a View Toward Algebraic
Geometry.
151 SILVERMAN. Advanced Topics in
the Arithmetic of Elliptic Curves.
152 ZlEGLER. Lectures on Polytopes.
153 FULTON. Algebraic Topology: A
First Course.
154 Brown/Pearcy. An Introduction to
Analysis.
155 Kassel. Quantum Groups.
156 Kechris. Classical Descriptive Set
Theory.
157 Malliavin. Integration and
Probability.
158 ROMAN. Field Theory.
159 CONWAY. Functions of One
Complex Variable II.
160 LANG. Differential and Riemannian
Manifolds.
161 Borwein/Erdelyi. Polynomials and
Polynomial Inequalities.
162 ALPERIN/BELL. Groups and
Representations.
163 DIXON/MORTIMER. Permutation
Groups.
164 NATHANSON. Additive Number Theory:
The Classical Bases.
165 NATHANSON. Additive Number Theory:
Inverse Problems and the Geometry of
Sumsets.
166 SHARPE. Differential Geometry: Cartan's
Generalization of Klein's Erlangen
Program.
167 MORANDI. Field and Galois Theory.
168 EWALD. Combinatorial Convexity and
Algebraic Geometry.
169 BHATIA. Matrix Analysis.
170 Bredon. Sheaf Theory. 2nd ed.
171 Petersen. Riemannian Geometry.
172 REMMERT. Classical Topics in Complex
Function Theory.
173 DIESTEL. Graph Theory.
174 BRIDGES. Foundations of Real and
Abstract Analysis.
175 LICKORISH. An Introduction to Knot
Theory.
176 LEE. Riemannian Manifolds.
177 NEWMAN. Analytic Number Theory.
178 Clarke/Ledyaev/Stern/Wolenski.
Nonsmooth Analysis and Control
Theory.
179 DOUGLAS. Banach Algebra Techniques in
Operator Theory. 2nd ed.
180 SRTVASTAVA. A Course on Borel Sets.
181 KRESS. Numerical Analysis.
182 WALTER. Ordinary Differential
Equations.
183 MEGGINSON. An Introduction to Banach
Space Theory.
184 BOLLOBAS. Modem Graph Theory.
185 Cox/Little/O'Shea. Using Algebraic
Geometry.
186 RAMAKRISHNAN/VALENZA. Fourier
Analysis on Number Fields.
This book contains a systematic treatment of probability from the
ground up. starting with intuitive ideas and gradually developing more
sophisticated subjects, such as random walks, martingales Markov
chains, ergodic theory, weak convergence of probability measures,
stationary stochastic processes, and the Kalman-Bucy filter Many
examples are discussed in detail, and there are a large number of
exercises. The book is accessible to advanced undergraduates and
can be used as a text for self-study.
This new edition contains substantial revisions and updated
references. The reader will find a deeper study of topics such as the
distance between probability measures, metrization of weak
convergence, and contiguity of probability measures. Proofs for a number
of some important results which were merely stated in the first
edition have been added. The author has included new material on the
probability of large deviations, on the central limit theorem for sums
of dependent random variables, and on a discrete version of Ito's
formula.
ISBN 0 367^<.5<.9 0
springeronline.com
s