/
Text
Driving the Power of AIX
Driving the Power of AIX
Performance Tuning on IBM Power Systems
Ken Milberg
MC Press Online, LP
Lewisville, TX 75077
®
™
Driving the Power of AIX : Performance Tuning on IBM Power Systems
Ken Milberg
Photography by Michele Huttler Silver, Michele Silver Photography
First Printing—October 2009
© 2009 Ken Milberg. All rights reserved.
Portions © MC Press Online, LP
Every attempt has been made to provide correct information. However, the publisher and the author do not
guarantee the accuracy of the book and do not assume responsibility for information included in or omitted
from it.
IBM is a registered trademark of International Business Machines Corporation in the United States, other
countries, or both. AIX, POWER and POWER6 are registered trademarks of International Business Machines
Corporation in the United States, other countries, or both. All other product names are trademarked or copyrighted by their respective manufacturers.
Printed in Canada. All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission
in any form or by any means, electronic, mechanical, photocopying, recording, or likewise.
MC Press offers excellent discounts on this book when ordered in quantity for bulk purchases or special
sales, which may include custom covers and content particular to your business, training goals, marketing
focus, and branding interest.
For information regarding permissions or special orders, please contact:
MC Press
Corporate Offices
125 N. Woodland Trail
Lewisville, TX 75077 USA
For information regarding sales and/or customer service, please contact:
MC Press
P.O. Box 4300
Big Sandy, TX 75755-4300 USA
ISBN: 978-158347-098-5
Acknowledgements
First and foremost, this book is dedicated to my children—Hadara, Ori, Rani
and Elana, whom I love and adore with all my heart and who have been a constant source of joy to me throughout their lives. Thank you Vera, for providing
me with these incredible children. Thank you Mom and Dad, for all the love
you have given me through the years. This book is dedicated to my parent’s
family, all of whom perished during the Holocaust, except for my dear Aunt
Molly, who passed away several years ago and whom I still miss dearly.
The publication of this book could not have been possible without the support
and encouragement of many individuals throughout my career. I want to thank
David Brodt for giving me my first job in systems and keeping me around even
after I mistakenly destroyed his entire B90 Burroughs system (even though it
was a Burroughs VMS bug) along with all his backups during a failed operations activity. I stayed on and led their project, my first, to convert their legacy
system to Unix over 20 years ago—SCO Unix 3.2.2. I want to thank Terry
Every for giving me my first opportunity in NYC in the early 1990s as a Unix
Systems Manager, working on HP9000s and HP-UX. I learned so much from
him, less about systems (though he is technical), and more about people and
class.
I want to thank Mark Mulconry for giving me my first opportunity to manage a
large production IBM AIX environment and my homeboys at Empire BC/BS
(Greg Pastuzyn, Steven Goldman, Steven Gerasimovich, Amit Goel, Arkady
Getselis) as well as my homegal, Marilyn Walter. To Winston, an AIX system
administrator who worked for me at the World Trade Center. We’ll always remember you. You will never be forgotten!
I want to thank the folks at IBM, who at the turn of the century thought
enough of me to put me on their AIX performance team in Washington DC,
working for the US Census Bureau (which is perhaps where this whole train
started).
I want to thank Nicolete McFadden and Bharvi Parikh for their work helping
me through many IBM initiatives, including founding and leading the NY
Metro PowerAIX/Linux Users Group. And thanks go to Randy Default, the former President of COMMON, who made me a permanent Guest on their Board
of Directors representing AIX interests. I want to thank Bess Protacio and her
AIX team of Bradd Baldwin, Abid Khwaja, and Jonathan Mencher for the times
we had at Adecco migrating to AIX from that nameless Sun Unix operating system. I want to thank Dan Raju and Wahid Ullah for the great AIX fun we had
in Ann Arbor and Ed Braunstein for providing my first exposure to AIX in
1996, when I was a CIO (before my career starting going downhill) and for the
great times we had at LAS.
I want to thank Brian Shorter, Mitch Diodato, Bruce Slaven, Jennifer Weems and
Tim Paramore at Arrow for giving me the confidence and tools to start my own
company, PowerTCO an IBM Business Partner, and for Raffi Princian for believing in me and leading our first assessment. Thanks also to the fine folks at Future
Tech (Bob Venero, Phil Preston, Karen Sinda, Mike Rosatto, Steven Vames, Bill
Daub, and Lynn Keegan) who showed me the ropes of working for a BP.
It must be said that I would not even have considered writing if not for the
folks at TechTarget who took a chance years ago on a neophyte writer. Thank
you TechTarget (in the early days it was Amy Kucharik and Jan Stafford) for
sticking by me and helping me launch my Ask The Expert Linux site as well as
my writing career. I still do quite a bit of work for searchdatacenter.techtarget.com
and searchenterpriselinux.com and love the assignments (thank you Matt
Stansberry and Leah Rosin). You can see my blog also at itknowledgeexchange,
another TechTarget offering.
I want to thank James Proescholdt, formerly of IBM Systems Magazine for giving me the opportunity to write for them and Rob McNelly, who runs their
AIXchange blog, who provided me with contact information that enabled me to
further my writing career with IBM. Thank you to Natalie Boike, my present
editor at IBM Systems Magazine for all the fun work. I am also very thankful to
Troy Mott at Backstop Media for being my editor/publisher on content through
IBM developerWorks and for helping advise me during the early conceptual
stages of my book.
I want to thank Susan Schreitmueller, IBM’s most renowned and well-known
performance expert, who reviewed my book and from whom I learned so much.
And Jaqui Lynch, among other performance gurus, from whom I also learned
so much through the years.
Finally the publication of this book could not have been possible but for the ungrudging efforts put in by the writer of the foreword of my book, IBM Distinguished Engineer Joefon Jann, and for Chris Gibson, IBM AIX guru and writer
who took the time out of his busy schedule to proofread the myriad mistakes in
my first drafts.
I want to thank Michele Huttler Silver, with Michele Silver Photography
(msilverphotograpy.com) for the incredible job she did with the breathtaking
photographs you will see interspersed throughout the book.
And thanks again to my publisher Merrikay Lee—for giving me the opportunity to write this book, for believing in me, for sponsoring our book signing,
book fair, and presentation seminar during the summer of 2009 in NYC and for
taking a chance on an IBM Power AIX book. Thanks also go to my copy editor,
Katie, for the stellar job. You are amazing!
I’ll add a special mention to my dear friends, Steven and Shelly, Mitch and
Candy, David and Laurie, who’ve always been there for me and my children,
through thick and thin.
Last, but definitely not least, thank you M—the love of my life, the one who
makes my heart sing and race, and the one person in my life who has never wavered in her belief in me. You’re my muse and inspiration to keep going (with
this book and through all life’s trials and tribulations), and one of the few folks
who think that I am more than an idiot savant. You are the one who has helped
keep things together for me, through good times and bad.
—Ken Milberg
September 2009
Contents
Foreword
Preface
xi
xiii
SECTION I: INTRODUCTION
Chapter 1: Performance Tuning Methodology
Step 1. Establishing a Baseline
Step 2. Stress Testing and Monitoring
Step 3. Identifying the Bottleneck
Step 4. Tuning
Step 5. Repeat
3
3
4
5
5
6
Chapter 2: Introduction to AIX
Unix
AIX
AIX Market Share
7
7
8
9
Chapter 3: Introduction to POWER Architecture
POWER5
POWER6
11
13
14
Section I: Summary, Tips, and Quiz
Summary
Tips
17
17
18
QUIZ
Multiple Choice
True or False
Fill In the Blank(s)
19
19
20
20
vi
Contents
SECTION II: CPU
Chapter 4: CPU: Introduction
23
Chapter 5: CPU: Monitoring
vmstat (Unix-generic)
sar (Unix-generic)
iostat (Unix-generic)
w (Unix-generic)
lparstat (AIX-specific)
mpstat (AIX-specific)
topas (AIX-specific)
nmon
Using nmon for Historical Analysis
ps (Unix-generic)
Tracing Tools
tprof
Timing Tools
time
timex
25
25
28
31
31
32
33
35
36
37
38
39
39
41
41
42
Chapter 6: CPU: Tuning
Process and Thread Management
nice
renice
ps
schedo
sched_R and sched_D
fixed_pri_global
timeslice
bindprocessor
smtctl
gprof
45
45
46
47
48
48
50
51
51
52
53
54
Section II: Summary, Tips, and Quiz
Summary
Tips
55
55
55
QUIZ
Multiple Choice
True or False
Fill in the Blank(s)
57
57
59
59
Contents
vii
SECTION III: MEMORY
Chapter 7: Memory: Introduction
Virtual Memory Manager
Computational Memory
File Memory
Paging and Swapping
VMM Tuning Evolution
63
63
65
65
65
66
Chapter 8: Memory: Monitoring
vmstat (Unix-generic)
Virtual Memory Summary
sar (Unix-generic)
lsps (AIX-specific)
ps (Unix-generic)
svmon (AIX-specific)
Memory Leak
67
68
71
71
73
73
74
77
Chapter 9: Memory: Tuning
vmo
minperm, maxperm, maxclient, and lru_file_repage
minfree and maxfree
Page Space Allocation
How Much Paging Space?
Paging Space Tuning
Thrashing and Load Control
Memory Scanning and lrubucket
rmss
81
81
82
84
85
86
87
87
88
89
Section III: Summary, Tips, and Quiz
Summary
Tips
91
91
92
QUIZ
Multiple Choice
True or False
Fill in the Blank(s)
94
94
96
96
SECTION IV: DISK I/O
Chapter 10: Disk I/O: Introduction
Direct I/O
Concurrent I/O
99
101
101
viii
Contents
Asynchronous I/O
Logical Volumes and Disk Placement: Intra- and Inter-Policy
Inter-Disk Policy
File Systems
102
102
105
105
Chapter 11: Disk I/O: Monitoring
sar
topas
Logical Volume Monitoring
AIX LVM Commands
filemon and fileplace
filemon
fileplace
107
107
108
111
112
116
116
117
Chapter 12: Disk I/O: Tuning
lvmo
ioo
JFS2 Tuning Options
119
119
120
122
Section IV: Summary, Tips, and Quiz
Summary
Tips
125
125
126
QUIZ
Multiple Choice
True or False
Fill in the Blank
128
128
129
130
SECTION VNETWORK I/O
131
Chapter 13: Network I/O: Introduction
Network I/O Overview
NFS
Media Speed
Network Subsystem Memory Management
Virtual and Shared Ethernet
133
134
136
139
141
141
Chapter 14: Network I/O: Monitoring
netpmon
Monitoring NFS
nfsstat
nfs4cl
143
145
148
149
151
Contents
ix
netpmon and NFS
Monitoring Network Packets
iptrace, ipreport, and ipfilter
tcpdump
152
154
154
156
Chapter 15: Network I/O: Tuning
Name Resolution
Maximum Transfer Unit
Tuning: Client
Tuning: Server
157
161
162
162
164
Section V: Summary, Tips, and Quiz
Summary
Tips
167
167
168
QUIZ
Multiple Choice
True or False
Fill in the Blank
170
170
171
172
SECTION VI: BONUS TOPICS
Chapter 16: AIX 6.1
Introduction
Memory
CPU
Disk I/O
JFS2
iSCSI
I/O Pacing
Asynchronous I/O
Network
NFS
175
175
176
179
179
179
179
180
180
182
183
Section VI: Chapter 16 Quiz
Multiple Choice
True or False
Fill in the Blank
185
185
187
187
Chapter 17: Tuning AIX for Oracle
Memory
CPU
189
189
192
x
Contents
Asynchronous I/O Servers
Concurrent I/O
Oracle Tools
Statspack
Oracle Enterprise Manager
192
193
194
194
195
Section VI: Chapter 17 Quiz
Multiple Choice
True or False
Fill in the Blank
197
197
198
198
Chapter 18: Linux on Power
Monitoring
Handy Linux Commands
Virtualization
Tuning
199
199
200
201
202
Section VI: Chapter 18 Quiz
Multiple Choice
True or False
Fill in the Blank(s)
205
205
205
206
Quiz Answers
Section I: Introduction
Section II: CPU
Section III: Memory
Section IV: Disk I/O
Section V: Network I/O
Section VI / Chapter 16: AIX 6.1
Section VI / Chapter 17: Tuning AIX for Oracle
Section VI / Chapter 18: Linux on Power
207
207
207
207
207
208
208
208
208
Foreword
As computers have become increasingly sophisticated, the task of tuning
the operating system to yield high performance for its applications while
providing optimal total cost of ownership (TCO) for the IT owners has
become increasingly complex. In the early days of computers, the OS typically ran only one application at a time, and most performance tuning was
targeted at minimizing the number of instructions required to run the application within the limited resources (CPU, memory, disk/tape, networking)
of a uniprocessor system. With advances in virtual memory, multitasking,
multicore, caches, faster networks, huge storage devices and databases,
and, in the past decade, the flourishing of virtualization technologies (e.g.,
LPARs, DLPARs, simultaneous multithreading, WPARs, virtual Ethernet,
virtual SCSI), the task of performance optimization has become far more
complex and has shifted to tuning the OS and balancing the hardware
resources across LPARs within a hardware box. Nonetheless, the tuning
goals remain the same: to yield high performance for applications while
providing optimal TCO for IT owners.
Ken Milberg, with his rich background in managing, operating, and writing
about Unix and Linux systems, has abstracted the essence of the complex
tuning process, which he clearly describes in Chapter 1. In fact, the tuning
methodology described therein is applicable to most OS types: establish a
baseline, stress test and monitor, identify the bottleneck, tune, and repeat.
The rest of the book highlights the important monitoring and tuning tools
for each major subcomponent of the AIX/POWER system. The progression of the topics is great, from the core to progressively further-away
xii
Foreword
components — from CPU to memory to disk to network, paralleling the
AIX tools schedo, vmo, ioo, no, and nfso.
The tips and quiz at the end of each section are a treat. Not only do they
give a summary review of the key items covered, but they also provide a
lot of fun and satisfaction, especially when you can verify whether you’ve
understood everything correctly by checking against the provided answers.
To sum up, this is a book that every AIX system administrator and systems
manager should read.
—Joefon Jann
Distinguished Engineer,
Research Lead in AIX and POWER Systems Software
IBM Thomas J. Watson Research Center, Yorktown Heights, New York
Preface
Why this book? Although a Google search may show a fair number of
books about AIX, including a couple about performance tuning, just about
all of them are at least a decade old. IBM provides a tremendous amount
of information through its portals and Redbooks, but it is not unusual for
administrators seeking to tune their boxes to examine dozens of Web sites
and Redbooks before finding the information they need. This book brings
it all together for you, and more. Further, I review best practices and provide tips and tricks that are not usually covered in the IBM literature. Last,
the book provides an impartial view (I don’t work for IBM) of systems
performance tuning based on the real-world experiences of a battle-scarred
systems administration veteran.
This book is intended for systems professionals who need to understand,
monitor, and control the factors that affect AIX performance on their IBM
POWER servers. It also includes bonus chapters on the recent innovations
of AIX 6.1, Linux on Power (LoP) performance, and running Oracle on
AIX.
This is an intermediate book about AIX performance analysis and systems
tuning. The material comes both from IBM sources and from real life,
based on my experiences as a Unix professional supporting production systems for more than 20 years (almost half of them on AIX), in many capacities and for a broad range of industries.
Because this book is not an introduction to Unix, prior knowledge of Unix
(and AIX in particular) is recommended, although I would not say it is a
prerequisite. The book covers tuning methodology, systems monitoring,
xiv
Preface
and performance tuning on all subsystems, including CPU, RAM, and
I/O (network and disk). As an introduction, I review time-tested tuning
and analysis methodology, steps that will assist you throughout the tuning
lifecycle.
The monitoring sections describe tools that will let you immediately gain a
foothold (taking quick-and-dirty snapshots on the health of the system) on
your system. They also discuss tools that will help you collect historic data
for the purpose of analyzing trends and results. All the tools used in this
book either are part of the standard IBM AIX systems build or are opensource products written by folks who work for IBM (e.g., nmon) and used
widely in the field of battle.
—Ken Milberg
August 2009
Section I
Introduction
This section introduces the concept of performance tuning methodology
and discusses the AIX operating system and how it has evolved through
the years. We also explore the development of IBM’s POWER architecture
and how it has changed from its early stages to the POWER6.
C h a p t e r
1
Performance Tuning Methodology
Performance tuning is a never-ending process, and an important concept
to understand is that it is not unusual to fix one bottleneck only to create another. That’s part of what makes our lives as AIX administrators so
indispensable! The following time-tested tuning and analysis methodology
will aid you throughout the tuning lifecycle:
1. Establish a baseline
2. Stress test and monitor
3. Identify bottleneck
4. Tune
5. Repeat (starting with step 2)
Step 1. Establishing a Baseline
Well before you ever tune a system, it is imperative to establish a baseline.
The baseline is a snapshot of what the system looks like when you first put
it into production, while it is performing at acceptable enough levels to the
business for it to be deployed. The baseline should not only capture performance statistics but also document the actual configuration of the system (amount of memory, CPU, and disk). It’s important to document the
system configuration because otherwise you won’t be comparing apples
with apples when the time comes to examine the baseline to your current
4
Chapter 1: Performance Tuning Methodology
configuration. This step is particularly relevant in our new partitioned
world, when you can dynamically add or subtract CPU resources at a
moment’s notice.
To come up with a proper baseline, you must first identify the appropriate tools to use for monitoring. Some tools are more suited to immediate
gratification, while others are geared more toward historical trending and
analysis. Tools such as nmon and topas, which we’ll discuss in detail in
Chapter 5, can serve both purposes.
Once you’ve identified your monitoring tools, you need to gather your
statistics and performance measurements. This information helps you to
define what an acceptable level of performance is for a given system. You
need to know what a well-performing system looks like before you start
receiving calls complaining about performance. You should also work with
the appropriate application and functional teams to define exactly what a
well-behaved system is. At that time, you would translate that definition
into an acceptable service level agreement (SLA), on which the customer
would sign off.
Step 2. Stress Testing and Monitoring
This step is where you monitor the system at peak workloads and during
problem periods. Stressing your system, preferably in a controlled environment, can help you make the right diagnosis — an essential part of performance tuning. Is your bottleneck really a CPU bottleneck, or is it related
more to memory or I/O?
It’s also important not to fall too much in love with any one utility. I like to
use several monitoring tools here to help validate my findings. For example, I might use an interactive tool (e.g., vmstat) and then a data capturing
tool (nmon) to help me track data historically.
The monitoring step is critical because you cannot effectively tune anything without having an accurate historical record of what has been going
on in your system, particularly during periods of stress. Larger organizations that recognize the importance of this process even have their own
stress-testing teams, which work together with application and infrastructure teams to test new deployments before putting them into production.
Step 4. Tuning
5
It’s also essential here to establish performance policies for the system.
You can determine the measures that are relevant during monitoring, analyze them historically, and then examine them further during stress testing.
Step 3. Identifying the Bottleneck
The objective of stressing and monitoring the system is to determine the
bottleneck. Ask any doctor: you cannot provide the correct medicine (the
tuning) without the proper diagnosis. If the system is in fact CPU-bound,
you can run additional tools, such as curt, ps, splat, tprof, and trace (we’ll
discuss these utilities later), to further identify the actual processes that are
causing the bottleneck.
It’s possible that your system might in fact be memory- or I/O-bound and
not CPU-bound. Fixing one bottleneck, such as a memory problem, can
actually cause another, such as a CPU bottleneck, because in this case your
system is now letting the CPU perform to its optimum capacity. At one
point in time, it might not have had the capacity to handle the increased
amount of resources given to it. I’ve seen this situation quite often, and it
isn’t necessarily a bad thing. Quite the opposite: it ultimately helps you
isolate all your bottlenecks and tune the system to its max.
You’ll find that monitoring and tuning systems is quite a dynamic process
and not always predictable. That’s what makes performance tuning as challenging as it is.
Step 4. Tuning
Once you’ve identified the bottleneck, it’s time to tune it. For a CPU
bottleneck, that usually means one of four solutions:
●
●
Balancing system workload — This solution involves running
processes at different intervals to more efficiently use the 24-hour
day. More often that not, this is what we usually do to resolve CPU
bottlenecks.
Tuning the scheduler — Tuning the scheduler using nice or renice
helps you assign different priorities to running processes to prevent
CPU hogs.
6
Chapter 1: Performance Tuning Methodology
●
●
Tuning scheduler parameters — Adjust scheduler parameters to finetune priority formulas. For example, you can use the schedo command to change the amount of time the operating system lets a given
process run before calling the dispatcher to choose another.
Increasing resources — Add CPUs or, in a virtualized environment,
reconfigure logical partitions (LPARs) to boost available resources.
This solution might include uncapping partitions or adding more
virtual processors to existing partitions. Virtualizing the partitioned
environment appropriately can help increase physical resource utilization, decrease CPU bottlenecks on specific LPARs, and reduce the
expense of idle capacity in LPARs that are not “breathing heavy.”
Step 5. Repeat
After tuning, you need to go through the process again, starting with step
2, stress testing and monitoring. Only by repeating your tests and consistently monitoring your systems can you determine whether your tuning
has made an impact. I know some administrators who simply tune certain
parameters based on best practices for a specific application and then move
on. That is the worst thing you can do. For one thing, what works in some
environments might not work in yours. More important, how do you really
know whether what you’ve tuned has helped the bottleneck unless you
look at the data?
To reiterate, AIX performance tuning is a dynamic and reiterative process,
and to achieve real success, you need to consistently monitor your systems,
which can only happen once you’ve established a baseline and SLA. The
bottom line is, if you can’t define the behavior of a system that runs well,
how will you define the behavior of a system that doesn’t?
C h a p t e r
2
Introduction to AIX
AIX — which stands for Advanced Interactive eXecutive — is a POSIXcompliant and X/Open-certified Unix operating system introduced by
IBM in 1986. While AIX is based on UNIX System V, it has roots in the
Berkeley Software Distribution (BSD) version of Unix as well. Today, AIX
has an abundance of both flavors (you can go with chocolate one day and
vanilla the next), providing another reason for its popularity.
Unix
From its introduction in 1969 and development in the mid-1970s, Unix
has evolved into one of the most successful operating systems to date.
The roots of this operating system go as far back as the mid-1960s, when
AT&T’s Bell Labs partnered with General Electric and the Massachusetts
Institute of Technology (MIT) to develop a multi-user operating system
called Multics (which stood for Multiplexed Information and Computer
Service). Dennis Ritchie and Ken Thompson worked on this project until
AT&T withdrew from it. The two eventually created another operating
system in an effort to port a computer game that simulated space travel.
They did so on a Digital Equipment Corporation (DEC) PDP-7 computer,
and they named the new operating system Unics (for Uniplexed Information and Computing Service). Somewhere along the way, “Unics” evolved
into “Unix.”
8
Chapter 2: Introduction to AIX
AIX
AIX was the first operating system to introduce the idea of a journaling
file system, an advance that enabled fast boot times by avoiding the need
to perform file system checking (fsck) for disks on reboot. AIX also has
a strong, built-in Logical Volume Manager (LVM), introduced as early as
1990, which helps to partition and administer groups of disks.
Another important innovation was the introduction of shared libraries,
which avoided the need for an application to statically link to the libraries
it used. The resulting smaller binaries used less of the hardware RAM to
run and required less disk space for installation.
IBM ported AIX to its RS/6000 platform of products in 1989. The release
of AIX Version 3 coincided with the announcement of the first RS/6000
models. At the time, these systems were considered unique in that they not
only outperformed all other machines in integer compute performance but
also beat the competition by a factor of 10 in floating-point performance.
Version 4, introduced in 1994, added support for symmetric multiprocessing (SMP) with the first RS/6000 SMP servers. The operating system
evolved until 1999, when AIX 4.3.3 introduced workload management
(WLM). In May 2001, IBM unveiled AIX 5L (the L stands for “Linux affinity”), coinciding with the release of its POWER4 servers, which provided for the logical partitioning of servers. In October of the following year,
IBM announced dynamic logical partitioning (DLPAR) with AIX 5.2.
The latest update to AIX 5L, AIX 5.3 (introduced in August 2004), provided innovative new features for virtualization, security, reliability, systems
management, and administration. Most important, AIX 5.3 fully supported
the Advanced Power Virtualization (APV) capabilities of the POWER5
architecture, including micropartioning, virtual I/O servers, and symmetric multithreading (SMT). Arguably, this was the most important release
of AIX in more than a decade, and it remains the most popular (as of this
writing). That is why we’ll primarily focus on AIX 5.3 for the purposes of
this book.
IBM announced AIX 6-Beta in May 2007 and formally introduced AIX
6.1 in November 2007. Major innovations of AIX 6.1 include workload
AIX Market Share
9
partitions (WPARs), which are similar to Solaris containers, and Live
Application Mobility (not available with Solaris), which lets you move
the partitions without application down time. Chapter 16 discusses performance monitoring and tuning on AIX 6.1.
AIX Market Share
AIX celebrated its 20th anniversary in January 2006, and it appears to have
an extremely bright future in the Unix space. IBM’s AIX has been the only
Unix that increased its market share through the years, and IBM continues
to own the market space for Unix servers. Most of the Unix growth at this
time stems from IBM.
AIX has benefited from the many hardware innovations that the POWER
platform has introduced through the years, and it continues to do so. IBM
has also made good decisions around its Linux strategy. Linux, supported
natively on the POWER5, more or less complements, rather than competes
with, AIX on the POWER architecture.
C h a p t e r
3
Introduction to POWER Architecture
The “POWER” in POWER architecture stands for Power Optimization
with Enhanced RISC, and it is the processor used by IBM’s midrange
Unix offering, AIX. POWER is a descendant of IBM’s 801 CPU and is a
second-generation Reduced Instruction Set Computer (RISC) processor. It
was introduced in 1990 to support Unix RS/6000 systems.
The POWER architecture incorporated many characteristics that were
already common in most RISC architectures. The instructions were fixed
in length (four bytes) and had consistent formats. What made the architecture unique among existing RISC architectures was that it was functionally
partitioned, separating the functions of program flow control, fixed-point
computation, and floating-point computation.
The objective of most RISC architectures was to be extremely simple
so that implementations would have an extremely short cycle type. This
approach would result in processors that could execute instructions at
the fastest possible clock rate. The designers of the POWER architecture
chose to minimize the total time spent to complete a task. This time was a
byproduct of three different components: path length, the number of cycles
needed to complete an instruction, and cycle time.
During the early 1990s, five different RISC architectures actively competed with one another. IBM partnered with Apple and Motorola to come up
with a common architecture that would meet the standards of an alliance
they would form. The first design was very simple, and all its instructions
12
Chapter 3: Introduction to POWER Architecture
were completed in one cycle. It lacked floating-point and parallel processing capability. The POWER architecture was a real attempt to correct this
flaw. It consisted of more than 100 instructions and was known as a complex RISC system.
The POWER1 chip consisted of 800,000 transistors per chip and was
functionally partitioned. It had separate floating-point registers and could
scale from low-end to the highest-end workstations. The first chip actually consisted of several chips on a single motherboard but was refined to
one RISC chip with more than a million transistors. Some of you may be
surprised to learn that this chip was actually used as the CPU for the Mars
Pathfinder mission.
The POWER2 chip was released in 1993 and was the standard-bearer for
nearly five years. It contained 15 million transistors per chip. It also added
a second floating-point unit (FPU) and extra cache. This chip was known
for powering the IBM Deep Blue supercomputer that would beat Garry
Kasparov at chess in 1997. (Joefon Jann, whose team developed this system, wrote the Foreword to this book.)
The POWER3 architecture was the first 64-bit symmetric multiprocessor.
Designed to work on both scientific and technical computer applications,
it included a data prefetch engine, dual floating-point execution units, and
a nonblocked interleaved data cache. It used copper interconnect, which
delivered double the performance for the same price.
The POWER4 (code-named Regatta) architecture, released in 2001,
featured 174 million transistors per processor. It incorporated micron
copper and silicon-based technology. Each processor had 64-bit, 1 GHz
PowerPC cores and could execute as many as 200 instructions simultaneously. POWER4 became the driving force behind the IBM Regatta Servers, which supported logical partitioning. The POWER4 processor supported logical partitioning with a new privileged processor state called the
POWER Hypervisor mode.
POWER5
13
As wonderful as the Regattas were, if you purchased one shortly before the
POWER5 systems were released, you were not a happy camper.
POWER5
IBM’s POWER5 architecture, introduced in 2003, contained 276 million
transistors per processor. It was based on the 130 nm copper/silicon-oninsulator (SOI) process and featured chip multiprocessing, a larger cache,
a memory controller on the chip, simultaneous multithreading (SMT),
advanced power management, and improved Hypervisor technology. The
POWER5 was built to allow up to 256 logical partitions and was available
on IBM’s System i and System p servers. Each POWER5 core is designed
to support SMT and single-threaded modes. The software (the Hypervisor)
switches the processor from SMT to single-threaded mode.
Some of the objectives of the POWER5 were
●
To maintain binary capability with older POWER4 systems
●
To enhance and extend symmetric multiprocessing (SMP) scalability
●
To improve performance and reliability
●
To provide additional server flexibility
●
To improve power efficiency
●
To provide virtualization capabilities
As a result of its dual-core design and support for SMT, one POWER5 chip
appears as a four-way microprocessor to the operating system. Processors
using SMT can issue multiple instructions from different code paths during
a single cycle. Multiple instructions from both hardware threads can be
issued from one cycle.
14
Chapter 3: Introduction to POWER Architecture
Figure 3.1 depicts the Hypervisor, without which there is no virtualization.
Programs
AIX 5L
Programs
Linux
Programs
IBM i
Open Firmware
RTAS
Open Firmware
RTAS
TIMI
SLIC
POWER Hypervisor
POWER 64-bit Processor
Figure 3.1: Hypervisor architecture
As you examine this architecture, you can see that the layers above the
POWER Hypervisor are similar, but the contents are characterized by the
operating system. The layers of code supporting AIX and Linux consist of
system firmware and Run-Time Abstraction Services (RTAS). Open Firmware and RTAS are both platform-specific firmware, and both are tailored
by the platform developer to manipulate the specific platform hardware.
In the POWER5 processor, IBM introduced further design enhancements
that enabled the sharing of processors by multiple partitions. The POWER
Hypervisor Decrementer (HDEC) is a new hardware facility in the POWER5 design that is programmed to provide the POWER Hypervisor with
a timed interrupt independent of partition activity. It was the POWER5
architecture, along with the extraordinary virtualization capabilities of
Advanced Power Virtualization (APV) that really paved the way for server
consolidation around IBM POWER systems. (IBM has since rebranded the
term Advanced Power Virtualization to PowerVM.)
POWER6
The POWER6, with approximately 790 million transistors, debuted in
June 2007. Its dual-core design enabled it to reach 4.7 GHz. Innovations
POWER6
15
in energy and cooling let it retain the same power consumption as the
POWER5 while almost doubling performance.
The POWER6 has hardware support for decimal arithmetic. It also has the
first decimal floating-point unit integrated in silicon. Several important
APV enhancements were also released with the POWER6, including Live
Partition Mobility, Decimal Floating Point, and Dynamic Energy Management. It was around this time that IBM rebranded APV to PowerVM.
Section I
Summary, Tips, and Quiz
Summary
●
The five-step performance tuning methodology is:
1. Establish a baseline
2. Stress test and monitor
3. Identify bottleneck
4. Tune
5. Repeat (starting with step 2)
●
●
●
●
●
Unix was “invented” in 1969, the result of an effort by Dennis Ritchie
and Ken Thompson to port a computer game to a DEC PDP-7 following
their work with AT&T’s Bell Labs.
AIX, which stands for Advanced Interactive eXecutive, was introduced
by IBM in 1986. It is the first version of Unix to provide a journaling
file system and to incorporate a Logical Volume Manager (LVM) in the
base operating system.
IBM’s Power Optimization with Enhanced RISC (POWER) architecture
was introduced in 1990 to support RS/6000 systems.
AIX 5L, introduced in May 2001, provided for the logical partitioning
of servers with the POWER4 architecture.
AIX 5.3, released in 2004, would become the most important release
of AIX in more than a decade. It boasted support for Advanced Power
Virtualization (APV) and the new POWER5 architecture. IBM has since
rebranded the term Advanced Power Virtualization to PowerVM.
18
Section I: Summary, Tips, and Quiz
●
AIX 6 and the POWER6 architecture were released in 2007 (the former
in the spring and the latter in the fall). AIX 6 enhancements include
workload partitioning and Live Application Mobility. POWER6 innovations include Live Partition Mobility, Decimal Floating Point, and
Dynamic Energy Management.
Tips
●
●
●
●
●
●
●
●
Do not, under any circumstances, introduce an application into
production without first implementing a proactive performance
monitoring strategy. Otherwise, you will never really know what
your subsystems (CPU, I/O, memory) should look like when the
system is performing well and its performance has been deemed acceptable to the business and/or application folks. The time to start monitoring your system is before you’ve been told that the system is slow,
not after.
Use more than one monitoring tool so that you can use each to validate
the findings of the others.
Create multiple environments for your application architecture, including development, test, and/or quality assurance.
Establish a deployment and stress-testing strategy for how applications
are tested and deployed into production. These measures will help you
ensure the reliability and performance of your applications.
Spend time analyzing your performance data. Remember, you can’t
prescribe the right medicine (tune) without a proper diagnosis (analysis
of historic data).
Introduce one change at a time when tuning your systems. Otherwise,
how will you really know what the true effect of each change is?
Use the virtualization capabilities of AIX 5.3 and APV (now
PowerVM). These innovations can help you save big money on total
cost of ownership and help drive a large return on investment for server
and data center consolidation projects.
Don’t upgrade to AIX 6.1 simply because you’ve fallen in love with
the new technology. Remember that your production application might
not share that love. Create a 6.1 partition on your POWER server so
Multiple Choice
19
that you can start playing nicely in the sandbox. Note that POWER6
innovations such as Live Partition Mobility are fully supported on AIX
5.3 (Technology Level 7, or TL_7).
Quiz
Multiple Choice
1. AIX stands for
a. Advanced Interactive Unix
b. Advanced Interactive eXecutive
c. Advanced Unix
d. It’s just an acronym.
2. AIX was introduced in
a. 1969
b. 1986
c. 1990
d. 1994
3. Which is the first Unix that introduced journaling file systems?
a. Solaris
b. HP-UX
c. AIX
d. Linux
20
Section I: Summary, Tips, and Quiz
4. Advanced Power Virtualization was introduced with which combination?
a. AIX 5.3 and POWER5
b. AIX 5.2 and POWER5
c. AIX5L and POWER4
d. AIX 6.1 and POWER5
5. DLPAR stands for
a. Logical partitioning
b. Advanced power virtualization
c. Dynamic logical partitioning
d. Nothing
True or False
6. Linux cannot run natively on the POWER architecture.
7. Performance monitoring and tuning is a never-ending process.
8. Fixing a bottleneck should not cause another bottleneck to occur.
9. Never make more than one tuning change at the same time.
Fill In the Blank(s)
10. Fill in the missing steps of the five-step tuning methodology described
in this book:
1. __________________
2. Stress test and monitor
3. __________________
4. __________________
5. __________________
Section II
CPU
This section provides an overview of CPU monitoring and tuning and
discusses best practices for CPU performance tuning, given the various
considerations that can impact performance.
C h a p t e r
4
CPU: Introduction
Unlike other subsystems (e.g., memory, I/O), when it comes to CPU, there
is less to actually tune and more you can do on the back end (e.g., balancing systems workload) to ensure your systems are running smoothly. As a
Unix administrator, you need to understand which tools are best used for
which purpose. As far as monitoring is concerned, some tools are better
suited to quick-and-dirty system snapshots, while others are clearly more
effective for long-term trending and analysis. Choose the tool that best fits
the situation you’re faced with.
For example, if you’re experiencing a serious production problem, you
don’t have five days to perform long-term analysis — you may not even
have more than five minutes to come up with something. Nevertheless, you
still need to arrive at the right diagnosis to help determine the bottleneck.
Often, you’ll find that the bottleneck isn’t actually CPU but relates to
memory or I/O. Most users assume CPU is the problem and figure the
box needs more horsepower. However, CPU usually isn’t the culprit, and
throwing more iron at a problem is neither the quickest nor the most costeffective way to solve the issue. Furthermore, trying to tune the CPU subsystem when virtual memory is the problem could be a real disaster. Before
you look for a way to tune, take the time to analyze the system properly.
I don’t mean to be condescending here. It’s just that sometimes we don’t
take the time to monitor and analyze. We rush to judgment because of the
pressure we’re under to solve problems and move on to the next issue or
24
Chapter 4: CPU: Introduction
production concern. This is one reason that, when first investigating any
performance bottleneck, I prefer to use tools that focus less on a specific
area but provide a better understanding of the big picture. The bottom line
is that you really want to make sure you have a CPU problem if that’s what
you’re trying to tune. More on this point later.
As an AIX administrator, you should already know some of the basic
tools of performance monitoring — commands such as vmstat and topas
— and you should be familiar with ways to identify processes that are
CPU hogs. What some people have a hard time understanding is that CPU
performance tuning isn’t about running some tuning commands but about
proactively monitoring systems, particularly when you’re not experiencing
performance problems. Without historical data to analyze, there can be no
effective performance tuning.
Performance in a virtualized environment provides challenges to even the
most senior of administrators, so I’ll also go over specific concepts for a
virtualized environment, including simultaneous multithreading (SMT),
virtual processors, and the POWER Hypervisor.
As far as the methodology, when investigating a perceived performance
problem, start by monitoring the statistics of CPU utilization. It’s important
to continuously observe system performance because you need to compare
the loaded system data with normal usage data, which is the baseline. Because the CPU is one of the fastest components of the system, if CPU utilization keeps the CPU 100 percent busy (which happens to every system at
some time), you’ll need to investigate the process that causes this situation.
AIX provides many trace and profiling tools to follow the most complex of
processes. Don’t be afraid to also use any application or database tools at
your disposal to help you further. In a CPU-bound system, all the processors are 100 percent busy, and some jobs may be waiting for CPU time in
the run queue. Generally speaking, a system has an excellent chance of becoming CPU-bound if the CPU is 100 percent busy, has a large run queue
compared with the number of CPUs, and requires more context switches
than usual.
That’s the quick and dirty. We’ll get into much more detail in the next
couple of chapters.
C h a p t e r
5
CPU: Monitoring
AIX systems administrators have much more at their disposal than the average Unix administrator. Not only can you use the standard Unix generic
monitoring tools that have been around nearly as long as Unix itself, but
a potpourri of AIX-specific commands is also available. Some of these
commands come standard with an AIX build, while others are tools that,
although not officially supported by IBM, are widely distributed and are
used by most administrators. We’ll discuss all these types of monitoring
tools in this chapter, including those we don’t use very often.
As we go through the tools, note that four commands — mpstat, sar,
topas, and vmstat — have been enhanced in AIX 5.3 to enable the tools
to report back accurate statistics about shared partitions using Advanced
Power Virtualization (PowerVM). The trace-based tools curt, filemon,
netpmon, pprof, and splat have also been updated. One command not
covered here, lparmon, is the most comprehensive tool you can use in a
partitioned environment.
vmstat (Unix-generic)
vmstat [-fsviItlw] [[-p|-P] pagesize|ALL] [Drives] [Interval [Count]]
While the vmstat command is more commonly associated with viewing information about virtual memory (hence the “vm”), it is the first
tool most administrators invoke when trying to get a quick assessment of
their systems. That’s because vmstat reports back all kinds of pertinent
26
Chapter 5: CPU: Monitoring
performance-related information, including data about memory, paging,
blocked I/O, and overall CPU activity. Because it reports virtually all
subsystem information line by line in a quick and painless way, running
vmstat is probably the simplest and most efficient way to gauge exactly
what is going on in your system.
A common way to run vmstat is for five iterations every two seconds:
vmstat 2 5
Running the command in this way produces the following results:
# vmstat 2 5
System configuration: lcpu=4 mem=3072MB ent=0.40
kthr
memory
page
faults
----- ------------- ---------------------avm
fre
cpu
---------- ----------------------
r
b
re
pi
po
fr
sr
cy
pc
ec
1
0 128826 641397
0
0
0
0
0
0
448 87 138
in sy
cs us sy id wa
0
1 98
0
0.01
2.8
1
0 128826 641397
0
0
0
0
0
0
385 10 136
0
1 99
0
0.01
2.2
1
0 128826 641397
0
0
0
0
0
0
381 13 138
0
1 99
0
0.01
2.2
1
0 128826 641397
0
0
0
0
0
0
364 40 138
0
1 99
0
0.01
2.4
1
0 128826 641397
0
0
0
0
0
0
610 13 138
0
2 98
0
0.01
3.3
In addition to specific monitoring information, vmstat provides a very
high-level snapshot of the system, which can be useful. Just by running
vmstat in the preceding snapshot, we know that we have a system with
four logical CPUs and 3 GB of RAM and are using shared processors. (In
actuality, this partition is using two physical CPUs; symmetric multithreading is enabled, yielding the four logical CPUs. More about SMT later.)
Some of the more important fields in the vmstat output include the
following:
●
r — The average number of runnable kernel threads over the sampling interval you have chosen.
vmstat (Unix-generic)
●
●
27
b — The average number of kernel threads in the virtual memory
waiting queue over the sampling interval. The r value should always
be higher than b; if it is not, you probably have a CPU bottleneck.
fre — The size of the memory free list. Don’t worry too much if this
number is really small. More important, determine whether any paging is going on if this size is small.
●
pi — Pages paged in from paging space.
●
po — Pages paged out to paging space.
Our focus in this chapter is on the last section of output, CPU:
●
us — User time
●
sy — System time
●
id — Idle time
●
wa — Time spent waiting on I/O
●
●
pc — Number of physical processors consumed (displayed only if
the partition is configured with shared processors)
ec — Percentage of entitled capacity (displayed only if the partition
is configured with shared processors)
Clearly, the system in our example has no bottleneck to speak of. How can
we tell this? Let’s look at us and sy. If these entries combined consistently
averaged more than 80 percent, you more than likely would have a CPU
bottleneck. If you are in a state where the CPU is running at 100 percent
(which happens on occasion to everyone), your system is really breathing hot and heavy. If the numbers are small but the wait time (wa) is on
the high side (usually greater than 30), this usually signals that there may
be I/O problems, which in turn can cause the CPU not to work as hard as
it can. Alternatively, if more time is spent in sy time than us time, your
system is probably spending less time crunching numbers and more time
processing kernel data. When this happens, it is usually a sign either of
badly written code or that something has run amok.
28
Chapter 5: CPU: Monitoring
Let’s look at another system:
# vmstat 2 5
System configuration: lcpu=4 mem=3072MB ent=0.40
kthr
----r b
2 1
3 2
4 1
2 1
6 2
memory
page
faults
cpu
------------- ---------------------- ------------- ----------------------avm
fre
re pi po fr sr cy in
sy cs us sy id wa
pc
ec
169829 600290 0
0
0
0
0
0 553 36538 175 64 32 4 0 0.79 84.9
169829 600290 0
0
0
0
0
0 778 33033 175 60 29 11 0 0.84 73.2
169828 600291 0
0
0
0
0
0 403 11904 179 76 10 4 10 0.69 87.8
169828 600291 0
0
0
0
0
0 368 30745 175 82 14 2 2 0.91 85.5
169830 600289 0
0
0
0
0
0 395 27898 173 57 34 4 5 0.89 91.5
What kind of determination can we make here? When we add us and sy,
our numbers come out much differently this time — fairly close to 100
percent. This system is clearly CPU-bound. If paging were going on, we
would see numbers in the paging (page) columns. In this case, no paging
is occurring, nor are there any I/O problems to speak of. Because vmstat is
an all-purpose utility, it can help you perform this quick-and-dirty analysis
on the fly. If the blocked processes represented three times the number of
runnable processes and everything else stayed the same, I/O would likely
be causing the CPU bottleneck. In that case, you should be prepared to
have even more of a CPU bottleneck once you fix the I/O problem. As I
explained previously, this is all part of systems tuning; fixing one bottleneck often causes another.
sar (Unix-generic)
sar {-A [-M]|[-a][-b][-c][-d][-k][-m][-q][-r][-u][-v][-w][-y][-M]}
[-s hh[:mm[:ss]]] [-e hh[:mm[:ss]]]
[-P processor_id[,...] | ALL]
[-f file] [-i seconds] [-o file] [interval [number]]
[-X file] [-i seconds] [-o file] [interval [number]]
The sar command is the Unix System Activity Reporting tool (part of the
bos.acct fileset). It is most commonly used to analyze CPU activity. The
command writes to standard output the contents of the cumulative activity,
similar to vmstat. The default version of sar produces a CPU utilization
report:
sar (Unix-generic)
29
# sar 2 5
AIX lpar30p682e_pub 3 5 00CED82E4C00
12/24/07
System configuration: lcpu=4 ent=0.40 mode=Uncapped
10:13:40
10:13:42
10:13:44
10:13:46
10:13:48
10:13:50
%sys
31
30
35
11
24
%wio
0
0
0
0
0
%idle
57
58
51
83
67
physc
0.18
0.17
0.20
0.07
0.14
%entc
44.5
43.5
50.8
18.0
34.5
11
26
0
63
0.15
38.3
Average
%usr
13
12
14
6
9
Used this way, the sar command provides the same type of high-level
information that vmstat does, although it also lets you know the mode
in which the system is running, which is helpful. In the example, we can
see that our partition is an uncapped partition, which, when configured
as such, lets the partition use more resources than its entitled capacity. In
this default view, the fields themselves are the same as the vmstat fields,
but us becomes usr, sy becomes sys, id becomes idle, io becomes wio, pc
becomes physc, and ec becomes entc.
A more effective way to run sar is to view all processors by using the ALL
flag:
# sar -u -P ALL 2 5
AIX lpar30p682e_pub 3 5 00CED82E4C00
12/24/07
System configuration: lcpu=4 ent=0.40 mode=Uncapped
10:24:18 cpu
10:24:20 0
1
2
3
U
10:24:22 0
1
2
%usr
27
0
0
0
10
32
0
0
%sys
71
35
36
29
27
66
37
35
%wio
0
0
0
0
0
0
0
0
0
%idle
2
65
64
71
62
63
2
63
65
physc
0.15
0.00
0.00
0.00
0.25
0.15
0.15
0.00
0.00
%entc
37.5
0.5
0.0
0.0
61.8
38.2
37.2
0.6
0.0
30
Chapter 5: CPU: Monitoring
10:24:24
3
1
2
3
U
0
0
0
0
0
12
29
30
37
35
30
25
69
0
0
0
0
0
0
0
70
63
65
70
62
63
2
0.00
0.00
0.00
0.00
0.25
0.15
0.15
0.0
0.6
0.0
0.0
62.1
37.9
37.7
I prefer using vmstat to sar because vmstat provides a quick snapshot of
all subsystems, not just CPU. Although you can use other flags to obtain
additional subsystem information using sar, it just is not as efficient or
simple.
One advantage sar provides that vmstat does not is the ability to capture
information and analyze data. This is done through the System Activity Data Collector (sadc), which is essentially a back end to sar. When
enabled through cron (it is commented out on a typical default AIX partition), sadc collects data periodically in binary format. In the following
example, we run it from the command line:
# /usr/lib/sa/sadc 2 5 /tmp/sarinfo
To view the results (remember it’s in binary format), we need to use the –f
flag:
# sar -f /tmp/sarinfo
AIX lpar30p682e_pub 3 5 00CED82E4C00
12/24/07
System configuration: lcpu=4 ent=0.40 mode=Uncapped
10:41:42
10:41:44
10:41:46
10:41:48
10:41:50
Average
%usr
0
0
0
0
0
%sys
1
1
1
1
1
%wio
0
0
0
0
0
%idle
99
98
99
99
99
physc
0.01
0.01
0.01
0.01
0.01
%entc
2.4
2.6
2.1
1.9
2.3
w (Unix-generic)
31
iostat (Unix-generic)
iostat [-a][-l][-s][-t][-T][-z] [{-A [-P] [-q|Q]} | {-d|-D [-R]} ]
[-m] [Drives] [Interval [Count]]
The iostat command is another good first-impression type of tool, which
is more commonly used for I/O information. When run with the –t flag, it
provides only tty/cpu information. I also like to use the –T flag to obtain the
timestamp:
# iostat -tT 1
System configuration: lcpu=4 ent=0.40
tty:
tin
0.0
0.0
0.0
0.0
0.0
tout
41.0
182.0
92.0
92.0
92.0
avg-cpu: % user % sys % idle % iowait physc % entc time
0.0
1.1
98.8
0.0
0.0
2.2 10:51:13
0.0
0.9
99.0
0.0
0.0
1.8 10:51:14
0.0
0.9
99.1
0.0
0.0
1.7 10:51:15
0.1
1.1
98.8
0.0
0.0
2.1 10:51:16
0.0
1.4
98.6
0.0
0.0
2.7 10:51:17
w (Unix-generic)
/usr/bin/w64 [ -hlsuwX ] [ user ]
The w command prints a summary of all current activity on the system.
I like this command — always have and always will. Sometimes I run it
even before vmstat. I appreciate the clear, concise way in which w provides important information, such as load average. You can tell a lot about
your system from the load average. If my load average commonly varies
between 2 and 5 but is 37 when I run this command, I’m about ready to
say, “Houston we have a problem.” In the following case, we’re okay.
# w
08:29AM
up 1 day,
User
tty
u0004773 pts/0
u0004773 pts/1
23:44,
login@
06:40AM
08:28AM
2 users,
idle
0
0
load average: 1.00, 1.00, 1.01
JCPU
0
0
PCPU what
0 -ks
0 –ksh
32
Chapter 5: CPU: Monitoring
lparstat (AIX-specific)
lparstat { -i | [-H|-h] [Interval [Count]] }
The purpose of the lparstat command is to report logical partition (LPAR)
information statistics. This command also displays hypervisor statistical data about many POWER Hypervisor calls. Introduced in AIX 5.2,
lparstat is commonly used to assist in shared-processor partitioned
environments.
In the following command output, you should recognize the entries up
until entitled capacity (entc).
# lparstat 2 5
System configuration:
type=Shared mode=Uncapped smt=On lcpu=4 mem=3072 psize=16 ent=0.40
%user
----0.1
0.0
0.0
0.0
0.1
%sys
---1.4
1.4
1.3
1.5
1.1
%wait
----0.0
0.0
0.0
0.0
0.0
%idle physc %entc lbusy
----- ----- ----- -----98.5 0.01
2.6
0.0
98.6 0.01
2.6
0.0
98.7 0.01
2.4
0.0
98.5 0.01
2.8
1.2
98.8 0.01
2.1
0.0
vcsw phint
---- ----582
0
635
0
593
0
685
0
458
1
On shared partitions, lparstat provides the following information:
●
●
●
lbusy — The percentage of logical processor utilization (executing at
the user and system level)
vcsw — The number of virtual context switches that are virtual processor hardware preemptions
phint — The number of phantom interrupts (redirected to other partitions in the shared pool)
An important flag worth a mention is the –h flag, which shows the POWER
Hypervisor statistics:
mpstat (AIX-specific)
33
# lparstat -H 2 5
System configuration:
type=Shared mode=Uncapped smt=On lcpu=4 mem=3072 psize=16 ent=0.40
Detailed information on Hypervisor Call
Hypervisor
Call
remove
read
nclear_mod
page_init
clear_ref
protect
put_tce
xirr
Number of
Calls
0
0
0
265
0
0
0
565
%Total Time
Spent
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.1
%Hypervisor
Time Spent
0.0
0.0
0.0
0.9
0.0
0.0
0.0
2.4
Avg Call
Time(ns)
Max Call
Time(ns)
1
1
1
604
1
1
1
758
656
0
0
6593
0
0
0
1406
Hypervisor information includes:
●
Number of calls — The number of Hypervisor calls
●
%Total Time Spent — Percentage of total time spent on call
●
●
●
%Hypervisor Time Spent — Percentage of Hypervisor time spent on
call
Avg Call Time — Average call time for this type of call; the percentage of logical processor utilization executing at the user and system
level (in nanoseconds)
Max Call Time — Maximum call time for this type of call (in nanoseconds)
For partitions running AIX 5.2 or AIX 5.3, either in a dedicated environment or in shared and capped mode, the overall CPU utilization is based on
the user, sys, wait, and idle values. In AIX 5.3 partitions running in uncapped mode, the utilization is based on the entitled capacity percentage.
mpstat (AIX-specific)
mpstat [ { -a | -d | -i | -s | -h } ] [ -w ] [ interval [ count ] ]
34
Chapter 5: CPU: Monitoring
The mpstat command (part of the bos.acct fileset) was introduced in AIX
5.3. This tool displays overall performance numbers for all logical CPUs
on your partitioned system. When you run the command, two sections
of statistics are displayed. The first section shows system configuration
information, which is displayed when the command starts and whenever
a change in the system configuration occurs; the second section, which is
displayed at user-specified intervals, shows utilization statistics:
# mpstat 1 2
System configuration: lcpu=4 ent=0.4 mode=Uncapped
cpu min maj mpc int
cs ics
rq mig lpa sysc us sy wa id
pc %ec lcs
0
18
0
0 524 125
56
1
0 100 100 8 58 0 34 0.01 2.1 465
1
0
0
0 108
0
0
0
0
0 0 36 0 64 0.00 0.5 108
2
0
0
0
10
0
0
0
0
0 0 32 0 68 0.00 0.0
10
3
0
0
0
10
0
0
0
0
0 0 29 0 71 0.00 0.0
10
U
- - - 0 97 0.39 97.3
ALL
18
0
0 652 125
56
1
0 100 100 0 1 0 98 0.01 2.7 593
------------------------------------------------------------------------------0
1
2
3
U
ALL
3
0
0
0
3
0
0
0
0
0
0
0
0
0
0
392
70
10
10
482
127
0
0
0
127
58
0
0
0
58
1
0
0
0
1
0 100
0
0
0
0 100
67
0
0
0
67
5
0
0
0
0
56
34
32
29
1
0
0
0
0
0
0
38
66
68
71
98
99
Information given includes:
●
cpu — Logical CPU processor ID
●
min — Minor page faults
●
ma — Major page faults
●
mpc — Total number of interprocessor calls
●
int — Total number of interrupts
●
cs — Total number of voluntary context switches
●
ics — Total number of involuntary context switches
0.01 1.4
0.00 0.4
0.00 0.0
0.00 0.0
0.39 98.2
0.01 1.8
331
70
10
10
421
topas (AIX-specific)
●
rq — Total run queues
●
mig — Total number of thread migrations
●
lpa — Logical processor affinity
●
sysc — Total number of system calls
●
us — CPU time spent on user activity
●
sy — CPU time spent on system activity
●
wa — CPU time spent waiting on I/O
●
id — CPU time idle
●
pc — Fraction of processor consumed
●
%ec — Percentage of entitled capacity consumed
●
lcs — Total number of logical context switches
35
The mpstat command is a very useful command because it reports collection information for each logical CPU on your partition in a format that is
clearly illustrated. You can even view SMT utilization by specifying the –s
flag:
# mpstat -s 1
System configuration: lcpu=4 ent=0.4 mode=Uncapped
Proc0
Proc1
1.01%
0.02%
cpu0
cpu1
cpu2
cpu3
0.85%
0.16%
0.01%
0.01%
-----------------------------------------------------------------Proc0
Proc1
0.74%
0.02%
cpu0
cpu1
cpu2
cpu3
0.56%
0.18%
0.01%
0.01%
topas (AIX-specific)
IBM has improved the topas command (part of the bos.perf.tools fileset)
substantially in AIX 5.3. Before these changes, topas did not have the
36
Chapter 5: CPU: Monitoring
ability to capture historical data, nor was it enhanced for use in shared
partitioned environments. (The command’s –L flag now reports partitioned
information.) By incorporating these changes to let you collect performance data from multiple partitions, IBM has really simplified the capability of topas as a performance management and capacity planning tool. The
command’s look and feel is quite similar to top and monitor (used in other
Unix variants).
The topas utility displays all kinds of information on your screen in a textbased, graphical type of format. In its default mode, it provides a myriad of
CPU, memory, and I/O information. Some recent changes:
●
●
As of TL_4 of AIX 5.3, topas uses a daemon named xmwlm, which
is automatically started from the inittab.
As of TL_5 of AIX 5.3, the system keeps seven days of data as a
default and records almost all the topas data that is displayed interactively, except for process and Workload Manager (WLM) information. You can use the topasout command to generate text-based
reports. By specifying the –C flag, you can actually view monitoring
information across all partitions in an IBM POWER system.
nmon
My favorite of all performance monitoring tools is nmon, which until
recently was not an “officially” supported IBM tool; if you were going to
send data to IBM for analysis, this was not the tool you would use. nmon
is almost the perfect AIX analysis tool (it’s also available now for Linux
on POWER). The data it collects is available either from your screen or
through reports, which you can run from cron. In the words of nmon’s
creator, Nigel Griffiths, “Why use five or six tools when one free tool can
give you everything you need?”
What attracts most people to nmon is that not only does it have a very
efficient front-end monitor, but it also provides the ability (unlike topas)
to capture data to a text file for graphing reports because the output is in a
.csv (spreadsheet) format. In fact, moments after running an nmon session, you can actually view the nicely rendered charts in a Microsoft Excel
spreadsheet, which you can hand off to senior management or other techni-
Using nmon for Historical Analysis
37
cal teams for further analysis. Further, in contrast to topas, I’ve never seen
any performance-type overhead with this utility.
Using nmon for Historical Analysis
First, we’ll tell nmon to create a file, name the run, and do data collection
every 30 seconds for one hour (120 intervals):
# ./nmon -f -t -r test3 -s 30 -c 120
AIX version 5.3.0.0 and starting up nmon nmon_aix5
When monitoring is completed, we’ll sort the file:
# sort -A p682e_pub_071224_1411.nmon > lpar30p682e_pub_071224_411.csv
Now, we can FTP the spreadsheet to a PC and open it up. Start the nmon
analyzer, and click on Analyze nmon data. Enter the location of the file,
wait about 20 seconds, and you’ll see your nmon data in all its glory!
Figure 5.1 shows some sample output from the nmon analyzer.
Figure 5.1: Sample nmon analyzer output
The nmon analyzer is an awesome tool, written by Stephen Atkins, that
graphically presents data (CPU, memory, network, or I/O) from an Excel
38
Chapter 5: CPU: Monitoring
spreadsheet. Perhaps the only drawback that prevents it from being perceived as an enterprise type of tool is that it lacks the ability to gather
statistics about large numbers of LPARs at once (although it now has a
partition-viewing capability similar to that of topas). The analyzer is not
a database, nor was it meant to be. That is where a tool such as Ganglia
helps; this utility has actually received the blessing of Nigel Griffiths as the
tool that can integrate nmon analysis.
You can download the nmon analyzer for free from http://www.ibm.
com/developerworks/aix/library/au-nmon_analyser. For more information
about Ganglia, see http://ganglia.info.
ps (Unix-generic)
ps [-ANPaedfklmMZ] [-n namelist] [-F Format] [-o
specifier[=header],...] [-p proclist][-G|-g grouplist] [-t
termlist] [-U|-u userlist] [-c classlist] [ -T pid] [ -L pidlist]
ps [aceglnsuvwxU] [t tty] [processnumber]
The ps command shows the current status of processes. Upon viewing the
syntaxes shown above, the first question you may have is, why the two
sets of usage parameters? To make a long story short, the answer has to do
with the basic history of Unix — the old Berkeley versus System V (now
referred to as X/Open Standards) wars. As we discussed in Chapter 2, AIX
is a hybrid of sorts, and it contains both flavors of Unix. Most of you are
probably more familiar with the X/Open Standards usage of ps (e.g., ps
–ef), which is the first usage shown above.
How can you best use ps in CPU systems monitoring? In other words,
how can you identify processes that are taking an inordinate amount of
CPU time? If you can find these processes, you can take action on them. I
like using the Berkeley syntax better here; the information it provides is in
a nicer, more presentable format. Let’s look at ps ux, which displays the
CPU execution time of processes:
# ps ux | more
USER
root
PID %CPU %MEM
8196 0.1 0.0
SZ
384
RSS
384
TTY STAT
A
STIME
08:45:25
TIME COMMAND
1:02 wait
tprof
root
root
root
root
root
root
root
root
root
root
root
53274
86118
299158
69666
0
57372
61470
286880
258190
151642
233606
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0 384 384
0.0 504 512
0.0 472 500
0.0 960 960
0.0 384 384
0.0 384 384
0.0 384 384
0.0 900 928
0.0 1216 1216
0.0 512 512
0.0 840 956
-
A
A
A
A
A
A
A
A
A
A
A
08:45:25
08:45:27
08:45:44
08:45:25
08:45:25
08:45:25
08:45:25
08:45:44
08:45:35
08:45:27
08:45:44
0:30
0:08
0:06
0:04
0:04
0:02
0:02
0:01
0:01
0:01
0:00
39
wait
/usr/sbin/syncd
/usr/sbin/getty
gi
swappe
wait
wait
/usr/bin/xmwlmrpc.lock
rtcmd
/usr/sbin/sshd
This ps command uses two key parameters:
●
●
— Displays user-oriented output about each process: the USER
(user), PID (process ID), %CPU (CPU time used), %MEM (memory
used), SZ (size of process core image), RSS (resident set size), TTY
(controlling terminal name), STAT (process state), STIME (start
time), TIME (total run time), and COMMAND (executed command)
fields.
u
x — Displays processes without a controlling terminal in addition
to processes with a controlling terminal. To see processes that don’t
include daemons, substitute a for x.
For our purposes, the most important field of the ps output is %CPU. This
field reports the percentage of CPU time that the process has used since it
started.
Tracing Tools
Tracing tools come in handy when you want to drill down further to analyze processes that are causing bottlenecks. Among these tools are curt,
splat, tprof, trace, and trcrpt. We’ll use the tprof and trace tools here.
tprof
tprof [ -c ] [ -C { all | cpuidslist } ] [ -d ] [ -D ] [ -e ]
{ [ -E { ALIGNMENT | EMULATION | ISLBMISS | DSLBMISS | PM_<event> } ]
[ -f interval ] } [ -F ] [ -j ] [ -J profilehook ] [ -k ] [ -l ]
40
Chapter 5: CPU: Monitoring
[ -L objectslist ] [ -m objectslist ] [ -M sourcepathlist ]
[ -p processlist ] [ -P { all | pidslist } ] [ -s ]
[ -S searchpathlist ] [ -t ] [ -T buffersize ] [ -u ] [ -v ]
[ -V verbosefilename ] [ -I ] [ -N ] { [-z] [-Z] | -R }
{ { -r rootstring } [ -X { xmloptions } ] |
{ { [ -A { all | cpuidslist } ] [-n] } [ -r rootstring ] -x command }
}
The tprof command reports CPU usage for both individual programs and
the system as a whole. The output provides an estimate of the amount of
CPU time spent for each process that was executing while tprof was running. It also contains an estimate of the amount of CPU time spent in each
of the kernel address spaces: the kernel address space, the user address
space, and shared library address spaces.
You can use tprof to view a basic global program and thread-level summary by running the command in the following fashion:
# tprof -x sleep 20
Mon Dec 24 18:55:54 2
System: AIX 5.3 Node: lpar30p682e_pub Machine: 00CED82E4C0
Starting Command sleep 2
stopping trace collection.
Generating sleep.prof
root@lpar30p682e_pub[/]
Let’s view the file (sleep.prof) that we just created:
# more sleep.prof
Configuration information
=========================
System: AIX 5.3 Node: lpar30p682e_pub Machine: 00CED82E4C00
Next, let’s use the trace command to run a manual trace:
time
41
/usr/bin/trace -ad -M -L 109113753 -T 500000 -j
000,00A,001,002,003,38F,005,006,134,139,5A2,5A5,465,234, -o Total Samples = 1088
Traced Time = 20.02s (out of a total execution time of 20.02s)
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Process
Freq Total Kernel
User Shared Other
=======
==== ===== ======
==== ====== =====
wait
4 99.82 99.82
0.00
0.00
0.00
swapper
1
0.09
0.09
0.00
0.00
0.00
/usr/bin/tprof
1
0.09
0.00
0.00
0.09
0.00
Total
6 100.00 99.91
0.00
0.09
0.00
Process
PID
TID Total Kernel
User Shared Other
=======
===
=== ===== ======
==== ====== =====
wait
8196
8197 44.58 44.58
0.00
0.00
0.00
swapper
0
3
0.09
0.09
0.00
0.00
0.00
/usr/bin/tprof
418000
688307
0.09
0.00
0.00
0.09
0.00
=======
Total
===
===
===== ======
100.00 99.91
==== ======
0.00
0.09
=====
0.00
The tprof command is an excellent tool for identifying runaway processes
because these processes appear at the top of the output list.
Timing Tools
Two tools, time and timex, provide access to information about command
execution time.
time
time [ -p ] Command [ Argument ... ]
The time command returns the total execution time of your program,
including real time, user time, and system time. This information can be
useful when you’re trying to figure out the amount of time it takes for commands to execute. time works by counting the CPU ticks from the time the
command was first started until the time it ends:
# time find ./ -depth 1>/dev/null
real
user
sys
0m23.30s
0m0.22s
0m2.10s
42
Chapter 5: CPU: Monitoring
timex
timex [ -s ][ -o ][ -p [ -fhkmrt ] ] cmd
Without any flags, the timex command provides the same type of information as time, but with a prettier view. Used with the –s flag, it summarizes
all system activity while the command is being executed. This spares you
the task of starting up a sar or vmstat process while running a timing. For
this reason alone, I like to use timex, and I’ve found it a very useful tool
through the years.
# timex -s find ./ -depth 1>/dev/null
real 21.69
user 0.20
sys
2
AIX lpar30p682e_pub 3 5 00CED82E4C00
12/26/07
System configuration: lcpu=4 ent=0.40 mode=Uncapped
08:40:08
%usr
%sys
%wio
%idle
physc
%entc
08:40:30
5
33
0
62
0.17
43.2
System configuration: lcpu=4 ent=0.40 mode=Uncapped
08:40:08 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s
08:40:30
0
0
0
0
0
0
0
0
System configuration: lcpu=4 mem=3072MB ent=0.40 mode=Uncapped
08:40:08
slots cycle/s fault/s odio/s
08:40:30 392358
0.00
18.11
0.00
System configuration: lcpu=4 ent=0.40 mode=Uncapped
08:40:08 rawch/s canch/s outch/s rcvin/s xmtin/s mdmin/s
08:40:30
0
0
0
0
0
0
System configuration: lcpu=4 ent=0.40 mode=Uncapped
08:40:08 scall/s sread/s swrit/s fork/s exec/s rchar/s wchar/s
08:40:30
19659
8
5522
0.14
0.18
12407 308149
System configuration: lcpu=4 ent=0.40 mode=Uncapped
08:40:08 cswch/s
08:40:30
5617
System configuration: lcpu=4 ent=0.40 mode=Uncapped
08:40:08 iget/s lookuppn/s dirblk/s
08:40:30
0
8513
0
timex
System configuration: lcpu=4 ent=0.40 mode=Uncapped
08:40:08 runq-sz %runocc swpq-sz %swpocc
08:40:30
1.3
95
System configuration:
08:40:08 proc-sz
08:40:30 68/262144
mode=Uncapped
inod-sz
file-sz
0/170
387/1124
thrd-sz
219/524288
System configuration: lcpu=4 ent=0.40 mode=Uncapped
08:40:08
msg/s sema/s
08:40:30
0.00
0.00
43
C h a p t e r
6
CPU: Tuning
This chapter identifies the AIX tools you’ll use to help resolve CPU system
bottlenecks and improve performance. Notice that I didn’t use the word
“tune” in the preceding sentence. You’ll find that improving CPU performance is less about tuning and more about improving workload utilization
and managing processes and threads more efficiently.
Process and Thread Management
A junior administrator might consider process management as little more
than monitoring active processes and killing zombie and/or runaway processes. In reality, there is a lot more to it than that.
Let’s start by addressing a fundamental question: how do processes relate
to threads? The answer is simple. While the process is the entity that AIX
uses to control the use of system resources, the threads control the actual
time consumption because each kernel thread is a single sequential flow
of control. Each process is made up of one or more threads. Controlling
thread use is where you can make a difference. To do this, you need to
understand the tools that let you work with threads to improve CPU performance. Although AIX Version 4 introduced the use of threads to control
processor time consumption, it was in AIX 5L that system management
tools really evolved to help monitor and analyze thread usage.
46
Chapter 6: CPU: Tuning
nice
nice [-n Increment] Command [Argument...]
nice [-Increment] Command [Argument...]
The nice command lets you adjust the priority of a given process. The
default value for processes is 20, except for korn shell (ksh) background
processes, which are set to 24. With nice, the larger the increment number
you specify, the lower the priority.
You can use the ps command with the –l (lowercase L) flag to view your
information. The NI column shows the nice value for each process:
nice
# ps -l | more
F
240001
200005
200001
S
UID
PID
PPID
A 20004773 90156 164038
A
0 376960 90156
A
0 409730 376960
C PRI
0 60
2 20
0 60
NI
ADDR
SZ WCHAN TTY TIME CMD
20 30448400 724
pts/0 0:00 ksh
60
45c400 736
pts/0 0:00 ksh
20
446400 724
pts/0 0:00 ps
Let’s start a new ksh with nice:
# nice -10 ksh
# ps -l | more
F
240001
200005
200001
S
UID
PID
PPID
A 20004773 90156 164038
A
0 311534 376960
A
0 376960 90156
C PRI NI ADDR
SZ WCHAN TTY TIME CMD
0 60 20 30448400 724
pts/0 0:00 ksh
1 80 30 48376400 688
pts/0 0:00 ksh
0 60 20 6045c400 736
pts/0 0:00 ksh
The preceding output shows that the priority of the new process (PID
311534) has been added and changed from its default. The child process
that was forked from the process is also shown.
Watch the nice syntax — it can be a little confusing. The minus sign
(–) identifies the increment value, which is assumed to be positive. To
specify a negative increment, you must use two minus signs, with no
spaces in between. When you use the renice command (covered next), the
parameter following the command name is the value, whether it is positive
or negative.
renice
47
renice
renice [ -n Increment ] [ -g|-p|-u ] ID . . .
The renice command dynamically reassigns a priority to a running process. Using renice can cause the system to assign either a higher or a lower
priority to a given process. When you use renice, you actually change the
value of the priority of a thread (default value of 40) by changing the nice
value of its process.
Assume that the following processes are currently running:
# ps -l
F
240001
200005
200001
200001
S
UID
PID
PPID
A 20004773 90156 164038
A
0 311534 376960
A
0 376960 90156
A
0 417842 311534
C PRI NI ADDR
SZ WCHAN TTY TIME CMD
0 60 20 30448400 724
pts/0 0:00 ksh
0 80 30 48376400 688
pts/0 0:00 ksh
0 60 20 6045c400 736
pts/0 0:00 ksh
3 81 30 30468400 732
pts/0 0:00 ps
Let’s increase the priority of a thread by changing the nice value for the
process that contains it:
# renice -10 376960
# ps -l
F
240001
200005
200001
200001
S
UID
PID
PPID
A 20004773 90156 164038
A
0 311534 376960
A
0 376960 90156
A
0 417842 311534
C PRI NI ADDR
0 60 20 30448400
0 80 30 48376400
0 50 10 6045c400
3 81 30 30468400
SZ WCHAN TTY TIME CMD
724
pts/0 0:00 ksh
688
pts/0 0:00 ksh
736
pts/0 0:00 ksh
732
pts/0 0:00 ps
It’s important to note that when not run as root, renice has some limitations. Without the protection of root, only processes by the current user ID
can be changed. In addition, you cannot increase nice values after making
a prior one less favorable.
48
Chapter 6: CPU: Tuning
ps
In the preceding chapter, we looked at the ps command and how you can
use it to monitor CPUs. You’ll find that ps is one of the most versatile
commands in Unix. Specifying it with the –mo flag gives you a granular
look at your threads:
# ps -mo THREAD
USER
u0004773
PID
PPID
TID ST CP PRI SC WCHAN
F
TT BND COMMAND
- A
0
60
1
-
240001
pts/0
-
933995 S
0
60
1
-
10400
-
root 311534 376960
- A
0
80
1
-
200005
pts/0
-
90156 164038
-
-
823311 S
0
80
1
-
10400
-
root 376960
90156
- A
0
50
1
-
200001
pts/0
-
835591 S
0
50
1
-
10400
-
root 409778 311534
- A
3
81
1
-
200001
pts/0
880775 R
3
81
1
-
400010
-
-
-
-
- -ksh
- - ksh
- - -ksh
- - ps -mo THREAD
- -
The TID column lists the thread ID, while the BND column shows the
processes and threads bound to a processor. Why do you need to know
this information? Because you can actually change the priority of threads,
globally. To do so, you modify the CPU scheduling parameters (using the
schedo command) that calculate the priority for each thread.
schedo
schedo -h [tunable] | {-L [tunable]} | {-x [tunable]}
schedo [-p|-r] (-a | {-o tunable}
schedo [-p|-r] (-D | ({-d tunable} {-o tunable=value})
The schedo command manages the CPU scheduler tunable parameters; it
can be used only with root. Similar to other tunable commands (e.g., vmo),
schedo can make immediate changes or can defer the changes until the
next reboot, depending on the flags you use. Use of the –p flag causes the
changes to take effect at the next reboot.
First, let’s display the existing scheduling parameters by using schedo with
the –L flag:
schedo
49
# schedo -L
NAME
CUR
DEF
BOOT
MIN MAX
%usDelta
100
100
100
0
100
UNIT
TYPE DEPENDENCIES
affinity_lim
7
7
7
0
100
dispatches
D
allowMCMmigrate
0
0
0
0
1
boolean
D
big_tick_size
1
1
1
1
100
10 ms
D
ded_cpu_donate_thresh
80
80
80
0
100
% busy
fixed_pri_global
1
0
0
0
1
boolean
force_grq
0
0
0
0
1
boolean
hotlocks_enable
0
0
0
0
1
boolean
idle_migration_barrier 4
4
4
0
100
sixteenth
D
krlock_confer2self
1
1
1
0
1
boolean
D
krlock_conferb4alloc
0
0
0
0
1
boolean
D
krlock_enable
1
1
1
0
1
boolean
krlock_spinb4alloc
1
1
1
1
2G-1
krlock_spinb4confer
1K
1K
1K
0
2G-1
maxspin
16K
16K
16K
1
4G-1
n_idle_loop_vlopri
100
100
100
0
976K
D
pacefork
10
10
10
10
2G-1
clock ticks D
sched_D
16
16
16
0
32
sched_R
16
16
16
0
32
search_globalrq_mload
256
256
256
0
4095M
search_smtrunq_mload
256
256
256
0
4095M
setnewrq_sidle_mload
384
384
384
0
4095M
shed_primrunq_mload
64
64
64
0
4095M
sidle_S1runq_mload
64
64
64
0
4095M
134
134
134
0
4095M
134
134
0
4095M
4095M 4095M 4095M 0
4095M
D
D
D
spins
D
D
D
sidle_S2runq_mlo
sidle_S2runq_mload
D
sidle_S1runq_mload
sidle_S3runq_mloa
sidle_S3runq_mload
134
sidle_S2runq_mloa
sidle_S4runq_mload
sidle_S4runq_mload
D
sidle_S3runq_mload
slock_spinb4confer
1K
1K
1K
0
2G-1
smt_snooze_delay
0
0
0
-1
97656K microsecs
D
smtrunq_load_diff
2
2
2
1
4095M
D
tb_balance_S0
0
0
0
0
2
ticks
D
50
Chapter 6: CPU: Tuning
tb_balance_S1
2
2
2
0
2
ticks
tb_threshold
100
100
100
10
1000
ticks
D
timeslice
1
1
1
0
2G-1
clock ticks D
unboost_inflih
1
1
1
0
1
boolean
D
v_exempt_secs
2
2
2
0
2G-1
seconds
D
v_min_process
2
2
2
0
2G-1
processes
D
v_repage_hi
0
0
0
0
2G-1
v_repage_proc
4
4
4
0
2G-1
v_sec_wait
1
1
1
0
2G-1
seconds
vpm_xvcpus
0
0
0
-1
2G-1
processors
D
----------------------------------------------------------------------------n//a means parameter not supported by the current platform or kernel
Parameter types:
S = Static: cannot be changed
D = Dynamic: can be freely change
B = Bosboot: can only be changed using bosboot and reboot
R = Reboot: can only be changed during reboot
C = Connect: changes are only effective for future socket connection
M = Mount: changes are only effective for future mountings
I = Incremental: can only be incremented
d = deprecated: deprecated and cannot be changed
Value conventions:
K = Kilo: 2^10
G = Giga: 2^30
P = Peta: 2^5
M = Mega: 2^20
T = Tera: 2^40
E = Exa: 2^60
You can also display these parameters using the –a flag, although the information given is far less meaningful.
sched_R and sched_D
The sched_R and sched_D scheduling parameters relate to process priority
calculations. The scheduler’s priority calculations are based on sched_R
and sched_D, values that are expressed in thirty-seconds (1/32). I won’t
bore you here with the complex algorithms associated with these parameters. The net of it is that lowering sched_R has the effect of helping the
scheduler distinguish between background processes and processes running as interactive foreground processes, thereby enabling it to assign a
greater priority to foreground processes. The following example lowers
sched_R from its default value of 16 to 5:
# schedo -o sched_R=5
Setting sched_R to 5
timeslice
51
fixed_pri_global
When a CPU is ready to dispatch a thread, the system checks the global run
queue before any of the others. When the thread completes its running slice
on the CPU, it gets put back on the queue, which helps maintain something
called processor affinity. Processor affinity is defined as the probability of
dispatching a thread to a processor that previously executed it. To improve
overall thread performance, you can enable an environment variable called
RT_GRQ, which is set to off by default. Turning on RT_GRQ automatically
places the thread on the global run queue. All fixed priority threads will be
placed on the run queue if you change the default from 0 to 1.
Let’s use schedo to change the default value of fixed_pri_global:
# schedo -o fixed_pri_global=1
# schedo -a | grep fixed_pri_global
fixed_pri_global = 0
# schedo -o fixed_pri_global=1
Setting fixed_pri_global to 1
# schedo -a | grep fixed_pri_global
fixed_pri_global = 1
The actual priority of the user processes varies over time, depending on the
amount of overall CPU time that a process has used most recently. Please
note that in some instances, this variable should be turned of because it can
impact SMT performance. Make sure that you test this in your environment to determine what works best for your application.
timeslice
Perhaps the most important schedo parameter is timeslice. This setting
represents the largest number of clock ticks that a thread can be in control
of before facing the possibility of being replaced by another thread. In
some cases, increasing the timeslice can improve system throughput by
reducing context switching.
Before changing the timeslice setting, make sure you run vmstat (or sar)
enough to determine whether there really is a considerable amount of
52
Chapter 6: CPU: Tuning
context switching going on. If there is, the overhead of dispatching threads
is more costly than letting them run for longer slices.
The following example increases the timeslice from 1 to 2:
# schedo -p -o timeslice=2
Setting timeslice to 2 in nextboot file
Setting timeslice to 2
In this case, we’ve also used the –p flag, which saves the parameter on a
reboot.
bindprocessor
bindprocessor { -q|-u ProcessID|-s SmtSetID|-b BindId|ProcessID
[ProcessorNum] }
CPU binding lets processes run on a specific processor, a capability that
relates to the processor affinity concept I defined earlier. Binding threads
to specific processors has many purposes; for example, you might bind
threads to a given processor to find the root cause of a hanging program.
More commonly, the technique is used when you’re trying to spread
around the wealth of a system — in a symmetric multiprocessing (SMP)
box, for example. To display the available (logical) processors on your
box, you would use the –q flag:
# bindprocessor -q
The available processors are:
# CPU binding
0 1 2
Assuming that symmetric multithreading (SMT) is enabled (it is by default), each and every hardware thread of the physical processor is listed
as a separate processor when you run the bindprocessor command. On
POWER5 chips, two hardware threads exist on each processor. With
shared processor logical partitions (LPARs), using this command binds
to virtual CPUs, so you must be careful because problems can result for
smtctl
53
applications that are predisposed to run on a specific CPU. If you want to
bind a process to a particular CPU, it’s as simple as running this command:
# bindprocessor 12769 3
This example assigns process ID (PID) 12769 to logical CPU 3.
smtctl
smtctl [ -m off|on [ -w boot|now ] ]
The smtctl command (introduced in AIX 5.3) displays SMT information.
To determine whether SMT is enabled, you simply run the command without any flags:
# smtctl
This system is SMT capable
SMT is currently enabled
SMT boot mode is not set
SMT threads are bound to the same virtual processor.
proc0 has 2 SMT thread
Bind processor 0 is bound with proc
Bind processor 1 is bound with proc
proc2 has 2 SMT threads.
Bind processor 2 is bound with proc2
Bind processor 3 is bound with proc2
System performance usually increases about 30 percent when SMT is
enabled, so you almost always want to activate this functionality. Processor affinity also occurs naturally. When a thread is running on a CPU and is
interrupted, it usually is placed back on the same CPU because the processor’s cache might still have lines belonging to the thread. If the thread were
to be dispatched to a different CPU, it might have to obtain information
from RAM, which would slow processing time dramatically. You can also
bind threads using subroutines, although I advise caution if you attempt
to do so. This technique binds all kernel threads in a process to a processor, which has the effect of forcing these threads to be run on that specific
processor until they are unbound.
54
Chapter 6: CPU: Tuning
gprof
/usr/ccs/bin/gprof [-b] [ -c [filename] ] [-e Name] [-E Name] [-f
Name] [-g filename] [-i filename] [-p filename] [-F Name] [-PathName]
[-s] [-x [filename]] [-z] [a.out [gmon.out ...]]
The gprof command, used during programming, produces an execution
profile of your compiled programs in C, Fortran, Pascal, or even Cobol.
The command reports on flow control through all the subroutines of your
program and tells you the amount of CPU time each subroutine consumed.
This information is useful when you’re troubleshooting how processes use
CPU resources. You can use gprof to profile your program and determine
which functions are using the CPU. The profile data is taken from the call
graph profile file (gmon.out by default).
AIX 5.3 lets you assign a user-specified name to the profiling output files
by setting special environment variables. Version 5.3 also provides additional profiling support for threads and new options that affect the type of
profiling data collected.
Section II
Summary, Tips, and Quiz
Summary
●
●
●
●
CPU monitoring tools you can use include iostat, lparstat, mpstat,
nmon, sar, topas, vmstat, and w.
Tracing tools include curt, splat, tprof, trace, and trcrp.
The nice and renice commands are important utilities that can help you
prioritize your processes and treads.
is the command used to manage the CPU scheduler’s tunable
parameters.
schedo
●
The smctl command is used for symmetric multithreading (SMT).
●
ps
●
is an extremely versatile command that can help you identify process
hogs, thread utilization, and nice priorities.
and mpstat are important performance tools you should use in
a partitioned environment.
lparstat
Tips
●
●
Identifying workload is of paramount importance to improving CPU utilization. Running jobs and processes during off-peak hours, using cron
and/or other third-party types of scheduling tools (e.g., IBM’s Workload
Manager, CA’s AutoSys) can make a big difference in performance.
Users usually will assume that your systems bottleneck is with the CPU,
but more often than not the problem is either memory- or I/O-related.
Tune a subsystem only when you’re certain of the diagnosis.
56
Section II: Summary, Tips, and Quiz
●
●
●
●
●
●
Before making any changes to production systems, make the changes
to either your test or development environment first so you can analyze
their effect. This advice is particularly important when using the schedo
command for AIX CPU tuning.
Once you’ve determined that you’re experiencing a CPU bottleneck,
adding CPUs is always an option. With dynamic logical partitioning
(DLPAR), this solution is much easier to accomplish than it used to be
because you can just add or subtract CPUs dynamically. Tools such as
the DLPAR toolset and Partition Load Manager (PLM) can automate
the process, letting you add or subtract CPUs to or from your partition
based on variables you’ve already identified. Uncapping partitions in a
virtual environment can also alleviate CPU bottlenecks.
Using nmon, ps, tprof, or any number of other tools, you might have
identified processes that are hogging CPU time. If you question whether
these processes are necessary, try contacting the process owners (if possible). You may find out that you can kill the processes. If you’re told
you can do so, be sure to kill them using kill -1 and not kill -9. Also, be
careful about zombie processes that can be created when you kill parent
processes and leave their children alone. It may not sound proper, but
make sure the children are dead, too; otherwise, you’re at risk for runaway and/or zombie processes.
Starting with the POWER5, SMT is built into the POWER architecture.
This capability provides two independent threads of instruction execution for each processor. Enabling SMT makes one processor appear as
two processors on the partition. Always make sure SMT is enabled (by
running the smtctl command), except where an ISV explicitly states
that it is not recommended. SMT’s performance gain depends on many
variables, each of which you should analyze carefully. SMT is bestsuited for multithreaded, I/O-intensive applications. It is not a good fit
for numerically intensive workloads.
Use tools such as nmon (and the nmon analyzer) or topas to store
historical performance data for trending and analysis. Don’t wait to use
these tools until you have a problem. You should be using them when
you are first in production with your system.
Other IBM utilities are available that don’t come standard with AIX and
have a cost associated with them, including:
Multiple Choice
❍
57
Performance Toolbox (PTX), which as of AIX 5.3 includes the
procmon utility (used for process management)
❍
IBM Tivoli Monitoring System Edition for System p for AIX 5L V5
❍
PM for System p, an IBM Global Technology Services offering
Quiz
Multiple Choice
1. Which iostat flag reports AIO information?
a. –A
b. –a
c. –v
d. –e
2. Which topas flag reports all partitioned information in your managed
system?
a. –c
b. –L
c. –j
d. –p
3. Which tool is best used to monitor performance numbers for all logical
CPUs on a partitioned system?
a. iostat
b. lparstatc
c. mpstat
d. lvms
58
Section II: Summary, Tips, and Quiz
4. Which ps flag reports thread information?
a. –a
b. –u
c. –ef
d. –mo
5. Which of the following is not an example of a trace tool?
a. lprof
b. tprof
c. curt
d. splat
6. timex –s reports the total execution time of the program as well as the
a. Number of referenced inodes
b. Percentage of blocked processes
c. Number of threads
d. Summary of systems activity
7. Given the following results of the vmstat command, are you experiencing a CPU bottleneck?
# vmstat 2
System configuration: lcpu=4 mem=3072MB ent=0.
kthr
memory
----- -----------avm
fre
page
----------------------
faults
cpu
------------
----------------------
r
b
re
pi
po
fr
sr
cy
in
pc
ec
1
4 128826 641397
0
0
0
0
0
0
448
sy
87 138
cs
us sy id wa
24
1 40 35
0.01
2.8
1
7 128826 641397
0
0
0
0
0
0
385
10 136
35 14 20 31
0.01
2.2
2
7 128826 641397
0
0
0
0
0
0
381
13 138
35
4 20 41
0.01
2.2
3
4 128826 641397
0
0
0
0
0
0
364
40 138
40 17 16 27
0.01
2.
Fill in the Blank(s)
a. Yes
b. No
c. Maybe
d. Not enough information
True or False
8. With nice, the larger the number, the lower the priority.
9. Lowering the schedo parameter sched_R has the effect of giving a higher preference to foreground processes than to background
processes.
Fill in the Blank(s)
10. Define processor affinity:
__________________________________________
59
Section III
Memory
This section provides an overview of the AIX Virtual Memory Manager
and other important memory-related concepts, including how to monitor
and tune your virtual memory. We also discuss best practices for virtual
memory monitoring, analysis, and tuning, given the various considerations
that can impact performance.
C h a p t e r
7
Memory: Introduction
What, exactly, is involved in memory performance tuning? As a systems
administrator, you’re probably already familiar with the basics of memory,
such as the differences between physical and virtual memory. What we’ll
be discussing here is how the Virtual Memory Manager (VMM) works in
AIX and how it relates to overall systems performance. We’ll also review
some of the more important recent enhancements.
Let me reiterate that regardless of which subsystem you want to tune, you
should always think of the process as an ongoing one. Start monitoring
your system as soon as you put it into production and have it running well,
rather than when users are screaming about slow performance. Review
Chapter 1 on tuning methodology. I’m not saying that you must follow that
specific methodology, but without a plan, you won’t succeed in optimizing
the performance of your environments. Further, be sure to make only one
change at a time unless otherwise noted (as when changing related parameters, such as minperm% and maxperm%). In addition, capture and analyze
data as quickly as possible after making a change to determine what difference, if any, the change has really made.
Virtual Memory Manager
AIX newbies are sometimes surprised to hear that the Virtual Memory
Manager (VMM) services all memory requests from the system, not
just virtual memory. When the system accesses random access memory
(RAM), the VMM needs to allocate space, even when plenty of physical
64
Chapter 7: Memory: Introduction
memory is left on the system. It implements a process of early allocation
of paging space. Using this method, the VMM plays a vital role in helping
manage real memory, not just virtual memory. In AIX, all virtual memory
segments are partitioned into pages, with a default page size of 4K. Because virtual memory consists of real memory and paging space, allocated
virtual memory segments can be either RAM or paging space (virtual
memory stored on disk).
This is an important concept to understand, so read that last paragraph at
least twice.
VMM also maintains what is referred to as a free list, which is defined as
unallocated page frames. These are used to satisfy page faults. There are
usually a very few unallocated pages (which you configure) that the VMM
uses to free up space and reassign the page frames to. The VMM then
selects the virtual memory pages (whose page frames are to be reassigned)
using its page replacement algorithm. The paging algorithm determines
which virtual memory pages currently in RAM ultimately have their page
frames brought back to the free list. AIX uses all available memory, except
that which is configured to be unallocated — the free list.
To reiterate, the purpose of VMM is to manage the allocation of both
RAM and virtual pages. VMM’s objectives are to help minimize both the
response time of page faults and the use of virtual memory where it can.
Given the choice between RAM and paging space, the preference is to use
physical memory — if the RAM is available.
VMM also classifies virtual memory segments into two distinct categories, which are critical for you to understand. This concept is the most
important to grasp, and I’ll admit that when I first started working with
AIX, it took me a while to fully understand the concept and the tuning recommendations (which we’ll discuss later) behind it. The two
categories are working segments using computational memory and persistent segments using file memory. Simply put, without fully grasping
these concepts, you won’t be able to tune your systems to their optimum
capabilities.
Paging and Swapping
65
Computational Memory
Computational memory is used while your processes are actually working
on computing information. These working segments are temporary (transitory) and exist only up until the time a process terminates or the page is
stolen. They have no real permanent disk storage location. When a process
terminates, both the physical and paging spaces are released in many cases.
When a large spike occurs in available pages, you can actually see this
happening while monitoring your system.
In the world of virtual memory, when free physical memory starts getting
low, programs that have not been used recently are moved from RAM to
paging space to help release physical memory for more real work. Remember, virtual memory consists of real and paging space; it is not just paging
space. The most important point to remember about computational memory is that when the system pages, you do not want it to page out computational memory; your preference is file memory.
File Memory
File memory (unlike computational memory) uses persistent segments (not
working segments), and it has a permanent storage location on the disk.
Data files or executable programs are mapped to persistent segments rather
than to working segments. The data files can relate to file systems, such as
the Journaled File System (JFS), Enhanced Journaled File System (JFS2),
or Network File System (NFS). These files remain in memory until the
time when a file is unmounted, a page is stolen, or a file is unlinked. After a
data file is copied into RAM, VMM controls when these pages are overwritten or used to store other data. Given the alternative, you would much
rather have file memory paged to disk than computational memory.
Paging and Swapping
When a process references a page on disk, the page must be paged in,
which could cause other pages to page out again. VMM is constantly
working in the background, stealing frames that have not been recently
referenced using the page replacement algorithm. It also helps detect
thrashing, which can occur when memory is extremely low and pages are
constantly being paged in and out to support processing. VMM actually
66
Chapter 7: Memory: Introduction
has a memory load control algorithm, which can detect whether the system
is thrashing and actually tries to remedy the situation. Unabashed thrashing
can literally cause a system to come to a standstill, as the kernel becomes
so concerned with making room for pages that it just can’t do anything
productive.
What about swapping? Although the terms are often used interchangeably, there is a subtle difference between paging and swapping. As we’ve
discussed, with paging, only parts of the process are moved back and forth
between disk and RAM. When swapping occurs, you are moving entire
processes back and forth. For this to happen, AIX would need to suspend
the entire process before moving it to paging space. It could then only continue to process when the process was swapped back into RAM at a later
time. The difference that is not subtle is this: while paging is often okay,
swapping is a very bad thing.
VMM Tuning Evolution
Before AIX 5L, you would have used the vmtune command to tune your
VMM system. Although the vmo command came around in AIX 5.2,
vmtune actually hung around until AIX 5.3. With AIX 5.3, vmtune is no
more. Although most of the actual parameters are the same (and remain
the same in AIX 6.1), there are some fundamental changes in the recommended tuning parameters. (AIX 5.3 also does away with the schedtune
command, whose function is now performed by schedo.)
One important change in AIX 5L relates to page frames. Starting with the
POWER4 processor, AIX supported up to 16MB page sizes. The POWER5
chip supports four virtual memory page sizes: 4K, 64K, 16 MB, and 16
GB. With a simple vmo change here that reflects these sizes, you can actually tune the system to provide for large page usage, which can improve
system performance substantially in very memory-intensive application.
The recommendations for the minperm and maxperm settings have also
changed substantially. Furthermore, starting with AIX 5.2, we no longer
save our tunables in rc.tune but in /etc/tunables.
C h a p t e r
8
Memory: Monitoring
As with CPU monitoring, the AIX systems administrator has a myriad
tools at his or her disposal when tuning the Virtual Memory Manager
(VMM). Some of the tools are Unix-generic, while others are AIX-specific.
We’ll discuss these tools in the context of real performance issues and what
you can do to address them. IBM enhanced the following tools in AIX
5.3 to allow more accurate statistics on shared partitions using APV: sar,
topas, and vmstat.
Suppose that while you’re surfing the Internet and enjoying your coffee,
one of the DBAs knocks on the side of your cubicle (why Unix administrators never get an office, I’ll never know) and informs you that “We have
a real memory problem.” Although your first reaction might be to dismiss
the suggestion entirely (do you tell the DBA that the indexes need rebuilding?), the first thing I would do is ask why the person came to this conclusion. The more information you have at your disposal, the more effective
you’ll be in your efforts to resolve the alleged bottleneck.
More than likely, the DBA used a graphical tool such as nmon or topas that
indicated that real memory was low. This is a common event. However, one
of the biggest incorrect assumptions is concluding that you have a memory
problem because real memory is low. On the contrary, we want real memory
to be low — because that means we’ve sized the system properly.
So, where do we begin to troubleshoot this issue? If you’ve read the CPU
monitoring chapter, you’ll know that I like to start with vmstat.
68
Chapter 8: Memory: Monitoring
vmstat (Unix-generic)
vmstat [-fsviItlw] [[-p|-P]
pagesize|ALL] [Drives] [Interval [Count]]
In Chapter 6, we used vmstat to monitor CPU. In this chapter, we’ll look
at how to use this command for virtual memory analysis, which was actually the intended purpose of the tool (remember the “vm” in vmstat).
Here’s a summary of the relevant output fields:
●
●
●
●
●
●
●
r — Average number of runnable kernel threads over a sampling
interval, which you specify when running the command. “Runnable”
includes threads that are ready but are waiting to run as well as those
that are already running. I start to become concerned when this number is three or four times greater than the number of processors on
the system.
b — Average number of kernel threads placed in the VMM wait
queue that are waiting on I/O. This is an extremely important field; if
these numbers are higher than r (runnable processes), that is usually
symptomatic of I/O problems. Watch this field very carefully!
avm — Contrary to what most people think, this field does not report
the average memory. It shows the number of active virtual pages —
the sum of virtual and real memory pages (remember this concept).
Each page is 4,096 bytes.
fre — Size of the free list. It’s important to note that you shouldn’t
concern yourself too much if these numbers look really low, because
a large part of RAM is used as a cache for file system data. Applications people will always point out this field to you and say, “There is
no more memory left on the system.” If no bottlenecks are occurring,
it is just not a problem.
re — Pager input/output list
pi — Number of pages paged in from paging space. This field becomes populated when there are lots of processes starting up, which
can occur during a CPU or memory bottleneck.
po — Number of pages paged out to paging space. If the numbers are
high here, paging is occurring, which can certainly signify a memory
bottleneck.
69
vmstat (Unix-generic)
●
fr — Number of pages freed (page replacement)
●
sr —Number of pages scanned by the page replacement algorithm
●
cy — Number of clock cycles executed by the page replacement
algorithm
●
in — Device interrupt
●
sy — System calls
●
cs — Kernel thread context switches
The following output is a snapshot of a very well-behaved system. This
system is easily handling the number of runnable processes; there are no
blocked processes, no paging going on, nor any waiting on I/O. I love this
system.
# vmstat 2 5
System configuration: lcpu=4 mem=3072MB ent=0.40
kthr
memory
-----
page
faults
------------- ---------------------avm
fre
cpu
---------in sy
cs
----------------------
r
b
re
pi
po
fr
sr
cy
pc
ec
3
0
173838 576044
0
0
0
0
0
0
365 87 144
us sy id wa
0
1 99
0
0.01
2.4
3
0
173837 576045
0
0
0
0
0
0
297 13 149
0
1 99
0
0.01
1.9
3
0
173838 576044
0
0
0
0
0
0
329 37 143
0
1 99
0
0.01
2.2
3
0
173838 576044
0
0
0
0
0
0
337 10 143
0
1 99
0
0.01
2.0
3
0
173838 576044
0
0
0
0
0
0
364 13 143
0
1 99
0
0.01
2.1
Here is a snapshot of the system that the DBA was looking at:
# vmstat 2 5
System configuration: lcpu=4 mem=3072MB ent=0.40
kthr
----r
b
memory
---------avm
fre
page
faults
--------------------re pi
po
fr
sr
cy
cpu
-------------in
sy
cs
-------------------us sy id wa
pc
ec
4 19
173838 123
0
9
92 104 208 417
365 12001 6004
12
4 30 54
.4
2.4
9 36
173837 567
0
7
45
53 109 229
297 17002 8124
21
9 12 58
.8
1.9
2 19
173838 191
0 22 127 140 287 567
329 18229 9374
41 23
.5
2.1
2 34
70
Chapter 8: Memory: Monitoring
At first glance, it appears that there are memory problems. In this case,
we’re looking at the free (fre) list because there is paging going on. If there
were not, I wouldn’t even give the low numbers a second glance. Oftentimes, one bottleneck will be the cause of another.
In this case, it appears that significant I/O problems are causing other
bottlenecks to occur. There are many blocked processes (b); and the wait
time (wa) in the CPU section is also extremely high. Preliminary analysis
shows us that the system just cannot keep up with the workload. The CPU
can’t work hard because of the I/O problems. The paging is occurring
because of the excessive I/O, which appears to have also caused a memory
bottleneck.
Let’s change just a few numbers around. What do we see now?
kthr
memory
-----
---------avm
fre
page
faults
--------------------re pi
po
fr
sr
cy
cpu
-------------in
sy
cs
--------------------
r
b
us sy id wa
pc
ec
4
2
173838 123
0
9
92 104 208 417
365 12001 6004
81 19
0
0
.4
2.4
9
1
173837 567
0
7
45
53 109 229
297 17002 8124
74 25
1
0
.8
1.9
9
1
173838 191
0 22 127 140 287 567
329 18229 9374
41 23
2
0
.5
2.1
Clearly, there are no I/O problems to speak of; the wait times and blocked
processes are not there. The CPU obviously is running hot and heavy. Do
we have a CPU bottleneck? Sure we do, because the CPU is running at
almost 100 percent busy. But what is causing that bottleneck? Because
excessive paging is going on and the numbers in fre list are low, I would
guess that in this case a memory bottleneck is causing the CPU bottleneck, not the reverse. In fact, this snapshot could have been taken after we
fixed the I/O bottleneck in the previous snapshot. Remember, fixing one
bottleneck often causes others, but that’s okay; it’s just part of the circle of
tuning. In any case, the system is at least processing data here, where in the
prior example, it was just stuck in the mud. If we can tune the memory accordingly here, the CPU bottleneck may break through, or it may continue.
In the latter event, we might have to throw more iron at the box or manage
our workload more efficiently.
sar (Unix-generic)
71
Virtual Memory Summary
You’ll be interested to know that AIX 5.3 introduced a new vmstat flag,
–v, that summarizes overall virtual memory statistics:
# vmstat -v
786432
748478
574790
5
110858
80.0
20.0
80.0
4.4
33524
0.0
0
4.4
80.0
33524
0
memory pages
lruable page
free page
memory pool
pinned pages
maxpin percentag
minperm percentage
maxperm percentage
numperm percentage
file pages
compressed percentage
compressed percentage
numclient percentage
maxclient percentag
client page
remote pageouts schedule
0 pending disk I/Os blocked with no pbuf
0 paging space I/Os blocked with no psbuf
2484 filesystem I/Os blocked with no fsbbuf
0 client filesystem I/Os blocked with no fsbuf
0 external pager filesystem I/Os blocked with no fsbuf
0 Virtualized Partition Memory Page Faults
0.00 Time resolving virtualized partition memory page faults
sar (Unix-generic)
sar { -A [-M] | [-a][-b][-c][-d][-k][-m][-q][-r][-u][-v][-w][-y][-M]
[-s hh[:mm[:ss]]] [-e hh[:mm[:ss][-P processor_id[,...] | ALL]
-f file] [-i seconds] [o file] [interval [number]] [-X file]
[-i seconds] [o file] [interval [number]]
Let’s turn our attention now to the sar command and try using it to examine data that can impact virtual memory performance. In the following
72
Chapter 8: Memory: Monitoring
view, we’ll use the –rm flag, which enables us to view paging statistic (-r)
and semaphore information (–m). The output reports the following fields:
●
cycle/s — Number of page replacement cycles per second
●
fault/s — Number of page faults per second
●
slots — Number of free pages on the paging spaces
●
odio/s — Number of non-paging disk I/Os per second
●
●
msg/s — Number of Interprocess Communication (IPC) message
primitives
sema/s — Number of IPC semaphore primitives
# sar -rm 1 5
AIX lpar30p682e_pub 3 5 00CED82E4C00
12/30/07
System configuration: lcpu=4 mem=3072MB ent=0.40 mode=Uncapped
15:21:14
slots cycle/s fault/s
msg/s sema/s
odio/s
15:21:15
392354
0.00
0.00
0.00
44.00
0.00
15:21:16
392354
0.00
0.00
0.00
3.00
0.00
15:21:17
392354
0.00
0.00
0.00
0.00
0.00
15:21:18
392354
0.00
0.00
0.00
0.00
0.00
15:21:19
392354
0.00
0.00
0.00
0.00
0.00
Average
Average
392354
0.00
0
0.00
9
0
73
ps (Unix-generic)
The preceding example shows a lot of page faults per second, but not much
else. We can also see that there are 392,354 4K pages available on the
paging space, which comes out to about 1.5 GB of available paging space.
We can validate this number by running the lsps command, which reports
the same result:
# lsps -a
Page Space
hd6
Physical Volume
hdisk0
Volume Group
rootvg
Size
1536MB
%Used Active
1
yes
Auto
yes
Type
lv
lsps (AIX-specific)
lsps {-s | [-c | -l] {-a | Psname | -t {lv|nfs} } }
The lsps command provides the paging space statistics. This very important command should definitely be part of your repertoire. One additional
view, besides the –a view illustrated above, is used: the –s flag.
It’s important to note that the –a flag reports only paging space that is being
used, while –s provides a summary of all paging space allocated, including
early page space allocation. We’ll discuss the various page space allocation
polices in Chapter 11, when we dig into page space tuning.
We’ve looked at various snapshots and seen some semblance of memory
problems. Where do we go from here? Let’s try to identify some memory
hog processes. If you recall, we previously looked at using ps commands
to identify CPU hogs. It just takes another flag to identify the memory hog.
(As I noted earlier, ps is one versatile command.)
ps (Unix-generic)
ps [-ANPaedfklmMZ] [-n namelist] [-F Format] [-o specifier]
[=header],... [-p proclist][-G|-g grouplist] [-t termlist]
[-U|-u userlist] [-c classlist] [ -T pid] [-L pidlist]
ps [aceglnsuvwxU] [t tty] [processnumber]
For our purposes here, we’ll use ps gv, the second usage shown above,
which is based on the Berkeley method:
74
Chapter 8: Memory: Monitoring
# ps gv | head -n 1; ps gv | egrep -v “RSS” | sort +6b -7 -n -r
PID
TIME PGIN
SIZE
319648
TTY STAT
- A
0:00
119
6576
401612
- A
0:06
86
336046
- A
0:03
106552
- A
286880
188568
RSS LIM TSIZ TRS %CPU %MEM COMMAND
6784
xx
201 208
0.0
1.0 /usr/sbin/rsct/bin/IBM.ERr
2284
2664
xx
828 380
0.0
0.0 /usr/sbin/IBM.CSMAgentRM
374
1672
2160
xx
522 488
0.0
0.0 /usr/sbin/rmcd -a
0:00
0
2048
2048
xx
0.0
0.0 j2p
- A
0:41
22
1956 32768
xx
60
0.0
0.0 /usr/bin
- A
0:00
25
1772
xx
116 148
0.0
0.0 /usr/sb
IBM.LPCommands -r
1920
0
33
Let me briefly identify what some of this information means:
●
●
●
●
SIZE — The amount of paging space allocated for the process (text
and data).
RSS — The amount of RAM used for the text and data segments of
the process (in kilobytes). Note that PID 286880 is using 32,768K.
TRS — The amount of RAM used for the text segment of the process
(in kilobytes).
%MEM — The actual amount of memory used per total RAM.
Watch for processes whose %MEM value is 40 to 70 percent.
The ps command provides a lot of useful information, but I don’t usually
start with it unless one of my trusted administrators has already diagnosed
that a memory issue of some kind exists on the system. Although ps has
helped us identify some of the processes, it’s really time to call in our
cleanup hitter, svmon.
svmon (AIX-specific)
svmon [-G
[-i Intvl [NumIntvl] ][-z] ]
svmon [-P [pid1...pidn] [-r] [-u|-p|-g|-v] [-ns] [-wfc] [-q [s|m|L|S]]
[-t Count] [ -i Intvl [NumIntvl] ] [-l] [-j] [-z] [-m] ]
svmon [-S [sid1...sidn] [-r] [-u|-p|-g|-v] [-ns] [-wfc] [-q [s|m|L|S]]
[-t Count] [ -i Intvl [NumIntvl] ] [-l] [-j] [-z] [-m] ]
svmon [-D sid1...sidn [-b] [-q [s|m|L|S]] [-i Intvl [NumIntvl] ][-z]]
svmon (AIX-specific)
75
svmon [-F [fr1...frn] [-q [s|m|L|S]] [-i Intvl [NumIntvl] ][-z] ]
svmon [-C cmd1...cmdn [-r] [-u|-p|-g|-v] [-ns] [-wfc] [-q [s|m|L|S]]
[-t Count] [ -i Intvl [NumIntvl] ] [-d] [-l] [-j] [-z] [-m] ]svmon
[-U [lognm1...lognmn] [-r] [-u|-p|-g|-v] [-ns] [-wfc] [-t Count]
[ -i Intvl [NumIntvl] ] [-d] [-l] [-j] [-z] [-m] ]
svmon [-W [class1...classn] [-e] [-r] [-u|-p|-g|-v] [-ns] [-wfc]
[-q [s|m|L|S]] [-t Count] [ -i Intvl [NumIntvl] ] [-l] [-j] [-z]
[-m]
svmon [-T [tier1...tiern] [-a superclass] [-x] [-e] [-r] [-u|-p|-g|-v]
[-ns] [-wfc] [-q [s|m|L|S]] [-t Count] [ -i Intvl [NumIntvl] ] [-l]
[-z] [-m]
From the usage alone, it’s clear how much you can do with the svmon utility. You use svmon specifically for VMM. It provides a potpourri of information about the current state of memory and really helps you drill down
and determine which processes, users, programs, and segments consume
the most virtual (real and paging) memory.
The statistics themselves are based on 4K pages, including real, virtual,
and paging space memory used. The –G flag gives you a global view of
memory utilization on your host:
# svmon -G
memory
pg space
size
786432
393216
inuse
211735
863
free
574697
pin
in use
work
110862
174420
pers
0
0
clnt
0
3731
PoolSize
-
inuse
183863
1742
pgsp
863
0
PageSize
s
4 KB
m 64 KB
pin
110862
virtual
17442
pin
97342
845
virtual
146548
174
76
Chapter 8: Memory: Monitoring
Let’s look at the first part of the data:
●
●
●
●
●
size — The size of real memory frames, or simply real memory
(including any frames that may have been reduced by using the rmss
command, which we discuss in the next chapter)
inuse — The number of frames containing actual pages; pages in
RAM in use by processes plus persistent pages that belonged to a
terminated process and remain resident in RAM
free — The number of pages on the free list
pin — The number of pages pinned in physical memory (RAM),
which cannot be paged out
virtual — The number of pages allocated in the virtual space
The next section of the output provides statistics about pin and inuse memory. The pin entry here specifies statistics about the subset of real memory
containing pinned pages, while inuse provides statistics about the subset of
all real memory in use. The information includes:
●
work — Number of frames containing working segment pages
●
pers — Number of frames containing persistent segment pages
●
clnt — Number of frames containing client segment pages
The third and final section provides individual statistics per page size
(where alternative page sizes are available):
●
PageSize — Page size
●
PoolSize — Number of pages in pool
●
inuse — Number of pages of this size that are used
●
pgsp — Number of pages allocated to paging space
●
pin — Number of pinned pages of this size
●
virtual — Number of pages of this size that are allocated in the
system virtual space
Memory Leak
77
To gain a better understanding of what is going on, you can correlate some
of the svmon fields to vmstat. In this case, the svmon –free field matches
up with the vmstat –fre, and the svmon –virtual matches the vmstat
–avm. The net is that while svmon provides more overall information
about memory, vmstat gives you more overall systems information. Let’s
look at both:
# svmon
memory
size
inuse
free
pin
virtual
786432
211735
574697
110862
174420
clnt
pg space
work
pers
pin
110862
0
0
in use
174420
0
3731
PageSize
PoolSize
inuse
pgsp
pin
virtual
s
4 KB
-
183863
863
97342
14654
m
64 KB
-
1742
0
845
1742
# vmstat
System configuration: lcpu=4 mem=3072MB ent=0.4
kthr
-----
memory
page
------------- ----------------------avm
fre
faults
cpu
-----------
----------------------
r
b
re
pi
po
fr
sr
cy
pc
ec
1
0 174418 574699
0
0
0
0
0
0
439 333 159
1
2 97
0
0.02
4.0
1
0
0
0
0
0
0
0
452
0
2 98
0
0.01
2.7
174418 574699
in
sy
cs
20 146
us sy id wa
In addition to the global view, you can create eight other types of reports
using svmon: user, command, class, tier, process, segment, detailed segment, and frame. I won’t review each one here, but I strongly recommend
that you check out all these views to see how each one can assist you.
Memory Leak
Let’s look at one more way to use svmon. Memory leaks can be a big
problem on a system. A memory leak is any program or process that keeps
on allocating more memory and does not release it. This situation can
cause real memory to be used up extremely quickly and, in a worst-case
78
Chapter 8: Memory: Monitoring
scenario, can even precipitate a system crash by causing the system to run
out of paging space. I’m not ashamed to admit that this has happened to
me. In fact, before I knew about svmon, I saw it happening before my eyes
and couldn’t stop it because I wasn’t certain what was causing it!
To identify the cause of memory leaks, you first need to identify the
processes that are using up the most memory. Here is one way to do this:
# svmon -uP -t 5 | grep -p Pi
----------------------------------------------------------------------------Inuse
Pin
Pgsp
286880 xmwlm
Pid Command
21074
7802
0
Virtual 64-bit Mthrd
20859
N
N
319648 IBM.ERrmd
20666
7815
0
20532
N
Y
336046 rmcd
19919
7805
0
19276
N
Y
413902 IBM.ServiceRM
19680
7818
0
19242
N
Y
401612 IBM.CSMAgentR
19623
7816
0
19462
N
Y
16MB
N
For this purpose, we’ve used the following flags with svmon:
— Specifies that the displayed information be sorted in decreasing
order, thereby displaying the top offender first
●
–u
●
–P
— Displays process information
●
–t
— Indicates the number of processes to display
After identifying the process you’re most concerned about (let’s assume
it’s the top offending process), you can track it further to make sure that
neither the working nor the kernel segments are increasing rapidly. You can
use svmon similarly to vmstat, with a counter. To illustrate, we’ll set the
counter to run for two intervals, with five-second iterations. The resulting
output confirms that we are not having problems:
79
Memory Leak
# svmon – P 286880 –i 5 2
-------------------------------------------------------------------------------Pid Command
286880
xmwlm
Inuse
Pin
Pgsp
20769
7802
0
Virtual 64-bit Mthrd
20559
PageSize
Inuse
Pin
Pgsp
Virtual
s
4 KB
13809
7802
0
1359
m
64 KB
435
0
0
4
Vsid
Esid Type Description
0
PSize
N
Inuse
16MB
N
Pin Pgsp Virtual
0 work kernel segment
s 11584 7799
0
11584
330ad
d work shared library text
m
435
0
0
435
6425d
c work shared memory segment
s
1480
0
0
14
541f1
2 work process private
s
444
3
0
44
50250
- clnt /dev/hd4:921
s
194
0
-
6825e
f work shared library data
s
91
0
0
2426d
1 clnt code,/dev/hd2:152455
s
15
0
-
6025c
- clnt /dev/hd2:41407
s
1
0
-
91
-------------------------------------------------------------------------------Pid Command
286880 xmwlm
Inuse
Pin
Pgsp
20769
7802
0
Virtual 64-bit Mthrd
20559
PageSize
Inuse
Pin
Pgsp
Virtual
s
4 KB
13809
7802
0
1359
m
64 KB
435
0
0
43
Vsid
16MB
N
PSize
Inuse
0 work kernel segment
s
11584
7799
0
330ad
d work shared library text
m
435
0
0
43
6425d
c work shared memory segment
s
1480
0
0
148
541f1
2 work process private
s
444
3
0
444
50250
- clnt /dev/hd4:921
s
194
0
-
-
6825e
f work shared library data
s
91
0
0
2426d
1 clnt code,/dev/hd2:152455
s
15
0
-
6025c
- clnt /dev/hd2:41407
s
1
0
-
0
Esid Type Description
N
Pin Pgsp Virtual
11584
-
C h a p t e r
9
Memory: Tuning
In this chapter, I identify and show you how to tune your virtual memory
subsystem. In contrast to other subsystems, there is a lot you can do to
improve performance from a virtual memory perspective.
Before we get started, let me again state that, unless instructed otherwise,
you should change only one parameter at a time. If you make multiple
changes, you won’t know precisely what caused the impact on performance. This point is particularly relevant to virtual memory.
vmo
vmo -h [tunable] | {-L [tunable]} | {-x [tunable]}
vmo [-p|-r] (-a | {-o tunable})
vmo [-p|-r] (-D | ({-d tunable} {-o tunable=value}))
Let us assume that we’re running an Oracle online transaction processing
(OLTP) application and we’ve determined from some vmstat output that
the system is paging. We’ve also looked at nmon data, which helped us
reach the same conclusion. What can we do to improve the situation? This
is where the vmo command comes into play.
You will probably use vmo more than any other tunable command because
it is with virtual memory that you have the greatest ability to positively
affect performance by changing parameters. The vmo command provides
82
Chapter 9: Memory: Tuning
a staggering 61 tunables in AIX 5.3. (The situation changes a bit in AIX
6.1 with the introduction of restricted parameters, which permit changes
but make it a little more difficult to get into trouble.) I won’t describe each
vmo parameter here, but I will go through the key ones as we try to tune
our memory subsystem.
minperm, maxperm, maxclient, and lru_file_repage
Perhaps the most important concepts that relate to tuning revolve
around our prior discussions about working and persistent storage. We
definitely want the Virtual Memory Manager (VMM) to favor working
storage, meaning that we don’t want AIX to page working storage. What
we really want is for the system to favor the caching that the database
(Oracle in this case) uses. The way to do this is to set the vmo command’s
maxperm parameter to a high enough value while also making certain that
the lru_file_repage parameter is set correctly. Here’s a description of the
involved parameters:
●
●
— The point below which the page stealer algorithm will
steal file or computational pages, regardless of repaging rates
minperm%
maxperm%
— The point above which the page stealer will steal only
file pages
●
●
— The minimum percentage of RAM that can be used to
cache client pages
maxclient%
— Setting this value to 0 (off) allows AIX to free only
file cache memory (provided numperm is greater than minperm and
VMM can steal enough memory to satisfy demand), virtually guaranteeing that working storage remains in memory
lru_file_repage
Background
Arguably, the most important vmo settings are minperm% and
maxperm%. Setting these parameters appropriately will ensure that your
system is tuned to favor either computational memory or file memory.
In most cases, you don’t want to page working segments, because doing so will cause your system to page unnecessarily and will decrease
performance.
minperm, maxperm, maxclient, and lru_file_repage
83
First, some background and history. The way things used to work was actually much simpler. If the number of file pages specified in vmo parameter
numperm% was greater than the actual number of pages (maxperm), the
page replacement would steal only file pages. When the number of file
pages fell below minperm, both file and computational pages could be
stolen. If the number fell between the minimum and maximum values,
the page replacement would steal only file pages — unless the number of
file repages was greater than the number of computational pages. In other
words, if your numperm was greater than maxperm, you would start to
steal from persistent storage.
Based on this methodology, the old approach to tuning minperm and maxperm was to set maxperm to a low number — much lower than the default
value (20) — and set minperm to less than or equal to 10. This is how we
normally would have tuned our database server. Don’t do this anymore!
Starting with AIX 5.2 Maintenance Level 5 (ML5) and AIX 5.3 ML2, the
rules have changed.
A New Approach
The new approach is to set maxperm to a very high value — higher than
its default (80) — and to make sure lru_file_repage is set to 0. IBM introduced the lru_file_repage parameter in AIX 5.2 with ML4 and in AIX 5.3
with ML1. The lru_file_repage value indicates whether the VMM repage
counts should be considered and what type of memory should be stolen.
The default setting is 1 (it becomes 0 in AIX 6.1), so we need to change
it to 0 to have the VMM steal file pages rather than computational pages.
This technique solves the old problem of having to limit JFS2 file cache to
guarantee memory for applications such as Oracle.
Let’s not lose sight of the fact that the primary reason you need to tune
lru_file_repage is because you want to protect the computational memory
— that is, process memory, kernel memory, and shared memory, which
includes Oracle’s System Global Area (SGA). Because Oracle uses its own
cache, using AIX file caching for this purpose only causes confusion, so
we want to stop it. In this scenario, if you were to reduce maxperm, you’d
be making the mistake of stopping the application caching programs that
are running. You’d also be permitting lrud, the kernel process responsible
for stealing memory when required, to do more work than necessary.
84
Chapter 9: Memory: Tuning
You should always be tracking your numperm, something you can do
using nmon or topas or from the command line using vmstat (with the –v
flag). If you leave the lru_file_repage default of 1, VMM will continue to
use the computational and noncomputational repage counts (defined at the
top) in determining whether to steal computational or file memory.
Here are the recommendations for configuring the other parameters we’ve
discussed:
vmo -p -o minperm%=5
vmo –p –o maxperm%=90
vmo –p –o maxclient%=90
In AIX 6.1, IBM has changed the default parameter values to reflect these
common defaults, so you’ll have less to do in that release. You should also
leave strict_maxperm and strict_maxclient at their default numbers. We
used to change these settings, but we don’t need to anymore. Changing
them to 1 (the old approach) places a hard limit on the amount of memory
that can be used for persistent file cache. This is done by making the
maxperm value the upper limit for the cache. These days, this step is unnecessary because changing lru_file_repage is a far more effective way of
tuning because we prefer that AIX file caching not be used at all.
minfree and maxfree
Two other important vmo parameters worth noting here are minfree and
maxfree. These values set the lower and upper limits of the free list, which
keeps track of the real memory frames released:
— Specifies the minimum number of frames on the free list,
at which point the VMM will start to steal pages to replenish
●
minfree
●
maxfree
— Specifies the number of frames on the free list at which
page stealing is to stop
If the number of pages on your free list falls below the minfree value, the
VMM starts to steal pages (just to add to the free list), which is not good.
It will continue to do this until the free list contains at least the number of
pages specified in the maxfree parameter.
Page Space Allocation
85
While you want to keep your free list higher (because you don’t want your
processes to be killed if the minfree value is reached, you want the VMM
to always get the page frames it needs from the free list). I remember when
the defaults used to be 120 and if I hadn’t raised the values, users would
nag me, saying no memory was left on the system. You also don’t want the
system to experience excessive I/O because it’s always stealing paging to
expand the free list. The default values now depend on the physical memory of the system. maxfree equals the lesser of the number of memory pages
divided by 128, or 128. These values are the sum of all memory pools. The
maxfree value should also be greater than or equal to maxpgahead.
Page Space Allocation
AIX provides three different modes of paging space allocation: deferred
page space allocation (DPSA), late page space allocation (LPSA), and
early page space allocation (EPSA). The default policy is deferred page
space allocation. DPSA works by delaying the allocation of paging space
until the time when it is necessary to page out the page. This approach
ensures that there is no wasted paging space, an important component of
demand paging. In fact, when you have a large amount of RAM, you may
actually never even use any of your paging space.
Here is an example:
# lsps -a
Page Space
Physical Volume
Volume Group
hd6
hdisk0
rootvg
Size %Used Active
1536MB
1
Auto
Type
yes
Only 1 percent of paging space is used here. Let’s view how AIX is presently handling paging space allocation:
# vmo -a | grep defps
defps = 1
The preceding output shows that the default method, DPSA, is being used.
To disable this policy, you would set the defps parameter to 0. This value
would cause the LPSA policy to be used. LPSA causes paging disk blocks
86
Chapter 9: Memory: Tuning
not to be allocated until the corresponding pages in RAM are touched. This
method is usually intended for environments where optimum performance
is more important than reliability, because in this scenario it’s possible for
a program to fail due to lack of memory.
The EPSA policy is usually used when you want to make sure that processes won’t be killed because of low paging conditions. EPSA ensures
this by preallocating paging space. This is the opposite end of the spectrum from LPSA. EPSA is used in environments where reliability rules.
To turn on EPSA, you set the PSALLOC environment variable to early
(PSALLOC=early).
You should also be aware of the garbage collection feature introduced in
AIX 5.3. Garbage collection frees up paging-space disk blocks, letting you
configure less paging space than you ordinarily would need to. This feature
is available only for the default DPSA policy.
How Much Paging Space?
How much paging space do you need on your system? What is the rule of
thumb?
To determine the answer, start with the folks who own your applications.
For example, your DB2 or Oracle teams should be able to tell you how
much paging space needs to be allocated on the system from a database
perspective. If yours is a small shop, you’ll have to do the research on
your own. Be careful, though. Database administrators usually like to
request the highest number of everything and might instruct you to double
the amount of paging space versus your RAM (an older rule of thumb).
Generally speaking, if a system has less than 4 GB of RAM, I usually like
to create a one-to-one ratio of paging space versus RAM. If it has 8 GB or
higher, I set my paging space to as little as half the size of RAM.
Monitor the system frequently after going live. If you see that you’re
never really approaching 50 percent of paging space utilization, don’t
add the space. A quick look at the recent Oracle for AIX documentation
confirms this principle; it recommends that the initial setting for paging
space be half the size of RAM plus 4 GB, with an upper limit of 32 GB.
The documentation further suggests monitoring space with the lsps –a
Thrashing and Load Control
87
command and not worrying unless the utilization is more than 25 percent
on the system.
Adding space that you won’t use gives you absolutely nothing extra. I’m
often asked how one can tell whether a process is using paging space Let’s
go back to the svmon command for a moment. Here is how you do it.
First, use the ps command to identify a process you want to view. Then,
use svmon as follows:
# svmon -P | grep -p 286880
--------------------------------------------------------------------------Pid Command
Inuse
Pin
Pgsp Virtual 64-bit Mthrd 16MB
286880 xmwlm
21009
7802
0
20925
N
N
N
Paging Space Tuning
When your free list is really low and you’re paging incessantly, your
system will start to release processes to avoid thrashing. It will even kill
processes if sufficient paging space is not available. To prevent this from
happening, you can tune these three vmo values:
●
●
●
— This parameter specifies the number of free paging space
pages at which AIX starts killing (SIGKILL) processes.
npskill
— This parameter specifies the number of free paging
space pages at which AIX starts sending warnings (SIGDANGER) to
processes.
npswarn
— Setting this parameter to 1 prevents processes owned
by root from being killed when parameter npskill has started to take
effect.
nokillroot
Thrashing and Load Control
Thrashing is what occurs when memory resources are so overloaded that
the system is in a state of utter exhaustion. To be more specific, the system is constantly paging in and out whole processes in a futile attempt to
88
Chapter 9: Memory: Tuning
process data, which it can’t properly do because of the excessive paging
operations.
Using the CPU tuning command schedo, you can affect the criteria used
to determine thrashing by tuning the VMM load control facility, which
further helps protect an overloaded system from thrashing. More than
anything, load control is meant to help straighten out infrequent spikes in
load. Let’s look at some schedo parameters you can adjust to specify the
thresholds for the algorithm that controls memory load control:
●
●
●
●
●
— Defines the period of time (in elapsed seconds)
that a reactivated suspended process is exempt from suspension.
v_exempt_secs
v_min_process — Defines the number of active processes that can
be run and waiting for page I/O.
— Controls the threshold for memory over commitment. If this threshold is exceeded, load control will try to suspend
processes.
v_repage_hi
v_repage_proc — Determines whether the process is eligible for suspension. This value is further used to set a threshold for the number
of repages and the number of page faults that the process has accumulated in the past second.
— Defines the number of intervals for which the po/fr
fraction — the number of pages written to paging space in the last
second (po) divided by the number of page steals occurring during
that time (fr) — can remain below v_repage_hi before suspended
processes are reactivated.
v_sec_wait
After tuning these values and playing around with some of these settings,
you can always reset them to their defaults using the schedo –D command.
Memory Scanning and lrubucket
The vmo command’s lrubucket parameter indicates the number of memory frames per bucket. On systems with multiple memory pools, the parameter’s setting is per memory pool. Tuning this value can help you reduce
scanning overhead on systems that have a large amount of memory.
rmss
89
This point has to do with how the page replacement algorithm works. The
algorithm’s role is to scan and look for free frames — to be used for new
pages or for page replacement. With larger systems, because there are so
many frames to scan, memory is divvied into buckets of frames. The larger
the bucket, the fewer the frames that must be scanned.
The following example increases the bucket to 2 GB (you specify the value
in 4K pages):
# vmo -o lrubucket=524288
Setting lrubucket to 524288
rmss
rmss [-s startmemsize] [-f finalmemsize] [-d deltamemsiz]
[-n numiterations] [-o outputfile] command
rmss -c memsize
rmss -r
rmss -p
Before the advent of the POWER4’s Hypervisor gave folks access to
dynamic logical partitioning (DLPAR) memory, the rmss command was
the only tool you could use for capacity planning as it related to memory.
It is still the only tool that lets you reduce available memory without either
physically removing RAM from your box or performing a DLPAR operation to reduce RAM.
Although rmss isn’t a performance-tuning tool in the strict sense of the
word, it is an invaluable aid that you should use when sizing systems.
Most administrators are just throwing RAM in the garbage because they
choose not to care whether their systems require it — often for fear they’ll
be blamed by application folks for not providing the excessive amount of
memory requested. Using rmss, you can quickly subtract memory (and
just as quickly add it) to determine how your application reacts.
90
Chapter 9: Memory: Tuning
First, let’s see how much memory we have on the box:
# lsattr -El mem0
goodsize 3072 Amount of usable physical memory in Mbytes False
size
3072 Total amount of physical memory in Mbytes False
Now, let’s use rmss to view the current memory size:
# rmss –p
simulated memory size is 3072 Mb.
Let’s change it:
# rmss -c 2048
Simulated memory size changed to 2048 Mb.
The system still sees 3 GB of physical memory.
When you’re ready, you can restore the real memory size:
# rmss –r
Section III
Summary, Tips, and Quiz
Summary
●
●
●
●
●
●
●
●
The Virtual Memory Manager (VMM) services all memory requests on
the system, not just virtual memory.
Working segments use computational memory, and persistent segments
use file memory. When paging, you prefer that the system does not page
out working/computational memory because this is the working storage
for processes that are currently executing.
File memory uses persistent storage and has a permanent location on the
disk.
Paging is a normal condition of AIX, due to its tight integration with the
VMM and AIX’s implementation of demand paging.
Data is constantly shuffled back and forth between paging space and
RAM because the kernel loads only a few pages at a time into memory.
The vmtune and schedtune commands are no more, replaced in AIX 5L
by vmo and schedo and eliminated completely in AIX 5.3.
Starting with AIX 5.2, tunables are saved in /etc/tunables. Before this
release, they were saved in rc.tune.
vmstat –v (–v
is a new flag) provides a summary of all virtual memory
statistics.
●
Thrashing is a condition that occurs when virtual memory resources are
overloaded and the free list is abnormally low. This condition can cause
entire processes to be swapped out to disk and can even cause a system
to crash, if the paging space fills up.
92
Section III: Summary, Tips, and Quiz
●
●
Memory leaks occur when a process keeps on allocating more memory
without releasing it. The svmon command can help find these leaks.
Memory monitoring tools you should use include lsps, nmon, ps, sar,
and vmstat.
svmon, topas,
●
●
is the primary tuning tool used to manage the virtual memory tunable parameters. You use schedo to tune the VMM load control facility,
which helps protect an overloaded system from thrashing.
vmo
AIX provides three different modes of paging space allocation:
deferred page space allocation (DPSA), late page space allocation
(LPSA), and early page space allocation (EPSA). The default policy is
DPSA.
Tips
●
●
●
●
With systems having so much more memory than back in the day,
the ratios for paging space recommendations are much lower than
ever. Just because your DBA tells you he or she needs a 1:1 (or
greater) ratio of physical to paging space doesn’t mean you have to
provide it.
Even Oracle now recommends that the initial paging space setting be
half the size of RAM plus 4 GB with an upper limit of 32 GB. It is
much easier to add paging space than delete it, and it’s easy enough to
determine whether your system uses a lot of paging space. If you do
your job properly, you won’t have to over-architect your paging space.
Having said that, you should always check with your ISV for recommendations before deploying your paging strategy.
Remember the new minperm% and maxperm% tuning recommendations (starting with AIX 5.3 ML2) to favor computational memory over
working persistent storage. IBM kernel engineers came out with these
new recommendations for a reason. Don’t forget to also set lru_file_repage to 0; otherwise, you’ll defeat the purpose of the new recommendations, and your system will be slower, not faster!
If you want to save your tuning changes on a reboot, make sure you
save them to /etc/tunables (there is no more rc.tune). The –p flag on
the vmo command will take care of this.
Tips
●
●
●
●
●
●
●
93
You can tune an extraordinary number of parameters with vmo, more
than for any other subsystem. However, “don’t touch that dial” unless
you fully understand what the parameters mean. And when you do tune
parameters, test your changes in a staging or development environment
before rolling them out in production, and remember to implement only
one change at a time.
Learn the svmon command. Most system administrators are stubborn
mules and will use only the same tools they’ve been using for decades.
svmon is easily the best memory analysis tool out there today; take the
time to learn how to use it.
Don’t wait for your system to start thrashing before you look at the
free list using vmstat or other utilities. A thrashing system can lead to a
crashing system and is one of the worst things that can happen to you as
a systems administrator.
Tuning the lrubucket parameter can help you reduce scanning overhead
on systems that have a large amount of memory. In most cases, you’ll
do fine if you at least double the default.
The rmss command can help with memory capacity planning. It lets
you temporarily reduce the amount of RAM without having to either
physically reduce memory or run a DLPAR operation.
Just as identifying workload is of paramount importance to improving
CPU utilization, it can also be important when managing batch jobs that
may accumulate a lot of virtual memory. Don’t be afraid to use your 24hour day, particularly if you see excessive paging.
Similar to CPUs, adding RAM is always an option if you’ve determined
that you’re experiencing a memory bottleneck. All it takes is a simple
DLPAR operation. The task is much easier than it used to be; you can
just add or subtract RAM dynamically. Tools such as the DLPAR toolset
and the Partition Load Manager (PLM) can automate this process. One
caveat here: As of AIX 6.1, PLM is no more. If you’re looking at using
uncapped partitions to do this, sorry — that solution works only with
CPUs, not RAM.
94
Section III: Summary, Tips, and Quiz
Quiz
Multiple Choice
1. Which vmstat flag reports summary information?
a. –c
b. –t
c. –a
d. –v
2. Which sar flag reports paging statistics?
a. –a
b. –g
c. –c
d. –d
3. What command summarizes the amount of paging space on your
system?
a. lsps -a
b. svmon
c. vmstat
d. stat
4. Which ps flag reports memory information?
a. gv
b. ux
c. –e
d. –m
Multiple Choice
95
Use the following output to answer Questions 5 through 7:
# vmstat 2 5
System configuration: lcpu=4 mem=3072MB ent=0.40
kthr
----r b
4 3
9 6
9 3
memory
page
faults
---------- ------------------------ ------------avm
fre re pi po fr sr cy
in
sy
cs
173838 123
0
9 92 104 208 417 365 12001 6004
173837 567
0
7 45 59 109 229 297 17002 8124
173838 191
0 22 127 187 287 567 329 18229 9374
cpu
-----------------us sy id wa pc ec
69 4 20 7 .8 2.4
51 9 12 28 1 2.9
71 23 2 4 .5 2.1
5. Given the preceding information, are you experiencing a RAM bottleneck?
a. Yes
b. No
c. Maybe
d. Not enough information
6. Which of the following would be an acceptable next action after you
come up with your analysis?
a. Notify the DBA team.
b. Do a vmstat –v to look at a summary of your memory and paging
statistics.
c. Tune the vmo command’s minperm and maxperm setting.
d. Run a trace.
7. It is not unusual to see multiple bottlenecks on a system. Does it appear
that you are having either a CPU or an I/O problem?
a. Yes
b. No
c. Maybe
d. Not enough information
96
Section III: Summary, Tips, and Quiz
True or False
8. Computational memory is made up of working segments and is transitory.
9. The Virtual Memory Manager (VMM) manages all memory requests,
including physical RAM, not just virtual memory.
Fill in the Blank(s)
10. Define a memory leak:
_________________________________________________
Section IV
Disk I/O
This section gives you an overview of disk management on AIX, including how to monitor and tune your disk I/O subsystem. We also discuss best
practices for disk placement, file system management, optimum hardware
configuration, and concepts such as direct and concurrent I/O, and asynchronous I/O (AIO).
C h a p t e r
10
Disk I/O: Introduction
What, exactly, is involved in tuning your disk subsystem? Tuning disk is
a little trickier that tuning your CPU or virtual memory subsystem. One
important reason is because you can do more to optimize throughput during the initial configuration of your I/O devices than you can ever do with
tuning. It’s simply much easier to move things around during the initial
build-out of your environment than to re-architect production.
Furthermore, understand that the slowest operation for running programs
is the time spent on actually retrieving your data from disk. This activity
involves the physical disk as well as its logical components, such as the
Logical Volume Manager (LVM). All the tuning in the world will do little
if you have a poorly architected subsystem. Let’s look at the I/O stack,
which is depicted in Figure 10.1.
100
Chapter 10: Disk I/O: Introduction
Application
open( ) close( ) read( ) write( )
Async, sync, and other
options for both open
and R/W
I/O to a file
mount unmount
I/O to blocks in
a filesystem
Filesystem
Mount options
affect the IO
dio cio rbr rbrw rbw
VMM
LVM
crfs chfs
mkfs
logform
mount
-a cio dio
mknfsexp chnfs exp
showmoung
cfsadmin
ioo vmo vmtune
File I/O to
filesystem cache
mkvg extendvg mklv inportvg chvg
importvg exportvg cplv mklvcopy
mirrorvg migratepv varyonvg
Device drivers
sattr chdev mkdev rmdev
Disk subsytem
Software varies
Block I/O to logical disk
Block I/O to a physical disk
Disk
Disk I/O flows from top to bottom
Physical Disk/Layout flows from bottom to top
Figure 10.1: I/O stack
The figure clearly shows the tight integration between physical components as they relate to both the logical disk and its application I/O. When
you configure your disk, you should work from the ground up. Start with
the physical system and then move to the device layers, logical volumes,
file systems, files, and applications. The physical component is crucial.
Configuring this component involves determining the amount of disk, type
(speed), size, and throughput.
One important challenge to note with storage technology is that although
the storage capabilities of disk are increasing dramatically, disk rotational
speed increases more slowly. Disk I/O is clearly the weakest link on a system: while RAM access takes about 540 CPU cycles, disk access can take
20 million CPU cycles.
To reiterate, poor layout of your data affects I/O performance much more
than any tunable I/O parameter. Returning to the I/O stack, you can clearly
see the truth in this statement just by looking at where the tunables are on
Concurrent I/O
101
the stack. They are much closer to the top than disk placement and logical
volumes.
Direct I/O
First introduced in AIX 4.3, direct I/O bypasses the Virtual Memory Manager (VMM), enabling the transfer of data directly to disk from the user’s
buffer. Direct I/O is not for everyone, because although it is possible to
improve performance using this technique, it is also possible to degrade
performance if you turn on direct I/O where you shouldn’t.
Implementing direct I/O can provide near raw logical volume performance
while at the same time maintaining the flexibility and manageability of file
systems. What are a good candidates for direct I/O? Applications that have
files with poor cache utilization are one example. Another is applications
that use synchronous writes, because these writes must go to disk. Direct
I/O goes directly to disk, so CPU usage drops because the dual data copy
(bypassing the cache) is dropped.
What are not good candidates for direct I/O? Applications that have
smaller requests with persistent segments (which translate into permanent
locations).
Concurrent I/O
Introduced in AIX 5.2, concurrent I/O (CIO) is nearly identical to direct
I/O, but one better. With direct I/O, inodes (data structures that are associated with files) are locked to prevent a condition in which multiple threads
might try to change the contents of a file at the same time. CIO actually
bypasses this inode lock, letting multiple threads read and write data
concurrently to the same file. This capability is enabled due to the way in
which JFS2 is implemented with a write-exclusive inode lock, which lets
multiples users read the same file simultaneously. This design has the effect of increasing performance dramatically when multiple users read from
the same data file.
Direct I/O can cause major problems with databases that continuously read
from the same file. Concurrent I/O solves this problem, making it the preferred method of running databases. You turn on CIO either by mounting
102
Chapter 10: Disk I/O: Introduction
the file system or through open systems calls. It’s as simple as running the
mount command with the cio option:
# mount -o cio /u01
When you mount the file system using this method, all files in the file system will use CIO.
Unlike direct I/O, you can use CIO only with JFS2. As with direct I/O,
some environments won’t benefit from turning on CIO. For example, applications that could benefit from a file system read-ahead or high buffer
cache might actually experience decreased performance. Test, test, test,
and then test some more!
Asynchronous I/O
Asynchronous I/O (AIO) conceptually relates to whether applications are
waiting for I/O to complete before processing additional data. In other
words, AIO lets applications continue to process while I/O runs in the
background. This approach improves performance because processing can
occur simultaneously.
An AIX 6.1 note: virtually everything AIO-related has changed with the
implementation of AIX 6.1. For information about these changes, see
Chapter 16.
Logical Volumes and Disk Placement:
Intra- and Inter-Policy
Figure 10.2 depicts the relationship between the logical volumes and the
physical disk.
Logical Volumes and Disk Placement: Intra- and Inter-Policy
Application
Layer
Logical
Layer
Raw
Logical Volume
JFS/JFS2
Volume
Group
Logical
Volume
Manager
103
Logical Volume
Logical Volume
Logical Volume Device Driver
Physical
Volume
Physical
Volume
Device Driver
Physical
Volume
Device Driver
Physical
Layer
Physical
Disk
Physical
Disk
Physical
Array
Figure 10.2: System layers
The logical volume layer sits between the application and physical layers.
In other words, the application layer correlates to the file system or raw
logical volume. The physical layer consists of the actual disk. Logical Volume Manager is the AIX disk management system that maps data between
logical and physical storage. LVM also lets data reside on multiple physical platters and be managed and analyzed using specialized LVM commands. LVM controls all the physical disk resources on your system while
providing a logical view of the storage subsystem.
Knowing that the logical layer sits directly between the application layer
and the physical layer should help you understand why the logical layer is
probably the most important of all the layers. Even your physical volumes
themselves are part of the logical layer because the physical layer encompasses only the actual physical components.
104
Chapter 10: Disk I/O: Introduction
What about the other elements that make up the preceding illustration?
From the bottom up, each of the drives is named as a physical volume.
Multiple physical volumes make up the volume group. The logical volumes are defined within the volume group, and LVM enables the data to
be on multiple physical drives, although they might be configured to be on
a single volume group. The logical volumes can be either one or multiple logical partitions. Each logical partition has a physical partition that
correlates to it. This is where you actually mirror your system, by having
multiple copies of the physical partitions.
How does logical volume creation correlate with physical volumes? Figure
10.3 illustrates the storage position on the physical disk platter.
Center
Inner Middle
Inner Edge
Middle
Edge
Figure 10.3: Physical disk platter layout
As a general rule, data written toward the center of the platter has faster
seek times than data written on the outer edge. This has to do with the
concept of data density. Because data is more dense as it moves toward the
center, there will be less movement of the head. Because the inner edge
will usually have the slowest seek times, more intensive I/O applications
should be brought closer to the center of the physical volumes. Is this
always the case? There are exceptions. For example, disks hold more data
per track on the edges of the disk than on the center. For this reason, logical volumes being accessed sequentially should actually be placed on the
edge for better performance. The same holds true for logical volumes that
have Mirror Write Consistency Check (MWCC) turned on. This is because
the MWCC sector is on the edge of the disk (not at the center), which
relates to the intra-disk policy of logical volumes.
File Systems
105
Inter-Disk Policy
The inter-disk policy defines the number of actual disks on which the
physical partitions of a logical volume reside. The general rule is that the
minimum policy provides the greatest reliability and availability, while the
maximum policy improves performance. Simply put, the more drives your
data is spread on, the better the performance. Some other best practices
include the following:
●
Allocating intensive logical volumes to separate physical volumes
●
Defining the logical volumes to the maximum size you need
●
Placing frequently used logical volumes close together
These are all reasons to understand your data before configuring your systems so that you can create policies that make sense from the start. You can
define your policies when creating the logical volumes themselves using
the System Management Interface Tool (SMIT) fastpath command:
# smitty mklv
File Systems
Two types of kernels exist in AIX: a 32-bit kernel and a 64-bit kernel.
(AIX 6.1 has only a 64-bit kernel.) Although both types of kernels share
some common libraries and most commands and utilities, you should
understand their differences and how the kernel relates to overall performance tuning. JFS2 is optimized for the 64-bit kernel, while JFS is optimized for the 32-bit kernel. Always use JFS2 if you can. Both JFS and
JFS2 are journaling file systems, which have been associated with performance overheads. In fact, with JFS, where availability was not an issue and
peak performance was necessary, you could disable metadata logging in an
effort to increase performance. With JFS2 (AIX 5.3 only), that technique is
no longer possible (or necessary) because the file system is tuned to handle
metadata-intensive types of applications more efficiently. With AIX 6.1
you can now mount file systems without logging.
The most important advantage of JFS2 lies in its ability to scale. With
JFS2, you can have files up to 16 TB; JFS imposes a file size limit of 64
GB. JFS2 also includes changes in the directory organization. It uses a
binary tree representation while performing inode searches, rather than the
linear method used by JFS.
C h a p t e r
11
Disk I/O: Monitoring
This chapter provides an overview of the AIX-specific tools (sar, nmon,
and topas) available to monitor disk I/O activity. These commands let you
quickly troubleshoot a performance problem and capture data for historical
trending and analysis. Don’t expect to see iostat here. That Unix utility lets
you quickly determine whether there is an imbalanced I/O load between
your physical disks and adapters. But unless you decide to write your own
scripting tools using iostat, it won’t help you with long-term trending and
capturing data.
sar
The sar command, whose syntax is given in Chapter 8, is one of those
older, generic Unix tools that have been improved over the years. Although
I generally prefer to use more specific AIX tools, such as nmon and topas,
sar provides strong information with respect to disk I/O. Let’s run a typical
sar command to examine I/O activity:
# sar -d 1 2
AIX newdev 3 5
06/04/
System Configuration: lcpu=4 disk=5
07:11:16
device %busy
avque
07:11:17
hdisk1
0
0.0
hdisk0
29
0.0
hdisk3
0
0.0
r+w/s
0
129
0
Here’s a breakdown of the column headings:
blks/s
0
85
0
avwait
0.0
0.0
0.0
avser
0.0
0.0
0.0
108
Chapter 11: Disk I/O: Monitoring
●
●
●
%busy — Portion of time the device was busy servicing transfer
requests
avque —Number of requests waiting to be sent to disk (as of AIX
5.3)
r+w/s — Number of read or write transfers to or from a device (in
512-byte units)
●
avwait — Average wait time per request (in milliseconds)
●
avserv — Average service time per request (in milliseconds)
You want to be wary of any disk that approaches 100 percent utilization
or shows a large number of queue requests waiting for disk. Although the
sample output shows some activity, we have no real I/O problems because
no waiting for I/O is occurring. We should continue to monitor this system
to ensure other disks in addition to hdisk0 are being used.
Where sar differs from iostat is in its ability to capture data for long-term
analysis and trending using its system activity data collector (sadc) utility.
Usually turned off in cron, the sadc utility lets you capture data for historical trending and analysis.
Here’s how this works. As delivered by default on AIX systems, two shell
scripts, /usr/lib/sa/sa1 and /usr/lib/sa/sa2, which are normally commented out provide daily reports on the activity of the system. The sar
command actually calls the sadc routine to access system data. The following example shows how the shell scripts are usually kicked off from cron:
# crontab -l | grep sa1
0 8-17 * * 1-5 /usr/lib/sa/sa1 1200 3 &
0 * * * 0,6 /usr/lib/sa/sa1 &
0 18-7 * * 1-5 /usr/lib/sa/sa1 &
topas
What about something a little more user-friendly? Did you say topas? The
topas command is a nice performance-monitoring tool that you can use for
a number of purposes, including monitoring your disk subsystem.
109
topas
Let’s take a look at the topas output from a disk perspective:
Topas output for host – Testhost
Mon May
7 07:33:38 2007
Interval:
2
Events/Queues
FILE/TTY
Cswitch
500
Readch
487
Syscall
1298
Writech
943
Kernel
0.5
|#
}
Reads
2
Rawin
User
0.5
|#
|
Writes
1
Ttyout
Wait
0.0
|
|
Forks
0
Igets
0
Idle
99.0
|###########################|
Execs
0
Namei
25
Dirblk
0
Network
KBPS
I-Pack
O-Pack
KB-In
en1
0.6
1.0
1.0
0.1
0.5
lo0
0.1
1.0
1.0
0.0
0.0
KB-Out
Runqueue
0.0
Waitqueue
0.0
PAGING
0
459
MEMORY
Faults
1
Real,MB
4095
TPS KB-Read KB-Writ
Steals
0
% Comp
13.8
Busy%
KBPS
hdisk0
0.0
0.0
0.0
0.0
0.0
PgspIn
0
% Noncomp
87.1
hdisk1
0.0
0.0
0.0
0.0
0.0
PgspOut
0
% Client
0.5
hdisk3
0.0
0.0
0.0
0.0
0.0
PageIn
0
cd0
0.0
0.0
0.0
0.0
0.0
PageOut
0
PAGING SPACE
hdisk2
0.0
0.0
0.0
0.0
0.0
Sios
0
Size,MB
Disk
Name
PID
CPU%
PgSp Owner
NFS (calls/sec)
4096
% Used
0.5
% Free
99.4
X
15256
0.8
2.5 root
ServerV2
0
topas
22320
0.2
1.5 root
ClientV2
0
Press:
syncd
15016
0.0
0.6 root
ServerV3
0
“h” for help
lrud
9030
0.0
0.0 root
ClientV3
0
“q” to quit
gil
10320
0.0
0.1 root
i4llmd
12434
0.0
1.1 root
prngd
19154
0.0
0.2 root
rpc.lock
26878
0.0
0.0 root
nfsd
28238
0.0
0.0 root
tcl
17906
0.0
0.8 root
i4lmd
25352
0.0
1.3 root
dtwm
22752
0.0
1.9 rds
xmgc
9804
0.0
0.0 root
20700
0.0
1.8 rds
1
0.0
0.7 root
vmstat
37288
0.0
0.2 root
dtfile
20444
0.0
1.7 rds
cron
27720
0.0
0.4 root
rshell
33334
0.0
0.8 user
netm
10062
0.0
0.0 root
dtsessio
init
No I/O activity at all is going on here. Besides the physical disk, pay close
attention to the “Wait” information (in the CPU section up top), which
110
Chapter 11: Disk I/O: Monitoring
also helps you determine whether the system is I/O-bound. If you see high
numbers here, you can then use other tools, such as filemon, fileplace,
lslv, or lsof, to help you figure out which processes, adapters, or file systems are causing your bottlenecks.
The topas command is useful for quickly troubleshooting an issue when
you want a little more than iostat can provide. In a sense, topas is a
graphical mix of iostat and vmstat, although recent improvements now
provide the ability to capture data for historical analysis. These improvements, introduced in AIX 5.3, no doubt were made because of the popularity of nmon.
While nmon provides a front end similar to topas, it is much more useful in terms of long-term trending and analysis. Further, as you learned
in Chapter 5, nmon gives system administrators the ability to output data
to an Excel spreadsheet for presentation in graphical charts (tailor-made
for senior management and functional teams) that clearly illustrate bottlenecks. The nmon analyzer tool provides the hooks into nmon. (Figure 5.1
in Chapter 5 shows some sample output from the nmon analyzer.) With
respect to disk I/O, nmon reports the following data: disk I/O rates, data
transfers, read/write ratios, and disk adapter statistics.
Here is one small example of where nmon really shines. Let’s say you
want to know which processes are hogging most of the disk I/O, and you
want to be able to correlate that activity with the actual disk to clearly illustrate I/O per process. nmon usage helps you here more than any other
tool. To perform this task with nmon, use the –t option; set your timing and
then sort by I/O channel.
How do you use nmon to capture data and import it into the analyzer? Use
the open-source sudo command and run nmon for three hours, taking a
snapshot every 30 seconds:
# sudo nmon -f -t -r test1 -s 30 -c 180
Next, sort the created output file:
# sort -A testsystem_yymmdd.nmon > testsystem_yymmdd.csv
111
Logical Volume Monitoring
Then FTP the .csv file to your PC, start the nmon analyzer spreadsheet
(enabling macros), and click on Analyze nmon data. The nmon command
also helps track the configuration of asynchronous I/O servers.
Logical Volume Monitoring
Say that a ticket has just been opened up with the service desk that relates
to slow performance on some database server. You suspect there might
be an I/O issue, so you start with iostat. iostat, the equivalent of using
vmstat for virtual memory, is arguably the most effective way to get a first
glance at what is happening with your I/O subsystem. Let’s run iostat, in
this case once a second:
# iostat 1
System configuration: lcpu=4 disk=4
tty:
tin
tout
0.0
392.0
avg-cpu:
% user
% sys
% idle
% iowait
5.2
5.5
88.3
1.1
Disks:
% tm_act
Kbps
tps
Kb_read
Kb_wrtn
hdisk1
0.5
19.5
1.4
53437739
21482563
hdisk0
0.7
29.7
3.0
93086751
21482563
hdisk4
1.7
278.2
6.2
238584732
832883320
hdisk3
2.1
294.3
8.0
300653060
832883320
The command reports the following information:
●
●
●
●
●
% tm_act — Percentage of time that the physical disk was active, or
the total time of disk request
Kbps — Amount of data (in kilobytes per second) transferred to the
drive
tps — Number of transfers per second issued to the physical disk
Kb_read — Total data (in kilobytes) from the measured interval that
is read from the physical volumes
Kb_wrtn — Amount of data (kilobytes) from the measured interval
that is written to the physical volumes
112
Chapter 11: Disk I/O: Monitoring
You need to watch % tm_act very carefully because if this utilization
exceeds roughly 60 to 70 percent, that usually indicates that processes are
starting to wait for I/O. This might be your first clue of impending I/O
problems. Moving data to less busy drives can obviously help ease this
burden. Generally speaking, the more drives your data hits, the better.
Just like anything else, too much of a good thing can also be bad, and you
also have to make sure you don’t have too many drives hitting any one
adapter. One way to determine whether an adapter is saturated is to sum the
Kbps amounts for all disks attached to one adapter. The total should be below the disk adapter’s throughput rating, usually less than 70 percent. Using the –a flag with iostat helps you drill down further to examine adapter
utilization. In the following output, there clearly are no bottlenecks:
# iostat -a
Adapter:
scsi0
Paths/Disk:
hdisk1_Path0
hdisk0_Path0
hdisk4_Path0
hdisk3_Path0
Kbps
0.0
% tm_act
37.0
67.0
0.0
0.0
Adapter:
ide0
Paths/Disk:
cd0
tps
0.0
Kbps
89.0
47.0
0.0
0.0
Kbps
0.0
% tm_act
0.0
Kb_read
0
tps
0.0
0.0
0.0
0.0
tps
0.0
Kbps
0.0
Kb_read
0
0
0
0
Kb_wrtn
0
Kb_read
0
Kb_wrtn
0
0
0
0
tps
0.0
Kb_read
0
Kb_wrtn
0
Kb_wrtn
0
AIX LVM Commands
We examined disk placement earlier, and I stressed the importance of
architecting your systems correctly from the beginning. Unfortunately, you
don’t always have that option. As system administrators, we sometimes inherit systems that must be fixed. Let’s look at a sample layout of the logical
volumes on disks to determine whether we need to change definitions or
rearrange data. We’ll examine a volume group and find the logical volumes
that are a part of it.
AIX LVM Commands
113
The lsvg command provides volume group information:
# lsvg -l data2
Data2vg
LV NAME
data2lv
loglv00
appdatalv
TYPE
jfs
jfslog
jfs
LPs
128
1
128
PPs
256
2
256
PVs
2
2
2
LV STATE
open/syncd
open/syncd
open/syncd
MOUNT POINT
/data
N/
/appdata
Now, let’s use lslv, which provides information about logical volumes:
# lslv data2lv
LOGICAL VOLUME: data2lv
VOLUME GROUP: data2vg
LV IDENTIFIER:
0003a0ec00004c00000000fb076f3f41.1
PERMISSION:
read/write
VG STATE:
active/complete
LV STATE:
opened/syncd
TYPE:
jfs
WRITE VERIFY: off
MAX LPs:
512
PP SIZE:
COPIES:
2
SCHED POLICY: parallel
LPs:
128
PPs:
256
STALE PPs:
0
BB POLICY:
relocatable
INTER-POLICY:
minimum
RELOCATABLE:
yes
INTRA-POLICY:
center
UPPER BOUND:
32
MOUNT POINT:
/data
LABEL:
/data
64 megabyte(s)
This view provides a detailed description of the logical volume attributes.
What do we have here? The intra-policy is at the center, which normally is
the best policy for I/O-intensive logical volumes. As you recall from an earlier discussion, there are exceptions to this rule. Unfortunately, you’ve just
hit one of them. Because Mirror Write Consistency Check (MWCC) is on,
the volume would have been better served if it were placed on the edge.
Let’s look at its inter-policy. The inter-policy is minimum, which is usually
the best policy if availability matters more than performance. Further, there
are twice as many physical partitions as logical partitions, which signifies
that you are mirroring your systems. In this case, let’s assume you were
told that raw performance was the most important objective, so the logical
volume wasn’t configured to reflect the reality of how the volume is being
114
Chapter 11: Disk I/O: Monitoring
used. Further, if you are mirroring the system and using an external storage
array, the situation would even be worse, because you’re already providing
mirroring at the hardware layer, which is actually more effective than using
AIX mirroring.
The lslv command’s –l (lowercase L) flag lists all the physical volumes
associated with the logical volumes and shows the distribution for each
logical volume:
# lslv -l data2lv
data2lv:/data2
PV
hdisk2
hdisk3
COPIES
128:000:000
128:000:000
IN BAND
100%
100%
DISTRIBUTION
000:108:020:000:000
000:108:020:000:000
With this detail, you can determine that 100 percent of the physical partitions on the disk are allocated to this logical volume. The distribution
section of the output shows the actual number of physical partitions within
each physical volume. From here, you can detail the volume’s intra-disk
policy.
Let’s drill down even further, using the -p flag:
# lspv -p hdisk2
hdisk2:
PP RANGE
1-108
109-109
110-217
218-237
238-325
326-365
366-433
434-542
STATE
free
used
used
used
used
used
free
free
REGION
outer edge
outer edge
outer middle
center
center
inner middle
inner middle
inner edge
LV ID
TYPE
MOUNT POINT
loglv00
data2lv
appdatalv
testdatalv
stagingdatalv
jfslog
jfs
jfs
jfs
jfs
N/A
/data2
/appdata
/testdata
/staging
The preceding view shows you what is free on the physical volume, what
has been used, and which partitions are used where. The order of the fields
AIX LVM Commands
115
is as follows: edge, middle, center, inner-middle, inner-edge. The sample
report shows that most of the data is in the middle and some is at the center. This is a nice view.
You can do a lot with lsvg and lslv; run a man on these commands to find
out more about them.
One of the best tools for looking at LVM use is lvmstat. Because the lvmstat view is not enabled by default, you need to enable it before running
the tool:
# lvmstat -v data2vg -e
The following command takes a snapshot of Logical Volume Manager
information every second for 10 intervals:
# lvmstat -v data2vg 1 10
The resulting output shows the most utilized logical volumes on your system since you started the data collection tool:
# lvmstat -v data2vg
Logical Volume
appdatalv
loglv00
data2lv
% iocnt
306653
34
453
Kb_read
47493022
0
234543
Kb_wrtn
383822
3340
234343
Kbps
103.2
2.8
89.3
This detail is very helpful when drilling down to the logical volume layer
in tuning your systems:
●
●
●
●
% iocnt — Number of read and write requests
Kb_read — Total data (in kilobytes) from your measured interval
that is read
Kb_wrtn — Total data (in kilobytes) from your measured interval
that is written
Kbps — Amount of data transferred (in kilobytes per second)
116
Chapter 11: Disk I/O: Monitoring
Be sure to review the documentation for all the commands discussed here
before adding them to your repertoire.
filemon and fileplace
This section introduces two important I/O tools, filemon and fileplace, and
discusses how you can use them during systems administration each day.
filemon
filemon [-d] [-i Trace_File -n Gennames_File] [-o File] [-O Levels]
[-P] [-T n] [-u] [-v]
The filemon command uses a trace facility to report on the I/O activity of
physical and logical storage, including your actual files. The I/O activity
monitored is based on the time interval specified when running the trace.
The command reports on all layers of file system utilization, including the
LVM, virtual memory, and physical disk layers. Run without any flags,
filemon executes in the background while application programs or system
commands are being run and monitored.
The trace starts automatically and runs until it is stopped. At that time, the
command generates an I/O activity report and exits. It can also process a
trace file that has been recorded by the trace facility. You can then generate reports from this file. Because reports generated to standard output usually scroll past your screen, I advise using the –o option to write the output
to a file:
# f ilemon -o dbmon.out -O all
Run trcstop command to signal end of trace.
Sun Aug 19 17:47:34 200
System: AIX 5.3 Node: lpar29p682e_pub Machine: 00CED82E4C00
# trcstop
[f ilemon: Reporting started]
# [f ilemon: Reporting completed]
[f ilemon: 73.906 secs in measured interval]
fileplace
117
When we check out the file, here is what we see:
Sun Aug 19 17:50:45 2007
System: AIX 5.3 Node: lpar29p682e_pub Machine: 00CED82E4C00
Cpu utilization:
68.2%
Cpu allocation:
77.1%
130582780 events were lost. Reported data may have inconsistencies or errors.
Most Active Files
-----------------------------------------------------------------------#MBs
#opns
#rds
#wrs
file
volume:inode
.
.
.
Look for long seek times because they can result in decreased application
performance. By examining the read and write sequence counts in detail,
you can further determine whether the access is sequential or random. This
information helps you when it is time to do I/O tuning. The sample output
clearly illustrates that there is no I/O bottleneck to speak of in this case.
The filemon command provides a tremendous amount of detail; to be
honest, I’ve found it gives too much information at times. Further, using
filemon can impose a large performance hit. I don’t typically like to recommend performance tools that impose such a substantial overhead, so I’ll
reiterate that although filemon certainly has a purpose, you need to be very
careful when using it.
fileplace
f ileplace [ {-l|-p} [-i] [-v] ] File | [-m LogicalVolumeName]
The fileplace command reports the placement of a file’s blocks within a
file system. The command is commonly used to examine and assess the
efficiency of a file’s placement on disk. For what purposes do you use it?
One reason would be to help determine whether some of your heavily used
files are substantially fragmented.
The fileplace command can also help you identify the physical volume with the highest utilization and determine whether the drive or I/O
adapter is causing the bottleneck. Let’s look at an example of a frequently
accessed file:
118
Chapter 11: Disk I/O: Monitoring
# fileplace -pv dbfile
File: dbfile
Size: 5374622 bytes
Blk Size: 4096
Inode: 21
Frag Size: 4096
Mode: -rw-r--r--
Vol: /dev/hd4
Nfrags: 1313
Owner: root
Group: system
Physical Addresses (mirror copy 1)
Logical Extent
----------------------------------
-----------------
02134816-02134943
hdisk0
128 frags
524288 Bytes,
9.7%
00004352-00004479
02135680-02136864
hdisk0
1185 frags
4853760 Bytes, 90.3%
00005216-00006400
1313 frags over space of 2049 frags:
2 extents out of 1313 possible:
space efficiency = 64.1%
sequentiality = 99.9%
You should be interested in space efficiency and sequentiality here.
Higher space efficiency means files are less fragmented and provide better sequential file access. A higher sequentiality tells you that the files are
more contiguously allocated, which is also better for sequential file access.
In the example, space efficiency could be better, while sequentiality is
quite high.
If space and sequentiality are too low, you might want to consider file system reorganization. You would do this with the reorgvg command, which
can improve logical volume utilization and efficiency.
C h a p t e r
12
Disk I/O: Tuning
The best way to tune your I/O is to configure it properly before deploying
your systems. In this way, I/O tuning is different from memory or CPU
subsystem tuning. Of course, nine times out of ten, you will have inherited
an existing system, so you need to be aware of all the areas where you can
tune your disk I/O subsystems.
lvmo
lvmo -v Name -o Tunable [=NewValue]
lvmo -a [-v vgname]
You use the lvmo command to set and display pinned memory buffer,
or pbuf, tuning parameters. The Logical Volume Manager uses pbufs to
control pending disk I/O operations. The lvmo command is also used to
display blocked I/O statistics.
lvmo is one of those new commands introduced in AIX 5.3. It’s important
to note that its usage permits changes only for LVM pbuf tunables that
are dedicated to specific volume groups. The ioo utility (described next)
remains the only way to manage pbufs on a systemwide basis. That’s
because before Version 5.3, the pbuf pool parameter was a systemwide resource. With the introduction of AIX Version 5.3, LVM manages one pbuf
pool for each volume group.
120
Chapter 12: Disk I/O: Tuning
Let’s display the lvmo tunables for the data2vg volume group:
# lvmo -v data2vg -a
vgname = data2vg
pv_pbuf_count = 1024
total_vg_pbubs = 1024
max_vg_pbuf_count = 8192
perv_blocked_io_count = 7455
global_pbuf_count = 1024
global_blocked_io_count = 7455
The following parameters are available for tuning:
●
●
●
— Number of pbufs that can be added when a physical volume is added to the volume group
pv_pbuf_count
— Maximum number of pbufs that can be allocated for a volume group
max_vg_pbuf_count
— Number of pbufs that can be added when a
physical volume is added to any volume group
global_pbuf_count
Let’s increase the pbuf count for this volume group:
# lvmo -v redvg -o pv_pbuf_count=2048
It’s important to note that if you increase the pbuf value too much,
performance may actually degrade. Truthfully, I usually stay away from
lvmo and use ioo instead. I’m more used to tuning the global parameters,
and it’s also safer this way.
ioo
ioo [-p|-r] { -o Tunable [=NewValue] }
ioo [-p|-r] { -d Tunable }
ioo [-p|-r] -D
ioo [-p|-r] -a
ioo -?
ioo
121
ioo -h [Tunable]
ioo -L [Tunable]
ioo -x [Tunable]
The ioo command is used for virtually all I/O-related tuning parameters.
As with vmo, you need to be extremely careful when changing this command’s parameters because doing so on the fly can severely degrade
performance. Table 12.1 details specific tuning parameters used often for
JFS file systems. As you can see, most of the tuning commands for I/O use
the ioo utility.
Table 12.1: JFS tuning parameters
Function
JFS tuning parameter
JFS2 tuning parameter
Set the maximum amount
of memory for caching files
vmo -o maxperm=value
vmo -o maxclient=value
(less than or equal to maxperm)
Set the minimum amount of vmo -o minperm=value
memory for caching
n/a
Set a (hard) limit on
memory for caching
vmo -o strict_maxperm
vmo -o maxclient
(hard limit)
Set the maximum number
of pages used for sequential read-ahead
ioo -o maxpgahead=value
ioo -o
j2_maxPageReadAhead=value
Set the minimum number
of pages used for sequential read-ahead
ioo -o minpgahead=value
ioo -o
j2_minPageReadAhead=value
Set the maximum number
of pending write I/Os to
a file
chdev -l sys0 -a maxpout
maxpout
chdev -1
sys0 -a maxpout maxpout
Set the minimum number
of pending write I/Os to
a file at which programs
blocked by maxpout
might proceed
chdev -l sys0 -a minpout
minpout
chdev -1
sys0 -a minpout minpout
Set the size of modified
data cache for a file with
random writes
ioo -o maxrandwrt=value
ioo -o j2_maxRandomWrite
ioo -o j2_nRandomCluster
Control the gathering of
I/Os for sequential writebehind
ioo -o numclust=value
ioo -o
j2_PagesPerWriteBehindCluste
r=value
Set the number of file system bufstructs
ioo -o numfsbufs=value
ioo -o
j2_nBufferPerPagerDevice=value
122
Chapter 12: Disk I/O: Tuning
There are several ways to determine the existing ioo values on your system. The long display listing for ioo gives you the most information. It lists
the values for current, reboot value, range, unit, type, and dependencies of
all tunable parameters managed by ioo. Here is a sample of some of the
parameters:
# ioo -L
NAME
CUR
DEF
BOOT
MIN
MAX
UNIT
TYPE
j2_atimeUpdateSymlink
0
0
0
0
1
boolean
D
j2_dynamicBufferPreallo
16
16
16
0
256
16K slabs
D
j2_inodeCacheSize
400
400
400
1
1000
j2_maxPageReadAhead
128
128
128
0
64K
4KB pages
D
j2_maxRandomWrite
0
0
0
0
64K
4KB pages
D
DEPENDENCIES
D
Let’s change a tunable:
# ioo -o maxpgahead=32
Setting maxpgahead to 32
JFS2 Tuning Options
Some important JFS2-specific file system performance enhancements include sequential page read-ahead and sequential and random write-behind.
The AIX Virtual Memory Manager anticipates page requirements for
observing the patterns of files that are accessed. When a program accesses
two pages of a file, the VMM assumes that the program will keep trying
to access the file in a sequential method. You can set VMM thresholds to
configure the number of pages to be read ahead. With JFS2, make note of
two important parameters:
●
●
J2_minPageReadAhead — Determines the number of pages to read
ahead when VMM initially detects a sequential pattern
J2_maxPageReadAhead — Determines the maximum number of
pages VMM can read in a sequential file
JFS2 Tuning Options
123
Sequential and random write-behind relates to writing modified pages in
memory to disk after a certain threshold is reached. In this way, it does not
wait for the syncd daemon to flush out pages to disk. The purpose of this
functionality is to limit the amount of dirty pages in memory, thereby further reducing I/O overhead and disk fragmentation. With sequential writebehind, pages do not stay in memory until the syncd daemon runs, which
can cause real bottlenecks. With random write-behind, when the number
of pages in memory exceeds a specified amount, all subsequent pages are
written to disk.
Another important area worth mentioning is large sequential I/O processing. When too much simultaneous I/O is occurring to your file systems, the
I/O can bottleneck at the file system level. In this case, you should increase
the ioo command’s j2_nBufferPerPagerDevice parameter (numfsbus
with JFS). If you use raw I/O as opposed to file systems, the same type of
bottleneck can occur through LVM. In this case, you might want to tune
the lvm_bufcnt parameter.
Section IV
Summary, Tips, and Quiz
Summary
●
●
●
●
●
●
Direct I/O, introduced in AIX 4.3, bypasses the Virtual Memory Manager and transfers data directly to the disk from the user’s buffer. Turning on this feature may increase your performance, depending on your
application. Direct I/O benefits applications that use synchronous writes,
because the writes have to go to disk.
Concurrent I/O (CIO) has all the performance benefits of direct I/O
while also bypassing inode lock. This action lets multiple threads read
and write data concurrently to the same file. Concurrent I/O benefits
from the implementation of JFS2 with a write-exclusive inode lock,
which lets multiple users read the same file simultaneously.
Appropriate use of asynchronous I/O (AIO) can significantly improve
the performance of writes on the I/O subsystem. AIO lets an application
continue processing while its I/O completes in the background; I/O and
application processing can thus run concurrently.
The logical volume sits between the application and physical layers.
The Logical Volume Manager (LVM) disk management system maps
the data between logical and physical storage. This architecture lets data
reside on multiple physical platters and be managed using LVM commands.
As a general rule, data written toward the center of the physical disk
platter has faster seek times than data written on the outer edge. This
advantage has to do with the density of the data
Inter-policy defines the number of disks on which the physical partitions
of a logical volume reside.
126
Section IV: Summary, Tips, and Quiz
●
●
●
●
●
●
●
Intra-policy defines the place on the disk where the logical volume actually resides.
You use the lslv, lvm, lvmstat, and lvpv commands to monitor logical
volumes.
Commands ioo and lvmo work to tune disk I/O. Most tuning commands
for I/O use the ioo utility.
The filemon command uses a trace facility to report on the I/O activity of physical and logical storage, including your actual files. The I/O
activity monitored is based on the time interval you specify when running the trace. The utility reports on all layers of file system utilization,
including the LVM, virtual memory, and physical disk layers.
The fileplace command reports the placement of a file’s blocks within a
file system. It commonly is used to examine and assess the efficiency of
a file’s placement on disk.
Journaling file systems, although much more secure than nonjournaling systems, have historically been associated with performance
overheads.
In a “Performance Rules!” shop (at the expense of availability), you
would disable metadata logging in an effort to increase performance
with the JFS file system. With JFS2, that option is no longer possible,
or even necessary, because JFS2 is tuned to handle metadata-intensive
types of applications much more efficiently. JFS imposes a limit of 64
GB for a file; with JFS2, you can have a file supporting 16 TB.
Tips
●
●
Make sure your data is spread evenly across all spindles. If you have a
storage area network (SAN) or an external storage array, verify that your
storage administrator understands how he or she needs to configure this
system — which includes trying to create arrays of equal size and type
if possible. Try to create one logical unit (LUN) for each array and then
spread the logical volumes across all physical volumes in the volume
group.
Make certain your mirrors are on separate disks and adapters.
Tips
●
●
●
●
●
●
●
●
127
If you’re running a relational database management system (RDBMS),
make sure your indexes, temporary tablespaces, and redo logs reside on
separate physical disks or LUNs.
Regarding adapters, spread them across multiple buses, and don’t attach
too many physical disks or LUNs to any one adapter. Remember, the
more adapters you have, the better your performance will be.
Be sure your device drivers support multipath I/O or your storage
equivalent of that (e.g., PowerPath for EMC) to allow for further load
balancing of the I/O subsystem.
Be careful when using the filemon command, because you incur a performance overhead when using this tracing tool.
The rule of thumb when configuring AIO servers in AIX 5.3 is to set the
maximum number of servers (MaxServers) equal to 10 times the amount
of disk or 10 times the number of processors. You would set MinServers
at one half of this amount. Other than having some more kernel processes hanging out that don’t get used (consuming a small amount of kernel
memory), there really is little risk in oversizing the number of MaxServers, so don’t be afraid to bump it up. Note that in AIX 6.1, this issue is
no longer a concern.
Consider employing concurrent I/O when using databases such as
Oracle. CIO permits multiple threads to read and write data concurrently
to the same file. This advantage accrues from the way in which JFS2 is
implemented with write-exclusive inode locks, which let multiple users
read the same file simultaneously. Performance increases dramatically
when multiple users read from the same data file.
Never lose sight of the fact that while RAM access takes about 540 CPU
cycles, disk access can take 20 million CPU cycles. Clearly, the weakest link on a system is the disk I/O storage system. It’s your job as the
system administrator to make sure it doesn’t become even more of a
bottleneck.
In terms of intra-disk policy, as a best practice, the more intensive I/O
applications should be brought closer to the center of the physical volumes. Note, though, that this rule has exceptions. Disks hold more data
per track on the edges, not on the center. That being said, logical volumes being accessed sequentially should actually be placed on the edge
128
Section IV: Summary, Tips, and Quiz
for better performance. The same advice holds true for logical volumes
that have Mirror Write Consistency Check (MWCC) turned on, because
the MWCC sector is on the edge of the disk and not at the center of it,
which relates to the intra-disk policy of logical volumes.
●
Examine parameters J2_minPageReadAhead and J2_maxPageReadAhead in an effort to increase performance when sequential I/O is encountered.
Quiz
Multiple Choice
1. What is the weakest link on your system?
a. RAM
b. CPU
c. Disk
d. CPU cache
2. What sits between the application layer and the physical layer of the
system?
a. Physical volumes
b. Logical volumes
c. File systems
d. Inodes
3. With JFS2, you can have a file that supports
a. 32 TB
b. 16 GB
c. 16 TB
d. 72 TB
True or False
129
4. For better performance, where on the disk platter should you place
logical volumes that are being accessed sequentially?
a. On the edge
b. In the middle
c. Inside
d. Outside
5. Which command do you use to set and display your pbuf tuning
parameter?
a. no
b. nfsm
c. lvmo
d. lsattr
6. Which command is used most often to tune disk I/O?
a. vmo
b. ioo
c. iostat
d. lsmo
7. What defines the place on the disk where the logical volume actually
reside?
a. Intra-policy
b. Inter-policy
c. Inode policy
d. LVM policy
True or False
8. The rule of thumb when configuring AIO servers in AIX 5.3 is to set the
maximum number of servers equal to 10 times the amount of disk or 10
times the number of processors.
130
Section IV: Summary, Tips, and Quiz
9. filemon reports the placement of a file’s blocks within a file system.
Fill in the Blank
10. Which parameter determines the number of pages to read ahead when
VMM initially detects a sequential pattern?
__________________________________________
Section V
Network I/O
This section provides an overview of network management on AIX, including how to monitor and tune the network subsystem. It also discusses
tools you can use to monitor your hardware and the Network File System
(NFS). Unlike other subsystems, the network subsystem has many things
to monitor, so we’ll spend quite a bit of time on this topic. You’ll learn how
to monitor network packets using the netstat command. We’ll also review
best practices for tuning your network and discuss various networking
concepts as they relate to systems performance.
C h a p t e r
13
Network I/O: Introduction
The first thing that usually comes to mind when a system administrator
hears that there might be some network contention issues is to run netstat.
The netstat command — the “net” equivalent of using vmstat or iostat
— provides a quick-and-dirty way to get an overview of how your network
is configured. Unlike vmstat or iostat, however, the command defaults
usually don’t give you as much information as you’d probably like. You
need to understand the correct usage of netstat and how best to use it
when monitoring your system.
The netstat facility isn’t really a monitoring tool in the sense that vmstat
and iostat are. Other, more suitable tools (which we’ll get to later) are
available to help you monitor your network subsystem. At the same time,
you can’t really start to monitor until you have a thorough understanding
of the various components related to network performance. These components include your network adapters, your switches and routers, and how
you are using virtualization on your host logical partitions.
If you determine that you indeed are experiencing a network bottleneck,
the solution to the problem might actually lie outside your immediate host
machine. If the network switch is improperly configured on the other end,
there is little you can do. Of course, you might be able to point the network
team in the right direction. You should also spend time gathering overall
information about your network.
134
Chapter 13: Network I/O: Introduction
How are you going to be able to understand how to troubleshoot your
network devices unless you really understand the network? In the next few
chapters, we’ll look at specific AIX network tracing tools, such as netpmon, to see how they can help you isolate your bottlenecks.
No matter which subsystem you want to tune, remember that systems
tuning is an ongoing process. As I’ve stated before, the best time to start
monitoring your systems is at the beginning, before you have any problems
and when users aren’t screaming. You need a baseline of network performance so that you know what the system looks like when it’s behaving
normally. And remember: be careful to make changes one at a time so you
can assess the actual impact of each change.
Network I/O Overview
Understanding the network subsystem as it relates to AIX is not an easy
undertaking. From a hardware and software aspect, there are far fewer
areas you need to investigate when you examine CPU and memory
bottlenecks. Tuning disk I/O is more complex than other tuning activities because many more issues affect performance, particularly during the
architecting and build-out of systems. In this respect, tuning the network
is probably most similar to tuning disk I/O — a fact that’s actually not too
surprising, given that both relate to I/O.
Let’s start by examining the AIX Transmission Control Protocol/Internet
Protocol (TCP/IP) layers, which are depicted in Figure 13.1.
Network I/O Overview
135
Figure 13.1: AIX TCP/IP layers
From this illustration, you can clearly see that there is more to network
monitoring than simply running netstat and looking for collisions. From
the application layer through the media layer, areas need to be configured,
monitored, and tuned. At this point, you should notice some similarities between this illustration and the Open Systems Interconnection (OSI) model,
which divides network architecture into seven layers (from top to bottom):
●
Application
●
Presentation
●
Session
●
Transport
●
Network
●
Data link
●
Physical
Perhaps the most important concept to understand is that each layer on the
host machine communicates with the corresponding layer on the remote
machine. The actual application programs transmit data using either the
User Datagram Protocol (UDP) or the TCP transport layer protocols. They
136
Chapter 13: Network I/O: Introduction
receive the data from whatever application you are using and divide that
data into packets. The packets themselves differ depending on whether a
packet is a UDP packet or a TCP packet. In general, UDP is faster, while
TCP is more secure.
There are many tunable parameters to look at, and we’ll get to these later.
To begin, you might want to start to familiarize yourself with the no command, which is the utility designed to make most network changes. From
a hardware perspective, it is critical for you to understand the components
that must be configured appropriately to optimize performance.
Although you might work together with the network teams that manage
your switches and routers, you probably won’t be configuring those devices unless you’re a small shop or a one-person IT department. The most
important component you’ll work with is the network adapter. Most of
your adapters will probably be some version that supports Gigabit Ethernet, such as a 10/100/1000 Mbps Ethernet card. Let’s review the important
concepts you’ll need to work with here.
NFS
Introduced by Sun Microsystems in 1984, the Network File System (NFS)
lets clients access files over a network as if the files were locally attached
as disks. Version 2 of NFS, introduced in 1989, operated exclusively on
UDP. Version 3, which debuted in 1995, added TCP support, which helped
NFS thrive over a wide area network (WAN). Version 4, introduced in
2000, was the first version developed by the Internet Engineering Task
Force (after Sun relinquished control of NFS development).NFS V4 was
also the first version to provide stateful support, whereby both the client
and the server maintain current information about both open files and file
locks.
NFS was further enhanced in 2003 under RFC3530, and it is this standard
that AIX supports. AIX 5.3 supports three versions of NFS: Versions 2, 3,
and 4. The default version is Version 3. (For Red Hat Linux, the default
NFS version is Version 4.) You can choose the NFS version type during the
actual mounting of the file system, and you can run different NFS versions
on the same server.
NFS
137
NFS now supports both TCP and UDP. Because UDP is faster (it does
less), some environments that favor optimum performance (on a LAN)
over reliability might perform better with UDP. TCP is more reliable (because it establishes connections) and provides better performance over a
WAN (because its flow control helps minimize network latency).
A benefit of NFS is that it acts independently of machine types and operating systems. It achieves this independence through the use of remote
procedure calls (RPCs), as depicted in Figure 13.2.
Server Z
Client A
Thread m
biod i
nfsd a
biod j
nfsd b
biod k
nfsd c
LAN
Client B
Thread n
biod a
nfsd x
biod b
nfsd y
biod c
nfsd z
Figure 13.2: Interaction between client and server
The figure illustrates how NFS clients A and B access the data on NFS
server Z. The client computers first request access to the exported data by
mounting the file system. Then, when a client thread tries to process data
within the NFS mounted file system, the data is redirected to the biod daemon, which takes the data through the LAN to the NFS server and its nfsd
daemon. The server uses nfsd to export the directories that are available to
its clients. As you can see, you’ll need to tune the network and I/O parameters. If Server Z is performing poorly, that obviously affects all of its NFS
clients. If possible, tune the server specifically to function as an NFS server
(more about this later).
138
Chapter 13: Network I/O: Introduction
What about the biod daemon? This daemon is required to perform both
read-ahead and write-behind requests. biod improves overall NFS performance as it either empties or fills up the buffer cache, acting as a liaison to
the client applications. As shown in the figure, the biod daemon sends the
requests to the server. On the other side, nfsd is the liaison that provides
NFS services to clients. When the server receives biod communications
from the client, it uses the nfsd daemon until the request is completed.
How is it that NFS was not stateful until Version 4, even though it could
use TCP as early as Version 2? Figure 13.3 illustrates where NFS resides in
relation to the TCP/IP stack and the OSI model.
Figure 13.3: NFS relationship to OSI and TCP/IP
Because NFS uses remote procedure calls, it does not reside on the transport stack. RPCs are a library of procedures that enable the client and
server processes to execute system calls as if they were executed in their
own address spaces. In a typical UDP NFS Version 2 or 3 implementation, the NFS server sends its client a type of cookie after the clients are
authorized to share the volume. This approach helps minimize network
traffic. The problem is that if the server goes down, clients will continue to
inundate the network with requests. That is why there is a preference for
Media Speed
139
using TCP. Only Version 4 can use stateful connections, and only Version 4
uses TCP as its transport protocol.
NFS 4 has no interaction with portmap or other daemons such as lockd
and statd, because these functions are rolled into the kernel. In versions
other than Version 4, the portmapper is used to register RPC services and
to provide the port numbers for the communications between clients and
servers. External Data Representation (XDR) provides the mechanism
that RPC and NFS use to ensure reliable data exchange between client and
server. This interaction takes place in a way that is platform-independent
for the exchange of binary data, thus addressing the possibility of systems
representing data in different ways. Using XDR, data can be interpreted
correctly, even on platforms that are not alike.
Media Speed
Network adapters communicate with other devices based on how the media
speed is configured. Although other choices are available, you should configure your card for either 100 Mbps full duplex or auto-negotiation. With
auto-negotiation, both adapters try to communicate using the highest possible speed. The documentation might tell you that you need to configure
the card this way (IBM even defaults to auto-negotiation on the system),
but most senior AIX administrators I know prefer to set it to full duplex to
ensure they receive the fastest possible adapter speed. If this setting doesn’t
function properly, you should work with the appropriate network teams to
resolve the problem before deployment.
I prefer to take more time initially rather than set the adapter to an option
that might cause slower speeds as a result of poorly configured switches.
The lsattr command gives you the information you need. Used with the
en prefix, it displays your driver parameters; the ent prefix displays your
hardware parameters. In the following case, the interface is set to autonegotiate.
# lsattr -El ent0
alt_addr
busintr
busmem
0x000000000000
166
0xc8030000
Alternate Ethernet Address
Bus interrupt level
Bus memory address
True
False
False
140
Chapter 13: Network I/O: Introduction
chksum_offload
intr_priority
ipsec_offload
large_send
media_speed
poll_link
poll_link_timer
rom_mem
rx_hog
rxbuf_pool_sz
rxdesc_que_sz
slih_hog
tx_preload
tx_que_sz
txdesc_que_sz
use_alt_addr
yes
3
no
no
Auto_Negotiation
no
500
0xc8000000
1000
1024
1024
10
1520
8192
512
no
Enable RX Checksum Offload
Interrupt priority
IPsec Offload
Enable TCP Large Send Offload
Media Speed
Enable Link Polling
Time interval for Link Polling
ROM memory address
RX Descriptors per RX Interrupt
Receive Buffer Pool Size
RX Descriptor Queue Size
Interrupt Events per Interrupt
TX Preload Value
Software TX Queue Size
TX Descriptor Queue Size
Enable Alternate Ethernet Address
True
False
True
True
True
True
True
False
True
True
True
True
True
True
True
True
You should also check your adapter firmware levels to make sure they’re
up-to-date. I’ve seen many network problems fixed by updating to the latest levels of firmware. The lscfg command reports firmware information:
# lscfg -vp | grep -p ROM
10/100 Mbps Ethernet PCI Adapter II:
Part Number.................09P5023
FRU Number..................09P5023
EC Level....................H10971A
Manufacture ID..............YL1021
Network Address.............0002556FC98B
ROM Level.(alterable).......SCU015
Product Specific.(Z0).......A5204207
Device Specific.(YL)........U0.1-P1-I1/E1
10/100/1000 Base-TX PCI-X Adapter:
Part Number.................00P3056
FRU Number..................00P3056
EC Level....................H11635A
Manufacture ID..............YL1021
Network Address.............00096B2E31BD
ROM Level.(alterable).......GOL002
Device Specific.(YL)........U0.1-P1/E2
Virtual and Shared Ethernet
141
Network Subsystem Memory Management
You should also start to familiarize yourself with the memory management facility of network subsystems. This facility makes use of data
structures called mbufs that are used to store kernel data for incoming and
outbound traffic. The buffer sizes themselves can range from 32 bytes to
16,384 bytes. The buffer pools are created by making allocation requests
to the Virtual Memory Manager. In a symmetric multiprocessing box, each
memory pool is split evenly for every processor. An important point to note
is that a processor cannot borrow from the memory pool outside of its own
processor.
Virtual and Shared Ethernet
Two other concepts to be familiar with are virtual Ethernet and shared
Ethernet.
First supported on AIX 5.3 on POWER5, virtual Ethernet allows for interpartition- and IP-based communications between logical partitions on the
same frame. This functionality is achieved through the use of a virtual I/O
switch. The Ethernet adapters themselves are created and configured using
the Hardware Management Console (HMC).
Shared Ethernet is one of the features of Advanced Power Virtualization (APV) or PowerVM. It enables the use of virtual I/O servers (VIOs),
whereby several host machines can actually share one physical network
adapter. Shared Ethernet is typically used in environments that don’t require substantial network bandwidth.
Although an in-depth discussion of virtualization is beyond the scope of
this book, you should understand that if you are using virtualization, there
might be other reasons for your bottleneck outside of what you’re doing on
the host machine. Virtualization is a wonderful thing, but you need to be
careful not to share too many adapters from your VIO server, or you might
pay a large network I/O penalty. Use of the appropriate monitoring tools
should inform you whether you have a problem. Further, you might want
to familiarize yourself with concepts such as Address Resolution Protocol
(ARP) and Domain Name Server (DNS), which can also affect network
performance and reliability in different ways.
14
C h a p t e r
Network I/O: Monitoring
Let’s begin our discussion of network I/O monitoring by revisiting our old
standby, netstat, which displays overall network statistics. Probably one of
the most common commands you will type is netstat –in:
# netstat -in
Name
en1
en1
en0
en0
lo0
lo0
lo0
Mtu
1500
1500
1500
1500
16896
16896
16896
Network
link#2
10.153
link#3
172.29.128
link#1
127
::1
Address
2a.21.70.0.90.6
10.153.3.7
2a.21.70.0.90.5
172.29.137.205
127.0.0.1
Ipkts
21005666
21005666
328241182
328241182
62223
62223
62223
Ierrs
0
0
0
0
0
0
0
Opkts Oerrs Coll
175389
0
0
175389
0
0
1189
0
0
1189
0
0
62234
0
0
62234
0
0
62234
0
0
Here is a key to the output fields:
●
Name — Interface name
●
Mtu — Interface Maximum Transfer Unit (MTU) size
●
Network — The actual network address to which the interface connects
●
Address — Media Access Control (MAC) or IP address
●
Ipkts — Total number of packets received by the interface
144
Chapter 14: Network I/O: Monitoring
●
Ierrs — Number of errors reported back from the interface
●
Opkts — Number of packets transmitted from the interface
●
Oerrs — Number of error packets transmitted from the interface
●
Coll — Number of collisions on the adapter (if you’re using Ethernet, you won’t see anything here)
Another handy netstat flag is –m. This option lets you view the kernel
memory allocation statistics, including mbuf memory requests (and buffer
size), amount of memory in use, and failures by CPU:
# netstat -m
Kernel malloc statistics:
******* CPU 0 *******
By size
inuse
calls failed
32
194
5203
0
64
484
3926
0
128
309
14913
0
256
392
214494
0
512
2060 26183179
0
1024
31
2714
0
2048
587
1237
0
4096
9
8367
0
8192
2
12
0
16384
224
354
0
32768
48
183
0
65536
84
142
0
131072
3
4
0
******* CPU 1 *******
By size
inuse
calls failed
32
17
96
0
64
295
1214
0
128
151
93806
0
256
83
273
0
512
1577 86936634
0
1024
4
18
0
2048
515
516
0
4096
1
707
0
8192
1
1
0
16384
32
32
0
32768
52
193
0
65536
34
34
0
131072
0
0
0
delayed
2
7
8
22
261
8
292
2
2
29
13
42
0
free
62
28
875
136
60
25
5
2
1
2
3
0
51
hiwat
2620
2620
1310
2620
3275
1310
1965
655
327
163
81
81
102
freed
0
0
0
0
0
0
0
0
0
0
0
0
0
delayed
0
5
5
5
199
2
257
0
1
4
15
17
0
free
111
25
713
29
23
4
1
1
4
0
5
0
44
hiwat
2620
2620
1310
2620
3275
1310
1965
655
327
163
90
81
88
freed
0
0
0
0
0
0
0
0
0
0
0
0
0
netpmon
145
If you’re using Ethernet, you can also use the entstat command to display
device driver statistics:
# entstat -d en1
------------------------------------------------------------ETHERNET STATISTICS (en1) :
Device Type: 10/100 Mbps Ethernet PCI Adapter II (1410ff01)
Hardware Address: 00:02:55:6f:c9:9b
Elapsed Time: 5 days 12 hours 14 minutes 46 seconds
Transmit Statistics:
-------------------Packets: 803536
Bytes: 511099654
Interrupts: 520
Transmit Errors: 0
Packets Dropped: 0
Receive Statistics:
------------------Packets: 2095253
Bytes: 1099945394
Interrupts: 2074913
Receive Errors: 0
Packets Dropped: 0
Bad Packets: 0
Max Packets on S/W Transmit Queue: 38
S/W Transmit Queue Overflow: 0
Current S/W+H/W Transmit Queue Length: 1
Broadcast Packets: 535
Broadcast Packets: 997476
The entstat output provides a potpourri of information. You won’t see
many collisions because you’ll probably be working in a switched environment. Look for transmit errors, and make sure they’re not increasing too
fast.
You need to learn to troubleshoot collision and error problems before you
even begin to think about tuning. As an alternative, you can use netstat –v,
which provides similar information.
netpmon
netpmon [-o File] [-d] [-T n] [-P] [-t] [-v] [-O ReportType ...]
[-i Trace_File -n Gennames_File]
The netpmon command reports information about CPU usage as it relates
to the network. It also provides data about network device driver I/O, Internet socket calls, and various other statistics.
146
Chapter 14: Network I/O: Monitoring
Similar to its other trace brethren, tprof and filemon, netpmon starts a
trace and runs in the background until you stop it with the trcstop command. I like netpmon because it really gives you a detailed overview of
network activity and also captures data for trending and analysis (although
it’s not as useful as nmon for the latter purpose). In the following example,
we’ll use a trace buffer size of 2 million bytes:
# netpmon -T 2000000 -o /tmp/net.out
Wed Sep 5 05:30:27 2007
System: AIX 5.3 Node: lpar7ml162f_pub Machine: 00C22F2F4C00
Run trcstop to signal the end of the trace:
# trcstop
# [netpmon: Reporting started]
[netpmon: Reporting completed]
[
4 traced cpus
[
245.464 secs total preempt time
]
]
[netpmon: 164.813 secs in measured interval]
Let’s look at the data. Here is just a small sampling of the output:
# more net.out
Process CPU Usage Statistics:
----------------------------Network
Process (top 20)
PID CPU Time
CPU %
CPU %
---------------------------------------------------------UNKNOWN
15920 151.2735 36.558
0.000
UNKNOWN
7794 104.8801 25.346
0.000
UNKNOWN
6876
73.8785 17.854
0.000
UNKNOWN
5402
50.6225 12.234
0.000
xmwlm
13934
15.0469
3.636
0.000
-ksh
5040
0.0371
0.009
0.000
getty
18688
0.0280
0.007
0.000
sshd:
28514
0.0224
0.005
0.000
syncd
10068
0.0212
0.005
0.000
netpmon
gil
swapper
spray
send-mail
rmcd
ping
ksh
trcstop
3870
0
5400
18654
15026
5036
26642
5404
0.0163
0.0135
0.0085
0.0084
0.0081
0.0068
0.0062
0.0057
0.004
0.003
0.002
0.002
0.002
0.002
0.002
0.001
147
0.004
0.000
0.000
0.000
0.000
0.000
0.000
0.000
As you can see, little overall network I/O activity was going on during this
time. The top section of the output is most important. It helps you gain an
understanding of which processes are eating up network I/O time.
The lsattr command, which we used in Chapter 13 to view hardware
parameters, is another tool you’ll use frequently to display statistics about
your interfaces. The attributes reported by this command are configured
using either the chdev or the no command. Let’s display the driver parameters using lsattr:
# lsattr -El en0
alias4
alias6
arp
authority
broadcast
mtu
netaddr
netaddr6
netmask
prefixlen
remmtu
rfc1323
security
state
tcp_mssdflt
tcp_nodelay
tcp_recvspace
tcp_sendspace
IPv4 Alias including Subnet Mask
IPv6 Alias including Prefix Length
on
Address Resolution Protocol (ARP)
Authorized Users
Broadcast Address
1500
Maximum IP Packet Size for This Device
Internet Address
IPv6 Internet Address
Subnet Mask
Prefix Length for IPv6 Internet Address
576
Maximum IP Packet Size for REMOTE Networks
Enable/Disable TCP RFC 1323 Window Scaling
none
Security Level
detach Current Interface Status
Set TCP Maximum Segment Size
Enable/Disable TCP_NODELAY Option
Set Socket Buffer Space for Receiving
Set Socket Buffer Space for Sending
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
148
Chapter 14: Network I/O: Monitoring
Sometimes, I also like to use the spray command to troubleshoot possible
problems (although oftentimes this command is blocked because it’s not
very secure). The spray command sends a one-way stream of packets from
your host to the remote host machines and reports the number of packets
dropped as well as the number of packets transferred:
# /usr/etc/spray lpar8test
-c 2000 -l 1400 -d 1
sending 2000 packets of length 1402 to
lpar8test ...
34 packets (1.700%) dropped by lpar8test
23667 packets/second, 33181234 bytes/second
In the preceding example, 2,000 packets were sent to the lpar8test host,
with a delay of one microsecond. Each packet consisted of 1,400 bytes.
Before using spray, make sure the sprayd daemon isn’t commented out
of the inetd daemon (the default configuration in AIX), and don’t forget
to refresh inetd. If you’re seeing a substantial number of dropped packets,
that obviously is not good.
Monitoring NFS
This section covers the use of the nmon, topas, nfsstat, nfs, nfs4cl, and
netpmon commands to monitor the Network File System (NFS). For NFS
tuning, you could use a tool such as topas or nmon initially because these
commands provide a nice dashboard view of what is happening in your
system. Remember that NFS performance problems might not be related
to your NFS subsystem at all; your bottleneck could be on the network or,
from a server perspective, related to CPU or disk I/O. Running a tool such
as topas or nmon can quickly help you get a sense of what the real issues
are.
Consider a system that has two CPUs and is running AIX 5.3 TL_6. The
report in Figure 14.1 shows nmon output from an NFS perspective.
nfsstat
149
Figure 14.1: NFS nmon output
Look at all the information that is available to you from an NFS (client and
server) perspective using nmon! There are no current bottlenecks at all on
this system.
Although topas has improved recently with its ability to capture data,
nmon might still be a better first choice. While topas provides a front end
similar to nmon, nmon is more useful in terms of long-term trending and
analysis.
nfsstat
The nfsstat tool is arguably the most important tool you’ll work with as
you monitor your network. This command displays all types of information about NFS and remote procedure calls (RPCs). You can use nfsstat as
150
Chapter 14: Network I/O: Monitoring
a monitoring tool to troubleshoot problems and also employ it for performance tuning.
Depending on the flags you use, you can have nfsstat display NFS client
or server information. The command can also show the actual usage count
of file system operations. This detail helps you understand exactly how
each file system is utilized, so that you can know how to best tune your
system. Look at the client flag (c) first.
The r flag generates the RPC information:
# nfsstat -cr
Client rpc:
Connection oriented
calls
badcalls
14348
1
nomem
cantconn
0
0
Connectionless
calls
badcalls
23
0
timers
0
nomem
0
badxids
0
interrupt
0
timeouts
0
newcreds
0
badverfs
0
timers
0
retrans
0
badxids
0
timeouts
0
newcreds
0
badverfs
0
cantsend
0
Here’s a rundown of the connection-oriented parameters:
●
calls — Number of RPC calls received
●
badcalls — Number of calls rejected by the RPC layers
●
●
●
●
badxids — Number of times a server reply was received that did not
correspond to any outstanding call
timeouts — Number of times calls timed out while waiting for replies from the server
newcreds — Number of times authentication information was refreshed
badverfs — Number of times a call failed due to a bad verifier in the
response
nfs4cl
151
If you notice a large number of timeouts or badxids, you could benefit
by increasing the timeo parameter with the mount command (details to
come).
Next, look at the NFS information by using the n flag:
# nfsstat -cn
Client nfs:
calls
badcalls
clgets
14348
1
0
Version 2: (0 calls)
null
getattr
setattr
0 0%
0 0%
0 0%
wrcache
write
create
0 0%
0 0%
0 0%
mkdir
rmdir
readdir
0 0%
0 0%
0 0%
Version 3: (14348 calls)
null
getattr
setattr
0 0%
3480 24%
5 0%
write
create
mkdir
44 0%
3 0%
0 0%
rename
link
readdir
0 0%
2 0%
3 0%
cltoomany
0
root
0 0%
remove
0 0%
statfs
0 0
lookup
0 0%
rename
0 0%
readlink
0 0%
link
0 0%
read
0 0
symlink
0 0%
lookup
1790 12%
symlink
0 0%
readdir+
3195 22%
access
5742 40%
mknod
0 0%
fsstat
5 0%
readlink
0 0%
remove
3 0%
fsinfo
2 0%
read
30 0%
rmdir
0 0%
pathconf
0 0%
In NFS Version 3, the output fields include:
●
calls — Number of received NFS calls
●
badcalls — Number of calls rejected by the NFS layer
●
clgets — Number of times a client handle was received
●
cltoomany — Number of times the client handle had no unused
entries
nfs4cl
If you’re running NFS Version 4, you might be using the nfs4cl command
more often. This command displays NFS 4 statistics and properties:
152
Chapter 14: Network I/O: Monitoring
# nfs4cl showfs
Server
--------
Remote Path
---------------
fsid
---------------
Local Path
---------------
If after running this command, you see that there is no output, run the
mount command to obtain more detail:
# mount
node
mounted
------------ --------------/dev/hd4
192.168.1.12 /stage/middleware
mounted over
vfs
---------------
---- ------------ ---------------
date
options
/
jfs
Sep 25 13:18 rw,log=/dev/hd8
/stage/middleware nfs3 Sep 25 13:22 ro,bg,soft,intr,
sec=sys
192.168.1.12 /userdata/20004773 /home/u0004773
nfs3 Sep 25 13:29 bg,hard,int
As you can tell, in this example no file systems are mounted using NFS
Version 4, only NFS Version 3.
Unlike the vast majority of performance tuning commands, nfs4cl can
also be used to tune your system. You do this by using the setfsoptions
subcommand to tune NFS Version 4. Another parameter you can tune is the
previously mentioned timeo, which specifies the timeout value for the RPC
calls to the server.
netpmon and NFS
The netpmon command can also help you troubleshoot NFS bottlenecks.
In addition to monitoring many other types of network statistics, netpmon
monitors for clients — both read and write subroutines and NFS RPC
requests. For servers, netpmon monitors read and write requests. The command starts a trace and runs in the background until you stop it.
First, let’s kick off the trace:
# netpmon -T 3000000 -o /tmp/nfrss.out
You run the trcstop command to signal the end of the trace, as the following message informs you:
netpmon and NFS
# Sun Oct
153
7 07:06:14 2007
System: AIX 5.3 Node: lpar24ml162f_pub Machine: 00C22F2F4C00
Run trcstop command to signal end of trace.
Let’s stop our trace:
# trcstop
# [netpmon: Reporting started]
[netpmon: Reporting completed
[
2 traced cpus
]
[
245.464 secs total preempt time
]
[netpmon: 164.813 secs in measured interval
Now, we can check out the NFS-specific information provided in the output file:
NFSv3 Client RPC Statistics (by Server):
---------------------------------------Server
Calls/s
---------------------------------p650
126.68
-----------------------------------------------------------------------Total (all servers)
126.68
Detailed NFSv3 Client RPC Statistics (by Server):
------------------------------------------------SERVER: p650
calls:
5602
call times (msec):
COMBINED (All Servers)
calls:
call times (msec):
avg 1.408
min 0.274
max 979.611 sdev 21.310
5602
avg 1.408
min 0.274
max 979.611 sdev 21.310
In this case, you can see the NFS Version 3 client statistics by server.
Although netpmon is a useful trace utility, its performance overhead can
sometimes outweigh its benefits, particularly when you have other ways to
obtain similar information. So be aware of this consideration when using
this utility.
154
Chapter 14: Network I/O: Monitoring
Monitoring Network Packets
Earlier, I addressed some of the very basic flags, such as –in, that you typically use with the netstat command. Using netstat, you can also monitor
more detailed information about the packets themselves. For example, the
–D option reports the overall number of packets received, transmitted, and
dropped in your communications subsystem. The command output sorts
the results by device, driver, and protocol:
# netstat -D
Source
Ipkts
Opkts
Idrops
Odrops
------------------------------------------------------------------------------ent_dev0
238122150
1805
0
0
ent_dev1
17583646
301547
0
0
--------------------------------------------------------------Devices Total
255705796
303352
0
0
.
.
.
There are actually so many different ways to use netstat that the best place
to start is to look at the man page for netstat and go from there. Don’t be
afraid to run these commands, because they won’t eat up disk space or affect performance.
iptrace, ipreport, and ipfilter
The tracing tools provided within AIX are used to record detailed information about packets. Use these commands with more caution.
The tools are extremely helpful when you’re trying to determine the root
cause of network performance problems. Check out iptrace and ipreport
first. The iptrace command records all packets received from the network
interfaces. The ipreport command formats the data generated from iptrace
into a readable trace report. You can also use the ipfilter command to sort
the output file created from ipreport.
Let’s try starting the trace and running it for one minute:
iptrace, ipreport, and ipfilter
155
# /usr/sbin/iptrace -a -i en0 iptrace
[1]
7375
# [774252
[1] + Done
/usr/sbin/iptrace -a -i en0 iptrace.out
Here, you can see the trace running:
# ps -ef | grep iptrace
root 205030 749602
0 10:57:32 pts/0 0:00 grep iptrace
root 774252
2 10:57:25
1
- 0:00 /usr/sbin/iptrace -a -i en0 iptrace.out
When we’re done with the trace, we need to kill the process:
# kill -1 77425
# iptrace: unload success!
Next, let’s sort the file:
# ipreport -r -s iptrace.out >/ipreport.network
Now, we can examine the output, which shows the captured information
about each packet, including packet size and IP address information:
# more ipreport.network
IPTRACE version: 2.0
ETH: ====( 114 bytes transmitted on interface en0 )==== 10:57:25.698790226
ETH: [ da:bb:b8:b5:26:14 -> 6e:87:76:59:6e:cd ] type 800 (IP)
IP:
< SRC =
172.29.135.44 > (lpar37p682e)
IP:
< DST =
172.29.131.16 >
IP:
ip_v=4, ip_hl=20, ip_tos=16, ip_len=100, ip_id=18349, ip_off=0 DF
IP:
ip_ttl=60, ip_sum=945f, ip_p = 6 (TCP)
TCP: <source port=22(ssh), destination port=53643 >
TCP: th_seq=337783617, th_ack=1783353394
TCP: th_off=8, flags<PUSH | ACK>
TCP: th_win=65522, th_sum=0, th_urp=0
TCP:
nop
TCP:
nop
TCP:
timestamps TSVal: 0x47414604 TSEcho: 0x47826117
TCP: 00000000
520bea13 dfaefa7b e1c517d6 ce86f960
|R......{.......’|
TCP: 00000010
fdb24d69 947c8d48 fa7b6379 235d1a63
|..Mi.|.H.{cy#].c|
TCP: 00000020
840adfc2 e1b4b916 e1002983 f96fc1fb
|..........)..o..|
156
Chapter 14: Network I/O: Monitoring
As you can imagine, the trace file can become very large fairly quickly.
The file for this example grew to 40 MB in less than a minute! Be very
careful when running these traces because you’ll run out of disk space really fast if you don’t have the disk bandwidth for these files.
You can also start the trace using the System Resource Controller (SRC).
tcpdump
What about tcpdump? This command prints the headers of the packets that
are captured for each network interface card (NIC). One important difference with tcpdump is that, unlike iptrace, it can look at only one network
interface at a time. And because iptrace examines the entire packet from
the kernel space, its results can include lots of dropped packets. With
tcpdump, you can limit the amount of data to be traced. Also, you don’t
need to use an ipreport type of command to format the binary data because
tcpdump performs both the trace and the output.
Let’s run tcpdump:
# tcpdump -w tcp.out
tcpdump: listening on en0, link-type 1, capture size 96 bytes
The utility continues to capture packets until you press Ctrl+C. If any
packets were dropped due to a lack of buffer space, tcpdump reports that,
too:
14755 packets received by filter
0 packets dropped by kernel
13:40:28.001711 IP lpar37p682e.ssh > 172.29.131.16.53736: P
374368029:374368077(48)
The preceding output shows that the kernel dropped no packets, which is a
good thing.
C h a p t e r
15
Network I/O: Tuning
The most important command for tuning AIX network parameters is the no
command. First, take a look at the first few parameters, using the –a flag:
root@lpar37p682e[/] > no -a
arpqsize = 12
arpt_killc = 20
arptab_bsiz = 7
arptab_nb = 149
bcastping = 0
clean_partial_conns = 0
delayack = 0
delayackports = {}
As an alternative, you can use the –L flag, which provides much more
detailed information.
The no command provides more than 100 parameters you can tune. In
older versions of AIX, thewall was an important tunable whose defaults
you needed to change; this parameter defined the upper limit for network
kernel buffers. Today, this size is defined at installation time depending on
the amount of RAM and the kernel type. For example, if you are running
AIX 5.3 on a 64-bit kernel, the parameter is set at half the size of real
158
Chapter 15: Network I/O: Tuning
memory. (I actually used to enjoy playing around with thewall, so I’m not
sure I like the new approach.) You can use netstat –m to detect shortages
or failures of network memory requests. In the following example, there
are no shortages (failures):
root@lpar37p682e[/etc/tunables] > netstat -m
Kernel malloc statistics
******* CPU 0 *******
By size
inuse
calls failed
delayed
free
hiwat
free
32
64
117
109
217
6523
0
0
0
1
11
83
5240
5240
0
0
128
256
975
520
15951
67637
0
0
29
30
785
1016
2620
5240
0
0
Streams mblk statistic failures
0 high priority mblk failures
0 medium priority mblk failures
0 low priority mblk failures
Although you can change many parameters using the no utility, most of
them are better left alone. The most important parameters are those that
relate to TCP streaming workload tuning:
●
●
●
— This parameter controls how much buffer space in
the kernel is used to buffer application data. You really want to bump
this value up from the default because if its limit is reached, the sending application suspends data transfer until TCP sends the data to the
buffer.
tcp_sendspace
tcp_recvspace — In addition to controlling the amount buffer space
to be consumed by receive buffers, this value helps AIX determine
the size to make its transmit window.
— When using UDP, you can set this value no
higher than 65536 because IP has an upper limit of 65,536 bytes per
packet.
udp_sendspace
tcpdump
●
159
udp_recvspace — This value should be greater than udp_sendspace
because it needs to handle as many simultaneous UDP packets per
socket as it can. You can easily set this parameter to 10 times the
value of udp_sendspace.
Let’s use no make a few changes. First, increase the size of udp_sendspace:
root@lpar37p682e[/] > no -p -o udp_sendspace=65536
Setting udp_sendspace to 65536
Setting udp_sendspace to 65536 in nextboot file
Next, change udp_recvspace to the recommended configuration of 10
times udp_sendspace:
root@lpar37p682e[/] > no -p -o udp_recvspace=655360
Setting udp_recvspace to 655360
Setting udp_recvspace to 655360 in nextboot file
Change to tunable udp_recvspace, will only be effective for future connections
Note that the –p flag retains the entries, even after a reboot. It appends the
updated values in the etc/tunables/nextboot stanza file.
Regarding the TCP parameters for higher-speed adapters, there is no problem setting tcp_sendspace to twice the value of tcp_recvspace. These are
good settings.
Two other important workload parameters of the no command are rfc1323
and sb_max. The rfc1323 tunable enables the TCP window scaling option, which lets TCP use a larger window size. Turning on this parameter
enables the best TCP performance. The sb_max tunable sets an upper limit
on the number of socket buffers queued to an individual socket, controlling
the amount of buffer space consumed by buffers (queued to either a sender
or receiver socket). This number should usually be less than thewall and
approximately four times the size of the largest value of the TCP or UDP
160
Chapter 15: Network I/O: Tuning
send and receive settings. For example, if your udp_recvspace value is
655360, you can’t go wrong by doubling this to 1310720.
Another useful no tunable, tcp_nodelayack, prompts TCP to send an
immediate rather than a delayed acknowledgment. Although sending an
immediate acknowledgment can add more overhead in some environments,
it can greatly improve network performance in others. If changing this
parameter does not improve performance in your environment, you can
quickly change it back.
Let’s also review ipqmalen. This tunable controls the length of the IP input
queue. If you see an overflow counter (using netstat –s), setting a maximum length for this queue can help fix the overflow.
What about Address Resolution Protocol (ARP)? When many clients are
connected to the system, you might want to tune the ARP cache. You can
examine the relevant statistics using netstat:
root@lpar37p682e[/etc/tunables] > netstat -p arp
arp:
10 packets sent
0 packets purged
If you see a high purge count, increase the size of the ARP table. In the
preceding example, no increase is needed.
Here are the no parameters that relate to arp:
root@lpar37p682e[/etc/tunables] > no -a | grep arp
arpqsize = 12
arpt_killc = 20
arptab_bsiz = 7
arptab_nb = 149
Name Resolution
161
You can tune these buffers either systemwide or according to specific
interfaces. To tune by interface, set the no command’s use_isno option to 1
(this option is enabled by default in AIX 5.3):
root@lpar37p682e[/etc/tunables] > no -a | grep use
use_isno = 1
Disabling the use_isno parameter (by setting it to 0) can serve as a diagnostic tool of sorts by setting the buffer values across the board to help
isolate performance problems. When these values are set for the specific
interfaces, they actually override the default value in the no view, which
can sometimes confuse system administrators. You can view specific interface settings using either ifconfig or lsattr:
# ifconfig en0
en0: flags=1e080863,480<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,
GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),CHAIN>
inet 172.29.135.44 netmask 0xffffc000 broadcast 172.29.191.255
tcp_sendspace 262144 tcp_recvspace 262144 rfc1323 1
In this example, look at the settings using ifconfig (see the last line, which
references a couple of the tunables mentioned earlier).You can change
these options (by interface) using SMIT or the chdev or ifconfig command. Note that ifconfig will not update the Object Data Manager (ODM),
so on reboot, the settings will revert to their previous values. For this reason, you should use SMIT. Use the smit tcpip fastpath, and go to Further
configuration > Network interfaces > Change/Show characteristics of
an interface.
Name Resolution
Name resolution is another area that can impact performance. If you know
how you want to resolve names (using either DNS or the hosts file), make
sure name resolution is set up correctly in the /etc/netsvc.conf file. If
you’re using DNS, take out the local if you are not using a hosts file at all,
or leave it in if you are using it as a backup to DNS (but make it the second
entry). If you’re not using DNS, remove the bind because it will slow
162
Chapter 15: Network I/O: Tuning
performance by first trying (if it is the first entry in the record) to resolve
using a name server that doesn’t exist.
Maximum Transfer Unit
The maximum transfer unit (MTU) is defined as the largest packet that
can be sent over a network. The size depends on the type of network. For
example, 16-bit token-ring has a default MTU size of 17,914, while Fiber
Distributed Data Interface (FDDI) has a default size of 4,352. Ethernet’s
default size is 1,500 (or 9,000 with jumbo frames enabled). Larger packets
mean fewer packet transfers, which results in higher bandwidth utilization on your system. An exception to this rule is if your application prefers
smaller packets.
If you’re using a Gigabit Ethernet, you can use a jumbo frames option. To
support the use of jumbo frames, your switch must be configured accordingly. To change to jumbo frames, use the smit device fastpath and go to
Communication > Ethernet > Adapter > Change > Show characteristics of an Ethernet adapter. You can make the change from there.
Tuning: Client
The biod daemon plays an important role in connectivity. While biod
self-tunes the number of threads (the daemon process creates and kills
threads as needed), you can adjust the maximum number of biod threads,
depending on the overall load. An important concept to understand here is
that increasing the number of threads alone will not alleviate performance
problems caused by CPU, I/O, or memory bottlenecks. For example, if
your CPU is near 100 percent utilization, increasing the number of threads
won’t help you at all.
Increasing the number of threads can help when multiple application
threads access the same files and you don’t find any other types of bottlenecks. Using the lsof command can help you further determine which
threads are accessing which files. From earlier tuning sections, you might
remember the Virtual Memory Manager parameters minperm and maxperm. Unlike when you tune database servers, with NFS you want to
let the VMM use as much RAM as possible for NFS data caching. Most
NFS clients have little need for working segment pages. To ensure that all
Tuning: Client
163
memory is used for file caching, set both maxperm and maxclient to 100
percent:
root@lpar24ml162f_pub[/tmp] > vmo -o maxperm%=100
Setting maxperm% to 100
root@lpar24ml162f_pub[/tmp] > vmo -o maxclient%=100
Setting maxclient% to 100
Note that in the event that your application uses databases and could benefit from performing its own file data caching, you should not set maxperm
and maxclient to 100 percent. In this situation, set these numbers low and
mount your file systems using concurrent I/O over NFS. NFS maintains
caches on each client system that contain attributes of the most recently
accessed files and directories. The mount command controls the length of
time that these entries are kept in cache.
The mount parameters you can change include the following: acdirmin,
acdirmax, acregmin, acregmax, and actime. For example, the acregmin
parameter specifies the minimum length of time after an actual update that
file entries will be retained. When a file is updated, its removal from cache
depends on this parameter’s value.
Using the mount command, you can also specify whether you want a hard
or soft mount. With a soft mount, if an error occurs, it is reported immediately to the requested program; with a hard mount, NFS keeps retrying.
These retries themselves could lead to performance problems. From a
reliability standpoint, hard mounting read and write directories is recommended to prevent possible data corruption.
Mount parameters rsize and wsize define the maximum sizes of RPC
packets for read and write directories, respectively. The default value is
32,768 bytes. With NFS 3 and 4, if your NFS volumes are mounted on
high-speed networks, you should increase this setting to 65,536. On the
other hand, if your network is extremely slow, you might think about decreasing the default to reduce the amount of packet fragmentation by sending shorter packets. However, if you do decrease the default, more packets
will need to be sent, which could increase overall network utilization.
164
Chapter 15: Network I/O: Tuning
Understand your network, and tune it accordingly!
Tuning: Server
Before examining specific NFS parameters, always try to decrease the
load on the network while also looking at your CPU and I/O subsystems.
CPU bottlenecks often contribute to what appears to be an NFS-specific
problem. For example, NFS can use either TCP or UDP, depending on
the version and your preference. Make sure your tcp_sendspace and
tcp_recvspace tunables are set to values higher than the defaults because
this can have an impact on your server by increasing network performance.
You tune these values with the no command:
root@lpar24ml162f_pub[/tmp] > no -a | grep send
ipsendredirects = 1
ipsrcroutesend = 1
send_file_duration = 300
tcp_sendspace = 1638
udp_sendspace = 9216
root@lpar24ml162f_pub[/] > no -o tcp_sendspace=524288
Setting tcp_sendspace to 524288
Change to tunable tcp_sendspace, will only be effective for future connection
If you are running Version 4 of NFS, make sure you turn on nfs_rfc1323.
Doing so allows for TCP window sizes greater than 64K. Set this value on
the client as well.
root@lpar24ml162f_pub[/] > no -o rfc1323
Setting rfc1323 to 1
As an alternative, you can set the rfc1323 tunable using the nfso command, which manages the NFS tuning parameters:
root@lpar24ml162f_pub[/] > nfso -o nfs_rfc1323=1
Setting nfs_rfc1323 to 1
165
Tuning: Server
Setting rfc1323 with nfso configures the TCP window to affect only NFS
(as opposed to no, which applies this setting across the board). If you
have already set this option with no, you don’t need to change it, although
you might want to in case some other Unix administrator decides to play
around with the no command.
Similar to the client, if the server is a dedicated NFS server, make sure you
tune your VMM parameters accordingly. Modify maxperm and maxclient to 100 percent to make sure the VMM controls the caching of the page
files, using as much memory as possible in the process.
On the server, tune nfsd, which is multithreaded, the same way you tuned
biod. (Other daemons you can tune include rpc.mountd and rpc.lockd.)
Like biod, nfsd self-tunes, depending on the load. Increase the number
of threads using the nfso command. One parameter to check is nfs_max_
read_size, which sets the maximum size of RPCs for read replies. Look at
what nfs_max_read_size is set to below:
root@lpar24ml162f_pub[/tmp] > nfso -L nfs_max_read_size
NAME
CUR
DEF
BOOT
MIN
MAX
UNIT
TYPE
DEPENDENCIES
--------------------------------------------------------------------------nfs_max_read_size
32K
32K
32K
512
64K
Bytes
D
Let’s increase it to 64K (using bytes):
root@lpar24ml162f_pub[/tmp] > nfso -o nfs_max_read_size=65536
root@lpar24ml162f_pub[/tmp] > nfso -L nfs_max_read_size
NAME
CUR
DEF
BOOT
MIN
MAX
UNIT
TYPE
DEPENDENCIES
--------------------------------------------------------------------------nfs_max_read_size
64K
32K
32K
512
64K
Bytes
D
We just changed nfs_max_read_size to the maximum value allowed. If
you want to keep the new values, add your changes to the /etc/tunables/
nextboot file so that the settings will remain changed after a reboot.
The nfso offers additional parameters you can modify. To list them all, use
the –a or –L flag.
Section V
Summary, Tips, and Quiz
Summary
●
●
●
●
●
●
●
The OSI model consists of the following layers: physical, data link, network, transport, session, presentation, and application. The AIX TCP/IP
layers correlate to the layers in the OSI.
The maximum transfer unit (MTU) is the largest packet that can be
sent over a network. Ethernet has a default size of 1,500 (or 9,000 with
jumbo frames enabled).
The lscfg command should be used to obtain information about firmware.
The netstat command is one of the most common commands you will
use to monitor your system. The entstat command is very similar; you
use it to display device driver statistics.
The netpmon command provides information about CPU usage as it
relates to the network. The command starts a trace and runs in the background, providing an overview of network activity and capturing data
for trending and analysis. It can also provide information to monitor
read and write subroutines for Network File System (NFS) clients and
servers.
is the monitoring tool you will use to display information about
NFS and remote procedure calls (RPCs). Other commands for NFS
include netpmon, nfs, nfs4cl, nfsstat, nmon, and topas.
nfstat
The nfs4cl command provides NFS 4 statistics and properties. You can
also tune using this command. One option is to set the timeout value for
RPC calls to the server.
168
Section V: Summary, Tips, and Quiz
●
●
●
●
●
●
●
You use the mount and nfsmo commands to tune NFS parameters. Use
mount to tune server-based resources.
The netstat command lets you monitor and troubleshoot network packet
issues.
You use the no command to tune the network subsystem. tcp_sendspace and udp_sendspace are important no parameters you should
examine.
Setting up DNS improperly can cause performance problems because
you may not be resolving names correctly.
Virtual Ethernet, supported on AIX 5.3 on POWER5, supports interpartition- and IP-based communications between logical partitions on
the same frame. This functionality is accomplished through the use of
a virtual I/O switch. Shared Ethernet, one of the features of Advanced
Power Virtualization (APV) or PowerVM, enables the use of virtual I/O
Servers (VIOs), letting several host machines share a single physical
network adapter.
The iptrace command records all the packets received from the network
interfaces. The ipreport command formats the data generated from
iptrace into a readable trace report. You can also use ipfilter to sort the
output file created from ipreport.
The tcpdump command prints the headers of packets captured for each
network interface card (NIC). One important difference with tcpdump
is that, unlike iptrace, it can look at only one network interface at a
time. And because iptrace examines the entire packet from the kernel
space, its results can offer lots of dropped packets. With tcpdump, you
can also limit the amount of data to be traced.
Tips
●
●
Although netstat is very useful, it is really not a monitoring tool in the
sense that vmstat and iostat are. You can use other, more suitable tools
to help monitor your network subsystem.
Ethernet has a default size of 1,500 (9,000 with jumbo frames enabled).
Larger packets require fewer packet transfers, resulting in higher
bandwidth utilization on your system. An exception to this rule is if an
Tips
169
application prefers smaller packets. If you’re using a Gigabit Ethernet, you can use the jumbo frames option. To support the use of jumbo
frames, your switch must be configured accordingly.
●
●
●
●
●
●
Virtualization is a wonderful thing, but be careful not to share too many
adapters from your VIO server, or you might pay a large network I/O
penalty. Use the appropriate monitoring tools so you’ll know whether
you have a problem.
From a client perspective, NFS file systems use disks that are remotely
attached. Anything that affects the performance of the mounted disk will
affect the performance of the NFS clients.
The maximum number of biod threads that can be tuned depends on
the overall load. Increasing the number of threads alone won’t alleviate performance problems caused by CPU, I/O, or memory bottlenecks.
For example, if your CPU is near 100 percent utilization, increasing
the number of threads won’t help at all. Increasing the threads can help
when multiple application threads access the same files and you don’t
find any other types of bottlenecks. The lsof command can help you
determine which threads are accessing which file.
With NFS, you want to let the Virtual Memory Manager use as
much RAM as possible for NFS data caching. Most NFS clients
have little need for working segment pages. To ensure that all memory is used for file caching, set both maxperm and maxclient to
100 percent.
The rfc1323 tunable enables the TCP window scaling option, which
lets TCP use a larger window size. Turn on this option to enable the best
TCP performance.
If you’re using DNS, take out the local if you are not using a hosts file at
all, unless you are using it as a backup to DNS (in that case, make it the
second entry). If you’re not using DNA at all, take out the bind because
it will only slow your performance by trying (if it is the first entry in the
record) to resolve using a name server that doesn’t exist.
170
Section V: Summary, Tips, and Quiz
Quiz
Multiple Choice
1. entstat is used to
a. Tune the Ethernet controller
b. Display device driver statistics
c. Provide ARP information
d. There is no such command.
2. What does the rfc1323 tunable do?
a. It provides information about mbufs.
b. It’s a generic text file.
c. It enables the TCP window scaling option, letting TCP use a larger
window.
d. It relates to UDP and increasing the packet size dynamically.
3. What is the packet size with jumbo frames enabled?
a. 1,500
b. 9,000
c. 90,000
d. 450
4. netpmon is used for what purpose?
a. To provide information about CPU usage as it relates to the network
b. To provide information about RAM usage as it relates to the network
c. To trace zombie processes
d. To waste time
True or False
171
5. If you want to keep network values after a reboot, which file must you
update?
a. /etc/config
b. /bin/tunablesc
c. /etc/tunables/nextboot
d. /tmp/configuration/tunable
6. Which of the following tunables controls how much buffer space in the
kernel is used to buffer application data?
a. tcp_send
b. tcp_sendtrace
c. tcpsendtcpip
d. tcp_sendspace
7. Which command lets you tune udp_sendspace?
a. ioo
b. no
c. nfso
d. netstat
True or False
8. Shared Ethernet, supported on AIX 5.3 on POWER5, allows for
interpartition- and IP-based communications between logical
partitions on the same frame. This is done by the use of a virtual I/O
switch.
9. NFS can use either TCP or UDP, depending on the version and your
preference.
172
Section V: Summary, Tips, and Quiz
Fill in the Blank
10. Which command is not used for NFS: nmon, nfsstat, nfs, nfs4cl, netpmon, or nfsfr?
___________________________________________
Section VI
Bonus Topics
Just when you thought you understood performance tuning on AIX, here
comes AIX 6.1 to throw you a curveball! In this section, we discuss upto-date information about the recent changes to performance monitoring
and tuning in AIX 6.1, including CPU, virtual memory, and I/O (disk and
network). We also review AIX performance tuning as it relates to Oracle.
The last chapter provides an overview of systems performance when running Linux on Power (LoP).
C h a p t e r
16
AIX 6.1
Many of the changes in AIX 6.1 are really less about kernel innovations
and more about ancillary features, such as improving default parameters
to more accurately reflect real-world data processing. Other enhancements
include restricted tunables, unique tunable documentation (a useful feature
that provides help messages via the new –h option for the tunable commands, including ioo, nfso, no, raso, schedo, and vmo), and various other
improvements to certain subsystems.
Introduction
AIX 6.1 provides many important innovations and improvements, including enhancements in the following categories:
●
●
●
●
Virtualization — Features such as workload partitioning and Live
Application Mobility
Security — Features such as encrypted file systems, trusted AIX, and
role-based access control (RBAC)
Availability — Features such as AIX concurrent updates and dynamic tracing
Manageability — Features such as the new Systems Director Console for AIX and the Workload Partition Manager
176
Chapter 16: AIX 6.1
The 6.1 release also provides support for POWER6 performance innovations, such as advanced simultaneous multithreading, shared dedicated
processors, and variable page size. It’s important to fully understand which
innovations and enhancements are a reflection of POWER6, AIX 6.1, or a
combination of both. For example, from a purely operating system perspective, AIX improves on the older tunable defaults for the aio, ioo, nfso,
no, schedo, and vmo commands.
Although AIX 6.1 includes some real performance enhancements, such as
improvements in I/O pacing and AIX’s implementation of asynchronous
I/O (AIO) servers, I must say that there is nothing breathtakingly different. In fact, IBM made more performance changes from AIX 5.1 to AIX
5.2 and from 5.2 to 5.3 (including new monitoring and tuning tools, new
tunables that changed how you set Virtual Memory Manager settings, and
concurrent I/O improvements) than you will see in moving from AIX 5.3
to AIX 6.1. In AIX 6.1, all the tuning commands remain the same (except
for those that have been taken away, such as aioo), and there are no new
monitoring tools. Other changes reflect updates made to the utilities to
reflect support for other workload partitioning innovations; the updated
utilities include curt, filemon, iostat, netpmon, pprof, procmon, proctree, svmon, topas, tprof, and vmstat.
Workload partitions (WPARs) enable the use of separate virtual partitions
within one AIX image. This feature is more of a complement to logical
partitions (LPARs) than a replacement for them. WPARs actually run inside LPARs and are similar in concept to Solaris containers.
I’ve built WPARs in less than 15 minutes. In fact, we’ll do some of our
analysis inside WPARs so that you can actually view some of the updated
tools that now support WPARs. Note that WPARs are possible only in AIX
6.1, and a POWER6 is not necessary. Some commands also run differently or don’t run at all within WPARs; we’ll discuss a few of these where
applicable.
Memory
Through the years, many have complained about the default VMM parameters of AIX. The complaints have been that the default parameters just
haven’t reflected the reality of what most users run on top of AIX — for
Memory
177
example, mission-critical database applications such as Oracle. Because
of this, systems administrators have had to change the default settings on
many subsystems — most notably those related to virtual memory (i.e.,
minperm and maxperm). IBM engineering has listened and in AIX 6.1 has
changed the parameters to reflect that reality. Note that you shouldn’t rely
exclusively on these settings. Further, always check with your ISV to verify
its recommended settings for AIX 6.1; then make changes accordingly.
The most important changes to default settings were made to address paging issues, where database servers frequently page out computational pages
even though the system has enough free memory. In the AIX 5.3 memory
tuning discussion, I recommended changing the relevant parameters to
defaults fairly close to what was indicated on the table. The changes are
indicated in the AIX 5.3 tuning recommendation column.
In AIX 6.1, IBM now classifies many tunables as “restricted” in an attempt
to discourage junior administrators from changing certain parameters
deemed critical enough to be classified as restricted. The net is that you can
change only 29 vmo tunables without receiving a firm warning; 30 others
are now deemed restricted tunables, which IBM officially states should not
be modified unless instructed by “IBM support professionals.”
A new vmo flag, –F, lets you view all the parameters, including the restricted ones. The following snippet of content includes an example from
the restricted section.
# vmo -F -a
force_relalias_lite = 0
vmm_default_pspa = -1
##Restricted tunables
maxperm% = 90
Even restricted tunables can be changed. If you make such a change, you
just receive a stern warning:
# vmo -o maxperm%=99
Setting maxperm% to 99
Warning: a restricted tunable has been modified
178
Chapter 16: AIX 6.1
When a restricted parameter is changed after a reboot, you’ll receive a further rebuke and be asked to confirm whether you really want to make the
change. You’ll have to physically type in “yes” to reply.
The most important out-of-the-box performance changes related to memory include new values for minperm, maxperm, maxclient, and strict_maxclient. This update is a continuation of changes that first appeared in AIX
5.3, when you no longer had to turn off strict_maxclient, increase minfree
and maxfree, or reduce minperm, maxperm, and maxclient. The new recommendation (now incorporated as the default value in AIX 6.1) is to turn
off the repage ratio check (lru_file_repage) to ensure that working storage
is not paged and to consider only file paging.
In AIX 6.1, the VMM replacement default is changed to use up to 90 percent of its real memory for file caching, favoring computational pages over
file pages. Unless the amount of active virtual memory exceeds 97 percent
of the size of real memory, minperm is reduced to 3 percent to ensure that
computational pages will not be stolen. Let’s try changing it:
# vmo -o minperm%=97
Value of the tunable minperm% cannot be changed in a WPAR
As you can see from the error message, some changes will not work in
WPARs. WPARs are a subset of an LPAR, but they are still part of the
single operating system image.
Another important change includes VMM dynamic variable page size
support (VPSS). Pages are defined as fixed-length data blocks held in
virtual memory. In AIX 6.1 (on POWER6 processors only), VMM can
now dynamically use the larger page size based on application memory
usage, which should substantially improve performance. This feature is
completely transparent to applications. AIX uses the larger page size only
if doing so does not result in increasing process memory usage. The use of
larger pages improves performance because fewer hardware address translations need to be made. This feature is supported only for working storage
memory, not persistent storage. The new parameter is vmm_default_pspa
(it works in conjunction with the existing vmm_mpsize_support tunable).
iSCSI
179
Let’s view the tunable setting for VPSS:
# vmo -a | grep pspa
vmm_default_pspa = -1
CPU
In AIX 6.1, only 27 of the schedo command’s 42 CPU-related tunables
are restricted, leaving 15 parameters that you can modify without explicit
warnings. Although some defaults have changed, no substantial changes
have been made with respect to CPU monitoring and tuning in AIX 6.1.
Disk I/O
Of the 48 tunables you can control with the ioo I/O tuning command, 27
are now restricted, leaving 21 that you can modify without explicit warnings. The most important changes affect I/O pacing and AIO dynamic
tunables.
JFS2
AIX 6.1 brings changes to the Enhanced Journaled File System (JFS2) that
let you mount a JFS2 file system without logging. Although this capability
can improve performance substantially, I don’t recommend implementing
it. If you do so and then at some point need to recover your data, you’ll
have to use the dreaded fsck command, which has been pretty much banished from memory since the advent of journaling file systems. Circumstances in which the capability might come in handy include restoring data
from backups and saving time during an activity where you might have a
very small window and availability is not a concern.
iSCSI
The target software driver can now be used over a Gigabit Ethernet
adapter, which should improve performance in this type of environment.
The target driver exports local disks or logical volumes to Internet Small
Computer System Interface (iSCSI) initiators that connect to AIX using the
iSCSI protocol. The proliferation of iSCSI represents a viable alternative to
fiber-based storage, making this an important enhancement.
180
Chapter 16: AIX 6.1
I/O Pacing
Disk I/O pacing is a mechanism that lets you limit the number of pending I/O requests to a file, thereby preventing disk I/O-intensive processes
(usually in the form of large sequential writes) from exhausting the CPU.
AIX 6.1 enables I/O pacing by default. In AIX 5.3, you must explicitly enable this feature. The new defaults set the sys0 settings of the minpout and
maxput parameters to 4096 and 8193, respectively.
Asynchronous I/O
AIO is an AIX software subsystem that permits processes to issue I/O
operations without waiting for I/O to finish. Because I/O operations and
application processing operate concurrently, they essentially run in the
background and improve performance. This advantage is particularly important in a database environment.
There are two types of AIX subsystems: Legacy AIO and POSIX AIO. The
differences between them involve different parameter passing at the application layer. In other words, the developers pick the implementation that
the application uses. Regardless of which subsystem is chosen, both run
concurrently on AIX. In AIX 5L, if applications use AIO, the subsystem
must be explicitly activated in the autoconfig parameter. The system also
requires a reboot because the kernel extensions must be loaded. In fact, any
release before AIX 5.3 TL_5 requires reboots if any changes are made to
the following tunables: maxreqs, maxservers, and minservers.
AIX 5.3 provided the aioo command, which lets you make these changes
dynamically without a reboot (decreasing required reboots). This command
does not change the Object Data Manager (ODM) attributes, meaning that
changes will not persist across a reboot. In AIX 6.1, the tunables fastpath
and fsfastpath are now restricted and are set to 1 by default. The new setting has the following effect on the tunables:
●
●
— AIO requests that raw logical volumes be passed directly
to the disk layer.
fastpath
— AIO requests that files opened with concurrent I/O on
JFS2 be passed directly to the Logical Volume Manager or to disk.
fsfastpath
Asynchronous I/O
181
##Restricted tunables
aio_fastpath = 1
aio_fsfastpath = 1
Further, AIO subsystems are now loaded by default and not activated.
They are started automatically when the application initiates the AIO I/O
requests. AIX 6.1 no longer provides the aioo command (what a short life
span), and these tunables are now used only with ioo.
The old method (AIX 5.3):
# # aioo -a
minservers = 1
maxservers = 1
maxreqs= 4096
fsfastpath = 0
The new method with AIX 6.1:
# ioo -a | grep active
aio_active = 0
posix_aio_active =
It’s worth noting that there are no more AIO devices in the ODM. Two
new parameters have also been added to ioo: aio_active and posix_aix_active. These settings can be changed only by AIX, and they are set to 1 only
when AIO kernel extensors are used and pinned. If you do a grep, you
won’t find any more AIO servers. You’ll now see aioLpools and aioPpools,
the kernel processes that manage the AIO subsystems for Legacy and
POSIX. As a result of this change, there is less pinned memory and fewer
processes running on the system; both have positive effects on overall
systems performance.
Here’s a look at the new AIO kernel processes:
# pstat -a | grep aio
39 a
2704e
1 2704e
40 a
28050
1 28050
0
0
0
0
1
1
aioLpool
aioPpool
182
Chapter 16: AIX 6.1
The minserver and maxserver parameters, as they relate to AIO servers,
are now tuned per each CPU tunable. Changing these values will not result
in changes to the number of available servers on the system; that number
depends on the number of concurrent I/O requests.
The following shows the new default values for minservers and
maxservers:
# ioo -a | grep minservers
aio_minservers = 3
posix_aio_minservers = 3
# ioo -a | grep maxservers
aio_maxservers = 30
posix_aio_maxservers = 30
Network
Of the 133 no command tunables, IBM has classified only these five as
restricted:
#no
-F -a
##Restricted tunables
extendednetstats
inet_stack_size
net_malloc_police
pseintrstack
use_isno
=
=
=
=
=
0
16
16384
24576
1
A new network caching daemon has also been introduced to improve
performance when resolving names using Domain Name Server (DNS).
You can start this daemon from the System Resource Controller (SRC). Its
main configuration file is /etc/netcd.conf, and you can copy the one in
/usr/samples/tcpip to /etc and use that as a template.
The command used to manage the daemon is netcdctrl. With this command, you can dump the cache contents to a file, display cache usage statistics, flush the cache table, and change the logging level of the daemon.
NFS
183
Regarding the /etc/netsvc.conf file, nothing has changed; this file is still
necessary in determining the order of resolving.
NFS
Of the 24 Network File System (NFS) tunables, IBM has classified 21 as
restricted. The only noteworthy change here is that RFC1323 (on the TCP/
IP stack) is now enabled by default, letting TCP connections use the TCP
scaling window for any NFS connections. The default number of biod daemons has also increased to 32 for each NFS Version 3 mount point.
Section VI
Chapter 16 Quiz
Multiple Choice
1. Which version of AIX started restricting tunables?
a. 4
b. 5
c. 5
d. 6.1
2. What has replaced AIO servers?
a. I/O pacing
b. Lpools and Ppools
c. aioLpools and aioPpools
d. mbuf
3. Which command has been taken away in AIX 6?
a. vmtune
b. ioo
c. aioo
d. aix
186
Section VI: Chapter 16 Quiz
4. Why did IBM institute restricted tunables?
a. To decrease cache that was taking up space.
b. No reason.
c. To make things harder for people.
d. To discourage junior administrators from changing parameters.
5. In AIX 6.1 (on POWER6 processors only), VMM can now dynamically
use the larger page size based on application memory usage. Why is
this enhancement important?
a. It will increase availability.
b. It should improve performance.
c. It will decrease performance.
d. It will help in DLPAR operations.
6. AIX 6.1 introduces changes to the Enhanced Journaling File System
(JFS2) that let you mount a JFS2 file system without logging. What is
the effect of doing so?
a. It increases performance while possibly decreasing availability.
b. It decreases performance while increasing availability.
c. It increases performance and reliability.
d. No change.
7. Which is the vmo flag that provides all parameters, including restricted
ones?
a. –l
b. –v
c. –o
d. –F
Fill in the Blank
187
True or False
8. AIX 6.1 enables I/O pacing by default. In AIX 5.3, you need to explicitly enable this feature.
9. Disk I/O pacing is a mechanism that lets you limit the number of pending I/O requests to a file.
Fill in the Blank
10. Explain the purpose of netcdctrl:
_______________________________________
C h a p t e r
17
Tuning AIX for Oracle
This chapter provides an overview of running Oracle on AIX. We’ll drill
down into the many aspects of tuning AIX to run Oracle, examining
memory, CPU, and I/O (both disk and network). We’ll discuss in detail the
Virtual Memory Manager and the tuning commands used to tune memory
for Oracle. I’ll go over some of the tools you can use to analyze bottlenecks and make changes to the system. Last, we’ll look at a couple of
Oracle tools that can help you with performance tuning.
Because many of the AIX tuning commands and parameters have changed
in recent years, Oracle has changed, too. Changes have also been made to
tools such as the Oracle Enterprise Manager (OEM). As you’ll see, this
important utility is one you should definitely add to your repertoire and
take the time to learn.
Memory
As we discussed in earlier chapters, the AIX Virtual Memory Manager
services all memory requests from the system, not just virtual memory.
When RAM is accessed, the VMM must allocate space even when plenty
of physical memory remains on the box. This point confuses both DBAs
and systems administrators at times.
The VMM works using a process called early allocation of paging space
by partitioning segments into pages. These pages can be either RAM
or paging space (virtual memory stored on disk). At the same time, it
190
Chapter 17: Tuning AIX for Oracle
maintains a free list of unallocated page frames, which are used to satisfy
page faults. The VMM’s page-replacement algorithm assigns page frames
and determines exactly which virtual memory pages currently in RAM will
have their page frames brought back to the free list.
The AIX operating system will use all available memory, other than
memory that is configured to be unallocated — in other words, the free list.
Obviously, administrators prefer to use physical memory rather than paging space when the physical memory is available.
VMM classifies memory segments into two categories: persistent segments
and working segments. Persistent segments use file memory, and working segments use computational memory. What does this mean to us? It’s
the computational memory that is used when your SQL queries access the
Oracle database. These are working segments. They have no real permanent location and will terminate when the process is completed.
On the other hand, file memory uses persistent segments that do have
permanent locations on the disks. Persistent segments remain in memory,
usually until the pages are stolen or the database is recycled. Again, you
want the file memory paged to disk but not the computational memory.
How do you tune the system? One critical parameter is the Translation
Lookaside Buffer (TLB). Applications such as Oracle exploit a tremendous
amount of virtual memory, so by using large pages you can increase performance substantially. Increasing the size of the TLB lets the system map
more virtual memory, resulting in a lower miss rate for applications, such
as Oracle, that use a lot of virtual memory. This category includes both
online transaction processing and data warehouse applications.
Oracle employs large pages for its System Global Area (SGA) because it is
the SGA that really dominates virtual memory. To reiterate, in AIX 5.3 and
later releases, you use vmo to tune; earlier releases used vmtune.
The following vmo command uses the lgpg_size and lgpg_regions parameters to allocate 16,777,216 bytes to provide large pages, with 256 actual
large pages:
# vmo -r -o lgpg_size=16777216 lgpg_regions=256
Memory
191
At the same time, with Oracle Database 10g, make sure the LOCK_SGA
Oracle initialization parameter is set to TRUE so that Oracle requests large
pages when allocating shared memory.
By far, the two most important vmo settings are minperm and maxperm.
These parameters determine whether the system favors computational
memory or file memory. The first thing to do here is make sure the lru_file_
repage parameter is set to 0. This parameter, which was introduced in ML1
of AIX 5.3, determines whether the page-stealing algorithm should consider
VMM repage counts and dictates the type of memory it should steal.
The default value for lru_file_repage is 1, so we need to change this setting using vmo:
# vmo -o lru_file_repage=0
Setting lru_file_repage to 0
Setting lru_file_repage to 0 tells the VMM that you want to steal only file
pages and not computational pages. Because this behavior will change if
numperm is less than minperm or greater than maxperm, we should also
set maxperm high and minperm very low. (Years ago, before the introduction of the lru_file_repage parameter, we used to make maxperm low. If
you did this now, you would stop the application caching programs that are
currently running.)
Let’s change the relevant parameters:
# vmo -p -o minperm%=5
# vmo -p -o maxperm%=90
# vmo -p -o maxclient%=90
You also want to take a look at minfree and maxfree. When the pages on
the free list fall below minfree, the VMM will start to steal pages, something you don’t want to have happen until you’ve beefed up the free list by
upping the number in maxfree. Use these values:
vmo -p -o minfree=960
vmo -p -o maxfree=1088
192
Chapter 17: Tuning AIX for Oracle
CPU
Let’s start our discussion of CPU performance and Oracle with symmetric multithreading (SMT). This important POWER5 innovation provides the ability for a single physical processor to concurrently dispatch
instructions from several hardware threads. In AIX 5L Version 5.3, a
dedicated partition created with one physical processor is configured as a
logical two-way by turning on SMT. With Oracle, you should always have
SMT on:
# smtctl
This system is SMT capable.
SMT is currently enabled.
SMT boot mode is not set.
SMT threads are bound to the same virtual processor.
proc0 has 2 SMT threads.
Bind processor 0 is bound with proc0
Bind processor 1 is bound with proc0
A couple other important concepts to keep in mind:
●
●
Processor affinity lets processes run on specific processors. You can
actually correlate specific processes with running processes.
The nice and renice commands change the priority of running processes. It is not recommended to renice Oracle processes.
Asynchronous I/O Servers
Asynchronous I/O (AIO) determines whether Oracle waits for I/O to
complete before starting new processing. What AIO does is let the system
continue processing while I/O completes in the background. Performance
improves significantly because processes can run at the same time that I/O
is going on. However, if tuned improperly, AIO can significantly degrade
the overall performance of writes on the I/O subsystem.
You can use the iostat or nmon command to monitor the AIO subsystem.
Let’s fire up iostat:
Concurrent I/O
193
# iostat -A 1 5
System configuration: lcpu=2 drives=2 ent=0.25 paths=2 vdisks=2
aio: avgc avfc maxgc maxfc maxreqs avg-cpu: %user %sys %idle %iowait %physc %entc
0
0
312
0
4096
3.1
7.1
89.8
Disks:
%tm_act
Kbps
tps
Kb_read
Kb_wrtn
hdisk1
0.0
0.0
0.0
0
0
hdisk0
0.0
0.0
0.0
0
0
0.0
0.0
16.7
The following parameters are used to monitor the AIO subsystem for the
specified interval:
●
avgc — Average global AIO request count per second
●
avfc — Average fastpath request count per second
●
●
●
maxgc — Maximum global AIO request count since the last time this
value was fetched
maxfc — Maximum fastpath request count since the last time this
value was fetched
maxreqs — Maximum number of AIO requests allowed
In the preceding example, AIO servers are not a system bottleneck.
Concurrent I/O
Concurrent I/O (CIO), introduced in AIX 5.2, is an important system capability that you should use in your Oracle environment. Similar to its predecessor, direct I/O, CIO lets file system I/O bypass the VMM and transfer
data directly to disk from the user’s buffer. CIO permits multiple threads
to read and write data concurrently to the same file, letting users read and
write simultaneously.
To turn on CIO, mount your file systems using the cio flag:
# mount -o cio /orafilesystem
Elements to consider when using CIO include:
194
Chapter 17: Tuning AIX for Oracle
●
●
●
Raw devices — Although some Oracle DBAs like to create raw logical volumes for their data (and there is little argument about the performance benefit of doing so), in most cases this functionality is too
difficult to administer, and I’ve found that the Unix administrators
can talk the Oracle DBAs out of this one. With the advent of CIO, I
would not use raw logical volumes unless performance is the driving
factor behind everything you’re doing and you have a staff that can
manage the complexities inherent in this type of environment.
Spreading the wealth — The more spindles you have, the more you
should spread your wealth around. The more adapters you have, the
more your performance will also increase. In addition, try to keep
indexes and redo logs off the same volumes as your data.
Storage area network (SAN) — Make sure you spend time looking at
your SAN. Optimizing the hardware will help you more than anything you can do at the operating system level.
Oracle Tools
Let’s look now at two Oracle-specific tools that can help you with your
AIX administration.
Statspack
Statspack is an Oracle performance diagnosis tool that I highly recommend
Unix administrators learn to use. Once you have it set up and configured,
which you do using SQL after Oracle is installed, it’s not that complicated
to use.
Statspack provides two basic collection options: level and threshold.
The level parameter controls the type of data collected from Oracle. The
threshold parameter acts as a filter for the collection of SQL statements
into the status summary tables.
To install Statspack, simply log on to the system as Oracle, start up sqlplus, and then follow the steps as instructed:
Oracle Enterprise Manager
SQL*Plus: Release 10.1.0.2.0 - Production on Sun May 18
Copyright (c) 1982, 2004, Oracle.
19:21:21
195
2008
All rights reserved.
Enter user-name: system as sysdba
Enter password:
Connected to:
Oracle Database 10g Enterprise Edition Release 10.1.0.2.0 - 64bit Production
With the Partitioning, OLAP and Data Mining options
SQL> execute
SQL> @?/rdbms/admin/spcreate
Choose the PERFSTAT user’s password
----------------------------------Not specifying a password will result in the installation FAILING
Oracle Enterprise Manager
choose the Temporary tablespace for the PERFSTAT user
----------------------------------------------------Below is the list of online tablespaces in this database which can store
temporary data (e.g., for sort work areas). Specifying the SYSTEM tablespace
for the user’s temporary tablespace will result in the installation FAILING,
as using SYSTEM for workareas is not supported.
Choose the PERFSTAT user’s temporary tablespace.
Oracle Enterprise Manager
The Oracle Enterprise Manager (OEM) is a very useful and productive tool
that I’ve used for years. To use this Web-based utility, you need to make
sure you let it run when installing Oracle or creating a database using the
Oracle dbca utility. After the database is created, turn on OEM with this
command:
$ emctl start dbconsole
Then enter the following in your browser to access the tool:
http://lpar21ml16ed_pub:5505/em
196
Chapter 17: Tuning AIX for Oracle
There is so much you can monitor and tune within OEM that whole books
exist on this utility. If you are working in an Oracle environment, this is a
must-use system tool.
Figure 17.1 shows the graphical OEM display.
Figure 17.1: Oracle Enterprise Manager
Section VI
Chapter 17 Quiz
Multiple Choice
1. What does the following command do: emctl start dbconsole?
a. Starts VMM
b. Starts OEM
c. Shuts off the dbservice
d. Brings up kernel tuning mode
2. Which of the following was introduced in ML1 of AIX 5.3 and determines whether the VMM repage counts are considered?
a. lru_file_repage
b. vmm
c. LOCK_SGA
d. Translation Lookaside Buffer
3. You can monitor the AIO subsystem by using either iostat or which of
the following?
a. vmstat
b. svm
c. sar
d. nmon
198
Section VI: Chapter 17 Quiz
4. Processor affinity enables processes to run
a. On specific processors.
b. Within the SGA.
c. In the hypervisor.
d. There is no such term.
5. Increasing the size of which of the following buffers lets the system
map more virtual memory, resulting in a lower miss rate for applications, such as Oracle, that use a lot of virtual memory?
a. Inode
b. Memory Buffer
c. SGA Buffer
d. Translation Lookaside Buffer
True or False
6. Statspack is an Oracle performance diagnosis tool.
7. AIO servers let the system continue processing while I/O completes in
the background.
8. The LOCK_SGA Oracle initialization parameter should be set to TRUE so
that Oracle requests large pages when allocating shared memory.
9. Direct I/O is more recent than concurrent I/O.
Fill in the Blank
10. What is the command to turn on CIO?
___________________________________________
C h a p t e r
18
Linux on Power
This chapter provides an overview of systems performance when running
Linux on Power (LoP).
Monitoring
AIX administrators will be happy to know that the nmon command works
great with Linux. SystemTap, which conducts performance analysis by
analyzing a running kernel, also runs on the platform. Two other popular
tools, iostat and sar, are also available on Linux systems. While tools
aren’t the focus here, it’s nice to know these options are available.
For our first monitoring example, let’s inspect some basic Linux configuration files. The /proc file system is one you should use frequently, much
more so than on AIX boxes because the information is simply more valuable in Linux. A lot of kernel and process information resides here, in the
form of configuration files. One such file is cpuinfo:
[root@172_29_140_173 proc]# more cpuinfo
processor : 0cpu : POWER5 (gr)clock : 1654.344000MHzrevision : 2.1 (pvr 003a 02001)
It’s easy to see from this file that you’re using a POWER5 system.
Next, we can look at the release text file to determine the operating system
level:
200
Chapter 18: Linux on Power
[root@172_29_140_173 etc]# more redhat-release
Red Hat Enterprise Linux Server release 5.2 (Tikanga)
This box is running Red Hat Enterprise Linux 5 (RHEL5) on a POWER5
partition.
Handy Linux Commands
One of my favorite Linux commands is top. This command provides
realtime information quickly, in a character-based display, including the
processes that are consuming the most CPU time.
Another useful command is free. It reports the total amount of free and
used physical and swap memory:
[root@172_29_140_173 etc]# free
total
used
Mem:
2073856 2057536
-/+ buffers/cache: 226944
Swap:
0
0
free shared
16320
0
1846912
0
buffers cached
440832 1389760
Still, my favorite of all Unix/Linux performance commands is vmstat. I
love this old standby because, unlike other tools, vmstat provides quickand-dirty information about all subsystems. Nothing fancy here:
[root@172_29_140_173 etc]# vmstat 1
procs -----------memory---------- --swap-- ---io--- --system-- -----cpu-----r b
buff
cache
si
so
bi
bo
in
cs
0 0
16256 440832 1389760
swpd
free
0
0
1
1
23
11
0
us sy id wa st
1
98
0
0
0
0 0
16384 440832 1389760
0
0
0
0
524
30
0
1
99
0
0
0
0 0
16384 440832 1389760
0
0
0
0
536
16
0
1
99
0
0
0
0 0
16320 440832 1389760
0
0
0
0
588
25
0
1
99
0
0
0
0 0
16320 440832 1389760
0
0
0
0
628
12
0
1
99
0
0
0
0 0
16320 440832 1389760
0
0
0
24
633
29
0
1
99
0
0
0
0 0
16320 440832 1389760
0
0
0
0
578
18
0
1
99
0
0
0
You will find that vmstat output on Linux differs a bit from what you see
on AIX systems. Here’s a quick description of what each field means:
Virtualization
●
swpd — Amount of virtual memory being used
●
free — Amount of idle memory
●
buff — Amount of memory used as buffers
●
cache — Amount of memory used as cache
201
Virtualization
Some administrators don’t take full advantage of the PowerVM capabilities of Linux. If you administer LoP the same way you do on x86 boxes,
you’re doing yourself and your organization a major disservice. Some
capabilities available on POWER systems include the following.
●
Symmetric multithreading (SMT)
●
Shared processor pool and uncapped partitions
From a CPU perspective, SMT is an important feature. It lets you maximize the use of instruction sets and in some cases increase CPU performance by 30 percent. SMT enables these improvements by supporting
multithreading, a capability that’s part of PowerVM and the POWER
architecture. Multithreading enables two separate instruction streams to run
concurrently on the same physical processor, with each thread appearing
to run on its own independent logical processor. This feature is enabled by
default.
Through the POWER architecture, you can create Linux and AIX partitions. When creating Linux partitions, you can “uncap” them, which means
that the partitions will receive unused CPU cycles from the shared processor pool over and above their entitled capacity. Other than the number of
cycles left in that shared processor pool, the only limitation is the number
of virtual processors configured for the profile. I recommend uncapping
partitions whenever possible to maximize all available CPU resources and
increase performance. From a CPU perspective, I’d also take advantage of
the capability to add CPU horsepower on the fly through a dynamic LPAR
(DLPAR) operation.
202
Chapter 18: Linux on Power
Tuning
In Linux, the sysctl command changes kernel parameters. Be advised that
the method you use to change parameters may depend on your distribution;
for example, you can use the Powertweak tool with Novell SUSE Linux,
but it isn’t available with Red Hat. Because we’re using Red Hat here, sysctl is the choice. Let’s change some parameters.
One parameter that’s changed frequently is SHMMAX, which is used to define the maximum size (in bytes) for a shared memory segment. In Oracle,
you should set this value large enough for the largest System Global Area
(SGA) size. Let’s examine the default parameter:
# sysctl kernel.shmmax
kernel.shmmax = 268435456
In this case, the limit is set to 256 MB. Let’s change this to 1 GB. To do
so, use the vi command to display the /etc/sysctl.conf file. This is where
you edit the value:
# Controls the maximum shared segment size, in bytes
kernel.shmmax = 107374182
When you view the file again using sysctl, you can see the change:
# sysctl kernel.shmmax
kernel.shmmax = 107374182
To make the parameter work without a reboot, issue the sysctl command
with the –p parameter.
On the memory side, parameters worth examining include SEMMSL, which
controls the maximum number of semaphores per semaphore set; SEMMNI,
which controls the maximum number of semaphore sets on the entire Linux system; and SEMMNS, which controls the maximum number of semaphores (no semaphore sets) on the entire Linux system.
Another important parameter is vm.nr.hugepages. The background here is
that the POWER architecture supports page sizes of 4 KB and 16 MB. The
Tuning
203
default vm.nr.hugepages setting for LoP is 4 KB, which is too small for
larger databases. To enable large pages, you need to change this parameter.
Let’s first view the hugepage parameter — in this case, by looking at the
proc/meminfo file.
# grep -i hugepages /proc/meminfo
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
Hugepagesize: 16384 kB
Now, let’s allocate 130 large pages to support an SGA of approximately
2 GB:
# sysctl -w vm.nr_hugepages=130
Then view the meminfo file again:
[root@172_29_140_173 ~]# grep -i hugepages /proc/meminfo
HugePages_Total: 41
HugePages_Free: 41
HugePages_Rsvd:
0
Hugepagesize: 16384 kB
You can see that the system is already starting to use the hugepage
parameter.
Entire books are dedicated to Linux performance tuning. Check your application recommendations to learn how kernel parameters should be configured in your environment.
Linux starts many daemons that usually aren’t needed, including autofs,
cups, nfslock, sendmail, and xfs. You should turn off anything that isn’t
explicitly required. You can accomplish this in several ways, but the chkconfig command is probably the best method.
As an example, let’s shut down cups:
204
Chapter 18: Linux on Power
<[root@172_29_140_173 ~]# chkconfig --del cups
Now, let’s make sure it’s not around anymore:
#
chkconfig --list
service cups supports chkconfig, but is not referenced in any runlevel (run
‘chkconfig --add cups’)
Section VI
Chapter 18 Quiz
Multiple Choice
1. What is SHMMAX used for?
a. Memory
b. I/O
c. CPU
d. Networking
2. What is the default setting for vm.nr.hugepages?
a. 4 KB
b. 16 KB
c. 32 TB
d. 1 MB
True or False
3. The nmon command is available on Linux.
4. sysctl is available only on Red Hat.
5. SMT is not available for Linux.
6. The /proc file system is more useful on AIX than Linux.
206
Section VI: Chapter 18 Quiz
7. SEMMSL controls the maximum number of semaphores per semaphore
set.
8. You use chkconfig to shut off services.
Fill in the Blank(s)
9. SEMMNS controls the maximum number of ____________ on the
system.
10. In vmstat output, _____ references the amount of virtual memory
being used.
Quiz Answers
Section I: Introduction
Answers: 1 – b, 2 – b, 3 – c, 4 – a, 5 – c, 6 – False, 7 – True, 8 – False,
9 – True, 10 – 1. Establish a baseline, 3. Identify bottleneck, 4. Tune,
5. Repeat (starting with step 2).
Section II: CPU
Answers: 1 – a, 2 – b, 3 – c, 4 – d, 5 – a, 6 – d, 7 – b, 8 – True, 9 – True,
10 – Processor affinity is the probability of dispatching a thread to a processor that previously executed it.
Section III: Memory
Answers: 1 – d, 2 – b, 3 – a, 4 – a, 5 – a, 6 – b, 7 – a, 8 – True, 9 – True,
10 – A memory leak occurs when a process keeps on allocating more memory without releasing it.
Section IV: Disk I/O
Answers: 1 – c, 2 – b, 3 – c, 4 – a, 5 – c, 6 – b, 7 – a, 8 – True, 9 – False,
10 – J2_minPageReadAhead
208
Quiz Answers
Section V: Network I/O
Answers: 1 – b, 2 – c, 3 – b, 4 – a, 5 – c, 6 – d, 7 – b, 8 – False, 9 – True,
10 – nfsfr
Section VI / Chapter 16: AIX 6.1
Answers: 1 – d, 2 – c, 3 – c, 4 – d, 5 – b, 6 – a, 7 – d, 8 – True, 9 – True,
10 – The netcdctrl command is used to manage the new network caching daemon, letting you dump the cache contents to a file, display cache
usage statistics, flush the cache table, and change the logging level of the
daemon.
Section VI / Chapter 17: Tuning AIX for Oracle
Answers: 1 – b, 2 – a, 3 – d, 4 – a, 5 – d, 6 – True, 7 – True, 8 – True,
9 – False, 10 – # mount -o cio /orafilesystem
Section VI / Chapter 18: Linux on Power
Answers: 1 – a, 2 – a, 3 – True, 4 – False, 5 – False, 6 – False, 7 – True,
8 – True, 9 – semaphores, 10 – swpd
Index
A
access time/speed
disk vs. CPU, 100, 127
network I/O and, 139–140
acdirmin, acdirmax, acregmin, acregmax, and
actime parameters to tune network I/O, 163
Address Resolution Protocol (ARP), 141, 160
Advanced Interactive eXecutive. See AIX, 7
Advanced Power Virtualization (APV), 8, 14,
17, 25, 141, 168
aio, AIX 6.1 and, 176
aio_active parameter, AIX 6.1 and, 181
aioLpools and aioPpools, 181
aioo, AIX 6.1 and, 176, 181
AIX, 7–9, 11, 14, 17, 18, 173
AIX 6.1, 175–187
aio_active parameter in, 181
aioLpools and aioPpools in, 181
asynchronous I/O (AIO) and, 180–182
autoconfig and, 180
availability in, 175
CPU tuning in, 179
disk I/O tuning in, 179, 180
Enhanced Journaled File System (JFS2) in, 179
fastpath parameter tuning in, 180, 180, 181
fsfastpath parameter tuning in, 180, 180, 181
Internet Small Computer System Interface
(iSCSI) in, 179
ioo vs. aioo to tune, 181
logical partitions (LPARs) and, 176
manageability of, 175
maxfree tuning in, 178
maxreqs parameter tuning in, 180
maxservers parameter tuning in, 180, 182
memory tuning in, 176–179, 177, 178, 179
minperm/maxperm tuning in, 177, 178
minservers parameter tuning in, 180, 182
netcdctrl daemon and, 182
Network File System (NFS) and, 183
network I/O and, 182
no to tune network I/O in, 182
POSIX and, 181
posix_aix_active parameter in, 181
POWER6 and, 176
role based access control (RBAC) in, 175
security in, 175
strict_maxclient tuning in, 178
tunable defaults in, 175, 176
variable page size support (VPSS) in, 178
virtualization in, 175
NOTE: Boldface indicates illustrations and code; t indicates a table.
209
Index
Linux on Power (LoP) and, 201
lparmon for monitoring, 25
lparstat for monitoring, 32–33, 32, 33
monitoring, 25–43
mpstat for monitoring, 25, 33–35, 34, 35
netpmon for monitoring, 25
nice to tune, 46, 46, 192
nmon for historical analysis of, 37–38, 37
nmon for monitoring, 36–37
Oracle and, tuning for, 192
pprof for monitoring, 25
process management and, 45
ps for monitoring, 38–39, 39
ps for tuning, 48, 48
renice tuning tool for, 47, 47, 192
B
sar for monitoring, 25, 28–30, 29, 30
balancing system workload, 5
sched_R and sched_D tuning tools for, 50, 50
baseline establishment, 3–4
schedo tuning tool for, 48–50, 49–50, 88, 179
Bell Labs, 7, 17
smtctl tuning tool for, 53, 53
Berkeley Software Distribution (BSD), 7
splat for monitoring, 25
bindprocessor, CPU tuning using, 52–53, 52, 53
thread management and, 45
biod daemon, network I/O tuning using, 138,
time for timing of, 41, 41
162, 169, 183
timeslice tuning tool for, 51–52, 52
bottlenecks, 4, 5
timex for timing of, 42, 42–43
CPU, 23–24
timing tools for, 41–43
CPU–bound, 24
topas for monitoring, 25, 35–36
memory– vs. CPU–bound, 5
tprof for tracing of, 39–41, 40, 41
tracing tools for, 39–41
tuning, 23–24, 45–54
C
vmstat for monitoring, 25–29, 26, 28, 29, 30
chdev, network I/O tuning using, 161
w for monitoring, 31, 31
client tuning, network I/O and, 162–164. See
cpuinfo file, 199
also network I/O, tuning of
cron, 36, 55
computational memory, 64, 65, 91
disk I/O monitoring using, 108
concurrent I/O (CIO), 97, 101–102, 125, 127
curt, 5
Oracle and, 193–194
AIX 6.1 and, 176
CPU, 173
CPU monitoring using, 25
Advanced Power Virtualization (APV) and, 25
CPU tracing using, 39
AIX 6.1 and, tuning of, 179
bindprocessor tuning tool for, 52–53, 52, 53
cpuinfo file in, 199
D
curt for monitoring, 25
daemons, turning off, in LoP, 203–204, 204
filemon for monitoring, 25
data placement on disk, inner vs. outer areas,
gprof tuning tool for, 54
104, 104, 125
historical analysis, 56
Decimal Floating Point, 15, 18
Deep Blue supercomputer, 12
iostat for monitoring, 31, 31
AIX 6.1, continued
vm_default_pspa parameter in, 178–179, 179
vmo to tune memory in, 177–179, 177, 178, 179
workload partitions (WPARs) in, 176
analyzing performance data, 18
Apple, 11
asynchronous I/O (AIO), 97, 102, 125, 127,
180–182
Oracle and, 192–193, 192
AT&T, 7, 17
Atkins, Stephen, 37
autoconfig, AIX 6.1 and, 180
AutoSys, 55
210
Index
deferred page space allocation (DPSA), 85–86, 92
device drivers and multipath I/O, 127
Digital Equipment Corporation (DEC), 7, 17
direct I/O, 97, 101, 125
disk I/O, 97–130, 173
access time in, 100, 127
AIX 6.1 and, tuning of, 179, 180
AIX LVM commands in monitoring, 112–118
asynchronous (AIO), 97, 102, 125, 127,
180–182, 192–193
capacity of, 100, 127
concurrent (CIO), 97, 101–102, 125, 127,
193–194
cron and, 108
data placement on, inner vs. outer areas, 104,
104, 125
device drivers and multipath, 127
direct, 97, 101, 125
Enhanced Journaled File System (JFS2)
tuning and, 105, 122–123, 126
file systems and, 105
filemon to monitor, 110, 116–117, 116, 117, 126
fileplace to monitor, 110, 116, 117–118, 118, 126
inodes and, 101
inter-disk policy for, 105
inter-policy and, 113–114, 125
intra-policy and, 126, 127–128
introduction to, 99–105
ioo to tune, 120–122, 121t, 122
iostat to monitor, 108, 111–112, 111, 112, 192
JFS file system tuning parameters, using ioo,
121, 121t
journaling file systems and, 121, 126
logical units (LUNs) in, 126, 127
Logical Volume Manager (LVM) and, 99,
103–104, 112–118, 125
logical volume monitoring in, 111–112, 111, 112
logical volume placement in, 125
logical volumes and placement,
intra/inter-policy for, 102–104, 103
lslv to monitor, 110, 113–114, 113, 114, 126
lsof to monitor, 110, 169
lsvg to monitor, 113, 113
lvmo to tune, 119–120, 120, 126
lvmstat to monitor, 115–116, 115, 126
Mirror Write Consistency Check (MWCC)
and, 104, 113, 128
mirroring and, 126
monitoring of, 107–118
mount command for, 102, 102
multipath, with device drivers, 127
nmon to monitor, 107, 110–111, 110, 192
Oracle and, 127, 192–193
pacing of, 180
relational database management systems
(RDBMS) and, 127
sadc utility and, 108
sar to monitor, 107–108, 107, 108
sequential, 128
server minimum/maximum numbers and, 127
stack in, 99–100, 100
storage area networks (SANs) and, 126
syncd daemon and, 123
system layers and, 103–104, 103
System Management Interface Tool (SMIT)
and, 105
topas to monitor, 107, 108–111, 109, 110
tuning of, 119–123
Virtual Memory Manager (VMM) and, 101,
122–123
DNS. See Domain Name Server (DNS)
Domain Name Server (DNS), 141, 161–162,
168, 169, 182
Dynamic Energy Management, 15, 18
dynamic logic partitioning (DLPAR), 8, 93
CPU tuning and, 56
Linux on Power (LoP) and, 201
E
early page space allocation (EPSA), 85–86, 92, 189
Enhanced Journaled File System (JFS2)
AIX 6.1 and, 179
disk I/O and, 105, 122–123, 126
memory and, 65
entstat, network I/O monitoring using, 145,
145, 167
environments for testing, 18
NOTE: Boldface indicates illustrations and code; t indicates a table.
211
Index
Ethernet
Internet Small Computer System Interface
(iSCSI) in, 179
jumbo frames in, 162, 167, 168–169
network I/O and, 136, 141, 168
External Data Representation (XDR), 139
F
fastpath parameter tuning, AIX 6.1 and, 180,
180, 181
Fiber Distributed Data Interface (FDDI),
network I/O and, 162
file memory, 64, 65, 91
file systems, disk I/O and, 105
filemon, 146
AIX 6.1 and, 176
CPU monitoring using, 25
disk I/O monitoring using, 110, 116–117,
116, 117, 126
fileplace, disk I/O monitoring using, 110, 116,
117–118, 118, 126
free, Linux on Power (LoP) monitoring using, 200
fsck, 8, 179
fsfastpath parameter tuning, AIX 6.1 and, 180,
180, 181
I
I/O. See disk I/O; network I/O
IBM, iii–iv, 8, 9, 11, 12, 17, 57
ifconfig
monitoring network I/O using, 161
Object Data Manager (ODM) and, 161
inetd daemon, network I/O monitoring using, 148
inodes, 101
input/output. See disk I/O; network I/O
inter-policy, disk I/O, 105, 113–114, 125
Internet Protocol. See TCP/IP
Internet Small Computer System Interface
(iSCSI), 179
intra-policy, disk I/O, 126, 127–128
ioo
AIX 6.1 and, 175, 176, 181
disk I/O tuning using, 120–122, 121t, 122
iostat, 55, 133
AIX 6.1 and, 176
CPU monitoring using, 31, 31
disk I/O monitoring using, 108, 111–112,
111, 112, 192
ipqmalen parameter in tuning network I/O, 160
iptrace, ipreport, and ipfilter, network I/O
monitoring using, 154–155, 155, 168
J
G
General Electric, 7
Global Technology Services, 57
gprof, CPU tuning using, 54
Griffiths, Nigel, 36
H
Hardware Management Console (HMC), 141
historical analysis
CPU, 56
nmon for, 37–38, 37
HMC. See Hardware Management Console (HMC)
hugepages parameter, Linux on Power (LoP)
tuning using, 203, 203
Hypervisor, 12, 13–14, 14, 24
Hypervisor Decrementer (HDEC), 14
212
Jann, Joefon, 12
JFS2. See Enhanced Journaled File System
Journaled File System (JFS)
memory and, 65
tuning parameters, using ioo, 121, 121t
journaling file systems, disk I/O and, 121, 126
jumbo frame Ethernet, 162, 167, 168–169
K
Kasparov, Garry, 12
kill commands, 56
L
late page space allocation in (LPSA), 85–86, 92
layers, system, 103–104, 103
leaks in memory, 77–79, 78, 79, 92
Index
lgpg_size and lgpg_regions parameters, Oracle
and, 190
libraries, 8
Linux, 9, 14, 136
Linux on Power (LoP), iii, 173, 199–206
commands for, 200–201
CPU performance in, 201
cpuinfo file in, 199
daemons automatically started in, turning off,
203–204, 204
dynamic logical partitioning (DLPAR) in, 201
free command to monitor, 200, 200
hugepages parameter in, 203, 203
meminfo file in, 203, 203
monitoring, 199–200
nmon to monitor, 199
SEMMSL parameter tuning in, 202
SHMMAX parameter tuning in, 202
symmetric multithreading (SMT) in, 201
sysctl to tune, 202
SystemTap to monitor, 199
top command to monitor, 200
tuning, 202–204
virtualization in, 201
vm.nr.hugepages parameter tuning in, 202–203
vmstat to monitor, 200–201, 200
Live Application Mobility, 9
Live Partition Mobility, 15, 18, 19
load control and memory, 87–88, 93
local area networks (LANs), 137. See also
network I/O
lockd, 139
logical partitions (LPARs), 6, 176
logical units (LUNs), 126, 127
logical volume, disk I/O and, 102–104, 103, 125
Logical Volume Manager (LVM), 17, 99,
103–104, 111–118, 125
developmental history of, 8
lvmo to tune, 119–120, 120, 126
lvmstat to monitor, 115–116, 115
logical volume monitoring, 111–112, 111, 112
lpamon, CPU monitoring using, 25
lparstat, 55
CPU monitoring using, 32–33, 32, 33
lru_file_repage, memory tuning using, 82–84,
84, 92
lru_file_repage parameter, Oracle and, 191, 191
lrubucket, memory tuning and, 88–89, 89, 93
lsattr, network I/O monitoring using, 139–140,
139, 140, 147, 147, 161
lscfg to monitor network I/O, 140, 140, 167
lslv, disk I/O monitoring using, 110, 113–114,
113, 114, 126
lsof, disk I/O monitoring using, 110, 169
lsps, memory monitoring using, 73, 92
lsvg, disk I/O monitoring using, 113, 113
lvmo, disk I/O tuning using, 119–120, 120, 126
lvmstat, disk I/O monitoring using, 115–116,
115, 126
M
market share of AIX, 9
Mars Pathfinder, 12
Massachusetts Institute of Technology (MIT), 7
maxclient
memory tuning using, 82–84, 84
network I/O tuning using, 163, 165
maxfree, 191, 191
AIX 6.1 and tuning in, 178
memory tuning using, 84, 85
maximum transfer unit (MTU), 162, 167
maxperm/minperm, 63, 66, 82–84, 84, 177, 178,
191, 191, 191
AIX 6.1 and, 177, 178
memory tuning using, 63, 66, 82–84, 84, 92
network I/O tuning using, 162–163, 163,
165, 169
Oracle and, 191, 191
maxpgahead, 85
maxreqs parameter tuning, AIX 6.1 and, 180
maxservers parameter tuning, AIX 6.1 and,
180, 182
meminfo file, Linux on Power (LoP) tuning
using, 203, 203
memory, 61–96, 173
AIX 6.1 and, tuning, 176–179, 177, 178, 179
computational, 64, 65, 91
deferred page space allocation (DPSA) in,
85–86, 92
NOTE: Boldface indicates illustrations and code; t indicates a table.
213
Index
memory, continued
dynamic logic partitioning (DLPAR) and, 93
early page space allocation (EPSA) in,
85–86, 92
Enhanced Journaled File System (JFS2) and, 65
file memory in, 64, 65, 91
free list in VMM and, 64
introduction to, 63–66, 63
Journaled File System (JFS) and, 65
late page space allocation in (LPSA) in,
85–86, 92
leaks, 77–79, 78, 79, 92
load control and, 87–88, 93
lru_file_repage tuning using, 82–84, 84, 92,
191, 191
lrubucket to tune, 88–89, 89, 93
lsps to monitor, 73, 92
maxclient to tune, 82–84, 84
maxperm/minperm to tune, 63, 66, 82–84, 84,
177, 178, 191, 191, 191
maxpgahead to tune, 85
minfree and maxfree parameters in,84–85,
191, 191
monitoring of, 67–79
Network File System (NFS) and, 65
network subsystem, management of, 141
nmon to monitor, 81, 92
Oracle and, 189–191, 189
page space allocation in, 85–87, 92, 189–190
paging in, 65–66, 91
Partition Load Manager (PLM) and, 93
persistent segments in, 64, 91
ps page space allocation for, 87, 87
ps to monitor, 73–74, 74, 92
RAM added to, 93
rmss to tune, 89–90, 90, 93
sar to monitor, 71–73, 72, 92
scanning and, 88–89, 89, 93
schedo CPU tuning and, 88, 91, 92
svmon to monitor, 74–77, 75, 77, 92, 93
swapping in, 65–66
thrashing and, 65–66, 87–88, 91
topas to monitor, 67, 92
Translation Lookaside Buffer (TLB) and, 190
tuning of, 81–90
214
variable page size support (VPSS) in, 178
Virtual Memory Manager (VMM) and, 61,
63–64, 91. See also Virtual Memory
Manager (VMM)
vm_default_pspa parameter in, 178–179, 179
VMM statistic summary using vsmstat in, 71, 71
vmm_mpsize_support parameter in, 178
vmo to tune, 66, 81–82, 91, 92, 93, 190–191,
190, 191
vmstat to monitor, 67–70, 69, 70, 91, 92, 93
working segments in, 64, 91
workload balancing and, 87–88, 93
methodology of power tuning, 3–6, 17
minfree, memory tuning using, 84, 85, 191, 191
minperm. See maxperm/minperm
minservers parameter tuning, AIX 6.1 and, 180, 182
Mirror Write Consistency Check (MWCC), 104
disk I/O monitoring using, 113, 128
mirroring, disk I/O and, 126
monitoring system performance, 4–5, 18
CPU, 25–43
disk I/O and, 107–118
Linux on Power (LoP) and, 199
memory, 67–79
network I/O and, 143–156
Motorola, 11
mount, 102, 102
hard vs. soft, 163
network I/O tuning using, 163–164, 168
mpstat, 55
CPU monitoring using, 25, 33–35, 34, 35
Multics, 7
multipath I/O and device drivers, 127
Multiplexed Information and Computer Service.
See Multics
N
name resolution, network I/O and, 161–162,
168, 169, 182
netcdctrl daemon, 182
netpmon
AIX 6.1 and, 176
CPU monitoring using, 25
network I/O monitoring using, 134, 145–145,
146, 148, 152–153, 152, 153, 167
Index
netstat, network I/O and, 131, 133, 135,
143–145, 143, 144, 154, 154, 158, 158, 167,
168
Network File System (NFS), 136–139, 137, 138
memory and, 65
monitoring, 148–153, 167, 169, 183
network I/O and, 131
network I/O, 131, 172, 173
acdirmin, acdirmax, acregmin, acregmax, and
actime parameters to tune, 163
Address Resolution Protocol (ARP) and, 141,
160
Advanced Power Virtualization (APV) and,
141, 168
AIX 6.1 and, 182
biod daemon to tune, 138
biod to tune client in, 162, 169, 183
chdev and, 161
client tuning in, 162–164
Domain Name Server (DNS) and, 141,
161–162, 168, 169, 182
entstat to monitor, 145, 145, 167
Ethernet and, virtual and shared, 136, 141,
168
External Data Representation (XDR) and,
139
Fiber Distributed Data Interface (FDDI) and,
162
Hardware Management Console (HMC) and,
141
ifconfig to monitor, 161
inetd daemon to monitor, 148
Internet Small Computer System Interface
(iSCSI) in, 179
introduction to, 133–141
ipqmalen parameter in tuning, 160
iptrace, ipreport, and ipfilter to monitor,
154–155, 155, 168
jumbo frame Ethernet and, 162, 167,
168–169
lockd and, 139
lsattr to monitor, 139–140, 139, 140, 147,
147, 161
lscfg to monitor, 140, 140, 167
maxclient to tune, 163, 165, 169
maximum transfer unit (MTU) in, 162, 167
maxperm/minperm in tuning, 162–163, 163,
165, 169
memory management in network subsystems
and, 141
monitoring of, 143–156
mount parameters to tune, 163–164, 168
name resolution in, 161–162, 168, 169, 182
netcdctrl daemon and, 182
netpmon to monitor, 134, 145–146, 146, 148,
152–153, 152, 153, 167
netstat to monitor, 131, 133, 143–145, 143,
144, 154, 154, 158, 158, 167, 168
Network File System (NFS), 131, 136–139,
137, 138, 148–153, 167, 169, 183
nfds in, 137–138
nfs to monitor, 148
nfs_rfc1323 in tuning, 164, 164
nfs4cl to monitor, 148, 151–152, 152, 167
nfsd to tune, 165
nfso to tune, 164–165, 164, 165, 168
nfsstat to monitor, 148, 149–151, 150, 151,
167
nmon to monitor, 148–149, 148, 167
no to tune, 157–161, 157, 159, 160, 161, 164,
164, 168, 182
Object Data Manager (ODM) and ifconfig,
161
Open Systems Interconnection (OSI) model
for networks and, 135, 138, 167
packets in, 136, 154–156, 168–169
portmap and, 139
protocols in, 141
protocols used in network and, 135–136, 164
remote procedure calls (RPCs) in, 137, 149,
163, 167
rfc1323 parameter in tuning, 159, 164–165,
164, 164
rsize and wsize parameters to tune, 163–164
sb_max parameter in tuning, 159
server tuning in, 164–165
speed of, 139–140
spray to monitor, 148, 148
System Management Interface Tool (SMIT), 161
TCP/IP layers and, 134, 135, 138
NOTE: Boldface indicates illustrations and code; t indicates a table.
215
Index
network I/O, continued
tcp_nodedelyack parameter in tuning, 160
tcp_recvspace/tcp_sendspace parameters in
tuning, 158, 164, 164, 168
tcpdump to monitor, 156, 156, 168
thewall to tune, 157–158, 159–160
threads in, 162
topas to monitor, 148, 149, 167
Transmission Control Protocol (TCP) and,
135–136, 164
trcstop to stop trace in, 146, 146, 152
tuning of, 157–165, 157
udp_recevspace/udp_sendspace parameters in
tuning, 158, 159
use_isno parameter in tuning, 161, 161
User Datagram Protocol (UDP) and,
135–136, 164
virtual I/O servers (VIOs), 141, 168, 169
Virtual Memory Manager (VMM) and, 141,
162, 169
nfs, network I/O monitoring using, 148
nfs_rfc1323 in tuning network I/O, 164, 164
nfs4cl, network I/O monitoring using, 148,
151–152, 152, 167
network I/O monitoring using, 151–152,
152, 151
nfsd, network I/O tuning using, 137–138, 165
nfso
AIX 6.1 and, 175, 176
network I/O tuning using 164–165, 164, 165,
168
nfsstat, network I/O monitoring using, 148,
149–151, 150, 151, 167
nice, 5, 55
CPU tuning using, 46, 46, 192
Oracle and, 192
nmon, 4, 55, 56, 67
CPU monitoring using, 36–37
disk I/O monitoring using, 107, 110–111,
110, 192
historical analysis using, 37–38, 37
Linux on Power (LoP) monitoring using, 199
memory monitoring using, 81, 92
network I/O monitoring using, 148–149,
148, 167
216
no
AIX 6.1 and, 175, 176
network I/O tuning using, 157–161, 157, 159,
160, 161, 164, 164, 168, 182
O
Object Data Manager (ODM), 180
ifconfig and, 161
Open Firmware, 14
Open Systems Interconnection (OSI) model for
networks, 135, 138, 167
Oracle, 173, 189–198
asynchronous I/O (AIO) and, 192–193
concurrent I/O (CIO) in, 193–194
CPU tuning for, 192
disk I/O and, 127
early allocation of paging space and, 189
iostat to monitor AIO in, 192–193, 193
lgpg_size and lgpg_regions parameters for, 190
lru_file_repage parameter for, 191, 191
memory tuning for, 189–191
minfree and maxfree parameters in, 191, 191
minperm and maxperm parameters in, 191, 191
nice and renice to tune CPU for, 192
nmon to monitor AIO in, 192
Oracle Enterprise Manager (OEM) and, 189,
195–196, 195, 196
page space allocation and, 86–87, 92, 189–190
Statspack for, 194, 195
storage area networks (SANs) and, 194
symmetric multithreading (SMT) for, 192
System Global Area (SGA) in, 83, 190
Translation Lookaside Buffer (TLB) and, 190
Virtual Memory Manager (VMM) and, 189–190
vmo to tune memory for, 190–191, 190, 191
Oracle Enterprise Manager (OEM), 189,
195–196, 195, 196
P
pacing disk I/O, 180
packets, in network communication, 136,
154–156, 168–169
page space allocation, 85–87, 92
Oracle and, 189–190
Index
paging in memory, 65–66, 91
Partition Load Manager (PLM), 93
CPU tuning and, 56
PDP-7 computers, 7
Performance Toolbox (PTX), 57
persistent segments of memory, 64, 91
PM, 57
portmap, 139
POSIX, 7, 181
posix_aix_active parameter, AIX 6.1 and, 181
POWER, 24
Power Optimization with Enhanced RISC. See
POWER servers
POWER servers, iii, 8, 9, 11–15, 17, 18–19, 201
power tuning methodology, 3–6, 17
POWER5, 13–14
POWER6, 14–15, 18
AIX 6.1 and, 176
PowerVM, 14, 15, 18, 141, 201
pprof
AIX 6.1 and, 176
CPU monitoring using, 25
process management, CPU, 45
procmon, 57
AIX 6.1 and, 176
proctree, AIX 6.1 and, 176
protocols, network, 135–136, 141, 164
ps, 5, 55, 56
CPU monitoring using, 39–39, 39
CPU tuning using, 48, 48
memory monitoring using, 73–74, 74, 92
memory page space allocation using, 87, 87
R
RAM, 93
raso, AIX 6.1 and, 175
Red Hat Linux, 136, 200
Regatta architecture, 12, 13
relational database management systems
(RDBMS), 127
remote procedure calls (RPCs), 137, 149, 163, 167
renice, 5, 55
CPU tuning using, 47, 47, 192
Oracle and, 192
repeating the tuning process, 6
resource increases, 6
rfc1323 parameter in tuning network I/O, 159,
164–165, 164, 169
RISC architecture and POWER, 11–12
Ritchie, Dennis, 7, 17
rmss, memory tuning using, 89–90, 90, 93
role based access control (RBAC), 175
RS/6000, 8, 17
rsize, network I/O tuning using, 163–164
Run-Time Abstraction Services (RTAS), 14
S
sadc, disk I/O monitoring using, 108
sar, 55, 67
CPU monitoring using, 25, 28–30, 29, 30
disk I/O monitoring and, 107–108, 107, 108
memory monitoring using, 71–73, 72, 92
sb_max parameter in tuning network I/O, 159
scanning memory, lrubucket and, 88–89, 89, 93
sched_R and sched_D, CPU tuning using, 50, 50
schedo, 55, 56, 66
AIX 6.1 and, 175, 176
CPU tuning using, 48–50, 49–50, 88, 179
memory tuning using, 91, 92
schedtune, 66, 91
scheduler tuning, 5–6
security, AIX 6.1 and, 175
SEMMSL parameter, Linux on Power (LoP)
tuning using, 202
sequential I/O, 128
servers
minimum/maximum numbers and I/O, 127
network I/O and, tuning, 164–165
virtual I/O (VIOs), 141, 168, 169
shared partitions, 67
SHMMAX parameter, Linux on Power (LoP)
tuning using, 202
simultaneous multithreading (SMT), 13, 24
smctl, 55
SMIT. See System Management Interface Tool
(SMIT)
smtctl, 56
CPU tuning using, 53, 53, 53
Solaris, 9
NOTE: Boldface indicates illustrations and code; t indicates a table.
217
Index
splat, 5
CPU monitoring using, 25
CPU tracing using, 39
spray, network I/O monitoring using, 148, 148
stack, I/O, 99–100, 100
Statspack, 194, 195
storage area networks (SANs)
disk I/O and, 126
Oracle and, 194
stress testing, 4–5, 18
strict_maxclient, AIX 6.1 and tuning in, 178
Sun Microsystems, 136
svmon
AIX 6.1 and, 176
leaks in memory monitored with, 77–79, 78,
79, 92, 93
memory monitoring, using, 74–77, 75, 77
swapping in memory, 65–66
symmetric multiprocessing (SMP), 8
symmetric multithreading (SMT), 8, 13, 24, 55,
56, 192
Linux on Power (LoP) and, 201
syncd daemon, disk I/O tuning using, 123
sysctl, Linux on Power (LoP) tuning using, 202
System Global Area (SGA), Oracle, 83, 190
system layers, 103–104, 103, 103
System Management Interface Tool (SMIT), 105
network I/O and, 161
System p, 57
SystemTap, Linux on Power (LoP) and, 199
T
TCP/IP, layers of, 134, 135, 138
tcp_nodedelyack parameter in tuning network
I/O, 160
tcp_sendspace/tcp_recvspace parameter in
tuning network I/O, 158, 164, 164, 168
tcpdump, network I/O monitoring using, 156,
156, 168
testing/test environments, 56
thewall, network I/O tuning using, 157–158,
159–160
Thompson, Ken, 7, 17
thrashing, 87–88, 91
memory and paging and swapping and, 65–66
218
memory tuning and, 87–88
page space allocation/tuning and, 87
thread management
CPU, 45
network I/O tuning and, 162
time, CPU timing using, 41, 41
timeslice, CPU tuning using, 51–52, 52
timex, CPU timing using, 42, 42–43
timing tools, CPU, 41–43
Tivoli Monitoring System, 57
top, Linux on Power (LoP) monitoring using, 200
topas, 4, 55, 56, 67
AIX 6.1 and, 176
CPU monitoring using, 25, 35–36
CPU tuning with, 24
disk I/O monitoring using, 107, 108–111,
109, 110
memory monitoring using, 67, 92
network I/O monitoring using, 148, 149, 167
tprof, 5, 56, 146
CPU tracing using, 39–41, 40, 41
trace, 5
CPU tracing using, 39
network I/O and, 146, 152
tracing tools, CPU, 39–41
Translation Lookaside Buffer (TLB), 190
Transmission Control Protocol (TCP), 135–136,
164. See also TCP/IP
trcrpt, CPU tracing using, 39
trcstop, 146, 146, 152
trpof, AIX 6.1 and, 176
tuning, 5–6
CPU, 45–54
disk I/O and, 119–123
Linux on Power (LoP) and, 202–204
memory, 81–90
network I/O and, 157–165
U
udp_sendspace/udp_recevspace parameter in
tuning network I/O, 158
Uniplexed Information and Computing Service.
See Unix
Unix, iii–iv, 7–8, 9, 11, 17
upgrading, 18
Index
use_isno parameter in tuning network I/O, 161, 161
User Datagram Protocol (UDP), 135–136, 164
V
variable page size support (VPSS), 178
virtual I/O servers (VIOs), 141, 168, 169
virtual memory, 173
Virtual Memory Manager (VMM), 61, 63–64,
67, 91
direct I/O and, 101
disk I/O tuning using, 122–123
early allocation of paging space and, 189
free list in, 64
network I/O and, subsystem memory and, 141
network I/O tuning using, 162, 169
Oracle and, 189–190
page space allocation for, 189–190
paging in, 65–66, 91
summary of statistics for, using vmstat, 71, 71
thrashing and, 65–66
Translation Lookaside Buffer (TLB) and, 190
tuning of, 66
variable page size support (VPSS) in, 178
vm_default_pspa parameter in, 178–179, 179
vmm_mpsize_support parameter in, 178
vmo for, 66
vmtune for, 66
virtualization, 6, 8, 14, 17, 18, 141, 169, 175
Linux on Power (LoP) and, 201
vm.nr.hugepages parameter, Linux on Power
(LoP) tuning using, 202–203
vm_default_pspa parameter, 178–179, 179
vmm_mpsize_support parameter, 178
vmo, 66
AIX 6.1 and, 175, 176
AIX 6.1 and, memory tuning using, 177–179,
177, 178, 179
memory tuning using, 81–82, 91, 92, 93,
190–191, 190, 191
Oracle, memory tuning using, 190–191, 190, 191
vmstat, 4, 55, 67, 78, 81, 133
AIX 6.1 and, 176
CPU monitoring using, 25–29, 26, 28, 29, 30
CPU tuning with, 24
Linux on Power (LoP) monitoring using, 200
memory monitoring using, 67–70, 69, 70,
92, 93
memory tuning and, 91
vmtune, 66, 91
W
w, 55
CPU monitoring using, 31, 31
wide area networks (WANs), 136. See also
network I/O
working segments of memory, 64, 91
workload analysis, 55
workload balancing, 5, 87–88, 93
Workload Manager, 55
workload partitions (WPARs), 8–9
AIX 6.1 and, 176
wsize, network I/O tuning using, 163–164
X
X/OPEN, 7
XDR. See External Data Representation (XDR)
Z
zombie processes, 56
NOTE: Boldface indicates illustrations and code; t indicates a table.
219
Your Source for
Everything
IT
=Technical and Thought-leadership
Articles
=Weekly/semi-monthly newsletters
=Industry-leading columnists
=Forums and blogs
=Industry News
=Resourse Directory
=Industry Event Directory
=White Papers, Webcasts, Trial
Software
Visit us at www.mcpressonline.com today
See Our Full Line of IT Books
and Training Materials
at MC-Store.com
Choose from a wide variety
of topics, including
• Security
• IT Management
• DB2
• IBM System i
• IBM WebSphere
• RPG
• Java and JavaScript
• SOA
...and many more.
MCPressOnline.com ~ MC-Store.com