Compuware APM Almanac 2012

Page 2
Alois Reitbauer works as Technology Strategist for dynaTrace. As a major

contributor to dynaTrace Labs technology he inuences the companies
future technological direction. Besides his engineering work he supports
Fortune 500 companies in implementing successful performance
management. In his former life Alois has been working for a number
of high tech companies as an architect and developer of innovative
enterprise software. At Segue software (now Borland a Microfocus
Company) he was part of the engineering teams of the companys active
and passive monitoring products. He is a regular speaker at conferences
like TheServerSide Java Symposium Europe, QCon, Jazoon, DeVoxx or
JAX. He was the author of the Performance Series of the German Java magazine as well as
author of other online and print publications and contributor to several books.
Alois Reitbauer
Andreas Grabner
Andreas Grabner has 10 years experience as an architect and
developer in the Java and .NET space. In his current role, Andi works as
a Technology Strategist. In his role he inuences the dynaTrace product
strategy and works closely with customers in implementing performance
management solutions across the entire application lifecycle.
Copyright dynaTrace software, A Division of Compuware. All rights reserved.
Trademarks remain the property of their respective owners.
The dynaTrace Application Performance Almanac is
brought to you by our authors at blog.dynatrace.com
Page 3
Michael Kopp has 10 years of experience as an architect and developer
in Java/JEE and C++. He currently works as a Technology Strategist and
product evangelist in the dynaTrace Center of Excellence. In this role
he is specializing in the architecture and performance of large scale
production deployments. As part of the R&D team he inuences the
dynaTrace product strategy and works closely with key customers in
implementing performance management solution for the entire lifecycle.
Before joining dynaTrace he was the Chief Architect at GoldenSource, a
major player in the EDM space. In this role one special focus has always
been performance and scalability of their enterprise offerings.
Michael Kopp
Klaus Enzenhofer
Klaus Enzenhofer has several years of experience and expertise
in the eld of web performance optimization and user experience
management. He works as Technical Strategist in the Center of
Excellence Team at dynaTrace software. In this role he inuences the
development of the dynaTrace application performance management
solution and the web performance optimization tool dynaTrace AJAX
Edition. He mainly gathered his experience in web and performance
by developing and running large-scale web portals at Tiscover GmbH.
The dynaTrace Application Performance Almanac is
brought to you by our authors at blog.dynatrace.com
Copyright dynaTrace software, A Division of Compuware. All rights reserved.
Trademarks remain the property of their respective owners.
Page 4
Welcome
dynaTrace Application Performance
Almanac Issue 2012
A Year of APM Knowledge
We are proud to present the second issue of the Application Performance Management
Almanac, a collection of technical articles drawn from our most read and discussed blog
articles of last year.
While keeping our focus on classical performance topic like memory management, some
new topics piqued our interest. Specically Cloud, Virtualization and Big Data performance
is getting increasingly important and becoming the focus of software companies as these
technologies gain traction. Web Performance has continued to be a hot topic. This topic
also broadened from Web Diagnostics to production monitoring using User Experience
Management, looking beyond at ones own application into Third Party component
performance which is becoming a primary contributor to application performance.
Besides providing deep technical insight we also tried to answer controversial Why
questions like in our comparison of different end user monitoring approaches, or our
discussion about the use cases for NoSQL technologies.
We are also glad that some of our top customers have generously agreed to allow us to
feature their applications as real-world examples. Furthermore, a number of excellent guest
authors have contributed articles, to whom we extend our thanks.
We decided to present the articles in chronological order to reect the development of
performance management over the year from our perspective. As the articles however do
not depend on each other they can be read as individual pieces depending on your current
interests.
We want to thank our readers for their loyal readership and hope you enjoy this Almanac.
Klaus Enzenhofer
Andreas Grabner
Michael Kopp
Alois Reitbauer
The Application Performance Almanac is an
annual release of leading edge application
knowledge brought to you by
blog.dynatrace.com
Subscribe to blog feed
Subscribe by email
Page 5
5 Steps to Set Up ShowSlow as Web Performance Repository for dynaTrace Data
[8]
5 Things to Learn from JC Penney and Other Strong Black Friday and Cyber
Monday Performers [332]
eCommerce Business Impact of Third Party Address Validation Services [310]
How Case-Sensitivity for ID and ClassName can Kill Your Page Load Time [209]
How Proper Redirects and Caching Saved Us 3.5 Seconds in Page Load Time [238]
Is Synthetic Monitoring Really Going to Die? [269]
Microsoft Not Following Best Practices Slows Down Firefox on Outlook Web
Access [160]
Real Life Ajax Troubleshooting Guide [31]
Slow Page Load Time in Firefox Caused by Old Versions of YUI, jQuery, and Other
Frameworks [40]
Step by Step Guide: Comparing Page Load Time of US Open across Browsers [202]
Testing and Optimizing Single Page Web 2.0/AJAX Applications Why Best
Practices Alone Dont Work Any More [45]
The Impact of Garbage Collection on Java Performance [59]
Third Party Content Management Applied: Four Steps to Gain Control of Your
Page Load Performance! [376]
Cassandra Write Performance a Quick Look Inside [231]
Clouds on Cloud Nine: the Challenge of Managing Hybrid-Cloud Environment
[388]
Goal-oriented Auto Scaling in the Cloud [193]
NoSQL or RDBMS? Are We Asking the Right Questions? [261]
Pagination with Cassandra, And What We Can Learn from It [352]
Performance of a Distributed Key Value Store, or Why Simple is Complex [342]
Web
Cloud
Page 6
Application Performance Monitoring in Production A Step by Step
Guide Measuring a Distributed System [98]
Application Performance Monitoring in Production A Step-by-Step Guide Part
1 [77]
Automatic Error Detection in Production Contact Your Users Before They
Contact You [213]
Business Transaction Management Explained [275]
Field Report Application Performance Management in WebSphere Environments
[134]
How to Manage the Performance of 1000+ JVMs [366]
Top 8 Performance Problems on Top 50 Retail Sites before Black Friday [320]
Troubleshooting Response Time Problems Why You Cannot Trust Your System
Metrics [65]
Why Performance Management is Easier in Public than Onpremise Clouds [166]
Why Response Times are Often Measured Incorrectly [176]
Why SLAs on Request Errors Do Not Work and What You Should Do Instead
[257]
Why You Really Do Performance Management in Production [220]
You Only Control 1/3 of Your Page Load Performance! [295]
How Server-side Performance Affects Mobile User Experience [198]
Why You Have Less Than a Second to Deliver Exceptional Performance [304]
DevOps
Mobile
Page 7
Automated Cross Browser Web 2.0 Performance Optimizations: Best Practices
from GSI Commerce [183]
dynaTrace in Continuous Integration - The Big Picture [20]
How to do Security Testing with Business Transactions Guest Blog by Lucy
Monahan from Novell [122]
Tips for Creating Stable Functional Web Tests to Compare across Test Runs and
Browsers [112]
To Load Test or Not to Load Test: That is Not the Question [245]
White Box Testing Best Practices for Performance Regression and Scalability
Analysis [86]
Behind the Scenes of Serialization in Java [25]
How Garbage Collection Differs in the Three Big JVMs [145]
How to Explain Growing Worker Threads under Load [15]
Major GCs Separating Myth from Reality [37]
The Cost of an Exception [72]
The Reason I Dont Monitor Connection Pool Usage [315]
The Top Java Memory Problems Part 1 [92]
The Top Java Memory Problems Part 2 [357]
Why Object Caches Need to be Memory-sensitive Guest Blog by Christopher
Andr [153]
Automation
Tuning
Page 8
5 Steps to Set Up ShowSlow as Web
Performance Repository for dynaTrace Data
by Andreas Grabner
Alois Reitbauer has explained in detail how dynaTrace continuously monitors
several thousand URLs and uploads the performance data to the public
ShowSlow.com instance. More and more of our dynaTrace AJAX Edition
Community Members are taking advantage of this integration in their
internal testing environments. They either use Selenium, Watir or other
functional testing tools to continuously test their web applications. They
use dynaTrace AJAX Edition to capture performance metrics such as Time
to First Impression, Time to Fully Loaded, Number of Network Requests
or Size of the Site. ShowSlow is then used to receive those performance
beacons, store it in a repository and provide a nice Web UI to analyze the
captured data over time. The following illustration shows a graph from the
public ShowSlow instance that contains performance results for a tested
website over a period of several months:
1
Week 1
Sun Mon Tue Wed Thu Fri Sat
2
9
16
23
30
3
10
17
24
31
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
blog.dynatrace.com
Subscribe by email
Page 9
As we received several questions regarding installation and setup of this
integration I thought it is time to write a quick Step-by-Step Guide on how
to use a private ShowSlow Instance and dynaTrace AJAX Edition in your test
environment. I just went through the whole installation process on my local
Windows 7 installation and want to describe the steps Ive taken to get it
running.
Step 1: Download Software
Sergey creator of ShowSlow provides a good starting point for our
installation: http://www.showslow.org/Installation_and_conguration
I started by downloading the latest ShowSlow version, Apache 2.2, PHP5
Binaries for Windows and MySql. If you dont have dynaTrace AJAX Edition
yet also go ahead and get it from our AJAX Edition Web Site.
Step 2: Installing Components
I have to admit I am not a pro when it comes to Apache, PHP or MySql
but even I managed to get it all installed and congured in minutes. Here
are the detailed steps:
Analyze Performance Metrics from dynaTrace over time using ShowSlow as
Repository
Page 10
Initial Conguration of Apache
1. During the setup process I congured to run Apache on Port 8080 in
order to not conict with my local IIS
2. Update Apaches httpd.conf to let the DocumentRoot point to my
extracted ShowSlow directory
3. Enable all modules as explained in Sergeys Installation and Conguration
Description. That is mod_deate, mod_rewrite, mod_expires
Installing the Database
1. Use mysql command line utility and follow the instructions in Sergeys
description. Before running tables.sql I had to manually switch to the
ShowSlow database executing the use showslow statement
2. Rename cong.samples.php in your ShowSlow installation directory
to cong.php and change the database credentials according to your
installation
Conguring PHP
1. In my extracted php directory (c:\php) I renamed php-development.
ini to php.ini
1. Remove the comment for the two MySql Extensions php_mysql.
dll and php_mysqli.dll
2. Set the extension_dir to c:/php/ext
3. If you want to use the WebPageTest integration you also need to
remove the comment for the extension php_curl.dll
2. In order for PHP to work in Apache I had to add the following lines to
httpd.conf -> following these recommendations
Page 11
1. LoadModule php5_module c:/php/php5apache2_2.dll -> at the
end of the loadmodule section
2. AddType application/x-httpd-php .php -> at the end of IfModule
3. AddType application/x-httpd-php .phtml -> at the end of IfModule
4. PHPIniDir c:/php -> at the very end of the cong le
5. Change DirectoryIndex from index.html to index.php to default
to this le
Step 3: Launching ShowSlow
Now you can either run Apache as a Windows Service or simply start httpd.
exe from the command line. When you open the browser and browse to
http://localhost:8080you should see the following
ShowSlow running on your local machine
Page 12
Step 4: Congure dynaTrace AJAX Edition
dynaTrace AJAX Edition is congured to send performance data to the
public ShowSlow instance. This can be changed by modifying dtajax.ini
(located in your installation directory) and adding the following parameters
to it:
-Dcom.dynatrace.diagnostics.ajax.beacon.uploadurl=http://
localhost:8080/beacon/dynatrace
-Dcom.dynatrace.diagnostics.ajax.beacon.portalurl=http://
localhost:8080/
These two parameters allow you to manually upload performance data
to your local ShowSlow instance through the Context Menu in the
Performance Report:

dynaTrace AJAX Edition will prompt you before the upload actually happens
in order to avoid an accidental upload. After the upload to the uploadUrl
you will also be prompted to open the actual ShowSlow site. If you click Yes
a browser will be opened navigating to the URL congured in portalUrl in
our case this is our local ShowSlow instance. Now we will see our uploaded
result:

Manually upload a result to ShowSlow
Page 13
Step 5: Automation
The goal of this integration is not to manually upload the results after every
test run but to automate this process. There is an additional parameter
that you can congure in dtajax.ini:
-Dcom.dynatrace.diagnostics.ajax.beacon.autoupload=true
After restarting dynaTrace AJAX Edition the performance beacon will be
sent to the congured ShowSlow instance once a dynaTrace Session is
completed. What does that mean? When you manually test a web page
using dynaTrace AJAX Edition or if you use a functional testing tool such as
Selenium in combination with dynaTrace AJAX Edition, a dynaTrace Session
is automatically recorded. When you or the test tool closes the browser the
dynaTrace session gets completed and moves to the stored session folder.
At this point dynaTrace AJAX Edition automatically sends the performance
beacon to the congured ShowSlow instance.
Uploaded data visible in ShowSlow under Last Measurements
Page 14
If you want to know more about how to integrate tools such as Selenium
with dynaTrace then read these blogs: How to Use Selenium with dynaTrace,
5 Steps to Use Watir with dynaTrace
Want more data and better automation support?
The integration with ShowSlow is a great way to automate your performance
analysis in a continuous manner. The performance beacon that gets send
to ShowSlow can of course also be used by any other tool. The beacon is a
JSON formatted object that gets sent to the congured endpoint via HTTP
POST. Feel free to write your own endpoint listener if you wish to do so.
dynaTrace also offers a solution that extends what is offered in dynaTrace
AJAX and ShowSlow. If you need more metrics and better automation
support check out Web Performance Automation. If you want to analyze
more than just whats going on in the browser check out Server-Side
Tracing, End-to-End Visibility. If you want to become more proactive in
identifying and reacting on performance regressions check out Proactive
Performance Management.
Page 15
How to Explain Growing Worker Threads
under Load
by Andreas Grabner
I recently engaged with a client who ran an increasing load test against
their load-balanced application. I got involved because they encountered a
phenomenon they couldnt explain - here is an excerpt of the email:
We have a jBoss application with mySQL that runs stable in a load testing
environment with lets say 20 threads. At one point this suddenly changes and
jBoss uses up to 100 threads for handling requests (or whatever the congured
max number of threads in jBoss might be) until now we have not been able
to pinpoint what causes this issue
I requested their collected performance metrics to take a quick look at it.
Here were my steps to analyze their problem.
Step 1: Verify what they said
Of course I trusted what I read in the email - but it is always good to
conrm and verify. They subscribed several jBoss measures such as
the Thread Pool Current Threads Busy. Charting them veried what they said:

2
Week 2
2
9
16
23
30
3
10
17
24
31
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
blog.dynatrace.com
Subscribe by email
Page 16
Looking at the busy worker threads shows us the load behavior on the
load balanced web servers
The initial high number of worker threads is probably caused as the
application had to warm-up - meaning that individual requests were slow
so more worker threads were needed to handle the initial load. Throughout
the test the 2nd application server had constantly more threads than the
rst. At the end of the test we see that both servers spike in load with server
2 maxing its workers threads. This conrms what they told me in the email.
Lets see why that is.
Step 2: Correlate with other metrics
The next thing I do is to look at actual web request throughput, CPU
Utilization and response times of requests. I chart these additional metrics
on my dashboard. These metrics are provided by the application server
or by the performance management solution we used in this case. Lets
look at the dashboard including my red marks that I want to focus on:

Correlating thread count with CPU, thr-oughput and response time

The top graph shows the busy worker threads again with a red mark
around the problematic timeframe. The second graph shows the number
of successful web requests - which is basically the throughput. It seems
Page 17
that the throughput increased in the beginning which is explained by the
increasing workload. Already before we see the worker threads go up we
can see that throughput stops and stays at. On server 2 (brown line) we
even see a drop in throughput until the end of the test. So even though
we have more worker threads we actually have fewer requests handled
successfully.
The CPU Graph now actually explains the root cause of the problem. I
chose to split the chart to have a separate graph for each application server.
Both servers max out their CPU at the point that I marked in red. The time
correlates with the time when we see the throughput becoming at. It is
also the time when we see more worker threads being used. The problem
though is that new worker threads wont help to handle the still increasing
load because the servers are just out of CPU.
The bottom graph shows the response time of those requests that are
handled successfully again, split by the two application servers. The
red area shows the time spent on CPU, the blue area shows the total
execution time. We can again observe that the contribution of the CPU
plateaus once we have reached the CPU limit. From there on we only see
increased Execution Time. This is time where the application needs to wait
on resources, such as I/O, Network or even CPU cycles. Interesting here
is that the 2nd application server has a much higher execution time than
application server 1. I explain this because the 2nd server gets all these
additional requests from the load balancer that results in more worker
threads which all compete for the same scarce resources.
Summarizing the problems: It seems we have a CPU problem in
combination with a load balancer conguration problem. As we run out of
CPU the throughput stalls, execution times get higher but the load balancer
still forwards incoming requests to the already overloaded application
servers. It also seems that the load balancer is unevenly distributing the
load causing even more problems on server 2.
Is it really a CPU Problem?
Page 18
Based on the analysis it seems that CPU is our main problem here. The
question is whether there is something we can do about it (by xing a
problem in the application), or whether we just reached the limits of the
hardware used (xable by adding more machines).
The customer uses dynaTrace with a detailed level of instrumentation
which allows me to analyze which methods consume most of the CPU.
The following illustration shows what we call the Methods Dashlet.
It allows me to look at those method executions that have been traced
while executing the load test. Sorting it by CPU Time shows me that there
are two methods called getXML which spend most of the time in CPU:

There are two different implementations of getXML that consume most of the
CPU
The rst getXML method consumes by far more CPU than all other methods
combined. We need to look at the code as to what is going on in there.
Looking at the parameters I assume it is reading content from a le and
then returning it as a String. File access explains the difference in Execution
Time and CPU Time. The CPU Time is then probably spent in processing the
le and generating an XML representation of the content. Inefcient usage
of an XML Parser or String concatenations would be a good explanation.
It seems like we have a good chance to optimize this method, save CPU
and then be able to handle more load on that machine.
What about these spikes on server 2?
In the response time chart we could see that Server 2 had a totally different
Page 19
response time behavior than server 1 with very high Execution Times. Lets
take a closer look at some of these long running requests. The following
illustration shows parts of a PurePath. A PurePath is the transactional
trace of one request that got executed during the load test. In this case it
was a request executed against server 2 taking a total of 68s to execute. It
seems it calls an external web service that returns invalid content after 60
seconds. I assume the web service simply ran into a timeout and after 60
seconds returned an error html page instead of SOAP Content:
External Web Service Call runs into a
timeout and returns an error page
instead of SOAP Content Looking at
other PurePaths from server 2 it seems
that many transactions ran into the
same problem causing the long
execution times. The PurePath contains
additional context information such as method arguments and parameters
passed to the web service. This information can be used when talking with
the external service provider to narrow down the problematic calls.
Conclusion
The most important thing to remember is to monitor enough metrics: CPU,
I/O, Memory, Execution Times, Throughput, Application Server Specics,
and so on. These metrics allow you to gure out whether you have problems
with CPU or other resources. Once that is gured out you need more in-
depth data to identify the actual root cause. We have also learned that
it is not always our own application code that causes problems. External
services or frameworks that we use are also places to look.
Page 20
dynaTrace in Continuous Integration - The
Big Picture
by Andreas Grabner
Agile development practices have widely been adopted in R&D
organizations. A core component is Continuous Integration where code
changes are continuously integrated and tested to achieve the goal of
having potentially shippable code at the end of every Sprint/Iteration.
In order to verify code changes, Agile team members write Unit or Functional
Tests that get executed against every build and every milestone. The results
of these tests tell the team whether the functionality of all features is still
there and that the recent code changes have not introduced a regression.
3
Week 3
6
13
20
27
7
14
21
28
1
8
15
22
2
9
16
23
3
10
17
24
4
11
18
25
5
12
19
26
blog.dynatrace.com
Subscribe by email
Page 21
Verify for Performance, Scalability and Architecture
Now we have all these JUnit, NUnit, TestNG, Selenium, WebDriver, Silk or
QTP tests that verify the functionality of the code. By adding dynaTrace to
the Continuous Integration Process these existing tests automatically verify
performance, scalability and architectural rules. Besides knowing that (for
example) the Product Search feature returns the correct result we want to
know:
How much CPU and Network bandwidth it takes to execute the search
query?
How many database statements are executed to retrieve the search
result?
Will product images will be cached on the browser?
How does JavaScript impact the Page Load Time in the browser?
Whether the last code change affected any of these Performance,
Scalability or Architectural Rules?
dynaTrace analyzes all Unit and Browser Tests and validates execution
characteristics such as number of database statements, transferred bytes,
cache settings ... against previous builds and test runs. In case there is a
change (regression) the developers are notied about what has actually
changed.
Page 22
dynaTrace automatically detects abnormal behavior on all subscribed
measures, e.g.: execution time, number of database statements, number of
JavaScript les, ...
Automatically validate rules such as number of remoting calls or number of
bytes transferred
Page 23
Compare the difference between the problematic and the Last Known
Good run
Besides providing condence about functionality these additional checks
ensure that the current code base performs, scales and adheres to
architectural rules.
Step-by-Step Guide to enable dynaTrace in CI
In order to integrate dynaTrace we need to modify the CI Process. The
following is a high-level, step-by-step guide that explains all steps in a
typical CI environment. When a new build is triggered the Build Server
executes an Ant, NAnt, Maven, or any other type of automation script that
will execute the following tasks:
1. Check-out current Code Base
2. Generate new Buildnumber and then Compile Code
3. (dynaTrace) Start Session Recording (through REST)
4. (dynaTrace) Set Test Meta Data Information (through REST)
5. Execute JUnit, Selenium, ... Tests - (dynaTrace Agent gets loaded into
Test Execution Process)
6. (dynaTrace) Stop Session Recording (through REST)
7. Generate Test Reports including dynaTrace Results (through REST)
dynaTrace provides an Automation Library that provides both a Java and
.NET Implementation to call the dynaTrace Server REST Services. It also
Page 24
includes Ant, NAnt and Maven Tasks that make it easy to add the necessary
calls to dynaTrace. The Demo Application includes a fully congured sample
including Ant, JUnit and Selenium.
Once these steps are done dynaTrace will automatically
identify Unit and Browser Tests
learn the expected behavior of each individual test
raise an incident when tests start behaving unexpected
Captured results of tests are stored in individual dynaTrace Sessions which
makes it easy to compare and share.More details on Test Automation in the
Online Documentation.
Conclusion
Take your Continuous Integration Process to the next level by adding
performance, scalability and architectural rule validations without the need
to write any additional tests. This allows you to nd more problems earlier
in the development lifecycle which will reduce the time spent in load testing
and minimize the risk of production problems.
Page 25
Behind the Scenes of Serialization in Java
by Alois Reitbauer
When building distributed applications one of the central performance-
critical components is serialization. Most modern frameworks make
it very easy to send data over the wire. In many cases you dont see at
all what is going on behind the scenes. Choosing the right serialization
strategy however is central for achieving good performance and scalability.
Serialization problems affect CPU, memory, network load and response
times.
Java provides us with a large variety of serialization technologies. The
actual amount of data which is sent over the wire can vary substantially.
We will use a very simple sample where we send a rstname, lastname and
birthdate over the wire. Then well see how big the actual payload gets.
Binary Data
As a reference we start by sending only the payload. This is the most efcient
way of sending data, as there is no overhead involved. The downside is that
due to the missing metadata the message can only be de-serialized if we
know the exact serialization method. This approach also has a very high
testing and maintenance effort and we have to handle all implementation
complexity ourselves. The gure below shows what our payload looks like
in binary format.
4
Week 4
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
31
4
11
18
25
5
12
19
26
blog.dynatrace.com
Subscribe by email
Page 26

Binary Representation of Address Entity
Java Serialization
Now we switch to standard serialization in Java. As you can see below we
are now transferring much more metadata. This data is required by the
Java Runtime to rebuild the transferred object at the receiver side. Besides
structural information the metadata also contains versioning information
which allows communication across different versions of the same object.
In reality, this feature often turns out to be more troublesome than it initially
looks. The metadata overhead in our example is rather high. This is caused
by large amount of data in the GregorianCalendar Object we are using.
The conclusion that Java serialization comes with a very high overhead
per se, however, is not valid. Most of this metadata will be cached for
subsequent invocations.
Person Entity with Java Serialization
Java also provides the ability to override serialization behavior using the
Externalizable interface. This enables us to implement a more efcient
serialization strategy. In our example we could only serialize the birthdate
as a long rather than a full object. The downside again is the increased
effort regarding testing and maintainability
Java serialization is used by default in RMI communication when not
using IIOP as a protocol. Application server providers also offer their own
serialization stacks which are more efcient than default serialization. If
Page 27
interoperability is not important, provider-specic implementations are the
better choice.
Alternatives
The Java ecosystem also provides interesting alternatives to Java serialization.
A widely known one is Hessian which can easily be used with Spring. Hessian
allows an easy straightforward implementation of services. Underneath it
uses a binary protocol. The gure below shows our data serialized with
Hessian. As you can see the transferred data is very slim. Hessian therefore
provides an interesting alternative to RMI.

Hessian Binary Representation of Person Object

JSON
A newcomer in serialization formats is JSON (JavaScript Object Notation).
Originally used as a text-based format for representing JavaScript objects,
its increasingly been adopted in other languages as well. One reason is the
rise of Ajax applications, but also the availability of frameworks for most
programming languages.
As JSON is a purely text-based representation it comes with a higher overhead
than the serialization approaches shown previously. The advantage is that
it is more lightweight than XML and it has good support for describing
metadata. The Listing below shows our person object represented in JSON.
{rstName:Franz,
lastName:Musterman,
birthDate:1979-08-13
}
Page 28
XML
XML is the standard format for exchanging data in heterogeneous systems.
One nice feature of XML is out of the box support for data validation, which
is especially important in integration scenarios. The amount of metadata,
however, can become really high depending on the mapping used. All
data is transferred in text format by default. However the usage of CDATA
tags enables us to send binary data. The listing below shows our person
object in XML. As you can see the metadata overhead is quite high.
<?xml version=1.0 encoding=UTF-8 standalone=yes?>
<person>
<birthDate>1979-08-13T00:00:00-07:00</birthDate>
<rstName>Franz</rstName>
<lastName>Musterman</lastName>
</person>
Fast InfoSet
Fast InfoSet is becoming a very interesting alternative to XML. It is more or
less a lightweight version of XML, which reduces unnecessary overhead and
redundancies in data. This leads to smaller data set and better serialization
and deserialization performance.
When working with JAX-WS 2.0 you can enable Fast InfoSet serialization by
using the @FastInfoset annotation. Web Service stacks then automatically
detect whether it can be used for cross service communication using HTTP
Accept headers.
When looking at data serialized using Fast InfoSet the main difference you
will notice is that there are no end tags. After their rst occurrence they
are only referenced by an index. There are a number of other indexes for
content, namespaces etc.
Page 29
Data is prexed with its length. This allows for faster and more efcient
parsing. Additionally binary data can avoid being serialized in base64
encoding as in an XML.
In tests with standard documents the transfer size could be shrunk down
to only 20 percent of the original size and the serialization speed could be
doubled. The listing below shows our person object now serialized with
Fast InfoSet. For Illustration purposes I skipped the processing instructions
and I used a textual representation instead of binary values. Values in curly
braces refer to indexed values. Values in brackets refer to the use of an
index.
{0}<person>
{1}<birthDate>{0}1979-08-13T00:00:00-07:00
{2}<rstName>{1}Franz
{3}<lastName>{2}Musterman
The real advantage can be seen when we look what the next address object
would look like. As the listing below shows we can work mostly with index
data only.
[0]<>
[1]<>{0}
[2]<>{3}Hans
[3}<>{4}Musterhaus
Object Graphs
Object graphs can be quite tricky to serialize. This form of serialization
is not supported by all protocols. As we need to work with reference to
entities, the language used by the serialization approach must provide a
proper language construct. While this is no problem in serialization formats
which are used for (binary) RPC-style interactions, it is often not supported
out of the box by text-based protocols. XML itself, for example, supports
Page 30
serializing object graphs using references. The WS-I however forbids the
usage of the required language construct.
If a serialization strategy does not support this feature it can lead to
performance and functional problems, as entities get serialized individually
for each occurrence of a reference. If we are, for example, serializing
addresses which refer to country information, this information will be
serialized for each and every address object leading to large serialization
sizes.
Conclusion
Today there are numerous variants to serialize data in Java. While binary
serialization remains the most efcient approach, modern text-based
formats like JSON or Fast Infoset provide valid alternatives especially when
interoperability is a primary concern. Modern frameworks often allow using
multiple serialization strategies at the same time. So the approach can even
be selected dynamically at runtime.
Page 31
Real Life Ajax Troubleshooting Guide
by Andreas Grabner
One of our clients occasionally runs into the following problem with their
web app: They host their B2B web application in their East Coast data
center with their clients accessing the app from all around the United States.
Occasionally they have clients complain about bad page load times or that
certain features just dont work on their browsers. When the problem cant
be reproduced in-house and all of the usual suspects (problem with
internet connection, faulty proxy, user error, and so on) are ruled out they
actually have to y out an engineer to the client to analyze the problem
on-site. Thats a lot of time and money spent to troubleshoot a problem.
Capturing data from the End User
In one recent engagement we had to work with one of their clients on the
West Coast complaining that they could no longer login to the application.
After entering username and password and clicking the login button, the
progress indicator shown while validating the credentials actually never
goes away. The login worked ne when trying it in-house. The login also
worked for other user in the same geographical region using the same
browser version. They run dynaTrace on their application servers which
allowed us to analyze the requests that came from that specic user. No
problems could be detected on the server side. So we ruled out all potential
5
Week 5
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
31
4
11
18
25
5
12
19
26
blog.dynatrace.com
Subscribe by email
Page 32
problems that we could identify from within the data center. Instead of
ying somebody to the West Coast we decided to use a different approach.
We asked the user on the West Coast to install the dynaTrace Browser
Agent. The Browser Agent captures similar data to dynaTrace AJAX Edition.
The advantage of the agent is that it automatically ties into the backend.
Requests by the browser that execute logic on the application server can be
traced end-to-end, from the browser all the way to the database.
dynaTrace Timeline showing browser (JavaScript, rendering, network) and
server-side activity (method executions and database statements)
The Timeline view as shown above gives us a good understanding of what
is going on in the browser when the user interacts with a page. Drilling into
the details lets us see where time is spent, which methods are executed and
where we might have a problem/exception:
Page 33
End-to-end PurePath that shows what really happens when clicking on a button
on a web page
Why the Progress Indicator Didnt Stop
In order to gure out why the progress indicator didnt stop spinning and
therefore blocked the UI for this particular user we compared the data of
the user that experienced the problem with the data from a user that had
no problems. From a high level we compared the Timeline views.
Page 34
Identifying the general difference by comparing the two Timeline Views
Both Timelines show the mouse click which ultimately results in sending two
XHR Requests. In the successful case we can see a long running JavaScript
block that processes the returns XHR Response. In the failing case this
processing block is very short (only a few milliseconds). We could also see
that in the failing case the progress indicator was not stopped as we can still
observe the rendering activity that updates the rotating progress indicator.
Page 35
In the next step we drilled into the response handler of the second XHR
Request as thats where we saw the difference. It turned out that the XHR
Response was an XML Document and the JavaScript handler used an XML
DOM parser to parse the response and then iterate through nodes that
match a certain XPath Query:
JavaScript loads the XML response and iterates through the DOM nodes
using an XPath expression
The progress indicator itself was hidden after this loop. In the successful
case we saw the hideProgressIndicator() method being called, in the failing
one it wasnt. That brought us to the conclusion that something in the load
function above caused the JavaScript to fail.
Wrong XML Encoding Caused Problem
dynaTrace not only captures JavaScript execution but also captures
network trafc. We looked at the two XML Responses that came back in
Page 36
the successful and failing cases. Both XML Documents were about 350k
in size with very similar content. Loading the two documents in an XML
Editor highlighted the problem. In the problematic XML Document certain
special characters such as German umlauts were not encoded correctly.
This caused the dom.loadXML function to fail and exit the method without
stopping the progress indicator.
Incorrect encoding of umlauts and other special characters caused the problem
in the XML Parser
As there was no proper error handling in place this problem never made it
to the surface in form of an error message.
Conclusion
To troubleshoot problems it is important to have as much information at
hand as possible. Deep dive diagnostics as we saw it in this use case is
ideal as it makes it easy to spot the problem and therefore allows us to x
problems faster.
Want to know more about dynaTrace and how we support web
performance optimization from development to production? Then check
out the following articles:
Best Practices on Web Performance Optimization
dynaTrace in Continuous Integration The Big Picture
How to integrate dynaTrace with your Selenium Tests
Page 37
Major GCs Separating Myth from Reality
by Michael Kopp
In a recent post we showed how the Java Garbage Collection MXBean
Counters have changed for the Concurrent Mark-and-Sweep Collector. It
now reports all GC runs instead of just major collections. That prompted
me to think about what a major GC actually is or what it should be. It is
actually quite hard to nd any denition of major and minor GCs. This well-
known Java Memory Management Whitepaper only mentions in passing
that a full collection is sometimes referred to as major collection.
Stop-the-world
One of the more popular denitions is that a major GC is a stop-the-world
event. While that is true, the reverse is not. It is often forgotten that every
single GC, even a minor one, is a stop-the-world event. Young Generation
collections are only fast if there is a high mortality rate among young
objects. Thats because they copy the few surviving objects and the number
of objects to check is relatively small compared to the old generation. In
addition they are done in parallel nowadays. But even the Concurrent GC
has to stop the JVM during the initial mark and the remark.
6
Week 6
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
31
4
11
18
25
5
12
19
26
blog.dynatrace.com
Subscribe by email
Page 38
That brings us immediately to the second popular denition.
The Old Generation GC
Very often GC runs in the old generation are considered major GCs. When
you read the tuning guides or other references, GC in the tenured or old
generation is often equated with a major GC. While every major GC cleans
up the old generation, not all runs can be considered major. The CMS
(Concurrent Mark and Sweep) was designed to run concurrently to the
application. It executes more often than the other GCs and only stops the
application for very short periods of time. Until JDK6 Update 23 its runs
were not reported via its MXBean. Now they are, but the impact on the
application has not changed and for all intents and purposes I would not
consider them major runs. In addition not all JVMs have a generational
Heap, IBM and JRockit both feature a continuous Heap on default. We
would still see GC runs that we would either consider minor or major. The
best denition that we could come up with is that a major GC stops the
world for a considerable amount of time and thus has major impact on
response time. With that in mind there is exactly one scenario that ts all
the time: a Full GC.
Full GC
According to the aforementioned whitepaper a Full GC will be triggered
Page 39
whenever the heap lls up. In such a case the young generation is collected
rst followed by the old generation. If the old generation is too full to
accept the content of the young generation, the young generation GC is
omitted and the old generation GC is used to collect the full heap, either
in parallel or serial. Either way the whole heap is collected with a stop-the-
world event. The same is true for a continuous heap strategy, as apart from
the concurrent strategy every GC run is a Full GC!
In case of the concurrent GC the old generation should never ll up.
Hence it should never trigger a major GC, which is of course the desired
goal. Unfortunately the concurrent strategy will fail if too many objects
are constantly moved into the old generation, the old generation is too
full or if there are too many allocations altogether. In that case it will fall
back on one of the other strategies and in case of the Sun JVM will use the
Serial Old Collector. This in turn will of course lead to a collection of the
complete heap. This was exactly what was reported via the MXBean prior to
Update 23.
Now we have a good and useful denition of a major GC. Unfortunately
since JDK6 Update 23 we cannot monitor for a major GC in case of the
concurrent strategy anymore. It should also be clear by now that monitoring
for major GCs might not be the best way to identify memory problems as
it ignores the impact minor GCs have. In one of my next posts I will show
how we can collection has on the application in a better way.
Page 40
Slow Page Load Time in Firefox Caused
by Old Versions of YUI, jQuery, and Other
Frameworks
by Andreas Grabner
We blogged a lot about performance problems in Internet Explorer caused
by the missing native implementation of getElementsByClassName (in 101
on jQuery Selector Performance, 101 on Prototype CSS Selectors, Top 10
Client Side Performance Problems and several others). Firefox, on the other
hand, has always implemented this and other native lookup methods. This
results in much faster page load times on many pages that rely on lookups
by class name in their onLoad JavaScript handlers. But this is only true if the
web page also takes advantage of these native implementations.
7
Week 7
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
31
4
11
18
25
5
12
19
26
blog.dynatrace.com
Subscribe by email
Page 41
Yahoo News with 1 Second CSS Class Name Lookups
per Page
Looking at a site like Yahoo News shows that this is not necessarily the
case. The following screenshot shows the JavaScript/Ajax Hotspot analysis
from dynaTrace AJAX Edition for http://news.yahoo.com on Firefox 3.6:
6 element lookups by classname result in about 1 second of pure JavaScript
execution time
The screenshot shows the calls to getElementsByClassName doing lookups
for classes such as yn-menu, dynamic-_ad or lter-controls. Even
though Firefox supports a native implementation of CSS class name
lookups it seems that YUI 2.7 (which is used on this page) does not take
advantage of this. When we drill into the PurePath for one of these calls we
can see what these calls actually do and why it takes them about 150ms to
return a result:
Page 42
The YUI 2.7 implementation of getElementsByClassName iterates through 1451
DOM elements checking the class name of every element
How to Speed Up These Lookups?
There are two solutions to this specic problem: a) upgrade to a new
version of YUI or b) specify a tag name additionally to the class name
Upgrade to a Newer Library Version
Framework developers have invested a lot in improving performance over
the last couple of years. The guys from Yahoo did a great job in updating
their libraries to take advantage of browser specic implementations. Other
frameworks such as jQuery did the same thing. Check out the blogs from
Page 43
the YUI Team or jQuery. If you use a different framework I am sure you will
nd good blogs or discussion groups on how to optimize performance on
these frameworks.
Coming back to YUI: I just downloaded the latest version and compared
the implementation of getElementsByClassName from 2.7 (dom.js) to 3.3
(compat.js). There has been a lot of change between the version that is
currently used at Yahoo News and the latest version available. Changing
framework versions is not always as easy as just using the latest download,
as a change like this involves a lot of testing but the performance
improvements are signicant and everybody should consider upgrading to
newer versions whenever it makes sense.
Specify Tag Name Additionally to Class Name
The following text is taken from the getElementsByClassName
documentation from YUI 2.7: For optimized performance, include a tag and/
or root node when possible
Why does this help? Instead of iterating through all DOM elements YUI
can query elements by tag name rst (using the native implementation of
getElementsByTagName) and then only iterate through this subset. This
works if the elements you query are of the same type. On all the websites
Ive analyzed the majority actually query elements of the same type. Also, if
you are just looking for elements under a certain root node specify the root
node, e.g. a root DIV element of your dynamic menus.
Implementing this best practice should be fairly easy. It doesnt require
an upgrade to a newer framework version but will signicantly improve
JavaScript execution time.
Conclusion: Stay Up-to-date With Your Frameworks
To sum this blog post up: follow the progress of the frameworks you are
using. Upgrade whenever possible and whenever it makes sense. Also
stay up-to-date with blogs and discussion forums about these frameworks.
Page 44
Follow the framework teams on Twitter or subscribe to their news feeds.
If you are interested in dynaTrace AJAX Edition check out the latest Beta
announcement and download it for free on the dynaTrace website.
Page 45
Testing and Optimizing Single Page Web 2.0/
AJAX Applications Why Best Practices Alone
Dont Work Any More
by Andreas Grabner
Testing and Optimizing what I call traditional page-based web applications
is not too hard to do. Take CNN as an example. You have the home page
www.cnn.com. From there you can click through the news sections such
as U.S, World, Politics, and many more each click loading a new page
with a unique URL. Testing this site is rather easy. Use Selenium, WebDriver
or even a HTTP Based Testing tool to model your test cases. Navigate to
all these URLs and if you are serious about performance and load time
use tools such as YSlow, PageSpeed or dynaTrace AJAX Edition (these tools
obviously only work when testing is done through a real browser). These
tools analyze page performance based on common Best Practices for every
individually tested URL.
This is great for traditional web sites but doesnt work anymore for modern
Web 2.0 applications that provide most of their functionality on a single
page. An example here would be Google and their apps like Search, Docs
or GMail. Only a very small time is actually spent in the initial page load.
The rest is spent in JavaScript, XHR Calls and DOM Manipulations triggered
by user actions on the same URL. The following illustration gives us an
overview of what part of the overall User Interaction Time actually gets
8
Week 8
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
31
4
11
18
25
5
12
19
26
blog.dynatrace.com
Subscribe by email
Page 46
analyzed with current Best Practice Approaches and which parts are left
out:
The initial Page Load Time on Web 2.0 applications only contributes a small
percentage to the overall perceived performance by the end user
Let me explain the differences between these traditional vs. modern web
applications and let me give you some ideas on how to solve these new
challenges.
Page 47
Optimizing Individual Pages Has Become Straightforward
Anybody Can Do It
Lets look at CNN again. For each URL it makes sense to verify the number
of downloaded resources, the time until the onLoad Event, the number of
JavaScript les and the execution time of the onLoad Event handlers. All
these are things that contribute to the Page Load Time of a single URL. The
following screenshot shows the timeline of the CNN Start Page:
Key Performance Indicators that can be optimized for individual pages by
optimizing Network Roundtrips, JavaScript and Rendering
Page 48
Now we can follow recommendations like reducing the number of images,
JavaScript and CSS les.
Optimizing Network Resources and with that speeding up Page Load Time for
single pages
Once you are through the recommendations your pages will most likely
load faster. Job well done at least for web sites that have a lot of static
pages. But what about pages that leverage JavaScript, DOM Manipulations
and make use of XHR? You will only speed up the initial page load time
which might only be one percent of the time the user spends on your page.
Single Page Web 2.0 Applications Thats the Next
Challenge!
I am sure you use Google on a regular basis, whether it is Google Search,
Docs or Gmail. Work with these Web Apps and pay attention to the URL.
It hardly ever changes even though you interact with the application by
clicking on different links. Not all of these actions actually cause a new page
with a different URL to be loaded. Lets look at a simple example: I open
Page 49
my local Google Search (in my case it is Google Austria) and see the search
eld in the middle of the page. Once I start typing a keyword two things
happen:
1. the search eld moves to the top and Google shows me Instant Results
based on the currently entered fraction of the keyword
2. a drop-down box with keyword suggestions pops up
But the URL stays the same I am still on www.google.at. Check out the
following screenshot:
All the time while I execute actions on thesite I stay on the same page
Page 50
When we now look at the dynaTrace timeline we see all these actions and
the browser activities corresponding to these actions:
Executing a Google Search includes several actions that are all executed on the
same URL
The question that arises is, well, is the page performance good? Do Google
follow their own Best Practices?
Analyzing Page Load Time is Not Enough
If we want to analyze this scenario with PageSpeed in Firefox we run into
the problem that PageSpeed and YSlow are focused on optimizing Page
Load Time. These Tools just analyze loading of a URL. In our Google
Scenario it is just the loading of the Google Home Page, which (no surprise)
gets a really good score:
Page 51
PageSpeed and YSlow only analyze activities when loading the initial page
Why this doesnt work? In this scenario we miss all the activities that happen
on that page after the initial page was loaded.
Analyzing All Activities on a Page Delivers Misleading
Results
On the other hand we have tools like dynaTrace AJAX Edition and Google
SpeedTracer that not only look at the page load time but at all activities while
the user interacts with the page. This is a step forward but can produce a
misleading result. Lets look at the Google example once again: Whenever
I strike a key, an XHR request is sent to the Google Search servers returning
a JavaScript le that is used to retrieve suggestions and instant results. The
longer my entered keyword, the more JavaScript les are downloaded. This
is as designed for the action Search something on Google but it
violates some Web Performance Best Practice rules. Thats why analyzing
all activities on a single URL will lead to the wrong conclusions. Check out
Page 52
the following screenshot it tells me that we have too many JavaScript les
on that page:
12 JavaScript les from the same domain can be problematic when analyzing a
single page loading but not when analyzing a single-page Web 2.0 application
How to Test and Optimize Single Page Web 2.0
Applications?
Testing Web 2.0 applications is not as challenging as it used to be a couple
of years back. Modern web testing tools both commercial and open
source provide good support for Web 2.0 applications by actually driving
a real browser instance executing different actions such as loading a URL,
clicking on a link or typing keys. I see a lot of people using Selenium or
WebDriver these days. These open source tools work well for most scenarios
and also work across various browsers. But depending on the complexity of
the web site it is possible that you will nd limitations and need to consider
commercial tools that in general do a better job in simulating a real end
user, e.g. really simulating mouse moves and keystrokes, and not just doing
this through JavaScript injection or low-level browser APIs.
For my Google Search example I will use WebDriver. It works well across
Firefox and Internet Explorer. It gives me access to all DOM elements on a
page which is essential for me to verify if certain elements are on the page,
whether they are visible or not (e.g.: verifying if a the suggestion drop
down box becomes visible after entering a key) and what values certain
controls have (e.g.: what are the suggested values in a Google Search).
Page 53
Object Pattern for Action based Testing
The following is the test case that I implemented using WebDriver. It is really
straight forward. I open the Google home page. I then need to make sure
we are logged in because Instant Search only works when you are logged
in. I then go back to the main Google Search page and start entering a
keyword. Instead of taking the search result of my entered keyword I pick a
randomly suggested keyword:
Unit Test Case that tests the Google Search Scenario using Page Object Pattern
Page 54
The script is really straight forward and fairly easy to read. As you can
see I implemented classes called GoogleHomePage, GoogleSuggestion
or GoogleResultPage. These classes implement the actions that I want
to execute in my test case. Lets look at the suggestions for methods
implemented on the GoogleHomePage class returning a GoogleSuggestion
object:
In order to test the suggestion box we simulate a user typing in keys, then wait
for the result and return a new object that handles the actual suggestions
The code of this method is again not that complicated. What you notice
though is that I added calls to a dynaTrace helper class. dynaTrace AJAX
Edition allows me to set Markers that will show up in the Timeline View (for
details on this see the blog post Advanced Timing and Argument Capturing).
The addTimerName method is a method included in the premium version
of dynaTrace AJAX Edition which we will discuss in a little bit.
Page 55
Timestamp-based Performance Analysis of Web 2.0
Actions
When I execute my test and instruct WebDriver to launch the browser with
the dynaTrace-specic environment variables, dynaTrace AJAX Edition will
automatically capture all browser activities executed by my WebDriver
script. Read more on these environment variables on our forum post
Automation with dynaTrace AJAX Edition.
Lets have a look at the recorded dynaTrace AJAX session that we get from
executing this script and at the Timeline that shows all the actions that
were executed to get the suggestions, as well as clicking on a random link:
The Markers in the Timeline also have a timestamp that allows us to measure
performance of individual actions we executed
The Timeline shows every marker that we inserted through the script. In
addition to the two that are called Start/Stop Suggestion, I also placed
a marker before clicking a random suggestion and placed another when
the nal search result was rendered to the page. This timestamp-based
approach is a step forward in tracking performance of individual actions.
From here we can manually drill into the timeframe between two markers
and analyze the network roundtrips, JavaScript executions and rendering
activity. The problem that we still have though is that we cant really apply
Best Practices such as #of Roundtrips as this number would be very action
specic. The goal here must be to see how many requests we have per
action and then track this over time. And thats what we want but we
want it automated!
Page 56
Action/Timer-based Analysis That Allows Automation
Remember the calls to dynaTrace.addTimerName? This allows me to tag
browser activities with a special name a timer name. In the premium
extension of dynaTrace, activities are analyzed by these timer names allowing
me to not only track execution time of an action; it allows me to track all
sorts of metrics such as number of downloaded resources, execution time
of JavaScript, number of XHR requests, and so on. The following screenshot
shows the analysis of a single test run focusing on one of the actions that I
named according to the action in my test script:
Page 57
Key Performance Indicators by Timer Name (=Action in
the Test Script)
This allows us to see how many network requests, JavaScript executions,
XHR Calls, etc. we have per action. Based on these numbers we can come
up with our own Best Practice values for each action and verify that we
meet these numbers for every build we test avoiding regressions. The
following screenshot shows which measures dynaTrace allows us to track
over time:
Multiple Key Performance Indicators can be subscribed and tracked over time
Instead of looking at these metrics manually dynaTrace supports us with
automatically detecting regressions on individual metrics per Timer
Name (=Action). If I run this test multiple times dynaTrace will learn the
expected values for a set of metrics. If metrics fall outside the expected
value range I get an automated notication. The following screenshot
shows how over time an expected value range will be calculated for us. If
values fall out of this range we get a notication:
Automatically identify regressions on the number of network resources
downloaded for a certain user action
Page 58
If you are interested in more check out my posts on dynaTrace in CI The
Big Picture and How to Integrate dynaTrace with Selenium
Conclusion: Best Practices Only Work on Page Load Time
-Not on Web 2.0 Action-based Applications
It is very important to speed up Page Load Time dont get me wrong.
It is the initial perceived performance by a user who interacts with your
site. But it is not all we need to focus on. Most of the time in modern
web applications is spent in JavaScript, DOM manipulations, XHR calls and
rendering that happen after the initial page load. Automatic verication
against Best Practices wont work here anymore because we have to analyze
individual user actions that do totally different things. The way this will
work is to analyze the individual user actions, track performance metrics
and automate regression detection based on these measured values.
Page 59
The Impact of Garbage Collection on Java
Performance
by Michael Kopp
In my last post I explained what a major Garbage Collection is. While a
major Collection certainly has a negative impact on performance it is not
the only thing that we need to watch out for. And in the case of the CMS
we might not always be able to distinguish between a major and minor
GC. So before we start tuning the garbage collector we rst need to know
what we want to tune for. From a high level there are two main tuning
goals.
Execution Time vs. Throughput
This is the rst thing we need to clarify if we want to minimize the time the
application needs to respond to a request or if we want to maximize the
throughput. As with every other optimization these are competing goals
and we can only fully satisfy one of them. If we want to minimize response
time we care about the impact a GC has on the response time rst and
on resource usage second. If we optimize for throughput we dont care
about the impact on a single transaction. That gives us two main things
to monitor and tune for: runtime suspension and Garbage Collection CPU
usage. Regardless of which we tune for, we should always make sure that a
GC run is as short as possible.But what determines the duration of GC run?
9
Week 9
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
31
4
11
18
25
5
12
19
26
blog.dynatrace.com
Subscribe by email
Page 60
What makes a GC slow?
Although it is called Garbage Collection the amount of collected garbage
has only indirect impact on the speed of a run. What actually determines
this is the number of living objects. To understand this lets take a quick
look at how Garbage Collection works.
Every GC will traverse all living objects beginning at the GC roots and mark
them as alive. Depending on the strategy it will then copy these objects to
a new area (Copy GC), move them (compacting GC) or put the free areas
into a free list. This means that the more objects stay alive the longer the
GC takes. The same is true for the copy phase and the compacting phase.
The more objects stay alive, the longer it takes. The fastest possible run is
when all objects are garbage collected!
With this in mind lets have a look at the impact of garbage collections.
Impact on Response Time
Whenever a GC is triggered all application threads are stopped. In my last
post I explained that this is true for all GCs to some degree, even for so
called minor GCs. As a rule every GC except the CMS (and possibly the
G1) will suspend the JVM for the complete duration of a run.
Page 61
The easiest way to measure impact on the response time is to use your
favorite tool to monitor for major and minor collections via JMX and
correlate the duration with the response time of your application.
The problem with this is that we only look at aggregates, so the impact on
a single transaction is unknown. In this picture it does seem like there is no
impact from the garbage collections. A better way of doing this is to use
the JVM-TI interface to get notied about stop-the-world events. This way
the response time correlation is 100% correct, whereas otherwise it would
depend on the JMX polling frequency. In addition, measuring the impact
that the CMS has on response time is harder as its runs do not stop the JVM
for the whole time and since Update 23 the JMX Bean does not report the
real major GC anymore. In this case we need to use either verbose:gc or a
solution like dynaTrace that can accurately measure runtime suspensions
via a native agent technology.
Page 62
Here we see a constant but small impact on average, but the impact
on specic PurePaths is sometimes in the 10 percent range. Optimizing
for minimal response time impact has two sides. First we need to get
the sizing of the young generation just right. Optimal would be that no
object survives its rst garbage collection, because then the GC would be
fastest and the suspension the shortest possible. As this optimum cannot
be achieved we need to make sure that no object gets promoted to old
space and that an object dies as young as possible. We can monitor that by
looking at the survivor spaces.
This chart shows the survivor space utilization. It always stays well above
50% which means that a lot of objects survive each GC. If we were to look
at the old generation it would most likely be growing, which is obviously
not what we want. Getting the sizing right, also means using the smallest
young generation possible. If it is too big, more objects will be alive and
need to be checked, thus a GC will take longer.
If after the initial warm-up phase no more objects get promoted to old
space, we will not need to do any special tuning of the old generation. If
only a few objects get promoted over time and we can take a momentary
hit on response time once in a while we should choose a parallel collector
in the old space, as it is very efcient and avoids some problems that the
CMS has. If we cannot take the hit in response time, we need to choose
the CMS.
Page 63
The Concurrent Mark and Sweep Collector will attempt to have as little
response time impact as possible by working mostly concurrently with
the application. There are only two scenarios where it will fail. Either we
allocate too many objects too fast, in which case it cannot keep up and
will trigger an old-style major GC; or no object can be allocated due to
fragmentation. In such a case a compaction or a full GC (serial old) must
be triggered. Compaction cannot occur concurrently to the application
running and will suspend the application threads.
If we have to use a continuous heap and need to tune for response time we
will always choose a concurrent strategy.
CPU
Every GC needs CPU. In the young generation this is directly related to
the number of times and duration of the collections. In old space and a
continuous heap things are different. While CMS is a good idea to achieve
low pause time, it will consume more CPU, due to its higher complexity.
If we want to optimize throughput without having any SLA on a single
transaction we will always prefer a parallel GC to the concurrent one. There
are two thinkable optimization strategies. Either enough memory so that
no objects get promoted to old space and old generation collections never
occur, or have the least amount of objects possible living all the time. It
is important to note that the rst option does not imply that increasing
memory is a solution for GC related problems in general. If the old space
keeps growing or uctuates a lot than increasing the heap does not help, it
will actually make things worse. While GC runs will occur less often, they will
be that much longer as more objects might need checking and moving. As
GCs becomes more expensive with the number of objects living, we need
to minimize that factor.
Allocation Speed
The last and least known impact of a GC strategy is the allocation speed.
While a young generation allocation will always be fast, this is not true in
Page 64
the old generation or in a continuous heap. In these two cases continued
allocation and garbage collection leads to memory fragmentation
To solve this problem the GC will do a compaction to defragment the area.
But not all GCs compact all the time or incrementally. The reason is simple;
compaction would again be a stop-the-world event, which GC strategies
try to avoid. The concurrent Mark and Sweep of the Sun JVM does not
compact at all. Because of that these GCs must maintain a so called free list
to keep track of free memory areas. This in turn has an impact on allocation
speed. Instead of just allocating an object at the end of the used memory,
Java has to go through this list and nd a big enough free area for the
newly allocated object.
This impact is the hardest to diagnose, as it cannot be measured directly.
One indicator is a slowdown of the application without any other apparent
reasons, only to be fast again after the next major GC. The only way to
avoid this problem is to use a compaction GC, which will lead to more
expensive GCs. The only other thing we can do is to avoid unnecessary
allocations while keeping the amount of memory usage low.
Conclusion
Allocate as much as you like, but forget as soon as you can - before the
next GC run if possible. Ddont overdo it either, there is a reason why using
StringBuilder is more efcient than simple String concatenation. And nally,
keep your overall memory footprint and especially your old generation as
small as possible. The more objects you keep the less the GC will perform.
Page 65
Troubleshooting Response Time Problems
Why You Cannot Trust Your System Metrics
by Michael Kopp
Production monitoring is about ensuring the stability and health of your
system, including the application. A lot of times we encounter production
systems that concentrate on system monitoring, under the assumption that
a stable system leads to stable and healthy applications. So lets see what
system monitoring can tell us about our application.
Lets take a very simple two tier Web Application:
A simple two tier web application
This is a simple multi-tier eCommerce solution. Users are concerned about
bad performance when they do a search. Lets see what we can nd out
about it if performance is not satisfactory. We start by looking at a couple
of simple metrics.
10
Week 10
3
10
17
24
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
blog.dynatrace.com
Subscribe by email
Page 66
CPU Utilization
The best known operating system metric is CPU utilization, but it is also
the most misunderstood. This metric tells us how much time the CPU
spent executing code in the last interval and how much more it could
execute theoretically. Like all other utilization measures it tells us something
about capacity, but not about health, stability or performance. Simply put:
99% CPU utilization can either be optimal or indicate impeding disaster
depending on the application.
The CPU charts show no shortage on either tier
Lets look at our setup. We see that the CPU utilization is well below 100%,
so we do have capacity left. But does that mean the machine or the
application can be considered healthy? Lets look at another measure that is
better suited for the job, the Load Average (System\Processor QueueLength
on Windows). The Load Average tells us how many threads or processes are
currently executed or waiting to get CPU time.
Unix Top Output:load average: 1.31, 1.13, 1.10
Linux systems display three sliding load averages for the last one, ve and
15 minutes. The output above shows that in the last minute there were on
average 1.3 processes that needed a CPU core at the same time.
If the Load Average is higher than the number of cores in the system we
should either see near 100% CPU utilization, or the system has to wait for
other resources and cannot max out the CPU. Examples would be Swapping
or other I/O related tasks. So the Load Average tells us if we should trust
the CPU usage on the one hand and if the machine is overloaded on the
other. It does not tell us how well the application itself is performing, but
whether the shortage of CPU might impact it negatively. If we do notice a
Page 67
problem we can identify the application that is causing the issue, but not
why it is happening.
In our case we see that neither the load average nor the CPU usage shines
any light on our performance issue. If it were to show high CPU utilization
or a high load average we could assume that the shortage in CPU is a
problem, but we could not be certain.
Memory Usage
Memory use is monitored because lack of memory will lead to system
instability. An important fact to note is that Unix and Linux operating
systems will mostly show close to 100% memory utilization over time.
They ll the memory up with buffers and caches which get discarded, as
opposed to swapped out, if that memory is needed otherwise. In order to
get the real memory usage we need subtract these. In Linux we can do
so by using the free command.
Memory Usage on the two systems, neither is suffering memory problems
If we do not have enough memory we can try to identify which application
consumes the most by looking at the resident memory usage of a process.
Once identied we will have to use other means to identify why the
process uses up the memory and whether this is OK. When we look at
memory thinking about Java/.NET performance we have to make sure that
the application itself is never swapped out. This is especially important
because Java accesses all its memory in a random-access fashion and if
a portion were to be swapped out it would have severe performance
penalties. We can monitor this via swapping measures on the process
Page 68
itself. So what we can learn here is whether the shortage of memory has a
negative impact on application performance. As this is not the case, we are
tempted to ignore memory as the issue.
We could look at other measures like network or disk, but in all cases the
same thing would be true, the shortage of a resource might have impact,
but we cannot say for sure. And if we dont nd a shortage it does not
necessarily mean that everything is ne.
Databases
An especially good example of this problem is the database. Very often the
database is considered the source of all performance problems, at least by
application people. From a DBA and operations point of view the database is
often running ne though. Their reasoning is simple enough: the database
is not running out of any resources, there are no especially long-running
or CPU consuming statements or processes running and most statements
execute quite fast. So the database cannot be the problem.
Lets look at this from an application point of view
Looking at the Application
As users are reporting performance problems the rst thing that we do is to
look at the response time and its distribution within our system.
The overall distribution in our system does not show any particular bottleneck

Page 69
At rst glance we dont see anything particularly interesting when looking
at the whole system. As users are complaining about specic requests lets
go ahead and look at these in particular:
The response time distribution of the specic request shows a bottleneck in the
backend and a lot of database calls for each and every search request
We see that the majority of the response time lies in the backend and
the database layer. That the database contributes a major portion to the
response time does not mean however that the DBA was wrong. We see
that every single search executes 416 statements on average! That means
that every statement is executing in under one millisecond and this is fast
enough from the database point of view. The problem really lies within the
application and its usage of the database. Lets look at the backend next.
The heap usage and GC activity chart shows a lot of GC runs, but does it have
a negative impact?
Looking at the JVM we immediately see that it does execute a lot of garbage
collection (the red spikes), as you would probably see in every monitoring
tool. Although this gives us a strong suspicion, we do not know how this is
affecting our users. So lets look at that impact:
Page 70
These are the runtime suspensions that directly impact the search. It is
considerable but still amounts to only 10% of the response time
A single transaction is hit by garbage collection several times and if we
do the math we nd out that garbage collection contributes 10% to the
response time. While that is considerable it would make sense to spend a
lot of time on tuning it just now. Even if we reduce it by half it will only
save us 5% of the response time. So while monitoring garbage collection
is important, we should always analyze the impact before we jump to
conclusions.
So lets take a deeper look at where that particular transaction is spending
time on the backend. To do this we need to have application centric
monitoring in place which we can then use to isolate the root cause.
The detailed response time distribution of the search within the backend shows
two main problems: too many EJB calls and a very slow doPost method

Page 71
With the right measuring points within our application we immediately
see the root causes of the response time problem. At rst we see that the
WebService call done by the search takes up a large portion of the response
time. It is also the largest CPU hotspot within that call. So while the host
is not suffering CPU problems, we are in fact consuming a lot of it in that
particular transaction. Secondly we see that an awful lot of EJB calls are
done which in turn leads to the many database calls that we have already
noticed.
That means we have identied a small memory-related issue; although
there are no memory problems noticeable if we were to look only at system
monitoring. We also found that we have a CPU hotspot, but the machine
itself does not have a CPU problem. And nally we found that the biggest
issue is squarely within the application; too many database and EJB calls,
which we cannot see on a system monitoring level at all.
Conclusion
System metrics do a very good job at describing the environment - after
all, that is what they are meant for. If the environment itself has resource
shortages we can almost assume that this has a negative impact on the
applications, but we cannot be sure. If there is no obvious shortage this
does not, however, imply that the application is running smoothly. A
healthy and stable environment does not guarantee a healthy, stable and
performing application.
Similar to the system, the application needs to be monitored in detail and
with application-specic metrics in order to ensure its health and stability.
There is no universal rule as to what these metrics are, but they should enable
us to describe the health, stability and performance of the application itself.
Page 72
11
Week 11
The Cost of an Exception
by Alois Reitbauer
Recently there was an extensive discussion at dynaTrace about the cost
of exceptions. When working with customers we very often nd a lot of
exceptions they are not aware of. After removing these exceptions, the code
runs signicantly faster than before. This creates the assumption that using
exceptions in your code comes with a signicant performance overhead.
The implication would be that you had better avoid using exceptions.
As exceptions are an important construct for handling error situations,
avoiding exceptions completely does not seem to be good solution. All in
all this was reason enough to have a closer look at the costs of throwing
exceptions.
3
10
17
24
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
blog.dynatrace.com
Subscribe by email
Page 73
The Experiment
I based my experiment on a simple piece of code that randomly throws an
exception. This is not a really scientically profound measurement and we
also dont know what the HotSpot compiler does with the code as it runs.
Nevertheless it should provide us with some basic insights.
public class ExceptionTest {

public long maxLevel = 20;

public static void main (String ... args){

ExceptionTest test = new ExceptionTest();

long start = System.currentTimeMillis();
int count = 10000;
for (int i= 0; i < count; i++){
try {
test.doTest(2, 0);
}catch (Exception ex){
// ex.getStackTrace();
}
}
long diff = System.currentTimeMillis() - start;
System.out.println(String.format(Average time for invocation:
%1$.5f,((double) diff)/count));
}
Page 74

public void doTest (int i, int level){
if (level < maxLevel){
try {
doTest (i, ++level);
}
catch (Exception ex){
// ex.getStackTrace();
throw new RuntimeException (UUUPS, ex);
}
}
else {
if (i > 1) {
throw new RuntimeException(Ups.substring(0, 3));
}
}
}
}
The Result
The result was very interesting. The cost of throwing and catching an
exception seems to be rather low. In my sample it was about 0.002ms per
exception. This can more or less be neglected unless you really throw too
many exceptions were talking about 100, 000 or more.
While these results show that exception handling itself is not affecting code
performance, it leaves open the question: what is responsible for the huge
Page 75
performance impact of exceptions? So obviously I was missing something
something important.
After thinking about it again, I realized that I was missing an important
part of exception handling. I missed out the part on what you do when
exceptions occur. In most cases you hopefully do not just catch the
exception and thats it. Normally you try to compensate for the problem
and keep the application functioning for your end users. So the point I
was missing was the compensation code that is executed for handling an
exception. Depending on what this code is doing the performance penalty
can become quite signicant. In some cases this might mean retrying to
connect to a server, in other cases it might mean using a default fallback
solution that provides a far worse performing solution.
While this seemed to be a good explanation for the behavior we saw in
many scenarios, I decided I was not done yet with the analysis. I had the
feeling that there was something else that I was missing here.
Stack Traces
Still curious about this problem I looked into how the situation changes
when I collect stack traces. This is what very often happens: you log an
exception and its stack trace to try to gure out what the problem is.
I therefore modied my code to get the stack trace of an exception as
well. This changed the situation dramatically. Getting the stack traces of
exceptions had a 10x higher impact on the performance than just catching
and throwing them. So while stack traces help to understand where and
possibly also why a problem has occurred, they come with a performance
penalty.
The impact here is often very great, as we are not talking about a single
stack trace. In most cases exceptions are thrown and caught at multiple
levels. Let us look at a simple example of a Web Service client connecting
to a server. First there is an exception at the Java library level for the failed
connection. Then there is a framework exception for the failed client and
Page 76
then there might be an application-level exception that some business
logic invocation failed. This now totals to three stack traces being collected.
In most cases you should see them in your log les or application
output. Writing these potentially long stack traces also comes with some
performance impact. At least you normally see them and can react to them
if you look at your log les regularly you do look at your log les regularly,
dont you? In some cases I have seen even worse behavior due to incorrect
logging code. Instead of checking whether a certain log level is enabled by
calling log.isxxEnabled () rst, developers just call logging methods. When
this happens, logging code is always executed including getting stack traces
of exceptions. As the log level however is set too low they never show up
anywhere - you might not even be aware of them. Checking for log levels
rst should be a general rule as it also avoids unnecessary object creation.
Conclusion
Not using exceptions because of their potential performance impact is a
bad idea. Exceptions help to provide a uniform way to cope with runtime
problems and they help to write clean code. You however need to trace the
number of exceptions that are thrown in your code. Although they might
be caught they can still have a signicant performance impact. In dynaTrace
we, by default, track thrown exceptions and in many cases people are
surprised by what is going on in their code and what the performance
impact is in resolving them.
While exception usage is good you should avoid capturing too many
stack traces. In many cases they are not even necessary to understand
the problem especially if they cover a problem you already expect. The
exception message therefore might prove to be enough information. I get
enough out of a connection refused message to not need the full stack
trace into the internal of the java.net call stack.
Page 77
12
Week 12
Application Performance Monitoring in
Production A Step-by-Step Guide Part 1
by Michael Kopp
Setting up application performance monitoring is a big task, but like
everything else it can be broken down into simple steps. You have to know
what you want to achieve and subsequently where to start. So lets start at
the beginning and take a top-down approach.
Know What You Want
The rst thing to do is to be clear about what we want when monitoring
the application. Lets face it: we do not want to ensure CPU utilization
to be below 90 percent or a network latency of under one millisecond. We
are also not really interested in garbage collection activity or whether the
database connection pool is of our application and business services. To
ensure that, we need to leverage all of the above-mentioned metrics.What
does the health and stability of the application mean though? A healthy
and stable application performs its function without errors and delivers
accurate results within a predened satisfactory time frame. In technical
terms this means low response time and/or high throughput and low to
non-existant error rate. If we monitor and ensure this then the health and
stability of the application is likewise guaranteed.
3
10
17
24
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
blog.dynatrace.com
Subscribe by email
Page 78
Dene Your KPIs
First we need to dene what satisfactory performance means. In case of
an end-user facing application things like rst impression and page load
time are good KPIs. The good thing is that satisfactory is relatively simple
as the user will tolerate up to a 3-4 second wait but will get frustrated after
that. Other interactions, like a credit card payment or a search have very
different thresholds though and you need to dene them. In addition to
response time, you also need to dene how many concurrent users you
want, or need, to be able to serve without impacting the overall response
time. These two KPIs, response time and concurrent users, will get you very
far if you apply them on a granular enough level.If we are talking about
a transaction oriented application your main KPI will be throughput. The
desired throughput will depend on the transaction type. Most likely you
will have a time window within which you have to process a certain known
number of transactions, which dictates what satisfactory performance
means to you.
Resource and hardware usage can be considered secondary KPIs. As long
as the primary KPI is not met, we will not look too closely at the secondary
ones. On the other hand, as soon as the primary KPI is met optimizations
must always be towards improving these secondary KPIs.
If we take a strict top-down approach and measure end-to-end we will not
need more detailed KPIs for response time or throughput. We of course
need to measure in more detail than that in order to ensure performance.
Page 79
Know What, Where and How to Measure
In addition to specifying a KPI for e.g. the response time of the search
feature we also need to dene where we measure it.
The different places where we can measure response time
This picture shows several different places where we can measure the
response time of our application. In order to have objective and comparable
measurements we need to dene where we measure it. This needs to be
communicated to all involved parties. This way you ensure that everybody
talks about the same thing. In general the closer you come to the end
user the closer it gets to the real world and also the harder it is to measure.
We also need to dene how we measure. If we measure the average we
will need to dene how it is calculated. Averages themselves are alright
if you talk about throughput, but very inaccurate for response time. The
average tells you nearly nothing about the actual user experience, because
it ignores volatility. Even if you are only interested in throughput volatility
is interesting. It is harder to plan capacity for a highly volatile application
than for one that is stable. Personally I prefer percentiles over averages, as
they give us a good picture of response time distribution and thus volatility.
Page 80
50th, 75th, 90th and 95th percentile of end user response time for page load
In the above picture we see that the page load time of our sample has a
very high volatility. While 50 percent of all page requests are loaded in 3
seconds, the slowest 10 percent take between 5 and 20 seconds! That
not only bodes ill for our end user experience and performance goals,
but also for our capacity management (wed need to over provision a
lot to compensate). High volatility in itself indicates instability and is not
desirable. It can also mean that we measure the response time with not
enough granularity. It might not be enough to measure the response time
of e.g. the payment transactions in general. For instance, credit card and d
ebit card payment transactions might have very different characteristics so
we should measure them separately. Without doing that type of measuring,
response time becomes meaningless because we will not see performance
problems and monitoring a trend will be impossible.
Page 81
This brings us to the next point: what do we measure? Most monitoring
solutions allow the monitoring either on an URL level, servlet level (JMX/
App Servers) or network level. In many cases the URL level is good enough
as we can use pattern matching on specic URI parameters.
Create measures by matching the URI of our Application and Transaction type
For Ajax, WebService Transactions or SOA applications in general this will
not be enough. WebService frameworks often provide a single URI entry
point per application or service and distinguish between different business
transactions in the SOAP message. Transaction-oriented applications have
different transaction types which will have very different performance
characteristics, yet the entry point to the application will be the same nearly
every time (e.g. JMS). The transaction type will only be available within the
request and within the executed code. In our credit/debit card example we
would most likely see this only as part of the SOAP message. So what we
need to do is to identify the transaction within our application. We can do
this by modifying the code and providing the measurements ourselves (e.g.
via JMX). If we do not want to modify our code we could also use aspects
to inject it or use one of the many monitoring solutions that supports this
kind of categorization via business transactions.
Page 82
We want to measure response time of requests that call a method with a given
parameter
In our case we would measure the response time of every transaction and
label it as a debit card payment transaction when the shown method is
executed and the argument of the rst parameter is DebitCard. This way
we can measure the different types of transactions even if they cannot be
distinguished via the URI.
Think About Errors
Apart from performance we also need to take errors into account. Very
often we see applications where most transactions respond within 1.5
seconds and sometimes a lot faster, e.g. 0.2 seconds. More often than not
these very fast transactions represent errors. The result is that the more
errors you have the better your average response time will get, which is of
course misleading.
Page 83
Show error rate, warning rate and response time of two business transactions
We need to count errors on the business transaction level as well. If you
dont want to have your response time skewed by those errors, you should
exclude erroneous transaction from your response time measurement. The
error rate of your transactions would be another KPI on which you can
put a static threshold. An increased error rate is often the rst sign of an
impending performance problem, so you should watch it carefully.
I will cover how to monitor errors in more detail in one of my next posts.
What Are Problems?
It sounds like a silly question but I decided to ask it anyway, because in
order to detect problems, we rst need to understand them.
ITIL denes a problem as a recurring incident or an incident with high
impact. In our case this means that a single transaction that exceeds our
response time goal is not considered a problem. If you are monitoring a
big system you will not have the time or the means to analyze every single
violation anyway. But it is a problem if the response time goal is exceeded
by 20% of your end user requests. This is one key reason why I prefer
percentiles over averages. I know I have a problem if the 80th percentile
exceeds the response time goal.
The same can be said for errors and exceptions. A single exception or
error might be interesting to the developer. We should therefore save the
information so that it can be xed in a later release. But as far as Operations
is concerned, it will be ignored if it only happens once or twice. On the
other hand if the same error happens again and again we need to treat it
as a problem as it clearly violates our goal of ensuring a healthy application.
Alerting in a production environment must be set up around this idea. If
Page 84
we were to produce an alert for every single incident we would have a so-
called alarm storm and would either go mad or ignore them entirely. On
the other hand if we wait until the average is higher than the response time
goal customers will be calling our support line, before we are aware of the
problem.
Know your system and application
The goal of monitoring is to ensure proper performance. Knowing there
is a problem is not enough, we need to isolate the root cause quickly. We
can only do that if we know our application and which other resources
or services it uses. It is best to have a system diagram or ow chart that
describes your application. You most likely will want to have at least two or
three different detail levels of this.
1. System Topology
This should include all your applications, service, resources and the
communication patterns on a high level. It gives us an idea of what
exists and which other applications might inuence ours.
2. Application Topology
This should concentrate on the topology of the application itself.
It is a subset of the system topology and would only include
communication ows as seen from that applications point of view. It
should end when it calls third party applications.
3. Transaction Response Flow
Here we would see the individual business transaction type. This is the
level that we use for response time measurement.
Maintaining this can be tricky, but many monitoring tools provide this
automatically these days. Once we know which other applications and
services our transaction is using we can break down the response time into
its contributors. We do this by measuring the request on the calling side,
inside our application and on the receiving end.
Page 85
Show response time distribution throughout the system of a single
transaction type
This way we get a denite picture of where response time is spent. In
addition we will also see if we lose time on the network in the form of
latency.
Next Steps
At this point we can monitor the health, stability and performance of
our application and we can isolate the tier responsible in case we have a
problem. If we do this for all of our applications we will also get a good
picture of how the applications impact on each other. The next steps are
to monitor each application tier in more detail, including resources used
and system metrics. In the coming weeks I will explain how to monitor
each of these tiers with the specic goal of allowing production-level root-
cause analysis. At every level we will focus on monitoring the tier from an
application and transaction point of view as this is the only way we can
accurately measure performance impact on the end user.
Finally I will also cover system monitoring. Our goal is however not to
monitor and describe the system itself, but measure how it affects the
application. In terms of application performance monitoring, system
monitoring is an integral part and not a separate discipline.
Page 86
White Box Testing Best Practices for
Performance Regression and Scalability
Analysis
by Andreas Grabner
Every change in your code can potentially introduce a performance
regression and can impact the applications ability to scale. Regression
analysis addresses this problem by tracking performance aspects of different
components throughout the development process and under different
load patterns.
Black vs. White Box Analysis
There are different avors of regression analysis. We can either look at the
overall response time of a transaction/request or at the execution times of
all involved components. The terms black box and white box testing
are commonly used: black box testing looks at the application as one
entity; white box testing on the other hand analyzes the performance of
all individual components. At rst glance this might not seem a big deal as
both approaches allow us to identify changes in the application that affect
end user response times. But lets have a look at where white box testing
really rocks!
13
Week 13
3
10
17
24
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
blog.dynatrace.com
Subscribe by email
Page 87
Simple Scalability Analysis
When you want to test how your application scales I recommend using a
simple increasing-workload load test. In this test you start with minimum
load and ramp it up over time. Using a black box approach we analyze the
response time of our application. The following illustration shows how our
application behaves from a response time perspective:
Black box testing response time analysis. We can probably predict how response
time of the application will change with increasing load - or can we?
It looks like our application is doing OK. With increasing load we see
response time going up slightly which can be expected. But does this
mean that trend will continue if we put more load on the system? Black box
testing forces us to assume that the application will continue to scale in the
same way which we cant really be sure of. With white box testing we can
see into the individual components and can gure out how performance
changes in the individual layers.
Page 88
White box testing analyzes all involved components. We can see which
components scale better than others and how well the application really scales
Getting white box insight into application components while executing a
load test allows us to learn more about the scalability of our application.
In this case, it seems like our business logic layer is by far the worst scaling
component in our application when we increase load. Knowing this allows
us to a) focus on this component when improving performance and b) make
a decision on whether and how to distribute our application components.
Analyzing Regressions on Component Level
During the development process you also want to verify if code changes
have any negative impact on performance. An improved algorithm in one
component can improve overall performance but can also have a negative
impact on other components. The improvement effort is neutralized by
other components that cant deal with the changed behavior.
Lets have a look at an example. We have an application that contains
presentation, business and database layers. An analysis of the rst
Page 89
implementation shows that the business layer makes many roundtrips to
the database to retrieve objects that hardly ever change. The engineers
decide that these types of objects would be perfect for caching. The
additional cache layer frees up database resources that can be used for
other applications. In the next iteration the business owner decides to
change certain aspects of the business logic. This change doesnt seem
to have any negative impact on the response time as reported from the
black box tests. When looking at all involved components, however, we see
that the changed business logic is bypassing the cache because it requires
other objects that have not been congured for caching. This puts the
pressure back on the database and is good example for a component level
regression.
Identify performance regressions on component level
Where to Go from Here?
It doesnt really matter which tools you are using: whether you use
open source or commercial load testing tools, whether you use
commercial performance management software like dynaTrace or your own
home grown performance logging mechanisms. It is important that you
run tests continuously and that you track your performance throughout the
development lifecycle. You are more efcient if you have tools that allow
you to automate many of these tasks such as automatically executing tests,
automatically collecting performance relevant data on code level and also
Page 90
automatically highlighting regressions or scalability problems after tests are
executed.
To give you an idea here are some screenshots of how automated and
continuous white box testing can help you in your development process.
Track Performance of Tests Across Builds and
Automatically Alert on Regressions
Automatically Identify Performance Regressions on your daily performance and
integration tests
Analyze an Increasing Load test and Identify the
Problematic Transactions and Components
Analyze scalability characteristics of components during an increasing load test
Page 91
Analyze Regressions on Code Level Between Builds/
Milestones
Compare performance regressions on component and method level
Additional Reading Material
If you want to know more about testing check out the other blog posts
such as 101 on Load Testing, Load Testing with SilkPerformer or Load
Testing with Visual Studio
Page 92
The Top Java Memory Problems Part 1
by Michael Kopp
Memory and garbage collection problems are still the most prominent
issues in any Java application. One of the reasons is that the very nature
of garbage collection is often misunderstood. This prompted me to write
a summary of some of the most frequent and also most obscure memory
related issues that I have encountered in my time. I will cover the causes
of memory leaks, high memory usage, class loader problems and GC
conguration and how to detect them. We will begin this series with the
best known one memory leaks.
Memory Leaks
A memory leak is the most-discussed Java memory issue there is. Most
of the time people only talk about growing memory leaks - that is, the
continuous rise of leaked objects. They are in comparison, the easiest to
track down as you can look at trending or histogram dumps.
14
Week 14
3
10
17
24
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
blog.dynatrace.com
Subscribe by email
Page 93
Memory trending dump that shows the number of objects of the same type
increasing
In the picture we see the dynaTrace trending dump facility. You can achieve
similar results manual by using jmap -histo <pid> multiple times and
compare the results. Single object leaks are less thought-about. As long as
we have enough memory they seldom pose a serious problem. From time
to time though there are single object leaks that occupy a considerable
amount of memory and become a problem. The good news is that single
big leaks are easily discovered by todays heap analyzer tools, as they
concentrate on that.
Single object which is responsible for a large portion of the memory being leaked
Page 94
There is also the particularly nasty, but rarely-seen, case of a lot of small,
unrelated memory leaks. Its theoretically possible, but in reality it would
need a lot of seriously bad programmers working on the same project. So
lets look at the most common causes for memory leaks.
ThreadLocal Variables
ThreadLocals are used to bind a variable or a state to a thread. Each thread
has its own instance of the variable. They are very useful but also very
dangerous. They are often used to track a state, like the current transaction
id, but sometimes they hold a little more information. A ThreadLocal
variable is referenced by its thread and as such its lifecycle is bound to it. In
most application servers threads are reused via thread pools and thus are
never garbage-collected. If the application code is not carefully clearing the
thread local variable you get a nasty memory leak.
These kinds of memory leaks can easily be discovered with a heap dump.
Just take a look at the ThreadLocalMap in the heap dump and follow the
references.
The heap dump shows that over 4000 objects which amount to about
10MB are held by ThreadLocals
Page 95
You can then also look at the name of the thread to gure out which part
of your application is responsible for the leak.
Mutable Static Fields and Collections
The most common reason for a memory leak is the wrong usage of statics.
A static variable is held by its class and subsequently by its classloader.
While a class can be garbage-collected it will seldom happen during an
applications lifetime. Very often statics are used to hold cache information
or share state across threads. If this is not done diligently it is very easy to get
a memory leak. Static mutable collections especially should be avoided at
all costs for just that reason. A good architectural rule is not to use mutable
static objects at all; most of the time there is a better alternative.
Circular and Complex Bi-directional References
This is my favorite memory leak. It is best explained by example:
org.w3c.dom.Document doc = readXmlDocument();
org.w3c.dom.Node child = doc.getDocumentElement().
getFirstChild();
doc.removeNode(child);
doc = null;
At the end of the code snippet we would think that the DOM document will
be garbage-collected. That is however not the case. A DOM node object
always belongs to a document. Even when removed from the document,
the Node Object still has a reference to its owning document. As long as
we keep the child object the document and all other nodes it contains will
not be garbage-collected. Ive see this and other similar issues quite often.
JNI Memory Leaks
This is a particularly nasty form of memory leak. It is not easily found unless
you have the right tool, and it is also not known to a lot of people. JNI
is used to call native code from Java. This native code can handle, call
and also create Java objects. Every Java object created in a native method
Page 96
begins its life as a so called local reference. That means that the object
is referenced until the native method returns. We could say the native
method references the Java object. So you dont have a problem unless the
native method runs forever. In some cases you want to keep the created
object even after the native call has ended. To achieve this you can either
ensure that it is referenced by some other Java object or you can change
the local reference into a global reference. A global reference is a GC root
and will never be garbage-collected until explicitly deleted by the native
code. The only way to discover such a memory leak is to use a heap dump
tool that explicitly shows global native references. If you have to use JNI
you should rather make sure that you reference these objects normally and
forgo global references altogether.
You can nd this sort of leak when your heap dump analysis tool explicitly
marks the GC Root as a native reference, otherwise you will have a hard
time.
Wrong Implementation of Equals/Hashcode
It might not be obvious on the rst glance, but if your equals/hashcode
methods violate the equals contract it will lead to memory leaks when used
as a key in a map. A hashmap uses the hashcode to look up an object and
verify that it found it by using the equals method. If two objects are equal
they must have the same hashcode, but not the other way around. If you
do not explicitly implement hashcode yourself this is not the case. The
default hashcode is based on object identity. Thus using an object without
a valid hashcode implementation as a key in a map, you will be able to add
things but you will not nd them anymore. Even worse: if you re-add it, it
will not overwrite the old item but actually add a new one - and just like
that you have a memory leak. You will nd it easily enough as it is growing,
but the root cause will be hard to determine unless you remember this
article.
The easiest way to avoid this is to use unit testcases and one of the available
frameworks that tests the equals contract of your classes (e.g. http://code.
google.com/p/equalsverier/).
Page 97
Classloader Leaks
When thinking about memory leaks we think mostly about normal Java
objects. Especially in application servers and OSGi containers there is
another form of memory leak, the class loader leak. Classes are referenced
by their classloader and normally they will not get garbage-collected until
the classloader itself is collected. That however only happens when the
application gets unloaded by the application server or OSGi container.
There are two forms of classloader leaks that I can describe off the top of
my head.
In the rst an object whose class belongs to the class loader is still referenced
by a cache, a thread local or some other means. In that case the whole
classloader and so, the whole application - cannot be garbage-collected.
This is something that happens quite a lot in OSGi containers nowadays
and used to happen in JEE application servers frequently as well. As it only
happens when the application gets unloaded or redeployed it does not
happen very often.
The second form is nastier and was introduced by bytecode manipulation
frameworks like BCEL and ASM. These frameworks allow the dynamic
creation of new classes. If you follow that thought you will realize that now
classes, just like objects, can be forgotten by the developer. The responsible
code might create new classes for the same purpose multiple times. As the
class is referenced in the current class loader you get a memory leak that
will lead to an out of memory error in the permanent generation. The real
bad news is that most heap analyzer tools do not point out this problem
either, we have to analyze it manually, the hard way. This form or memory
leak became famous due to an issue in an older version of hibernate and its
usage of CGLIB.
Summary
As we see there are many different causes for memory leaks and not all of
them are easy to detect. In my next post I will look at further Java memory
problems, so stay tuned.
Page 98
Application Performance Monitoring
in Production A Step by Step Guide
Measuring a Distributed System
by Michael Kopp
Last time I explained logical and organizational prerequisites to successful
production-level application performance monitoring. I originally wanted to
look at the concrete metrics we need on every tier, but was asked how you can
correlate data in a distributed environment, so this will be the rst thing that
we look into. So lets take a look at the technical prerequisites of successful
production monitoring.
Collecting Data from a Distributed Environment
The rst problem that we have is the distributed nature of most applications
(an example is shown in the transaction ow diagram below). In order to
isolate response time problems or errors we need to know which tier and
component is responsible. The rst step is to record response times on
every entry and exit from a tier.
15
Week 15
3
10
17
24
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
blog.dynatrace.com
Subscribe by email
Page 99
A simple transaction ow showing tier response time
The problem with this is twofold. Firstly, the externalJira tier will host
multiple different services which will have different characteristics. This
is why we need to measure the response time on that service level and
not just on the tier level. We need to do this on both sides of the fence,
otherwise we will run into an averaging problem. The second problem is
that externalJira is called from several other tiers, not just one.

In a complex System average tier response times are not helpful. Tier response
times need to be looked at within a transaction context as shown by this
transaction ow
Page 100
When we look at the second transaction ow diagram (above) we see
that externalJira is called from three different tiers. These tiers sometimes
call the same services on externalJira, but with vastly different parameters
which leads to different response times of externalJira. We have a double
averaging problem:
different tiers calling different services on externalJira skewing the
average
different tiers calling the same service on externalJira with different
parameters skewing the average
Lets look at this in a little more detail with following example
In this table we see which tier entry point calls which services on other tiers.
The Payment 1 service calls services 1-3 and measures the response time on
its side. The Payment 2 service calls the same three services but with very
different response times. When we look at the times measured on services
1-3 respectively we will see a completely different timing. We measured
the response times of services 1-3 irrespective of their calling context and
ended up with an average! Service 1 does not contribute 500ms to the
response times of either Payment 1 or 2, but the overall average is 500ms.
This average becomes more and more useless the more tiers we add. One
Page 101
of our biggest customers hits 30 JVMs in every single transaction. In such
complex environments quick root cause isolation is nearly impossible if you
only measure on a tier by tier basis.
In order to correlate the response times in a complex system we need to
retain the transaction context of the original caller. One way to solve this is
to trace transactions, either by using a monitoring tool that can do that or
by modifying code and building it into the application.
Correlating data in a distributed environment
HTTP uses a concept called referrer that enables a webpage to know from
which other pages it was called. We can use something similar and leverage
this to do our response time monitoring. Lets assume for the moment that
the calls done in our imaginary application are all WebService HTTP calls. We
can then either use the referrer tag or some custom URL query parameter
to track the calling services. Once that is achieved we can track response
time based on that custom property. Many monitoring tools allow you to
segregate response time based on referrer or query parameters. Another
possibility, as always, is to report this yourself via your own JMX Bean. If we
do that we will get a response that is context aware.
Page 102
We now see that Service 2 only calls Service 3 when it is called directly from
Payment 1, which means its contribution is far less than the original table
suggested. We also still see a difference in the request and the response
time of the services. This is due to the network communication involved.
By measuring the response time aware of context we can now also see
the time that we spend in the communication layer more clearly, which
enables us to isolate network bottlenecks and their impact. The average
response time table did not allow us to do that.
We can push this to any level we want. E.g. we can divide the Payment
1 WebService call into its three variants supported by our shop: Visa,
MasterCard, and AMEX. If we push this as a tag/referrer down the chain
we get an even more detailed picture of where we spend time.
The problem with this approach is that it is not agnostic to your application
or the remoting technology. It requires you to change your code base, and
monitoring the different response times becomes more complicated with
every tier you add. Of course you also need to maintain this alongside the
actual application features, which increases cost and risk.
This is where professional APM tools come in. Among other things they
do transaction tracing and tagging transparently without code changes.
They can also split measure response time in a way that is context aware;
they can differentiate between an AMEX and a Visa credit card payment
via Business Transactions. And nally they allow you to focus on the entry
response time, in our case Payment 1 and Payment 2. In case you have
a problem, you can drill down to the next level from there. So there is no
need to keep an eye on all the deeper level response times.
Page 103
dynaTrace automatically traces calls across tiers (synchronous and
asynchronous), captures contextual information per transactions and highlights
which tiers contribute how much to the response time
Beyond Response Time
By analyzing the response time distribution across services and tiers we
can quickly isolate the offending tier/service in case we face a performance
problem. I stated before that monitoring must not only allow us to detect
a problem but also isolate the root cause. To do this we need to measure
everything that can impact the response time either directly or indirectly.
In generally I like to distinguish between usage, utilization and impact
measures.
Page 104
Usage and Utilization Measurement
A usage measure describes how much a particular application or transaction
uses a particular resource. A usage metric can usually be counted and is not
time based. An exception is the maybe best known usage measure of CPU
time. But CPU time is not really time based, it is based on CPU cycles; and
there are a limited number of CPU cycles that can be executed in a specic
time. We can directly measure how much CPU time is consumed by our
request, by looking at the threads consumed CPU time. In addition we can
measure the CPU usage on the process and system level. Most of the time
we are measuring a limited resource and as such we also have a utilization
measure, e.g. the CPU utilization of a system. Other examples include the
number of database calls of a transaction or the connection pool usage.
What is important is that the usage is a characteristic of the transaction
and does not increase if performance goes down. If the specic resource is
fully utilized, we have to wait for it, but then will use the same amount we
always do!
While response time and overall CPU usage uctuate the average CPU usage
per transaction is stable
To illustrate that we can again think about the CPU time. An AMEX credit
card payment transaction will always use roughly the same amount of CPU
time (unless there is a severe error in your code). If the CPU is utilized the
Page 105
response time will go up, because it has to wait for CPU, but the amount
of CPU time consumed will stay the same. This is what is illustrated in the
chart. The same should be true if you measure the amount of database
statements executed per transaction, how many web services were called
or how many connections were used. If a usage measure has a high
volatility then you either are not measuring on a granular enough business
transaction level (e.g. AMEX and Visa Payment may very well have different
usage measures) or it is an indicator for an architectural problem within
the application. This in itself is, of course, useful information for the R&D
department. The attentive reader will note that caches might also lead to
volatile response times, but then they should only do so during the warm-
up phase. If we still have high volatility after that due to the cache, then the
cache conguration is not optimal.
The bottom line is that a usage measure is ideally suited to measure which
sort of transactions utilize your system and resources the most. If one of
your resources reaches 100% utilization you can use this to easily identify
which transactions or applications are the main contributors. With this
information you can plan capacity properly or change the deployment
to better distribute the load. Usage measures on a transaction level are
also the starting point for every performance optimization activity and are
therefore most important for the R&D department.
Unfortunately the very fact that makes a usage measure ideal for performance
tuning makes it unsuitable for troubleshooting scalability or performance
problems in production. If the connection pool is exhausted all the time,
we can assume that it has a negative impact on performance, but we do
not know which transactions are impacted. Turning this around means that
if you have a performance problem with a particular transaction type you
cannot automatically assume that the performance would be better if the
connection pool were not exhausted!
The response time will increase, but all your transaction usage measures
will stay the same, so how do you isolate the root cause?
Page 106
Impact Measurement
In contrast to usage measures, impact measures are always time based. We
measure the time that we have to wait for a specic resource or a specic
request. An example is the getConnection method in the case of database
connection pools. If the connection pool is exhausted the getConnection
call will wait until a connection is free. That means if we have a performance
problem due to an exhaustion of the database connection pool, we can
measure that impact by measuring the getConnection method. The
important point is that we can measure this inside the transaction and
therefore know that it negatively impacts the AMEX, but not the Visa
transactions. Other examples are the execution time of a specic database
statement. If the database slows down in a specic area we will see this
impact on the AMEX transaction by measuring how long it has to wait for
its database statements.
The database statement impacts the response time by
contributing 20%
When we take that thought further we will see that every time a transaction
calls a different service, we can measure the impact that the external service
has by measuring its response time at the calling point. This closes the cycle
to our tier response times and why we need to measure them in a context-
aware fashion. If we would only measure the overall average response time
Page 107
of that external service we would never know the impact it has on our
transaction.
This brings us to a problem that we have with impact measurement on the
system level and in general.
Lack of Transaction Context
In an application server we can measure the utilization of the connection
pool, which tells us if there is a resource shortage. We can measure average
wait time and/or the average number of threads waiting on a connection
from the pool, which (similar to the Load Average) tells us that the resource
shortage does indeed have an impact on our application. But both the
usage and the impact measure lack transaction context. We can correlate
the measurements on time basis if we know which transactions use which
resources, but we will have to live with a level of uncertainty. This forces
us to do guesswork in case we have a performance problem. Instead of
zeroing in on the root cause quickly and directly, this is the main reason
that troubleshooting performance problems takes a long time and lots of
experts. The only way to avoid that guesswork is to measure the impact
directly in the transaction, by either modifying the code or use a motoring
tool that leverages byte-code injection and provides transaction context.
Of course there are some things that just cannot be measured directly.
The CPU again is a good example. We cannot directly measure waiting for
a CPU to be assigned to us, at least not easily. So how do we tackle this?
We measure the usage, utilization and indirect impact. In case of the CPU
the indirect impact is measured via the load average. If the load average
indicates that our process is indeed waiting for CPU we need to rule out
any other cause for the performance problem. We do this by measuring the
usage and impact of all other resources and services used by the application.
To quote Sherlock Holmes: If all directly measured root causes can be
ruled out than the only logical conclusion is that the performance problem
is caused by whatever resource cannot be directly measured. In other
words if nothing else can explain the increased response time, you can be
Page 108
sufciently certain that the CPU exhaustion is the root cause.
What About Log Files?
As a last item I want to look at how to store and report the monitoring data.
I was asked before whether log les are an option. The idea was to change
the code and measure the application from inside (as hinted at several
times by me) and write this to a log le. The answer is a denitive NO; log
les are not a good option.
The rst problem is the distributed nature of most applications. At the
beginning I explained how to measure distributed transactions. It becomes
clear that while you can write all this information to a log le periodically, it
will be highly unusable, because you would have to retrieve all the log les,
correlate the distributed transaction manually and on top of that correlate
it with system and application server metrics taken during that time. While
doable, it is a nightmare if we are talking about more than two tiers.
Page 109
Trying to manually correlate log les from all the involved
servers and databases is nearly impossible in bigger
systems. You need a tool that automates this.
The second problem is lack of context. If you only write averages to the
log le you will quickly run into the averaging problem. One can of course
rene this no end, but it will take a long time to reach a satisfactory level
of granularity and you will have to maintain this in addition to application
functionality, which is what should really matter to you. On the other hand
if you write the measured data for every transaction you will never be able
to correlate the data without tremendous effort and will also have to face
a third problem.
Both logging all of the measures and aggregating the measures before
logging them will lead to overhead which will have a negative impact
on performance. On the other hand if you only turn on this performance
logging in case you already have a problem, we are not talking about
monitoring anymore. You will not be able to isolate the cause of a problem
that has already occurred until it happens again.
The same is true if you do that automatically and e.g. automatically start
capturing data once you realize something is wrong. It sounds intuitive,
but it really means that you already missed the original root cause of why
it is slow.
On the other hand log messages often provide valuable error or warning
information that is needed to pinpoint problems quickly. The solution is to
capture log messages the same way that we measure response times and
execution counts: within the transaction itself. This way we get the valuable
log information within the transaction context and do not have to correlate
dozens of log les manually.
Page 110
The dynaTrace PurePath includes log messages and
exceptions in the context of a single transaction
In addition a viable monitoring solution must store, aggregate and
automatically correlate all retrieved measurements outside the monitored
application. It must store it permanently, or at least for some time. This way
you can analyze the data after the problem happened and do not have to
actively wait until it happens again.
Conclusion
By now it should be clear why we need to measure everything that we can
in the context of the calling transaction. By doing this we can create an
accurate picture of what is going on. It enables us to rule out possible root
causes for a problem and zero in on the real cause quickly. It also enables
Page 111
us to identify resource and capacity issues on an application and service
level instead of just the server level. This is equally important for capacity
planning and cost accounting.
As a next step we will look at the exact metrics we need to measure in each
tier, how to interpret and correlate them to our transaction response time.
Page 112
Tips for Creating Stable Functional Web Tests
to Compare across Test Runs and Browsers
by Andreas Grabner
In the last week I created stable functional tests for a new eCommerce
application. We picked several use cases, e.g.: clicking through the different
links, logging in, searching for products and actually buying a product.
We needed functional tests that run on both Internet Explorer and Firefox.
With these tests we want to make sure to automatically nd any functional
problems but also performance and architectural problems (e.g.: too many
JavaScript les on the site, too many exceptions on the server or too many
database statements executed for a certain test scenario). We also want to
nd problems that happen on certain browsers which is why we ran the
tests on the two major browsers.
Test Framework: Selenium WebDriver
As test framework I decided to pick Selenium WebDriver and downloaded
the latest version. I really thought it is easier to write tests that work in a
similar way on both browsers. I have several lessons learned
1. When you write a script always test it immediately on both browsers
2. Use a page object approach when developing your scripts. With that
you keep the actual implementation separated from the test cases (you
16
Week 16
1
8
15
22
29
2
9
16
23
30
3
10
17
24
31
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
blog.dynatrace.com
Subscribe by email
Page 113
will see my test scripts later in this blog it will make more sense when
you see it)
3. Be aware of different behaviors of IE and FF
4. Make sure your test code can deal with unexpected timings or error
situations
What a Test Script Should Look Like (Slick and Easy
to Understand)
Here is a screenshot of one of my test cases.
Selenium test using a page object pattern that facilitates writing easy to
understand test cases
Common Functionality in PageObjectBase
I put lots of helper methods in a base class that I called PageObjectBase. As
WebDriver wont wait for certain objects or for the page to be loaded (or at
least, I havent found that functionality) I created my own waitFor methods
to wait until certain objects are on the page. This allows me to verify
Page 114
whether my app made it to the next stage or not. Here is another screenshot
of one of my helper methods. You see that I had to work around a certain
limitation I came across in IE seems like By.linkText doesnt work same is
true for most of the other lookup methods in By. What worked well for me
is By.xpath with the only limitation that certain methods such as contains()
dont work on Firefox. As you can see lots of things to consider
unfortunately not everything works the same way on every browser.
Helper methods in my PageObjectBase class
Easy to Switch Browsers
My test classes create the WebDriver runner. Here I also created a base class
that depending on a system property that I can set from my Ant script
instantiates the correct WebDriver implementation (IE or FF). This base class
also checks whether dynaTrace will be used to collect performance data. If
thats the case it creates a dynaTrace object that I can use to pass test and
test step names to dynaTrace. This makes it easier to analyze performance
data later on more on this later in this article.
Page 115
Base object that makes sure we have the correct WebDriver and that dynaTrace
is properly set up
Analyzing Tests across Test Runs
As recently blogged, dynaTrace offers Premium Extensions to our
free dynaTrace AJAX Edition. These extensions allow us not only to collect
performance data automatically from Internet Explorer or Firefox they
automatically analyze certain key metrics per test case. Metrics can be the
number of resources downloaded, the time spent in JavaScript, the number
of redirects or the number of database queries executed on the application
server.
Identify Client Side Regressions across Builds
I have access to different builds. Against every build I run my Selenium
tests and then verify the Selenium results (succeeded, failed, errors) and
the numbers I get from dynaTrace (#roundtrips, time in JS, #database
Page 116
statements, #exceptions, and so on). With one particular build I still got all
successful Selenium Test executions but got a notication from dynaTrace
that some values were outside of the expected value range. The following
screenshot shows some of these metrics that triggered an alert:
JavaScript errors, number of resources and number of server-side exceptions
show a big spike starting with a certain build
A double click on one of the metrics of the build that has this changed
behavior opens a comparison view of this particular test case. It compares
it with the previous test run where the numbers were OK:
Page 117
The Timeline makes it easy to spot the difference visually. Seems we have many
more network downloads and JavaScript executions
A side-by-side comparison of the network requests is also automatically
opened showing me the differences in downloaded network resources. It
seems that a developer added a new version of jQuery including a long list
of jQuery plugins.
Page 118
When these libraries are really required we need to at least consider consolidating
the jQuery library and using a minied version of these plugins
Now we know why we have so many more resources on the page. Best
practices recommend that we merge all CSS and JavaScript les into a
single le and deploy a minied version of it instead of deploying all these
les individually. The JavaScript errors that were thrown were caused by
incompatibility between the multiple versions of jQuery. So even though
the Selenium test was still successful we have several problems with this
build that we can now start to address.
Identify Server-Side Regressions across Builds
Even though more and more logic gets executed in the browser we still
need to look at the application executed on the application server. The
Page 119
following screenshot shows another test case that shows a dramatic growth
in database statements (from 1 to more than 9000). Looks like another
regression.
Database executions exploded from 1 to more than 9000 for this particular test
case
The drill down to compare the results of the problematic build with the
previous works in the same way. Double click the measure and we get
to a comparison dashboard. This time we are interested in the database
statements. Seems it is one statement that got called several thousand
times.
This database statement was called several thousand
times more often than in the previous test runs
When we want to know who executed these statements and why
they werent executed in the build before, we can open the PurePath
Page 120
comparison dashlet. The PurePath represents the actual transactional trace
that dynaTrace captured for every request of every test run. As we want to
focus on this particular database statement we can drill from here to this
comparison view and see where its been called.
Comparing the same transaction in both builds. A code change caused the call
of allJourneys as compared to getJourneyById
Analyzing Tests across Browsers
In the same way as comparing results across test runs or builds it is possible
to compare tests against different browsers. It is interesting to see how
applications are different in different browsers. But it is also interesting to
identify regressions on individual browsers and compare these results with
the browser that doesnt show the regressions. The following screenshot
shows the comparison of browser metrics taken from the same test
executed against Internet Explorer and Firefox. Seems that for IE we have 4
more resources that get downloaded:
Page 121
Not only compare metrics across test runs but also compare metrics across
browsers
From here we can go on in the same way as I showed above. Drill into
the detailed comparison, e.g.: Timeline, Network, JavaScript or Server-Side
execution and analyze the different behavior.
Want more?
Whether you use Selenium, WebDriver, QTP, Silk, dynaTrace, YSlow,
PageSpeed or ShowSlow I imagine you are interested in testing and
you want to automate things. Check out my recent blogs such as those
on Testing Web 2.0 Applications, Why You cant Compare Execution times
Across Browsers or dynaTrace AJAX Premium.
Page 122
How to do Security Testing with Business
Transactions Guest Blog by Lucy Monahan
from Novell
by Andreas Grabner
Lucy Monahan is a Principal Performance QA Engineer at Novell, and helps to
manage their distributed Agile process.
One of the most important features of an application is to provide adequate
security and protect secrets held within. Business Transactions used
with Continuous Integration, Unit-, Feature- and Negative Testing
specialized for security can detect known security vulnerabilities before
your product goes to market.
Catch Secrets Written to Logging
Plaintext secrets written to a log le are a well-known vulnerability. Once an
intruder gains access to a hard disk they can easily comb through log les
to further exploit the system. Here a Business Transaction is used to search
logging output to look for secrets.
Application data slips into logging in a variety of ways: lack of communication
regarding which data is a secret, lingering early code, perhaps before
a comprehensive logging strategy has been implemented, old debug
17
Week 17
1
8
15
22
29
2
9
16
23
30
3
10
17
24
31
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
blog.dynatrace.com
Subscribe by email
Page 123
statements or perhaps inadvertent inclusion via localization processing.
Its a good idea to grep log les after your test run but that will not cover
output to the server console, which may contain different content. For
example, users may start the application using nohup which may write
terminal logging to nohup.out. And starting an application with redirection
of stdout and stderr will persist console output to a le, such as:
startserver.sh > /opt/logs/mylogle.log 2>&1
Use Business Transactions to search logging content during your testing
and trap secrets before they are revealed! And be sure to enable a high
logging level since trace level may contain secrets that info level does not.
This Business Transaction uses two Measures for two of the org.apache.
log4j.spi.Log4J LoggingEvent class constructors. Other methods that can be
trapped include those from proprietary logging classes or classes associated
with the auditing channel or retrieval of localized messages.
These Measures simply search the constructors message argument for
the word testuser for purposes of demonstration:
Create the two argument measures for the two different LoggingEvent
constructors
Page 124
The Business Transaction Filter denes both measures with an OR
operator to activate the Business Transaction:
The Business Transaction will lter transactions that contain a log message for
testuser
We want to know about any instance of a secret being logged thus the
threshold Upper Severe is set to 1.0 occurrence. Each of the Business
Transactions outlined here uses this threshold.
Page 125
The threshold on the Business Transaction can later be used for
automatic alerting
When the secret is written to logging the Business Transaction is
activated:
The Business Transaction dashlet shows us how many log messages have been
written that include our secret text
The displayed columns in the Business Transaction can be customized to
show counts of each Measure that matched the lter.
TIP: Running this Business Transaction against Install and Uninstall programs
is highly recommended since a lot of secret data is requested and used during
these processes.
From the Business Transaction dashlet we can drill-down to the actual
transactional trace (PurePath) and see where this secret was actually
logged. The PurePath contains additional contextual information such as
HTTP parameters, method arguments, exceptions, log messages, database
statements, remoting calls, and so on.
Page 126
The actual transaction that contains our captured secret. The dynaTrace
PurePath contains lots of contextual information valuable for developers.
Catch Exception Messages Containing Secrets
If your application prints exceptions and their messages then you need
to catch any secrets embedded within them. This Business Transaction
uses a measure to obtain the return value for java.lang.Throwable.
getLocalizedMessage(). Depending on your applications architecture a
measure for java.lang.Throwable.getMessage() may also be required.
With the Business Transaction dened, perform a negative testing test run
that intentionally throws exceptions.
The measure in this example simply searches for the word authentication
in thegetLocalizedMessage() return value for demonstration purposes:
Page 127
Argument measure that counts the occurrences of authentication in the return
value of getLocalizedMessage
Page 128
Business Transaction that lters based on the captured secret in
getLocalizedMessage
Business Transaction dashlet showing us instances where authentication
was part of the localized message
Page 129
Catch Non-SSL Protocols
Any non-SSL trafc over the network has the potential for exposing secrets.
Most applications requiring high security will disallow all non-SSL trafc.
Your application needs to be tested to ensure that all requests are being
made over SSL. Non-SSL connections can inadvertently be opened and
used when different components within an application are handling their
own connections or oversight in installation or conguration allowed the
connections.
Unless you include an explicit check for non-SSL connections then they may
be opened and used stealthily. One way to ensure that only SSL connections
are being used is to trap any non-SSL connections being opened.
This Business Transaction uses a measure to obtain the string value of the
rst argument to the constructor for javax.naming.ldap.InitialLdapContext.
The rst argument is a hashtable and when SSL is enabled one of the entries
contains the text protocol=ssl.
Its worth noting that the presence of the protocol=ssl value is particular
to the JVM implementation being used. Indeed, when SSL is not enabled
the protocol value is omitted, hence the use of the notcontains operator
for this measure. Use a method sensor to capture the value being used in
your JVM implementation to conrm the text token for your application
environment.
Page 130
Measure that evaluates whether SSL is not passed as argument
Page 131
Business Transaction that lters those transactions that do not pass SSL as
protocol argument
Business Transaction dashlet shows the transactions that do not use SSL as
parameter
What other protocols are being used in your application? You can write a
similar Business Transaction to ensure that only SSL is being used for that
protocol.
Catch Secrets Written to Print Statements
The above examples catch secrets written to Exceptions and logging. But a
common debug technique is to use a print statement, such as System.out.
println() in Java, rather than a logger function. A Business Transaction that
catches print statements is recommended for the test suite.
Having said that, in Java java.lang.System.out.println() is not accessible in
the usual way because out is a eld in the System class. An alternative
Page 132
approach may be to use a Business Transaction with a measure based on
Classname value to trap all calls to java.lang.System. All related PurePaths
for this Business Transaction would then merit review to assess security
risks. This approach may or may not be feasible depending how often your
application calls java.lang.System. Feasibility testing of this approach would
thus be necessary.
The test application used here does not call java.lang.System so an example
is not included. In any case, a companion test that performs a full text
search of the source code tree for calls to java.lang.System.out.println() is
highly recommended.
Summary
For each Business Transaction the process is:
Dene which entities within your application represent secrets (e.g.
passwords, account numbers)
Dene which classes and methods will be used for the Measure
denition
Create a new Business Transaction using the Measure denition
Ensure that the input data used in the test contains many secrets
Include these types of Business Transactions in continuous integration,
unit, feature and negative testing that contain a diversity of secrets. These
Business Transactions are not intended for load testing, however, since load
testing may simply contain many instances of the same secret rather than a
diversity of secrets and the result will ood your result set.
Security-related issues released into the eld are costly. The customers
sensitive data is at risk and the expense of security eld patches and the
damage done to the corporations reputation is high. Understanding your
applications architecture will help identify possible vulnerabilities and
enable you to grow a suite of Business Transactions designed specically to
trap secrets and guarantee your applications security.
Page 133
Follow up Links for dynaTrace Users
When you are a dynaTrace User you can check out the following links on
our Community Portal that explain more on how dynaTrace does Business
Transaction Management, how to improve your Continuous Integration
process and how to integrate dynaTrace into your testing environment.
Page 134
18
Week 18
Field Report Application Performance
Management in WebSphere Environments
by Andreas Grabner
Just in time for our webinar with The Bon-Ton Stores, where we talking
about the challenges in operating complex WebSphere environments, we
had another set of prospects running their applications on WebSphere.
Francis Cordon, a colleague of mine, shares some of the sample data
resulting from these engagements.
In this article I want to highlight important areas when managing
performance in WebSphere environments. This includes WebSphere health
monitoring, end to end performance analysis, performance and business
impact analysis as well as WebSphere memory analysis and management.
1
8
15
22
29
2
9
16
23
30
3
10
17
24
31
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
blog.dynatrace.com
Subscribe by email
Page 135
WebSphere Health Monitoring
WebSphere application servers provide many different metrics that we
need to consider when monitoring server health. This includes system
metrics such as memory and CPU utilization. It also includes transaction
response times and connection pool usage. The following screenshot
shows a dashboard that gives a good overview of the general health of a
WebSphere server:
Monitoring WebSphere server health including memory, CPU, response times,
connection pools and thread information
Page 136
From a very high-level perspective we can look at overall response times
but also at response times of individual services. The following illustration
shows a dashboard that visualizes response times and whether we have any
SLA Violations on any of our monitored service tiers:
Easy to spot whether we have any SLA breaches on any of our tiers
Page 137
The following dashboard provides an extended in-depth view. Not only
does it show response times or memory usage it also shows which layers
of the application contribute to the overall performance and provides an
additional overview of problematic SQL statements or exceptions:
A more in-depth WebSphere health monitor dashboard including layer
performance breakdown, database and exception activity
End-To-end Performance View
The transaction ow dashlet visualizes how transactions ow through the
WebSphere environment. We can look at all transactions, certain business
transactions (e.g. product searches, check-outs, logins, and many others)
or can pick individual ones from specic users. From this high-level ow we
can drill down to explore more technical details to understand where time
is spent or where errors happen.
The following screenshot shows how to drill into the details of those
transactions that cross through a specic WebSphere server node. For every
transaction we get to see the full execution trace (PurePath) that contains
contextual information such as executed SQL statements, exceptions, log
messages, executed methods including arguments, etc.
Page 138
Drill into the transactions owing through WebSphere. Each individual
transaction contains contextual information and provides the option to lookup
offending source code
If we want to focus on database activity we simply drill down into the
database details. Database activity is captured from within the application
server including SQL statements, bind variables and execution times. The
following 3 illustrations show different ways to analyze database activity
executed by our WebSphere transactions.
Page 139
Analyze all queries including bind values executed by our WebSphere application.
Identify slow ones or those that are executed very often
We can pick an individual database statement to see which transaction
made the call and how it impacts the performance of this transaction:
Page 140
Identify the impact of a database query on its transaction. In this case a stored
procedure is not returning the expected result and throws an exception
It is not enough to look at the actual transaction and its database statements.
We also monitor performance metrics exposed by the database in this case
its an Oracle database instance. dynaTrace users can download the Oracle
Monitor Plugin from our Community Portal. You can also read the article
on How to Monitor Oracle Database Performance.
Page 141
Analyze the activity on the database by monitoring Oracles performance
metrics and correlate it to our transactional data
Business Impact Analysis
As many different end users access applications running on WebSphere it is
important to identify problems that impact all users but also problems that
just impact individual users. The following illustration shows how Business
Transactions allow us to analyze individual users, and from there dig deeper
in the root cause of their individual performance problems.
Page 142
Analyze the performance impact for individual users using Business Transactions
Analyzing Memory Usage and Memory Leaks
Memory management and analysis can be hard if you dont know what
to look out for. Read our blogs on Top Memory Leaks in Java, Memory
Leak Detection in Production or the Impact of Garbage Collection on
Performance to make yourself familiar with the topic.
Page 143
The following screenshots show how to analyze memory usage in
WebSphere and how to track potential Memory Leaks by following object
reference paths of identied memory hotspots:
We start by analyzing our memory usage
Page 144
Identify hotspots and the identify the root cause of memory leaks
Final Words
Thanks again to Francis for sharing his experience with us. Existing dynaTrace
customers please check out the content we have on our Community Portal.
For more information, download the recorded version of our webinar with
The Bon-Ton Stores.
Page 145
19
Week 19
How Garbage Collection Differs in the Three
Big JVMs
by Michael Kopp
Most articles about garbage collection (GC) ignore the fact that the Sun
Hotspot JVM is not the only game in town. In fact whenever you have to
work with either IBM WebSphere or Oracle WebLogic you will run on a
different runtime. While the concept of garbage collection is the same,
the implementation is not and neither are the default settings or how to
tune it. This often leads to unexpected problems when running the rst
load tests or in the worst case when going live. So lets look at the different
JVMs, what makes them unique and how to ensure that garbage collection
is running smoothly.
The Garbage Collection Ergonomics of the Sun Hotspot
JVM
Everybody believes they know how garbage collection works in the Sun
Hotspot JVM, but lets take a closer look for the purpose of reference.
1
8
15
22
29
2
9
16
23
30
3
10
17
24
31
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
blog.dynatrace.com
Subscribe by email
Page 146
The memory model of the Sun Hotspot JVM
The Generational Heap
The Hotspot JVM always uses a generational heap. Objects are rst allocated
in the young generation, specically in the Eden area. Whenever the Eden
space is full a young generation garbage collection is triggered. This will
copy the few remaining live objects into the empty Survivor space. In
addition objects that have been copied to Survivor in the previous garbage
collection will be checked and the live ones will be copied as well. The result
is that objects only exist in one Survivor, while Eden and the other Survivor
are empty. This form of garbage collection is called copy collection. It is fast
as long as nearly all objects have died. In addition allocation is always fast
because no fragmentation occurs. Objects that survive a couple of garbage
collections are considered old and are promoted into the Tenured/old
space.
Tenured Generation GCs
The mark and sweep algorithms used in the Tenured space are different
because they do not copy objects. As we have seen in one of my previous
posts, garbage collection takes longer the more objects are alive.
Consequently GC runs in Tenured are nearly always expensive which is
why we want to avoid them. In order to avoid GCs we need to ensure that
objects are only copied from young to old when they are permanent and
in addition ensure that the Tenured does not run full. Therefore generation
sizing is the single most important optimization for the GC in the Hotspot
JVM. If we cannot prevent objects from being copied to Tenured space
once in a while we can use the concurrent mark and sweep algorithm
which collects objects concurrent to the application.
Page 147
Comparison of the different garbage collector strategies
While that shortens the suspensions it does not prevent them and they
will occur more frequently. The Tenured space also suffers from another
problem: fragmentation. Fragmentation leads to slower allocation, longer
sweep phases and eventually out of memory errors when the holes get too
small for big objects.
Java heap before and after compacting
This is remedied by a compacting phase. The serial and parallel compacting
GC perform compaction for every GC run in the Tenured space. It is
important to note that, while the parallel GC performs compaction every
time, it does not compact the whole Tenured heap but just the area that is
worth the effort. By worth the effort I mean when the heap has reached a
certain level of fragmentation. In contrast, the concurrent mark and sweep
does not compact at all. Once objects cannot be allocated anymore a serial
Page 148
major GC is triggered. When choosing the concurrent mark and sweep
strategy we have to be aware of that side effect.
The second big tuning option is therefore the choice of the right GC
strategy. It has big implications for the impact the GC has on the
application performance. The last and least known tuning option is around
fragmentation and compacting. The Hotspot JVM does not provide a lot of
options to tune it, so the only way is to tune the code directly and reduce
the number of allocations.
There is another space in the Hotspot JVM that we all came to love over
the years, the Permanent Generation. It holds classes and string constants
that are part of those classes. While garbage collection is executed in the
permanent generation, it only happens during a major GC. You might
want to read up what a Major GC actually is, as it does not mean an Old
Generation GC. Because a major GC does not happen often and mostly
nothing happens in the permanent generation, many people think that the
Hotspot JVM does not do garbage collection there at all.
Over the years all of us run into many different forms of the OutOfMemory
situations in PermGen and you will be happy to hear that Oracle intends to
do away with it in future versions of Hotspot.
Oracle JRockit
Now that we had a look at Hotspot, let us look at the difference in the
Oracle JRockit. JRockit is used by Oracle WebLogic Server and Oracle has
announced that it will merge it with the Hotspot JVM in the future.
Heap Strategy
The biggest difference is the heap strategy itself. While Oracle JRockit does
have a generational heap it also supports a so-called continuous heap. In
addition the generational heap looks different as well.
Page 149
Heap of the Oracle JRockit JVM
The young space is called Nursery and it only has two areas. When objects
are rst allocated they are placed in a so-called Keep Area. Objects in the
Keep Area are not considered during garbage collection while all other
objects still alive are immediately promoted to tenured. That has major
implications for the sizing of the Nursery. While you can congure how often
objects are copied between the two survivors in the Hotspot JVM, JRockit
promotes objects in the second young generation GC.
In addition to this difference JRockit also supports a completely continuous
Heap that does not distinguish between young and old objects. In certain
situations, like throughput orientated batch jobs, this results in better
overall performance. The problem is that this is the default setting on a
server JVM and often not the right choice. A typical Web Application is not
throughput but response time orientated and you will need to explicitly
choose the low pause time garbage collection mode or a generational
garbage collection strategy.
Mostly Concurrent Mark and Sweep
If you choose concurrent mark and sweep strategy you should be aware
about a couple of differences here as well. The mostly concurrent mark
phase is divided into four parts:
Initial marking, where the root set of live objects is identied. This is
done while the Java threads are paused.
Concurrent marking, where the references from the root set are followed
in order to nd and mark the rest of the live objects in the heap. This is
Page 150
done while the Java threads are running.
Precleaning, where changes in the heap during the concurrent mark
phase are identied and any additional live objects are found and
marked. This is done while the Java threads are running.
Final marking, where changes during the precleaning phase are
identied and any additional live objects are found and marked. This is
done while the Java threads are paused.
The sweeping is also done concurrent to your application, but in contrast
to Hotspot in two separate steps. It is rst sweeping the rst half of the
heap. During this phase threads are allowed to allocate objects in the
second half. After a short synchronization pause the second half is swept.
This is followed by another short nal synchronization pause. The JRockit
algorithm therefore stops more often than the Sun Hotspot JVM, but the
remark phase should be shorter. Unlike the Hotspot JVM you can tune the
CMS by dening the percentage of free memory that triggers a GC run.
Compacting
The JRockit does compacting for all Tenured Generation GCs, including
the Concurrent Mark and Sweep. It does so in an incremental mode for
portions of the heap. You can tune this with various options like percentage
of heap that should be compacted each time or how many objects are
compacted at max. In addition you can turn off compacting completely
or force a full one for every GC. This means that compacting is a lot more
tunable in the JRockit than in the Hotspot JVM and the optimum depends
very much on the application itself and needs to be carefully tested.
Thread Local Allocation
Hotspot does use thread local allocation (TLA), but it is hard to nd
anything in the documentation about it or how to tune it. The JRockit uses
this on default. This allows threads to allocate objects without any need
for synchronization, which is benecial for allocation speed. The size of a
Page 151
TLA can be congured and a large TLA can be benecial for applications
where multiple threads allocate a lot of objects. On the other hand too
large a TLA can lead to more fragmentation. As a TLA is used exclusively
by one thread, the size is naturally limited by the number of threads. Thus
both decreasing and increasing the default can be good or bad depending
on your applications architecture.
Large and small objects
The JRockit differentiates between large and small objects during allocation.
The limit for when an object is considered large depends on the JVM version,
the heap size, the garbage collection strategy and the platform used. It
is usually somewhere between 2 and 128 KB. Large objects are allocated
outside thread local area in case of a generational heap directly in the old
generation. This makes a lot of sense when you start thinking about it. The
young generation uses a copy collection. At some point copying an object
becomes more expensive than traversing it in every garbage collection.
No permanent Generation
And nally it needs to be noted that the JRockit does not have a permanent
generation. All classes and string constants are allocated within the normal
heap area. While that makes life easier on the conguration front it means
that classes can be garbage collected immediately if not used anymore. In
one of my future posts I will illustrate how this can lead to some hard-to-
nd performance problems.
The IBM JVM
The IBM JVM shares a lot of characteristics with JRockit: The default heap is
a continuous one. Especially in WebSphere installation this is often the
initial cause for bad performance. It differentiates between large and small
objects with the same implications and uses thread local allocation on
default. It also does not have a permanent generation, but while the IBM
JVM also supports a generational heap model it looks more like Suns than
JRockit.
Page 152
The IBM JVM generational heap
Allocate and Survivor act like Eden and Survivor of the Sun JVM. New objects
are allocated in one area and copied to the other on garbage collection.
In contrast to JRockit the two areas are switched upon GC. This means
that an object is copied multiple times between the two areas before it
gets promoted to Tenured. Like JRockit the IBM JVM has more options to
tune the compaction phase. You can turn it off or force it to happen for
every GC. In contrast to JRockit the default triggers it due to a series of
triggers but will then lead to a full compaction. This can be changed to an
incremental one via a conguration ag.
Conclusion
We see that while the three JVMs are essentially trying to achieve the same
goal, they do so via different strategies. This leads to different behavior that
needs tuning. With Java 7 Oracle will nally declare the G1 (Garbage First)
production ready and the G1 is a different beast altogether, so stay tuned.
If youre interested in hearing me discuss more about WebSphere in a
production environment, then check out our webinar with The Bon-Ton
Stores. Ill be joined by Dan Gerard, VP of Technical & Web Services at Bon-
Ton, to discuss the challenges theyve overcome in operating a complex
WebSphere production eCommerce site to deliver great web application
performance and user experience. Watch it now to hear me go into more
detail about WebSphere and production eCommerce environments.
Page 153
Why Object Caches Need to be Memory-
sensitive Guest Blog by Christopher Andr
by Michael Kopp
Christopher Andr is an Enablement Service Consultant at dynaTrace and helps
our Customers to maximize their value they get out of dynaTrace.
The other day, I went to a customer who was experiencing a problem that
happens quite frequently: he had a cache that was constantly growing,
leading to OutOfMemory Exceptions. Other problems in the application
seemed linked to it. Analyzing and nding the root cause of this memory-
related problem triggered me to write this blog on why they ran into
OutOfMemory Exceptions despite having a properly congured cache.
He was trying to cache the results of database selects, so he wouldnt have
to execute them multiple times. This is generally a good idea, but most
Java developers dont really know how to do this right and forget about the
growing size of their caches.
How can we have memory problems in Java?
Quite often, I hear Java developers saying I cant have memory problems;
the JVM is taking care of everything for me. While the JVMs memory
handling is great this does not mean we dont have to think about it at all.
Even if we do not make any obvious mistakes we sometimes have to help
20
Week 20
1
8
15
22
29
2
9
16
23
30
3
10
17
24
31
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
blog.dynatrace.com
Subscribe by email
Page 154
the JVM manage memory efciently.
The behavior of the garbage collection (GC) has been explained in several
blogs. It will reclaim memory of all objects that cant be reached by so
called GC Roots (GC Roots are objects that are assumed to be always
reachable). Problems often happen when an object creates many references
to different objects and the developer forgets to release them.
Java object references and GC roots
Page 155
Cache systems
A cache is, to simplify to the extreme, a Map. You want to remember
a particular object and associate it with an identier. Because we dont
have an endless supply of memory, there are specic algorithms that are
dedicated to evicting certain no longer needed entries from this cache.
Lets have a quick look at some of them:
Least Recently Used (LRU)
In this algorithm the cache maintains an access timestamp for every
entry. Whenever it is triggered to remove entries, due to additional
size limitations, it will evict those with the oldest access timestamp
rst.
Timed LRU
This is a special form of LRU that evicts items based on a specic
not-used timeframe instead of a size limitation. It is one of the
most frequently used algorithms for database caches and was used
in my customers case.
Least Frequently Used (LFU)
This algorithm tracks the number of times an entry has been
accessed. When the cache tries to evict some of its entries, it
removes those that are accessed least often.
Despite the usage of timed LRU algorithm, my customer faced the problem
that the number of objects that were referenced grew too big. The memory
used by these objects could not be reclaimed by the JVM because they
were still hard-referenced by the cache. In his case the root cause was not
an inappropriately sized cache or a bad eviction policy. The problem was
that the cached objects were too big and occupied too much memory.
The Cache had not yet evicted these objects yet based on the timed LRU
algorithm and therefore the GC could not claim these objects to free up
memory.
Page 156
Solving this with Memory-sensitive Standard
Java Mechanisms
The problem caused by hard references was addressed by the Java Standard
library early on (version 1.2) with so-called reference objects. Well only
focus on one of them: soft references.
SoftReference is a class that was created explicitly for the purpose of being
used with memory-sensitive caches. A soft reference will, according to the
ofcial javadoc, be cleared at the discretion of the garbage collector in
response to memory demand. In other words, the reference is kept as
long as there is no need for more memory and can potentially be deleted
if the JVM needs more memory. While the specication states that this can
happen any time, the implementations I know only do this to prevent an
OutOfMemory. When the garbage collector cannot free enough memory, it
will dereference all SoftReferences and then run another garbage collection
before throwing an OutOfMemoryException. If the SoftReference is the
only thing that keeps an object tree alive, the whole tree can be collected.
In my customers case, it would mean that the cache would be ushed
before an OutOfMemoryException would be triggered, preventing it from
happening and making it a perfect fail-safe option for his cache.
Side Effects
While Soft References have their uses, nothing is without side effects.
Garbage Collection and Memory Demand
First of all: a memory-sensitive cache often has the effect that people size it
too big. The assumption is that when the cache can be cleared on memory
demand, we should use all available memory for the cache, because it will
not pose a problem. The cache will be growing until it lls a large portion
of the memory. As we have learned before this leads to slower and slower
garbage collections because of the many objects to check. At some point
the GC will ush the cache and everything will be peachy again, right? Not
really, the cache will grow again. This is often mistaken for a memory leak.
Page 157
Because many heap analyzers treat soft references in a special way it is not
easily found.
Additionally, SoftReference objects occupy memory themselves and the
mere number of these soft reference objects can also create OutOfMemory
exceptions. These objects cannot be cleared by the garbage collector! For
example, if you create a SoftReference object for every key and every value
in your Map, these SoftReference objects are not going to be collected
when the object they point to is collected. That means that youll get the
same problem as previously mentioned except that, instead of it being
caused by many objects of type Key and Value, itll be triggered by
SoftReference objects.
The small schema below explains how a soft reference works in Java and
how an OutOfMemoryException can happen:
Soft references before and after ush
Page 158
This is why you generally cannot use a memory-sensitive cache without
using the mentioned cache algorithms: the combination of a good cache
algorithm with SoftReference is going to create a very robust cache system
that should limit the amount of OutOfMemory occurrences.
Cache System Must Handle Flushed Soft References
Your cache system must be able to deal with the situation when it gets
ushed. It might sound obvious but sometimes we assume the values or
even the keys of the hashmap are always going to be stored in memory,
causing some NullPointerExceptions when they are garbage collected as
the cache gets ushed. Either it needs to reload the ushed data upon the
next access, which some caches can do, or your cache needs to clean out
ushed entries periodically and on access (this is what most systems do).
There is No Standard Map Implementation Using
SoftReferences.
To preempt any hardcore coders posting comments about WeakHashMap,
let me explain that a WeakHashMap uses WeakReferences and not
SoftReferences. As a weak reference can be ushed on every GC, even
minor ones, it is not suitable for a cache implementation. In my customers
case, the cache would have been ushed too often, not adding any real
value to its usage.
This being said, do not despair! The Apache Commons Collection library
provides you with a map implementation that allows you to use the kind
of reference you want to use for keys and values independently. This
implementation is the ReferenceMap class and will allow you to create your
own cache based on a map without having to develop it from scratch.
Conclusion
Soft references provide a good means of making cache systems more stable.
They should be thought of an enhancement and not a replacement to an
Page 159
eviction policy. Indeed many existing cache systems leverage them, but I
encounter many home grown cache solutions at our clients, so it is good
to know about this. And nally it should be said that while soft references
make a cache system better they are not without side effects and we should
take a hard look at them before trusting that everything is ne.
Page 160
Microsoft Not Following Best Practices Slows
Down Firefox on Outlook Web Access
by Andreas Grabner
From time to time I access my work emails through Outlook Web Access
(OWA) which works really great on all browsers I run on my laptop (IE,
FF, and Chrome). Guessing that Microsoft probably optimized OWA for its
own browser I thought that I will denitely nd JavaScript code that doesnt
execute that well on Firefox as compared to Internet Explorer. From an end
users perspective there seems to be no noticeable performance difference
but using dynaTrace AJAX Edition (also check out the video tutorials) I
found a very interesting JavaScript method that shows a big performance
difference when iterating over DOM elements.
Allow Multiple DOM Elements with the Same ID? That is
Not a Good Practice!
I recorded the same sequence of actions on both Internet Explorer 8 and
Firefox 3.6. This includes logging on, selecting an email folder, clicking
through multiple emails, selecting the Unread Mail Search Folder and then
logging out. The following screenshot shows the Timeline Tab (YouTube
Tutorial) of the Performance Report including all my activities while I was
logged on to Outlook Web Access.
21
Week 21
1
8
15
22
29
2
9
16
23
30
3
10
17
24
31
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
blog.dynatrace.com
Subscribe by email
Page 161
Outlook Web Access Timeline showing all browser
activities when clicking through the folders
Instead of comparing the individual mouse and keyboard event handlers I
opened the hotspot view (YouTube tutorial) to identify the slowest JavaScript
methods. If you use dynaTrace just drill into the HotSpot view from the
selected URL in your performance report (YouTube tutorial). There was
one method with a slow execution time. This is the pure method execution
time excluding times of child method calls. The method in question is
called getDescendantById. It returns the DOM element identied by its id
that is a descendant of current DOM Element. If you look at the following
screenshot we see the same method call returning no value in both browsers
meaning that the element that we are looking for (divReplaceFolderId)
is not on that page. Its interesting to see that the method executes in
0.47ms on Internet Explorer but takes 12.39ms on Firefox. A closer look at
the method implementation makes me wonder what the developer tries to
accomplish with this method:
Page 162
Special implementation for non-IE Browsers to get elements by ID
If I understand the intention of this method correctly it should return
THE UNIQUE element identied by its id. It seems though that IE allows
having multiple elements with the same ID on a single page. Thats why
the method implements the workaround by using getElementsByTagName
and then accessing the returned element array by ID. In case there are more
elements with the same ID the method returns the rst element. In case
no element was found and we are not running on IE the implementation
iterates through ALL DOM elements and returns the rst element that
matches the ID. Looks like an odd implementation for me with the result
that on NON-IE browsers we have to iterate through this loop that will
probably never return any element anyway. Here is some pseudo-code on
how I would implement this happy to get your input on this:
Page 163
var elem = document.getElementById(d);
if(elem != null) {
// check if this element is a descendant of our current element
var checkNode = elem.parentNode;
while(checkNode != null && checkNode != this._get_DomNode())
checkNode = checkNode.parentNode;
if(checkNode == null) elem = null; // not a descendant
}
return elem;
This code works on both IE and FF even if there would be duplicated
elements with the same ID which we should denitely avoid.
Firefox Faster in Loading Media Player?
I continued exploring the sessions I recorded on both IE and FF. Interestingly
enough I found a method that initializes a Media Player JavaScript library.
Check out the following image. It shows the difference in execution time
for both IE and FF. This time Firefox is much faster at least at rst peek:
Page 164
It appears like initializing Media Player library is much faster in Firefox
The time difference here is very signicant 358ms compared to 0.08ms.
When we look at the actual execution trace of both browsers we however
see that IE is executing the if(b.controls) control branch whereas Firefox
does not. This tells me that I havent installed the Media Player Plugin on
Firefox:
Page 165
Actual JavaScript trace comparison between Internet
Explorer and Firefox
Lesson learned here is that we always have to look at the actual PurePath
(YouTube tutorial) as we can only compare performance when both
browsers executed the same thing.
Conclusion
Before writing this blog I hoped to nd JavaScript performance problem
patterns in Firefox similar to IE related problem patterns I blogged about
in the past, such as Top 10 Client-Side Performance Problems. Instead of
nding real JavaScript performance patterns it seems I keep nding code
that has only been optimized for IE but was not really tested, updated or
optimized for Firefox. Similar to my blog Slow Page Load Time in Firefox
caused by older versions of YUI, jQuery I consider the ndings in this blog
going into the same directions. Microsoft implements its JavaScript code
to deal with an IE specic situation (allowing multiple elements with the
same ID) but hasnt really implemented the workaround to work efciently
enough in other browsers. By the way comparing method JavaScript code
executions as I did here with the free dynaTrace AJAX Edition is easier and
can be automated using the dynaTrace Premium AJAX Extensions.
If you have any similar ndings or actual Firefox JavaScript performance
problem patterns let me know would be happy to blog about it.
Page 166
Why Performance Management is Easier in
Public than Onpremise Clouds
by Michael Kopp
Performance is one of the major concerns in the cloud. But the question
should not really be whether or not the cloud performs, but whether the
application in question can and does perform in the cloud. The main
problem here is that application performance is either not managed at all or
managed incorrectly and therefore this question often remains unanswered.
Now granted, performance management in cloud environments is harder
than in physical ones, but it can be argued that it is easier in public clouds
than in on-premise clouds or even a large virtualized environment. How do
I come to that conclusion? Before answering that lets look at the unique
challenges that virtualization in general and clouds in particular pose to
the realm of application performance management (APM).
Time is relative
The problem with timekeeping is well known in the VMware community.
There is a very good VMware whitepaper that explains this in quite some
detail. It doesnt tell the whole story, however, because obviously there are
other virtualization solutions like Xen, KVM, Hyper-V and more. All of them
solve this problem differently. On top of that the various guest operating
systems behave very differently as well. In fact I might write a whole article
22
Week 22
1
8
15
22
29
2
9
16
23
30
3
10
17
24
31
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
blog.dynatrace.com
Subscribe by email
Page 167
just about that, but the net result is that time measurement inside a guest is
not accurate, unless you know what you are doing. It might lag behind real
time and speed up to catch up in the next moment. If your monitoring tool
is aware of that and supports native timing calls it can work around that
and give you real response times. Unfortunately that leads to yet another
problem. Your VM is not running all the time: like a process it will get de-
scheduled from time to time; however, unlike a process it will not be aware
of that. While real time is important for response time, it will screw with
your performance analysis on a deeper level.
The effects of timekeeping on response and execution time
If you measure real time, then Method B looks more expensive than it
actually is. This might lead you down the wrong track when you are looking
for a performance problem. When you measure apparent time then you
dont have this problem, but your response times do not reect the real user
experience. There are generally two ways of handling that. Your monitoring
solution can capture these de-schedule times and account for this all the
way against your execution times. The more granular your measurement
the more overhead this will produce. The more pragmatic approach is to
simply account for this once per transaction and thus capture the impact
that the de-schedules have on your response time. Yet another approach is
to periodically read the CPU steal time (either from vSphere or via mpstat
on Xen) and correlate this with your transaction data. This will give you a
Page 168
better grasp on things. Even then it will add a level of uncertainty in your
performance diagnostics, but at least you know the real response time and
how fast your transactions really are. Bottom line, those two are no longer
the same thing.
The Impact of Shared Environments
The sharing of resources is what makes virtualization and cloud environments
compelling from a cost perspective. Most normal data centers have an
average CPU utilization far below 20%. The reason is twofold: on the one
hand they isolate the different applications by running them on different
hardware; on the other hand they have to provision for peak load. By
using virtualization you can put multiple isolated applications on the
same hardware. Resource utilization is higher, but even then it does not go
beyond 30-40 percent most of the time, as you still need to take peak load
into account. But the peak loads for the different applications might occur
at different times! The rst order of business here is to nd the optimal
balance.
The rst thing to realize is that your VM is treated like a process by the
virtualization infrastructure. It gets a share of resources how much can
be congured. If it reaches the congured limit it has to wait. The same
is true if the physical resources are exhausted. To drive utilization higher,
virtualization and cloud environments overcommit. That means they
allow 10 2GHz VMs on a 16GHz physical machine. Most of the time this
is perfectly ne as not all VMs will demand 100 percent CPU at the same
time. If there is not enough CPU to go around, some will be de-scheduled
and will be given a greater share the next time around. Most importantly
this is not only true for CPU but also memory, disk and network IO.
What does this mean for performance management? It means that
increasing load on one application, or a bug in the same, can impact another
negatively without you being aware of this. Without having a virtualization-
aware monitoring solution that also monitors other applications you will
not see this. All you see is that the application performance goes down!
Page 169
When the load increases on one application it affects the
other
With proper tools this is relatively easy to catch for CPU-related problems,
but a lot harder for I/O-related issues. So you need to monitor both
applications, their VMs and the underlying virtualization infrastructure and
correlate the information. That adds a lot of complexity. The virtualization
vendors try to solve this by looking purely at VM and host level system
metrics. What they forget is that high utilization of a resource does not
mean the application is slow! And it is the application we care about.
OS Metrics are Worse than Useless
Now for the good stuff. Forget your guest operating system utilization
metrics, they are not showing you what is really going on. There are several
reasons why that is so. One is the timekeeping problem. Even if you and
your monitoring tool use the right timer and measure time correctly, your
operating system might not. In fact most systems will not read out the timer
device all the time, but rely on the CPU frequency and counters to estimate
Page 170
time as it is faster than reading the timer device. As utilization metrics are
always based on a total number of possible requests or instructions per
time slice, they get screwed up by that. This is true for every metric, not just
CPU. The second problem is that the guest does not really know the upper
limit for a resource, as the virtualization environment might overcommit.
That means you may never be able to get 100% or you can get it at one
time but not another. A good example is the Amazon EC2 Cloud. Although
I cannot be sure, I suspect that the guest CPU metrics are actually correct.
They correctly report the CPU utilization of the underlying hardware, only
you will never get 100% of the underlying hardware. So without knowing
how much of a share you get, they are useless.
What does this mean? You can rely on absolute numbers like the number of
I/O requests, the number of SQL Statements and the amount of data sent
over the wire for a specic application or transaction. But you do not know
whether an over-utilization of the physical hardware presents a bottleneck.
There are two ways to solve this problem.
The rst involves correlating resource and throughput metrics of your
application with the reported utilization and throughput measures on
the virtualization layer. In case of VMware that means correlating detailed
application and transaction level metrics with metrics provided by vSphere.
On EC2 you can do the same with metrics provided by CloudWatch.
EC2 cloud monitoring dashboard showing 3 instances
Page 171
This is the approach recommended by some virtualization vendors. It is
possible, but because of the complexity requires a lot of expertise. You
do however know which VM consumes how much of your resources.
With a little calculation magic you can break this down to application and
transaction level; at least on average. You need this for resource optimization
and to decide which VMs should be moved to different physical hardware.
This does not do you a lot of good in case of acute performance problems
or troubleshooting as you dont know the actual impact of the resource
shortage (or if it has an impact at all). You might move a VM, and not
actually speed things up. The real crux is that just because something
is heavily used does not mean that it is the source of your performance
problem! And of course this approach only works if you are in charge of the
hardware, meaning it does not work with public clouds!
The second option is one that is, among others, proposed by Bernd Harzog,
a well-known expert in the virtualization space. It is also the one that I
would recommend.
Response Time, Response Time, Latency and More
Response Time
On the Virtualization Practice blog Bernd explains in detail why resource
utilization does not help you on either performance management or capacity
planning. Instead he points out that what really matters is response time or
throughput of your application. If your physical hardware or virtualization
infrastructure runs into utilization problems the easiest way to spot this
is when it slows down. In effect that means that I/O requests done by
your application are slowing down and you can measure that. Whats more
important is that you can turn this around! If your application performs ne
then whatever the virtualization or cloud infrastructure reports, there is no
performance problem. To be more accurate, you only need to analyze the
virtualization layer if your application performance monitoring shows that
a high portion of your response time is down to CPU shortage, memory
shortage or I/O latency. If that is not the case then nothing is gained by
optimizing the virtualization layer from a performance perspective.
Page 172
Network impact on the transaction is minimal, even though network utilization
is high
Diagnosing the Virtualization Layer
Of course in case of virtualization and private clouds you still need to
diagnose an infrastructure response time problem, once identied. You
measure the infrastructure response time inside your application. If you
have identied a bottleneck, meaning it slows down or is a big portion of
your response time, you need to relate that infrastructure response time
back to your virtualized infrastructure: which resource slows down? From
there you can use the metrics provided by VMware (or whatever your
virtualization vendor) to diagnose the root cause of the bottleneck. The key
is that you identify the problem based on actual impact and then use the
infrastructure metrics to diagnose the cause of that.
Layers Add Complexity
What this of course means is that you now have to manage performance
on even more levels than before. It also means that you have to somehow
manage which VMs run on the same physical host. We have already seen
that the nature of the shared environment means that applications can
impact each other. So a big part of managing performance in a virtualized
Page 173
environment is to detect that impact and tune your environment in a
way that both minimizes that impact and maximizes your resource usage
and utilization. These are diametrically opposed goals!
Now, What about Cloud? A cloud by nature is more dynamic than a
simple virtualized environment. A cloud will enable you to provision
new environments on the y and also dispose of them again. This will lead
to spikes on your utilization, leading to performance impact on existing
application. So in the cloud the minimum impact vs. maximize resource
usage goal becomes even harder to achieve. Cloud vendors usually
provide you with management software to manage the placement of your
VMs. They will move them around based on complex algorithms to try and
achieve the impossible goal of high performance and high utilization. The
success is limited, because most of these management solutions ignore the
application and only look at the virtualization layer to make these decisions.
Its a vicious cycle and the price you pay for better utilizing your datacenter
and faster provisioning of new environments.
Maybe a bigger issue is capacity management. The shared nature of the
environment prevents you from making straightforward predictions about
capacity usage on a hardware level. You get a long way by relating the
requests done by your application on a transactional level with the capacity
usage on the virtualization layer, but that is cumbersome and does not
lead to accurate results. Then of course a cloud is dynamic and your
application is distributed, so without having a solution that measures all
your transactions and auto detects changes in the cloud environment you
can easily make this a full time job.
Another problem is that the only way to notice a real capacity problem
is to determine if the infrastructure response time goes down and
negatively impacts your application. Remember utilization does not equal
performance and you want high utilization anyway! But once you notice
capacity problems, it is too late to order new hardware.
That means is that you not only need to provision for peak loads, effectively
over-provisioning again, you also need to take all those temporary and
Page 174
newly-provisioned environments into account. A match made in planning
hell.
Performance Management in a Public Cloud
First let me clarify the term public cloud here. While a public cloud has
many characteristics, the most important ones for this article are that you
dont own the hardware, have limited control over it and can provision
new instances on the y.
If you think about this carefully you will notice immediately that you have
fewer problems. You only care about the performance of your application
and not at all about the utilization of the hardware its not your hardware
after all. Meaning there are no competing goals! Depending on your
application you will add a new instance if response time goes down on a
specic tier or if you need more throughput than you currently achieve. You
provision on the y, meaning your capacity management is done on the y
as well. Another problem solved. You still run in a shared environment and
this will impact you. But your options are limited as you cannot monitor or
x this directly. What you can do is measure the latency of the infrastructure.
If you notice a slowdown you can talk to your vendor, though most of the
time you will not care and just terminate the old and start a new instance
if infrastructure response time goes down. Chances are the new instances
are started on a less utilized server and thats that. I wont say that this is
easy. I also do not say that this is better, but I do say that performance
management is easier than in private clouds.
Conclusion
Private and public cloud strategies are based on similar underlying
technologies. Just because they are based on similar technologies, however,
doesnt mean that they are similar in any way in terms of actual usage.
In the private cloud, the goal is becoming more efcient by dynamically
and automatically allocating resources in order to drive up utilization while
also lowering management costs of those many instances. The problem
Page 175
with this is that driving up utilization and having high performance are
competing goals. The higher the utilization the more the applications will
impact one another. Reaching a balance is highly complex, and is made
more complex due to the dynamic nature of the private cloud.
In the public cloud, these competing goals are split between the cloud
provider, who cares about utilization, and the application owner, who cares
about performance. In the public cloud the application owner has limited
options: he can measure application performance; he can measure the
impact of infrastructure degradation on the performance of his business
transactions; but he cannot resolve the actual degradation. All he can do
is terminate slow instances and/or add new ones in the hope that they will
perform at a higher level. In this way, performance in the public cloud is in
fact easier to manage.
But whether it be public or private you must actively manage performance
in a cloud production environment. In the private cloud you need to
maintain a balance between high utilization and application performance,
which requires you to know what is going under the hood. And without
application performance management in the public cloud, application
owners are at the mercy of cloud providers, whose goals are not necessarily
aligned with yours.
Page 176
Why Response Times are Often Measured
Incorrectly
by Alois Reitbauer
Response times are in many if not in most cases the basis for performance
analysis. When they are within expected boundaries everything is OK.
When they get to high we start optimizing our applications.
So response times play a central role in performance monitoring and
analysis. In virtualized and cloud environments they are the most accurate
performance metric you can get. Very often, however, people measure and
interpret response times the wrong way. This is more than reason enough
to discuss the topic of response time measurements and how to interpret
them. Therefore I will discuss typical measurement approaches, the related
misunderstandings and how to improve measurement approaches.
Averaging Information Away
When measuring response times, we cannot look at each and every single
measurement. Even in very small production systems the number of
transactions is unmanageable. Therefore measurements are aggregated
for a certain timeframe. Depending on the monitoring conguration this
might be seconds, minutes or even hours.
While this aggregation helps us to easily understand response times in large
23
Week 23
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
4
11
18
25
blog.dynatrace.com
Subscribe by email
Page 177
volume systems, it also means that we are losing information. The most
common approach to measurement aggregation is using averages. This
means the collected measurements are averaged and we are working with
the average instead of the real values.
The problem with averages is that they in many cases do not reect what is
happening in the real world. There are two main reasons why working with
averages leads to wrong or misleading results.
In the case of measurements that are highly volatile in their value, the
average is not representative for actually measured response times. If our
measurements range from 1 to 4 seconds the average might be around
2 seconds which certainly does not represent what many of our users
perceive.
So averages only provide little insight into real world performance. Instead
of working with averages you should use percentiles. If you talk to people
who have been working in the performance space for some time, they
will tell you that the only reliable metrics to work with are percentiles. In
contrast to averages, percentiles dene how many users perceived response
times slower than a certain threshold. If the 50th percentile for example is
2.5 seconds this means that the response times for 50 percent of your users
were less or equal to 2.5 seconds. As you can see this approach is by far
closer to reality than using averages
Percentiles and average of a measurement series
The only potential downside with percentiles is that they require more data
to be stored than averages do. While average calculation only requires the
sum and count of all measurements, percentiles require a whole range of
Page 178
measurement values as their calculation is more complex. This is also the
reason why not all performance management tools support them.
Putting Everything into One Box
Another important question when aggregating data is which data you use
as the basis of your aggregations. If you mix together data for different
transaction types like the start page, a search and a credit card validation
the results will only be of little value as the base data are as different as
apple and oranges. So in addition to ensuring that you are working with
percentiles it is necessary to also split transaction types properly so that the
data that are the basis for your calculations t together.
The concept of splitting transactions by their business function is often
referred to as business transaction management (BTM). While the eld of
BTM is wide, the basic idea is to distinguish transactions in an application by
logical parameters like what they do or where they come from. An example
would be a put into cart transaction or the requests of a certain user.
Only a combination of both approaches ensures that the response times
you measure are a solid basis for performance analysis.
Far from the Real World
Another point to consider with response times is where they are measured.
Most people measure response times at the server-side and implicitly
assume that they represent what real users see. While server-side response
times are down to 500 milliseconds and everyone thinks everything is ne,
users might experience response times of several seconds.
The reason is that server-side response times dont take a lot of factors
inuencing end-user response times into account. First of all server-side
measurements neglect network transfer time to the end users. This easily
adds half a second or more to your response times.

Page 179
Server vs. client response time
At the same time server-side response times often only measure the initial
document sent to the user. All the images, JavaScript and CSS les that
are required to render a paper properly are not included in this calculation
at all. Experts like Souders even say that only 10 percent of the overall
response time is inuenced by the server side. Even if we consider this an
extreme scenario it is obvious that basing performance management solely
on server-side metrics does not provide a solid basis for understanding end-
user performance.
The situation gets even worse with JavaScript-heavy Web 2.0 applications
where a great portion of the application logic is executed within the
browser. In this case server-side metrics cannot be taken as representative
for end-user performance at all.
Not Measuring What You Want to Know
A common approach to solve this problem is to use synthetic transaction
monitoring. This approach often claims to be close to the end-user.
Commercial providers offer a huge number of locations around the world
from where you can test the performance of pre-dened transactions.
While this provides better insight into what the perceived performance of
end-users is, it is not the full truth.
The most important thing to understand is how these measurements are
Page 180
collected. There are two approaches to collect this data: via emulators or
real browsers. From my very personal perspective any approach that does
not use real browsers should be avoided as real browsers are also what your
users use. They are the only way to get accurate measurements.
The issue with using synthetic transactions for performance measurement
is that it is not about real users. Your synthetic transactions might run pretty
fast, but that guy with a slow internet connection who just wants to book
a $5,000 holiday (OK, a rare case) still sees 10 second response times. Is
it the fault of your application? No. Do you care? Yes, because this is your
business. Additionally synthetic transaction monitoring cannot monitor
all of your transactions. You cannot really book a holiday every couple of
minutes, so you at the end only get a portion of your transactions covered
by your monitoring.
This does not mean that there is no value in using synthetic transactions.
They are great to be informed about availability or network problems that
might affect your users, but they do not represent what your users actually
see. As a consequence, they do not serve as a solid basis for performance
improvements
Measuring at the End-User Level
The only way to get real user performance metrics is to measure from within
the users browser. There are two approaches to do this. You can use a tool
like the free dynaTrace AJAX Edition which uses a browser plug-in to collect
performance data or inject JavaScript code to get performance metrics.
The W3C now also has a number of standardization activities for browser
performance APIs. The Navigation Timing Specication is already supported
by recent browser releases, as is the Resource Timing Specication. Open-
source implementations like Boomerang provide a convenient way to
access performance data within the browser. Products like dynaTrace User
Experience Management (UEM) go further by providing a highly scalable
backend and full integration into your server-side systems.
The main idea is to inject custom JavaScript code which captures timing
Page 181
information like the beginning of a request, DOM ready and fully loaded.
While these events are sufcient for classic web applications they are not
enough for Web 2.0 applications which execute a lot of client-side code. In
this case the JavaScript code has to be instrumented as well.
Is it Enough to Measure on the Client-side?
The question now is whether it is enough to measure performance from
the end-user perspective. If we know how our web application performs
for each user we have enough information to see whether an application is
slow or fast. If we then combine this data with information like geo location,
browser and connection speed we know for which users a problem exists.
So from a pure monitoring perspective this is enough.
In case of problems, however, we want to go beyond monitoring. Monitoring
only tells us that we have a problem but does not help in nding the cause
of the problem. Especially when we measure end-user performance our
information is less rich compared to development-centric approaches. We
could still use a development-focused tool like dynaTrace AJAX Edition for
production troubleshooting. This however requires installing custom
software on an end users machine. While this might be an option for SaaS
environments this is not the case in a typical eCommerce scenario.
The only way to gain this level of insight for diagnostics purposes is to
collect information from the browser as well as the server side to have
a holistic view on application performance. As discussed using averaged
metrics is not enough in this case. Using aggregated data does not provide
the insight we need. So instead of aggregated information we require the
possibility to identify and relate the requests of a users browser to server-
side requests.
Client/server drill-down of pages and actions
Page 182
The gure below shows an architecture based (and abstracted) from
dynaTrace UEM which provides this functionality. It shows the combination
of browser and server-side data capturing on a transactional basis and a
centralized performance repository for analysis.
Architecture for end-to-end user experience monitoring
Conclusion
There are many places where and ways how to measure response times.
Depending on what we want to achieve each one of them provides more
or less accurate data. For the analysis of server-side problems measuring at
the server-side is enough. We however have to be aware that this does not
reect the response times of our end users. It is a purely technical metric for
optimizing the way we create content and service requests. The prerequisite
to meaningful measurements is that we separate different transaction types
properly.
Measurements from anything but the end-users perspective can only
be used to optimize your technical infrastructure and only indirectly the
performance of end users. Only performance measurements in the browser
enable you to understand and optimize user-perceived performance.
Page 183
Automated Cross Browser Web 2.0
Performance Optimizations: Best Practices
from GSI Commerce
by Andreas Grabner
A while back I hosted a webinar with Ron Woody, Director of Performance
at GSI Commerce (now part of eBay). Ron and his team are users of
dynaTrace both AJAX and Test Center Edition. During the webinar we
discussed the advantages and challenges that Web 2.0 offers with a big
focus on eCommerce.
This blog is a summary of what we have discussed including Rons tips,
tricks and best practices. The screenshots are taken from the original
webinar slide deck. If you want to watch the full webinar you can go ahead
and access it online.
Web 2.0 An Opportunity for eCommerce
Web 2.0 is a great chance to make the web more interactive. Especially for
eCommerce sites it brings many benets. In order to leverage the benets
we have to understand how to manage the complexity that comes with
this new technology.
24
Week 24
3
10
17
24
31
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
blog.dynatrace.com
Subscribe by email
Page 184
The Benets of Web 2.0
JavaScript, CSS, XHR, and many others thats what makes interactive web
sites possible, and thats what many of us consider Web 2.0 to be.
When navigating through an online shop users can use dynamic menus or
search suggestions to more easily nd what they are looking for. Web 2.0
also eliminates the need for full page reloads for certain user interactions,
e.g.: display additional production information when hovering the mouse
over the product image. This allows the user to become more productive
by reducing the time it takes to nd and purchase a product.
Web 2.0 allows us to build more interactive web sites that support the user in
nding the right information faster
The Challenges
The challenge is that you are not alone in what you have to offer to your
users. Your competition leverages Web 2.0 to attract more users and is
there for those users that are not happy with the experience on your own
Page 185
site. If your pages are slow or dont work as expected online shoppers
will go to your competitor. You may only lose them for this one shopping
experience but you may lose them forever if the competitor satises their
needs. Worse than that frustrated users share their experience with their
friends, impacting your reputation.
Performance, reliability and compatibility keep your users happy. Otherwise you
lose money and damage your reputation
Page 186
The Complexity of Web 2.0
Performance optimization was easier before we had powerful browsers
supporting JavaScript, CSS, DOM, AJAX, and so on.
When we take a look at a Web 2.0 application we have to deal with an
application that not only lives on the application server whose generated
content gets rendered by the browser. We have an application that spawns
both server and client (browser). Application and navigation logic got
moved into the browser to provide better end-user experience. Web 2.0
applications leverage JavaScript frameworks that make building these
applications easier. But just because an application can be built faster
doesnt mean it operates faster and without problems. The challenge is
that we have to understand all the moving parts in a Web 2.0 application
as outlined in the following illustration:
Web 2.0 Applications run on both server and client
(browser) using a set of new components (JS, DOM, CSS,
AJAX, etc. Performance in Web 2.0
With more application logic sitting in the client (browser) it becomes more
important to measure performance for the actual end-user. We need to split
Page 187
page load times into time spent on the server vs. time spent on the client
(browser). The more JavaScript libraries we use, the more fancy UI effects
we add to the page and the more dynamic content we load from the server
the higher the chance that we end up with a performance problem in the
client. The following illustration shows a performance analysis of a typical
eCommerce user scenario. It splits the end-users perceived performance
into time spent in browser and time spent to download (server time):
Up to 6 seconds spent in the browser for shipping, payment and conrm
Users get frustrated when waiting too long for a page. Optimizing on the
server side is one aspect and will speed up page load time. The more logic
that gets moved to the browser the more you need to focus on optimizing
the browser side as well. The challenge here is that you cannot guarantee
the environment as you can on the server-side. You have users browsing
with the latest version of Chrome or Firefox, but you will also have users
that browser with an older version of Internet Explorer. Why does that make
a difference? Because JavaScript engines in older browsers are slower and
impact the time spent in the browser. Older browsers also have a limited
set of core features such as looking up elements by class name. JavaScript
frameworks such as jQuery work around this problem by implementing
the missing features in JavaScript which is much slower than native
implementations. It is therefore important to test your applications on a
broad range of browsers and optimize your pages if necessary.
Page 188
How GSI Commerce Tames the Web 2.0 Beast
GSI Commerce (now eBay) powers sites such as NFL, Toys R Us, ACE, Adidas,
and many other leading eCommerce companies.
In order to make sure these sites attract new and keep existing online users
it is important to test and optimize these applications before new versions
get released.
Business Impact of Performance
Ron discussed the fact that Performance indeed has a direct impact on
business. Weve already heard about the impact when Google and Bing
announced the results on their performance studies. GSI conrms these
results where poor performance has a direct impact on sales. Here is why:
Our clients competitors are only a click away
Poor performance increases site abandonment risk
Slow performance may impact brand
Client and Server-Side Testing
GSI does traditional server-side load testing using HP LoadRunner in
combination with dynaTrace Test Center Edition. They execute tests
against real-life servers hosting their applications. On the other hand
they also execute client-side tests using HP Quick Test Pro with dynaTrace
AJAX Edition to test and analyze performance in the browser. They also
leverage YSlow and WebPageTest.org.
GSI Browser Lab
Online users of web sites powered by GSI use all different types of browsers.
Therefore GSI built their own browser lab including all common versions
of Internet Explorer and Firefox. Since dynaTrace AJAX also supports Firefox
they run dynaTrace AJAX Edition on all of their test machines as it gives
Page 189
them full visibility into rendering, JavaScript, DOM and AJAX. They use HP
Quick Test Pro for their test automation:
GSI Browser Lab is powered by dynaTrace AJAX Edition and includes multiple
versions of Internet Explorer and Firefox
How GSI uses dynaTrace
While HP Quick Test Pro drives the browser to test the individual use
cases, dynaTrace AJAX Edition captures performance relevant information.
GSI uses a public API to extract this data and pushes it into a web based
reporting solution that helps them to easily analyze performance across
browsers and builds. In case there are problems on the browser side the
recorded dynaTrace AJAX Edition sessions contain all necessary information
to diagnose and x JavaScript, Rendering, DOM or AJAX problems. This
allows developers to see what really happened in the browser when the
error happened without having them trying to reproduce the problem.
In case it turns out that certain requests took too long on the application
server GSI can drill into the server side PurePaths as they also run dynaTrace
on their application servers.
Page 190
Using dynaTrace in testing bridges the gap between testers and
developers. Capturing this rich information and sharing it with a mouse
click makes collaboration between these departments much easier.
Developers also start using dynaTrace prior to shipping their code to
testing. Ron actually specied acceptance criteria. Every new feature must
at least have a certain dynaTrace AJAX Edition page rank before it can go
into testing.
Besides running dynaTrace on the desktops of developers and testers, GSI
is moving towards automating dynaTrace in Continuous Integration (CI) to
identify more problems in an automated way during development.
Analyze performance and validate architectural rules by letting dynaTrace
analyze unit and functional tests in CI
Saving Time and Money
With dynaTraces in-depth visibility and the ability to automate many tasks
both on the Client and Server-Side it was possible to
Reduce test time from 20 hours to 2 hours
Find more problems faster
Shorten project time
The fact that developers actually see what happened improves collaboration
with the testers. It eliminates the constant back and forth. The deep
visibility allows identication of problems that were difcult/impossible to
Page 191
see before. Especially rendering, JavaScript and DOM analysis have been
really helpful for optimizing page load time.
Tips and Tricks
Here is a list of tips and tricks that Ron shared with the audience:
Clear browser cache when testing load time
> Different load behavior depending on network time
Multiple page testing
> Simulates consumer behavior with regard to caching
Test in different browsers
> IE 6, 7 & 8 have different behavior
> Compare Firefox & IE understand cross-browser behavior
> Use weighted averages based on browser trafc
Test from different Locations
> www.webpagetest.org to test sites from around the world
(including dynaTrace Session Tracking)
> Commercial solutions for network latency, cloud testing, etc.
Best Practices
Ron concluded with the following best practices:
Performance matters
Dene performance targets
Test client-side performance
> Cross-browser testing
Page 192
> Add server-side testing
> Tie everything together
Automate!
Get Test and Development on the same page
Get proactive
Benchmark your site against competition
Conclusion
If you are interested in listening to the full webinar that also includes an
extended Q&A Session at the end go ahead and listen to the recorded
version.
Page 193
Goal-oriented Auto Scaling in the Cloud
by Michael Kopp
The ability to scale your environment on demand is one of the key
advantages of a public cloud like Amazon EC2. Amazon provides a lot
of functionality like the Auto Scaling groups to make this easy. The one
downside to my mind is that basing auto scaling on system metrics is a
little nave, and from my experience only works well in a limited number of
scenarios. I wouldnt want to scale indenitely either, so I need to choose
an arbitrary upper boundary to my Auto Scaling Group. Both the upper
boundary and the system metrics are unrelated to my actual goal, which is
always application related e.g. throughput or response time.
Some time back Amazon added the ability to add custom metrics to
CloudWatch. This opens up interesting possibilities. One of them is to do
goal-oriented auto scaling.
Scale for Desired Throughput
A key use case that I see for a public cloud is batch processing. This
is throughput and not response time oriented. I can easily upload the
measured throughput to CloudWatch and trigger auto scaling events on
lower and upper boundaries. But of course I dont want to base scaling
events on throughput alone: if my application isnt doing anything I
wouldnt want to add instances. On the other hand, dening the desired
25
Week 25
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
31
4
11
18
25
5
12
19
26
6
13
20
27
blog.dynatrace.com
Subscribe by email
Page 194
throughput statically might not make sense either as it depends on the
current job. My actual goal is to nish the batch in a specic timeframe. So
lets size our EC2 environment based on that!
I wrote a simple Java program that takes the current throughput, remaining
time plus remaining number of transactions and calculates the throughput
needed to nish in time. It then calculates the difference between actual and
needed throughput as a percentage and pushes this out to CloudWatch.
public void setCurrentSpeed(double transactionsPerMinute, long
remainingTransactions,
long remainingTimeInMinutes,
String JobName)
{
double targetTPM;
double currentSpeed;
if (remainingTimeInMinutes > 0 && remainingTransactions > 0)
{// time left and something to be done
targetTPM = remainingTransactions / remainingTimeInMinutes;
currentSpeed = transactionsPerMinute / targetTPM;
}
else if (remainingTransactions > 0) // no time left but
transactions left?
throw new SLAViolation(remainingTransactions);
else // all done
currentSpeed = 2; // tell our algorithm that we are too fast,
//if we dont have anything left to do
PutMetricDataRequest putMetricDataRequest = new
PutMetricDataRequest();
Page 195
MetricDatum o = new MetricDatum();
o.setDimensions(Collections.singleton(new Dimension().
withName(JobName).
withValue(JobName)));
o.setMetricName(CurrentSpeed);
o.setUnit(percent);
o.setValue(currentSpeed);
putMetricDataRequest.setMetricData(Collections.singleton(o));
putMetricDataRequest.setNamespace(dynaTrace/APM);
cloudWatch.putMetricData(putMetricDataRequest);
}
After that I started my batch job with a single instance and started measuring
the throughput. When putting the CurrentSpeed into a chart it looked
something like this:
The speed would start at 200% and go down according to the target time after
the start
Page 196
It started at 200%, which my Java code reports if the remaining transactions
are zero. Once I start the load the calculated speed goes down to indicate
the real relative speed. It quickly dropped below 100%, indicating that it
was not fast enough to meet the time window. The longer the run took,
the less time it had to nish. This would mean that the required throughput
to be done in time would grow; in other words, the relative speed was
decreasing. So I went ahead and produced three auto scaling actions and
the respective alarms.
The rst doubled the number of instances if current speed was below 50%.
The second added 10% more instances as long the current speed was
below 105% (a little safety margin). Both actions had a proper threshold
and cool down periods attached to prevent an unlimited sprawl. The result
was that the number of instances grew quickly until the throughput was
a little more than required. I then added a third policy. This one would
remove one instance as long as the relative current speed was above 120%.
The adjustments result in higher throughput which adjust the relative speed
As the number of instances increased so did my applications throughput
until it achieved the required speed. As it was faster than required, the
batch would eventually be done ahead of time. That means that every
minute that it kept being faster than needed, the required throughput kept
Page 197
shrinking, which is why you see the relative speed increasing in the chart
although no more instances were added.
Upon breaching the 120% barrier the last auto scaling policy removed
an instance and the relative speed dropped. This led to a more optimal
number of instances required to nish the job.
Conclusion
Elastic scaling is very powerful and especially useful if we couple it with goal
oriented policies. The example provided does of course need some ne
tuning, but it shows why it makes sense to use application-specic metrics
instead of indirectly related system metrics to meet an SLA target.
Page 198
How Server-side Performance Affects Mobile
User Experience
by Alois Reitbauer
Testing mobile web sites on the actual device is still a challenge. While tools
like dynaTrace AJAX Edition make it very easy to get detailed performance
data from desktop browsers, we do not have the same luxury for mobile.
I was wondering whether desktop tooling can be used for analyzing and
optimizing mobile sites. My idea was to start testing mobile web sites
on desktop browsers. Many websites return mobile content even when
requested by a desktop browser. For all sites one has control over it is also
possible to override browser checks.
The basic rationale behind this approach is that if something is already slow
in a desktop browser it will not be fast in a mobile browser. Typical problem
patterns can also be more easily analyzed in a desktop environment than
on a mobile device.
I chose the United website inspired by Maximiliano Firtmans talk at
Velocity. I loaded the regular and the mobile sites with Firefox and collected
all performance data with dynaTrace. The rst interesting fact was that the
mobile site was much slower than the regular site.
26
Week 26
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
blog.dynatrace.com
Subscribe by email
Page 199
mobile.united.com is slower than united.com
This is quite surprising as the mobile site has way less visual content, as you
can see below. So why is the site that slow?
Desktop and mobile website of United
When we look at the timeline we see that the mobile site is only using one
domain while the regular site is using an additional content domain. So
serving everything from one domain has a serious impact on performance.
Timeline comparison of United sites
I checked the latest result from BrowserScope to see how many connections
Page 200
mobile browsers can handle. They are using up to 35 connections, which
is quite a lot. The United mobile site does not leverage this fact for mobile.
Connections per domain and total for mobile browsers
Looking at the content reveals two optimization points. First, a lot of the
content is images which could be sprited. This would then only block one
connection and also speed up download times. The second point is that
the CSS which is used is huge. A 70k CSS le for a 12k HTML page is quite
impressive.
Very large CSS le on mobile.united.com
While these improvements will make the page faster they are not the
biggest concern. Looking at the requests we can see that there are several
network requests which take longer than 5 seconds. One of them is the
CSS le which is required to layout the page. This means that the user
Page 201
does not see a nicely layouted page within less than 5 seconds (not taking
network transfer time into consideration). So in this case the server used for
the mobile website is the real problem.
Request with very high server times
Conclusion
This example shows that basic analysis of mobile web site performance
can also be done on the desktop. Especially performance issues caused
by slow server-side response times or non-optimized resource delivery can
be found easily. The United example also shows how important effective
server-side performance optimization is in a mobile environment. When
we have to deal with higher latency and smaller bandwidth we have to
optimize server-side delivery to get more legroom for dealing with slower
networks.
Content delivery chain of web applications
Looking at the content delivery chain which start at the end user and goes
all the way back to the server-side it becomes clear that any time we lose on
the server cannot be compensated by upstream optimization.
Page 202
Step by Step Guide: Comparing Page Load
Time of US Open across Browsers
by Andreas Grabner
The US Open is one of the major world sport events these days. Those tennis
enthusiasts that cant make it to the Centre Court in Flushing Meadows are
either watching the games on television or following the scores on the
ofcial US Open Web Site.
The question is: how long does it take to get the current standings? And
is my computer running Firefox (FF) faster than my friends Internet
Explorer (IE)?
Comparing US Open 2011 Page Load Time
I made this test easy. I recorded page load activity in both Firefox 6
and Internet Explorer 8 when loading http://www.usopen.org. I am
using dynaTrace AJAX Edition Premium Version to analyze and compare
the activity side-by-side. The following screenshot shows the High-Level
Key Performance Indicators (KPIs) for both browsers. The interesting
observations for me are:
27 more roundtrips in IE (column request count) resulting in 700k
more downloaded data
27
Week 27
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
blog.dynatrace.com
Subscribe by email
Page 203
Slower JavaScript (JS) execution time in IE
Slower rendering in Firefox
High-level comparison between Internet Explorer and Firefox
Comparing Page Activity in Timeline
The next step is to compare page load in the Browser Timeline. In the
following screenshot you see the page activity for Internet Explorer (top)
and Firefox (bottom). I highlighted what are to me 3 signicant differences:
1. Loading behavior for Google, Twitter and Facebook JavaScript les is
much faster in Firefox
2. In Internet Explorer we have 6 XHR calls whereas in Firefox we only
see 5
3. Long running onLoad event handler in Internet Explorer
Page 204
Easy to spot differences in the Timeline Comparison between Internet Explorer
and Firefox
What is the Difference in Network Roundtrips?
Next step is to compare the network requests. We have already learned
through the high-level KPIs that there is a signicant difference in network
roundtrips e.g. Internet Explorer has 27 more resources requests than
Firefox. The following screenshot shows me the browser network dashlet
in Comparison Mode. It compares the network requests from IE and FF
and uses color coding to highlight the differences. Grey means that this
request was done by IE but not by FF. Red means that IE had more requests
to a specic resource than FF. If we look at this table we can observe the
following interesting facts:
Page 205
Internet Explorer tries to download ashcanvas.swf twice where this
component is not loaded in Firefox (top row)
It requests certain JS, CSS and one JPG twice making it one request
more than on Firefox (rows in red)
It requests certain les that are not requested by Firefox (rows in gray)
A network comparison shows that Internet Explorer is requesting additional
resources and some resources more than once
What is the Extra AJAX Request?
The same browser network Comparison View allows us to focus on the
AJAX requests. Looking at the data side-by-side shows us that the Flash
object the one that was requested twice is requested once using AJAX/
XHR. As this Flash component is only requested in IE we see the extra XHR
Request.
Page 206
Comparison of AJAX requests between Firefox and Internet Explorer. Its easy to
spot the extra request that downloads the Flash component
With a simple drill down we also see where this AJAX/XHR request comes
from. The following screenshot displays the browser PurePath with the
full JavaScript trace including the XHR request for the Flash component.
The AJAX request for the Flash component is triggered when ashcanvas.js is
loaded
Page 207
Why is the onLoad Event Handler Taking Longer in IE?
The Performance Report shows us the JavaScript hotspots in Internet
Explorer. From here (for instance) we see that the $ method (jQuery
Lookup method) takes a signicant time. The report also shows us where
this method gets called. When we drill from there to the actual browser
PurePaths we see which calls to the $ method were actually taking a long
time:
JavaScript hotspot analysis brings us to problematic $ method calls that are
slower on Internet Explorer
Want to Analyze Your Own Web Site?
This was just a quick example on how to analyze and compare page
load performance across browsers. For more recommendations on how
to actually optimize page load time I recommend checking out our other
blogs on Ajax/JavaScript.
Page 208
To analyze individual medium complex Web 2.0 applications download the
free dynaTrace AJAX Edition to analyze individual pages.
For advanced scenarios such as the following take a look at dynaTrace
AJAX Edition Premium Version:
JavaScript heavy Web 2.0 applications: Premium Version is unlimited
in the JavaScript activities it can process the free AJAX Edition only
works for medium-complex Web 2.0 applications
Compare performance and load behavior across browsers:
Premium Version automatically compares different sessions the free
AJAX Edition only allows manual comparison
Identify regressions across different versions of your web site:
Premium Version automatically identies regressions across tested
versions
Automate performance analysis: Premium Version automatically
identies performance problems that can be reported through REST or
HTML reports
I also encourage everybody to participate in the discussions on our
Community Forum.
Page 209
How Case-Sensitivity for ID and ClassName
can Kill Your Page Load Time
by Andreas Grabner
Many times have we posted the recommendation to speed up your DOM
element lookups by using unique IDs or at least a tag name. Instead of
using $(.wishlist) you should use $(div.wishlist) which will speed up
lookups in older browsers; if you want to lookup a single element then give
it a unique ID and change your call to $(#mywishlist). This will speed up
lookup in older browsers from 100-200ms to about 5-10ms (times vary
depending on number of DOM elements on your page). More on this in
our blogs 101 on jQuery Selector Performance or 101 on Prototype CSS
Selectors.
28
Week 28
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
blog.dynatrace.com
Subscribe by email
Page 210
Case Sensitive ID handling Results in Interesting
Performance Impact
With the recommendation from above I was surprised to see the following
$(#addtowishlist) call with a huge execution time difference in Internet
Explorer (IE) 7, 8 and Firefox (FF) 6:
Same $(#addtowishlist) call with huge performance differences across
browsers doesnt only reveal performance problems
So why is This Call Taking That Long?
It turns out that the ID attribute of the element in question (addtowishlist) is
actually dened as addToWishList. As Class and Id are case-sensitive (read
this article on the Mozilla Developer Network) the call $(#addtowishlist)
should in fact return no element. This leads us to an actual functional
problem on this page. The element exists but is not identied because the
developer used a different name in the $ method as dened in HTML. The
performance difference is explained by a uniqueness of Internet Explorer 6
and the way jQuery implements its $ method.
Page 211
jQuery 1.4.2 is the version used on the page we analyzed. The following
screenshot shows what happens in Internet Explorer 7:
jQuery iterates through all DOM elements in case the element returned by
getElementsById doesnt match the query string
The screenshot shows the dynaTrace browser PurePath for IE 7. In fact,
getElementById returns the DIV tag even though it shouldnt be based
on HTML standard specication. jQuery adds an additional check on the
returned element. Because the DOM elements ID addToWishList is not
case-equals with addtowishlist it calls its internal nd method as fallback.
The nd method iterates through ALL DOM Elements (1944 in this case)
and does a string comparison on the ID element. In the end, jQuery doesnt
return any element because none match the lower case ID. This additional
check through 1944 elements takes more than 50ms in Internet Explorer 7.
Why the time difference in IE 7 and FF 6?
IE 8 and FF 6 execute so much faster because getElementById doesnt return
an object and jQuery therefore also doesnt perform the additional check.
Page 212
Lessons Learned: We Have a Functional and a
Performance Problem
There are two conclusions to this analysis:
We have a functional problem because IDs in HTML in JavaScript/CSS
are used with mixed case and therefore certain event handlers are not
registered correctly.
We have a performance problem because IE 7 incorrectly returns an
element leading to a very expensive jQuery check.
So watch out and check how your write your IDs and ClassNames. Use
tools to verify your lookups return the expected objects and make sure you
always use a lookup mechanism that performs well across browsers.
Page 213
Automatic Error Detection in Production
Contact Your Users Before They Contact You
by Andreas Grabner
In my role I am responsible for our Community and our Community Portal.
In order for our Community Portal to be accepted by our users I need to
ensure that our users nd the content they are interested in. In a recent
upgrade we added lots of new multi-media content that will make it easier
for our community members to get educated on Best Practices, First Steps,
and much more.
Error in Production: 3rd Party Plugin Prevents Users from
Accessing Content
Here is what happened today when I gured out that some of our us-
ers actually had a problem accessing some of the new content. I was
able to directly contact these individual users before they reported the
issue. We identied the root cause of the problem and are currently
working on a permanent x preventing these problems for other us-
ers. Let me walk you through my steps.
Step 1: Verify and Ensure Functional Health
One dashboard I look at to check whether there are any errors on our
Community Portal is the Functional Health dashboard. dynaTrace comes
29
Week 29
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
blog.dynatrace.com
Subscribe by email
Page 214
with several out-of-the-box error detection rules. These are rules that
e.g. check if there are any HTTP 500s, exceptions being thrown between
application tiers (e.g.: from our authentication web service back to our
frontend system), severe log messages or exceptions when accessing the
database.
The following screenshot shows the Functional Health dashboard. As we
monitor more than just our Community Portal with dynaTrace I just lter
to this application. I see that we had 14 failed transactions in the last hour.
It seems we also had several unhandled exceptions and several HTTP 400s
between transaction tiers:
dynaTrace automates error detection by analyzing every transaction against
error rules. In my case I had 14 failed transactions in the last hour on our
Community Portal
Page 215
My rst step tells me that we have users that experience a problem.
Step 2: Analyze Errors
A click on the error on the bottom right brings me to the error details,
allowing me to analyze what these errors are. The following screenshot
shows the Error dashboard with an overview of all detected errors based
on the congured error rules. A click on one Error rule shows me the actual
errors on the bottom. It seems we have a problem with some of our new
PowerPoint slides we made available on our Community Portal:
The 14 errors are related to the PowerPoint slide integration we recently added
to our Community Portal as well as some internal Conuence Problems
Now I know what these errors are. The next step is to identify the impacted
users.
Page 216
Step 3: Identify Impacted Users
A drill into our Business Transactions tells me which users were impacted by
this problem. It turns out that we had 5 internal users (those with the short
usernames) and 2 actual customers having problems.
Knowing which users are impacted by this problem allows me to proactively
contact them before they contact me
Page 217
What is also interesting for me is to understand what these users were
doing on our Community Portal. dynaTrace gives me the information
about every visit including all page actions with detailed performance and
context information. The following shows the activities of one of the users
that experienced the problem. I can see how they got to the problematic
page and whether they continued browsing for other material or whether
they stopped because of this frustrating experience:
Analyzing the visit shows me where the error happened. Fortunately the user
continued browsing to other material
I now know exactly which users were impacted by the errors. I also know
that even though they had a frustrating experience these users are still
continuing browsing other content. Just to be safe I contacted them letting
them know we are working on the problem and also sent them the content
they couldnt retrieve through the portal.
Page 218
Step 4: Identify Root Cause and Fix Problem
My last step is to identify the actual root cause of these errors because I
want these errors to be xed as soon as possible to prevent more users from
being impacted. A drill into our PurePaths shows me that error is caused by
a NullPointerException thrown by the Conuence plugin we use to display
PowerPoint presentations embedded in a page.
Having a PurePath for every single request (failed or not) available makes it
easy to identify problems. In this case we have a NullPointerException being
thrown all the way to the web server leading to an HTTP 500
dynaTrace also captures the actual exception including the stack trace
giving me just the information I was looking for.
Page 219
The Exception Details window reveals more information about the actual
problem
Conclusion
Automatic error detection helped me to proactively work on problems and
also contact my users before they reported the problem. In this particular
case we identied a problem with the viewle Conuence plugin. In case
you use it make sure you do not have path-based animations in your slides.
It seems like this is the root cause of this NullPointer Exception.
For our dynaTrace users: If you are interested in more details on how to use
dynaTrace, best practices or self-guided Walkthroughs then check out our
updated dynaLearn Section on our Community Portal.
For those that want more information on how to become more proactive
in your application performance management check out Whats New in
dynaTrace 4.
Page 220
Why You Really Do Performance Management
in Production
by Michael Kopp
Often performance management is still confused with performance
troubleshooting. Others think that performance management in production
is simply about system and Java Virtual Machine (JVM) level monitoring and
that they are already doing application performance management (APM).
The rst perception assumes that APM is about speeding up some
arbitrary method performance and the second assumes that performance
management is just about discovering that something is slow. Neither of
these two are what we at dynaTrace would consider prime drivers for APM
in production. So what does it mean to have APM in production and why
do you do it?
The reason our customers need APM in their production systems is to
understand the impact that end-to-end performance has on their end users
and therefore their business. They use this information to optimize and
x their application in a way that has direct and measurable return on
investment (ROI). This might sound easy but in environments that include
literally thousands of JVMs and millions of transactions per hour, nothing is
easy unless you have the right approach!
True APM in production answers these questions and solves problems
30
Week 30
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
blog.dynatrace.com
Subscribe by email
Page 221
such as the following:
How does performance affect the end users buying behavior or the
revenue of my tenants?
How is the performance of my search for a specic category?
Which of my 100 JVMs, 30 C++ Business components and 3 databases
is participating in my booking transaction and which of them is
responsible for my problem?
Enable Operations, Business and R&D to look at the same production
performance data from their respective vantage points
Enable R&D to analyze production-level data without requiring access
to the production system
Gain End-to-end Visibility
The rst thing that you realize when looking at any serious web application
pick any of the big e-commerce sites is that much of the end user
response time gets spent outside their data center. Doing performance
management on the server side only, leaves you blind to all problems
caused due to JavaScript, content delivery networks (CDNs), third party
services or, in case of mobile users, simply bandwidth.

Page 222
Web delivery chain
As you are not even aware of these, you cannot x them. Without knowing
the effect that performance has on your users you do not know how
performance affects your business. Without knowing that, how do you
decide if your performance is OK?
This dashboard shows that there is a relationship between performance and
conversion rate
Page 223
The primary metric on the end user level is the conversion rate. What end-
to-end APM tells you is how application performance or non-performance
impacts that rate. In other words, you can put a dollar gure on response
time and error rate!
Thus the rst reason why you do APM in production is to understand the
impact that performance and errors have on our users behavior.
Once you know the impact that some slow request has on your business
you want to zero-in on the root cause, which can be anywhere in the web
delivery chain. If your issue is on the browser side, the optimal thing to
have is the exact click path of the users affected.
A visits click path plus the PageAction PurePath of the rst click
You can use this to gure out if the issue is in a specic server side request,
related to third party requests or in the JavaScript code. Once you have the
click path, plus some additional context information, a developer can easily
use something like dynaTrace AJAX Edition to analyze it.
Page 224
If the issue is on the server side we need to isolate the root cause there.
Many environments today encompass several hundred JVMs, Common
Language Runtimes (CLRs) and other components. They are big, distributed
and heterogeneous. To isolate a root cause here you need to be able to
extend the click path into the server itself.
From the click path to server side
But before we look at that, we should look at the other main driver of
performance management the business itself.
Create Focus Its the Business that Matters
One problem with older forms of performance management is the
disconnects from the business. It simply has no meaning for the business
whether average CPU on 100 servers is at 70% (or whatever else). It does
not mean anything to say that JBoss xyz has a response time of 1 second on
webpage abc. Is that good or bad? Why should I invest money to improve
that? On top of this we dont have one server but thousands with thousands
of different webpages and services all calling each other, so where should
Page 225
we start? How do we even know if we should do something?
The last question is actually crucial and is the second main reason why
we do APM. We combine end user monitoring with business transaction
management (BTM). We want to know the impact that performance has
on our business and as such we want to know if the business performance of
our services are inuenced by performance problems of our applications.
While end user monitoring enables you to put a general dollar gure on
your end user performance, business transactions go one step further. Lets
assume that the user can buy different products based on categories. If
I have a performance issue I would want to know how it affects my best
selling categories and would prioritize based on that. The different product
categories trigger different services on the server side. This is important for
performance management in itself as I would otherwise look at too much
data and could not focus on what matters.
The payment transaction has a different path depending on the context
Business transaction management does not just label a specic web
request with a name booking, but really enables you to do performance
management on a higher level. It is about knowing if and why revenue of
one tenant is affected by the response time of the booking transaction.
Page 226
In this way business transactions create a twofold focus. It enables the
business and management to set the right focus. That focus is always
based on company success, revenue and ROI. At the same time business
transactions enable the developer to exclude 90% of the noise from his
investigation and immediately zero in on the real root cause. This is due to
the additional context that business transactions bring. If only bookings via
credit cards are affected, then diagnostics should focus on only these and
not all booking transactions. This brings me to the actual diagnosing of
performance issues in production.
The Perfect Storm of Complexity
At dynaTrace we regularly see environments with several hundred or even
over thousand web servers, JVMs, CLRs and other components running
as part of a single application environment. These environments are not
homogeneous. They include native business components, integrations
with, for example, Siebel or SAP and of course the mainframe. These
systems are here to stay and their impact on the complexity of todays
environments cannot be underestimated. Mastering this complexity is
another reason for APM.
Todays systems serve huge user bases and in some cases need to process
millions of transactions per hour. Ironically most APM solutions and
approaches will simply break down in such an environment, but the value
that the right APM approach brings here is vital. The way to master such
an environment is to look at it from an application and transaction point
of view.
Page 227
Monitoring of service level agreements
Service Level Agreement (SLA) violations and errors need to be detected
automatically and the data to investigate needs to be captured, otherwise
we will never have the ability to x it. The rst step is to isolate the offending
tier and nd out if the problem is due to host, database, JVM, the mainframe
a third party service or the application itself.
Isolating the credit card tier as the root cause
Instead of seeing hundreds of servers and millions of data points you can
immediately isolate the one or two components that are responsible for
your issue. Issues happening here cannot be reproduced in a test setup.
This has nothing to do with lack of technical ability; we simply do not have
the time to gure out which circumstances led to a problem. So we need
Page 228
to ensure that we have all the data we need for later analysis available all
the time. This is another reason why we do APM. It gives us the ability to
diagnose and understand real world issues.
Once we have identied the offending tier, we know whom to talk to and
that brings me to my last point: collaboration.
Breaking the Language Barrier
Operations is looking at SLA violations and uptime of services, the business
is looking at revenue statistics of products sold and R&D is thinking in terms
of response time, CPU cycles and garbage collection. It is a fact that these
three teams talk completely different languages. APM is about presenting
the same data in those different languages and thus breaking the barrier.
Another thing is that as a developer you never get access to the production
environment, so you have a hard time analyzing the issues. Reproducing
issues in a test setup is often not possible either. Even if you do have access,
most issues cannot be analyzed in real time. In order to effectively share
the performance data with R&D we rst need to capture and persist it. It is
important to capture all transactions and not just a subset. Some think that
you only need to capture slow transactions, but there are several problems
with this. Either you need to dene what is slow, or if you have baselining
you will only get what is slower than before. The rst is a lot of work and
the second assumes that performance is ne right now. That is not good
enough. In addition such an approach ignores the fact that concurrency
exists. Concurrent running transactions impact each other in numerous
ways and whoever diagnoses an issue at hand will need that additional
context.

Page 229
A typical Operations to Development conversation without APM
Once you have the data you need to share it with R&D, which most of
the time means physically copying a scrubbed version of that data to
the R&D team. While the scrubbed data must exclude things like credit
card numbers, it must not lose its integrity. The developer needs to be
able to look at exactly the same picture as operations. This enables better
communication with Operations while at the same time enabling deep
dive diagnostics.
Now once a x has been supplied Operations needs to ensure that there are
no negative side effects and will also want to verify that it has the desired
positive effect. Modern APM solves this by automatically understanding the
dynamic dependencies between applications and automatically monitoring
new code for performance degradations.
Thus APM in production improves communication, speeds up deployment
cycles and at the same time adds another layer of quality assurance. This is
the nal, but by far not least important reason we do APM.
Conclusion
The reason we do APM in production is not to x a CPU hot spot, speed
up a specic algorithm or improve garbage collection. Neither the business
Page 230
nor operations care about that. We do APM to understand the impact that
the applications performance has on our customers and thus our business.
This enables us to effectively invest precious development time where it
has the most impact thus furthering the success of the company. APM truly
serves the business of a company and its customers, by bringing focus to
the performance management discipline.
My recommendation: If you do APM in production, and you should, do it
for the right reasons.
Page 231
Cassandra Write Performance a Quick Look
Inside
by Michael Kopp
I was looking at Cassandra, one of the major NoSQL solutions, and I was
immediately impressed with its write speed even on my notebook. But I
also noticed that it was very volatile in its response time, so I took a deeper
look at it.
First Cassandra Write Test
I did the rst write tests on my local machine, but I had a goal in mind. I
wanted to see how fast I could insert 150K data points each consisting of
3 values. In Cassandra terms this meant I added 150K of rows in a single
column family, adding three columns each time. Dont be confused with
the term column here, it really means a key/value pair. At rst I tried to
load the 150K in one single mutator call. It worked just ne, but I had
huge garbage collection (GC) suspensions. So I switched to sending 10K
buckets. That got nice enough performance. Here is the resulting response
time chart:
31
Week 31
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
blog.dynatrace.com
Subscribe by email
Page 232
Cassandra client/server performance and volatility
The upper chart shows client and server response time respectively. This
indicates that we leave a considerable time either on the wire or in the
client. The lower chart compares average and maximum response time on
the Cassandra server, clearly showing a high volatility. So I let dynaTrace
do its magic and looked at the transaction ow to check the difference
between client and server response time.
Getting a Look at the Insides of Cassandra
batch_mutate transactions from client to Cassandra server
Page 233
This is what I got 5 minutes after I rst deployed the dynaTrace agent. It
shows that we do indeed leave a large portion of the time on the wire,
either due to the network or waiting for Cassandra. But the majority is still
on the server. A quick check of the response time hotspots reveals even
more:
This shows that most of the time spent on the server is CPU and I/O
The hotspots show that most of the time on the Cassandra server is spent in
CPU and I/O, as it should be, but a considerable portion is also attributed to
GC suspension. Please note that this is not time spent in garbage collection,
but the time that my transactions were actively suspended by the garbage
collector (read about the difference here)! What is also interesting is that
a not so insignicant portion is spent inside Thrift, the communication
protocol of Cassandra, which conrmed the communication as part of the
issue. Another thing that is interesting is that the majority of the transactions
are in the 75ms range (as can be seen in the upper right corner), but a lot
of transactions are slower and some go all the way up to 1.5 seconds.
Page 234
Hotspots of the slowest 5% of the batch_mutate calls
I looked at the slowest 5% and could see that GC suspension plays a much
bigger role here and that the time we spend waiting on I/O is also greatly
increased. So the next thing I checked was garbage collection, always one
of my favorites.
The charts show that nearly all GC suspensions are due to minor collections
What we see here is a phenomenon that I have blogged about before. The
GC suspensions are mostly due to the so called minor collection. Major
Page 235
collections do happen, but are only responsible for two of the suspensions.
If I had only monitored major GCs I would not have seen the impact on my
performance. What it means is that Cassandra is allocating a lot of objects
and my memory setup couldnt keep up with it - not very surprising with
150K of data every 10 seconds.
Finally I took a look at the single transactions themselves:
Single batch_mutate business transactions, each inserting 10K rows
What we see here is that the PurePath follows the batch_mutate call from
the client to the server. This allows us to see that it spends a lot of time
between the two layers (the two synchronization nodes indicate start and
end of the network call). More importantly we see that we only spend
about 30ms CPU in the client side batch_mutate function, and according
to the elapsed time this all happened during sending. That means that
either my network was clogged or the Cassandra server couldnt accept
my request quick enough. We also see that the majority of the time on the
server is spent waiting on the write. That did not surprise me as my disk is
not the fastest.
A quick check on the network interface showed me that my test (10x150K
rows) accumulated to 300MB of data, being quick in math this told me
that a single batch_mutate call sent roughly 2MB of data over the wire, so
we can safely assume that the latency is due to network. It also means that
Page 236
we need to monitor network and Cassandras usage of it closely.
Checking the Memory
I didnt nd a comprehensive GC tuning guide for Cassandra and didnt
want to invest a lot of time, so I took a quick peek to get an idea about the
main drivers for the obvious high object churn:
The memory trend shows the main drivers
What I saw was pretty conclusive. The Mutation creates a column family
and a column object for each single column value that I insert. More
importantly the column family holds a ConcurrentSkipListMap which keeps
track of the modied columns. That produced nearly as many allocations as
any other primitive, something I have rarely seen. So I immediately found
the reasons for the high object churn.
Page 237
Conclusion
NoSQL or Big Data solutions are very, very different from your usual
RDBMS, but they are still bound by the usual constraints: CPU, I/O and
most importantly how it is used! Although Cassandra is lighting fast and
mostly I/O bound, its still Java and you have the usual problems e.g. GC
needs to be watched. Cassandra provides a lot of monitoring metrics
that I didnt explain here, but seeing the ow end-to-end really helps to
understand whether the time is spent on the client, network or server and
makes the runtime dynamics of Cassandra much clearer.
Understanding is really the key for effective usage of NoSQL solutions as we
shall see in my next blogs. New problem patterns emerge and they cannot
be solved by simply adding an index here or there. It really requires you to
understand the usage pattern from the application point of view. The good
news is that these new solutions allow us a really deep look into their inner
workings, at least if you have the right tools at hand.
Page 238
How Proper Redirects and Caching Saved Us
3.5 Seconds in Page Load Time
by Andreas Grabner
We like to blog about real life scenarios to demonstrate practical examples
on how to manage application performance. In this blog I will tell you how
we internally improved page load time for some of our community users by
3.5 seconds by simply following our own web performance guidelines that
we promote through our blog, community articles and our performance
report in dynaTrace AJAX Edition and dynaTrace AJAX Premium Version.
32
Week 32
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
blog.dynatrace.com
Subscribe by email
Page 239
Step 1: Identifying That We Have a Problem
We had users complaining about long page load times when entering http://
community.dynatrace.com. Testing it locally showed acceptable page load
time not perfect, but not too bad either. In order to verify the complaints
we looked at the actual end user response time captured by dynaTrace
User Experience Management (UEM). Focusing on our Community home
page we saw that we in fact have a fair amount of users experiencing very
high page load times. The following screenshot shows the response time
histogram of our Community home page.
Response times ranging from 1 to 22 seconds for our Community home page
Page 240
I was also interested in seeing whether there are any particular browsers
experiencing this problem. The next screenshot therefore shows the
response time grouped by browser:
Mobile browsers have a signicant performance problem when loading our
home page. My rst thought is latency problems
Now we know that we really have a performance problem for a set of
users. I assume that the really long load times are somehow related to
network latency as the main user group uses Mobile Safari. Lets continue
and analyze page load behavior.
Step 2: Analyzing Page Load Behavior
Analyzing page load behavior of http://community.dynatrace.com and
looking at the network roundtrips shows one of our problems immediately.
The following screenshot highlights that we have a chain of 6 redirects from
entering http://community.dynatrace.com until the user ends up on the
actual home page http://community.dynatrace.com/community/display/
Page 241
PUB/Community+Home. Depending on network latency this can take
several seconds and would explain why mobile users experience very long
page load times.
The following screenshot shows that even on my machine being very
close to our web servers it takes 1.2 seconds until the browser can actually
download the initial HTML document:
6 Redirects take 1.2 seconds to resolve. On a mobile network this can be much
higher depending on latency
We have often seen this particular problem (lots of redirects) with sites we
have analyzed in the past. Now we actually ran into the same problem
on our own Community portal. There are too many unnecessary redirects
leaving the browser blank and the user waiting for a very long time. The
rst problem therefore is to eliminate these redirects and automatically
redirect the user to the home page URL.
A second problem that became very obvious when looking at the dynaTrace
browser performance report is browser caching. We have a lot of static
content on our Community pages. Caching ratio should therefore be as
high as possible. The Content tab on our report however shows us that
1.3MB is currently not cached. This is content that needs to be downloaded
every time the user browses to our Community page even though this
content hardly ever changes:
Page 242
dynaTrace tells me how well my web page utilizes client-side caching. 1.3MB is
currently not cached on a page that mainly contains static content
The report not only shows us the percentage or size of content that is
cached vs. not cached. It also shows the actual resources. Seems that our
system is congured to force the browser to not cache certain images at all
by setting an expiration date in the past:
Lot of static images that have an expiration header set to January 1st 1970
forcing the browser not to cache these resources
Step 3: Fix and Verify Improvements
Fixing the redirects and caching was easy. Instead of 6 redirects we only
have 1 (as we have to switch from http to https we cant just do a server-side
URL rewrite). Optimized caching will have a positive impact for returning
users as they have to download fewer resources and with that also save on
roundtrips.
Page 243
The following image shows the comparison of the page load time before
and after the improvements:
Improved page load time by 3.5 seconds due to better caching and optimized
redirect chains
The improvements of 3.5 seconds are signicant but we still have many
other opportunities to make our Community portal faster. One of the
next areas we want to focus is server-side caching as we see a lot of time
spent on our application servers for our content pages that are created
on-the-y by evaluating Wiki-style markup code. More on that in a later
blog post.
Page 244
Conclusion and More
You have to analyze performance from the end-user perspective. Just
analyzing page load time in your local network is not enough as it doesnt
give you the experience your actual end users perceive. Following the
common web performance best practices improves performance and
should not only be done when your users start complaining but should be
something you constantly check during development.
There is more for you to read:
Check out our Best Practices on WPO and our other blogs on web
performance
For more information on automating web performance optimization
check out dynaTrace AJAX Premium Version
For information on User Experience Management check out dynaTrace
UEM
Page 245
33
Week 33
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
To Load Test or Not to Load Test: That is Not
the Question
by Andreas Grabner
There is no doubt that performance is important for your business. If you
dont agree you should check out what we and others think about the
performances impact on business or remember headlines like these:
Target.com was down after promoting new labels: article on MSN
Twitter was down and people were complaining about it on Facebook:
Hufngton Post article
People stranded in airports because United had a software issue: NY
Times article
The question therefore is not whether performance is important or not.
The question is how to verify and ensure your application performance is
good enough
Use Your End-Users as Test Dummies?
In times of tight project schedules and very frequent releases some
companies tend to release new software versions without going through a
proper test cycle. Only a few companies can actually afford this because they
have their users loyalty regardless functional or performance regressions
blog.dynatrace.com
Subscribe by email
Page 246
(again only a few companies have that luxury). If the rest of us were to
release projects without proper load testing we would end up as another
headline on the news.
Releasing a new version without proper load testing is therefore not the
correct answer.
Dont Let Them Tell You that Load Testing is Hard
When asking people why they are not performing any load tests you usually
hear things along the following lines:
We dont know how to test realistic user load as we dont know the use
cases nor the expected load in production
We dont have the tools, expertise or hardware resources to run large
scale load tests
It is too much effort to create and especially maintain testing scripts
Commercial tools are expensive and sit too long on the shelf between
test cycles
We dont get actionable results for our developers
If you are the business owner or member of a performance team you
should not accept answers like this. Let me share my opinion in order for
you to counter some of these arguments in your quest of achieving better
application performance.
Answer to: What Are Realistic User Load and Use Cases?
Indeed it is not easy to know what realistic user load and use cases are if
you are about to launch a new website or service. In this case you need to
make sure to do enough research on how your new service will be used
once launched. Factor in how much money you spend in promotions and
what conversion rate you expect. This will allow you to estimate peak loads.
Page 247
Learn from Your Real Users
Its going to be easier when you launch an update to an existing site. I
am sure you use something like Google Analytics, Omniture, or dynaTrace
UEM to monitor your end users. If so, you have a good understanding of
current transaction volume. Factor in the new features and how many new
users you want to attract. Also factor in any promotions you are about to
do. Talk with your Marketing folks they are going to spend a lot of money
and you dont want your system to go down and all the money be wasted.
Also analyze your Web server logs as they can give you even more valuable
information regarding request volume. Combining all this data allows you
to answer the following questions:
What are my main landing pages I need to test? Whats the peak load
and what is the current and expected page load time?
What are the typical click paths through the application? Do we have
common click scenarios that we can model into a user type?
Where are my users located on the world map, and what browsers do
they use? What are the main browser/location combinations we need
to test?
The following screenshots give you some examples of how we can extract
data from services such as Google Analytics or dynaTrace UEM to better
understand how to create realistic tests:
What are the top landing pages, the load behavior and page load performance?
Testing these pages is essential as it impacts whether a user stays or leaves the web site
Page 248
Browser and bandwidth information allow us to do more realistic tests as these
factors impact page load time signicantly
Analyzing click sequences of real users allows us to model load test scripts that
reect real user behavior
Page 249
CDN, Proxies, Latency: There is More than Meets the Eye
What we also learn from our real users is that not every request makes it
to our application environment. Between the end user and the application
we have different components that participate and impact load times:
connection speed, browser characteristics, latency, content delivery
network (CDN) and geolocation. A user in the United States on broadband
will experience a different page load time than a user on a mobile device
in Europe, though both are accessing an application hosted in the US.
To execute tests that take this into consideration you would actually need
to execute your load from different locations in the world using different
connection speed and different devices. Some cloud-based testing services
offer this type of testing by executing load from different data centers or
even real browsers located around the globe. One example is Gomez Last
Mile Testing.
Answer to: We Dont Have the Tools or the Expertise
This is a fair point. As load testing is usually not done on a day-to-day basis
as it is hard to justify the costs for commercial tools, for hardware resources
to simulate the load or for people that need constant training on tools they
hardly use.
All these challenges are addressed by a new type of load testing: load
testing done from the cloud offered as a service. The benets of cloud-
based load testing are:
Cost control: you only pay for the actual load tests not for the time
the software sits on the shelf
Script generation and maintenance is included in the service and is
done by people that do this all the time
You do not need any hardware resources to generate the load as it is
generated by the service provider
Page 250
Answer to: Its Too Much Effort to Create and Maintain
Scripts
Another very valid claim but typically caused by two facts:
A. Free vs. commercial tools: too often free load testing tools are used that
offer easy record/replay but do not offer a good scripting language
that makes it easy to customize or maintain scripts. Commercial tools
put a lot of effort into solving exactly this problem. They are more
expensive but make it easier, saving time.
B. Tools vs. service: load testing services from the cloud usually include
script generation and script maintenance done by professionals. This
removes the burden from your R&D organization.
Answer to: Commercial Tools are Too Expensive
A valid argument if you dont use your load testing tool enough as then
the cost per virtual user hour goes up. An alternative as you can probably
guess by now are cloud-based load testing services that only charge for
the actual virtual users and time executed. Here we often talk about the
cost of a Virtual User Hour. If you know how often you need to run load
tests, how much load you need to execute over which period of time it will
be very easy to calculate the actual cost.
Answer to: No Actionable Data after Load Test
Just running a load test and presenting the standard load testing report
to your developers will probably do no good. Its good to know under
how much load your application breaks but a developer needs more
information than We cant handle more than 100 virtual users. With only
this information the developers need to go back to their code, add log
output for later diagnostics into the code and ask the testers to run the test
again, as they need more actionable data. This usually leads to multiple
testing cycles, jeopardizes project schedules and also leads to frustrated
developers and testers.
Page 251
Too many test iterations consume valuable resources and impact your
project schedules
To solve this problem load testing should always be combined with an
application performance management (APM) solution that provides rich,
actionably in-depth data for developers to identify and x problems without
needing to go through extra cycles and in order to stay within your project
schedules.
Capturing enough in-depth data eliminates extra test cycles, saves time
and money
Page 252
The following screenshots show some examples on what data can be
captured to make it very easy for developers to go right to xing the
problems.The rst one shows a load testing dashboard including load
characteristics, memory consumption, database activity and performance
breakdown into application layers/components:
The dashboard tell us right away whether we have hotspots in memory,
database, exceptions or in one of our application layers
In distributed applications it is important to understand which tiers are
contributing to response time and where potential performance and
functional hotspots are:
Analyzing transaction ow makes it easy to pinpoint problematic hosts or
servicesMethods executed contributed to errors and bad response time. To speed
up response time hotspot analysis we can rst look at the top contributors
Page 253
In-depth transactional information makes it easy to identify code-level problems
before analyzing individual transactions that have a problem. As every
single transaction is captured it is possible to analyze transaction executions
including HTTP parameters, session attributes, method argument,
exceptions, log messages or SQL statements making it easy to pinpoint
problems.
Are We on the Same Page that Load Testing is Important?
By now you should have enough arguments to push load testing in your
development organization to ensure that there wont be any business
impact on new releases. Ive talked about cloud-based load testing services
multiple times as it comes with all the benets I explained. I also know that
it is not the answer for every environment as it requires your application to
be accessible from the web. Opening or tunneling ports through rewalls
or running load tests on the actual production environment during off-
hours are options you have to enable your application for cloud-based load
testing.
Page 254
One Answer to these Questions: Compuware Gomez 360
Web Load Testing and dynaTrace
New combined Gomez and dynaTrace web load testing solution provides
an answer to all the questions above and even more. Without going into
too much detail I want to list some of the benets:
Realistic load generation using Gomez First Mile to Last Mile web
testing
In-depth root-cause analysis with dynaTrace Test Center Edition
Load testing as a service that reduces in-house resource requirements
Keep your costs under control with per Virtual User Hour billing
Works throughout the application lifecycle from production, to test,
to development
Page 255
Running a Gomez load test allows you to execute load from both backbone
testing nodes as well as real user browsers located around the world.
Especially the last mile is an interesting option as this is the closest you can
get to your real end users. The following screenshot shows the response
time overview during a load test from the different regions in the world
allowing you to see how performance of your application is perceived in
the locations of your real end users:
The world map gives a great overview how pages perform from the different
test nodes
Page 256
From here it is an easy drill down in dynaTrace to analyze how increasing load
affects performance as well as functional health of the tested application:
There is a seamless drill option from the Gomez load testing results to the
dynaTrace dashboards
Find out more about Gomez 360 Load Testing for web, mobile and cloud
applications.
Page 257
34
Week 34
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
Why SLAs on Request Errors Do Not Work
and What You Should Do Instead
by Klaus Enzenhofer
We often see request error rates as an indicator for service level agreement
(SLA) compliance. Reality however shows that this draws a wrong picture.
Lets start with an example.
We had a meeting with a customer and were talking about their SLA and
what it is based on. Like in many other cases the request error rate was used
and the actual SLA they agreed on was 0.5%. From the operations team we
got the input that at the moment they have a request error rate of 0.1%.
So they are far below the agreed value. The assumption from current rate is
that every 1000
th
customer has a problem while using the website. Which
really sounds good but is this assumption true or do more customers have
problems?
Most people assume that a page load equals a single request, however if
you start thinking about it you quickly realizes that this is of course not the
case. A typical page consists of multiple resource requests. So from now on
we focus on all resource requests.
Lets take a look at a typical eCommerce example. A customer searches for a
certain product and wants to buy it in our store. Typically he will have to go
through multiple pages. Each click will lead to a page load which executes
blog.dynatrace.com
Subscribe by email
Page 258
multiple resource requests or executes one or more Ajax requests. In our
example the visitor has to go through at least seven steps/pages starting
at the product detail page, and ending up with on the conrmation page.
Browser performance report from dynaTrace AJAX Edition Premium Version
showing the resource requests per page of the buying process
The report shows the total request count per page. The shortest possible
click-path for a successful buy leads to 317 resource requests. To achieve a
good user experience we need to deliver the resources fast and without any
errors. However if we do the math for the reported error rate:
Customers with Errors = 317 requests * 0.1% = 31.7%
That means that on average every third user will have at least one failing
request and it doesnt even violate our SLA!
The problem is that our error rate is independent of the number of requests
per visiting customer. Therefore the SLA does not reect any real world
impact. Instead of a request failure rate we need to think about failed visits.
The rate of failed visits has a direct impact on the conversion rate and thus
the business; as such it is a much better key performance indicator (KPI). If
you ask again your Operations team for this, most will not be able to give
you the exact number. This is not a surprise as it is not easy to correlate
independent web requests together to a visit.
Another thing that needs to be taken into account is the importance of a
single resource request for the user experience. A user will be frustrated
if the photo of the product he wants to buy is missing or - even worse - if
the page does not load at all. He might not care if the background image
Page 259
does not load and might even be happy if the ads do not pop up. This
means we can dene which missing resources are just errors and which
constitute failed visits. Depending on the URI pattern we can distinguish
between different resources and we can dene a different severity for each
rule. In our case we dened separate rules for CSS les, images used by
CSS, product images, JavaScript resources and so forth.
Error rules for different resources within dynaTrace
This allows us to count errors and severe failures separately on a per page
action or visit basis. In our case a page action is either a page load (including
all resource requests) or a user interaction (including all resource and Ajax
requests). A failed page action is like saying the content displayed in the
browser is incomplete or even unusable and the user will not have a good
experience.
Therefore instead of looking at failed requests it is much better to look at
failed page actions.
The red portion of the bars represents failed page actions
Page 260
When talking about user experience we are however not only interested in
single pages but in whole visits. We can tag visits that have errors as non-
satised and visits that abandoned the page after an error as frustrated.
Visits by user experience
Such a failed visit rate draws a more accurate picture of reality, the impact
on the business and in the end whether we need to investigate further or
not.
Conclusion
SLAs on request failure rate is not enough. One might even say it is worthless
if you really want to nd out how good or bad the user experience is for
your customers. It is more important to know the failure rate per visit and
you should think about dening a SLA on this value. In addition we need to
dene which failed requests constitute a failed visit and are of high priority.
This allows us to x those problems with real impact and improve the user
experience quickly.
Page 261
35
Week 35
2
9
16
23
30
3
10
17
24
31
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
NoSQL or RDBMS? Are We Asking the Right
Questions?
by Michael Kopp
Most articles on the topic of NoSQL are around the theme of relational
database management systems (RDBMS) vs. NoSQL. Database
administrators (DBAs) are defending RDBMS by stating that NoSQL
solutions are all dumb immature data stores without any standards. Many
NoSQL proponents react with the argument that RDMBS does not scale
and that today everybody needs to deal with huge amounts of data.
I think NoSQL is sold short here. Yes, Big Data plays a large role; but it is not
the primary driver in all NoSQL solutions. There are no standards, because
there really is no NoSQL solution, but different types of solutions that cater
for different use cases. In fact nearly all of them state that theirs is not a
replacement for a traditional RDBMS! When we compare RDBMS against
them we need to do so on a use case basis. There are very good reasons
for choosing an RDBMS as long as the amount of data is not prohibitive.
There are however equally good reasons not to do so and choose one of
the following solution types:
Distributed key-value stores
Distributed column family stores
(Distributed) document databases
blog.dynatrace.com
Subscribe by email
Page 262
Graph databases
It has to be said however that there are very simple and specic reasons
as to why traditional RDBMS solutions cannot scale beyond a handful of
database nodes, and even that is painful. However before we look at why
NoSQL solutions tend not to have that problem, we will take a look why
and when you should choose an RDBMS and when you shouldnt.
When and Why You (Should) Choose an RDBMS
While data durability is an important aspect of an RDBMS it is not a
differentiator compared to other solutions. So I will concentrate rst
and foremost on unique features of an RDBMS that also have impact on
application design and performance.
Table based
Relations between distinct table entities and rows (the R in RDBMS)
Referential integrity
ACID transactions
Arbitrary queries and joins
If you really need all or most of these features than an RDBMS is certainly
right for you, although the level of data you have might force you in another
direction. But do you really need them? Lets look closer.The table based
nature of RDBMS is not a real feature, it is just the way it stores data. While
I can think of use cases that specically benet from this, most of them
are simple in nature (think of Excel spreadsheets). That nature however
requires a relational concept between rows and tables in order to make up
complex entities.
Page 263
Data model showing two different kinds of relationships
There are genuine relations between otherwise stand-alone entities (like
one person being married to another) and relationships that really dene
hierarchical context or ownership of some sort (a room is always part of a
house). The rst one is a real feature; the second is a result of the storage
nature. It can be argued that a document (e.g. an XML document) stores
such a relation more naturally because the house document contains the
room instead of having the room as a separate document.
Referential integrity is really one of the cornerstones of an RDBMS: it
ensures logical consistency of my domain model. Not only does it ensure
consistency within a certain logical entity (which might span multiple rows/
tables) but more importantly cross entity consistency. If you access the
same data via different applications and need to enforce integrity at the
central location this is the way to go. We could check this in the application
as well, but the database often acts as the nal authority of consistency.
The nal aspect of consistency comes in the form of ACID transactions. It
ensures that either all my changes are consistently seen by others in their
entirety, or that none of my changes are committed at all. Consistency
really is the hallmark of an RDBMS. However, we often set commit points
for other reasons than consistency. How often did I use a bulk update for the
simple reason of increased performance? In many cases I did not care about
the visibility of those changes, but just wanted to have them done fast. In
other cases we would deliberately commit more often in order to decrease
locking and increase concurrency. The question is do I care whether Peter
shows up as married while Alice is still seen as unmarried? The government
for sure does, Facebook on the other hand does not!
Page 264
SELECT count(e.isbn) AS number of books, p.name AS publisher
FROM editions AS e INNER JOIN
publishers AS p ON (e.publisher_id = p.id)
GROUP BY p.name;
The nal dening feature of an RDBMS is its ability to execute arbitrary
queries: SQL selects. Very often NoSQL is understood as not being able
to execute queries. While this is not true, it is true that RDBMS solutions
do offer a far superior query language. Especially the ability to group and
join data from unrelated entities into a new view on the data is something
that makes an RDBMS a powerful tool. If your business is dened by
the underlying structured data and you need the ability to ask different
questions all the time then this is a key reason to use an RDBMS.
However if you know how to access the data in advance, or you need to
change your application in case you want to access it differently, then a lot
of that advantage is overkill.
Why an RDBMS Might Not be Right for You
These features come at the price of complexity in terms of data model,
storage, data retrieval, and administration; and as we will see shortly, a
built-in limit for horizontal scalability. If you do not need any or most of the
features you should not use an RDMBS.
If you just want to store your application entities in a persistent and
consistent way then an RDBMS is overkill. A key-value store might be
perfect for you. Note that the value can be a complex entity in itself!
If you have hierarchical application objects and need some query
capability into them then any of the NoSQL solutions might be a
t. With an RDBMS you can use object-relational mapping (ORM)
to achieve the same, but at the cost of adding complexity to hide
complexity.
Page 265
If you ever tried to store large trees or networks you will know that an
RDBMS is not the best solution here. Depending on your other needs
a graph database might suit you.
You are running in the cloud and need to run a distributed database
for durability and availability. This is what Dynamo and big table based
data stores were built for. RDBMS on the other hand do not well here.
You might already use a data warehouse for your analytics. This is not
too dissimilar from a column family database. If your data grows too
large to be processed on a single machine, you might look into Hadoop
or any other solution that supports distributed map/reduce.
There are many scenarios where fully ACID driven relational table based
database is simply not the best option or simplest option to go with. Now
that we have got that out of the way, lets look at the big one: amount of
data and scalability.
Why an RDBMS Does Not Scale and Many NoSQL
Solutions Do
The real problem with RDBMS is the horizontal distribution of load and
data. The fact is that RDBMS solutions cannot easily achieve automatic
data sharding. Data sharding would require distinct data entities that can
be distributed and processed independently. An ACID-based relational
database cannot do that due to its table based nature. This is where NoSQL
solutions differ greatly. They do not distribute a logical entity across multiple
tables; its always stored in one place. A logical entity can be anything from
a simple value, to a complex object or even a full JSON document. They do
not enforce referential integrity between these logical entities. They only
enforce consistency inside a single entity, and sometimes not even that.
Page 266
NoSQL differs to RDBMS in the way entities get distributed and that no
consistency is enforced across those entities
This is what allows them to automatically distribute data across a large
number of database nodes and also to write them independently. If I were
to write 20 entities to a database cluster with 3 nodes, chances are I can
evenly spread the writes across all of them. The database does not need to
synchronize between the nodes for that to happen and there is no need for
a two phase commit, with the visible effect that Client 1 might see changes
on Node 1 before Client 2 has written all 20 entities. A distributed RDBMS
solution on the other hand needs to enforce ACID consistency across all
three nodes. That means that Client 1 will either not see any changes until
all three nodes acknowledged a two phase commit or will be blocked until
that has happened. In addition to that synchronization, the RDBMS also
needs to read data from other nodes in order to ensure referential integrity,
all of which happens during the transaction and blocks Client 2. NoSQL
solutions do no such thing for the most part.
The fact that such a solution can scale horizontally also means that it can
leverage its distributed nature for high availability. This is very important in
the cloud, where every single node might fail at any moment.
Page 267
Another key factor is these solutions do not allow joins and groups across
entities, as that would not be possible in a scalable way if your data ranges
in the millions and is distributed across 10 nodes or more. I think this is
something that a lot of us have trouble with. We have to start thinking
about how to access data and store it accordingly and not the other way
around.
So it is true that NoSQL solutions lack some of the features that dene an
RDBMS solution. They do so for the reason of scalability; that does however
not mean that they are dump data stores. Document, column family and
graph databases are far from unstructured and simple.
What about Application Performance?
The fact that all these solutions scale in principle does however not mean
that they do so in practice or that your application will perform better
because of it! Indeed the overall performance depends to a very large
degree on choosing the right implementation for your use case. Key-value
stores are very simple, but you can still use them wrong. Column family
stores are very interesting and also very different from a table-based design.
Due to this it is easy to have a bad data model design and this will kill your
performance.
Besides the obvious factors of disk I/O, network and caching (which you
must of course take into consideration), both application performance
and scalability depend heavily on the data itself; more specically on the
distribution across the database cluster. This is something that you need
to monitor in live systems and take into consideration during the design
phase as well. I will talk more about this and specic implementations in
the coming months.
There is one other factor that will play a key role in the choice between
NoSQL and more traditional databases. Companies are used to RDBMS
and they have experts and DBAs for them. NoSQL is new and not well
understood yet. The administration is different. Performance tuning
and analysis is different, as are the problem patterns that we see. More
Page 268
importantly performance and setup are more than ever governed by the
applications that use them and not by index tuning.
Application performance management as a discipline is well equipped to
deal with this. In fact by looking at the end-to-end application performance
it can handle the different NoSQL solutions just like any other database.
Actually, as we have seen in my last article we can often do better!
Page 269
36
Week 36
2
9
16
23
30
3
10
17
24
31
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
Is Synthetic Monitoring Really Going to Die?
by Alois Reitbauer
More and more people are talking about the end of synthetic monitoring.
It is associated with high costs and missing insight into real user
performance. This is supported by the currently evolving standards of
the W3C Performance Working Group which will help to get more accurate
data from end users directly in the browser with deeper insight. Will user
experience management (UEM) using JavaScript agents eventually replace
synthetic monitoring or will there be a coexistence of both approaches in
the end?
I think it is a good idea to compare these two approaches in a number
of categories which I see as important from a performance management
perspective. Having worked intensively with both approaches I will present
my personal experience. Some judgments might be subjective but this is
what comments are for, after all!
Real User Perspective
One of the most if not the most important requirement of real user
monitoring is to experience performance exactly as real users do. This
means how close the monitoring results are to what real application users
see.
blog.dynatrace.com
Subscribe by email
Page 270
Synthetic monitoring collects measures using pre-dened scripts executed
from a number of locations. How close this is to what users see depends on
the actual measurement approach. Only solutions that use real browsers and
dont just emulate provide reliable results. Some approaches only monitor
from high speed backbones like Amazon EC2 and only emulate different
connection speeds making measurements only an approximation of real
user performance. Solutions like Gomez Last Mile in contrast measure from
real user machines spread out across the world resulting in more precise
results
Agent-based approaches like dynaTrace UEM measure directly in the
users browser taking actual connection speed and browser behavior into
account. Therefore they provide the most accurate metrics on actual user
performance.
Transactional Coverage
Transactional coverage denes how many types of business transactions
or application functionality are covered. The goal of monitoring is to
cover 100 percent of all transactions. The minimum requirement is to cover
at least all business-critical transactions.
For synthetic monitoring this directly relates to on the number of transactions
which are modeled by scripts: the more scripts, the greater the coverage.
This comes at the cost of additional development and maintenance effort.
Agent-based approaches measure using JavaScript code which gets injected
into every page automatically. This results in 100 percent transactional
coverage. The only content that is not covered by this approach is streaming
content as agent-based monitoring relies on JavaScript being executed.
SLA Monitoring
SLA monitoring is a central to ensure service quality at the technical and
business level. For SLA management to be effective, not only internal but
also third party services like ads have to be monitored.
Page 271
While agent-based approaches provide rich information on end-user
performance, they are not well suited for SLA management. Agent-based
measurements depend on the users network speed, local machine, etc.
This means a very volatile environment. SLA management however requires
a well-dened and stable environment. Another issue with agent-based
approaches is that third parties like content delivery networks (CDNs) or
external content providers are very hard to monitor.
Synthetic monitoring using pre-dened scripts and provides a stable
and predictable environment. The use of real browsers and the resulting
deeper diagnostics capabilities enable more ne grained diagnostics and
monitoring especially for third party content. Synthetic monitoring can
also check SLAs for services which are currently not used by actual users.
Availability Monitoring
Availability monitoring is an aspect of SLA monitoring. We look at it
separately as availability monitoring comes with some specic technical
prerequisites which are very different between agent-based and synthetic
monitoring approaches.
For availability monitoring only synthetic script-based approaches can be
used. They do not rely on JavaScript code being injected into the page
but measures using on points of presence instead. This enables them to
measure a site even though it may be down, which is essential for availability
monitoring.
Agent-based will not collect any monitoring data if a site is actually down.
The only exception is an agent-based solution which also runs inside the
web server or proxy like, dynaTrace UEM. Availability problems resulting
from application server problems can then be detected based on HTTP
response codes.
Understanding User-specic Problems
In some cases especially in a SaaS environment the actual application
Page 272
functionality heavily depends on user-specic data. In case of functional or
performance problems, information on a specic user request is required
to diagnose a problem.
Synthetic monitoring is limited to the transactions covered by scripts. In
most cases they are based on test users rather than real user accounts (you
would not want a monitoring system to operate a real banking account).
For an eCommerce site where a lot of functionality does not depend on
an actual user, synthetic monitoring provides reasonable insight here. For
many SaaS applications this however is not the case.
Agent-based approaches are able to monitor every single user click,
resulting in a better ability to diagnose user specic problems. They also
collect metrics for actual user requests instead of synthetic duplicates. This
makes them the preferred solution for web sites where functionality heavily
depends on the actual user.
Third Party Diagnostics
Monitoring of third party content poses a special challenge. As the resources
are not served from our own infrastructure we only have limited monitoring
capabilities.
Synthetic monitoring using real browsers provides the best insight here.
All the diagnostics capabilities available within browsers can be used to
monitor third party content. In fact the possibilities for third party and your
own content are the same. Besides the actual content, networking or DNS
problems can also be diagnosed.
Agent-based approaches have to rely on the capabilities accessible via
JavaScript in the browser. While new W3C standards of the Web Performance
Working Group will make this easier in the future it is hard to do in older
browsers. It requires a lot of tricks to get the information about whether
third party content loads and performs well.
Page 273
Proactive Problem Detection
Proactive problem detection aims to nd problems before users do. This
not only gives you the ability to react faster but also helps to minimize
business impact.
Synthetic monitoring tests functionality continuously in production. This
ensures that problems are detected and reported immediately irrespective
if someone is using the site or not.
Agent-based approaches only collect data when a user actually accesses
your site. If for example you are experiencing a problem with a CDN from
a certain location in the middle of the night when nobody uses your site
you will not see the problem before the rst users accesses your site in the
morning.
Maintenance Effort
Cost of ownership is always an important aspect of software operation. So
the effort needed to adjust monitoring to changes in the application must
be taken into consideration as well.
As synthetic monitoring is script-based it is likely that changes to the
application will require changes to scripts. Depending on the scripting
language and the script design, the effort will vary. In any case there is
continuous manual effort required to keep scripts up-to-date.
Agent-based monitoring on the other hand does not require any changes
when the application changes. Automatic instrumentation of event
handlers and so on ensures zero effort for new functionality. At the same
time modern solutions automatically inject the required HTML fragments
to collect performance data automatically into HTML content at runtime.
Suitability for Application Support
Besides operations and business monitoring, support is the third main user
Page 274
of end user data. In case a customer complains that a web application is
not working properly, information on what this user was doing and why it
is not working is required.
Synthetic monitoring can help here in case of general functional or
performance issues like a slow network from a certain location or broken
functionality. It is however not possible to get information on what a user
was doing exactly and to follow that users click path.
Agent-based solutions provide much better insight. As they collect data for
real user interactions they have all information required for understanding
potential issues users are experiencing. So also problems experienced by a
single user can be discovered.
Conclusion
Putting all these points together we can see that both synthetic monitoring
and agent-based approaches have their strengths and weaknesses. One
cannot simply choose one over the other. This is validated by the fact that
many companies use a combination of both approaches. This is also true
for APM vendors who provide products in both spaces. The advantage of
using both approaches is that modern agent-based approaches perfectly
compensate for the weaknesses of synthetic monitoring, leading to an ideal
solution.
Page 275
37
Week 37
2
9
16
23
30
3
10
17
24
31
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
Business Transaction Management Explained
by Andreas Grabner
The terms business transaction and business transaction management
(BTM) are widely used in the industry but it is not always well understood
what we really mean by them. The BTM Industry Portal provides some
good articles on this topic and is denitely recommended to check out. The
general goal is to answer business-centric questions that business owners
ask to application owners: How much revenue is generated by a certain
product?, What are my conversion and bounce rates and what
impacts them? or Do we meet our SLAs to our premium account
users?
Challenge 1: Contextual Information is More Than Just
the URL
In order to answer these questions we need information captured from the
underlying technical transactions that get executed by your applications
when users interact with your services/web site. Knowing the URL accessed,
its average response time and then mapping it to a business transaction is
the simplest form of business transaction management but doesnt work
in most cases because modern applications dont pass the whole business
transaction context in the URL.
Business context information such as the username, product details or
blog.dynatrace.com
Subscribe by email
Page 276
cash value usually comes from method arguments, the user session on the
application server or from service calls that are made along the processed
transaction.
Challenge 2: Business Context is Somewhere Along a
Distributed Transaction
Modern applications are no longer monolithic. The challenge with that is
that transactions are distributed, they can take differing paths, and data we
need for our business context (username, product information, price, and
so on) is often available on different tiers. This requires us to trace every
single transaction across all tiers in order to collect this data for a single user
transaction:
Every transaction is different: it involves different services, crosses multiple
tiers and we need to capture business information along the full end-to-end
transaction

Page 277
Challenge 3: Understanding Your Users Means: All Users,
All Actions, All the Time
Knowing that a certain transaction of a user failed including all contextual
information is great. In modern applications users have many ways to reach
a certain point in the application, e.g. checking out. So the questions are:
how did the user get to this point? What were the actions prior to the
problem? Is one of the previous actions responsible for the problem we
see?
In order to answer these questions we need access to all transactions of
every single user. This allows us to a) understand how our users interact
with the system b) how users reach our site (identify landing pages) c)
where users drop off the site (important for bounce rate and bounce page
evaluation) and d) which paths tend to lead to errors. It also supports a
critical use case in business transaction management which is user complaint
resolution, e.g. when user Maria calls the support line we can look up all
her transactions, follow her steps and identify where things went wrong:
Page 278
Knowing every action of every user allows us to better understand our users,
follow their click paths and speed up individual problem resolution Why
Continue Reading?
In this blog I give you more examples of business transaction
management and focus on the challenges I just explained:
We need to analyze more than just URLs as modern web applications
have become more complex
We live in a distributed transactional world where business context data
can come from every tier involved
We need to focus on every user and every action to understand our
users
To make it easier to understand I also bring examples from PGAC and other
real life BTM implementations.
An Example of Business Transactions
Lets assume we have a web site for a travel agency. The interesting business
data for this type of application is:
What destinations are people searching for? I want to make sure I offer
trips to those regions and people nd what they are looking for
How many trips are actually being sold?
How much money do we make?
How many people that search actually buy (convert)? Are there certain
trips that have a better conversion rate?
From a technical perspective we can monitor every single technical
transaction, which means we can look at every single web request that
is processed by our application server. If we do that we basically look at
a huge set of technical transactions. Some of these transactions represent
Page 279
those that handle search requests. Some of them handle the shopping cart
or check-out procedure. The rst goal for us however is to identify the
individual business transactions such as Search:
Among all technical transactions we want to identify those that perform a
certain business transaction, e.g. search, checkout and log on
Now, not every search is the same. In our case the search keyword (trip
destination) is interesting so we want to split the search transactions by
the destination used in the search criteria. That allows us to see how many
search requests are executed by destination and also how long these search
requests take to execute. Assuming we have to query external services to
fetch the latest trip offerings it could be very interesting to see how fast or
slow searches for specic destinations are, or whether the number of search
results impacts query time and therefore user experience.
Page 280
Splitting the searches by destination allows us to see how many searches are
actually executed and whether there are any performance implications for
specic destinations
The next interesting aspect is of course the actual business we do or dont
do. If our website serves different markets (US, EMEA, Asia, and others) it
is interesting to see revenue by market. The next illustration shows that
we can focus on the actual purchase business transactions. On these we
evaluate revenue generated, the market that generated the revenue and
whether there is a performance impact on business by looking at response
time as well:
Page 281
Looking at those business transactions that handle the purchase allows us to
see how much money we actually make split by markets we serve
Contextual Information for Every Transaction
In order for all of this to actually work we need more than just capturing
the URL and response time. A lot of times the type of context information
we need is in the user session, in the HTTP POST body or some parameter
on a method call or SQL bind value. The next illustration shows us what
technical information we are talking about in order to map these values to
business transactions:
Capturing context information for every single technical transaction is
mandatory for business transaction management
Page 282
As we live in a distributed transactional world we need the full end-to-
end transaction and on that transaction the ability to capture this type of
technical but business-relevant data.
Every Action of Every Visit to identify Landing Pages,
Conversion and Bounce Rates
Some of the most interesting metrics for business owners are conversion
rates, bounce rates, user satisfaction and how well landing pages do.
In order to get these metrics we need to identify every individual user
(typically called visitors), all actions (individual technical transactions/
requests) for the user and the information on whether the user had a good
or bad experience (we factor performance and functional problems into
that). With that information we can:
Identify the landing page of a visitor -> thats the rst page that was
requested by an individual visitor. The landing page tells you how
people get to your website and whether promotions or ads/banners
work
Identify bounce rates and bounce pages -> a visitor bounces of your
site if only one page was visited. We want to know which pages
typically make people bounce off the page so we can optimize them.
If you spend a lot of money on marketing campaigns but due to bad
performance or a problem on these landing pages people bounce off,
its money that is wasted
Identify click paths visitors take until they actually convert (buy a
product) or where along this path they actually leave -> visitors
usually need to click through a series of pages before they click on
the Conrm button. If they do these visitors convert. If they dont
and leave your site somewhere along the way we want to know
where and why they are leaving in order to optimize these pages and
therefore increase conversion rates
Identify how satised our end users are based by looking at response
Page 283
times and any functional problems along their execution path ->
visitors that experience long page load times or run into any type of
functional problem are more likely to be frustrated and leave the site
before converting. Therefore we want to track this and identify which
actions result in a bad user experience
Landing Pages Impact Whether Users Make it Across the
First Site Impression
Knowing the rst action of every visit lets us identify our landing pages.
Visitors that dont make it further than the landing page bounce off our site
right away. This is the rst problem we want to address as these people are
never going to generate any business. The next screenshot shows how we
want to analyze landing pages. It is interesting to see which landing pages
we have, which are frequented more often, what the bounce rate of these
landing pages is and whether there is any correlation to performance (e.g.
higher bounce rate on pages that take longer):
Landing page report shows bounce rates, access count and compares load time
to the industry standard
Page 284
Bounce and Conversion Rates, User Satisfaction and
Activity are Important Metrics for Business Owners
Having every action of every visitor available, knowing whether visitors
only visit the landing page and bounce off or whether they also make it
to the checkout page allows us to calculate bounce and conversion rates.
Also looking at response times, request count and errors allows us to look
at visitor satisfaction and usage. The following illustration shows all these
metrics, which a business owner is interested in:
Dashboard for business owner to monitor those metrics that impact business
Understanding Which Path Visitors Take Allows You to
Improve Conversion and Lower Bounce Rates
The underlying data for all these metrics we just discussed are the individual
actions of every visitor. Only with that information can we identify why
certain visitors convert and others dont which will help us to improve
conversion rate and lower the bounce rate. The following screenshot shows
Page 285
a detailed view of visits. Having all this data allows us to see which visitors
actually bounce off of the site after just looking at their landing page, which
users actually convert and what path users take when interacting with our
web site:
Having every visitor and all their actions allows us to manage and optimize
business transactions
Speed Up User Complaint Resolution With All Actions
Available
As already explained in the introduction section of this blog one important
use case is speeding up the resolution process of individual user complaints.
If you have a call center and one of your users complaints about a problem
it is best to see what this particular user did to get to the error explained.
Having all user actions available and knowing the actual user along with
the captured transactions allows us to look up only transactions for that
particular user and with that to really see what happened:
Page 286
When user Maria calls in we can look up all the actions from this user and see
exactly what error occurred
Deep Technical Information for Fast Problem Resolution
Besides using technical context information as input for business
transactions (e.g. username, search keywords, cash amount) we also need
very deep technical information in scenarios where we need to troubleshoot
problems. If visitors bounce off the page because of slow response time or
because of an error we want to identify and x this problem right away. In
order to do that our engineers need access to the technical data captured
for those users that ran into a particular problem. The following screenshot
shows a user that experienced a problem on one of the pages, which is
great information for the business owner as he can proactively contact this
Page 287
user. The engineer can now go deeper and access the underlying technical
information captured including transactional traces that show the problem
encountered such as a failed web service call, a long-running database
statement or an unhandled exception:
From business impact to technical problem description: technical context
information helps to solve problems faster
To Sum Up: What We Need For Business Transaction
Management
There are several key elements we need to perform the type of business
transaction management explained above:
All Visitors, All Actions, All the Time
> First action is landing page
> Last action is bounce page
> Helps us to understand the click path through the site,
Page 288
where people bounce off, which path people that convert
takeKnowing the click paths allows us to improve conversion
rate and lower bounce rates
> Looking up the actions of complaining users speeds up
problem resolution
Technical Context Information on a Distributed Transaction
> URLs alone are not enough as the business transaction itself is
not always reected in URL or URL parameters
> Need to capture business context information from HTTP
session, method arguments, web service calls, SQL statements,
and other pertinent information sourcesThis information comes
from multiple tiers that participate in a distributed transaction
> Technical context information speeds up problem resolution
Out of the Box Business Transactions for Standard
Applications
From an implementation perspective the question that always comes up is:
do I need to congure all my business transactions manually? The answer
is: not all but for most applications we have seen it is necessary as business
context information is buried somewhere along the technical transaction
and is not necessarily part of the URL. Identifying business transactions
based on the URL or web services that get called is of course a valid start
and is something that most business transaction management solutions
provide, and it actually works quite well for standard applications that use
standard web frameworks. For more complex or customized applications it
is a different story.
Business Transactions by URL
The easiest way to identify business transactions is by URL assuming your
application uses URLs that tell you something about the related business
Page 289
transactions. If your application uses URLs like the following list you can
easily map these URLs to business transactions:
/home maps to Home
/search maps to Search
/user/cart maps to Cart
/search?q=shoes still maps to Search but it would be great to actually
see the search by keyword
The following screenshot shows the automatically identied business
transactions based on URLs in dynaTrace. We automatically get the
information if there are any errors on these transactions, what the response
time is or how much time is spent in the database:
Business transactions by web request URL works well for standard web
applications using meaningful URLs that can easily be mapped
Business Transactions by Service/Servlet
Another often-seen scenario is business transactions based on servlet
names or web service calls that are executed by the technical transaction.
This is most often very interesting as you want to know how your calls to
the search or credit card web service are doing. The name of the invoked
web service method is often very descriptive and can therefore be used for
automatic business transactions. Here is an example:
Page 290
Business transactions by web service method name works well as method names
are often very descriptive
Business Transactions by Page TitlePage titles are very often better than
URLs to clarify, thats the actual titles of the pages users visit. The following
shows us business transactions per page title including information on
whether problems are to be found in the browser (browser errors), client,
network or server allowing a rst quick root cause analysis:
Business transaction by page title helps us to understand end user experience
Customized Business Transactions for Non-standard
Applications
A lot of applications our customers are using dont use standard web
frameworks where URLs tell them everything they need to identify their
Page 291
business transactions. Here are some examples:
Web 2.0 applications use a single service URL. The name of the actual
business transaction executed can only be captured from a method
argument of the dispatcher method on the backend
Enforcing SLAs by account types (free, premium, elite members): the
account type of a user doesnt come via the URL but is enumerated
with an authentication call on the backend
Search options are passed via URL using internal IDs. The human
readable name of this option is queried from the database
The booking business transaction is only valid if multiple conditions
along the technical transaction are fullled, e.g: credit card check is OK
and booking was forwarded to delivery
Lets have a look at of some examples on how we use customized business
transactions in our own environment:
Requests by User Type
On our Community Portal we have different user types that access our
pages. It includes employees, customers, partners, and several others.
When a user is logged in we only get the actual user type from an
authentication call that is made to the backend JIRA authentication service.
We can capture the return value of the getGroup service call and use this to
dene a business transaction that splits all authenticated transactions by this
user type allowing me to see which types of users are actually consuming
content on our Community Portal:
Page 292
Using a method return value allows me to analyze activity per user type
Search Conversion Rates
We have a custom search feature on our Community Portal. In order
to ensure that people nd content based on their keywords I need to
understand two things: a) what keywords are used and b) which keywords
result in a successful click to a search result and which ones dont -> that
helps me to optimize the search index. The following screenshot shows
the two business transactions I created. The rst splits the search requests
based on the keyword which is passed as HTTP POST parameter on an
Ajax call. The second looks at clicks to content pages and shows those
that actually came from a previous search result. For that I use the referrer
header (I know the user came from a search result page) and I use the last
used search keyword (part of the user session):
Page 293
Using HTTP POST parameter, referrer header and HTTP session information to
identify search keywords and the conversion rate to actual result clicks
These were just two examples of how we internally use business transactions.
In the use cases described it is not possible to just look at a URL, a servlet
or web service name to identify the actual business transaction. In these
scenarios and scenarios that we always see with our customers it is required
to capture information from within the technical transaction and then
dene these business transactions based on the context data captured.
A Practical Example: How Permanent General Assurance
Corporation Uses Business Transactions
As a last example on business transaction management in real life I want to
highlight some of what was shown during a webinar by Glen Taylor, Web
Service Architect from Permanent General Assurance Corporate (PGAC).
PGAC runs TheGeneral.com and PGAC.com. Their web applications dont
have URLs that tells them whether the user is currently asking for an
insurance quote or whether they are in the process of verifying their credit
card. The information about the actual business transaction comes from
the instance class of an object passed to their ProcessFactory. dynaTrace
Page 294
captures the object passed as argument, and its actual class. With that
information they are able to split their technical transactions into business
transactions.
PGAC uses the instance name of a class to dene their business transactions
If you are interested in their full story check out the recorded webinar. It is
available for download on the dynaTrace recorded webinar page.
More on Business Transaction Management
If you are already a dynaTrace user you should check out the material we
have on our dynaTrace Community Portal: Business Transactions in Core
Concepts and Business Transactions in Production.
If you are new to dynaTrace check out the information on our website
regarding Business Transaction Management and User Experience
Management.
Page 295
38
Week 38
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
4
11
18
25
5
12
19
26
You Only Control 1/3 of Your Page Load
Performance!
by Klaus Enzenhofer
You dont agree? Have you ever looked at the details of your page load time
and analyzed what really impacts page load time? Let me show you with a
real-life example and explain that in most cases you only control 1/3 of the
time required to load a page as the rest is consumed by 3rd party content
that you do not have under control.
Be Aware of Third Party Content
When analyzing web page load times we can use tools such as dynaTrace,
Firebug or PageSpeed. The following two screenshots show timeline views
from dynaTrace AJAX Edition Premium Version. The timelines show all
network downloads, rendering activities and JavaScript executions that
happen when loading almost exactly the same page. The question is:
where does the huge difference come from?

blog.dynatrace.com
Subscribe by email
Page 296
Timeline without/with third party content
The two screenshots below show these two pages as rendered by the
browser. From your own application perspective it is the exact same page
the only difference is the additional third party content. The screenshot
on the left hand side refers to the rst timeline, the screenshot on the right
to the second timeline. To make the differences easier to see I have marked
them with red boxes.
Page 297
Screenshot of the page without and with highlighted third party content
The left hand screenshot shows the page with content delivered by your
application. Thats all the business-relevant content you want to deliver
to your users, e.g. information about travel offers. Over time this page
got enriched with 3
rd
party content such as tracking pixels, ads, Facebook
Connect, Twitter and Google Maps. These third party components make
the difference between the two page loads. Everyone can easily see that
this enrichment has an impact on page load performance and therefore
affects user experience. Watch this video to see how users experience the
rendering of the page.
The super-fast page that nishes the download of all necessary resources
after a little over two seconds is slowed down by eight seconds. Table 1
shows 5 key performance indicators (KPIs) that represent the impact of the
third party content.
Page 298

4 Typical Problems with Third Party Content
Let me explain 4 typical problems that come with adding third party
content and why this impacts page load time.
Problem 1: Number of Resources
With every new third party feature we are adding new resources that have
to be downloaded by the browser. In this example the number of resources
increased by 117. Lets compare it with the SpeedOfTheWeb baseline for
the shopping industry. The best shopping page loads at least 72 resources.
If we would stick with our original page we would be the leader in this
category with just 59 resources.
In addition to the 117 roundtrips that it takes to download these resources
this also means that the total download size of the page grows signicantly.
To download the extra (approximately) 2 Mb from the servers of the third
party content provider your customer will need extra time. Depending
on bandwidth and latency the download time can vary and if you think
of downloading the data via a mobile connection, it really can be time
consuming.
Problem 2: Connection Usage
Domain sharding is a good way to enable older browsers to download
resources in parallel. Looking at modern web sites, domain sharding is often
used too aggressively. But how can you do too much domain sharding?
Table 2 shows us all the domains from which we only download one or two
resources. There are 17 domains for downloading 23 resources domain
sharding at its best!
Page 299
And what about connection management overhead? For each domain we
have to make a DNS look up so that we know to which server to connect.
The setup of a connection also needs time. Our example needed 1286
ms for DNS lookup and another 1176 ms for establishing the connections
to the server. As almost every domain refers to third party content you have
no control over them and you cannot reduce them.
URL Count
www.facebook.com 2
plusone.google.com 2
www.everestjs.net 2
pixel2823.everesttech.net 2
pixel.everesttech.net 2
metrics.tiscover.com 2
connect.facebook.net 1
apis.google.com 1
maps.google.com 1
api-read.facebook.com 1
secure.tiscover.com 1
www.googleadservices.
com
1
googleads.g.doubleclick.
net
1
ad-dc2.adtech.de 1
csi.gstatic.com 1
ad.yieldmanager.com 1
ssl.hurra.com 1

Page 300
Problem 3: Non Minied Resources
You are trying to reduce the download size of your page as much as
possible. You have put a lot of effort into your continuous integration (CI)
process to automatically minify your JavaScript, CSS and images and then
you are forced to put (for example) ads on your pages. On our example
page we can nd an ad provider that does not minify JavaScript. The screen
shot below shows part of the uncompressed JavaScript le.
Uncompressed JavaScript code of third party content provider
I have put the whole le content into a compressor tool and the size can be
reduced by 20%. And again you cannot do anything about it.
Problem 4: Awareness of Bad Response Times of Third
Party Content Provider
Within your data center you monitor the response times for incoming
requests. In case the performance of the response time is degrading you
will be alerted. Within your data center you have the awareness that you
know when something is going wrong and you can do something about
Page 301
it. But what about third party content? Do Facebook, Google, etc. send
you alerts if they are experiencing bad performance? You will now say that
these big providers will never have bad response times, but take a look at
the following two examples:
Timeline with slow Facebook request
This timeline shows us a very long running resource request. You will
never see this request lasting 10698 ms in your data center monitoring
environment as the resources are provided by Facebook, one of the third
party content providers on this page.
Page 302
Timeline with slow Facebook and Google+ requests
The second example shows the timeline of a different page but with the
same problem. On this page not only is Facebook slow but Google+ also.
The slow requests have durations from 1.6 seconds to 3.5 seconds and
have a big impact the experience of your user. The problem with the user
experience is that the bad experience is not ascribed to the third party
content provider but to YOU!
Page 303
Conclusion
What we have seen is that third party content has a big impact on user
experience. You cannot rely on big third party content providers to always
deliver high performance. You should be aware of the problems that can
occur if you put third party content on your page and you really have to
take action. In this blog I have highlighted several issues you are facing with
third party content. What should be done to prevent these types of problems
will be discussed in my next blog -Third Party Content Management!
Page 304
Why You Have Less Than a Second to Deliver
Exceptional Performance
by Alois Reitbauer
The success of the web performance movement shows that there is
increasing interest and value in fast websites. That faster websites lead
to more revenue and reduced costs is a well proven fact today. So being
exceptionally fast is becoming the dogma for developing web applications.
But what is exceptionally fast and how hard is it to build a top performing
web site?
Dening Exceptionally Fast
First we have to dene what exceptionally fast really means. Certainly it
means faster than just meeting user expectations. So we have to look at
user expectations rst. A great source for which response times people
expect from software is this book. It provides really good insight into time
perception in software. I can highly recommend it to any anybody who
works in the performance space.
There is no single value that denes what performance users expect. It
depends on the type of user interaction and ranges from a tenth of a second
to ve seconds. In order to ensure smooth continuous interaction with the
user an application is expected respond within two to four seconds. So
ideally an application should respond within two seconds.
39
Week 39
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
4
11
18
25
5
12
19
26
blog.dynatrace.com
Subscribe by email
Page 305
This research is also backed up by studies done by Forrester asking people
about their expectations regarding web site response times. The survey
shows that while users in 2006 accepted up to four seconds they expected
a site to load within two seconds in 2009.
It seems like two seconds is the magic number for a web site to load. As we
want to be exceptionally fast this means that our pages have to load in less
than two seconds to exceed user expectations.
How Much Faster Do We Have to Be?
From a purely technical perspective everything faster than two seconds
should be considered exceptional. This is however not the case for human
users. As we are not clocks, our time perception is not that precise. We are
not able to discriminate time differences of only a couple of milliseconds.
As a general rule we can say that humans are able to perceive time
differences of about 20 percent. This 20 percent rule means that we have
to be at least 20 percent faster to ensure that users notice the difference.
For delivering exceptional performance this means a page has to load in
1.6 seconds or faster to be perceived exceptionally fast.
How Much Time Do We Have?
At rst sight 1.6 seconds seems to be a lot of processing time for responding
to a request. This would be true if this time was under our control.
Unfortunately this is not the case. As a rule of thumb about 80 percent of
this time cannot be, or can only indirectly be, controlled by us.
Lets take a look at where we lose the time closely. A good way to understand
where we lose the time is the web application delivery chain. It shows
all the parts that play together to deliver a web page and thus inuence
response times.

Page 306
Web application delivery chain
On the client side we have to consider rendering, parsing and executing
JavaScript. Then there is the whole Internet infrastructure required to deliver
content to the user. Then there is our server infrastructure and also the
infrastructure of all third party content providers (like ads, tracking services,
social widgets) we have on our page.
Sending Our Initial Request
The rst thing we have to do is to send the initial request. Let us investigate
how much time we are losing here. To be able to send the request to the
proper server the browser has to look up the domains IP address via DNS.

Page 307
Interactions for initial web request
Whenever we communicate over the Internet we have to take two factors
into account bandwidth and latency. Thinking of the internet as pipe,
bandwidth is the diameter and latency is the length. So while bandwidth
helps us to send more data at each single point in time, latency tells us how
long it takes to send each piece of data. For the initial page request we are
therefore more interested in latency as it directly reects the delay from a
user request to the response.
So, what should we expect latency-wise? A study of the Yahoo Development
Blog has shown that latency varies between 160 and over 400 milliseconds
depending on the connection type. So even if we assume a pretty fast
connection we have to consider about 300 ms for the two roundtrips. This
means we now have 1.3 seconds left.
Getting the Content
So far we havent downloaded any content yet. How big a site actually is,
is not that easy to say. We can however use stats from the HTTP archive.
Lets assume we have a very small page of about 200 kB. Using a 1.5 Mbit
connection it will take about a second to download all the content. This
Page 308
means we now have only 300 ms seconds left. Up to now we have lost
about 80 percent of our overall time.
Client Side Processing
Next we have to consider client side processing. Depending on the
complexity of the web page this can be quite signicant. We have seen cases
where this might take up several seconds. Lets assume for now that you
are not doing anything really complex. Our own tests at SpeedoftheWeb.
org show that 300 ms is a good estimate for client side processing time.
This however means that have no more time left for server side processing.
This means that in case we have to do any processing on the server,
delivering an exceptionally fast web site is close to impossible or we have
to apply a lot of optimization to reach this ambitious goal.
Page 309
Conclusion
Delivering exceptional performance is hard - really hard - considering the
entire infrastructure in place to deliver content to the user. It is nothing you
can simply build-in later. A survey by Gomez testing a large number of sites
shows that most pages miss the goal of delivering exceptional performance
across all browsers.
Performance across 200 web sites
Faster browsers help but are not silver bullets for exceptional performance.
Many sites even fail to deliver expected user performance. While sites do
better when we look at perceived render time also called above-the-fold
time, they still cannot deliver exceptional performance.
Page 310
40
Week 40
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
4
11
18
25
5
12
19
26
eCommerce Business Impact of Third Party
Address Validation Services
by Andreas Grabner
Are you running an eCommerce site that relies on third party services such
as address validation, credit card authorization or mapping services? Do you
know how fast, reliable and accurate these service calls (free or charged)
are for your web site? Do you know if it has an impact on your end users
when one of these services is not available or returns wrong data?
End User and Business Impact of Third Party Service Calls
In last weeks webinar Daniel Schrammel, IT System Manager at Leder
& Schuh (responsible for sites such as www.shoemanic.com or www.
jelloshoecompany.com) told his story about the impact of third party
online services to his business. One specic problem Leder & Schuh had
was with a service that validates shipping address information. If the
shipping address entered is valid, users can opt for a cash on delivery
option which is highly popular in the markets they sell to. If the address
cant be validated or the service is unreachable, this convenient way of
payment is not available and users have to go with credit card payment. As
the eCommerce platform used to run their online stores also comes from a
third party provider the Leder & Schuh IT team has no visibility into these
third party online service calls, whether they succeed and how that impacts
end user behavior.
blog.dynatrace.com
Subscribe by email
Page 311
Monthly Report Basically Means No Visibility
As stated before, Leder & Schuh uses an eCommerce Solution that was not
developed in-house. Therefore they had no option to monitor the service
calls from within the application as this was not supported by the platform
(no visibility into application code). They had to rely on a monthly report
generated by the address validation service telling them how many requests
they had last month, how many succeeded, partially succeeded (e.g. street
name incorrect) or completely failed. With that aggregated data it was:
Impossible to tell which queries actually caused the verication to fail
(was it really the user entering a wrong address or is the service not
using an up-to-date address database?)
Hard to tell whether a failing address validation has an impact on users
decisions to actually buy shoes (is there a correlation between address
validation and bounce rates?)
Live Monitoring of Service Quality
In order to solve this problem they had to get visibility into the eCommerce
solution to monitor the calls to the third party address validation service.
They were interested in:
1. The call count (to validate the service fee they had to pay)
2. The response code of the service (to see the impact the response had
on users bouncing off the site)
3. The actual input parameters that caused the service to return an
address validation error (to verify whether addresses were really bogus
or should have been valid)
Using dynaTrace allowed them to accomplish these and other
general application performance management (APM) goals without
needing to modify the third party eCommerce platform and without any
help from the third party address validation service. The following dashboard
Page 312
shows the calls to the address validation service. On the top line we see
green which represents the calls that return with a success, orange which
represents validations with partial success and red that indicates those calls
that failed. The bottom left chart shows an aggregation of these 3 return
states showing spikes where up to 30% of the validation calls dont return
a success:
Monitoring third party service calls, the response code and impact on end users
Monitoring the service like this allows Leder & Schuh to:
Get live data on service invocations -> dont have to wait until the end
of the month
Can look at those addresses that failed -> to verify if the data was really
invalid of whether the validation service uses an out-of-data database
Can verify the number of calls made to the service -> verify if they dont
get charged for more calls matches what they get charged
Can monitor availability of the service -> in case the service is not
reachable this breaches the SLA
Page 313
Impact of Service Quality on User Experience and Business
As indicated in the beginning, the option cash on delivery is much
more popular than paying by credit card. In case the address validation
service returns that the address is invalid or in case the service is down (not
reachable) the user only gets the option to pay with credit card. Correlating
the status and the response time of the service call with the actual orders
that come in allows Leder & Schuh to see the actual business impact. It
turns out that more users bounce off the site when the only payment
option they are given is paying by credit card (functional impact) or if the
validation service takes too long to respond (performance impact). The
following dashboard shows how business can be impacted by the quality
of service calls:
Quality of service calls (performance or functional) has a direct impact on
orders and revenue

Page 314
Want to Learn More?
During the webinar we also talked about general response time, service level
and system monitoring they are now doing on their eCommerce platform.
With the visibility they got they achieved some signicant application
performance improvements and boosted overall business. Here are some
of the numbers Daniel presented:
50% fewer database queries
30% faster landing pages
100% visibility into all transactions and third party calls
You can watch the recorded webinar on the dynaTrace recorded webinar
library.
Page 315
41
Week 41
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
4
11
18
25
5
12
19
26
The Reason I Dont Monitor Connection Pool
Usage
by Michael Kopp
I have been working with performance sensitive applications for a long
time now. As can be expected most of them have to use the database at
one point or another. So you inevitably end up having a connection pool.
Now, to make sure that your application is not suffering from waiting on
connections you monitor the pool usage . But is that really helping? To be
honest - not really
How an Application Uses the Connection Pool
Most applications these days use connection pools implicitly. They get a
connection, execute some statements and close it. The close call does not
destroy the connection but puts it back into a pool. The goal is to minimize
the so-called busy time. Under the hood most application servers refrain
from putting a connection back into the pool until the transaction has
been committed. For this reason it is a good practice to get the database
connection as late as possible during a transaction. Again the goal is to
minimize usage time, so that many application threads can share a limited
number of connections.
All connection pools have a usage measure to determine if enough
connections are available, or in other words to see if the lack of connections
blog.dynatrace.com
Subscribe by email
Page 316
has a negative effect. However as a connection is occupied only for very
short amounts of time - often fractions of a second - we would need to
check the usage equally often to have a statistically signicant chance of
seeing the pool being maxed-out under normal conditions.
Connection Pool Usage if Polled Every 10 Seconds
In reality this is not done, as checking the pool too often (say several times
a second) would lead to a lot of monitoring overhead. Most solutions check
every couple of seconds and as a result we only see pool usage reaching
100% if it is constantly maxed-out. If we were to track the usage on a
continuous basis the result would look different:
Pool usage as seen if min/max and average are tracked continuously instead
of polled
Page 317
This means that by the time we see 100% pool usage with regular monitoring
solutions we would already have a signicant negative performance impact
- or would we?
What Does 100% Pool Usage Really Mean?
Actually, it does not mean much. It means that all connections in the pool
are in use, but not that any transactions are suffering performance problems
due to this. In a continuous load scenario we could easily tune our setup
to have 100% pool usage all the time and not have a single transaction
suffering; it would be perfect.
However many use cases do not have a steady continuous load pattern and
we would notice performance degradation long before that. Pool usage
alone does not tell us anything; acquisition time does!
This shows the pool usage and the min/max acquisition time which is non-zero
even though the pool is never maxed out
Most application servers and connection pools have a wait or acquisition
metric that is far more interesting than pool usage. Acquisition time
represents the time that a transaction has to wait for a connection from
the pool. It therefore represents real, actionable information. If it increases
we know for a fact that we do not have enough connections in the pool all
the time (or that the connection pool itself is badly written). This measure
can show signicant wait time long before the average pool usage is
anywhere close to 100%. But there is still a slight problem: the measure is
still an aggregated average across the whole pool or more specically, all
Page 318
transactions. Thus while it allows us to understand whether or not there
are enough connections overall, it does not enable us to identify which
business transactions are impacted and by how much.
Measuring Acquisition Time properly
Acquisition time is simply the time it takes for the getConnection call to
return. We can easily measure that inside our transaction and if we do that
we can account for it on a per business transaction basis and not just as an
aggregate of the whole pool. This means we can determine exactly how
much time we spend waiting for each transaction type. After all I might not
care if I wait 10ms in a transaction that has an average response time of a
second, but at the same time this would be unacceptable in a transaction
type with 100ms response time.
The getConnection call as measured in a single transaction. It is 10 ms although
the pool average is 0.5ms
We could even determine which transaction types are concurrently ghting
over limited connections and understand outliers, meaning the occasional
case when a transaction waits a relative long time for a connection, which
would otherwise be hidden by the averaging affect.
Conguring the Optimal Pool Size
Knowing how big to congure a pool upfront is not always easy. In reality
most people simply set it to an arbitrary number that they assume is big
enough. In some high volume cases it might not be possible to avoid wait
Page 319
time all the time, but we can understand and optimize it.
There is a very easy and practical way to do this. Simply monitor the
connection acquisition time during peak load hours. It is best if you do that
on a per business transaction basis as described above. You want to pay
special attention to how much it contributes to the response time. Make
sure that you exclude those transactions from your analysis that do not wait
at all; they would just skew your calculation.
If the average response time contribution to your specic business
transaction is very low (say below 1%) then you can reasonably say that
your connection pool is big enough. It is important to note that I am not
talking about an absolute value in terms of milliseconds but contribution
time! If that contribution time is too high (e.g. 5% or higher) you will want
to increase your connection pool until you reach an acceptable value. The
resulting average pool usage might be very low on average or close to
100%, it does not really matter!
Conclusion
The usefulness of a pool measure depends on the frequency of polling it.
The more we poll it, the more overhead we add - and in the end it is still
only a guess. Impact measures like acquisition time are far more useful and
actionable. It allows us to tune the connection pool to a point where it
has no or at least acceptable overhead when compared to response time.
Like all impact measures it is best not to use the overall average, but to
understand it in terms of contribution to the end user response time.
Page 320
42
Week 42
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
4
11
18
25
5
12
19
26
Top 8 Performance Problems on Top 50 Retail
Sites before Black Friday
by Andreas Grabner
The busiest online shopping season is about to start and its time to make
a quick check on whether the top shopping sites are prepared for the big
consumer rush, or whether they are likely going to fail because their pages
are not adhering to web performance best practices.
The holiday rush seems to start ramping up earlier and earlier each year.
The Gomez Website Performance Pulse showed an average degradation in
performance satisfaction of +13% among the top 50 retail sites in the past
24 hours when compared to a typical non-peak period. For some, like this
website below, the impact began a week ago. Its important to understand
not only whether there is a slowdown, but also why is it the Internet or
my app?
blog.dynatrace.com
Subscribe by email
Page 321
Nobody wants their site to respond with a page like that as users will most likely
be frustrated and run off to their competition
Before we provide a deep dive analysis on Cyber Monday looking at the
pages that ran into problems we want to look at the top 50 retail pages
and see whether they are prepared for the upcoming weekend. We will
be using the free SpeedOfTheWeb service as well as deep dive dynaTrace
browser diagnostics. We compiled the top problems we saw along the 5
parts of the web delivery chain that are on these pages analyzed today,
3 days before Black Friday 2011. These top problems have the potential
to lead to actual problems once these pages are pounded by millions of
Christmas shoppers.
Page 322
High level results of one of the top retail sites with red and yellow indicators on
all parts of the web delivery chain
Page 323
User Experience: Optimizing Initial Document Delivery
and Page Time
The actual user experience can be evaluated by 3 key metrics: rst
impression; onload; and fully loaded. Looking at one of the top retail sites
we see values that are far above the average for the shopping industry
(calculated on a day-to-day basis based on URLs of the Alexa shopping
index):
4.1 seconds until the user gets the rst visual indication of the page and a total of
19.2 seconds to fully load the page
Problem #1: Too Many Redirects Results in Delayed First
Impression
The browser cant render any content until there is content to render.
From entering the initial URL until the browser can render content there
are several things that happen: resolving DNS, establishing connections,
following every HTTP redirect, downloading HTML content. One thing that
can be seen on some of the retail sites is the excessive use of HTTP redirects.
Page 324
Here is an example: from entering the initial URL until retrieved from the
initial HTML document the browser had to follow 4 redirects, taking 1.3
seconds:
4 HTTP redirects from entering the initial URL until the browser can download
the initial HTML document
Problem #2: Web 2.0 / JavaScript Impacting OnLoad and
Blocking the Browser
The timeline view of one of the top retail sites makes it clear that JavaScript
the enabler of dynamic and interactive Web 2.0 applications does not
necessarily improve user experience but actually impacts user experience
by blocking the browser for several seconds before the user can interact
with the site:
Many network resources from a long list of source domains as well as problematic
JavaScript code impact user experience
Page 325
The following problems can be seen on the page shown in the timeline
above:
All JavaScript les and certain CSS les are loaded before any images.
This delays rst impression time as the browser has to parse and execute
JavaScript les before downloading and painting images
One particular JavaScript block takes up to 15 seconds to apply dynamic
styles to specic DOM elements
Most of the 3rd party content is loaded late which is actually a good
practice
Browser: JavaScript
As already seen in the previous timeline view, JavaScript can have a huge
impact in user experience when it performs badly. The problem is that
most JavaScript actually performs well on the desktops of web developers.
Developers tend to have the latest browser version, have blocked certain
JavaScript sources (Google Ads, any types of analytics, etc.) and may not
test against the full blown web site. The analysis in this blog was done
on a laptop running Internet Explorer (IE) 8 on Windows 7. This can be
considered an average consumer machine.
Here are two common JavaScript problem patterns we have seen on the
analyzed pages:
Problem #3: Complex CSS Selectors Failing on IE 8
On multiple pages we can see complex CSS selectors such as the following
that take a long time to execute:
Complex jQuery CSS lookups taking a very long time to execute on certain
browsers
Page 326
Why is this so slow? This is because of a problem in IE 8s querySelectorAll
method. The latest versions of JavaScript helper libraries (such as jQuery,
Prototype and YUI) take advantage of querySelectorAll and simply forward
the CSS selector to this method. It however seems that some of these more
complex CSS lookups cause querySelectorAll to fail. The fallback mechanism
of JavaScript helper libraries is to iterate through the whole DOM in case
querySelectorAll throws an exception. The following screenshot shows
exactly what happens in the above example:
Problem in IEs querySelectorAll stays unnoticed due to empty catch block.
Fallback implementation iterates through the whole DOM
Page 327
Problem #4: 3rd Party Plugins uch as Supersh
One plugin that we found several times was Supersh. We actually
already blogged about this two years ago - it can lead to severe client side
performance problems. Check out the blog from back then: Performance
Analysis of dDynamic JavaScript Menus. One example this year is the
following Supersh call that takes 400ms to build the dynamic menu:
jQuery plugins such as Supersh can take several hundred milliseconds and
block the browser while doing the work
Content: Size and Caching
In order to load and display a page the browser is required to load the
content. That includes the initial HTML document, images, JavaScript and
CSS les. Users that come back to the same site later in the holiday season
need not necessarily download the full content again but rather access
already-cached static content from the local machine. Two problems related
to loading content are the actual size as well as the utilization of caching:
Page 328
Problem #5: Large Content Leads to Long Load Times
Too much and large content is a problem across most of the sites analyzed.
The following is an example of a site with 2MB of total page time where
1.5MB was JavaScript alone:
This site has 2MB in size which is far above the industry average of 568kb
Even on high-speed Internet connections a page size that is 4 times the
industry average is not optimal. When accessing pages of that size from
a slow connection maybe from a mobile device loading all content to
display the site can take a very long time leading to frustration of the end
user.
Problem #6: Browser Caching Not Leveraged
Browser-side caching is an option: web sites have to cache mainly static
content on the browser to improve page load time for revisiting users. Many
of the tested retail sites show hardly any cached objects. The following
report shows one example where caching basically wasnt used at all:
Client-side caching is basically not used at all on this page. Caching would
improve page load time for revisiting users
For more information check out our Best Practices on Browser Caching.
Page 329
Network: Too Many Resources and Slow DNS Lookups
Analyzing the network characteristics of the resources that get downloaded
from a page can give a good indication whether resources are distributed
optimally across the domains they get downloaded from.
Problem #7: Wait and DNS Time
The following is a table showing resources downloaded per domain. The
interesting numbers are highlighted. There seems to be a clear problem
with a DNS lookup to one of the 3rd party content domains. It is also clear
that most of the content comes from a single domain which leads to long
wait times as the browser cant download all of them in parallel:
Too many resources lead to wait times. Too many external domains add up on
DNS and connect Time. Its important to identify the bottlenecks and nd the
optimal distribution
Reducing resources is a general best practice which will lower the number
of roundtrips and also wait time per domain. Checking on DNS and
connection times especially with 3rd party domains allows you to speed
up these network related timings.
Server: Too Many Server-Requests and Long Server
Processing Time
Besides serving static JavaScript, CSS and image les, eCommerce sites
have dynamic content that gets delivered by application servers.
Page 330
Problem #8: Too Much Server Side Processing Time
Looking at the server request report gives us an indication on how much
time is spent on the web servers as well as application servers to deliver the
dynamic content.
Server processing time and the number of dynamic requests impact highly
dynamic pages as we can nd them on eCommerce sites
Long server-side processing time can have multiple reasons. Check out the
latest blog on the Impact of 3rd Party Service Calls on your eBusiness as
well as our other server-side related articles on the dynaTrace blog for more
information on common application performance problems.
Page 331
Waiting for Black Friday and Cyber Monday
This analysis shows that, with only several days until the busiest online
shopping season of the year starts, most of the top eCommerce sites out
there still have the potential to improve their websites in order to not deliver
a frustrating user experience once shoppers go online and actually want to
spend money. We will do another deep dive blog next week and analyze
some of the top retail pages in more detail providing technical as well as
business impact analysis.
Page 332
43
Week 43
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
4
11
18
25
5
12
19
26
5 Things to Learn from JC Penney and Other
Strong Black Friday and Cyber Monday
Performers
by Andreas Grabner
The busiest online shopping time in history has brought a signicant
increase in visits and revenue over the last year. But not only were shoppers
out there hunting for the best deals - web performance experts and the
media were out there waiting for problems to happen in order to blog
and write about the business impact of sites performing badly or sites that
actually went down. We therefore know by now that even though the
Apple Store performed really well it actually went down for a short while
Friday morning.
The questions many are asking right now are: what did those that performed
strongly do right, and what did those performing weakly miss in preparing
for the holiday season?
Learn by Comparing Strong with Weak Performers
We are not here to do any nger-pointing but want to provide an objective
analysis on sites that performed well vs. sites that could do better in order
to keep users on their site. Looking at what sites did right allows you to
follow their footsteps. Knowing what causes weak performance allows you
to avoid the things that drag your site down.
blog.dynatrace.com
Subscribe by email
Page 333
Strong Performance by Following Best Practices
JC Penney was the top performer based on Gomez Last Mile Analysis on
both Black Friday and Cyber Monday, followed by Apple and Dell. For
mobile sites it was Sears followed by Amazon and Dell. Taking a closer look
at their sites allows us to learn from them. The SpeedoftheWeb speed
optimization report shows us that they had strong ratings across all 5
dimensions in the Web performance delivery chain:
SpeedoftheWeb analysis shows strong ratings across all dimensions of the web
delivery chain
Using dynaTrace browser diagnostics technology allows us to do some
forensics on all activities that happen when loading the page or when
interacting with dynamic Web 2.0 elements on that page. The following
screenshot shows the timeline and highlights several things that allow JC
Penney to load very quickly as compared to other sites:
Low number of resources, light on 3rd party content and not overloaded with
Web 2.0 elements
Page 334
Things they did well to improve performance:
Light on 3rd Party content, e.g.: They dont use Facebook or Twitter
JavaScript plugins but rather just provide a popup link to access their
pages on these social networks.
The page is not overloaded with hundreds of images or CSS les. They
for instance only have one CSS le. Minifying this le (removing spaces
and empty lines) would additionally minimize the size.
JavaScript on that page is very lightweight. No long running script
blocks, onLoad event handlers or expensive CSS selectors. One thing
that could be done is merging some of the JavaScript les.
Static images are hosted on a separate cache-domain served with
cache-control headers. This will speed up page load time for revisiting
users. Some of these static images could even be sprited to reduce
download time.
Things they could do to become even better
Use CSS sprites to merge some of their static images, e.g. sprite the
images in the top toolbar (Facebook, Twitter, sign-up options)
Combine and minify JavaScript les
Page 335
Weak Performance by not Following Best Practices
On the other hand we have those pages that didnt perform that well.
Load times of 15 seconds often lead to frustrated users who will then shop
somewhere else. The following is a typical SpeedoftheWeb report for these
sites showing problems across the web delivery chain:
Bad ratings across the web delivery chain for pages that showed weak
performance on Black Friday and Cyber Monday
Now lets look behind the scenes and learn what actually impacts page load
time. The following is another timeline screenshot with highlights on the
top problem areas:
Page 336
Weak performing sites have similar problem patterns in common, e.g. overloaded
with 3rd party content, heavy on JavaScript and not leveraging browser caching
Things that impact page load time:
Overloaded pages with too much content served in non-optimized
form, e.g. static images are not cached or sprited. JavaScript and CSS
les are not minimized or merged.
Heavy on 3rd party plugins such as ad services, social networks or user
tracking.
Many single resource domains (mainly due to 3rd party plugins) with
often high DNS and connect time.
Heavy JavaScript execution e.g. inefcient CSS selectors, slow 3rd
party JavaScript libraries for UI effects, etc.
Multiple redirects to end up on correct URL, e.g. http://site.com ->
http://www.site.com -> http://www.site.com/start.jsp
Page 337
To-do-List to Boost Performance for the Remaining
Shopping Season
Looking across the board at strong and weak performers allows us to come
up with a nice to-do list for you to make sure you are prepared for the
shoppers that will come to your site until Christmas. Here are the top things
to do:
Task 1: Check Your 3rd Party Content
We have seen from the previous examples that 3rd party content can
impact your performance by taking lengthy DNS and connect time, long-
lasting network downloads as well as adding heavy JavaScript execution.
We know that 3rd party content is necessary but you should do a check on
what the impact is and whether there is an alternate solution to embedding
3rd party content, e.g. embed Facebook with a static link rather than the
full blown Facebook Connect plugin.
Analyze how well your 3rd party components perform with network, server and
JavaScript time
Also check out the blog on the Impact of 3rd Party Content on Your Load
Time.
Task 2: Check the Content You Deliver and Control
Too often we see content that has not been optimized at all. HTML or
JavaScript les that total up to several hundred kilobytes per le often
caused by server-side web frameworks that generate lots of unnecessary
empty lines, blanks, add code comments to the generated output, etc. The
best practice is to combine, minify and compress text les such as HTML,
Page 338
JavaScript and CSS les. There are many free tools out there such as YUI
Compressor or Closure Compiler. Also check out Best Practices on Network
Requests and Roundtrips.
Check the content size and number of resources by content type. Combining
les and compressing content helps reduce roundtrips and download time
Developer comments generated in the main HTML document are good
candidates for saving on size
Task 3: Check Your JavaScript Executions
JavaScript whether custom-coded, a popular JavaScript framework or
added through 3rd party content is a big source of performance problems.
Analyze the impact of JavaScript performance across all the major browsers,
not just the browsers you use in development. Problems with methods
Page 339
such as querySelectorAll in IE 8 (discovered in my previous blog post) or
problems with outdated jQuery/Yahoo/GWT libraries can have a huge
impact on page load time and end user experience:
Inefcient CSS selectors or slow 3rd party JavaScript plugins can have major
impact on page load time
Find more information on problematic JavaScript in the following
articles: Impact of Outdated JavaScript Libraries and 101 on jQuery Selector
Performance
Task 4: Check Your Redirect Settings
As already highlighted in my previous blog post where I analyzed sites prior
to Black Friday, many sites still use a series of redirects in order to get their
users onto the initial HTML document. The following screenshot shows an
example and how much time is actually wasted before the browser can
start downloading the rst initial HTML document:
Proper redirect conguration can save unnecessary roundtrips and speed up
page load time
For more information and best practices check out our blog post How We
Saved 3.5 Seconds by Using Proper Redirects.
Page 340
Task 5: Check Your Server-Side Performance
Especially with dynamic pages containing location-based offers, your
shopping cart or a search result page requires server-side processing. When
servers get overloaded with too many requests and when the application
doesnt scale well we can observe performance problems when returning
the initial HTML document. If this document is slow, the browser isnt
able to download any other objects and therefore leaves the screen blank.
Typical server-side performance problems are either a result of architectural
problems (not built to scale), implementation problems (bad algorithms,
wasteful with memory leading to excessive garbage collection, too busy
on the database, etc.) or problems with 3rd party Web services (credit
card authentication, location based services, address validation, and so
on). Check out the Top 10 Server-side Performance Problems Taken from
Zappos, Monster, Thomson and Co:
Application performance problems on the application server will slow down
initial HTML download and thus impact the complete page load
Page 341
A good post on how 3rd party web services can impact your server-side
processing was taken from the experience report of a large European
eCommerce site: Business Impact of 3rd Party Address Validation Service.
Conclusion: Performance Problems are Avoidable
I hope this analysis gave you some ideas or pointed you to areas you
havent thought about targeting when it comes to optimizing your web
site performance. Most of these problems are easily avoidable.
If you need further information, I leave you with some additional links to
blogs we did in the past that show how to optimize real-life application
performance:
Top 10 Client-side Performance Problems
Top 10 Server-side Performance Problems taken from Zappos, Monster,
Thomson and Co
Real life page analysis on: US Open 2011, Masters, Yahoo News
For more, check out our other blogs we have on web performance, and
mobile performance.
Page 342
44
Week 44
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
4
11
18
25
5
12
19
26
Performance of a Distributed Key Value Store,
or Why Simple is Complex
by Michael Kopp
Last time I talked about the key differences between RDBMS and the most
important NoSQL databases. The key reasons why NoSQL databases can
scale the way they do is that they shard based on the entity. The simplest
form of NoSQL database shows this best: the distributed key/value store.
Last week I got the chance to talk to one of the Voldemort developers on
LinkedIn. Voldemort is a pure Dynamo implementation; we discussed its
key characteristics and we also talked about some of the problems. In a
funny way its biggest problem is rooted in its very cleanliness, simplicity
and modular architecture.
Performance of a Key/Value Store
A key/value store has a very simple API. All there really is to it is a put, a
get and a delete. In addition Voldemort, like most of its cousins, supports a
batch get and a form of an optimistic batch put. That very simplicity makes
the response time very predictable and should also make performance
analysis and tuning rather easy. After all one call is like the other and there
are only so many factors that a single call can be impacted by:
blog.dynatrace.com
Subscribe by email
Page 343
I/O in the actual store engine on the server side (this is plug-able in
Voldemort, but Berkeley DB is the default)
Network I/O to the Voldemort instance
Cache hit rate
Load distribution
Garbage collection
Both disk and network I/O are driven by data size (key and value) and load.
Voldemort is a distributed store that uses Dynamos consistent hashing;
as such the load distribution across multiple nodes can vary based on
key hotness. Voldemort provides a very comprehensive JMX interface to
monitor these things. On the rst glance this looks rather easy. But on
the second glance Voldemort is a perfect example of why simple systems
can be especially complex to monitor and analyze. Lets talk about the
distribution factor and the downside of a simple API.
Performance of Distributed Key/Value Stores
Voldemort, like most distributed key/value stores, does not have a master.
This is good for scalability and fail-over but means that the client has a little
more work to do. Even though Voldemort (and most of its counterparts) does
support server side routing, usually the client communicates with all server
instances. If we make a put call it will communicate with a certain number
of instances that hold a replica of the key (the number is congurable). In a
put scenario it will do a synchronous call to the rst node. If it gets a reply
it will call the remainder of required nodes in parallel and wait for the reply.
A get request, on the other hand, will call the required number of nodes in
parallel right away.
Page 344
Transaction ow of a single benchmark thread showing that it calls both
instances every time
What this means is that the client performance of Voldemort is not only
dependent on the response time of a single server instance. It actually
depends on the slowest one, or in case of put the slowest plus one other.
This can hardly be monitored via JMX of the Voldemort instances. Lets
understand why.
What the Voldemort server sees is a series of put and get calls. Each and
every one can be measured. But we are talking about a lot of them and
what we get is moving averages and maximums via JMX:
Average and maximum get and put latency as measured on the Voldemort
instance
Page 345
Voldemort also comes with a small benchmarking tool which reports client-
side performance of the executed test:
[reads] Operations: 899
[reads] Average(ms): 11.3326
[reads] Min(ms): 0
[reads] Max(ms): 1364
[reads] Median(ms): 4
[reads] 95th(ms): 13
[reads] 99th(ms): 70
[transactions] Operations: 101
[transactions] Average(ms): 74.8119
[transactions] Min(ms): 6
[transactions] Max(ms): 1385
[transactions] Median(ms): 18
[transactions] 95th(ms): 70
[transactions] 99th(ms): 1366
Two facts stick out. The client-side average performance is a lot worse than
reported by the server side. This could be due to the network or to the fact
that we have an average of the slower call every time instead of the overall
average (remember, we call multiple server instances for each read/write).
The second important piece of data is the relatively high volatility. Neither
of the two can be explained by looking at the server side metrics!The
performance of a single client request depends on the response time of
the replicas that hold the specic key. In order to get an understanding
of client-side performance we would need to aggregate response time on
a per key and per instance basis. The volume of statistical data would be
rather large. Capturing response times for every single key read and write is
a lot to capture but, more to the point, analyzing it would be a nightmare.
Page 346
Whats even more important is, the key for the key/value store alone might
tell us which key range is slow, but not why: it is not actionable.
As we have often explained, context is important for performance
monitoring and analysis. In case of a key/value store the context of the API
alone is not enough and the context of the key is far too much and not
actionable. This is the downside of a simple API. The distributed nature only
makes this worse as key hotness can lead to an uneven distribution in your
cluster.
Client Side Monitoring of Distributed Key Value Stores
To keep things simple I used the performance benchmark that comes with
Voldemort to show things from the client side.
Single benchmark transaction showing the volatility of single calls to Voldemort
As we can see the client does indeed call several Voldemort nodes in
parallel and has to wait for all of them (at least in my example) to return. By
looking at things from the client side we can understand why some client
functionality has to wait for Voldemort even though server side statistics
would never show that. Furthermore we can show the contribution of
Voldemort operations overall, or a specic Voldemort instance to a particular
transaction type. In the picture we see that Voldemort (at least end-to-end)
contributes 3.7% to the response time of doit. We also see that the vast
Page 347
majority is in the put calls of the applyUpdate. And we also see that the
response time of the nodes in the put calls varies by a factor of three!
Identifying the Root Cause of Key Hotness
There are two key issues that are hard to track, analyze and monitor with
Voldemort according to a Voldemort expert. The one is key hotness. Key
hotness is a key problem for all Dynamo implementations. If a certain key
range is requested or written much more often than others it can lead to
an over-utilization of specic nodes while others are idle.
It is very hard to determine which keys are hot at any given time and
why. If the application is mostly user-driven it might be nigh impossible to
predict up front. One way to overcome this is to correlate end user business
transactions with the triggered Voldemort load and response time. The
idea is that an uneven load distribution on your distributed key/value store
should be triggered by one of the three scenarios
All the excessive load is triggered by the same application
functionality
This is pretty standard and means that the keys that you use in that
functionality are either not evenly spread, monotonic increasing or
otherwise unbalanced.
All the excessive load is triggered by a certain end user group or a
specic dimension of a business transaction
One example would be that the user group is part of the key(s)
and that user group is much more active than usual or others.
Restructuring the key might help to make it more diverse.
Another example is that you are accessing data sets like city
information and for whatever reason New York, London and Vienna
are accessed much more often than anything else. (e.g. more people
book a trip to these three cities than to anything else)
A combination of the above
Either the same data set is accessed by several different business
Page 348
transactions (in which case you need a cross cut) or the same data
structure is accessed by the same business transaction.
The key factor is that all this can be identied by tracing your application
and monitoring it via your business transactions. The number of discrete
business transactions and their dimensions (booking per location, search
by category) is smaller than the number of keys you use in your store.
More importantly, it is actionable! The fact that 80% of the load on your
6 overloaded store instances results from the business transaction search
books thriller enables you to investigate further. You might change the
structure of the keys, optimize the access pattern or setup a separate store
for the specic area if necessary.
Identifying Outliers
The second area of issues that are hard to track down are outliers. These
are often considered to be environmental factors. Again, JMX metrics arent
helping much here, but taking a look at the internals quickly reveals what
is happening:
PurePath showing the root cause of an outlier
Page 349
In my load test of two Voldemort instances (admittedly a rather small
cluster) the only outliers were instantly tracked down to synchronization
issues within the chosen store engine: Berkeley DB. What is interesting is
that I could see that all requests to the particular Voldemort instance that
happened during that time frame were similar blocked in Berkley DB.
Seven were waiting for the lock in that synchronized block and the 8th was
blocking the rest while waiting for the disk.
Hotspot showing where 7 transactions were all waiting for a lock in Berkley DB
Page 350
The root cause for the lock was that a delete had to wait for the disk
This issue happened randomly, always on the same node and was unrelated
to concurrent access of the same key. By seeing both the client side (which
has to wait and is impacted), the corresponding server side (which shows
where the problem is) and having all impacted transactions (in this case I
had 8 transactions going to Voldemort1 which were all blocked) I was able
to pinpoint the offending area of code immediately.
Granted, as a Voldemort user that doesnt want to dig into Berkley
DB I cannot x it, but it does tell me that the root cause for the long
synchronization block is disk wait and I can work with that.
Page 351
Conclusion
Key/value stores like Voldemort are usually very fast and have very
predictable performance. The key to this is the very clean and simple
interface (usually get and put) that does not allow for much volatility in
terms of execution path on the server side. This also means that they are
much easier to congure and optimize, at least as far as speed of a single
instance goes.
However this very simplicity can also be a burden when trying to understand
end user performance, contribution and outliers. In addition even simple
systems become complex when you make them distributed, add more and
more instances and make millions of requests to it. Luckily the solution is
easy: focus on your own application and its usage of the key/value store
instead of the key/value store itself alone.
Page 352
45
Week 45
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
31
Pagination with Cassandra, And What We Can
Learn from It
by Michael Kopp
Like everybody else it took me a while to wrap my head around the BigTable
concepts in Cassandra. The brain needs some time to accept that a column
in Cassandra is really not the same as a column in our beloved RDBMS.
After that I wrote my rst web application with it and ran into a pretty
typical problem: I needed to list a large number of results and needed to
page this for my web page. And like many others I ran straight into the next
wall. Only this time it was not Cassandras fault really, so I thought I would
share what I found.
Pagination in the Table Oriented World
In the mind of every developer there is a simple solution for paging. You
add a sequence column to the table that is monotonically increasing and
use a select like the following:
select * from my_data where sq_num between 25 and 50
This would get me 25 rows. It is fast too, because I made sure the sq_num
column had an index attached to it. Now on the face of it this sounds easy,
but you run into problems quickly. Almost every use case requires the result
to be sorted by some of the columns. In addition the data would not be
static, but be inserted to and possibly updated all the time. Imagine you are
blog.dynatrace.com
Subscribe by email
Page 353
returning a list of names, sorted by rst name. The sq_cnt approach will not
work because you cannot re-sequence large amounts of data every time.
But luckily databases have a solution for that. You can do crazy selects like
the following:
select name, address
from (
select rownum r, name, address
from person sort by name;
)
where r > 25 and
r < 50;
It looks crazy, but is actually quite fast on Oracle (and I think SQL Server too)
as it is optimized for it. Although all databases have similar concepts, most
dont do so well in terms of performance. Often the only thing possible
with acceptable performance is to limit the number of return rows. Offset
queries, as presented here, incur a serve performance overhead. With that
in mind I tried to do the same for Cassandra.
Pagination in Cassandra
I had a very simple use case. I stored a list of journeys on a per tenant basis
in a column family. The name of the journey was the column name and
the value was the actual journey. So getting the rst 25 items was simple.
get_slice(key : tenant_key,
column_parent : {column_family : Journeys_by_
Tenant},
predicate :
{ slice_range :
{ start : A,
Page 354
end : Z,
reverse : false,
count : 25 }
} )
But like so many I got stuck here. How to get the next 25 items? I looked,
but there was no offset parameter, so I checked doctor Google and the
rst thing I found was: Dont do it! But after some more reading I found
the solution and it is very elegant indeed. More so than what I was doing
in my RDBMS, and best of all it is applicable to RDBMS!
The idea is simple: instead of using a numeric position and a counter you
simply remember the last returned column name and use it as a starting
point in your next request. So if the rst result returned a list of journeys
and the 25th was Bermuda then the next button would execute the
following:
Tenant},
predicate :
{ slice_range :
{ start : Bermuda,
end : Z,
reverse : false,
count : 26 }
} )
You will notice that I now retrieve 26 items. This is because start and end
are inclusive and I will simply ignore the rst item in the result. Sounds
super, but how to go backwards? It turns out that is also simple: You use
the rst result of your last page and execute the following:
Page 355
Tenant},
predicate :
{ slice_range :
{ start : Bermuda,
end : A,
reverse : true,
count : 26 }
} )
The reverse attribute will tell get_slice to go backwards. Whats important
is that the end of a reverse slice must be before the start. Done! Well not
quite. Having a First and Last button is no problem (simply use reverse
starting with Z for the last page), but if like many Web pages, you want to
have direct jumpers to the page numbers, you will have to add some ugly
cruft code.
However, you should ask yourself how useful it is to jump to page 16, really!
There is no contextual meaning of the 16th page. It might be better to add
bookmarks like A, B, C instead of direct page numbers.
Applying This to RDBMS?
The pagination concept found in Cassandra can be applied to every
RDBMS. For the rst select simply limit the number of return rows either by
ROWNUM, LIMIT or similar (you might also use the JDBC API).
from person
sort by name
fetch rst 25 rows only;
Page 356
For the next call, we can apply what we learned from Cassandra:
from person
where name > Andreas
sort by name
If we want to apply this to the previous button it will look
like this:
from person
where name < Michael
sort by name desc
For the Last button simply omit the where clause. The advantage? It is far
more portable then offset selects virtually every database will support
it. It should also perform rather well, as long as you have an index on
the name column (the one that you sort by). Finally, there is no need to
have a counter column!
Conclusion
NoSQL databases challenge us because they require some rewiring of our
RDBMS-trained brain. However some of the things we learn can also make
our RDBMS applications better.
Of course you can always do even better and build pagination into your
API. Amazons SimpleDB is doing that, but more on SimpleDB later. Stay
tuned
Page 357
46
Week 46
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
31
by Michael Kopp
Some time back I planned to publish a series about Java memory problems.
It took me longer than originally planned, but here is the second installment.
In the rst part I talked about the different causes for memory leaks,
but memory leaks are by no means the only issue around Java memory
management.
High Memory Usage
It may seem odd, but too much memory usage is an increasingly frequent
and critical problem in todays enterprise applications. Although the average
server often has 10, 20 or more GB of memory, a high degree of parallelism
and a lack of awareness on the part of the developer leads to memory
shortages. Another issue is that while it is possible to use multiple gigabytes
of memory in todays JVMs the side effects are very long garbage collection
(GC) pauses. Sometimes increasing the memory is seen as a workaround
to memory leaks or badly-written software. More often than not this makes
things worse in the long run and not better. These are the most common
causes for high memory usage.
blog.dynatrace.com
Subscribe by email
Page 358
HTTP Session as Cache
The session caching anti-pattern refers to the misuse of the HTTP session
as data cache. The HTTP session is used to store user data or a state that
needs to survive a single HTTP request. This is referred to as conversational
state and is found in most web applications that deal with non-trivial user
interactions. The HTTP session has several problems. First, as we can have
many users, a single web server can have quite a lot of active sessions, so
it is important to keep them small. The second problem is that they are
not specically released by the application at a given point. Instead, web
servers have a session timeout which is often quite high to increase user
comfort. This alone can easily lead to quite large memory demands if we
consider the number of parallel users. However in reality we often see HTTP
sessions multiple megabytes in size.
These so called session caches happen because it is easy and convenient
for the developer to simply add objects to the session instead of thinking
about other solutions like a cache. To make matters worse this is often done
in a re and forget mode, meaning data is never removed. After all why
should you? The session will be removed after the user has left the page
anyway (or so we may think). What is often ignored is that session timeouts
from 30 minutes to several hours are not unheard of.
A practical example is the storage of data that is displayed in HTML selection
elds (such as country lists). This semi-static data is often multiple kilobytes
in size and is held per user in the heap if kept in the session. It is better
to store this which moreover is not user-specic in one central cache.
Another example is the misuse of the hibernate session to manage the
conversational state. The hibernate session is stored in the HTTP session
in order to facilitate quick access to data. This means storage of far more
states than necessary, and with only a couple of users, memory usage
immediately increases greatly. In modern Ajax applications, it may also be
possible to shift the conversational state to the client. In the ideal case, this
leads to a state-less or state-poor server application that scales much better.
Another side effect of big HTTP sessions is that session replication becomes
a real problem.
Page 359
Incorrect Cache Usage
Caches are used to increase performance and scalability by loading data
only once. However, excessive use of caches can quickly lead to performance
problems. In addition to the typical problems of a cache, such as misses
and high turnaround, a cache can also lead to high memory usage and,
even worse, to excessive GC behavior. Mostly these problems are simply
due to an excessively large cache. Sometimes, however, the problem lies
deeper. The key word here is the so-called soft reference. A soft reference
is a special form of object reference. Soft references can be released at any
time at the discretion of the garbage collector. In reality however, they are
released only to avoid an out-of-memory error. In this respect, they differ
greatly from weak references, which never prevent the garbage collection
of an object. Soft references are very popular in cache implementations
for precisely this reason. The cache developer assumes, correctly, that the
cache data is to be released in the event of a memory shortage. If the cache
is incorrectly congured, however, it will grow quickly and indenitely until
memory is full. When a GC is initiated, all the soft references in the cache
are cleared and their objects garbage collected. The memory usage drops
back to the base level, only to start growing again. This phenomenon can
easily be mistaken to be an incorrectly congured young generation. It
looks as if objects get tenured too early only to be collected by the next
major GC. This kind of problem often leads to a GC tuning exercise that
cannot succeed.
Only proper monitoring of the cache metrics or a heap dump can help
identify the root cause of the problem.
Churn Rate and High Transactional Memory Usage
Java allows us to allocate a large number of objects very quickly. The
generational GC is designed for a large number of very short-lived objects,
but there is a limit to everything. If transactional memory usage is too high,
it can quickly lead to performance or even stability problems. The difculty
here is that this type of problem comes to light only during a load test and
Page 360
can be overlooked very easily during development.
If too many objects are created in too short a time, this naturally leads to
an increased number of GCs in the young generation. Young generation
GCs are only cheap if most objects die! If a lot of objects survive the GC it is
actually more expensive than an old generation GC would be under similar
circumstances! Thus high memory needs of single transactions might not
be a problem in a functional test but can quickly lead to GC thrashing
under load. If the load becomes even higher these transactional objects
will be promoted to the old generation as the young generation becomes
too small. Although one could approach this from this angle and increase
the size of the young generation, in many cases this will simply push the
problem a little further out, but would ultimately lead to even longer GC
pauses (due to more objects being alive at the time of the GC).
The worst of all possible scenarios, which we see often nevertheless, is an
out-of-memory error due to high transactional memory demand. If memory
is already tight, higher transaction load might simply max out the available
heap. The tricky part is that once the OutOfMemory hits, transactions that
wanted to allocate objects but couldnt are aborted. Subsequently a lot of
memory is released and garbage collected. In other words the very reason
for the error is hidden by the OutOfMemory error itself! As most memory
tools only look at the Java memory every couple of seconds they might not
even show 100% memory at any point in time.
Since Java 6 it is possible to trigger a heap dump in the event of an
OutOfMemory which will show the root cause quite nicely in such a case.
If there is no OutOfMemory one can use trending or -histo memory dumps
(check out jmap or dynaTrace) to identify those classes whose object
numbers uctuate the most. Those are usually classes that are allocated
and garbage collected a lot. The last resort is to do a full scale allocation
analysis.
Page 361
Large Temporary Objects
In extreme cases, temporary objects can also lead to an out-of-memory
error or to increased GC activity. This happens, for example, when very large
documents (XML, PDF) have to be read and processed. In one specic
case, an application was unavailable temporarily for a few minutes due to
such a problem. The cause was quickly found to be memory bottlenecks
and garbage collection that was operating at its limit. In a detailed analysis,
it was possible to pin down the cause to the creation of a PDF document:
byte tmpData[] = new byte[1024];
int offs = 0;
do
{
int readLen = bis.read(tmpData, offs, tmpData.length - offs);
if(readLen == -1)
break;
offs += readLen;
if(offs == tmpData.length) {
byte newres[] = new byte[tmpData.length + 1024];
System.arraycopy(tmpData, 0, newres, 0,tmpData.length);
tmpData = newres;
}
} while(true);
To the seasoned developer it will be quite obvious that processing multiple
megabytes with such a code leads to bad performance due to a lot of
unnecessary allocations and ever growing copy operations. However a lot
of times such a problem is not noticed during testing, but only once a
certain level of concurrency is reached where the number of GCs and/or
Page 362
amount of temporary memory needed, becomes a problem.
When working with large documents, it is very important to optimize the
processing logic and prevent it from being held completely in the memory.
Memory-related Class Loader Issues
Sometimes I think that the class loader is to Java what DLL hell was to
Windows. When there are memory problems, one thinks primarily of
objects that are located in the heap and occupy memory. In addition to
normal objects, however, classes and constant values are also administered
in the heap. In modern enterprise applications, the memory requirements
for loaded classes can quickly amount to several hundred MB, and thus
often contribute to memory problems. In the Hotspot JVM, classes are
located in the so-called permanent generation or PermGen. It represents
a separate memory area, and its size must be congured separately. If this
area is full, no more classes can be loaded and an out-of-memory occurs
in the PermGen. The other JVMs do not have a permanent generation,
but that does not solve the problem. It is merely recognized later. Class
loader problems are some of the most difcult problems to detect. Most
developers never have to deal with this topic and tool support is also
poorest in this area. I want to show some of the most common memory-
related class loader problems:
Large Classes
It is important not to increase the size of classes unnecessarily. This is
especially the case when classes contain a great many string constants,
such as in GUI applications. Here all strings are held in constants. This is
basically a good design approach, however it should not be forgotten that
these constants also require space in the memory. On top of that, in the
case of the Hotspot JVM, string constants are a part of the PermGen, which
can then quickly become too small. In a concrete case the application had a
separate class for every language it supported, where each class contained
every single text constant. Each of these classes itself was actually too
Page 363
large already. Due to a coding error that happened in a minor release, all
languages, meaning all classes, were loaded into memory. The JVM crashed
during start up no matter how much memory was given to it.
Same Class in Memory Multiple Times
Application servers and OSGi containers especially tend to have a problem
with too many loaded classes and the resulting memory usage. Application
servers make it possible to load different applications or parts of applications
in isolation to one another. One feature is that multiple versions of the
same class can be loaded in order to run different applications inside the
same JVM. Due to incorrect conguration this can quickly double or triple
the amount of memory needed for classes. One of our customers had to
run his JVMs with a PermGen of 700MB - a real problem since he ran it on
32bit Windows where the maximum overall JVM size is 1500MB. In this
case the SOA application was loaded in a JBoss application server. Each
service was loaded into a separate class loader without using the shared
classes jointly. All common classes, about 90% of them, were loaded up to
20 times, and thus regularly led to out-of-memory errors in the PermGen
area. The solution here was strikingly simple: proper conguration of the
class loading behavior in JBoss.
The interesting point here is that it was not just a memory problem, but
a major performance problem as well! The different applications did use
the same classes, but as they came from different class loaders, the server
had to view them as different. The consequence was that a call from one
component to the next, inside the same JVM, had to serialize and de-
serialize all argument objects.
This problem can best be diagnosed with a heap dump or trending dump
(jmap -histo). If a class is loaded multiple times, its instances are also
counted multiple times. Thus, if the same class appears multiple times
with a different number of instances, we have identied such a problem.
The class loader responsible can be determined in a heap dump through
simple reference tracking. We can also take a look at the variables of the class
Page 364
loader and, in most cases, will nd a reference to the application module
and the .jar le. This makes it possible to determine even if the same .jar le
is being loaded multiple times by different application modules.
Same Class Loaded Again and Again
A rare phenomenon, but a very large problem when it occurs, is the
repeated loading of the same class, which does not appear to be present
twice in the memory. What many forget is that classes are garbage
collected too, in all three large JVMs. The Hotspot JVM does this only
during a major GC, whereas both IBM and JRockit can do so during every
GC. Therefore, if a class is used for only a short time, it can be removed
from the memory again immediately. Loading a class is not exactly cheap
and usually not optimized for concurrency. If the same class is loaded by
multiple threads, Java synchronizes these threads. In one real world case,
the classes of a script framework (bean shell) were loaded and garbage
collected repeatedly because they were used for only a short time and the
system was under load. Since this took place in multiple threads, the class
loader was quickly identied as the bottleneck once analyzed under load.
However, the development took place exclusively on the Hotspot JVM, so
this problem was not discovered until it was deployed in production.
In case of the Hotspot JVM this specic problem will only occur under load
and memory pressure as it requires a major GC, whereas in the IBM JVM or
JRockit this can already happen under moderate load. The class might not
even survive the rst garbage collection!
Incorrect Implementation of Equals and Hashcode
The relationship between the hashcode method and memory problems is
not obvious at rst glance. However, if we consider where the hashcode
method is of high importance this becomes clearer.
The hashcode and equals methods are used within hash maps to insert
and nd objects based on their key. However, if the implementation of the
Page 365
operator is faulty, existing entries are not found and new ones keep being
added.
While the collection responsible for the memory problem can be identied
very quickly, it may be difcult to determine why the problem occurs. We
had this case at several customers. One of them had to restart his server
every couple of hours even though it was congured to run at 40GB! After
xing the problem they ran quite happily with 800MB.
A heap dump even if complete information on the objects is available
rarely helps in this case. One would simply have to analyze too many objects
to identify the problem. In this case, the best variant is to test comparative
operators proactively, in order to avoid such problems. There are a few free
frameworks (such as http://code.google.com/p/equalsverier/) that ensure
that equals and hash code conrm to the contract.
Conclusion
High memory usage is still one of the most frequent problems that we
see and it often has performance implications. However most of them can
be identied rather quickly with todays tools. In the next installment of
this series I will talk about how to tune your GC for optimal performance,
provided you do not suffer from memory leaks or the problems mentioned
in this blog.
You might also want to read my other memory blogs:
How Garbage Collection differs in the three big JVMs
Major GCs Separating Myth from Reality
The impact of Garbage Collection on Java performance
Page 366
47
Week 47
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
31
How to Manage the Performance of 1000+
JVMs
by Michael Kopp
Most production monitoring systems I have seen have one major problem:
There are too many JVMs, CLRs and hosts to monitor.
One of our bigger customers (and a Fortune 500 company) mastered the
challenge by concentrating on what really matters: the applications!
blog.dynatrace.com
Subscribe by email
Page 367
Ensure Health
The following dashboard is taken directly from the production environment
of that customer:
High-level transaction health dashboard that shows how many transactions
are performing badly
What it does is pretty simple: it shows the load of transactions in their
two data centers. The rst two charts show the transaction load over
different periods of time. The third shows the total execution sum of all
those transactions. If the execution time goes up but the transaction count
does not, they know they have a bottleneck to investigate further. The pie
charts to the right show the same information in a collapsed form. The
color coding indicates the health of the transactions. Green ones have a
response time below a second while red ones are over 3 seconds. In case
of an immediate problem the red area in the ve minute pie chart grows
quickly and they know they have to investigate.
The interesting thing is that instead of looking at the health of hosts or
databases, the primary indicators they use for health are their end users
experiences and business transactions. If the amount of yellow or red
Page 368
transactions increases they start troubleshooting. The rst lesson we learn
from that is to measure health in terms that really matter to your business
and end users. CPU and memory utilization do not matter to your users,
response time and error rates do.
Dene Your Application
Once they detect a potential performance or health issue they rst need to
isolate the problematic application. This might sound simple, but they have
hundreds of applications running in over 1000 JVMs in this environment.
Each application spans several JVMs plus several C++ components. Each
transaction in turn ows through a subset of all these processes. Identifying
the application responsible is important to them and for that purpose they
have dened another simple dashboard that shows the applications that
are responsible for the red transactions:
This dashboard shows which business transactions are the slowest and which
are very slow most often
They are using dynaTrace business transaction technology to trace and
identify all their transactions. This allows them to identify which specic
Page 369
business transactions are slow and which of them are slow most often. They
actually show this on a big screen for all to see. So not only does Operations
have an easy time identifying the team responsible, most of the time that
team already knows by the time they get contacted!
This is our second lesson learned: know and measure your application(s)
rst! This means:
You dene and measure performance at the unique entry point to the
application/business transaction
You know or can dynamically identify the resources, services and JVMs
used by that application and measure those
Measure Your Applicationand its Dependencies
Once the performance problem or an error is identied the real fun begins,
as they need to identify where the problem originates in the distributed
system. To do that we need to apply the knowledge that we have about
the application and to measure the response time on all involved tiers.
The problem might also lie between two tiers, in the database or with an
external service you call. You should not only measure the entry points but
also the exit points of your services. In large environments, like the one in
question, it is not possible to know all the dependencies upfront. Therefore
we need the ability to automatically discover the tiers and resources used
instead.
Page 370
Show the transaction ow of a single business transaction type
At this point, we can isolate the fault domain down to the JVM or Database
level. The logical next step is to measure the things that impact the
application on those JVMs/CLRs. That includes the resources we use and
the third party services we call. But in contrast to the usual utilization-
based monitoring, we are interested in metrics that reect the impact these
resources have on our application. For example: instead of only monitoring
the connection usage of a JDBC connection it makes much more sense to
look at the average wait duration and the number of threads waiting for
a connection. These metrics represent the direct impact the resource pool
has. The usage on the other hand explains why a thread is waiting but
100% usage does not imply that a thread is waiting! The downside with
normal JMX-based monitoring of resource measures is that we still cannot
directly relate their impact to a particular type of transaction or service. We
can only do that if we measure the connection acquisition directly from
within the service. This is similar to measuring the client side and server
Page 371
side of a service call. The same thing can be applied to the execution of
database statements itself. Our Fortune 500 company is doing exactly
that and found that their worst-performing application is executing the
following statements quite regularly
This shows that a statement that takes 7 seconds on average is executed
regularly
While you should generally avoid looking at top 10 reports for analysis, in
this case it is clear that the statements were at the core of their performance
problem.
Finally we also measure CPU and memory usage of a JVM/CLR. But we again
look at the application as the primary context. We measure CPU usage of a
specic application or type of transaction. It is important to remember that
an application in the context of SOA is a logical entity and cannot be
identied by its process or its class alone. It is the runtime context, e.g. the
URI or the SOAP message that denes the application. Therefore, in order
to nd the applications responsible for CPU consumption, we measure it
on that level. Measuring memory on a transaction level is quite hard and
maybe not worth the effort, but we can measure the impact that garbage
collection (GC) has. The JVM TI interface informs us whenever a GC suspends
the application threads. This can be directly related to response time impact
on the currently executing transactions or applications. Our customer uses
such a technique to investigate the transactions that consume the most
CPU time or are impacted the most by garbage collection:
Page 372
Execution time spent in fast, slow and very slow transactions compared with
their respective volume
This dashboard shows them that, although most execution time is spent
in the slow transactions, they only represent a tiny fraction of their overall
transaction volume. This tells them that much of their CPU capacity is spent
in a minority of their transactions. They use this as a starting point to go
after the worst transactions. At the same time it shows them on a very high
level how much time they spend in GC and if it has an impact. This again
lets them concentrate on the important issues.
All this gives them a fairly comprehensive, yet still manageable, picture
of where the application spends time, waits and uses resources. The only
thing that is left to do is to think about errors.
Monitoring for Errors
As mentioned before, most error situations need to be put into the
context of their trigger in order to make sense. As an example: if we get
an exception telling us that a particular parameter for a web service is
invalid we need to know how that parameter came into being. In other
words we want to know which other service produced that parameter or
if the user entered something wrong, which should have been validated
Page 373
on the screen already. Our customer is doing the reverse which also makes
a lot of sense. They have the problem that their clients are calling them
and complaining about poor performance or errors happening. When a
client calls them, they use a simple dashboard to look up the user/account
and from there lter down to any errors that happened to that particular
user. As errors are captured as part of the transaction they also identify the
business transaction responsible and have the deep-dive transaction trace
that the developer needs in order to x it. That is already a big step towards
a solution. For their more important clients they are actually working on
proactively monitoring those and actively call them up in case of problems.
In short, when monitoring errors we need to know which application and
which ow led to that error and which input parameters were given. If
possible we would also like to have stack traces of all involved JVMs/CLRs
and the user that triggered it.
Making Sure an Optimization Works!
There is one other issue that you have in such a large environment. Whenever
you make changes it might have a variety of effects. You want to make sure
that none are negative and that the performance actually improves. You
can obviously do that in tests. You can also compare previously recorded
performance data with the new one, but in such a large environment this
can be quite a task, even if you automate it. Our customer came up with
a very pragmatic way to do a quick check instead of a going through the
more comprehensive analysis right away. The fact is all they really care
about are the slow or very slow transactions, and not so much whether
satisfactory performance got even better.
Page 374
Transaction load performance breakdown that shows that the outliers are
indeed reduced after the x
The chart shows the transaction load (number of transactions) on one of
their data centers, color coded for satisfactory, slow and very slow response
time (actually they are split into several more categories). We see the
outliers on the top of the chart (red portion of the bars). The dip in the
chart represents the time that they diverted trafc to the other data center
to apply the necessary changes. After the load comes back the outliers have
been signicantly reduced. While this does not grantee that the change
applied is optimal in all cases it tells them that overall it has the desired
effect under full production load!
What about Utilization Metrics?
At this point you might ask if I have forgotten about utilization metrics
like CPU usage and the like, or if I simply dont see their uses. No I have
not forgotten them and they have uses. But they are less important than
you might think. A utilization metric tells me if that resource has reached
Page 375
capacity. In that regard it is very important for capacity planning, but as
far as performance and stability go it only provides additional context. As
an example: knowing that the CPU utilization is 99% does not tell me
whether the application is stable or if that fact has a negative impact on
the performance. It really doesnt! On the other hand if I notice that an
application is getting slower and none of the measured response time
metrics (database, other services, connection pools) increase while at the
same time the machine that hosts the problematic service reaches 99%
CPU utilization we might indeed have hit a CPU problem. But to verify that
I would in addition look at the load average which, similar to the number
of waiting threads on a connection pool, signies the number of threads
waiting for a CPU and thus signies real impact.
The value operating system level utilization metrics give gets smaller all the
time. Virtualization and cloud technologies not only distort the measurement
itself indeed by both running in a shared environment and having the
ability to get more resources on demand, resources are neither nite nor
dedicated and thus resource utilization metrics become dubious. At the
same time application response time is unaffected if measured correctly,
and remains the best and most direct indicator of real performance!
Page 376
48
Week 48
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
31
Third Party Content Management Applied:
Four Steps to Gain Control of Your Page Load
Performance!
by Klaus Enzenhofer
Todays web sites are often cluttered with third party content that slows
down page load and rendering times, hampering user experience. In my rst
blog post, I presented how third party content impacts your websites
performance and identied common problems with its integration. Today
I want to share the experience I have gained as a developer and consultant
in the management of third party content. In the following, I will show you
best practices for integrating Third Party Content and for convincing your
business that they will benet from establishing third party management.
First the bad news: as a developer, you have to get the commitment for
establishing third party management and changing the integration of third
party content from the highest level of business management possible
the best is CEO level. Otherwise you will run into problems implementing
improvements. The good news is that, from my experience, this is not an
unachievable goal you just have to bring the problems up the right way
with hard facts. Lets start our journey towards implementing third party
content management from two possible starting points I have seen in the
past: the rst one is triggered if someone from the business has a bad user
experience and wants to nd out who is responsible for the slow pages.
blog.dynatrace.com
Subscribe by email
Page 377
The second one is that you as the developer know that your page is slow.
No matter where you are starting the rst step you should make is to get
exact hard facts.
Step 1: Detailed Third Party Content Impact Analysis
For a developer this is nothing really difcult. The only thing we have to do
is to use the web performance optimization tool of our choice and take a
look at the page load timing. What we get is a picture like the screenshot
below. We as developers immediately recognize that we have a problem
but for the business this is a diagram that needs a lot of explanation.
Timeline with third party content
Page 378
As we want to convince them we should make it easier to understand for
them. In my experience something that works well is to take the time to
implement a URL parameter that turns off all the third party content for
a webpage. Then we can capture a second timeline from the same page
without the Third Party requests. Everybody can now easily see that there
are huge differences:
Timeline without third party content
We can present these timelines to the business as well but we still have
to explain what all the boxes, timings, etc. mean. We should invest some
more minutes and create a table like the one below, where we compare
some main key performance indicators (KPI).
As a single page is not representative we prepare similar tables for the 5
most important pages. Which pages these are depends on your website.
Landing pages, product pages and pages on the money path are
potentially interesting. Our web analytics tool can help us to nd the most
interesting pages.
Page 379
Step 2: Inform Business about the Impact
During step 1 we have found out that the impact is signicant, we have
collected facts and we still think we have to improve the performance of
our application. Now it is time to present the results of the rst step to
the business. From my experience the best way to do this is a face to face
meeting with high level business executives. CEO, CTO and other business
unit executives are the appropriate attendees.
The presentation we give during this meeting should cover the following
three major topics:
Case study facts from other companies
The hard facts we have collected
Recommendations for improvements
Google, Bing, Amazon, etc. have done case studies that show the impact
of slow pages to the revenue and the users interaction with the website.
Amazon for example found out that a 100 ms slower page reduces revenue
by 1%. I have attached an example presentation to this blog which should
provide some guidance for your presentation and contains some more
examples.
After this general information we can show the hard facts about our system
and as the business now is aware of the relationship between performance
and revenue they normally listen carefully. Now we are no longer talking
about time but money.
At the end of our presentation we make some recommendations about
how we can improve the integration of third party content. Dont be shy
no third party content plugin is untouchable at this point. Some of the
recommendations can only be decided by the business and not by the
development. Our goals for this meeting are that we have the commitment
to proceed, that we get support from the executives when discussing the
implementation alternatives, and that we have a follow up meeting with
Page 380
the same attendees to show improvements. What would be nice is the
consent of the executives to the recommended improvements but from
my experience they commit seldom.
Step 3: Check Third Party Content Implementation
Alternatives
Now as we have the commitment, we can start thinking about integration
alternatives. If we stick to the standard implementation the provider
recommends, we wont be able to make any improvements. We have to
be creative and always have to try to create win-win situations! Here in this
blog I want to talk about 4 best practices I have encountered the past.
Best Practice 1: Remove It
Every developer will now say OK, thats easy, and every businessman
will say Thats not possible because we need it! But do you really need
it? Lets take a closer look at social media plugins, tracking pixels and ads.
A lot of websites have integrated social media plugins like those for Twitter
or Facebook. They are very popular these days and a lot of webpages have
integrated such plugins. Have you ever checked how often your users really
use one of the plugins? A customer of ours had integrated ve plugins. After
6 months they checked how often each of them was used. They found out
that only one was used, and only by other people than the QA department
who checked that all of them were working after each release. With a little
investigation they found out that four of the ve plugins could be removed
as simply nobody was using them.
What about tracking pixels? I have seen a lot of pages out there that have
not only integrated one tracking pixel but ve, seven or even more. Again,
the question is: do we really need all of them? It does not matter who
we ask, we will always get a good explanation of why a special pixel is
needed but stick to the goal of reducing it down to one pixel. Find out
which one can deliver most or even all of the data that each department
Page 381
needs and remove all the others. Problems we might run into will be user
privileges and business objectives that are dened for teams and individuals
on specic statistics. It costs you some effort to handle this but at the end
things will get easier as you have only one source for your numbers and
you will stop discussing which statistics delivers the correct values and from
which statistics to take the numbers, as there is only one left. Once at a
customer we have removed 5 tracking pixels with a single blow. As this
led to incredible performance improvements, their marketing department
made an announcement to let customers know they care about their
experience. This is a textbook example of creating a win-win-situation as
mentioned above.
Other third party content that is a candidate for removal is banner ads.
Businessmen will now say this guy is crazy to make such a recommendation,
but if your main business is not earning money with displaying ads then
it might be worth taking a look at it. Taking the numbers from Amazon
that 100 ms additional page load time reduces revenue by one percent
and think of the example page where ads consume round about 1000 ms
of page load time 10 times that. This would mean that we lose 10 * 1%
= 10% of our revenue just because of ads. The question now is: Are you
really earning 10% or more of your total revenue with ads? If not you
should consider removing ads from your page.
Best Practice 2: Move Loading of Resources Back After
the Onload Event
As we now have removed all unnecessary third party content, we still have
some work left. For user experience, apart from the total load time, the
rst impression and the onload-time are the most important timings. To
improve these timings we can implement lazy loading where parts of the
page are loaded after the onload event via JavaScript; several libraries are
available that help you implement this. There are two things you should be
aware of: the rst thing is that you are just moving the starting point for the
download of the resources so you are not reducing the download size of
your page, or the number of requests. The second thing is that lazy loading
Page 382
only works when JavaScript is available in the users browser. So you have
to make sure that your page is useable without JavaScript. Candidates for
moving the download starting point back are plugins that only work if
JavaScript is available or are not vital to the usage of the page. Ads, social
media plugins, maps, are in most cases such candidates.
Best Practice 3: Load on User Click
This is an interesting option if you want to integrate a social media plugin.
The standard implementation for example of such a plugin looks like the
picture below. It consists of a button to trigger the like/tweet action and the
number of likes/tweets.
Twitter and Facebook buttons integration example
To improve this, the question that has to be answered is: do the users really
need to know how often the page was tweeted, liked, etc.? If the answer
is no we can save several requests and download volume. All we have to
do is deliver a link that looks like the action button and if the user clicks on
the link we can open a popup window or an overlay where the user can
perform the necessary actions.
Best Practice 4: Maps vs. Static Maps
This practice focuses on the integration of maps like Google Maps or
Bing Maps on our page. What can be seen all around the web are map
integrations where the maps are very small and only used to give the user a
hint about where the point of interest is located. To show the user this hint,
several JavaScript les and images have to be downloaded. In most cases
the user does not need to zoom or reposition the map, and as the map
is small it is also hard to use. Why not use the static map implementation
Bing Maps and Google Maps are offering? To gure out the advantages of
the static implementation I have created two HTML pages which show the
Page 383
same map. One uses the standard implementation and the other the static
implementation. Find the source les here.
After capturing the timings we get the following results:
When we take a closer look at the KPIs we can see that every KPI for the
static Google Maps implementation is better. Especially when we look at
the timing KPIs we can see that the rst impression and the Onload time
improve by 34% and 22%. The total load time decreases by 1 second
which is 61% less, a really big impact on user experience.
Some people will argue that this approach is not applicable as they want
to offer the map controls to their customers. But remember Best Practice
3 load on user click: As soon as the user states his intention of interacting
with the map by clicking on it, we can offer him a bigger and easier-to-
use map by opening a popup, overlay or a new page. The only thing the
development has to do is to surround the static image with a link tag.
Page 384
Step 4: Monitor the Performance of your Web
Application/Third Party Content
As we need to show improvements in our follow-up meeting with business
executives, it is important to monitor how the performance of our website
evolves over the time. There are three things that should be monitored by
the business, Operations and Development:
1. Third party content usage by customers and generated business value
Business Monitoring
2. The impact of newly added third party content Development
Monitoring
3. The performance of third party content in the client browser
Operations Monitoring
Business Monitoring:
An essential part of the business monitoring should be a check whether the
requested third party features contribute to business value. Is the feature
used by the customer or does it help us to increase our revenue? We have
to ask this question again and again not only once at the beginning of the
development, but every time when business, Development and Operations
meet to discuss web application performance. If we ever can state No, the
feature is not adding value, remove it as soon as possible!
Page 385
Operations Monitoring:
There are only a few tools that help us to monitor the impact of Third
Party Content to our users. What we need is either a synthetic monitoring
system like Gomez provides, or a monitoring tool that really sits in our
users browser and collects the data there like dynaTrace User Experience
Management (UEM).
Synthetic monitoring tools allow us to monitor the performance from
specied locations all over the world. The only downside is that we are not
getting data from our real users. With dynaTrace UEM we can monitor the
third party content performance of all our users wherever they are situated
and we get the actual timings as experienced timings. The screenshot below
shows a dashboard from dynaTrace UEM that contains all the important
data from the operations point of view. The pie chart and the table below
indicate which third party content provider has the biggest impact on the
page load time and the distribution. The three line charts on the right side
show you the request trend, the total page load time and the Onload time
and the average time that third party content contributes to your page
performance.
Page 386
dynaTrace third party monitoring dashboard
Development Monitoring:
A very important thing is that Development has the ability to compare KPIs
between two releases and view the differences between the pages with
and without third party content. If we have already established functional
web tests which integrate with a web performance optimization tool that
delivers us the necessary values for the KPIs. We just have to reuse the
switch we have established during step 1 and run automatic tests on the
pages we have identied as the most important. From this moment on we
will always be able to automatically nd regression caused by third party
content.
Page 387
We may also consider enhancing our switch and make each of the third
party plugins switchable. This allows us to check the overhead a new plugin
adds to our page. It also helps us when we have to decide which feature we
want to turn if there are two or more similar plugins.
Last but not least as now the business, Operations and Development have
all the necessary data to improve the user experience, we should meet up
regularly to check the performance trends of our page and nd solutions to
upcoming performance challenges.
Conclusion
It is not a big deal to start improving the third party content integration. If
we want to succeed it is necessary that business executives, Development
and Operations work together. We have to be creative, we have to make
compromises and we have to be ready to go different ways of integration
never stop aiming for a top performing website! If we take things seriously
we can improve the experience of our users and therefore increase our
business.
Page 388
49
Week 49
4
11
18
25
5
12
19
26
6
13
20
27
7
14
21
28
1
8
15
22
29
2
9
16
23
30
3
10
17
24
31
Clouds on Cloud Nine: the Challenge of
Managing Hybrid-Cloud Environments
by Andreas Grabner
Obviously, cloud computing is not just a fancy trend anymore. Quite a few
SaaS offerings are already built on platforms like Windows Azure. Others
use Amazons EC2 to host their complete infrastructure or at least use it for
additional resources to handle peak load or do number-crunching. Many
also end up with a hybrid approach (running distributed across public and
private clouds). Hybrid environments especially make it challenging to
manage cost and performance overall and in each of the clouds.
In this blog post we discuss the reasons why you may want to move your
applications into the cloud, why you may need a hybrid cloud approach
or why it might be better to stay on-premise. If you choose a cloud or
hybrid-cloud approach the question of managing your apps in these silos
comes up. You want to make sure your move to the cloud makes sense in
terms of total cost of ownership while ensuring at least the same end user
experience as compared to running your applications the way you do today.
Cloud or No Cloud A Question You Have to Answer
The decision to move to the cloud is not easy and depends on multiple
factors: is it technically feasible? Does it save cost, and how can we manage
blog.dynatrace.com
Subscribe by email
Page 389
cost and performance? Is our data secure with the cloud provider? Can we
run everything in a single cloud, do we need a hybrid-cloud approach
and how do we integrate our on-premise services?
The question is also which parts of your application landscape benet from
a move into the cloud. For some it makes sense for some it will not.
Another often-heard question is the question of trust: we think it is out of
question that any cloud data center is physically or logically secured to the
highest standards. In fact, the cloud data center is potentially more secure
than many data centers of small or medium sized enterprises. It boils down
to the amount of trust you have in your cloud vendor.
Now, lets elaborate on the reasons why you may or may not move your
applications to the cloud.
Reasons and Options for Pure Cloud
If you are a pure Microsoft shop and you have your web applications
implemented on ASP.NET using SQL Server as your data repository,
Microsoft will talk you into using Windows Azure. You can focus on
your application and Microsoft provides the underlying platform with
options to scale, optimize performance using content delivery networks
(CDNs), leverage single-sign-on and other services.
If you have a Java or Python application and dont want to have to care
about the underlying hardware and deployment to application and
web servers you may want to go with Google AppEngine.
c) If you have any type of application that runs on Linux (or also Windows)
then there is of course the oldest and most experienced player in the
cloud computing eld: Amazon Web Services. The strength of Amazon
is also not necessarily in PaaS (Platform as a Service) but more in IaaS as
they make it very easy to spawn new virtual machine instances through
EC2.
Page 390
There is a nice overview that compares these three cloud providers: Choosing
from the major PaaS providers. (Rememberthe landscape and offerings
are constantly changing so make sure to check with the actual vendors
on pricing and services). There are a lot more providers in the more
traditional IaaS space like Rackspace, GoGrid and others.
Reasons for Staying On-Premise
Not everybody is entitled to use the cloud and sometimes it simply doesnt
make sense to move your applications from your data centers to a cloud
provider. Here are three reasons:
1. It might be the case that regulatory issues or the law stop you from
using cloud resources. For instance you are required to isolate and store
data physically on premise, e.g. banking industry.
2. You have legacy applications requiring specic hardware or even
software (operating system); it can be laborious and thus costly or
simply impossible to move into the cloud.
3. It is simply not cheaper to run your applications in the cloud; the cloud
provider doesnt offer all the services you require to run your application
on their platform or it would become more complex to manage your
applications through the tools provided by the cloud vendor.
Reasons for Hybrid Clouds
We have customers running their applications in both private and public
clouds, sometimes even choosing several different public cloud providers.
The common scenarios here are:
Cover peak load: lets assume you operate your application within
your own data center and deal with seasonal peaks (e.g. four weeks
of Christmas business). You might consider additional hardware
provisioning in the cloud to cover these peaks. Depending on your
technologies used you may end up using multiple different public
cloud providers.
Page 391
Data center location constraints: in the gambling industry its required
by law to have data centers in certain countries in order to offer these
online services. In order to avoid building data centers around the globe
and taking them down again when local laws change we have seen the
practice of using cloud providers in these countries instead of investing
a lot of money up-front to build your own data centers. Technically this
is not different from choosing a traditional hosting company in that
country, but a cloud-based approach provides more exibility. And
here again it is possible to choose different cloud providers as not every
cloud provider has data centers in the countries you need.
Improve regional support and market expansion: when companies
grow and expand to new markets they also want to serve these new
markets with the best quality possible. Therefore its common practice
to use cloud services such as CDNs or even to host the application in
additional regional data centers of the current cloud providers.
Externalize frontend and secure backend: we see this scenario a lot in
eCommerce applications where the critical business backend services
are kept in the private data center, with the frontend application hosted
in the public cloud. During business/shopping hours it is easy to add
additional resources to cover the additional frontend activity. During
off-hours its easy and cost-saving to scale down instead of having
many servers running idle in your own environment.
A Unied View: The (Performance) Management
Challenge in Clouds
Combining your Azure usage reports with your Google Analytics statistics
at the end of the month and correlating this with the data collected in your
private cloud is tedious job and in most cases wont give you the answers
to the question you have, which potentially include:
How well do we leverage the resources in the cloud(s)?
How much does it cost to run certain applications/business transactions
especially when costs are distributed across clouds?
Page 392
How can we identify problems our users have with the applications
running across clouds?
How can we access data from within the clouds to speed up problem
resolution?
Central Data Collection from All Clouds and Applications
We at dynaTrace run our systems (Java, .NET and native applications)
across Amazon EC2, Microsoft Windows Azure and on private clouds for
the reasons mentioned above, and so do an increasing number of our
customers. In order to answer the questions raised above we monitor these
applications both from an infrastructure and cloud provider perspective as
well as from a transactional perspective. To achieve this we
Use the Amazon API to query information about Instance usage and
cost
Query the Azure Diagnostics Agent to monitor resource consumption
Use dynaTrace User Experience Management (UEM) to monitor end
user experience
Use dynaTrace application performance management (APM) across all
deployed applications and all deployed clouds
Monitor business transactions to map business to performance and
cost
The following shows an overview of what central monitoring has to look
like. End users are monitored using dynaTrace UEM, the individual instances
in the cloud are monitored using dynaTrace Agents (Java, .NET, native) as
well as dynaTrace Monitors (Amazon Cost, Azure Diagnostics, and so on).
This data combined with data captured in your on-premise deployment is
collected by the dynaTrace Server providing central application performance
management:
Page 393
Getting a unied view of application performance data by monitoring all
components in all clouds
Real-Life Cross Cloud Performance Management
Now lets have a closer look at the actual benets we get from having all this
data available in a single application performance management solution.
Understand Your Cloud Deployment
Following every transaction starting at the end-user all the way through
your deployed application makes it possible to a) understand how your
application actually works b) how your application is currently deployed
in a this very dynamic environment and c) identify performance hotspots:
Follow your end user transactions across your hybrid cloud environment: identify
architectural problems and hotspots
Page 394
Central Cost Control
It is great to get monthly reports from Microsoft but it is better to monitor
your costs online up-to-the-minute. The following screenshot shows the
dashboard that highlights the number of Amazon instances we are using
and the related costs giving us an overview of how many resources we are
consuming right now:
Live monitoring of instances and cost on Amazon
Monitor End User Experience
If you deploy your application across multiple cloud data centers you want
to know how the users serviced by individual data centers do. The following
screenshot shows us how end user experience is for our users in Europe
they should mainly be serviced by the European data centers of Azure:
Page 395
Analyze regional user experience and verify how well your regional cloud data
centers service your users
Root Cause Analysis
In case your end users are frustrated because of bad performance or
problems you want to know what these problems are and whether they
are application or infrastructure related. Capturing transactional data from
within the distributed deployed application allows us to pinpoint problems
down to the method level:
Page 396
Identify which components or methods are your performance hotspots on I/O,
CPU, sync or wait
For developers it is great to extract individual transactions including
contextual information such as exceptions, log messages, web service calls,
database statements and the information on the actual hosts (web role
in Azure, JVM in EC2 or application server onpremise) that executed this
transaction:
dynaTrace PurePath works across distributed cloud applications making it easy
for developers to identify and x problems
Page 397
Especially the information from the underlying hostswhether virtual in
one of your clouds or physical in your data centerallows you to gure
out whether a slowdown was really caused by slow application code or an
infrastructure/cloud provider problem.
For our dynaTrace users
If you want to know more about how to deploy dynaTrace in Windows
Azure or Amazon EC2 check the following resources on the dynaTrace
Community Portal: Windows Azure Best Practices and Tools, Amazon
Account Monitor, Amazon EC2 FastPack
Index
Agile
and continuous integration [20]
Ajax
troubleshooting guide [31]
Architecture
validation with dynaTrace [20]
Automation
AJAX Edition with Showslow [8]
+ AJAX Edition with Showslow [13]
of cross-browser development [112]
+ of cross-browser development [183]
of load test [245]
of regression and scalability analysis
[86]
security with business transaction [122]
to validate code in continuous
integration [20]
Azure
hybrid with EC2 [388]
Best Practices
for Black Friday survival [332]
for cross-browser testing [183]
Microsoft not following [160]
Business Transaction Management
explained [275]
over 1000+ JVMs [366]
security testing with [122]
Caching
memory-sensitivity of [153]
Cassandra
garbage collection suspension [232]
pagination with [352]
Cloud
and key-value stores [343]
auto-scaling in [193]
hybrid performance management [388]
in the load test [249]
+ in the load test [250]
inside horizontally scaling databases
[232]
pagination in horizontally horizontally
scaling databases [353]
public and private performance
management [166]
RDBMS versus horizontally scaling
databases [261]
Continuous Integration
dynaTrace in [20]
Cross-browser
DOM case sensitivity [209]
exceptional performance with browser
plurality [304]
Firefox versus Internet Explorer [202]
Javascript implementation [160]
page load optimization with UEM [238]
stable functional testing [112]
Database
connection pool monitoring [315]
DevOps
APM in WebSphere [134]
automatic error detection in
production [213]
business transaction management
explained [275]
control of page load performance [295]
incorrect measurement of response
times [177]
managing performance of 1000+ JVMs
[366]
performance management in public
and onpremise cloud [166]
step-by-step guide to APM in
production [77]
+ step-by-step guide to APM in
production [98]
top performance problems before
Black Friday [320]
troubleshooting response times [65]
why SLAs on request errors do not
work [258]
why do APM in production [220]
Dynamo
key/value stores in [342]
EC2
challenges of hybridizing [388]
eCommerce
top performance problems before
Black Friday [320]
user experience strong performers
[332]
Web 2.0 best practices in [183]
Exception
cost of [72]
Firefox
framework performance [40]
versus Internet Explorer [202]
Frameworks
problems in Firefox with old versions
[40]
Garbage collection
across JVMs [145]
impact on Java performance [59]
myths [37]
Java
major garbage collections in [145]
memory problems in [92]
+ memory problems in [357]
object caches and memory [153]
performance management of 1000+
JVMs [366]
serialization [25]
jQuery
in Internet Explorer [209]
Load Testing
cloud services [249]
+ cloud services 250
importance of [245]
white box [86]
Memory
leaks [92]
+ leaks [357]
sensitivity of object caches [153]
Metrics
incorrect [176]
in production [98]
of third party content [376]
to deliver exceptional performance
[304]
Mobile
server-side ramications on mobile
[198]
time to deliver exceptional
performance [304]
NoSQL
Cassandra performance [231]
or RDBMS [261]
pagination with Cassandra [353]
shard behavior [342]
Page Load Time
control of [295]
reducing with caching [238]
Production
APM in WebSphere [134]
automatic error detection in [213]
managing 1000+ JVMs in [366]
measuring a distributed system in [98]
step-by-step guide to APM [77]
why do APM in [220]
RDBMS
comparison with Cassandra [352]
or NoSQL [262]
Scalability
automatically in the cloud [193]
white box testing for [86]
Security
testing with business transactions [122]
Selenium
cross-browser functional web testing
with [112]
Serialization
in Java [25]
Server-side
connection pool usage [315]
performance in mobile [198]
SLAs
and synthetic monitoring [269]
on request errors [257]
Synthetic monitoring
will it die [269]
System metrics
distributed [98]
trustworthiness [65]
Third-Party Content
effect on page load [295]
business impact of [310]
minimizing effect on page load [376]
Tuning
connection pool usage [315]
cost of an exception [72]
garbage collection in the 3 big JVMs
[145]
myths about major garbage collections
[37]
serialization in Java [25]
top Java memory problems [92]
+ top Java memory problems [357]
why object caches need to be memory
sensitive [153]
worker threads under load [15]
User Experience
how to save 3.5 seconds of load time
with [238]
in ecommerce [310]
on Black Friday [332]
proactive management of [213]
synthetic monitoring and [269]
users as crash test dummies [245]
Virtualization
versus public cloud [166]
Web
Ajax troubleshooting guide [31]
best practices dont work for single-
page applications [45]
eCommerce impact of address
validation services [310]
four steps to gain control of page load
performance [376]
frameworks slowing down Firefox [40]
how case-sensitivity can kill page load
time [209]
how to save 3.5 seconds page load time
[238]
impact of garbage collection on Java
performance [59]
lessons from strong Black Friday
performers [332]
page load time of US Open across
browsers [202]
set up ShowSlow as web performance
repository [8]
why Firefox is slow on Outlook web
access [160]
will synthetic monitoring die [269]
Web 2.0
automated optimization [183]
testing and optimizing [46]
WebSphere
APM in [134]

Compuware APM Almanac 2012

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Compuware APM Almanac 2012

Uploaded by

Copyright:

Available Formats

Page 2

Alois Reitbauer works as Technology Strategist for dynaTrace. As a major

You might also like