APM for Enterprise: How Does It Scale?
May 18, 2015

Larry Haig
Intechnica

Share this

It is easy to feel that so called "second generation" Application Performance Management (APM) tooling rules the world.

And for good reason, many would argue – certainly the positive disruptive effects of support for highly distributed / Service Orientated architectures, and the requirements of many fast moving businesses to support a plethora of different technologies are a powerful dynamic. That leaves aside the undoubted advantages of comprehensive traffic screening (as opposed to "hard" sampling), ease of installation and commissioning (relative in some cases), user accessibility, flexible reporting and tighter productive association between IT and business – in short, empowering the DevOps and PerfOps revolution.

So, modern APM is certainly well attuned to the requirements of current business. What's not to like?

Could these technologies have an Achilles heel? Certainly, they are generally strong on lists of customer logos, but tight lipped when it comes to detailed high volume case studies.

Hundreds or thousands of JVMs and moderately high transaction volumes are all very well (and well attested), but how do these technologies stack up for the high end enterprise? What other options might exist?

It could be argued that an organization with tens of thousands of JVMs and millions of metrics has a fundamentally different issue than those closer to the base of the pyramid. Certainly these organizations are fewer in number, but that is scant comfort for those with the responsibility of managing their application delivery. Whether in banking/financial trading, FMCG or elsewhere, the issue of effectively analyzing daily transaction flows at high scale is real. The situation is exacerbated at peak – one large UK gaming company generates 20-30,000 events per second during a normal daily peak. During the popular Grand National race meeting, traffic increases 5-10 times – creating the need to transfer several terabytes a day into an APM data store.

The question is: which if any of the APM tools can even come close to these sorts of volumes?

It is certainly possible to instrument these organizations with second generation APM – but what snares lie in wait for the unwary, and what compromises will have to be made?

To some extent, the answer depends upon the particular technology deployed. All will have their own weaknesses, but those architected around collector/analysis servers are likely to be particularly vulnerable to the effects of extreme data volume unless high scale technology/architectural interventions have been made "under the covers". Cloud based solutions may duck this bullet (although they are not guaranteed to do so), but come with their own security concerns, at least in theory.

So, you are a high volume Enterprise, and have plumped for second generation APM. What issues may arise? Essentially, software agent based APM is likely to evidence stress in one or more of three principal areas:

■ Length of data storage/"live" access

■ Data granularity

■ Production system performance overhead

Compromises essentially hinge around reducing the data flows processed by the APM to reduce the amount of data written to disk, or improving the inherent efficiency of such data handling. Traditionally, this involves sampling rather than screening all transactions; and this is an option for some. However, sampling has no value for businesses needing to identify and analyze a particular single customer session.

Other approaches are to increase the hardware capacity of collector/server components, or reducing the application server to collector ratio. Either way, these compromises run the risk of eroding the underlying value proposition supporting much of second generation tool philosophy. In addition they will push the architecture of these solutions to their limit and potentially expose fundamental issues in how they scale.

Open Source approaches to extreme scale have evolved using NoSQL – creating products such as Hadoop and ElasticSearch. The pedigree of these is generally good, in that they have been developed as strategies within companies such as Google and Facebook to deal with the problems of ultra-high volume environments.

Certainly, integration of these technologies into their tooling by APM vendors can be a potential solution, providing that they have been architected/implemented appropriately – and tested with extreme scale in mind.

Given that most if not all major volume Enterprises have de facto constraints on their flexibility and speed of adoption of extension technologies (not to mention change generally), perhaps there is a case for revisiting "traditional" APM tooling models. These certainly had (and have) a track record of delivering value in large enterprise deployments, albeit without some of the bells and whistles offered by later entrants. Any high scale developments made by these vendors would certainly have the advantage of leveraging the often considerable sunk investment made in them.

Provided that any constraints are well understood, and appropriate investment is made in initial commissioning and ongoing support, then this option would in our view be worth adding to the mix – for consideration, at least.

Alternatively, perhaps a "dual tool" approach may have validity – second generation APM pre-production, and traditional high volume solutions in the live environment.

For Enterprises with extremely strong nerves, and appropriate skills, "building your own" using Open Source technologies is a possibility, although it is likely to be both extremely high risk and costly. Such an approach comes with its own ongoing maintenance challenges as well.

We would like to see more open sourcing of the key components of APM, for example the agents that instrument Java and .Net applications. These, conforming to open standards, enable a flexible approach to open-APM. Choose your agents, your transport method (Apache Flume, FluentD etc.), and your data storage and analysis methods (Elastic Kibana) that are appropriate for your scale and company skillset.

Either way, we would strongly suggest that major enterprises face these issues squarely, and certainly not make significant investments in APM without appropriate high volume (production scale) Proof of Concept preliminary trialling.

Above all, put little trust in marketing. Prove it in your environment – ideally in production.

Larry Haig is Senior Consultant at Intechnica.

This blog was written with contributions by James Billingham, Performance Architect at Intechnica.

Share this

The Latest

April 25, 2024

The use of hybrid multicloud models is forecasted to double over the next one to three years as IT decision makers are facing new pressures to modernize IT infrastructures because of drivers like AI, security, and sustainability, according to the Enterprise Cloud Index (ECI) report from Nutanix ...

April 24, 2024

Over the last 20 years Digital Employee Experience has become a necessity for companies committed to digital transformation and improving IT experiences. In fact, by 2025, more than 50% of IT organizations will use digital employee experience to prioritize and measure digital initiative success ...

April 23, 2024

While most companies are now deploying cloud-based technologies, the 2024 Secure Cloud Networking Field Report from Aviatrix found that there is a silent struggle to maximize value from those investments. Many of the challenges organizations have faced over the past several years have evolved, but continue today ...

April 22, 2024

In our latest research, Cisco's The App Attention Index 2023: Beware the Application Generation, 62% of consumers report their expectations for digital experiences are far higher than they were two years ago, and 64% state they are less forgiving of poor digital services than they were just 12 months ago ...

April 19, 2024

In MEAN TIME TO INSIGHT Episode 5, Shamus McGillicuddy, VP of Research, Network Infrastructure and Operations, at EMA discusses the network source of truth ...

April 18, 2024

A vast majority (89%) of organizations have rapidly expanded their technology in the past few years and three quarters (76%) say it's brought with it increased "chaos" that they have to manage, according to Situation Report 2024: Managing Technology Chaos from Software AG ...

April 17, 2024

In 2024 the number one challenge facing IT teams is a lack of skilled workers, and many are turning to automation as an answer, according to IT Trends: 2024 Industry Report ...

April 16, 2024

Organizations are continuing to embrace multicloud environments and cloud-native architectures to enable rapid transformation and deliver secure innovation. However, despite the speed, scale, and agility enabled by these modern cloud ecosystems, organizations are struggling to manage the explosion of data they create, according to The state of observability 2024: Overcoming complexity through AI-driven analytics and automation strategies, a report from Dynatrace ...

April 15, 2024

Organizations recognize the value of observability, but only 10% of them are actually practicing full observability of their applications and infrastructure. This is among the key findings from the recently completed Logz.io 2024 Observability Pulse Survey and Report ...

April 11, 2024

Businesses must adopt a comprehensive Internet Performance Monitoring (IPM) strategy, says Enterprise Management Associates (EMA), a leading IT analyst research firm. This strategy is crucial to bridge the significant observability gap within today's complex IT infrastructures. The recommendation is particularly timely, given that 99% of enterprises are expanding their use of the Internet as a primary connectivity conduit while facing challenges due to the inefficiency of multiple, disjointed monitoring tools, according to Modern Enterprises Must Boost Observability with Internet Performance Monitoring, a new report from EMA and Catchpoint ...