Projects collect lots of metrics that they do not need. All on this forum would agree that measurement is critical. But not all metrics are useful, and too many metrics can be confusing and obscure what's important.
Furthermore, measuring takes time and space resources away from doing. As computers get faster, storage gets cheaper, metrics and logging frameworks come built-in and data analysis and display becomes more powerful, the temptation grows to collect everything, just in case you need it.
Here are some observations on why we collect too many metrics, and how we can avoid it.
1. If your job is collecting data, collecting more data makes you look more productive
Collecting metrics is a means to an end, not an end unto itself. If you don't get paid unless you find more numbers to squeeze from an application then your organization needs some adjustment.
Depending on your level in the organization, the jobs should be:
- Ask a question that a metric could answer
- Decide what metric answers a question
- Implement the collection of a requested metric
- Answer a question using the collected metric values
The end goal of metrics is either to identify a problem, or fix a problem.
2. Sometimes you can see "anomalies" looking at other metrics you might not think relevant
This is actually the most compelling argument for collecting a lot of metrics. But this should be done by choice, in a purposeful way, in a non-production but realistically-loaded environment, and the result should be analyzed by somebody with the time and qualifications to judge the value of these metrics. Just turning on all the metrics all the time and hoping the bug will jump out at you when you need it, is not an engineering approach.
3. It's easier to browse existing metrics than to figure out how to enable a new metric
It shouldn't be, especially if it's one of the many that you would have been collecting already. Good tools and infrastructure should make the mechanics easy, and their use is something your developers and operations people should know: How do I enable/disable specific metrics and adjust their collection frequency and persistence? Whether it's one app-server's JMX metrics or your external network bandwidth, somebody around there should know the points at which metrics are collected, how these are configured, and where the results go. If not, then that's a problem to address.
When the person who knows is explicitly asked to look at the metrics being collected, chances are they'll see some that are not used or useful. Or, they might see metrics or logging that are not enabled, but would have been useful in the past, and that's even better. Either way: a requirement of your application's implementation and documentation should be how to easily control metrics collection.
4. It's easier to collect all the metrics than to figure out which are the right few
How do you know which few metrics you need? Of course you don't, always, in advance. This is the hardest problem and the biggest reason why we collect too much. There are two main approaches to identifying what to measure:
- negative or problem-focused
- positive or goal-focused
The negative approach might alternatively be called the House, MD approach, where we do differential diagnosis to decide which tests to run on the patient. We build a diagnostic handbook for our application by listing problems, symptoms, metrics and value ranges which confirm the problem exists; and/or metrics and value ranges which exclude that problem.
This process has the added advantage of forcing us to identify potential problems, so our QA department can test for these in advance (see The AntifragileOrganization). If testing or production shows additional problems, we add that problem, along with the metrics we used to identify and diagnose it, to our diagnostic handbook, and keep the collection of those useful metrics enabled, if possible.
The positive approach is the more familiar one: the SLA. Quantify what we want to achieve as metrics and measure that. We then use externally visible goals like the SLA to drive internal metrics, like measuring every operation comprising a transaction. Then measuring the resources used by the operations comprising a transaction. Then measuring the resources that compete with the resources that impact the operations that comprise a transaction ... And this is the trap. Everything in the entire system contributes to your SLA, so it's tempting to measure and report on everything.
However, considering both approaches together suggests a solution:
1. Measure what you want to achieve
Record user experience, transaction frequency, error rates, availability, system correctness. If you don't measure that, you can't know you have a problem. These metrics are generally those worth reporting to management and your team. (Metrics reporting follies are a topic worthy of a separate post, or book).
2. Measure what you need to know to solve the problems shown by point #1
Let diagnostic need drive the rest of your metrics, as well as your logging. When a metric proves useful, keep it enabled if it's not costly (and if it is, see if you can get it another way for next time). But don't bother producing reports about these metrics.
3. Disable all the metrics and logging that aren't either (a) identifying problems or (b) helping you solve them
You'll be amazed at how much lighter your load is.
Tom Fleck is Senior Software Engineer at OC Systems.
Monitoring a business means monitoring an entire business – not just IT or application performance. If businesses truly care about differentiating themselves from the competition, they must approach monitoring holistically. Separate, siloed monitoring systems are quickly becoming a thing of the past ...
A growing IT delivery gap is slowing down the majority of the businesses surveyed and directly putting revenue at risk, according to MuleSoft's 2017 Connectivity Benchmark Report on digital transformation initiatives and the business impact of APIs ...
Why containers are growing in popularity is no surprise — they’re extremely easy to spin up or down, but come with an unforeseen issue. Without the right foresight, DevOps and IT teams may lose a lot of visibility into these containers resulting in operational blind spots and even more haystacks to find the presumptive performance issue needle ...
Much emphasis is placed on servers and storage when discussing Application Performance, mainly because the application lives on a server and uses storage. However, the network has considerable importance, certainly in the case of WANs where there are ways of speeding up the transmission of data of a network ...
The majority of IT executives believe investment in IT Service Management (ITSM) is important to gain the agility needed to compete in an era of global, cross-industry disruption and digital transformation, according to Delivering Value to Today’s Digital Enterprise: The State of IT Service Management 2017, a report by BMC, conducted in association with Forbes ...
Let’s say your company has examined all the potential pros and cons, and moved your critical business applications to the cloud. The advertised benefits of the cloud seem like they’ll work out great. And in many ways, life is easier for you now. But as often happens when things seem too good to be true, reality has a way of kicking in to reveal just exactly how many things can go wrong with your cloud setup – things that can directly impact your business ...
IT leadership is more driven to be innovative than ever, but also more in need of justifying costs and showing value than ever. Combining the two is no mean feat, especially when individual technologies are put forward as the single tantalizing answer ...
The move to Citrix 7.X is in full swing. This has improved the centralizing of Management and reduction of costs, but End User Experience is becoming top of the business objectives list. However, delivering that is not something to be considered after the upgrade ...
As organizations understand the findings of the Cyber Monday Web Performance Index and look to improve their site performance for the next Cyber Monday shopping day, I wanted to offer a few recommendations to help any organization improve in 2017 ...
Online retailers stand to make a lot of money on Cyber Monday as long as their infrastructure can keep up with customers. If your company's site goes offline or substantially slows down, you're going to lose sales. And even top ecommerce sites experience performance or stability issues at peak loads, like Cyber Monday, according to Apica's Cyber Monday Web Performance Index ...