Projects collect lots of metrics that they do not need. All on this forum would agree that measurement is critical. But not all metrics are useful, and too many metrics can be confusing and obscure what's important.
Furthermore, measuring takes time and space resources away from doing. As computers get faster, storage gets cheaper, metrics and logging frameworks come built-in and data analysis and display becomes more powerful, the temptation grows to collect everything, just in case you need it.
Here are some observations on why we collect too many metrics, and how we can avoid it.
1. If your job is collecting data, collecting more data makes you look more productive
Collecting metrics is a means to an end, not an end unto itself. If you don't get paid unless you find more numbers to squeeze from an application then your organization needs some adjustment.
Depending on your level in the organization, the jobs should be:
- Ask a question that a metric could answer
- Decide what metric answers a question
- Implement the collection of a requested metric
- Answer a question using the collected metric values
The end goal of metrics is either to identify a problem, or fix a problem.
2. Sometimes you can see "anomalies" looking at other metrics you might not think relevant
This is actually the most compelling argument for collecting a lot of metrics. But this should be done by choice, in a purposeful way, in a non-production but realistically-loaded environment, and the result should be analyzed by somebody with the time and qualifications to judge the value of these metrics. Just turning on all the metrics all the time and hoping the bug will jump out at you when you need it, is not an engineering approach.
3. It's easier to browse existing metrics than to figure out how to enable a new metric
It shouldn't be, especially if it's one of the many that you would have been collecting already. Good tools and infrastructure should make the mechanics easy, and their use is something your developers and operations people should know: How do I enable/disable specific metrics and adjust their collection frequency and persistence? Whether it's one app-server's JMX metrics or your external network bandwidth, somebody around there should know the points at which metrics are collected, how these are configured, and where the results go. If not, then that's a problem to address.
When the person who knows is explicitly asked to look at the metrics being collected, chances are they'll see some that are not used or useful. Or, they might see metrics or logging that are not enabled, but would have been useful in the past, and that's even better. Either way: a requirement of your application's implementation and documentation should be how to easily control metrics collection.
4. It's easier to collect all the metrics than to figure out which are the right few
How do you know which few metrics you need? Of course you don't, always, in advance. This is the hardest problem and the biggest reason why we collect too much. There are two main approaches to identifying what to measure:
- negative or problem-focused
- positive or goal-focused
The negative approach might alternatively be called the House, MD approach, where we do differential diagnosis to decide which tests to run on the patient. We build a diagnostic handbook for our application by listing problems, symptoms, metrics and value ranges which confirm the problem exists; and/or metrics and value ranges which exclude that problem.
This process has the added advantage of forcing us to identify potential problems, so our QA department can test for these in advance (see The AntifragileOrganization). If testing or production shows additional problems, we add that problem, along with the metrics we used to identify and diagnose it, to our diagnostic handbook, and keep the collection of those useful metrics enabled, if possible.
The positive approach is the more familiar one: the SLA. Quantify what we want to achieve as metrics and measure that. We then use externally visible goals like the SLA to drive internal metrics, like measuring every operation comprising a transaction. Then measuring the resources used by the operations comprising a transaction. Then measuring the resources that compete with the resources that impact the operations that comprise a transaction ... And this is the trap. Everything in the entire system contributes to your SLA, so it's tempting to measure and report on everything.
However, considering both approaches together suggests a solution:
1. Measure what you want to achieve
Record user experience, transaction frequency, error rates, availability, system correctness. If you don't measure that, you can't know you have a problem. These metrics are generally those worth reporting to management and your team. (Metrics reporting follies are a topic worthy of a separate post, or book).
2. Measure what you need to know to solve the problems shown by point #1
Let diagnostic need drive the rest of your metrics, as well as your logging. When a metric proves useful, keep it enabled if it's not costly (and if it is, see if you can get it another way for next time). But don't bother producing reports about these metrics.
3. Disable all the metrics and logging that aren't either (a) identifying problems or (b) helping you solve them
You'll be amazed at how much lighter your load is.
Tom Fleck is Senior Software Engineer at OC Systems.
In 2017, every second counts and even minor issues can have a significant impact on the success or failure of a brand interaction. Our latest research found that two thirds of people have rising expectations for digital performance, showing that businesses can expect consumer pressure to grow. The App Attention Index 2017 revealed just how unforgiving consumers are of badly performing digital services ...
In today's everchanging IT industry, network engineers face a slew of challenges when it comes to network management. As networks continue to grow and become more complex, many IT professionals struggle to get a grasp on key workflows in which network engineers still rely on manual processes, including network documentation, troubleshooting, change management and cybersecurity ...
Many organizations are struggling to resolve customer-impacting incidents quickly enough to preserve brand loyalty and revenue, according to PagerDuty's recent State of Digital Operations Report ...
"Become the Automator, Not the Automated." While it's a simple enough phrase, it speaks directly to how today's organizations and IT teams must innovate to remain competitive. A critical aspect of innovation is acknowledging the digital transformation of businesses. The move to digitalization enables organizations to more effectively unlock the power of information technology (IT) to fuel and accelerate business innovation. It is a competitive weapon and a survival imperative ...
Executives in the US and Europe now place broad trust in Artificial Intelligence (AI) and machine learning systems, designed to protect organizations from more dynamic pernicious cyber threats, according to Radware's 2017 Executive Application & Network Security Survey ....
While IT service management (ITSM) has too often been viewed by the industry as an area of reactive management with fading process efficiencies and legacy concerns, a new study by Enterprise Management Associates (EMA) reveals that, in many organizations, ITSM is becoming a hub of innovation ...
Cloud is quickly becoming the new normal. The challenge for organizations is that increased cloud usage means increased complexity, often leading to a kind of infrastructure "blind spot." So how do companies break the blind spot and get back on track? ...
Hybrid IT is becoming a standard enterprise model, but there’s no single playbook to get there, according to a new report by Dimension Data entitled The Success Factors for Managing Hybrid IT ...
Any mobile app developer will tell you that one of the greatest challenges in monetizing their apps through video ads isn't finding the right demand or knowing when to run the videos; it's figuring out how to present video ads without slowing down their apps ...
40 percent of UK retail websites experience downtime during seasonal peaks, according to a recent study by Cogeco Peer 1 ...