Distributed Tracing - The Next Step of APM
September 26, 2016

Gergely Nemeth
RisingStack

Share this

Transforming your monolithic application into a microservices-based one is not as easy as many think. When you are breaking a software down into smaller pieces, you're moving the communication to the network layer and the complexity of your architecture is heavily increasing. Other issues arise as well since performance monitoring and finding the root source of an error becomes extremely challenging.

With the rise of microservices, developers need proper Application Performance Management (APM) tools to develop and operate their applications successfully. This blog examines the particular difficulties of monitoring microservices and what APM should be able to do to alleviate the major pain-points of monitoring and debugging them.

Figuring Out What Breaks in a Microservices Application

In a monolithic application, specific code pieces are communicating in the applications memory. It means that when something breaks, the log files will probably be useful to find the cause of an error and you can start debugging right away.

When something goes wrong in a microservices call-chain – called distributed transactions – all of the services participating in that request will throw back an error. It means that you need an excellent logging system, and if you have one, you'll still experience problems since you have to manually correlate the log files to find out what caused the trouble in the first place.

What's the solution to this problem? Distributed Tracing.

For microservices applications, there is a much more sophisticated application performance monitoring method available, called Distributed Tracing.

For distributed tracing, you have to attach a correlation ID to your requests which you can use to track what services are communicating with each other. With these IDs, you can to reverse engineer what happened during an error since all of the services involved in a request will be there for you to see instantly.

Next-gen APM solutions can already attach correlation IDs to requests and can also group the services taking part in a transaction and visualize the exact dataflow on a simple tree-graph. A tool like this enables you to see the distributed call stacks, the root cause of an error, and the dependencies between your microservices.


A distributed tracing timeline shows all of the services taking part in a certain transaction and the source of the error that later propagated back to all of them



There are only a few Distributed Tracing solutions available right now, but you can find open source solutions for Java monitoring and a SaaS solution focusing on Node.js – the technology primarily used for building microservices.

The concept of Distributed Tracing is based on Google’s Dapper whitepaper, which is publicly available here.

Increasing Architecture Complexity and Slow Response Times

As I mentioned above, increasing architecture complexity comes by the definition with microservices.

In a microservices application, the services will usually use a transport layer, like the HTTP protocol, RabbitMQ or Kafka. It will add delays to the internal communication of your application, and when you put services into a call chain, your response times will be higher. A modern APM solution must be prepared for this, and support message queue communication to map out a distributed system. If you have one, you'll be able to figure out what makes it slow.

Companies that build microservices should be able to deal with slow response times by using a distributed tracing APM tool. Correlation IDs let you visualize whole call chains and look for slow response times, whether it's caused by a slow service or the slow network.

If the transaction timeline graph shows that your services are fine, but your network is slow, you can to speed up your application by investigating that. One time, we could figure out that our PaaS provider was using external routing, so every request between our services went outside the public network and back, it reached more than 30 network hops, which caused the bad response time. The next step, in this case, was to choose another without external routing.

If your network times are fine, you have to investigate what slows your services down. It's quite easy if you have an APM with a built-in CPU profiler, or you have some profiling solution enabled. Requesting a CPU profile in the right time (presumably when the response time of a service gets high) will allow you to look for the slow functions and find the location of them. Thankfully, Chrome's Developer Tools support loading and analyzing javascript CPU profiles which solves this problem.

Conclusion

Application performance monitoring solutions have been around for a while, offering the same functionalities for years without major breakthroughs. This has to change. The way how we develop and deploy software is not the same than it was three years ago, and legacy APM tools are not helping as much as they used to. We need solutions that are treating microservices as first class citizens, and the developers who are building them too.

Gergely Nemeth is Co-Founder and CEO of RisingStack.

Share this

The Latest

March 24, 2017

A growing IT delivery gap is slowing down the majority of the businesses surveyed and directly putting revenue at risk, according to MuleSoft's 2017 Connectivity Benchmark Report on digital transformation initiatives and the business impact of APIs ...

March 23, 2017

Why containers are growing in popularity is no surprise — they’re extremely easy to spin up or down, but come with an unforeseen issue. Without the right foresight, DevOps and IT teams may lose a lot of visibility into these containers resulting in operational blind spots and even more haystacks to find the presumptive performance issue needle ...

March 22, 2017

Much emphasis is placed on servers and storage when discussing Application Performance, mainly because the application lives on a server and uses storage. However, the network has considerable importance, certainly in the case of WANs where there are ways of speeding up the transmission of data of a network ...

March 21, 2017

The majority of IT executives believe investment in IT Service Management (ITSM) is important to gain the agility needed to compete in an era of global, cross-industry disruption and digital transformation, according to Delivering Value to Today’s Digital Enterprise: The State of IT Service Management 2017, a report by BMC, conducted in association with Forbes ...

March 17, 2017

Let’s say your company has examined all the potential pros and cons, and moved your critical business applications to the cloud. The advertised benefits of the cloud seem like they’ll work out great. And in many ways, life is easier for you now. But as often happens when things seem too good to be true, reality has a way of kicking in to reveal just exactly how many things can go wrong with your cloud setup – things that can directly impact your business ...

March 16, 2017

IT leadership is more driven to be innovative than ever, but also more in need of justifying costs and showing value than ever. Combining the two is no mean feat, especially when individual technologies are put forward as the single tantalizing answer ...

March 15, 2017

The move to Citrix 7.X is in full swing. This has improved the centralizing of Management and reduction of costs, but End User Experience is becoming top of the business objectives list. However, delivering that is not something to be considered after the upgrade ...

March 14, 2017

As organizations understand the findings of the Cyber Monday Web Performance Index and look to improve their site performance for the next Cyber Monday shopping day, I wanted to offer a few recommendations to help any organization improve in 2017 ...

March 13, 2017

Online retailers stand to make a lot of money on Cyber Monday as long as their infrastructure can keep up with customers. If your company's site goes offline or substantially slows down, you're going to lose sales. And even top ecommerce sites experience performance or stability issues at peak loads, like Cyber Monday, according to Apica's Cyber Monday Web Performance Index ...

March 10, 2017

Applications and infrastructure are being deployed and commissioned at a faster rate than ever before, the number of tools it takes to effectively manage these services is multiplying, and the expectations placed on IT to ensure customer satisfaction is increasing, according to The State of Monitoring 2017 report from BigPanda ...