What the Facebook Outage Teaches Us About Error Monitoring

May 28, 2020

James Smith
SmartBear

On Wednesday, May 6th, iOS users all over the world experienced an app crash when they tried to open popular apps such as TikTok, GroupMe, Spotify, and Pinterest.

How did simultaneous crashes occur across so many independent apps? What's the common thread that would cause widespread app crashes?

Turns out, it was a change in behavior in the Facebook API. Some may wonder what Facebook has to do with logging into your favorite music source or social media app. And the answer is: Quite a lot, actually.

Most major consumer apps connect to Facebook to utilize functions such as seamless login, share functionality, and advertising insights. When the API change impacted Facebook's iOS SDK, consumer apps were the first in line to discover the issue, thanks to deep integrations with Facebook that occur at application launch.

What Facebook's outage demonstrates is how important it is to invest in an error monitoring solution that is built to handle widespread and concurrent crash reports without dropping them, as well as the importance of practicing defensive programming.

Why Your Selection of an Error Monitoring Solution Matters

When the Facebook issue occurred, every error monitoring provider saw an immediate and direct impact on the iOS apps they support. A huge volume of crash reports occurred simultaneously within minutes of the issue being reported on Github.

Naturally, this type of error spike dramatically increases the volume of crash reports ingested by error monitoring services. With a sudden deluge of crash reports, how should an error monitoring solution handle it?

The correct response is to buffer increases in crash reports without dropping any and to scale systems to handle the additional load. A robust error monitoring tool should be able to manage the spike and continue to process events, even if this means the worst-case scenario of a slight delay in processing time. Once the bug is rolled back or fixed (as was the case with Facebook within about three hours), any backlog of events should be addressed quickly.

This response demonstrates what you want from an error monitoring solution, which can be summed up in three points.

1. Acknowledge there's a problem: Error monitoring providers should be first in line to provide information about the source of a problem. Because they have more insight into a widespread issue than any single app provider will have, error monitoring organizations should provide information and updates to customers to keep them abreast of the larger situation and how the problem is being handled.

2. Provide error events and data: When an app starts to see a huge spike in user crashes, developers need to know why. The first thing they do is turn to their error monitoring solution to see where the error is originating. A solid error monitoring solution should be able to process a stream of error data within a reasonable amount of time and provide some clues about what's happening.

3. Deliver, not disable, continuous processing: Delays in processing errors are one thing; suspending processing of errors is another. Disabling error processing is completely unacceptable from any error monitoring provider. The main function of these tools is to provide continuous monitoring so that developers can view, dissect, and measure all errors — in real time or, at worst, in hindsight. Anything less means the tool isn't doing its job.

As the Facebook issue demonstrated for many unsuspecting organizations, challenges arise when you rely on free error monitoring tools. First and foremost, free services are more likely to take the "easy path" when things get tough. Rather than manage an onslaught of errors, these tools may simply shut down.

That's right: Free error monitoring tools are often turned off for the duration (or longer) of a widespread problem, based on a very simple cost/benefit analysis. Since organizations aren't paying directly for the service (collection of data is the "payment"), there's no accountability from these providers, no incentive to manage the situation correctly, and no real customer support.

After all, disabling the service is much easier than handling the crisis. The only loss for the free error monitoring provider? The opportunity to collect more data.

As a result, organizations that rely on free tools don't have the benefit of hindsight. They can't use their own error data to understand what happened, nor do they see any subsequent errors that should've been captured, starting at the time of the original occurrence and extending to the moment the service is turned back on.

This situation perfectly sums up that old adage, You get what you pay for. Sadly, it ain't a lot.

Practice Defensive Programming and Error Monitoring

The SDK issue isn't unique to Facebook. A few weeks ago, almost the exact same thing happened to Doordash, Uber Eats, and other apps that rely on maps when the Google Maps iOS SDK experienced an issue.

The two most important takeaways from these widespread app disasters:

1. Good SDK design tenets dictate that SDKs should never crash an application. What was missing in both the Facebook and Google Maps cases — and what every app company must have — are defensive programming measures that ensure better handling of malformed data from outside servers.

2. Error monitoring solutions matter a great deal. In order to understand how outages and errors occur, you must have error processing in real time in order to address a challenge and pinpoint bad code. If you select an error monitoring provider that simply disables its service when things get tough, then you'll remain in the dark. And, with free services, you can pretty much bet on that outcome.

What these outages also demonstrate is the absolute need for good software design and error monitoring processes. Developers must know exactly what app features are controlled remotely and why, where everything is documented, and how to turn off third-party apps when things go sideways without impacting the user experience.

In an interconnected app world, errors are going to happen. The real question is, can you trust your error monitoring system to always have your crash reports?

James Smith is SVP of the Bugsnag Product Group at SmartBear

The Latest

Internet and Cloud Creating Network Blind Spots

November 21, 2024

Broad proliferation of cloud infrastructure combined with continued support for remote workers is driving increased complexity and visibility challenges for network operations teams, according to new research conducted by Dimensional Research and sponsored by Broadcom ...

How Do Banks Stay Ahead? Use the Right Tech to Connect the Business From End-To-End

November 20, 2024

New research from ServiceNow and ThoughtLab reveals that less than 30% of banks feel their transformation efforts are meeting evolving customer digital needs. Additionally, 52% say they must revamp their strategy to counter competition from outside the sector. Adapting to these challenges isn't just about staying competitive — it's about staying in business ...

Financial Services Industry Is Ready to Lead on AI Adoption, Once Data Concerns Are Addressed

November 19, 2024

Leaders in the financial services sector are bullish on AI, with 95% of business and IT decision makers saying that AI is a top C-Suite priority, and 96% of respondents believing it provides their business a competitive advantage, according to Riverbed's Global AI and Digital Experience Survey ...

SLOs for Mobile: Key Challenges and How to Address Them

November 18, 2024

SLOs have long been a staple for DevOps teams to monitor the health of their applications and infrastructure ... Now, as digital trends have shifted, more and more teams are looking to adapt this model for the mobile environment. This, however, is not without its challenges ...

Navigating the IT Modernization Journey: Insights, Challenges and Strategic Recommendations for Success

November 14, 2024

Modernizing IT infrastructure has become essential for organizations striving to remain competitive. This modernization extends beyond merely upgrading hardware or software; it involves strategically leveraging new technologies like AI and cloud computing to enhance operational efficiency, increase data accessibility, and improve the end-user experience ...

On-Demand Webinars

Analyst Reports

White Papers

Why Your Selection of an Error Monitoring Solution Matters

Practice Defensive Programming and Error Monitoring

The Latest

Videos

Hot Topics

APM DIGEST

Search form

On-Demand Webinars

Analyst Reports

White Papers

Why Your Selection of an Error Monitoring Solution Matters

Practice Defensive Programming and Error Monitoring

Related Links

The Latest

Videos

Hot Topics

APM DIGEST

User login