A timeline of the events of the outage and steps for remediation
Last week, Square experienced a multi-hour outage across our services. We understand that you rely on our systems to power your business and that’s a responsibility we take seriously. We apologise for letting you down and for the length of time it took for us to get our systems back up and running.
Beginning at 13:54 U.S. Eastern Time (ET) on 7 September 2023, Square products and services were unavailable. At 14:05 ET 8th September systems began to recover with merchants able to access restored payment services by 17:19 ET. For sellers on a supported configuration that utilised offline mode, Square completed processing offline payments by 13:57 PM ET on 8th September or, if the device came online at a later time, shortly after the device came online. Square Online websites were available; however, Square Online customers were unable to process payments during the outage.
As we previously shared, this outage was caused by a key part of our infrastructure, our DNS servers. Now that we’ve completed a root cause analysis, we want to share an overview of the incident and steps for remediation.
We’re going to start with an overview of how Square’s systems work together. Square operates in multiple data centre regions. Square services use DNS and mesh-based routing infrastructure to find service dependencies and serve requests. Without DNS, Square products, internal tools and services can’t communicate, which results in service disruption. In this incident, an unrelated change to our host-based firewalls combined with a DNS service upgrade caused unexpected load on our internal DNS servers and caused them to fail. Once node-based DNS caches expired, services couldn’t communicate with their dependencies and caused external requests to fail.
Square’s host-based firewall policy is managed by a central service that pushes firewall policies to nodes in Square data centres, which then expand the policy into firewall rules. This service uses an accelerated rollout strategy to quickly adapt to changing environment state. But, in this case, a small policy change expanded to a much larger ruleset. This large ruleset caused node instability and when combined with the traffic pattern of DNS, caused DNS to start failing requests.
Square uses a microservices environment for services that handle external requests and many internal systems to manage our services. In this case, many services used for troubleshooting and recovery were also impacted, which resulted in an extended outage.
Based on a forensic analysis of the incident, we’ve ruled out a cyberattack as the cause of this incident, and there’s no evidence of a data breach or loss.
7 September 2023
8 September 2023
The incident has highlighted a number of opportunities to improve our infrastructure, and we're working on making these changes, which are designed to prevent future incidents:
Many sellers utilised Offline Mode in order to continue accepting payments. As a precautionary measure, we deferred processing offline payments for a number of hours. We are expanding support for and improving our communication regarding the availability of Offline Mode.
We apologise for the disruption our outage might have created for you, your customers, and your employees. We know this situation was made more difficult by our communication frequency and the delayed support response some of you experienced. We will learn from this event and improve our systems and processes.
We appreciate your business and we are committed to doing better to regain your trust.