Square status - Ireland

Degraded Performance: Square Services
Incident Report for Square IE
Postmortem

Incident Summary: 07/09/2023

A timeline of the events of the outage and steps for remediation

Summary

Last week, Square experienced a multi-hour outage across our services. We understand that you rely on our systems to power your business and that’s a responsibility we take seriously. We apologise for letting you down and for the length of time it took for us to get our systems back up and running.

Beginning at 13:54 U.S. Eastern Time (ET) on 7 September 2023, Square products and services were unavailable. At 14:05 ET 8th September systems began to recover with merchants able to access restored payment services by 17:19 ET. For sellers on a supported configuration that utilised offline mode, Square completed processing offline payments by 13:57 PM ET on 8th September or, if the device came online at a later time, shortly after the device came online. Square Online websites were available; however, Square Online customers were unable to process payments during the outage.

As we previously shared, this outage was caused by a key part of our infrastructure, our DNS servers. Now that we’ve completed a root cause analysis, we want to share an overview of the incident and steps for remediation.

Service Impact

We’re going to start with an overview of how Square’s systems work together. Square operates in multiple data centre regions. Square services use DNS and mesh-based routing infrastructure to find service dependencies and serve requests. Without DNS, Square products, internal tools and services can’t communicate, which results in service disruption. In this incident, an unrelated change to our host-based firewalls combined with a DNS service upgrade caused unexpected load on our internal DNS servers and caused them to fail. Once node-based DNS caches expired, services couldn’t communicate with their dependencies and caused external requests to fail.

Square’s host-based firewall policy is managed by a central service that pushes firewall policies to nodes in Square data centres, which then expand the policy into firewall rules. This service uses an accelerated rollout strategy to quickly adapt to changing environment state. But, in this case, a small policy change expanded to a much larger ruleset. This large ruleset caused node instability and when combined with the traffic pattern of DNS, caused DNS to start failing requests.

Square uses a microservices environment for services that handle external requests and many internal systems to manage our services. In this case, many services used for troubleshooting and recovery were also impacted, which resulted in an extended outage.

Based on a forensic analysis of the incident, we’ve ruled out a cyberattack as the cause of this incident, and there’s no evidence of a data breach or loss.

Timeline

7 September 2023

  • 11:04 U.S. Eastern Time (ET) - Host-based firewall rule change deployed to enable region communication, increasing on-node firewall rule size.
  • 13:56 ET - DNS zone change.
  • 14:02 ET - Engineers were notified of infrastructure issues and incident response begins starting with DNS investigation.
  • 14:47 ET - issquareup.com incident created.
  • 14:52 ET - Work begins to recover internal access and tooling.
  • 15:56 ET - Shed networking traffic to our DNS servers. Started manual work to bring up new DNS servers.
  • 18:00 ET - DNS service capacity increased, but doesn’t help. Started manual deployment of networking changes to re-enable our authorisation and access services.
  • 18:29 ET - Internal access services recover. This allows engineers to start working in parallel to recover the authorisation and control plane services.
  • 19:00 ET - Started manual deployment of networking changes to all data centres.
  • 20:36 ET - Square deployment pipeline recovers.
  • 22:06 ET - Rebuild of our DNS servers.
  • 23:52 ET - New configuration based on reverted ruleset is built and configuration begins to be pushed to DNS hosts.

8 September 2023

  • 12:06 ET - Some DNS hosts are healthy and more internal tooling recovers.
  • 12:55 ET - All DNS servers are healthy.
  • 1:30  ET - Partial recovery of internal service to service connectivity. Partial recovery of our edge routing infrastructure.
  • 2:05 ET - Some Square systems begin recovery.
  • 2:40 ET - Payment traffic has fully recovered.
  • 3:12 ET - Edge routing infrastructure fully recovered.
  • 4:18 ET - Majority of Square products and services are recovered. issquareup.com incident is updated that we’ve implemented a series of fixes.
  • 5:19 ET - issquareup.com incident is resolved.
  • 6:59 ET - Additional DNS capacity is added.
  • 9:52 ET - Background processing of offline payments begins.
  • 13:57 ET - Uploaded offline payments have been fully processed.

Service Improvements

The incident has highlighted a number of opportunities to improve our infrastructure, and we're working on making these changes, which are designed to prevent future incidents:

  • Transitioning our DNS infrastructure to isolated infrastructure.
  • Additional monitoring and optimisations for critical networking infrastructure.
  • Optimising dependencies between our deploy and platform infrastructure where feasible.

Many sellers utilised Offline Mode in order to continue accepting payments. As a precautionary measure, we deferred processing offline payments for a number of hours. We are expanding support for and improving our communication regarding the availability of Offline Mode.

In Closing

We apologise for the disruption our outage might have created for you, your customers, and your employees. We know this situation was made more difficult by our communication frequency and the delayed support response some of you experienced. We will learn from this event and improve our systems and processes.

We appreciate your business and we are committed to doing better to regain your trust.

Posted Sep 18, 2023 - 17:07 BST

Resolved
This incident has been resolved.
Posted Sep 08, 2023 - 14:48 BST
Update
We can now confirm that the disruption impacting Square services has been resolved.
Please be aware that sellers may encounter delays in the updating of certain products/services:

- Offline Mode Payments: Payments are being uploaded, but there will be a slight delay before they appear as completed.
Any new Offline Mode Payments will be completed as normal in the coming hours.

- Square Reporting Tools: There is a possibility of delays in updating new billing and transaction information across all Square reporting tools, including those in all Square Point of Sale apps and the Dashboard.

We understand how important it is to have your business tools fully operational, and for this reason, our engineering team is currently engaged in discussions to prevent similar disruptions from happening in the future.

We sincerely thank you for your patience as our team worked to resolve this issue, and we apologize for any inconvenience this disruption may have caused to your business.

Once this disruption has been fully investigated, we plan to publish a full review of this issue and determine what steps we can take to prevent it from happening again.
Posted Sep 08, 2023 - 14:45 BST
Update
Your continued patience and support mean a lot to us as our engineers oversee the implemented solution. Services are steadily regaining their functionality, and we will share any additional updates on this platform as soon as they become available.
Posted Sep 08, 2023 - 13:21 BST
Update
We are actively observing the recovery of all Square systems and will continue to post live updates here. Thanks again for your patience.
For instant answers to common questions, visit our Support Center at squareup.com/help or our Seller Community at sellercommunity.com.
Posted Sep 08, 2023 - 12:18 BST
Update
We appreciate your ongoing patience and support as our engineers continue to monitor the solution implemented. We are continuing to see services regain functionality and we'll post any further updates here as we have them.
Posted Sep 08, 2023 - 11:22 BST
Update
Our engineering team is continuing to monitor the results of the fix implemented and Square services are continuing to recover.
As a reminder, for instant answers to common questions, visit our Support Center at squareup.com/help or our Seller Community at sellercommunity.com. Thank you.
Posted Sep 08, 2023 - 10:18 BST
Monitoring
Our engineering team has implemented a fix and services are beginning to recover. We’re continuing to monitor the results and will be back with an update shortly. Thank you for your patience!
Posted Sep 08, 2023 - 09:17 BST
Update
At this time, we do not have a solution for the disruption, though we have all the right people working to get it resolved as soon as possible. Very sorry for the inconvenience today.
Posted Sep 08, 2023 - 08:07 BST
Update
All of the appropriate team members are working to identify what's causing this disruption. We'll be back with an update as soon as possible. Thank you for your patience!
Posted Sep 08, 2023 - 07:25 BST
Update
Checking in to let you know that our engineers are still working on a resolution. We'll continue to update you as we learn more.
Posted Sep 08, 2023 - 06:46 BST
Update
Our engineering team are actively working to identify the issue. All hands are on deck, and we'll update you as soon as we have news. Thanks for your patience again!
Posted Sep 08, 2023 - 06:11 BST
Update
Our engineering team is dedicated to finding a solution. We'll share updates as soon as possible. Thank you for your continued patience today.
Posted Sep 08, 2023 - 05:35 BST
Update
We're working to pinpoint the issue's root cause, and will continue to share updates as we get them. Thank you for your understanding!
Posted Sep 08, 2023 - 04:54 BST
Update
We're working hard to find the issue's root cause. We'll share updates ASAP. Your patience is greatly appreciated as we work through this today.
Posted Sep 08, 2023 - 04:21 BST
Update
We are actively working to resolve the disruption affecting multiple Square Services. We thank you for your ongoing patience as we await further updates on our team's progress.
Posted Sep 08, 2023 - 03:46 BST
Update
We're continuing to work on resolving this disruption, and can assure you that we're working hard to get you the information you need. We'll continue post updates as we learn more.
Posted Sep 08, 2023 - 03:11 BST
Update
Thank you for your patience. We realise that this disruption is impacting many businesses at the moment. We've got the right people on this and we're fully committed to resolving the problem as soon as we can.
Posted Sep 08, 2023 - 02:31 BST
Update
Our engineering team is continuing to work to identify the root cause of this ongoing disruption. We will be back here as soon as any update is shared. As the day goes on, we appreciate your patience with our team.
Posted Sep 08, 2023 - 01:56 BST
Update
Thank you for your ongoing patience as our team continues to investigate the disruption impacting multiple Square Services. We remain committed to providing you with timely updates, and we'll have another update within the hour as we gather more information from our Engineers.
Posted Sep 08, 2023 - 01:16 BST
Update
We appreciate your continued patience as we continue to investigate a disruption with one of our Data Centers. At this time, reaching our Customer Success team may be a longer wait than normal. We will be back with an update within the hour as we receive more information from our Engineers.
Posted Sep 08, 2023 - 00:50 BST
Update
While we investigate the disruption to our Data Center which is currently impacting multiple Square Services, we recommend that Sellers stay logged into their account and avoid logging out.
At this time, reaching our Customer Success team may be a longer wait than normal. We will be back here to update you as soon as we receive more information. Thank you for your patience.
Posted Sep 07, 2023 - 20:22 BST
Identified
We are currently investigating a disruption with one of our Data Centers that is causing an impact on multiple Square Services. At this time, reaching our Customer Success team may be a longer wait than normal. We’ll be back to update as soon as we receive more information from our Engineers.
Posted Sep 07, 2023 - 20:12 BST
This incident affected: Payment Acceptance, Point of Sale, Online Store, Dashboard, Square for Restaurants, Square for Retail, Appointments, Phone Support, and Square Hardware.