On September 21st, all of our customer services went down around 1am BST (midnight GMT) for around 10 to 11 hours.
For those in North America, this was outside business hours, while those in New Zealand and Australia had no service for a whole day.
This was an unprecedented situation, and I would like to explain what went wrong, what we did to fix it, and what we’re changing to prevent this happening again.
First of all, this was not an outside attack. This was down to mistakes made by Exprodo Software employees. I’m going to remove all names from this account, apart from where there were actions taken by me.
It started with a desire to update our websites at www.exprodo.com and www.calpendo.com. A new version of each website was created, separately from the published website, and the plan was to switch from one version of the website to the other. Unlike all of our other web services which we maintain ourselves, www.exprodo.com and www.calpendo.com are hosted on Squarespace. So the switching from one website to the other was to be done using Squarespace tools, and was expected to take place only by making changes within Squarespace.
The employee making this change decided to do it when it was thought to cause least disruption to the website, and started on it towards the end of the western North American business day. Unfortunately, this also happened to be when none of our technical support staff were awake.
After disabling the old website and enabling the new one, it was then realised that there needed to be a change outside Squarespace to complete the switch to the new website.
When an internet domain is registered, a “registrar” holds that registration. We use 123-reg.co.uk as our registrar. We also use 123-reg to store our DNS settings. This means that when somebody types in an address into their web browser, like “university-x.calpendo.com”, a request goes to 123-reg to work out what the internet address of “university-x.calpendo.com” is, so it knows how to talk to it. This process of matching names to addresses is known as “DNS”.
The employee making this change decided that the exprodo.com and calpendo.com domains should be moved from 123-Reg to Squarespace to get the website working. While this did fix the problem with getting www.exprodo.com and www.calpendo.com working, it had an unfortunate side effect. All those DNS settings that say what the internet address is for “university-x.calpendo.com” were lost.
Suddenly, no exprodo.com or calpendo.com URL would work, apart from www.exprodo.com and www.calpendo.com which were set up within Squarespace.
The first sign of trouble was an email at 2.40am BST when a customer in New Zealand reported that they could not access their Calpendo. This was followed by an Australian customer at 3.40am BST and a Singaporean customer at 4.10am.
At around 4.30am, another employee saw the support tickets raised by these customers and at 4.45am started trying to wake up the relevant people. Unfortunately, our processes were not sufficient for this task, and two hours were lost as a result while people were asleep.
At 6.45am I took charge of the situation.
It took a little while to see what had happened because the actions taken overnight had not been passed on.
By 7.20am, I knew the domains had been migrated away from 123-Reg, but not yet by whom or why. Attempting to contact the employee who had made the change did not work. Again, our processes for contacting staff were inadequate as they were asleep and not responding.
At 7.30am, I sent a message to Exprodo’s support team to tell everybody what the problem was so they could tell customers. At this point, I thought it was possible it had been an attack by a third party.
Also at 7.30am, the first European customer contacted us. They got a reply at 7.57am. A flurry of other customers reported problems around the same time.
At 7.36am, I sent a message to our support team to talk to customers and tell them we have a major issue that we are dealing with and we currently have no ETA for a fix. I decided to focus on fixing the problem while the support team handled communication with customers.
At 8am, I spoke to 123-Reg as their support desk opened for the day. They told me the domains were migrated away to another provider, and told me that it went to Squarespace. 123-Reg also told me that once a .com domain is moved to a different registrar, it can’t be moved again for 60 days.
At 8.12am I told our support team to tell everybody to expect a long down time. At this point, I could see that we couldn’t simply roll things back. Somehow we had to move forward and make this work with Squarespace.
At 8.25am, I logged into my Squarespace account where I am the owner of the www.calpendo.com website, and saw that it says it has expired. I didn’t know why, but it was because it was cancelled overnight and the new version of the website was set up in its place. Unfortunately, my Squarespace account did not have access to the new website. That meant I couldn’t set up the missing DNS information in Squarespace without getting that access. I didn’t understand this until I spoke to Squarespace, which happened when their support opened at 9am UK time.
At around 9.20am I told our support team that they should inform customers that their data is fine and unaffected and that it is rather that access to it is broken.
At 9.38am I finished chatting with Squarespace. That’s when I knew it was an employee who had made the changes, and that I needed to get access to a different Squarespace account.
I started thinking about contingency plans in case it took too long to access the account, but at 10.00am another employee found the required Squarespace credentials. I could then see the live calpendo.com and exprodo.com and their DNS details. I started playing with their DNS settings to learn how it worked. I discovered a problem that didn’t exist with 123-Reg’s system, but I could work around it.
At 10.55am, I told our support team that I had put all the settings in place for one customer, and that they should now be fully operational.
At 12.02pm I reported to our support team that I had finished adding entries for all customers. This involved manually adding nearly 300 DNS entries.
I then asked our support team to respond to all customer tickets telling them we believe the problem was fixed, and asking them to verify.
All these settings were added from first principles because we did not have a backup of all the DNS settings. That was a problem with the service provided by 123-Reg, and a part of the reason that I had intended to migrate to a better DNS service.
Where it would have helped to have a backup of the DNS settings was with those customers that were set up in a non-standard way. In particular for those very few that had two different URLs to access the same system. A backup would also have helped with our other services which took a bit longer to set up, such as docs.exprodo.com, downloads.exprodo.com and our internal issue tracker.
Breaking all these things down, there are problems that we need to address.
- All credentials used for accounts we rely on should be stored in a known and secure place so that it can be accessed when required. Nobody should set up new accounts for company use without storing the details in the appropriate place.
- Access to such accounts must be restricted to those employees who have the appropriate skills and training to use them.
- Third parties that support multiple users on a single account, with different access rights for each user, are preferred over those that provide only a single user on each account. For example, a system that allows finance people to access the invoices charged on a third party system, but prevents those finance people from accessing the technical parts of the service.
- All system changes should be logged so that others can see what has been done and why.
- There should be more internal separation of account details so that each employee can only access the parts of a system that they have sufficient training to use properly. We will need to look at all our operations to ensure there is better segregation so that, as we grow, we do not share the keys to the kingdom too widely within Exprodo.
- We had no backup of our DNS settings. This must change.
- Customer operations (via the DNS settings) are now tied to the marketing infrastructure (in Squarespace). They were not tied like this before these events. Changes to the marketing website have now been frozen to protect against accidentally breaking things again while the two are tied in this way.
- We must migrate our DNS settings to an Enterprise class DNS service provider.
- It is also essential that we move the domain registration away from Squarespace, but this time by preparing for the move in a way that limits any down time. We cannot do this until at least 60 days after the move from 123-Reg took place.
- Our internal escalation procedures are not well enough defined or planned for in an emergency. This must include the ability and knowledge of how to wake people up. We have never had this kind of emergency need before, and we were not prepared for it.
- We have no service status page for customers to look at in an emergency. Typically, these use a completely different web domain and rely on separate infrastructure so it will still work even when major problems affect the service elsewhere.
It is my firm belief that when things go wrong, one should be open about what has happened. If customers lose confidence in us because of things we have done badly, then we deserve that. I’m gutted that we had a service disruption of this severity and duration.
I can only hope that our customers can forgive the mistakes we have made and trust us to improve where we need to.
Paul Robinson
Founder, Exprodo Software