Ik zal even een update geven van de problemen. Sorry dat ik dit nog niet eerder gedaan heb:
As the result of a malfunction originating in one of our switches the Greenhost network in Amsterdam (Haarlem) has been unreachable between roughly 10:15 and 16:30 CET. After the network connectivity was restored, it was apparent that access to the storage systems could not be re-established for all systems. The only way to restore this was to reboot all virtual machines in the network. Impacting all Greenhost services.
It took till 16:00 to restore all major services, and until 17:20 before all virtual machines were rebooted and available. Individual (cloud) VPS clients might have faced additional issues related to their machines rebooting.
There has been no data loss and all delayed e-mail has been delivered afterwards.
Our network has a fully redundant design to be fault tolerant. However one of the mechanisms (STP) in this design was the root-cause of the failure.
10:30 – Our engineers confirmed a problem in the traffic propagation of one of our customer VLANs through the network infrastructure.
14:00 – After several hours in pursuit of the source of the problem it was found that one of our core switches experienced an STP re-election earlier that morning and was since then discarding the traffic of the VLAN in question.
14:30 – After consulting the documentation possible courses of action were discussed to remedy the issue.
15:00 – After agreeing on a possible solution an attempt was made to trigger a new STP root switch re-election which would have in theory enabled proper propagation of traffic of the VLAN in question.
15:15 – While attempting to trigger the STP root re-election a network loop was inadvertently created over one of the access switches. This made a network congestion that resulted in most of our infrastructure being unable to communicate with the rest of the platform or the outside world. Unfortunately the out-of-band access failed, so reverting the configuration was not immediately possible.
15:25 – A team of our engineers was dispatched towards our data center to fix the problem on-site, while our front office tweeted about the outage and prepared a voice message for clients calling about the issue. The engineers at the headquarters were trying to find a way to restore communications remotely.
16:00 – The engineers arrived at the data center, but entry was delayed, as it was to crowded at the data center at that point in time.
16:10 – Once on site the problematic switch was reverted to the original configuration which restored the network. The team in the headquarters immediately started verifying the state of the platform.
16:15 – At this point it was apparent that some of our infrastructure wasn’t able to recover from the network outage automatically. With modern approaches to both Web, e-mail and VPS hosting having underlying storage that is heavily dependent on the network to function there is increased sensitivity to network disruptions, especially if timeouts occur.
16:25 – After a quick planning session engineers set out to restart and verify the operation of our infrastructure elements that were showing erratic behaviour after the outage. This affected a large part of our services, including e-mail, Web and most of our VPS hosting customers.