Follow us on Steam Follow us on FB Follow us on Twitter Subscribe on Youtube

Announcement

Collapse
No announcement yet.

Colo Outage Final Explaination (Geeks Only)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Colo Outage Final Explaination (Geeks Only)

    Final Report on Reason For Outage – Route Engine Replacement - Tierpoint Texas - NE-080515-01

    Tierpoint Texas has been working with vendors and 3rd party resources to fully document and identify the network events on July 31st and August 5th.

    TierPoint’s IP Managed network contains a pair of border routers. These routers work in tandem and, by design, if one of the routers were to fail, the other router would remain online ensuring customer connectivity. Both border routers also have internal redundancies, one such redundancy is a pair of Routing Engines in an Active/Standby configuration. The events related to this RFO were the result of the active Routing Engine on Border2 failing partially, in an extremely obscure manner, resulting in the Standby Routing Engine not assuming the role of primary. All traffic transiting Border2 during the events was dropped, resulting in a ~50% reduction in throughput. Tierpoint has determined the root of the recent networks issues sit with a partially failed routing engine on Border2.

    During the investigation process, TierPoint engaged vendors, third party consultants and our engineering team from other TierPoint markets to give us as much insight and as many perspectives as possible. During this ‘deep dive’ process, we identified several opportunities to improve performance across our network. While these aren’t necessarily related to any issues with the route engine, we are implementing changes to improve overall performance and health of our network.

    Tierpoint has implemented the following changes based on the internal, peer, and vendor review.

    - The failing Routing Engine on Border2 has been replaced with new hardware. Since replacement Border2 has remained entirely stable with zero indication of fault.

    - Tierpoint has replaced Netflow v5 with IPFIX. This change results in two improvements. First, IPFIX is provided via hardware instead of software moving the processing burden off of the route engine and placing the transactions on the linecard ASICs. We have seen a drop from 6% to 3% utilization on the Routing Engine CPU during peak hours. Second, this increased the number of flows provided to Arbor, Noction, and PRTG, resulting in a much more granular view of our traffic and improving attack detection speed and accuracy.

    - The Routing Engines are protected by a firewall filter. Our border routers had a "reject" statement rather than a "discard" statement. A reject filter generates a return ICMP packet for each dropped packet, increasing processing burden on the route engine. Certain types of attacks can cause an increased amount of CPU utilization on the Routing Engines while processing the "reject" statement. A "discard" statement, which simply dismisses the packet without generating an ICMP response, has been implemented across all network devices.

    - Tierpoint has implemented additional logging to help determine root cause for any future events. Our vendor has reviewed our logging configuration and confirms it matches best practices.

    - Tierpoint has implemented improved Out of Band management utilizing both the IP Plus and IP Managed networks. Engineers will now be able to more efficiently access critical routing equipment remotely, regardless of network status.

    - Tierpoint will be scheduling a code upgrade to a newer version of Junos which implements additional safeguards against attacks and improved performance. The updated version of Junos is currently under Tierpoint review. We expect the code upgrade for our border routers to occur in mid September.

    Please let us know if you have any additional questions. We appreciate the patience while waiting for the release of the final RFO. Tierpoint feels confident the previous issues have been fully resolved and network stability has been restored.​
  • #2

    So basically, whomever set up these border routers were clueless in proper configuration and someone though it'd be cool, in the event of a DDOS attack, to send those packets back to the source vie ICMP. Cute. That only doubled the utilized bandwidth keeping others from legitimately using the network.

    So what they've done is get a hold of someone who knows what their doing to properly configure the border routers.

    Comment

    • #3

      I blame Ouch.
      sigpic

      Comment

      • #4

        Even with these edge routers "issues", where was the fail over? WestIP(formerly Smoothstone, Teledvance) has a COLO in my building that does not sell slots anymore as we make the customer host their own now(gonna see if I could possibly sneak a game server in there) and even we have more power and redundancy than these guys. Hey, maybe it is because we value our customers! Batman is right but I bet even he can't explain why a COLO would not have a backup route for data above and beyond this setup. NOC alarms would have been going off like crazy and they should have been able to even temporarily rerouted the traffic. I think we have smarter NOC and IT people in the clan than Tierpoint has working for them. Sad when you gotta go to Cisco to see what fucked up! Cisco equipment has not changed in decades. IOS is still the language and if they are not using Cisco, shame on them. Rant over!

        http://www.tierpoint.com/data-centers/texas/dallas/
        Wake up and smell the corruption!

        Comment

        • #5

          Maybe this guy tried to fix it the first time and hosed it....

          Having a dog named "Shark" at the beach was a bad idea!

          Comment

          • #6

            What? o.O

            Comment

            Working...
            X