Final Report on Reason For Outage – Route Engine Replacement - Tierpoint Texas - NE-080515-01
Tierpoint Texas has been working with vendors and 3rd party resources to fully document and identify the network events on July 31st and August 5th.
TierPoint’s IP Managed network contains a pair of border routers. These routers work in tandem and, by design, if one of the routers were to fail, the other router would remain online ensuring customer connectivity. Both border routers also have internal redundancies, one such redundancy is a pair of Routing Engines in an Active/Standby configuration. The events related to this RFO were the result of the active Routing Engine on Border2 failing partially, in an extremely obscure manner, resulting in the Standby Routing Engine not assuming the role of primary. All traffic transiting Border2 during the events was dropped, resulting in a ~50% reduction in throughput. Tierpoint has determined the root of the recent networks issues sit with a partially failed routing engine on Border2.
During the investigation process, TierPoint engaged vendors, third party consultants and our engineering team from other TierPoint markets to give us as much insight and as many perspectives as possible. During this ‘deep dive’ process, we identified several opportunities to improve performance across our network. While these aren’t necessarily related to any issues with the route engine, we are implementing changes to improve overall performance and health of our network.
Tierpoint has implemented the following changes based on the internal, peer, and vendor review.
- The failing Routing Engine on Border2 has been replaced with new hardware. Since replacement Border2 has remained entirely stable with zero indication of fault.
- Tierpoint has replaced Netflow v5 with IPFIX. This change results in two improvements. First, IPFIX is provided via hardware instead of software moving the processing burden off of the route engine and placing the transactions on the linecard ASICs. We have seen a drop from 6% to 3% utilization on the Routing Engine CPU during peak hours. Second, this increased the number of flows provided to Arbor, Noction, and PRTG, resulting in a much more granular view of our traffic and improving attack detection speed and accuracy.
- The Routing Engines are protected by a firewall filter. Our border routers had a "reject" statement rather than a "discard" statement. A reject filter generates a return ICMP packet for each dropped packet, increasing processing burden on the route engine. Certain types of attacks can cause an increased amount of CPU utilization on the Routing Engines while processing the "reject" statement. A "discard" statement, which simply dismisses the packet without generating an ICMP response, has been implemented across all network devices.
- Tierpoint has implemented additional logging to help determine root cause for any future events. Our vendor has reviewed our logging configuration and confirms it matches best practices.
- Tierpoint has implemented improved Out of Band management utilizing both the IP Plus and IP Managed networks. Engineers will now be able to more efficiently access critical routing equipment remotely, regardless of network status.
- Tierpoint will be scheduling a code upgrade to a newer version of Junos which implements additional safeguards against attacks and improved performance. The updated version of Junos is currently under Tierpoint review. We expect the code upgrade for our border routers to occur in mid September.
Please let us know if you have any additional questions. We appreciate the patience while waiting for the release of the final RFO. Tierpoint feels confident the previous issues have been fully resolved and network stability has been restored.
Tierpoint Texas has been working with vendors and 3rd party resources to fully document and identify the network events on July 31st and August 5th.
TierPoint’s IP Managed network contains a pair of border routers. These routers work in tandem and, by design, if one of the routers were to fail, the other router would remain online ensuring customer connectivity. Both border routers also have internal redundancies, one such redundancy is a pair of Routing Engines in an Active/Standby configuration. The events related to this RFO were the result of the active Routing Engine on Border2 failing partially, in an extremely obscure manner, resulting in the Standby Routing Engine not assuming the role of primary. All traffic transiting Border2 during the events was dropped, resulting in a ~50% reduction in throughput. Tierpoint has determined the root of the recent networks issues sit with a partially failed routing engine on Border2.
During the investigation process, TierPoint engaged vendors, third party consultants and our engineering team from other TierPoint markets to give us as much insight and as many perspectives as possible. During this ‘deep dive’ process, we identified several opportunities to improve performance across our network. While these aren’t necessarily related to any issues with the route engine, we are implementing changes to improve overall performance and health of our network.
Tierpoint has implemented the following changes based on the internal, peer, and vendor review.
- The failing Routing Engine on Border2 has been replaced with new hardware. Since replacement Border2 has remained entirely stable with zero indication of fault.
- Tierpoint has replaced Netflow v5 with IPFIX. This change results in two improvements. First, IPFIX is provided via hardware instead of software moving the processing burden off of the route engine and placing the transactions on the linecard ASICs. We have seen a drop from 6% to 3% utilization on the Routing Engine CPU during peak hours. Second, this increased the number of flows provided to Arbor, Noction, and PRTG, resulting in a much more granular view of our traffic and improving attack detection speed and accuracy.
- The Routing Engines are protected by a firewall filter. Our border routers had a "reject" statement rather than a "discard" statement. A reject filter generates a return ICMP packet for each dropped packet, increasing processing burden on the route engine. Certain types of attacks can cause an increased amount of CPU utilization on the Routing Engines while processing the "reject" statement. A "discard" statement, which simply dismisses the packet without generating an ICMP response, has been implemented across all network devices.
- Tierpoint has implemented additional logging to help determine root cause for any future events. Our vendor has reviewed our logging configuration and confirms it matches best practices.
- Tierpoint has implemented improved Out of Band management utilizing both the IP Plus and IP Managed networks. Engineers will now be able to more efficiently access critical routing equipment remotely, regardless of network status.
- Tierpoint will be scheduling a code upgrade to a newer version of Junos which implements additional safeguards against attacks and improved performance. The updated version of Junos is currently under Tierpoint review. We expect the code upgrade for our border routers to occur in mid September.
Please let us know if you have any additional questions. We appreciate the patience while waiting for the release of the final RFO. Tierpoint feels confident the previous issues have been fully resolved and network stability has been restored.
Comment