The Federal Communications Commission has finished investigating T-Mobile for a network outage that Chairman Ajit Pai called “unacceptable.” But instead of punishing the mobile carrier, the FCC is merely issuing a public notice to “remind” phone companies of “industry-accepted best practices” that could have prevented the T-Mobile outage.
After the 12-hour nationwide outage on June 15 disrupted texting and calling services, including 911 emergency calls, Pai wrote that “The T-Mobile network outage is unacceptable” and that “the FCC is launching an investigation. We’re demanding answers—and so are American consumers.”
Pai has a history of talking tough with carriers and not following up with punishments that might have a greater deterrence effect than sternly worded warnings. That appears to be what happened again yesterday when the FCC announced the findings from its investigation into T-Mobile. Pai said that “T-Mobile’s outage was a failure” because the carrier didn’t follow best practices that could have prevented or minimized it, but he announced no punishment. The matter appears to be closed based on yesterday’s announcement, but we contacted Chairman Pai’s office today to ask if any punishment of T-Mobile is forthcoming. We’ll update this article if we get a response.
FCC details T-Mobile mistakes
The staff-investigation report identified several mistakes made by T-Mobile during the outage, which began as T-Mobile was installing new routers in the Southeast US. When a fiber transport link in the region failed, T-Mobile’s network should have transferred traffic across a different link. But the carrier “had misconfigured the weight of the links to one of its routers,” which “prevented the traffic from flowing to the new active router as intended.” T-Mobile hadn’t implemented any fail-safe process to prevent the misconfiguration or to alert network engineers to the problem.
The Atlanta market “became isolated” from the rest of the network, causing all LTE users in the area to lose connectivity. A software error made things worse by preventing mobile devices in the Atlanta area from re-registering with the IP Multimedia Subsystem over Wi-Fi. Instead of routing device-registration attempts to a different node, “the registration system repeatedly routed re-registration attempts for each mobile device to the last node retained in its records, which was unavailable due to the market isolation.”
The software error had existed in T-Mobile’s network for months. “This software error likely did not cause problems before this outage occurred because the outage was the first notable market isolation since T-Mobile integrated this software into its network,” the FCC said. Regular testing “could have discovered the software flaw and routing misconfiguration before they could impact live calls,” the FCC also said.
After the trouble on June 15 began, T-Mobile engineers “ended up exacerbating [the outage’s] impact because they misdiagnosed the problem.” The FCC report continued:
T-Mobile believed that the fiber transport link that failed earlier in the day was continuing to cause the ongoing outage. Acting on this belief, T-Mobile manually shut down the link in an attempt to transfer traffic away from it. Due to the still-misconfigured Open Shortest Path First weights, however, these steps recreated the outage’s initial conditions. LTE customers in the Atlanta market were again disconnected from the LTE network and forced to establish calls over Wi-Fi, and their registration attempts again failed and created a registration storm that added further congestion to T-Mobile’s IP Multimedia Subsystem.
T-Mobile engineers almost immediately recognized that they had misdiagnosed the problem. However, they were unable to resolve the issue by restoring the link because the network management tools required to do so remotely relied on the same paths they had just disabled. When T-Mobile engineers were able to access the equipment on site and correct their mistake by restoring the link an hour later, customers in the Atlanta market were again able to attempt to register to VoLTE [Voice over LTE]. However, this again created additional congestion because T-Mobile engineers had not yet addressed the software error that prevented registrations from completing.
Outage goes nationwide
The FCC report explained how the outage spread from the Atlanta market, going nationwide. External traffic destined for the Atlanta system was redirected to other regions, which “created enough congestion in those registration systems to cause the T-Mobile network to send the registration attempts to other nodes. The software error again routed re-registration attempts to the last node on record, which was likely already experiencing severe congestion.” Shortly after, “IP Multimedia Subsystem, VoLTE, and Voice over Wi-Fi registrations began to fail nationwide.”
The vast majority of T-Mobile customers were unable to connect to Voice over LTE or Voice over Wi-Fi networks, and thus “fell back to T-Mobile’s 3G and 2G circuit-switched networks to make and receive calls while the device continued its registration attempts to the VoLTE network.” This resulted in 3G and 2G congestion, causing many phone calls to fail. Network nodes continued to hold resources for these call sessions after the calls terminated, overwhelming the nodes’ computing resources and causing even more call failures.