Assurance Test + Big Data Analysis Cure LTE Advanced Networks
Big data analysis is becoming as crucial an element in the testing of network systems as physical tests, by solving baffling problems whose causes, in some cases, literally no one could even guess at. In recent deployments, these tools revealed how unexpectedly finicky LTE Advanced can be; that information is helping several wireless carriers improve their mobile service.
In the past two years, an increasing number of analytical tools have hit the market, designed to enable network operators to ingest oceanic flows of data and sift out enough information-krill to keep themselves swimming.
Network assurance specialist Accedian says big data becomes especially useful for identifying dependencies between different characteristics, and in the case of several recent LTE Advanced deployments, characteristics that previously were thought to have little or no relationship to each other. That started with a puzzler that popped up on SK Telecom (Nasdaq: SKM)'s network.
SK Telecom's wireless network was experiencing packet loss, leading to performance degradation and dropped calls. Engineers could not figure out what the problem might be, because the network was running at an average utilization rate of only 20% -- the network had capacity to spare.
The company began measuring utilization on the backhaul network every tenth of a second, rolled into a 1-second sample. By drilling down to this level, SK Telecom noticed that they were getting microbursts that seemed to be showing up only in financial networks.
It turned out that LTE Advanced creates a more chatty, bursty network for mobile backhaul, explained Accedian vice president of product management & services Scott Sumner. Inter-cell communication would occur in 1ms microbursts that would exceed the network's total capacity, leading to packet loss.
With LTE Advanced, he noted, "There are new problems in the network we hadn't seen before."
Armed with the knowledge that measuring at millisecond accuracy might not be detailed enough, and realizing that LTE Advanced was more sensitive than previously understood, Accedian took this data to another operator (identified only as a Tier 1 operator in Japan) that was experiencing oscillations in throughput on calls that would swing from 200 Mbits/sec to nearly nothing and back in the space of a second.
Accedian ran correlations and generated a graph of this operator's latency information charted against its throughput performance on a microsecond scale. The graph showed frequent 20 microsecond delays that correlated 100% with throughput loss.
Ultimately, the company discovered the phenomenon had to do with the way the company's antennas were configured in MIMO mode. The spikes in delay resulted in the transmission of signaling messages being skewed, leading to packets interfering with each other, Sumner said.
"They never would have seen this correlation if they hadn't been measuring down at the microsecond level," he said. They might never have even thought to look. In ordinary LTE networks, there's no correlation between loss and throughput, Sumner noted.
Another operator found that if it lost five packets in a minute, it was no problem; but if it lost five packets in a row, it would get a 1-second outage on data throughput. Under those circumstances, a mere 2% of packet loss has the potential to knock down throughput by 80%.
"Another example of strange correlations coming out of these networks," Sumner said.
Another example is an Accedian customer in India (again unidentified), which had nearly half of its voice-over-LTE (VoLTE) calls in entire city sectors drop every 14 minutes.
They had no idea where to start looking for a problem because every metric in the network was well below the threshold for triggering an alarm. Packet loss, for example was set to trigger at 2%, but the network was experiencing packet loss of less than 0.2%. And yet there was a 95% correlation between packet loss and call drops; packet loss was almost certainly a triggering event.
So this carrier correlated all metrics against each other -- Accedian was supplying them with over 50 metrics daily, including jitter, class of service, MOS, packet loss, etc. The problems with the MOS score correlated only to packet loss metrics.
Packet loss, meanwhile, also correlated 100% to loss burst -- the loss of packets in a row. So Accedian and its customer took that data and combined it with information from their Cisco Systems Inc. (Nasdaq: CSCO) routers, from their Juniper Networks Inc. (NYSE: JNPR) routers, from their Samsung Corp. basestations, plus meta metrics such as the CPU levels of the network switch, and ran all of it through a big data analysis system. When the company re-ran the correlation, it found that these loss bursts correlated 100% with a particular router's ring switching, switching from one side of a ring to the other, east side to west side, west side to east side, just for very short periods of time.
"Then they went on to correlate it with the angle of tilt on their antennas in the sector. They discovered the SON controller was tilting antennas to compensate for interference, and the antenna angle would hit a certain point every 14 minutes; that made a stream of packets disappear, and it was amplified by a Cisco router that went into protection switch mode and it made all the calls drop and then the antennas moved out of that angle and everything came back.
So in sum: instability in SON interference cancellation algorithm created rapid loss every 14 minutes, putting aggregation routers into protection switch instability, which amplified the loss. A burst of missing packets hit VoLTE call signaling, causing call drops after momentary drops in MOS score.
"No human would have discovered this without analytics, having to look at over a billion records," Sumner said. The network was instrumented with no hardware, just a virtual probe running on a virtual machine.
It's amazing what you can find if you look for it, Sumner said. An Accedian engineer was experiencing problems with his own cellphone service in Miami. He used Accedian's own tools and discovered that the network had a packet-doubling problem. He called the carrier and told them what the problem was, and the initial response was that they were aware that there was a service problem. "They called him back later and told him he was right," Sumner said.
Accedian took home the 2016 Leading Lights Award for Outstanding Test & Measurement Vendor in part for its success in using big data analysis as a means of detection.
— Brian Santo, Senior Editor, Components, T&M, Light Reading
CALLING ALL TEST, ASSURANCE AND MONITORING COMPANIES: Make sure your company and services are listed free of charge at Testapedia, the comprehensive set of searchable databases covering the companies, products, industry organizations and people that are directly involved in defining and shaping the telecom test and measurement industry.