In our introductory post on network metrics, we focused on a critical idea: without precise measurement, there is no way to sync network nodes. Without node synchronization, there’s no way to have properly functioning time-sensitive applications, whether that means robots on an automated assembly line or gaming with a squad of geographically dispersed friends. Conventional measurement metrics don’t address real performance concerns. For real-time applications, one second can be an eternity, and one misaligned clock across a network has an exponential impact on performance.
When we say precision, we mean a degree of granularity and speed in measurement that many network providers deem impossible to achieve. “Impossible” or not (spoiler: it’s not), we at Subspace feel that modern, global scalability requires sub-millisecond accuracy. In this article, we’ll explain why this degree of accuracy is necessary and some of how we achieve it.
Internet Meteorology — a Temporal Graph of the Internet
True story: Bad weather forecasts lead to unhappy people. Imagine if forecasts only said, “An hour ago, it was raining over here, so an hour from now, if nothing changes, it’ll probably rain over there.” Sounds like a recipe for imprecision and disappointment, right? Of course, things change. Seeing how they change and drawing accurate conclusions from those observations requires more precise measurement using better metrics.
Most often, assessing network performance depends on latency telemetry. This is typically measured with pings — the number of milliseconds required for a signal to travel from one network node to another and receive the destination’s reply — but these pings are neither sufficiently informative nor frequent. Subspace measures each direction between nodes with accuracy down to <50 microseconds (1,000 microseconds = 1 millisecond) and does so at <10µsec intervals.
So many measurements create a massive streaming data load coming from thousands of nodes simultaneously. Without precision, these data points couldn’t be correlated in real-time to show geographically dispersed associations or used to draw accurate conclusions. (Subspace has developed unique methods for performing this computation that would otherwise overwhelm most computing systems.) Taken together, though, these minute data points, when plotted over time, show an extremely granular picture of network “weather.” It can be the difference between seeing pressure weather patterns begin to develop as correlated patterns of pressure, humidity, wind, and dozens of other factors emerge versus belatedly looking at the screen and saying, “Oh, right. That’s a hurricane.”
Subspace’s measurement granularity helps spot danger patterns in their earliest stages. This enables the network to reroute traffic onto alternate top-performance routes long before link congestion or outages become critical. With standard, slow ping measurement, measurements don’t reflect current conditions. Routes may already be down by the time an issue is detected.
Weather Mapping the Internet to Prevent Losing Data in the Storm
If packet travel times are at the heart of network performance monitoring, it makes sense that anything interfering with packet movement–or outright swallowing packets–is bad for performance analysis. In our prior article, we touched on Border Gateway Protocol (BGP), but now we can spotlight the routing technology’s real issue: BGP can lose packet information and wreck traffic monitoring.
Most BGP issues stem from peering changes and route flapping. Peering is the direct exchange of data between two networks without having traffic carried by an intermediary. Peering can help control traffic performance and costs. However, when peering relationships change, as from an equipment failure, traffic can be greatly affected until other networks learn of the change and update accordingly.
Route flapping is often caused by equipment misconfiguration. This often occurs when a router alternately advertises a destination network via one route then another (or as unavailable, and then available again) in quick sequence. The results in network routes flipping from an up/down state in rapid succession, forcing recalculation by all routers. Flapping hampers traffic and can result in packet loss.
Again, precision and measurement can come to the rescue. One way Subspace achieves this is with advanced packet sniffing. Data packets move through network stack layers within a node, are transmitted across the network until they reach the destination node, then move through that system or device’s stack. Sniffers are tools that capture, log, and analyze packet traffic. Packet capture (PCAP) analysis of sniffed traffic through Subspace’s many points of presence (PoPs) allows us to obtain telemetry measurements with excellent accuracy, but accuracy alone isn’t enough; we also need remarkable speed in measuring.
As mentioned earlier, Subspace achieves sub-millisecond measurements, which must be coordinated across the network in order to piece together a truly accurate “weather map” of real-time conditions. Some of the methods we use are proprietary, but some are public and deployed by others. Facebook offers a technical yet fascinating account of how their company uses Network Time Protocol (NTP) and chrony to achieve remarkably precise time measurement. Subspace combines this class of time precision with packet analysis to see how traffic is moving across the network, recognize changing conditions and patterns almost instantaneously, and respond with routing optimizations within the blink of an eye…or, if we’re being accurate, orders of magnitude faster.
Network Monitoring to Validate Traffic Flow
Just saying “we measure faster with better metrics” doesn’t convey enough, so let’s dig down a layer. We talked about measuring traffic with packet sniffing, but this comes with its own issues. Many smaller organizations do packet sniffing with “port mirroring,” which essentially intercepts packets en route between nodes and copies them onto a different system for analysis. (Clearly, there are security risks associated with this, and there are ways to mitigate those risks.) This creates a scaling problem. Port mirroring may work well when dealing with gigabytes of data, but at the backbone level, we may be talking about petabytes of copied data daily. Even if most of this data gets tossed out, just the compute load to sift through it all may be beyond the reach of most companies.
At Subspace, we use sampled flow (sFlow) for packet sampling. sFlow is implemented in ASICs embedded within routers and switches and can be set to regularly capture one out of every N packets flowing by. This significantly cuts down on data volume while providing temporal path mapping at real-time speeds. It allows Subspace to confirm traffic flow without the need for (or overhead of) direct port mirroring. With sFlow, we have better monitoring that allows for effectively optimizing and troubleshooting current vs. best latency on links, traffic dips, capacity (circuits and servers), interface utilization, flow/network weather map, and queue depth monitoring.
Packet sampling is only one element of our routing platform, which is distributed throughout our global PoPs. Another is circuit-level routing. Ideally, we’d like to link communication circuits into the Subspace platform and have them instantly work at optimal levels — plug-and-play simplicity on steroids. In reality, there’s a lot of configuration involved, from performance functions to controlling egress circuit selection.
Detecting Path Shifts at Lightning-Fast Speeds
However, because of our focus on precision measurement inputs, we can achieve lightning-fast response cycle times: sub-1 ms regionally and sub-second circuit failure. This lays the foundations for how Subspace performs multi-pathing, traffic engineering, load balancing, and much more. We have the ability to adapt instantaneously when a path’s performance shifts by even <50 μs. With continuous measurement snapshots, we can assess link latency, circuit and server capacity, interface utilization, and queue depth monitoring. This creates the framework for real-time optimization and real-time delivery.
The final key to our achievement is understanding that with great precision comes great responsibility…to get out of the way. As noted before, Subspace crafted a global analysis system able to assess all of this information and respond to it as needed in real-time. There is no time for humans to be in the routing decision process. This is what we mean by Subspace using a software-optimized rather than a human-optimized routing system. Our algorithms can achieve optimal path performance through up to 80% latency reduction. No matter how good their intentions or expertise, humans can’t even come close to this in real-time.
What Our Precision Measurement Means for Developers
Very little about Subspace is plug-and-play on the back end. We’ve invested countless thousands of programming hours and incredible resources in devising the world’s most robust, responsive, and reliable TCP/IP network for global-scale applications.
We did all this work so our customers wouldn’t have to. From the outside, Subspace is a plug-and-play optimization network. Our ability to predict path outages translates into massive reductions in lag and packet loss for every user. Imagine sitting down to an RTC hackathon, implementing our API call, and then taking home first place because your application used a network that trounced the performance of every competitor. It really is that easy…on the outside.
Subspace is unique in its ability to implement precision measurement, run those measurements through real-time analysis, and convert the results of those analyses into instant action for optimal network performance. To learn more about what Subspace does and how it gets done, read our network quality white paper here.