Real improvement requires better measurement
Imagine trying to assess your car’s fuel efficiency by measuring its tire pressure. Is it a relevant metric? Sure…kind of. But it fails to capture the essence of what needs to be known.
Similarly, when it comes to internet performance, most developers are baffled at the delta between what their dashboards show and real-life performance because their service providers use the wrong metrics. Bad metrics mask real performance problems, which in turn severely limit application scaling, especially across large geographies. Only by isolating the right metrics and using them to pursue increased network precision can performance be massively improved so that applications can reach their full potential.
What are the standard approaches to network precision?
Tire pressure is purposefully absurd to make a point, but the conventional network metrics used by service providers, while appearing technically apt, can be equally misleading. Following are some of the main culprits.
Time to first byte (TTBF). When a client browser sends an HTTP request to a page, the page will respond with a sequence of data. However, there’s no time restriction on when all data must arrive. TTBF only indicates how long it took for the first byte to reach the browser from the target page. That’s like asking a teenager to do the dishes and gauging task performance on hearing “OK.” TTBF fails to capture the real performance concern, which is how long it takes for the page to become usable.
TCP round-trip time (RTT). When a network node sends a TCP data packet, the receiving device receives an acknowledgement that the packet was received. If there’s no acknowledgement, the node resends the packet, which is why TDP is a great protocol for ensuring data integrity. However, with the many different load balancers and proxies in play across modern networks, TCP RTT usually only measures one of many parts that comprise a connection.
Network Time Protocol (NTP). NTP is a protocol meant to help synchronize computer clocks across networks. Synchronization is the heart of precision. You might well imagine the need for precise synchronization between numerous robotic assemblers on a high-speed manufacturing line, or for one system to integrate a financial transaction event a split second before another event is allowed to happen elsewhere. NTP excels at maintaining sub-millisecond synchronization across a LAN. On the internet, though, sync errors frequently exceed 100 ms. Overall, NTP variances fall “somewhere between 100ms and 250ms.” With real-time applications, this amount of discrepancy may mean that by the time a measurement arrives, it’s too late to be useful. This wide spectrum of NTP service quality contributes to why some devices won’t sync on NTP at all.
Real-Time Transport Control Protocol (RTCP). RTCP is for sending flow control and statistic reports back to a sender. The sender can respond by modifying bitrate or adding forward error correction (FEC). However, these report statistics stack up, and if every network node is amassing information on every other node, it can result in large storage scaling across the group. In turn, this can create delays of minutes to hours in RTCP reports, effectively making the metric useless for conveying real-time conditions. For this reason, RTCP metrics often aren’t captured, as the volume is too high to have enabled by default.
These metrics all share a common element: They don’t reflect real-time needs or differentiate between fault monitoring and performance monitoring.
Google Spanner: One answer to imprecision
Given all the problems noted above, the lack of an absolute clock authority from which to sync all clocks makes it impossible to order events in a large, distributed system efficiently. The order of events matters. Who shot first, Han or Greedo? Flame wars have scorched the world over events occurring out of order. Somewhat similarly, things can fall apart when distributed databases lack precisely synchronized time to govern a serialized (first #1, then #2, etc.) order of events. Serializability is essential.
Google’s Spanner is a database platform that behaves with near-perfect global synchronization. With a global database, different regions must act with absolute coordination, including in their data replication and fault tolerance, even across many thousands of miles. Google solved the problem by putting GPS receivers and atomic clocks in its data centers. These devices constantly coordinate their time values with a master server in each location, which then coordinates with other master servers. Google calls this system TrueTime.
Google gives an upper bound of 7 ms for sync variance between global nodes. To maintain proper event ordering, Google makes a node wait 7 ms to report that a transaction has been committed. No subsequent transaction can commit earlier than this. The system is “waiting out the uncertainty.”
Spanner is a broadly adopted example of the scalability that can come from truly reliable measurement applied to network functionality. Google focuses on using TrueTime and Spanner for global-scale, managed relational database platforms, especially systems like OLTP that rely on split-second accuracy. However, precision is also needed elsewhere, such as in the IoT space and gaming, where split-second errors can mean the difference between in-game life and death.
Perhaps most of all, Spanner illustrates that network precision can be achieved on a global scale, and the benefits of this can change industries. When we increase the measurement accuracy on networks to sub-milliseconds, organizations can react and adapt better than ever before.
How you measure defines how you improve
Google lit a fire under the internet in 2010 when the company announced that speed would be a ranking signal for desktop searches. In July 2018, Google did the same for mobile searches. In short, the speed of a page directly impacts its ranking and thus the site’s success.
Google uses tools such as Lighthouse to weigh a range of variables in assessing total page performance. Among these are Speed Index and Time to Interactive (TTI). Google notes that TTI wants a page to respond to user interactions within 50 ms. (TTI comprises 15% of the Lighthouse performance score.) Google also weighs TTFB server response times.
For pages that can be cached, Google’s system works well. Without caching, though, all that’s being assessed is the first instant of the page’s response. Every subsequent instant is ignored. As we saw above in the TTFB discussion, this is a poor approach to measuring ongoing real-time traffic.
In a similar vein, look at Netflix’s ISP Speed Index. The sole metric that matters here is Mbps throughput from the ISP to the user, which is a proxy for performance between the user and the nearest Netflix caching server. However, only assessing throughput to the client node ignores instability, latency, or any other factor that impairs the user experience (such as mid-show buffering or downsampling of video and/or audio streams). Most Netflix users have experienced “try again later” messages. The ISP Speed Index is the tire pressure metric of failed show requests.
Prioritizing TTFB can make for a poor real-time content stream. Prioritizing bandwidth will ignore launch issues and may result in crippling retry storms. Real-time apps need a predictable bitrate, throughput, and packet interval so that, in multi-user scenarios such as games, all participants stay in sync.
Provider unreliability can appear in surprising ways
Lack of proper measurement can lead to a host of misconceptions and bad decisions. In the absence of good performance data, you might place a server rack wherever the office space is cheapest. However, server placement matters because all users in a city don’t have the same performance experience. Different providers conduct packet routing in different ways. Comcast in Atlanta may deliver an entirely different performance from Time Warner’s Spectrum in Atlanta. It’s quite possible — and common — for a server 1000 km away to provide faster ping times than one that’s 100 km away.
Bad statistics contribute to bad decisions. To illustrate, keep in mind that a CDN’s performance relies heavily on the local ISP’s routing performance. When a CDN says “90% of connections have excellent bitrates,” that means 10% do not, but this is likely outside the CDN’s direct control. Still, it’s the end result that matters. If we’re talking about a game with 10 users, that means one user will have a bad experience. In gaming, if one player has a bad session, it negatively impacts all users.
Measurement matters because it’s the only way to get to the heart of what’s causing those internet slow-downs. Once the problem is understood and can be consistently and accurately measured, it can be remedied. The fact that providers keep recreating the same problems indicates that they’re not measuring the right things in the right ways. Because Subspace approaches metrics and precision differently, our implementation allows us to write a software-based solution to most performance issues in short order. It’s a wholly different approach to network optimization.
Subspace addresses the underlying problem
Subspace has patented methods for synchronizing users and making sure that all traffic paths are excellent worldwide. The specifics are proprietary, but the bottom line is that Subspace assesses network conditions at the millisecond level, a feat that is infeasible or impossible with most network providers. Subspace relies on software rather than humans to fix issues because humans are slow and cautious about making network changes. When network trouble strikes, there’s no allowable time to deliberate over remedies. Fortunately, because Subspace measures and reacts within milliseconds, we can learn if our predictions had negative results and move to a better solution before anyone even notices a change.
Such a brief description makes what we do sound far easier than it is. To illustrate, first, keep in mind that packet round trips from node to server and back are usually not symmetrical. It’s not like driving to the store and back — unless you drive a kilometer to get there and six to get home. This became important when we worked to improve performance between Atlanta, Georgia, and Ashburn, Virginia, near Washington, D.C. We saw 60 ms ping times, which were bafflingly high. We discovered that it was 13 ms to Ashburn but 47 ms back to Atlanta. We didn’t have to do anything about the first path, which met performance goals, but we had to build a fix to improve the return. Once executed, suddenly, all traffic was hitting 13 ms each way.
Now imagine having to address this sort of optimization for every major network path everywhere, from Boise to Berlin to Bangalore. Mathematically, this is overwhelming, which is why no one has ever done it before. The math has existed for over a decade, but no one has been able to engineer it on a practical level — until Subspace. Without proprietary IP, it would require about 10,000 servers per continent to do all this network computation.
Subspace views networks almost like quantum physics: You can never know a network state for sure. You can multiply the probabilities of X, Y, and Z factors to derive an end measurement, but by the time you derive that value, the event is already in the past. (This relates closely to the problem with trying to sync groups via NTP.) The network is constantly changing. Therefore, the less accurate our timing measurements are, the more of an error range we have in every individual action we take. Smaller measurement intervals make for more accurate, effective actions. It’s like how you can reasonably measure the 5K race finishing times of random adults with a sundial, but you need a stopwatch when you move to high school track athletes. Switch to Olympians, and you need laser-based systems. Measurement accuracy must scale with performance.
With our techniques and algorithms, Subspace is building a network that appears to never fail. When a fiber cable gets backhoed, users never perceive an issue because Subspace finds an alternative path within milliseconds. Subspace does all this computation and sub-millisecond path optimization so that clients don’t have to. They simply integrate their application(s) with our API, which only takes a few minutes, and get back to work.
Demand the right answers for your applications
The old metrics of Internet performance measurement are, at best, partial truths. At worst, they mask significant performance issues that impair user experience and limit application scaling. Can a provider truly say that it’s putting in its best effort to optimize the internet for real-time traffic if it’s not even measuring it?
Developers and providers need new, better metrics. As hyper-scale clouds flourish and applications seek to serve users around the world in synchronization, now is the time to ask hard questions. What exactly is being measured? Are those measurements leading to precision and performance equivalent to that delivered by Subspace? If not, maybe something should be done about that.