NDN over WebSockets == TCP over TCP
Named Data Networking (NDN) was first ported to web browser environment in 2012. At that time, a browser-based JavaScript application can communicate with the Internet via either XMLHTTPRequest or WebSocket. Feeling that WebSocket is a better match for the NDN implementation at the time, I wrote the initial code for a CCNx WebSocket proxy.
Web applications would connect to this proxy over TCP, negotiate a WebSocket connection, and send NDN packets in WebSocket frames. The proxy then decapsulates these frames, and delivers the NDN packets to ccnd forwarder over TCP.
NDN-over-WebSockets survived multiple protocol changes over the years, and made its way into the NDN Forwarding Daemon (NFD). It worked fine for simple NDN web applications, such as status pages and text chat, and even file retrievals.
Recently, with the rise of video streaming on the NDN testbed, congestion control functionality starts to show up in NDN libraries. Then, a question popped into my mind: WebSockets run over TCP, and NDN congestion control algorithms are largely borrowed from TCP, would this cause any problems?
TCP over TCP is considered a bad idea. When a packet loss occurs, both layers of TCP would try to retransmit, and eventually lead to a TCP meltdown. However, such situation isn't guaranteed to occur: people have been running PPP-over-SSH VPN for many years, and it just works.
When we run NDN over WebSockets or plain TCP, a similar situation can theoretically occur. Several NDN libraries implement a congestion control algorithm adapted from TCP CUBIC, which reacts to an NDN packet loss in a similar way as how TCP reacts to an IP packet loss. If an IP packet is lost on the WebSocket link, the lower layer TCP transport would try to recover by retransmitting the TCP segment. Since TCP provides an in-order delivery service, it could not deliver any WebSocket frames or NDN packets in later TCP segments to the NDN layer, until the lost TCP segment has been recovered. If the additional delay for recovering the lost TCP segment is too large, the NDN congestion control algorithm would see an Interest timeout and start its own Interest retransmissions. This leads to the same situation as a TCP meltdown.
Web UDP?
In 2017, an open letter "Public Demand for Web UDP" was posted. The main reason of wanting UDP in web browsers is that, TCP's reliable and in-order delivery service is causing latency spikes and congestion, hurting the user experience of web games. It would be beneficial to have an unreliable and unordered transport, such as UDP, for use in browser-based web games as well as other server-client communication use cases.
There is, in fact, an existing API to have UDP in a webapp: WebRTC. WebRTC is best known as a peer-to-peer protocol for transporting audio and video content, but it also supports DataChannel for carrying arbitrary application data, and nothing prevents a server-side application to act like a WebRTC peer. However, implementing such a scheme is not for the faint of heart, because WebRTC is an extremely complex protocol and the WebRTC native codebase has more code than the Space Shuttle.
In 2020-June, I discovered Pion, a simpler WebRTC stack written in Go programming language. I tried its DataChannel demos, and found them working very well. I thought about building an NDN-over-WebRTC proxy that I always wanted.
I pitched this idea to the authors of Far Cry: Will CDNs Hear NDN's Call?, who were building iViSA video streaming application at the time, and offered to write the proxy program if it's a worthwhile direction. However, I'm told that they already performed a comparison between TCP and UDP: the video player in the browser connects via WebSocket to an NFD instance on localhost (where packet loss is not expected to occur), and that NFD connects to the global NDN testbed via either TCP or UDP. No difference in video playback quality was noticed during their trial.
A few months later, in 2020-November, I read about QuicTransport, an experimental feature in the Google Chrome browser that allows UDP-like unreliable datagram communication in webapps. Its client-side API is straightforward, and the server-side demo code has only 268 lines. Over a weekend, I implemented a transport module in NDNts, and a corresponding server-side NDN-QUIC gateway based on aioquic Python library. Then, I deployed one instance of the NDN-QUIC gateway on a small VPS in Canada, and modified a small webapp to use it.
"NDN push-ups" over QUIC
In my last article "The Reality of NDN Video Streaming", I described my "NDN push-ups" site and how it helped me collect video streaming quality of experience statistics from the real world. In 2021-February, I started experimenting with QUIC transport on this website.
NDN-QUIC Gateway Deployment
I deployed four NDN-QUIC gateways around the world:
gateway | location | VPS provider | NFD | connecting to testbed router | RTT |
---|---|---|---|---|---|
AMS | Amsterdam, Netherlands | MaxKVM | yes | Queen's University Belfast, UK | 24 ms |
LAX | Los Angeles, USA | VirMach | yes | University of Memphis, USA | 56 ms |
NRT | Tokyo, Japan | Oracle Cloud | yes | Waseda University, JP | 3 ms |
YUL | Beauharnois, Canada | Gullo's Hosting | no | Northeastern University, USA | 13 ms |
Three gateways, AMS, LAX, and NRT are deployed on KVM servers, and they are configured similarly:
There is a local NFD running in a Docker container.
- The container is permitted to use up to 10% of a CPU core and 768MB of memory.
- NFD Content Store capacity set to 98304 entries.
- These limits were never reached during the experiment period.
The NFD connects to one statically configured NDN testbed router over UDP.
- The chosen routers aren't necessarily the nearest ones.
- Instead, I did some "traffic engineering" and factored in their proximity to my video repository servers.
- Video repository has two replicas, attached at Northeastern University and Waseda University routers.
- NRT and YUL each connects to the same testbed router as a video repository replica.
- AMS and LAX each connects to a testbed router 1 NDN-hop from Northeastern University.
The NDN-QUIC gateway Python script is running as a systemd service on the host machine.
- For each QUIC connection from the browser, the script creates a UDP face toward the NFD, and then translates between QUIC datagrams and UDP packets.
- If an NDN packet received over UDP is larger than the MTU of the QUIC connection, the script can also perform NDNLPv2 fragmentation.
The "YUL" gateway is deployed on an OpenVZ container with only 256MB memory. It does not have a local NFD, but directly proxies every packet to the testbed router.
"RTT" column is measured with TCP traceroute to port 6363 of the testbed router, using this command:
sudo traceroute -T -p 6363 DESTINATION
NDN-QUIC Gateway Selection
The logic of selecting a NDN-QUIC gateway is simple. Inspired by the NDN-FCH service, I made a script on Cloudflare Workers:
Cloudflare determines the location of requesting user using IP geolocation.
- It is provided to the worker script in the request.cf property.
- Free plan can obtain country-level geolocation only, as an ISO 3166 country code.
Since I only have a handful of servers, I use a lookup table to map the country code into continent.
Then, the worker script selects a NDN-QUIC gateway based on the continent:
- AMS serves Africa and Europe.
- NRT serves Asia.
- LAX serves all other regions including Americas and Oceania.
- YUL is never selected by default due to its limited capacity and lack of local caching, but a curious viewer could choose this (or any other) gateway in the webapp.
Then I modified the webapp:
If the viewer is using Chrome browser with experimental QuicTransport feature, the webapp would attempt a QUIC connection before falling back to WebSockets.
- The fallback to WebSockets occurs only if the QUIC connection cannot be established. As long as the QUIC connection is established successfully, the webapp would not try WebSockets, even if it fails to fetch content via QUIC.
- While NDNts
@ndn/autoconfig
package has a speed test feature to choose the fastest connection among several WebSockets, this feature isn't currently usable for QUIC connections.
If the viewer is using other browsers or their Chrome version doesn't support QuicTransport, the webapp would only use WebSockets.
Viewer Locations and Counts
I collected statistics between 2021-02-12 and 2021-03-13. The number of video playback sessions from each continent is presented in the table below:
user continent | QUIC gateway | QUIC sessions | WebSocket sessions | failed sessions |
---|---|---|---|---|
Africa (AF) | AMS | 13 | 7 | 0 |
Antarctica (AN) | LAX | 0 | 0 | 0 |
Asia (AS) | NRT | 95 | 85 | 10 |
Europe (EU) | AMS | 78 | 88 | 2 |
North America (NA) | LAX | 77 | 64 | 2 |
Oceania (OC) | LAX | 8 | 4 | 0 |
South America (SA) | LAX | 5 | 3 | 0 |
The "QUIC gateway" column in the above table indicates the default choice in the worker script. There were a few changes in the gateway selection logic during the month:
- NRT wasn't deployed until 2021-03-02. Before that, Asian viewers were mostly served by AMS.
- Starting 2021-03-07, the worker script returns a QUIC gateway at 50% probability, to funnel some viewers to connect via WebSockets even if their browser is capable of QUIC transport, so that I can collect some comparison data on WebSocket connections.
The actual gateway selections and user continents are shown in the next table:
QUIC gateway | AF | AN | AS | EU | NA | OC | SA |
---|---|---|---|---|---|---|---|
AMS | 13 | 0 | 49 | 74 | 1 | 0 | 0 |
LAX | 0 | 0 | 3 | 0 | 73 | 7 | 5 |
NRT | 0 | 0 | 42 | 1 | 1 | 0 | 0 |
YUL | 0 | 0 | 1 | 3 | 2 | 1 | 0 |
Video Resolution
As described in the last article, my video application is based on Shaka Player, an adaptive video player that automatically selects a video resolution best suited for the estimated bandwidth. Every 5 seconds, the web application collects video playback statistics, and reports to an HTTP-based beacon server. During the collection period between 2021-02-12 and 2021-03-13, there were 4624 video playback log entries, representing 23120 seconds (6.4 hours) of total playback time.
I analyzed these logs to see the video resolution selected by Shaka Player in every 5-second interval. 1080p resolution is folded into 720p, because "NDN push-ups" site doesn't have 1080p content, and there's only 1 log entry at 1080p (from my other video site) during the month. The first chart is arranged by video playback time, i.e. how many seconds into playing a title:
The second chart is categorized by user continent, obtained from MaxMind GeoLite2 database:
Both charts display WebSocket (WS) and QUIC transports separately.
In case the browser never established a testbed connection during a playback session, that session would be excluded from these charts.
However, as long as a connection was eventually established, time spent waiting for initial data arrival would be included as null
resolution.
From these charts, we can see:
- Using QUIC transport instead of WebSockets significantly increases the chance of receiving high resolution (720p) content.
- European viewers were able to get 720p for 42% of the time over WebSockets, and 60% of the time over QUIC.
- Similar improvements were seen in North America and Asia, to a smaller extent.
- There aren't sufficient samples to draw a conclusion in other continents.
Startup Delay
Startup delay is a critical metric in quality of experience of video streaming, because a lower startup delay means the viewer could wait shorter after pressing the start button to see my awesome push-ups. The next chart shows the cumulative distribution function (CDF) of startup delay in top three continents, separated by WebSocket transport and QUIC transport.
We can see that using QUIC transport instead of WebSockets significantly decreases the startup delay:
continent | WebSockets median | QUIC median | difference |
---|---|---|---|
Europe | 2275 ms (86 samples) | 599 ms (94 samples) | -1676 ms |
North America | 1583 ms (58 samples) | 1056 ms (90 samples) | -527 ms |
Asia | 3571 ms (73 samples) | 1721 ms (96 samples) | -1850 ms |
The Case of an Asian NDN-QUIC Gateway
At the beginning of my experiment, I didn't have an NDN-QUIC gateway in Asia. Asian viewers were dispatched to the AMS gateway in Europe.
I read a Chinese forum post on 2021-03-02, stating that there had been an IPv4 routing change between major Chinese broadband providers and Oracle Cloud Tokyo, such that the IP routing became direct instead of going through USA. I checked online ping measurements and saw that most of China could ping Tokyo in less than 80ms, so I decided to deploy the NRT gateway in Tokyo, Japan.
This chart shows the video resolution in 5-second intervals experienced by Asian viewers. Statistics from China mainland and India (top two regions) are presented separately from the rest of Asia (Indonesia, Singapore, Vietnam, etc). It includes viewers using QUIC transport and connected to either AMS gateway or NRT gateway, as well as Indian viewers connecting to the Mumbai testbed router via WebSockets.
The chart shows mixed results when switching from AMS to NRT:
- Chinese viewers were getting much better video resolutions.
- Indian viewers had worsened experience: while the percentage of 720p increased, they were also spending more time watching blurry 240p picture.
- The rest of Asia generally received an improvement.
A potential reason for worsened experience in India is that, India isn't closer to NRT than AMS. Geographically, India's largest city Mumbai has roughly equal distance to Tokyo and Amsterdam. Network-wise, two of India's largest broadband providers have better connectivity to Netherlands than to Japan:
- Reliance Jio looking glass, from MUM-NLD-02:
- To AMS: public peering at AMS-IX, RTT 121ms.
- To NRT: transit via Telstra in Hong Kong and Tata Communication in Singapore, RTT 226ms.
- Airtel looking glass, from Mumbai GPX1:
- To AMS: peer route via Hurricane Electric in London, RTT 123ms.
- To NRT: transit via Tata Communication in London, RTT 355ms.
Nevertheless, when compared to a WebSocket-based NDN router located domestically within India, QUIC transport still managed to provide a similar level of experience, despite the physical and network distance.
Meanwhile in the United States
One of the conclusions in the last article was that, to achieve high resolution, it is necessary to connect to a router near the viewer. There are five NDN testbed routers capable of accepting WebSocket connections over TLS in different regions of the United States, but only one NDN-QUIC gateway on the west coast. How do they compare?
The next chart includes only sessions from the United States, in which the IP has city-level accuracy in the MaxMind GeoLite2 database, and connected to a router located in the United States. In this bubble chart:
- Horizontal position indicates the geographical distance between the viewer and the NDN-QUIC gateway or NDN-WebSockets router, truncated to 100 km accuracy.
- Vertical category represents the video resolution.
- Bubble size is logarithmically proportional to the duration spent playing at this resolution, by a viewer at this distance.
As expected, the average distance to a NDN-QUIC gateway is greater than the distance to a WebSocket-enabled router, because I have a considerable smaller NDN-QUIC gateway network than the global NDN testbed. However, the greater distance caused only minor reduction on video resolution.
Conclusion
This article describes my recent experiments comparing UDP-based QUIC transport with TCP-based WebSocket transport for NDN video streaming on the "NDN push-ups" website. Using real world data collected during February and March 2021, I analyzed quality of experience metrics such as video resolution and startup latency, which revealed that QUIC transport was generally performing better than WebSockets in this application.
If QuicTransport would graduate from experimental status in Chrome browser and become available in other browsers, I would recommend NDN web applications to use QUIC transport instead of WebSockets. Additionally, web applications would benefit from a wider deployment of NDN-QUIC gateways, such as including it as a standard feature of the global NDN testbed.
Although this is not a scientific publication, raw data and scripts in this article are available as a GitHub Gist. If you find this article interesting, please do a few push-ups in my honor, cheers!