EPISODE 1892

[INTRODUCTION]

[0:00:01] ANNOUNCER: China's Great Firewall, or GFW, is often spoken about but is rarely understood. It is one of the most sophisticated and opaque censorship systems on the planet, and it shapes how over a billion people interact with the global internet, influences the design of privacy and proxy tools worldwide, and continues to evolve in ways that challenge researchers, developers, and policy makers alike. 

Jackson Sippe is a PhD researcher at the University of Colorado Boulder, whose work focuses on uncovering how national-scale censorship systems operate. Jackson recently helped to lead a groundbreaking study analyzing a previously undocumented GFW technique that quietly broke fully encrypted proxy protocols across China for more than a year. 

In this episode, Jackson joins Gregor Vand to discuss how the Great Firewall works at a technical level, the 2021 to 2023 blocking event, the popcount-based detection algorithm his team reverse engineered, the cat and mouse ecosystem of censorship circumvention, and what these findings mean for the future of the open internet. 

Gregor Vand is a security-focused technologist, having previously been a CTO across cybersecurity, cyber insurance, and general software engineering companies. He is based in Singapore and can be found via his profile at vand.hk or on LinkedIn.

[INTERVIEW]

[0:01:38] GV: Hello, and welcome to Software Engineering Daily. My guest today is Jackson Sippe. 

[0:01:42] JS: Hey, happy to be here. 

[0:01:44] GV: Yeah. We're going to be getting into a pretty interesting topic with Jackson today around what is called the GFW, so the Great Firewall. And we'll get into what even that is. But also, Jackson's done a lot of research into how it operates and actually sort of how it operates through the years, because it's been changing as well. Before we get into all that, as we like to do, maybe just sort of, Jackson, what is your background? How did you get into this kind of research in the first place? 

[0:02:13] JS: Yeah, sure. Yeah, I am currently a PhD student at the University of Colorado Boulder, and I've been here for about five years now. I'm advised by Dr. Eric Wustrow, who's been doing this work for a ways longer than that. And as I first started my PhD, I got involved with an organization called the GFW Report that has focused on censorship within China. And so through working with them, I've done a number of different projects sort of exploring the relationship between China and censorship. 

[0:02:46] GV: Awesome. Maybe let's just get into what is - I always kind of get the acronym mixed up because I think there should be a C in there somewhere, because it's sort of the great Chinese firewall perhaps. But yeah, maybe what is the GFW? 

[0:03:00] JS: The GFW as the great firewall, as it's known, tends to refer to the censorship mechanism in China. That acronym has sort of expanded as it's gained popularity. Sometimes people refer to Iran's GFW. So we often refer to it as China's GFW now. And it's a collection of different techniques and deployments that are spread around China in order to filter citizens in China's access to the internet. It affects a number of different protocols. We see it on DNS traffic, on TLS traffic, on quick traffic these days. And it's also been used almost as a weapon against organizations or services that haven't complied with the censorship desire of the nation. 

So, one of my favorite facets of the GFW is a tool called the Great Cannon, which is sort of uncommon, something that not many people have heard of, but it was this technique that they deployed in 2015 against GitHub actually. So, GitHub was allowing for proxies to sort of run through GitHub pages, I think it was. And the idea was is China couldn't get them to comply with, "Hey, we don't want you to be this proxy source." 

And so what they did instead is they looked at all of the HTTP traffic, unencrypted traffic that they could going to and from China, and they would inject a JavaScript request in that traffic that would then make a request to GitHub, right? Effectively, they designed the largest denial of service attack ever by injecting these JavaScript requests into every HTTP request that went across the GFW. 

[0:04:43] GV: Wow. Okay. I think I had heard of GitHub playing a part somewhere. But yeah, that's new to me as well. I guess looking at the fact that you've done a lot of research in this space. Again, before we get into what the research sort of turned up, why - again, just for the audience. Some people in the audience may have been to China, may understand roughly what happens from the internet perspective. But equally, I think, let's just assume zero knowledge. Why is this kind of research difficult in the first place? 

[0:05:14] JS: Sure. The goal of the GFW is to restrict Chinese citizens specifically their access to information outside of China that they deem undesirable. This might be Western entertainment that maybe goes against the goals of the CCP, the Chinese Communist Party. And what we really struggle with is finding ways to detect that censorship not being citizens located in China, right? 

One of the biggest kind of logistical issues we have is getting access to vantage points, right? Places in China that we can connect to and then send requests that may or may not get blocked by the GFW. Additionally, because we aren't part of the CCP or the Chinese government, we don't necessarily know when we've reached ground truth, right? At the end of the day, this is sort of a blackbox system. And we can to some extent only speculate, right? It takes a number of experiments to really determine whether or not what we have observed is what we think it is, or just some other effect of the network. 

[0:06:25] GV: Okay. Before we kind of get into what the research was, there's also a kind of like a time frame thing here, which is helpful. So something kind of happened in - I believe it was November 2021. Let's start there. What happened then? Maybe did that kind of almost kick off why the research was then undertaken, or you can help us out?

[0:06:47] JS: Absolutely. Yeah, sure. I think that the incident you're referring to in November 2021 was the deployment of a new technique by the GFW to detect and then block fully encrypted protocols. And I think to understand that, we need to sort of get back to how Chinese citizens have responded to the GFW, right? Which is there are plenty of people in China who disagree with it, right? They would like to be able to access maybe whatever is latest on Netflix or something else. And so they will use proxies in order to circumvent this censorship and connect out to whatever content they otherwise wouldn't be able to access, right? 

And these proxies use a number of different techniques to be censorship resistant. Right? If you think about a proxy that you've accessed before or maybe like a VPN that you've connected to before, that traffic, it's obvious, right? We can look at it in tools like Wireshark, right? Or observe it on the network somehow and see that that is clearly a VPN connection, right? Or some type of standard proxy connection. The GFW has designed ways of detecting and blocking that, right? 

And so proxy developers kind of had to get more creative and they had to develop protocols that would be less obvious to some type of network observer like the GFW. And these tools took two different forms. The first of which being mimicry, right? We are going to try and make our proxy look as close to all the other traffic on the internet. And if you aren't familiar, almost all the traffic on the internet looks like TLS because everybody's doing something that involves a website, right? 

And so the first thing they did is tried to mimic TLS, right? And look as close to normal TLS as possible. But with TLS kind of having sharp edges, there were always these issues that would exist. And the sensor could then block that traffic because they would see that, say, the encryption scheme was wrong, right? They picked the wrong cipher suite, something like that. 

And so this led to a different approach. If we can't look exactly like everything else, let's try to look like nothing. Let's try and blend into the background. And that led to these protocols that we refer to as fully encrypted. And a fully encrypted protocol involves some type of key exchange between the client and the server. 

And then after that key exchange has occurred, every single bite that is sent from the client to the server and vice versa is encrypted. There's no protocol header. There's no exchange of information about size or anything like that, right? It is just going to be encrypted bytes as the payload of a TCP connection. 

And the idea was, "Okay, this just sort of blends in. There's nothing to worry about. They won't be looking for this type of traffic." But these tools became pretty prolific. They became the standard, right? And every proxy had sort of implemented their own version of this fully encrypted protocol. 

And so in November of 2021, users, all of a sudden, just couldn't access their proxies, right? And it wasn't just one, right? It wasn't just one particular implementation, but it was widespread. It was all of these different proxy developers, right? Shadowsocks and V2Ray are two names that are pretty popular in the Chinese censorship circumvention space. And both of their services had sort of fallen over as a result of this attack. And this was alarming because this had been a very stable approach for years. And users were trying to understand what had happened. And there was no clear answer of, "Oh, this one developer has decided to pull the site," things like that. 

[0:10:25] GV: Yeah, this is very interesting. Because, on the layman side, I do remember there was just like a point in time when - I was living in Hong Kong. Yeah, I guess in November 2021, and beforehand. And there was this just sort of watershed moment where people started saying, "Oh, my VPN just doesn't work in China anymore." And it was sort of make sure you get maybe the "correct VPN" before you get there, because you've got no chance of getting it whilst you're there, that kind of thing. It kind of became this which one works. And some people - I've even seen like posts today where I think people still go into mainland China and haven't fully anticipated the kind of tools, if you want to call it that, that they'll need to just be able to access things the way that they would hope to be accessing them. 

[0:11:13] JS: Absolutely. 

[0:11:14] GV: Yeah. Great. Let's move to the research that you and your colleagues did. I think this is like fairly groundbreaking research into sort of how this all worked. Yeah. Let's just talk through that. I mean, I believe it was a sort of six-month setup that you had to put in place. I mean, let's kind of start there. Where did the research begin, and how did you go about it? 

[0:11:37] JS: Yeah. As I said, I've worked extensively with this organization called GFW Report. They were sort of the leads on this project. They were the first people to be notified of what was going on. And there was no explanation to start, right? It was sort of this widespread event where no one knew exactly what tools were working, what tools weren't. 

And it started with effectively just seeing, "Okay, which of these tools have all of a sudden stopped working? And what techniques were they using?" Right? And that led to this realization that okay, it is in fact these encrypted protocols that seem to be the issue, right? If we send a regular TLS connection or TLS-style proxy connection, there doesn't seem to be a problem, but the fully encrypted traffic is an issue. 

And so what we ended up doing is setting up a number of VPS's in China, primarily in the Tencent and Baidu clouds. And then we set up some sync servers in the US, right? We had a number of different university partners, Boulder, as well as University of Maryland, and the University of Massachusetts Amherst, and we set up servers at all of these universities that would just sort of listen to any TCP traffic that came to them. Nothing particularly interesting on that end. But what it allowed us to do is we could now send traffic from inside of China outside to the US, right? Crossing that border and effectively triggering any censorship that may occur there. 

And while this sort of set up our architecture, we then sort of had to go into, "Okay, how are we going to determine what exactly is fully encrypted traffic?" Right? This is a hard question. And so what is encrypted traffic, and how did we figure it out? Well, the first thing we did is started sending in these TCP payloads just random bytes, right? See that this is getting blocked. Okay, that makes sense. What if we send all zero bytes or all one bytes, right? 

And I'm sitting in my advisor's office late at night. We're trying to figure this situation out, and we see that certain bites would get by when others wouldn't when we just sent a long string of the same bite. And, ultimately, what we found was that they were determining the entropy of the payload based off of the number of bits that had been set in the total payload. Right? They were doing a count, what we refer to as a popcount, on each bite. And if it was roughly 50% of bits set zero and one, then they would consider that to be high entropy traffic and then they would ultimately block those connections. 

[0:14:16] GV: So we're going to get into popcount shortly. Just kind of just sticking on, I guess, the setup here, were there any kind of risks with this? I don't know. Did you ever think that, effectively within China, something would get triggered, and sort of there might be any risks associated with this, or is that just something that comes along with the research, and you have to just accept that that may be the case? 

[0:14:40] JS: Yeah, it's a great question. The risks with this particular work were relatively low in our sort of threat model, if you will. We try to reduce the amount of information that we leave on these servers, right? We assume that at any point in time they could get flagged as being owned or controlled by us and just shut down, and we'd lose access to whatever was on them. 

We are very careful in how we access those machines, making sure that all the data that we're collecting is removed quickly so that we don't, of course, have setbacks, as well as reducing what may be accessible to an adversarial nation like China were they to conduct forensics on one of these servers. 

But aside from the sort of technical risks there, there's also kind of these risks to the people doing the research. Right? In this particular work, again, there wasn't much of a concern. We were just doing this sort of observation of what was occurring. But in some of our other work, we have really struggled with these concerns and even struggled to have people accept the research that we have done due to the legal risks associated with it. Some people refer to some of our work as being an offensive attack against these services that are deployed in China.

[0:15:51] GV: Which is very interesting because, at least from what you described so far, you're simply allowing traffic to leave China and go, in this case, I guess, into the US. There is no sort of threat data there. It's not like you're doing a kind of reverse pen test or something. It's just information. 

[0:16:10] JS: Exactly. You're exactly right in this case. And now, some of our other research, one particular attack, we found a memory leak in their DNS censorship that allowed us to leak vast amounts of memory off of these servers. That was one of the attacks that we were maybe a little more concerned about. But it sort of goes to show how just observing this traffic or sending traffic back and forth isn't something to be as concerned about. 

[0:16:36] GV: Yeah. You mentioned popcount. That is an interesting concept as you've just touched on. It refers to sort of the density or the bit density in the payloads. But let's talk a bit more about that because it seems like that was such a critical kind of heuristic for how the DFW now operates. 

[0:16:58] JS: Yeah, absolutely. Popcount is kind of a technical term or an implementation of Hamming weight, right? With Hamming weight, we're concerned with like the number of nonzero symbols in some type of data. And with popcount, we are just counting the number of nonzero bits in some form of data, right? 

For each byte, we simply count, "Okay, how many bits have been set?" And eight bits in a bite. If we see four of those bits set, we can assume that this is relatively high entropy, right? The closer we are to 50%, the higher likelihood that this is a packet containing high entropy and likely encrypted. Data will cluster around that number, right? Because if we have to flip a coin eight times, we would expect four heads and four tails. Same thing applies when setting bits in a byte. 

And ultimately, this sort of speaks to how the GFW wants to execute this censorship, right? They are doing the best they can to come up with crude but effective techniques. Right? So this is a relatively simple thing to do, right? We're just counting bits. But that's the sort of technique that they have to deploy in order to stay relatively optimized and keep up with the vast amounts of traffic that they are constantly ingesting. 

[0:18:16] GV: Okay. And just to back up for a second, the sort of overall thing that was kind of discovered here was the idea that rather than selectively blocking specific things, it's kind of the other way around now, which is that everything is blocked unless, I guess in this case, the bit density can be determined to be - well, I'm looking at the notes here, which is it has to fall within a very specific range. Is that right? For it to be exempted? 

[0:18:43] JS: Yes. It's funny. The range was exactly 3.4 to 4.6 was the count. Within that threshold, they would consider that traffic to be high entropy, and they would ultimately block it. How they came up with this number? We're not exactly sure, right? That was just in our tests. That's what we observed is the exact threshold for which they would consider it to be high entropy.

[0:19:05] GV: Right. And so then, for the GFW to sort of accept that this data is random and should pass effectively, as you touched on, that sort of clusters around the 4.0 of the density. Is that correct? 

[0:19:18] JS: Exactly. Exactly. Count all the bits that you have in the packet and see how many are set to one. Divide the number set to one by the total number of bits, right? And if that value shows up between 3.4 and 4.6, then you can say, "Okay, this is clearly an encrypted payload, and we're going to block it. Otherwise, we allow it." 

[0:19:40] GV: Gotcha. Yeah. Then there are kind of like other, I guess, exemption rules that I think you discovered as well around ASCII and protocol fingerprinting. Maybe if we could just like dive into those as well. 

[0:19:54] JS: Absolutely. And so I think what's important to understand here is that that entropy count actually came last, right? We refer to our rules as exemptions, right? We're trying to find a way to exempt traffic from blocking. Sort of like you said, block everything and then decide what not to block. And when we think about popcount, we have to look at every single bit in that payload in order to conduct that entropy test, right? And that can be considerably resource-intensive when you're considering the scale of traffic they're observing. 

They tried to identify other tests that they could run first to sort of filter off that traffic. And there were two different ways that they did this. The first being that they would look for large amounts of ASCII. If they saw a continuous string of ASCII characters, encoded ASCII characters, they would exempt that traffic. If it started with ASCII characters, they would exempt that traffic. And that was effectively just saying, "Okay, this is unlikely to be encrypted data, right? Even though we don't know what this protocol is, we can allow it, right?" 

The other approach they took to filter off traffic before they've reached that entropy measurement was through protocol fingerprinting. It was really a crude protocol fingerprint. But effectively, they chose a handful of protocols that they knew they saw a lot of. For example, TLS. Looked at what the first few bytes of a TLS connection were. And then if they saw those bytes at the start of any packet, they would say this isn't going to be considered a fully encrypted packet, right? And that did two things. In the case of TLS, you're able to filter off the vast majority of your traffic. In the measurements that we conduct here in that regard, it would be about 80% of traffic will get exempt right there with looking for TLS fingerprints. 

But in addition to that, if you think about a TLS packet, we've got a couple of header bytes at the top, and then we have an encrypted payload. If you aren't filtering off that traffic, you will have an insanely high false positive rate because so much of the internet is, in fact, encrypted, it's just encrypted in a way that we're okay with. Right? As long as it's TLS, that's not a problem. Or as long as it's SSH, that's not a problem. 

[0:22:12] GV: Yeah. I wanted to then touch on false positive, effectively a sort of - you call it sort of collateral damage to legitimate traffic, if you like. I think there was, I believe, in the research estimated around - was it 0.6% of the normal internet traffic would be false positives? How did you, I guess, arrive at that validation? 

[0:22:35] JS: Absolutely. The false positive rate we concluded was 0.6%. You're exactly right. And the way that we did that is by taking these rules that we had determined and testing them against benign traffic that we had here at the University of Colorado. We are fortunate to have a resource here that allows us to see all of the network traffic going to and from the university. And we rolled out these rules and basically looked at the traffic that wouldn't be blocked, or that would be blocked if it was conducted here at CU. 

And what that allowed us to do is say, "Okay, no one university in the United States is going to be using censorship circumvention tools or proxies, right? Because there's no blocking occurring. Why would they go to the trouble?" And so we could sort of infer that this may be the false positive rate. And that allowed us to identify a couple of other exemptions that they were doing. We took some of the bites that we observed in those packets that would have gotten blocked here at CU. And we found some other TLS resumption headers that they were allow listing. 

And we found that the large majority of the traffic that would have gotten blocked here at CU ultimately belonged to torrent services. Those specific protocols for torrenting. And so we believe that there may have been some reason that they still wanted to block that traffic anyway. Their false positive rate might have been seen as even lower based off of that. 

[0:24:03] GV: Yeah. Very interesting. I guess that's sort of an interesting piece of doing it within a university. There's obviously just a lot of traffic generally. Yeah. I mean, was there any sort of - if you were to do this again, for example, would there be any benefit to doing it in a sort of more siloed environment, if you like? Or do you think that would not really affect the results? 

[0:24:25] JS: Do you mean for calculating the false positive rate? 

[0:24:27] GV: Yeah, exactly. 

[0:24:28] JS: Well, I don't think so. And the reason being is that having access to this sort of network that has just a large amount of random traffic should be, in theory, representative of the type of traffic that would be coming and going across that national border, right? And so our hope is that we're modeling that as close as possible. 

[0:24:47] GV: Yeah, that's a really good point. Okay. If we look at the idea of popcount manipulation, I think I'm going to let you take us through this. I mean, you are a programmer. And I believe a lot of this was done in Rust. We've got a lot of Rust listeners as well on the podcast. Could you maybe just take us through sort of what even is popcount manipulation and why was that part of the research? 

[0:25:13] JS: Absolutely. Once we had sort of determined, "Okay, this is how they're doing this blocking. What can we do to help proxy developers get around this blocking, right? This is one of the most common techniques for censorship circumvention. We've got to get it working again. 

And our solution was we knew that they didn't have some fancy metric for determining whether or not it was encrypted, right? It was just this popcount. What if we stuff the payload full of additional bits, right? We can check the pop count of our own packet that's about to go across the network. And if we see that that payload is highly entropic, which of course it's going to be because it's encrypted, we can instead add some ones or some zeros, depending on which way we need to go, and get outside of that encrypted threshold range. Right? 

And so that was one of the more sophisticated techniques that we deployed in order to stop this censorship from being effective was we would add bits to the payload using a scheme based off of the key for the encryption so that it was pseudo random, right? There wasn't just a bunch of ones at the beginning or a bunch of zeros at the beginning or something. And then we would add a few bytes at the end that would tell you exactly how many bits you need to remove, right? 

And so then, when the server or the other host receives this packet, they can decrypt how many bits they need to remove, and then they can determine which bits are supposed to be removed based off of that key. And ultimately, this got implemented in Shadowsocks's Rust implementation, as well as the Shadowsocks Android implementation, and has been working to this day, I believe. 

[0:27:02] GV: Wow. And what kind of performance hit or overhead would you say this adds, if any? 

[0:27:09] JS: Sure. We estimate that the overhead is about 17%. If you assume a worst case scenario of you have a popcount of exactly four, right? Meaning we have four zeros, four ones in every single bite on average, what we need to do is add just enough bits. In this case, we'd be adding ones to exceed the pop count of 4.6. And what that works out to is about a 17.6% overhead, which we find to be very tolerable, right? Given the overhead that's already required for doing this sort of multi-layered proxy, we believe that a 17% overhead is something that you could apply to every single packet and not necessarily just the first few packets of the connection or to get around the censorship. 

[0:27:57] GV: Shadowsocks, Rust, to your knowledge, is this being used within VPNs, commercially available VPNs? Or how is this sort of been taken up beyond the sort of research stage? 

[0:28:11] JS: Yeah. Shadowsocks Android is an app that you can download, right? And you can use as a client, right? Shadowsocks Rust, they provide a command line implementation, but it's also designed as a library that can be added into other VPNs. Off the top of my head, I don't know the names of the commercial ones that do use it, but it is heavily used by more user-friendly applications. 

[0:28:37] GV: Yeah, this is something I'm probably sure going to go and do some extra research on, where certain - I wouldn't sort of name names, but certain VPNs have said, "Oh, we've got this new protocol. And it's faster, better, can get through anything else that might not have been working." I'm curious if maybe this is what actually they've been referring to building into their commercial products. That's a little note for me to go on afterwards. I think you also came up with other sort of circumvention methods. One is it sounds very simple, but there must be a lot more to it, adding an HTTP or TLS header. How does that even work? Yeah. 

[0:29:19] JS: You're exactly right. It is as simple as it sounds. And so we found that there were these exemption, basically byte strings, right? That would be the first four bytes of a TLS connection, right? And so we said, "Okay, can we just add those four bytes to the start of a fully encrypted or a fully random payload?" And the answer was yes. And we were like, "Whoa. Well, that was really simple." 

And so what we ultimately ended up doing is we provided this information immediately to the proxy developers, right? We said, "Hey, guys, we're working on this. We're still not exactly sure how it works, but we do know that if you add these couple of bytes to the start of every payload, your traffic could get through no problem." 

And so we sort of used this as an emergency solution, right? A little band-aid to give out that would allow people to get their services back up and running while we developed the kind of more sophisticated approach behind the popcount manipulation. 

[0:30:14] GV: I mean, is there any evidence that the GFW is actively probing suspected proxies like this or - 

[0:30:21] JS: Absolutely. Active probing was a technique that was around long before this particular incident, and it's one of the primary concerns of proxy servers, right? Is that the GFW will see a connection to an IP address, right? And they'll be like, "That connection looks a little weird. We think it might be this proxy implementation." But we don't want it to go block an IP address willy-nilly. So what we're going to do is we're going to make a couple of connections to that IP address and see if we can get it to talk to us in a way that is representative of one of these proxies, right? 

And so this is a common technique that they were using. And actually, they were doing active probing with sort of this entropy check prior to the passive detection that was going on in this paper. And I do think that that's an important distinction to make is that this was a passive attack where they were just blocking any traffic that met these parameters, and it wasn't affecting specific IPs. There were subnets that were specifically targeted, but it wasn't like this specific IP is a proxy, and we're not going to allow traffic to it. 

[0:31:29] GV: I mean, is there any sort of defense against this, or is this just kind of part of the landscape, and it's going to be - for anyone that's like trying to do this, is this just sort of a cat and mouse thing? 

[0:31:42] JS: And cat and mouse is the term that we frequently use, right? Is how are we going to get just one step ahead this week? And there are a couple of different defenses or strategies that we can use to prevent active probing. There are techniques trying to behave the same for every single connection, right? You might try to have a standard web server, right? And it will receive standard TLS connections, standard HTTP requests, right? It will respond to those. But when it gets special connections, proxy connections, it will handle those differently. 

And the hope is, is that by using some type of cryptographic key that's sent by the client, the sensor will never be able to determine that it is in fact a proxy, right? They won't have access to that key in order to get that host to respond in a specific way. There are other approaches, kind of similar vein, in which use some sort of other application as a front, right? The naive proxy is known to basically use Chromium as its sort of front application, and then the proxy sort of runs on top of Chromium, right? Everything looks like it's a standard Google Chrome TLS connection. But in reality, it's a tunnel for a naive proxy. 

The third approach that we also use here and has been a large chunk of our research is refraction networking. And in this case, what we do is we say, "Okay, you can send a connection to any IP in this range that we can observe." Right? Say, the University of Colorado's IP range. And using this network tap that we have access to, we'll just look for connections that look like they belong to us. 

And when we see one, we'll pick that connection up, we will start our proxy tunnel with that connection and redirect all that traffic to its intended destination, right? And why that's so advantageous and sort of defeats the cat and mouse is now the sensor has to decide, "Okay, in order to block this proxy, we have to block every single IP address associated with this American university." And so we really raised the stakes of the false positives associated with blocking that sort of traffic. 

[0:33:53] GV: Interesting. Okay. If we look at deployment, once this was kind of all discovered, if you like, did you have to do responsible disclosure? And who do you even make that to? And how does that then, I guess, filter through into sort of actual non-academic users, I guess? 

[0:34:15] JS: Yeah sure. Responsible disclosure is a funny subject to bring up when you're attacking a system like the GFW, right? There are kind of two sides to that coin, right? We are disclosing our findings to the circumvention tool developers. But then there's also this idea of, "Well, do we disclose something to the GFW? Is that something that needs to happen?" And in this particular case, no, right? We were just reverse-engineering this technique that they had. And it's not usually something that we're particularly interested in doing is helping them figure out what we've learned. But we do work closely with the circumvention tool developers, right? 

Like I said, in regards to the sort of prepending bytes that seems very simple, that was information that went out very early in this process. January of 2022 is when we ultimately sort of found that and started giving that information out to these developers so they could start patching their tools as early as possible. 

And that sort of gets into what you were asking about how do we filter this out of the academic world and into the hands of people on the ground. And it's those same relationships, right? Where we already work closely with the developers of these tools. And when we find something out, we can let them know before it goes public, and they can try and get it to their users as quickly as possible. 

[0:35:33] GV: Yeah, I do find the whole, yeah, as you say, responsible disclosure. And then thinking, would you sort of disclose this to what the CCP? It's a strange one to sort of think about. That said, do you have any understanding that this research has been taken in on that side and any lines of communication around it, or sort of implicit or explicit changes that you've seen as a result? 

[0:35:59] JS: With this particular research, no. There wasn't really anything we told them, right? And there wasn't really anything that we could observe. We'll get into some more of the timeline, I think, in a little bit. But on other works, like I mentioned, we had this offensive attack in their DNS injector. And we did try to disclose that to them. We sent emails to Cinsert. That was who we decided was the best point of contact. We never heard back. 

But what we did find is that there was patching of this vulnerability after we had sort of made this disclosure. Right? So we do believe that they, to some extent, are listening and, some extent, are paying attention to what we're doing, but it's still that black box where you don't have 100% confirmation. 

[0:36:45] GV: And yeah, as you alluded to, there's sort of another timeline marker here, which I think it's March 2023 that the GFW actually stops dynamic blocking. Could you talk to us about what you understand from that? 

[0:37:00] JS: Yeah. Again, this is one of those parts where we can speculate a lot. And I can give you some of my thoughts on why, but the facts are we don't know, right? What we do know is that the blocking began on November 6th of 2021, and ended on March 15th of 2023. In that time frame, there were a number of political things going on in China. Xi Jinping was running for reelection in the beginning of March of 2023. I believe he was reelected two or three days before the blocking stopped. 

And we've seen that pattern occur not just in China but in a number of other countries, where when politically sensitive events are occurring, they will sort of ramp up their censorship. And then when those events sort of settle down, they tend to turn off their censorship or ramp it back down. 

What's interesting with China is, usually, when they find a technique that they like, they stick with it. They keep it going kind of ongoing. And so what was different about this technique is unclear. We continue to do longitudinal measurements to see if this censorship has been turned back on. We still haven't seen any evidence that it has been reenabled. But we can only speculate that, potentially, it was too computationally intensive or that the vantage points required were too valuable to be using for this technique. 

[0:38:19] GV: Yeah. And I'm thinking of the dates again. 2021 and then 2023. I mean, do you think the intertwines with politics, as you've touched on? But the pandemic, that was obviously quite a sort of sensitive time for China from an information standpoint because it seemed like at least - I was living in Hong Kong at the time, which people often refer to Hong Kong and the lockdowns. And I can say categorically, there were no lockdowns in Hong Kong, but there were for sure lockdowns in China sort of around being able to leave your house and so forth. And I think it was quite sensitive as to people within China being able to see what the rest of the world was doing, especially maybe towards the end of the rest of the world would call the end of the pandemic, because that seemed to be kind of then the worst part for China. And as you say, it was ramping up into a re-election, if you want to call it that, of Xi Jinping. Anything there that you think could be related as well? 

[0:39:16] JS: I think it's very possible. Again, that's what really one of the things that makes this research the most challenging, though, is you never really know. You don't get copies of the memos sent around the government that say, "Hey, turn on this censorship on this day for this reason." Right? And so, you're exactly right. It could have been certain events going on sort of the tail end of the pandemic that was causing the censorship to ramp up. It could have been the political timing. It's really hard to say. 

[0:39:44] GV: Yeah. China in a nutshell. 

[0:39:46] JS: Exactly. 

[0:39:47] GV: Yeah. And today, here we are in 2025. And just what is the current status the same effectively that it's off, or have you seen anything change? 

[0:39:56] JS: Yeah, in regards to this specific technique, we haven't seen any activity, right? That's not to say the GFW is turned off, right? This is one small piece of the GFW. And they're still deploying a number of other techniques. But the idea of passively observing and blocking fully encrypted traffic, we are not seeing any signs of today. 

[0:40:18] GV: Looking at just speculation on how this was actually operating, even though, as you've just touched on it, this specific part is probably turned off. But general GFW architecture and just a speculation on how that all works, what do you think it is? 

[0:40:35] JS: Yeah. From a technical perspective, there are a couple of different vantage points that they could have access to, right? And the two that sort of apply here is what we would refer to as like an in-path sensor or vantage point versus an on-path vantage point, right? And an in-path vantage point sits quite literally in the path. It is one of the hops that your traffic takes between the two hosts, right? 

And what's special about that vantage point is that it has the ability to tamper with that connection, right? It could change the payload, or it could go so far as to just start dropping the packets. And because of the blocking technique that we observed in this research, we believe that they were using these in-path vantage points because they were just dropping the packets afterwards. 

The other side of that is what's called an on-path vantage point, right? This is a less sophisticated or less valuable vantage point where they get a copy of all of your traffic, and they can even send packets back to the hosts, but they aren't sitting in the path where they can manipulate and potentially drop that traffic, right? 

And it's an important distinction because those in-path vantage points, if something goes wrong, that could lead to internet outages, right? And you could have major issues there. I believe that those are potentially sensitive assets, and they are very careful of what they allow run on those. Most of the censorship that we observe ongoing tends to be more of an on-path attack. 

[0:42:04] GV: And do you think there's any ML, machine learning, kind of going on there, or it's kind of more of a static, I guess, implementation? 

[0:42:13] JS: Yeah. I think that for specifically what we observed, they had to have deterministic code that was running, right? They had if statements that were checking those bytes and checking that entropy. Because ML inference is just too slow to be doing line rate. However, it could be possible that they were using ML to try and determine some of those heuristics, right? It's hard to say. 

Though we have confirmed that the GFW is using machine learning in other aspects. There's been a massive leak of information in the past couple of months coming out of a Chinese corporation called Geedge and a research lab called MESA. And we've found evidence in those repositories that they have been using machine learning to classify data and determine what traffic is coming from a sensor and not. However, I think that most of that research is being used in those active attacks, right? Those attacks where they're making the connections. Because to try and do it passively at line rate is just unrealistic. 

[0:43:19] GV: Gotcha. As something that kind of comes up in my line of work is - again, I sit in Singapore. And so I'm working currently with products that do get used within China. And people sometimes come to me and say, "Oh, do you think there's going to be a problem with this team using -" when I say problem, I mean a performance or just access problem. With this team using the product within China, I give a very kind of - this point, layman answer. The last time I went into mainland China is probably 10 years ago at this point. I used to go quite often, and then just stopped. 

But my understanding is always just that there is this massive speed, massive latency around traffic that goes in and out of the perimeter, if you want to call it that. And that's usually my kind of, again, standard piece of advice is just if this is not something that has infra-based within the GFW. That's usually why people strike up deals. I believe Cloudflare has a product that's kind of like a deal where they've had to JV, joint venture, with a Chinese partner. So that's a way to enable your product to get over this latency. 

I think you sent me a little piece on this before we chatted today, and it's like called the great bottleneck. Yeah, maybe could you just talk to us about that? Because I think I would also love to like learn a little bit more around what actually is this rather than me just saying, "Oh, there's just a latency problem, and that's what you should assume."

[0:44:43] JS: Sure. We've got to get more creative with our naming. 

[0:44:47] GV: The GB something. Yeah. Exactly. 

[0:44:49] JS: Yeah. Yeah. Exactly. It's always got to be the great. 

[0:44:53] GV: Yeah. 

[0:44:53] JS: But yeah, so this has been an issue for it seems like a long time. And it's an issue that affects more people than just the citizens of China who are trying to access things they aren't supposed to, right? The great bottleneck is this sort of widespread idea that traffic going outside of China is going to be slow, or it's going to have some type of network issue, right? 

And the research that you're referring to, I think, found that was primarily on the download side, right? When we are pulling traffic from other places, we have an issue, or pulling data from outside of the country in, we have an issue. The upload doesn't seem to be as big of an issue. 

I think that this isn't necessarily a censorship technique, right? I think that if it was a result of the censorship that we've observed, if they didn't have the compute resources to effectively censor this traffic, we would see it more on that upload side. Right? When that HTTP request gets made, that's when you would see the slowdown, because that's the traffic that you are trying to observe for potential censorship things. 

And so I think it is truly just a lack of international infrastructure in China. I think the reason that they are doing that is because they sort of have this goal of creating this isolated ecosystem, right? They want for their users or their citizens to primarily be accessing Chinese-based sources, right? Chinese tools, Chinese services, what have you. 

And so I think that even though they could probably afford to spend the money to improve that international infrastructure, build some more cables, underwater sea cables or what have you, they are choosing not to almost as a way to incentivize both companies and users to use national services. And so, yeah, I don't think it's necessarily malicious, but it's more pushing towards this goal of isolation. 

[0:46:48] GV: Yeah, the cloud providers are sort of another proxy battle, I guess, where you've got like Alibaba cloud and others within China. And then, obviously, the big three that we know in the US and the rest of the world. I think that's an interesting take on it, which is, yeah, this probably has as much of a commercial leaning as it does. I think that's another thing where people go into China. And I asked a friend who goes more often than I do. I said, "Oh, a family member going to China. What should I advise them?" And he just said, "Oh, you must download this specific app." Yes, it's half in Chinese, but this is how you'll be able to do stuff, like call a cab." 

I mean, that maybe seems obvious. You go to another country, in Singapore, say, "Hey, get the Grab app." Because Uber is not a thing here. But this is like a whole layer extra, which is just sort of you're going to be able to achieve nothing unless you actually download this super app and has maps, and transport, and food, and all this kind of stuff. Everything else just isn't going to help you out, basically. 

[0:47:53] JS: I think you're exactly right. This isn't something that's unique to China. I mean, that sort of super app seems an extreme example. But just like you said, there tend to be ride sharing apps that exist in certain regional areas. Or even the US trying to block TikTok, unless it gets sold to an American entity, is really the same thing, right? It's this kind of idea of we are going to prioritize national services before international services. 

[0:48:21] GV: Yeah, absolutely. I mean, just a carve out with this, is that Hong Kong still doesn't really experience this bottleneck effect. At least that was my - when I was last there, I think last year, there is no sort of GFW per se within Hong Kong. And is it used as any kind of proxy or something? I think we were touching on this before we started recording. 

[0:48:43] JS: Yeah. There is sort of this idea that Hong Kong tends to have a freer internet in our perspective of things. I would definitely defer to you on any lived experience. But we do see that a number of proxies will kind of use that as their first hop. And there's even some speculation on if there are gray market links that exist between China and Hong Kong, specifically for the transit of proxy traffic that's ultimately going to go other places. 

Some people even believe that a lot of the proxy traffic exists simply as a way to get around the bottleneck, right? They might not be trying to access undesirable content. They just have some type of maybe business need to get outside of the country. They aren't like at the enterprise-level, where they can afford these contracts with a Cloudflare. And so they use these proxies as an alternative means. 

[0:49:34] GV: Yeah. And this crosses into like realms of like dark fiber, I think. 

[0:49:39] JS: Yeah. Yeah. Yeah. 

[0:49:41] GV: I mean, my kind of anecdote in this is just like I could be sitting in a Starbucks here, and I'm over here in conversation of someone who's clearly come from Hong Kong to sell a dark fiber contract to someone in Singapore. Could you maybe just touch on that briefly? How does that cross into this? 

[0:49:56] JS: I'm not much of an expert on that. I've only heard sort of, like you said, these anecdotal experiences of somebody has access to some type of link that goes from mainland China to Hong Kong, and then they're selling access to people in China. But I would love to know more. 

[0:50:14] GV: I mean, at least to me, it kind of looked legit. But equally, I don't know why is it being held in a Starbucks and not in someone's office. That's all we need to know. If we sort of then just look at any other countries that you at least believe may have adopted these techniques, I mean, maybe there's some obvious ones that people are already thinking of, but just from where you sit, which other places have you maybe seen adopt the GFW approach? 

[0:50:41] JS: Yeah. There are a couple of different things to consider there. Specifically, for the GFW approach, I think it's interesting or pertinent to get back to that sort of leak that I was referring to. And you may have heard about this. It sort of hit some of the mainstream news sources in that this massive trove of data got out of this Chinese corporation called Geedge. And this corporation's goal is effectively to commercialize censorship software and hardware and then distribute it to other nations. 

And in that leak, they found that Kazakhstan, Ethiopia, Pakistan and Myanmar are all confirmed clients or customers of this organization that is Geedge. And so we know that they, to some extent, have the capabilities of the GFW through these business dealings. However, those aren't the countries that we typically think of when we think of other nations that are doing censorship. 

I think one of the big ones that comes to mind is Iran, right? We don't know if there's any relationship between Iran and China's censorship infrastructure. Iran, they all tend to have their own sort of character, if you will. Iran is known to be very dynamic. They are constantly changing what's getting censored, how it's getting censored, where it's getting censored. And two neighboring ISPs in Iran can have completely different censorship experiences. 

And then another one is Russia. Russia, again, has sort of its own interesting culture where they have more of a capitalist spin, if you will, on their infrastructure, right? In China, you kind of have two or three major ISPs. They tend to have heavy involvement from the government. And so deploying that censorship infrastructure is very much a state activity. But when you look at Russia, the censorship is basically the state saying to all of these sort of independent ISPs, "Hey, look, you're going to block this list of sites. We don't care how you do it. Figure it out." Right? 

And so Russia has its own way of determining - or you see these weird characteristics where one ISP will look different from another, and they seem to be doing almost this bare minimum approach, where it's very easy to get around, and it's very patchwork. Yeah, the censorship landscape is fascinating how much of a shift you see between all of these different countries. 

[0:52:58] GV: Yeah, for sure. And I guess just looking ahead, if you want to call it that, any predictions where kind of things will go? I mean, we've touched on like a lot of the whole idea here is that things just seem to keep changing. And there might not be any rhyme or reason at least from the outside as to why. But yeah, just if you were to kind of think ahead, anything that you suspect might change or happen? 

[0:53:22] JS: It's hard to say it is a dynamic landscape. But I do believe that censorship is going to become a little easier for the sensors, and it's going to become harder to circumvent. We're seeing, especially in China, this shift of middle boxes closer and closer to the end user, where due to sort of a scarcity of IP addresses, they're implementing these gnats to where a residential address won't have a dedicated IP, right? 

And so as we sort of see more and more middlebox infrastructure getting closer and closer to that end user, it will become easier for them to deploy more of a distributed censorship network that can kind of implement more advanced techniques because it's taking on less load. 

[0:54:04] GV: Just as we sort of start to wrap up here, what about your own research? Are you doing more into this kind of going forwards, or are you kind of leaving this here and moving on to a different part of the censorship landscape? Yeah, where are you taking things from here? 

[0:54:21] JS: Yeah. I've been looking extensively into that leaked information I referred to. That's a relatively new finding for us. And so we've been looking at that to sort of be that ground truth that I said doesn't exist, right? Confirm some of these statements that we have made in all of the research we've done so far, as well as trying to understand better what role machine learning can play in this space, right? 

A lot of my research is focused around that network tap here at CU and having access to this massive corpus of internet traffic that is real. And finding ways that we can better classify that traffic or even generate synthetic traffic is sort of the direction I'm headed in next. 

[0:55:02] GV: Awesome. Well, that all sounds fascinating. I hope maybe we get to catch up in a couple years, hear what you've been up to since then. And just want to really thank you for your time, Jackson. And I've learned a ton today. I'm sure the audience has as well. Obviously, you're doing great work. It's very important that this information is understood at least outside of China. Yeah, thanks so much. 

[0:55:26] JS: Absolutely. Thank you so much for having me.

[END]