EPISODE 1594

[EPISODE]

[0:00:00] ANNOUNCER: If you're a sports fan and like to track sports statistics and results, you've probably heard of Sofascore. The website started in 2010 and ran on a modest single server, and now has 25 million monthly active users, covers 20 different sports, 11,000 leagues and tournaments, and is available in over 30 languages.

Josip Stuhli has been with Sofascore for 13 years. He started there as an engineer and is currently CTO. Josip joins the show today to talk about the challenges Sofascore encountered over the years, and how the team solved them. He discusses dealing with traffic spikes from game days, structuring and restricting the codebase, organizing the frontend and backend, and much more.

This episode of Software Engineering Daily is hosted by Sean Falconer. Check the show notes for more information on Sean’s work and where to find him.

[INTERVIEW]

[0:01:00] SF: Josip, welcome to the show.

[0:01:01] JS: Thank you for having me.

[0:01:02] SF: Yes. Thanks so much for being here. I've been really looking forward to this. As we were discussing, as we joined this call, we realized that we actually overlapped a month ago at Infobip Shift, and somehow, we were both speakers, and we somehow missed each other. So, we'll have to wait until next year to meet in person. But I'm excited to meet you today, and to dive into your background of multiple years building this company from the technology side.

[0:01:27] JS: Yes. I cannot believe that we managed to miss the chair, but again, there's so many people there speaking. It actually isn't such a surprise. I mean, Sofascore is really an interesting project, and I really love talking about what we did and how we got to this place, and what problems we have, and how we solve them. I always kind of see it as a way to kind of give back to other people, to kind of see what mistakes we had, and maybe not do them.

[0:01:58] SF: Yes, I think, for myself, like a big part of the way that I learned new technologies, new concepts, is by listening to in depth interviews with people who have sort of been there and done that. It's a great way to learn what mistakes people have made, but also how they've evolved their way of thinking and how different technologies that they have adopted sort of unlocked or solve various problems for them or made it easier. Maybe they started with sort of DIY, and then they figured out, “Oh, wow, I don't need to build this. I can actually use this open source project or I can buy this SaaS product or something that, that alleviates some of that pain.”

So, we're going to dive into all that stuff. But maybe before we go too deep, can you give a little bit of background about who you are and what do you do?

[0:02:40] JS: Sure. So currently, I have the role of CTO at Sofascore, which I've had for 11 years now. I've been with a company for 13 years. I started with just a regular software developer, and kind of moved my way up. I've been with computers for as long as I can remember, basically, so since elementary school. That's 25 years now. I've kind of always been fascinated with them, with the ability that you can just tell the computer what it will do, and then it will do exactly that. If something doesn't work, then it's on you, not on the computer.

I also love optimizing everything. It's really a, maybe even to the point of disease. I cannot help myself if I see that something is suboptimal, and I tryto like make everything work faster, and cheaper, and for everything to be better. I also love tinkering with different technologies. So yes, that's about me.

[0:03:42] SF: Yes. There's a certain personal satisfaction that you can get from you’re really getting the weeds of like bit twiddling things for optimization and taking something that maybe you can make 10x or 100x improvement on the performance of it just by making certain tweaks or making different data structure changes, or strategy changes, or algorithmic changes, and it can have a major impact, essentially, on what you can do from a product standpoint.

So, you've been at Sofascore for 13 years, and probably not necessarily everyone is super familiar with the company. I know you're based out of Croatia. But can you give a little bit of background about what is Sofascore? When did it start? And what is the scale of the company today in terms of users, traffics, requests, and so forth? To just set some context, shape, and picture of where you are.

[0:04:30] JS: Yes. Sure. So, the company was started in 2010, 13 years ago. It basically happened slowly. So, the founders, they were having an online forum, and they realized that the best topics that people go for are related to sports. So, they started writing their own topics on that forum, and they gained a lot of traction. They had good SEO, and then they realized that they could reinvest the money into actually building an app that will do the things that they were doing automatically.

That's when I came in. I came in, myself and one other guy. Basically, what Sofascore is now and how we started, is it's basically a live scoring app. So, you can watch your favorite teams, or sports, or players in real-time. You can see their performance, you can also see a lot of stats, especially for football. So basically, it's like a second screen when you're watching a game. Or if you cannot watch again, you can always look and see what's happening. We really do have a lot of different statistics. We have heat maps for players. We have over 250 statistics for each player, for each match, that has a good coverage, and the users recognize that.

So, we are currently at 25 million monthly active users, and they generate more than 1.3 petabytes of traffic each month, which is huge. It translates to more than 300 billion requests that the servers have to handle. So, it's a lot. I mean, the thing is also that the way the app works is we have the most users when there's a lot of games. So, we get really big spikes and our peak was 1.8 million people real time, which was a lot.

[0:06:26] SF: Yes. That's kind of the current state. I want to go back to the beginning and just shape the picture of where you started from an engineering standpoint. Back in the early days of being an engineer, can you just kind of describe what that experience was? How big was the team and what was the infrastructure like back then? I'm sure, it's significantly simplistic in comparison to where it is today. 

[0:06:49] JS: Oh, yes. It was a totally different time. So, it was just two guys, me and one other guy. We were doing the backend and frontend and it was horrible. I'm going to be honest. Because we were students at a time. So, there were weeks when we didn't do anything, because we were studying. And then when we did, we were basically kids, so didn't really want to code when we didn't feel like it. But if we put that aside, the main issue was we didn't really know how to scale the infrastructure. So, none of us had any experience of how you should deal with spike in traffic. And the nature of the app, and it was website at the time, is that you get a lot of users in a very short amount of time. And then, the rest of the time, you don't actually get any traffic.

So, the way we did it, we started writing in PHP, and it was with a MySQL database, and it was all on one server. Then, when people came, it couldn't handle the load because it was just one machine and then everything crashed. Then, you just kind of hope that people will go away, and those most persistent ones will actually get to see the results, which it's not a way to handle business. It's a bad business practice, as you might imagine. Then, we started to kind of see how can we survive the spikes, and what can we do to the infrastructure to kind of handle more. The first thing you do is you split the web from the database, which is what we did, then we were able to handle a bit more.

But still, when more people came, the site will just crash. It didn't happen often. But we didn't really have a solution at that time. Then, we kind of started to look around what can we do to kind of survive those spikes. The first thing you do, is you add cache. So, we did. We added Memcached. It helped. But PHP is slow. So, when you put PHP, you still have to boot the PHP to fetch the data from Memcached and return it. And then we could handle more, but at a certain point, it would also crash. 

The thing is we didn't manage the infra ourselves. We actually had a company that was doing that for us. And the problem is they didn't know how to scale either. So, they would install PHP via WHM and cPanel, which is not something you do. So, at that point, I was like, I talked to the founders, and I said, “Listen, this is the prime use case for cloud.” This was 2013. And the cloud was something really new. Not a lot of people were using it. I said, “This is the ideal case.” So, most of the time you have very low traffic, and then when spikes happen, you can auto scale, and then when the people go away, you can just scale down. They asked me do I know how to do it? I said sure. We're going to do it. I did not know how to do it. I had no experience whatsoever. I was really into tech, so I knew I'm going to try my best, I'm going to see what I can do, read everything I can on the Internet, but honestly, I had no idea how that actually would turn out.

But it did turn out great. So, we were on the cloud for a couple of years, and it worked really, really well. That's when we actually decided to look into the caching layer to figure that out, and from that point on, our caching layer is really a robust and complex piece of work. That's what enables us to really scale to this level of where we are now.

[0:10:28] SF: That original investment in cloud, was that with AWS?

[0:10:32] JS: Yes. And then no, if there were any other clouds at that point, probably AWS were just the cloud you would use. But yes, AWS.

[0:10:40] SF: Yes. Even AWS 10 years ago was a much different world than it is today. You probably had S3, EC2 virtual instances. I don't even know if like Elastic Beanstalk existed at that point.

[0:10:55] JS: No. I don't think it did, like you had Route 53, you have S3, and you had EC2. I think that's it. I don't think there are any other services at that point in time. But it was enough. You could do a lot with it. We had an Auto Scaling group, and then when traffic spikes came, it would just auto scale. Magically, it would work. It was amazing. I mean, that's what enabled us to grow at that point in time, because we didn't have to plan – I mean, we couldn't plan our growth. We didn't know how many users will come. We didn't know how our growth would look like. So, cloud really enabled us to kind of focus on the product, and to build that and not really think about infrastructure that much.

[0:11:36] SF: Then, how were you running your databases back then? Were they running directly on EC2 instances? Or were you – I'm assuming this was even pre-services like RDS. Did you have to manage your own database cluster?

[0:11:50] JS: Yes. So, I think we did it ourselves in the first days. But then we moved to RDS. RDS was really amazing. That was really well. Then, as project grew, we kind of started changing some components that we had. We've switched from one data point. So basically, the way we get our data is we buy it from data provider. And when we were switching from one data provider to another, the other provider didn't have its own schema. You just got an XML, and you could do whatever you wanted. So, we thought that it was a really good idea to move from MySQL to MongoDB, because Web Scale.

But the thing is, we are an analytical company, so we want to do a lot of queries, and aggregations, and everything else, and that kind of proved difficult to do in MongoDB, especially in those days. MongoDB even had collection-level locking. So, when you do updates you couldn't read, which was an issue, because we had a lot of updates, because there are a lot of games happening at the same time. That was a problem. Also, we couldn't actually aggregate any data with simplicity. You have to write MapReduce, and it was really, really a pain in the ass.

That's when we figured out that we made a mistake. It was not a good move for us. At that point, we also did something which I'm not sure if I would do it now with such ease. We decided that we're going to switch back from NoSQL to a relational database, which is a big undertaking. You have to normalize all of the data. You have to solve conflicts. There's a lot of things that goes into that. But we were young and crazy and said, “We're going to do it. This is something that we need to do.” We did it. We moved to Postgres, and that's something we are using to this day.

[0:13:39] SF: In terms of the migration that you had to make. So, it sounds like you basically had to move from a SQL-style database to Mongo, and then move back from Mongo to Postgres in this case. I'm assuming you were able to do that with zero downtime, or very little downtime. So, what were some of the challenges you ran into with doing that large-scale migration, and how did you kind of overcome those?

[0:14:02] JS: Yes. We had to do it with zero downtime, because the app is global, and there's always something going on, and we cannot just allow ourselves to be down. A couple of minutes of downtime is okay. But at that point in time, we were big enough that it was a really important requirement. So, the way we did it is we timestamped all of the collections in MongoDB. Every entity had a timestamp on when it was changed. Then, we created a schema in Postgres, which was a normalized version of that, and we wrote a tool that would migrate from MongoDB to Postgres, and record the timestamp. Then, it would do its thing, and that would take some time. In the meantime, new stuff got updated.

So then, we just took from the last time the script was ran, just migrate those stuff, and then the time got shorter and shorter, and that's how we got it running and the databases were in sync. And to make sure that everything is working, we ran a canary deployment to see that it can actually handle the weeds. It was a read only canary machine. Then, once we were satisfied that everything is working as it should, we just did a failover on the Postgres version. Surprisingly, everything worked. Then, we just got rid of everything that was in MongoDB, and continued our development on Postgres.

[0:15:31] SF: Then, in terms of these traffic spikes that you're having, are those primarily reads or writes? I mean, if you're dealing real-time ingestion, essentially, of like sports data. I'm assuming there's a fair amount of writes that are happening. But there's also a large amount of reads. So, what is the biggest challenge in terms of managing both the ingestion of data, as well as essentially the request from the data?

[0:15:55] JS: That's true. So, when we have spikes, those spikes occur, either due to a really popular small number of games. For example, when popular teams play, or when there's a lot of smaller games happening. In the second case, when there's a lot of smaller games happening, then we have a lot of changes. So, just to give you an example, we have invalidation system in place, which purchases the cash on changes that are important to us. And it’s like, we handle up to 150,000 purchase per minute. So, that's the amount of changes that can happen in one single minute. But that's actually not the issue. That's something we can control. So, if there's a lot of things happening, everything gets queued, and then when we just go through the queue at our own pace. If it was to happen, that a lot of data providers would ship a lot of changes, at the same time. We just queue those.

The biggest issue is users. Mostly they do reads, but for some reason, when they get a push notification, they add more games. We don't know why they do that. But like at the biggest spike we have, they just keep clicking on the app, and they sign up for more subscriptions, and that's something that's not cacheable. That's the rights and you can queue that. But we also want to give feedback to the user that they actually did the thing that they wanted to do. So, we try not to queue those things, those rights, and that's where the biggest spikes actually happen.

[0:15:55] SF: Right. So essentially, someone is adding more games, the user experience should be one, I get feedback that like, I'm now following this game. But I also should be able to see the live updates, like that's my expectation. So, you can't queue it for five minutes, and then check back in 10 minutes and your data is there.

[0:17:46] JS: Yes. I mean, because some people also mute their games. The expectation from the user is if I mute this game now, I will get zero push notifications from now. That's something we cannot queue. Also, we have a sync mechanism between devices. So, what they expect is, if I click the game here, I wanted to show up on my other screen as well, instantaneously. That's the expectation they have come to help from us. So, that's also one of the reasons why we want to do it as fast as possible. 

[0:18:15] SF: Then you mentioned that you have a heavy reliance on caching in the cache layer, and that's what's helped you essentially have the level of like scale and responsiveness. What are some of the things that you've had to do in terms of developing that caching layer in order to serve the customer? I'm assuming the cache is probably distributed across multiple machines. How are you keeping those objects in sync across different machines? And what other challenges have you faced in innovations that you've had to essentially come up with to solve different problems?

[0:18:45] JS: That's a really interesting question. I really love talking about it, because that's the core of our system, and what enables us to do the things we do without actually spending millions on infrastructure. The thing is, we realized really early on that the app itself is read-heavy, and we've started to optimize our REST API to kind of be as cacheable as possible. So, no endpoints will have different data types, or there will not be big blobs of data. They will be specific for things that change, such the things that change at different points in time will have different endpoints so that they will be more cacheable. The way that we did the whole caching system.

It started simple enough, where we only had one Varnish server. Varnish is what we use for our caching layer. It's an amazing piece of software, which is free, and it can handle an absurd amount of requests per second. So, we only had one launch machine for every AWS instance, which you have a problem with scaling. If you have one machine, then all the caches on one machine. When you scale your backend, we have backend and Varnish in the same machine. When you scale it, you basically split your cache into two. So, instead of just one request going to the database, you now have two. If you extrapolate that, if you have in your Auto Scaling group, if you have 20 machines that your cache is 20 times worse than if you only had one.

So, the first thing we did is we separated the caching layer to be in its own Auto Scaling group, which helped. But then, my disease kicked in, and Varnish has request coalescing, which means only one request will go to the backend. Everything else will get queued up on the client side, which is amazing. But if you have three machines, then three requests, which are the same will go to the backend. That's when we started looking into how can we do this to be more optimal.

Basically, what we have now is we have two layers of Varnishes. So, the first layer is there to have data locality, and to have a really good CPU, and a lot of bandwidth, which actually utilizes the CPU and the bandwidth to serve the data to devices. Then, to keep all of the data in sync, what we do is we shard based on the hash of the URL. So, all of the Varnish is in the first layer, when they receive a request. If it's not in their cache, they will go to the second layer of Varnishes, and all of them will go to one single machine.

If that machine fails, it has a consistent hashing ring, which they will all fail over to a different Varnish. And it will do so in a way that it's actually switching between all of the remaining Varnishes equally. It works really well. But in order to have that, you have to have cache invalidation, and that's a really difficult thing to do. So, there's this really cool thing, the two most difficult things in computer science are naming things, cache invalidation, and off by one errors.

So, cache invalidation is really difficult, so we invest a lot of time to develop an in-house library, which basically builds out a graph of dependencies between entities and endpoints that rely on those entities. Then, once an entity is changed, we know exactly what endpoints need to be purged, and what parameters need to be put in the endpoints. It's actually open source, and it works really well, and enables us to have – our current cache hit rate these more than 99%. So, that means that we only need to pay for 1% of the servers.

Now, we would have to pay 100 times more if we didn't have this caching layer. And also, those machines are actually distributed all around the world. So, the first layer is we have machines in the US, we have them in Brazil, we have them in India, in Sydney, all around the world.

[0:23:00] SF: Yes. So essentially, you can reduce the latency on –

[0:23:03] JS: Yes, exactly. I mean, it's a really big difference. It's also what enables us to have a kind of a system that's not as complex. We don't have to do multi-data center on different continents, we just have the caching layer distributed. Everything else is in two data centers in France. So, the inter-data center latency is really low. That's where we store all of our data. And then, the way we kind of manage to get the latency down is by having the caching layer, which is distributed, and it makes a really big difference.

So, we managed to reduce the latency for example, for Australia, from half a second to less than 80 milliseconds. Going from something you can see, you can see the low there, to something which is basically imperceptible. That’s really is something that gives you an edge in regards to your competition, because not everybody will optimize for this. So, users will see that your app is fast, and they will see a loader in your competition, and that's what will drive them to recommend your app to their friends, and to kind of gain more users.

[0:24:11] SF: Yes, especially in a world where someone's using this potentially as a second screen. They're actually watching like the live sports. And then if there's like a noticeable delay on the stats that's different from what they just watched on television, that's going to be like just a bad user experience.

[0:24:29] JS: I mean, it goes even further. That's a funny story. So, the way those streaming sites work, they get reencoded a couple of times. So, if you're watching in your TV, or even worse on IPTV, the app will actually be faster than what you can see on the TV because of this. People will first see the results in the app, and then they will see it on their screen. We actually had users who wanted a feature request to kind of slow down the personifications and the data refresh for the games that they are watching because it's reeling their –

[0:25:02] SF: Yes, I was wondering, is it a spoiler?

[0:25:03] JS: Yes, it’s a spoiler. I mean, when we watch the games in the company, everybody puts their phone on mute, because you just don't want to know. It’s that fast.

[0:25:12] SF: Yes. I can remember watching related to that delay, oh, like Olympics in an apartment building years ago where I'm Canadian, so we were watching the Winter Olympics with Canada playing. And depending on where people were watching it, the feed had different delays. So, you'd hear cheers from an apartment from somewhere else and be like, “Canada scores” and haven’t watched it yet. So, I can see how that would be a frustrating problem. But that's a high-quality problem. If someone's like, “Your service is too fast. We need you to slow it down, because you're faster than the television, essentially.”

Taking through this journey, like a big part of this has been splitting, initially, like splitting data from compute. Let's not have things running on the same server, and that way you can auto-scale or to handle the scaling of those different services, depending on like the request volume that's dependent on IT services. And then you do the same thing with caching. At what point did you have to start thinking, I'm assuming that this started as like a monolith application. At some point, you probably started to break up that monolith as well, because certain parts of the application like the API layer is probably going to have a different scaling needs than some other part of the application. So, can you talk a little bit about like when that started to becoming a problem? What was the strategy for starting to break that up?

[0:26:27] JS: Yes. So, we actually did that before it started to be a problem. It kind of came naturally. So, we don't have microservices, I don't like them. I think people try to use microservices too early in their development, which then causes them to have different issues. What we have our services, for the lack of better word, so what we started doing is everything that was potentially slow, we started putting it into queues. That kind of grew. The first thing we did was the data ingestion. So, when our data provider sends us data, we put in a queue, and then we have workers, which then take jobs from the queue. Pars them, mark them as them, and then move on to next. That enables you to have multiple workers, if you need to do that. Once we move to the cloud, we started running those workers on different nodes, and that enabled that to scale that part independently of the API.

Then, as time passed by, we had more and more services. Now, we have more than 150 different services, which are running as separate containers in Kubernetes, and they have different number of replicas, and they can auto-scale independently from each other. But the API layer itself is basically just another service, and since a lot of the work that's being done is not related to the API itself, but rather to parsing the data, and calculating different stuff, it just works and it's simple to maintain. Everything is in one single repository. I mean, not every, most of the stuff is in one repository, which enables us to have a really simple development.

I mean, it's simple now. But while we were growing, one of the issues that presented itself was we had a frontend team and the backend team, and they worked on the same code, because the frontend was actually just rendered from the backend, and then we got really annoying, because you will have a lot of developers who were actively trying to modify basically the same code, and the deployments got complicated, and that's when we said, the frontend should be just another platform, as iOS and Android apps are. So, that's one of the biggest things that really allowed us to have teams which can work independently.

Also, one of the things we had was, we had two APIs. For some reason, the web had its own API. And the mobile apps had its own API, which means you'd have a verse cache bitrate, because you have different stuff that you need to cache. So, that's also one of the things that we need. We unify the API, and really taught how to enable the API to grow with time, without worrying so much about backwards compatibility and that has served us really well in the last couple of years.

[0:29:32] SF: Right. I mean, I think another motivation for starting to break up like a monolith, or any particular application, isn't just necessarily around the scaling concerns of your traffic, but also around the like internal scaling concerns of your engineering team, because everybody's basically working against the same project, then they start to become an organizational problem or a deployment problem, because essentially, everybody's kind of just like running over each other to some degree. 

[0:30:00] JS: Yes. Exactly. So, you have a lot of people working on the same code base. You get a lot of conflicts, and then you have to resolve those conflicts, and then you just get slower. I remember, in the early days, we would do a lot of deployment. Then, once a month, we would release a new version, and then all hell would break loose. Then, we said, “Okay, we should probably kind of deploy more often.” Then, we moved from one month to two weeks, and then from two weeks to one week, and from one week, to every two days. Now, we have like 50 deployments every day, because every single thing that's changed is just immediately deployed, and then you can see if it's working or not. If something breaks, you can easily roll back, and you can see what change actually made the problem, the bug, and it's really easy to do deployment that way. When we get new developers in the company, junior developers usually deploy within the first two days. It's so simple to use.

[0:31:01] SF: You mentioned queueing earlier, and how important that is to some of the things that you're doing around data ingestion. So, was the – when you migrated to the cloud, were you using things like SQS on AWS for queuing? Or was this something that you built yourself? Then, what is sort of the status of your queuing pipeline today? Are you taking advantage of newer technologies around like things like Kafka, for example?

[0:31:25] JS: Yes. So, actually used a library for queuing, which is called rescue, which relies on Redis and it's extremely fast. But the thing is, it's a client-side implementation. So, if you want to switch technologies, you have to reimplement it in another language. It was actually developed by the guys from GitHub, so it's really high quality. But then once we knew that we wanted to write some services in another language, in PHP, then we realize that we should probably move away from the client-side implementation. Use a queue that's really a service-side queue, which would be better suited. 

Fortunately, we didn't use SQS, because it made our migration from the cloud to our data center more easy. If we had relied on SQS, you will be coupled to AWS, which is not a bad thing, right? I mean, usually when people say you shouldn't couple or rely on anything that’s – you cannot live your life with worrying that it should be completely modular and be able to write anywhere. But yeah, we've moved to a job queue, which is called Beanstalk. It's written in C. It's really fast and it works really well. We had no need for Kafka. Kafka is a bit too big, in my opinion, for our use case. We need something that's simple and easy and nimble and doesn't use a lot of resources. That's what enables us to be on the scale. So, everything. Everything is queued. Everything that can be is queued. Everything that depends on any third-party services.

[0:33:02] SF: Then, in terms of, you mentioned earlier that one of the reasons Mongo didn't work out is because you need to do a lot of like analytical queries. Have you evolved the way that you're doing some of your analytics today? Are you still relying on Postgres and running queries there? Or have you invested in new types of technologies that have been like, there's new types of databases that have been developed specifically for performing analytical operations at scale that are like highly performant? beyond just what you might get with an out of the box, sort of SQL database?

[0:33:36] JS: Yes. That's true. So, we've moved to Postgres for two different reasons. One is the analytical part, and the other is it actually has this cool thing called MVCC, multiversion concurrency control, which basically allows you to read and write at the same time. So, you can update your data and still be able to read it, and that's what's something that's really important to us, because we want users to be able to read the old state of the database while we are writing the new state. And that's something that Postgres does really, really well.

As far as the analytical part, we are still using Postgres for the sports data, but for analysis of user data. So, every click in the app is recorded automatically by Firebase SDK. It's anonymized when sending to Firebase. What we do is we download all of the data and run our machine learning models on it to figure out which users will be our longtime users, which are the most valuable ones, to kind of optimize campaigns and also to different personalization stuff in the app and that’s a lot of data.

So, we have one terabyte of data every day, and we started doing that in 2019. It's more than a petabyte of data right now. Honestly, that's just too much for Postgres. It wouldn't scale. We have 1 billion rows every day that are ingested, and the database we use for that is called ClickHouse. It's developed by Yandex and it's really a specifically crafted for analytical workloads. It’s a column-oriented data store and it's really efficient in storing and querying the data. But for the sports data, we're still using Postgres because that data is not as big. So, it's only 150 gigabytes of data. It's not an issue for Postgres.

[0:35:30] SF: And a little while ago, you were talking about how your deployment model has changed over time from you're going, I think, like you started deploying several months or something like that. And now, basically, you've evolved to a continuous deployment model. What is that CI/CD pipeline today? What tools are you using to essentially handle your deployment?

[0:35:55] JS: Yes. It's actually more than CI/CD. It's kind of our own in-house solution, which it allows us to do a lot of stuff, but it isn't really that complicated. So basically, what we can do is we can deploy any branch for any project we have to production really easily. Basically, what happens is, any person in the team can say I want to deploy my branch, and the system will basically take all of the branches that are in auto deploy. It will merge them sequentially. It will then build out a container image, and it will deploy it in production. That really enables us to kind of test things out in production without actually having to commit anything to master.

Then, once something is committed to the master branch on GitHub, it will just get all deployed again. What enables us to have this sort of belief in the system that it will work is we have a lot of tests, which are run automatically. So, the way we do development is we just, when we are working on a feature or a bug fix, new branch will get created. That branch cannot be merged to master without passing certain checks. Those checks are tests. The linter has to pass. You have to get approved from QA. You have to get approved from your peers, and then it can be merged. It's really, I mean, it's a custom solution. But I'm not going to say a couple of lines of code, but definitely just a couple of files with certain logic, which is not a complicated system.

It's really important, I think, for companies that want to grow and kind of be fast. You have to do deployments daily, if you can, of course. I mean, if you have to create a binary version and ship it to your customers that will then have to deploy it on their infra, then obviously, you cannot do it every hour. But if you can, it really allows you to grow your product more rapidly.

[0:37:54] SF: Yes. I mean, you just run into a lot of organizational challenges with people developing for long periods of time independently, and then doing a merge, and then doing a deployment, and everybody’s crossing their fingers, hoping that things were –

[0:38:08] JS: Exactly. I mean, think about it, like our deployments before is we would just FTP to the server and drag files. Then, when we moved to the cloud, then you will just have the Amazon machine image, which was amazing. Then, we will have the same code and all the machines always. Once we decided that the cloud is too expensive, and move back to the data center, we wanted something like it, because we didn't want FTP to every single machine to do it. So, that's when we started using containers, because you could just build once and then run on all of the machines, you could have the same code. And that's something that I think is really important, because if you're uploading manually, then you have different states on different machines and it's just a nightmare to debug.

Then, also, the thing is, you should have cattle, not pets. You shouldn't treat your servers, special little snowflakes. You should just create and destroy them. That makes your life easier.

[0:39:11] SF: Yes. You mentioned you're doing canary deployments earlier. Is that part of your regular deployment model? Are you always doing sort of a progressive rollout?

[0:39:19] JS: That's something we didn't do before. And then for certain features, we started doing canary deployments. But right now, we don't actually have the need for that, because the rollback is so easy, where we can just deploy something in production. And if something breaks, you can roll back within a couple of seconds. So, we do have the support for it. It’s just not needed anymore. I mean, in Kubernetes, it's really easy to do. You just have a service with funds for different deployments. But we don't have the need for it anymore. We just deployed to production because changes are small and incremental.

[0:39:56] SF: You were an early adopter to AWS, and then at some point, you started to move back to running your own infrastructure. So, are you running a hybrid setup today?

[0:40:07] JS: Yes. So, the main reason why we have moved from the cloud is just because of the price. I mean, I love the cloud, and I was sorry to kind of move away from it. But we got to a point where our traffic was costing us more than compute, which was ridiculous. That's when we started to look for alternatives, and that's when we moved our caching layer away from AWS and on prem. Once we saw that this is working really well, and then we can actually over provision the system by a double, to have more than doubled the capacity of our highest peak and still for it to cost less than auto-scaling cloud, we said, “Okay, we have to move to on-prem.”

It's not actually our data center. We just lease the servers, month by month. But the thing, the issue you have there is what if you've miscalculated the capacity you need. So, what happens if you have more users than you have capacity. That's why we've actually built a system, which is also really simple, which allows us to spin up virtual machines. All of our machines are dedicated machines, which we pay month by month. But if the load is high enough, it will trigger an API call to our provider, which will spin up new machines. Those machines will automatically join Kubernetes cluster, and then we'll kind of get part of the load, and then everything will work. If the load goes up again, the new machines will be spun up, and that's how we have the hybrid cloud approach where we have the load price, but also the scalability of the cloud, if we need it.

[0:41:48] SF: Yes. I mean, I think that's a really like innovative solution. You are investing in existing servers to serve your baseline use case. And then, leveraging cloud to scale when you have these spikes that maybe go beyond what the baseline capacity is.

[0:42:04] JS: Yes. Exactly. I mean, when we first started using the cloud, the thing is, we had really big spikes. But once we figured out how to do the cache layer efficiently, then all of the spikes got read from the cache. So, there wasn't a really big spike on the backend itself. But the spike was on the caching layer, which doesn't need to scale as much, because it can handle a lot. So, the bottleneck is not the CPU. It's actually the bandwidth and you have – if you have really good machines with a lot of bandwidth, then you don't really have the need to scale them. Unless you, of course, need to put out more bandwidth than you have. That's when we got into the situation where we don't actually need to scale as much. So, the Auto Scaling groups would basically remain the same most of the time. And then that's what allowed us to say, “Okay, we're going to move away, and we know that it will work because we don't actually need the Auto Scaling group as much as we did before.”

[0:43:06] SF: You started your journey on Sofascore, basically the engineering team was two college kids. You, yourself and one other person. And then, I'm sure it's grown significantly over the last 13 years and you've evolved a lot as an engineer yourself now, into the role of CTO. But how has the engineering or company culture changed and evolved during that time as you become a bigger organization?

[0:43:33] JS: Yes. It's changed a lot. I mean, I have a really funny anecdote. So, we only had the web application in the beginning, and then we started to invest into mobile apps. While at that point in time, push notifications were basically non-existent. We were the first live score application which implemented push notifications. That's how early we started. But the thing is, the way we did development in those early days, we had a guy who would code up a new version, and then he was too lazy to kind of send the APK file to all of the team members to test it out. He would just upload to production and say, “This is in production, please test the app.” That's how development was done in the early days.

But once you get to a certain point of users, you realize you cannot do that stuff anymore. So, right now, I don't want to have the corporate world where everything needs to get signed off on. But we do have a process, which kind of gives the developers the confidence to develop with ease. So, all of the code is – I mean, I already said this, it's developing a separate branch, it has to pass all these checks, and then it gets deployed. If an issue arises, then it's rolled back easily. We kind of want our developers to be able to move fast and not break things. And I mean, the company is larger now. It's no longer a startup. So, you have a lot of colleagues that I can help you with your issues when we started. I couldn't ask anybody to help me. All of the problems that arose, especially when something went down in production, I would have to fix it, basically, because I was the most senior guy in the company, that was really stressful because you couldn't get on vacation, if something broke, I would have to get online to fix it.

Now, everything is split among multiple people. So, more people know different parts of the system, so there is no system that depends on just one person. I think, that's really, really important. You cannot have a company where only one person knows one part of the system. And if that person leaves [inaudible 0:45:45].

[0:45:45] SF: Yes, absolutely. You can't have a bust factor of one on your instances.

[0:45:49] JS: Exactly.

[0:45:49] SF: So, what's next for you and for Sofascore?

[0:45:53] JS: So, we have more ideas that we have people who can implement those ideas, which has always been a problem for us. There's so much we want to do. Things that we are most focused on in the coming weeks and months and years, is we want to empower the lower league sports to have visibility in the world. So, you have a lot of leagues that are still running their league by writing into Excel files or on paper. So, we have a software called Sofascore Editor, which allows you to input, to basically digitalize your league, and to have visibility of your league in the world. So, anybody in the world can see what your league is doing. How the players are performing. And we also love this because we can enable a lot of people to get access to a lot of data. So, we want to invest more in crowdsourcing and in this lower leagues editor, as well. We see something like this that will help us grow as a company and also to give back to the world.

[0:46:59] SF: That's great. Thank you so much for coming on the show. I thought this was really, really fascinating. I love these deep dives through the history of a company. I think, like I was saying at the top of the show, this is the way I learned and I think of the way a lot of people learn from people who've actually done these types of things. I love the practicality of what you're doing as well and how you solve these different problems. You're taking a real first principles approach, rather than just falling in love with something. We're not just going to adopt Kafka because it's a newfangled, like hot technology. We're going to look at this problem, and what makes sense for the shape of the problem within our company. Should we use something like that? Or should we use something else? Should we build something, buy something, use an open source project and so forth? Or even, adopt micro services, because that's the cool thing that we're hearing about on StackOverflow, or Reddit, or something like that. And I think that strategy is one that served you well, and I think it would serve others.

[0:47:49] JS: Yes, I mean, there's always this discussion between if you're going to build it yourself, or use something that's already existing in the world, or pay somebody else to do it. It's kind of a thing that you have to decide for yourself. We have stuff that we built ourselves because we couldn't find the alternative that was good enough. But usually, most of the stuff is just, we take existing technology and find a way to kind of fit it so it works well together with other parts of the system. And who knows, maybe in a couple of years, I'm going to say that we are now too big, and we have to move to microservices, because that's something that works well for teams that are larger than we are.

[0:48:30] SF: Again, I want to thank you so much and hopefully we meet in person then next year in Croatia. Cheers.

[0:48:36] JS: Yes. Thank you so much for having me. This was really an interesting talk and hopefully we will meet in person.

[END]