EPISODE 1724

[INTRO]

[0:00:00] ANNOUNCER: Tanzu GemFire is a distributed in-memory key-value store that performs read-and-write operations at fast speeds. It offers highly available parallel message cues, continuous availability, and a scalable event-driven architecture. It was developed to have sub-millisecond response times and accordingly found early application and automated trading environments on Wall Street.

Ivan Novick is the product manager for GemFire at the Tanzu Division of Broadcom. He joins the show to talk about Tanzu GemFire and its applications.

This episode is hosted by Lee Atchison. Lee Atchison is a software architect, author, and thought leader on cloud computing and application modernization. His best-selling book, Architecting for Scale, is an essential resource for technical teams looking to maintain high availability and manage risk in their cloud environments. Lee is the host of his podcast, Modern Digital Business, produced for people looking to build and grow their digital business, listen at mdb.fm. Follow Lee at softwarearchitectureinsights.com and see all his content at leeatchison.com.

[EPISODE]

[0:01:18] LA: Ivan, welcome to Software Engineering Daily.

[0:01:20] IN: Hi, Lee.

[0:01:21] LA: So, tell me about GemFire.

[0:01:24] IN: Right. I mean, GemFire has been around for almost 20 years now, and it came up and grew up in the Wall Street setting. Wall Street traders were looking to build software that could interact, that could do trading, and could propagate real-time data among the trading floor and among the different services they had in real-time. Every millisecond counted. It came up through there. It's really a Java-based system and Java was really extremely popular at that point. I think Java still is one of the best languages out there. So, it was really built as a Java-based data store for when every millisecond counts and when the accuracy of the data counts. That's kind of where it was born out of.

[0:02:10] LA: Got it. Got it. So, it was formed as an enterprise-grade application, not, it didn't grow up to the enterprise grade. It was designed from the start as enterprise-grade.

[0:02:19] IN: Yes. It was designed at the start in Wall Street with trading firms, looking to have sub-second, sub-millisecond responses for data and to programmatically do compute of the data and to use that to process the data and to feed it back into their automated trading and their trading desks. It was really, it's like an in-memory supercomputing grid.

[0:02:44] LA: Makes sense, yes. So, let's look at some comparisons here. As I did a little bit of study about GemFire, I've never used GemFire personally. But as I did some research about it, the best analogy, the best comparison I could find is Enterprise Redis. I'm a little bit more familiar with that. I actually wrote a book on Enterprise Redis. So, I've got some knowledge in that space, and it seems like it solves many of the same problems that Enterprise Redis does, but in a different way. So, I'm wondering if you could do a comparison between us two. What makes GemFire unique from Redis Enterprise?

[0:03:21] IN: Yes. They definitely have a different personality. You're right, they definitely do have an overlap, but distinct personality. In fact, the funny thing is Redis was also born and brought up in the same company as GemFire. It was created back at Pivotal when GemFire was also at Pivotal. So, this pivotal VMware Broadcom network originated both products.

But in any case, I think of Redis truly as a key-value store. As frequently, I think of it as a cache, right? That when you want to cache something and you're going to have the permanent data in the database, then your cache is Redis. Redis has more of a clear boundary between client and server in the sense that, okay, you've got this client, I do the key, I do the value. It's not really a compute engine as well. It's really just keys and values, right?

You tell me, Redis came up more on the application caching side. Is that -

[0:04:19] LA: That's correct. Yes. That's why my comment about starting up in the enterprise level for GemFire, because that is one of the differences. Redis started more at the lower level, hobbyist level from yet a better cache. It turned into a data store and really, truly today is an enterprise-grade data store. But it did start on the cache and more at a hobby level or a small application level.

[0:04:43] IN: Like a web development.

[0:04:44] LA: And web development, yes.

[0:04:46] IN: A lot of web developer, and it became just, I mean, we also, in our company, and actually myself, I'm the manager of six DB products, and for credit, Charlie Black, in my team is the lead on GemFire and within my team. But we also manage a Redis distribution, and Redis is super popular, and it's picked up very frequently in application development and web development.

GemFire, I think of it more as your infrastructure for software, like software development. So, let's say you're building a ticketing system, and you want to sell millions and billions of tickets online.
It's more than just caching, right? You want real-time data store, but it can be the source of truth. It could be having events and passing data from place A to B. It's that data infrastructure to build something, to build software that's, like you said, an enterprise-grade software application, like selling tickets for a railway or for an Amtrak or something like that, right?

We do have people who use GemFire as a cache. You can do that, right? You can configure GemFire to be a web cache or session cache, and people do that. But that's just scratching the surface of what the capability of GemFire can do.

[0:05:59] LA: No, that makes sense. I think you see Enterprise Redis going into that space as well, but you were natively in that space from the beginning. It also sounds like you have a compute component as well.

[0:06:09] IN: That's right.

[0:06:10] LA: You want to talk about that a little bit more?

[0:06:11] IN: Yes. Well, part of this original concept of the Wall Street trading systems, was involving compute and computing risk. So, because GemFire really is a Java-based system, what may be, that it easily integrates writing Java code with the data. So, you can process the data and have a large-scale enterprise Java application. Within your code, you can indicate, okay, how the data is being processed and how it's working on the server side and have it in real-time, for example, when something's updated, have it, "Okay, the data was updated, so let's do this compute on the server side right after that," like on an event-based system.

So, it's not just get and put. It can have this logic built into it for processing the data. That's fundamentally a huge difference with Redis. I mean, I think one of the other things people straight away call out as a leading feature for GemFire is the active-active multi-site replication. So, we have a pretty robust intra-data center replication called WAN Replication. So, if you have New York, London, and Tokyo, and you want to have the data being updated in real-time in all sites, we support active-active where all the sites are synchronizing the data. I think that's a pretty advanced feature relative to all the other in-memory grids or products in this sector.

[0:07:44] LA: Makes sense, yes. So, as far as not features, but as far as offering to the customer, the real value is performance, scalability, and availability. Would that be the good trio of things to say that you're focused on?

[0:08:01] IN: I would say yes, but I would say we think of customer first. So, the first thing we think of, okay, so let's say you have, like we said, an airline or a bank. What are the business problems that need to be solved from a software point of view in regards to data? How could GemFire be at the core of the processing of that? We're not just thinking about this as an independent, let's say, database and going into a customer and saying, "Hey, you have MySQL. Let's replace it with GemFire." We're saying, "What are the business processing problems that you're trying to solve that are in real-time complex business processing? And could GemFire be at the heart of that?"

As a grid, we're data is synchronized and always up to date and being updated millions of times a second. But then you can then build business processes around it. So, it's not really just the data. It's also the application that you're going to build. The enterprise software you're going to build around it. 

[0:08:59] LA: From your standpoint, what are the benefits of having the compute built into the data store as opposed to having a separate compute component and keeping the compute and data separate?

[0:09:12] IN: It allows you to solve complex real-time problems. So, when the data sets are large and when the SLAs or the timelines are fast, and when the high availability is required, having a two-tier layer, there's a lot of data synchronization problems. So, imagine, for example, you just had data in an Oracle database and you have an application. But the data may be in multiple sites. It may be bigger than what fits in one Oracle database. It may be updating faster than what you can do in an article database. So, just having everything in the grid allows us to just build things that couldn't be built with a two-tier architecture like that. That's where GemFire gets picked up, really.

It's not, they don't come and say, "Hey, I need a cache. Should I pick Redis or GemFire?" They come from a different perspective, which is, I want to build world-leading competitive software infrastructure for my company. And we want to solve, we want to have real-time analytics and real-time business processing with the enterprise availability. Then, what type of software infrastructure can support that.

Otherwise, you're sending messages from here to there. There's latencies, multiple components. It reminds me almost in a certain sense of, I don't want to say it, but of a mainframe, but in a sense that it's a distributed system where you're having that ultimate speed and reliability.

[0:10:37] LA: So, let's go a little bit deeper. Let's come up with a specific business use case. You mentioned ticketing, for train ticketing. That's a fine example if you want to. Tell me some examples of some of the business use cases associated with, let's say, train ticketing, that this architecture specifically helps with.

[0:10:55] IN: Right. So, let's talk about airline ticketing, right? Imagine you're an airline and you need to keep the state online as for all the tickets, who's checked in, who's not checked in, the seats, the customer loyalty program. Then, that's interfacing with the different websites and the different handheld devices and all the different ways that you could be updating the status. There's just a lot of different nitty-gritty, critical changes and applications that all need to know the common operational picture or the common state of what exactly is happening.

Again, like, what are the flights? What are the planes? Where is every plane? Where are the seats? Who has the tickets? All of these different details need to be tracked. Then, all these different websites and systems need to then poke at that, and either modify an element like when someone checks in, okay, you have to mark it as checked in. The volumes are very high. The latency requirement is very fast. So, having it just in one database is kind of like too insufficient, right? It should be this more of a massive data grid that's interactive to support something like that.

[0:12:10] LA: Okay. What the specific capabilities you need from the data store to accomplish what you're talking about is you need massive scalability because of the number of transactions is huge. You also need consistency.

[0:12:23] IN: Right. Millions of updates.

[0:12:27] LA: Consistency. High consistency between -

[0:12:29] IN: Consistency is a must.

[0:12:30] LA: Yes. And you need to have high availability.

[0:12:34] IN: High availability is a must.

[0:12:35] LA: Yes. So, you do an active-active system. You have multiple writers, multiple readers in a network with multiple nodes, and they self-back up for each other. Now, what I mean by self-back up, they act as backups for each other. Now, they're all in memory, right?

[0:12:52] IN: It's all in-memory.

[0:12:54] LA: Do you do any persistent to disk at all? Or do you allow that the in-memory with the massive number of nodes is going to be as persistent?

[0:13:01] IN: We rely on the in-memory with the massive number of nodes for the availability, but we also provide for asynchronous writing to disk so that, for example, if you're doing maintenance, you can do a clean shutdown and a clean restart and kind of try and more rapidly bring the system up to speed. But primarily, GemFire is an in-memory data grid.

[0:13:23] LA: Got it. But that offline write, it actually is an offline write. That doesn't have to be low latency. It can be -

[0:13:30] IN: Right. That doesn't have to finish update. Because what has to happen is it has to be updated in multiple nodes, potentially in different availability zones, depending on how things are configured. But it doesn't have to write to disk.

[0:13:45] LA: So, one of the challenges with an active-active system is doing things like transactions and overlapping transactions that collide and things like that and dealing with all that appropriately. What strategy do you use in those situations?

[0:13:58] IN: Right. So, conflict resolution. Now, some of the aspects that make GemFire able to have all these superpowers is because we are keys and values and not a SQL database. So, in a SQL database, things are much more complicated. By being keys and values, and the keys and values, again, this is Java-based. So, the keys and values could be Java classes, which could be complex. But everything is overridable. When you do conflict resolution, what it means is basically the same key was updated multiple times in places or in multiple locations. So, there are default conflict resolution algorithms that are published in academia, which we implement, and they're essentially based on the last time and the last update.

But because everything is overridable and programmable, some customers will implement their own business logic in the conflict resolution, which could consider other factors. So, they may have a business logic that said, some custom logic, like if New York was down for this long and I got an update, anything can be implemented in the overridden conflict resolution for a specific type of a key. But we do have default conflict resolutions, which work out of the box. Essentially, the last update will win for a given key.

[0:15:21] LA: Got it. Got it. How large is a typical, and I know typical is a bad word, but how large is the typical deployment? How many nodes are we talking about in a system? Are we talking about a dozen? Are we talking about hundreds?

[0:15:33] IN: So, I'll give you the ranges. The capability could be useful on a small scale. We've got folks who use GemFire on what I'd call the edge, where they have, let's say, a processing facility, let's say an industrial use case. Maybe they have two or four hosts on the edge that represent a high availability grid and they're doing some sort of custom processing of business, or of industrial work at the edge and they've got lots of different small mini clusters with two or four nodes. It is a clustered system, so there's not a whole lot of sense in doing it with one node. There should be some replicas and some parallelism in the processing, so you could be like an at-the-edge solution with two to four. But if you're running a global who's doing millions and billions of transactions per seconds, and you're using GemFire as that single source of truth that has the information about all the transactions, you may have 200 servers. A server, these days, is like 2 terabytes of RAM with maybe 20 or 30 cores. So, you could have hundreds of those, let's say 200.

But in practicality, both GemFire and any other in-memory grid doesn't usually go into thousands. It's not like Hadoop or something like that. It's usually -

[0:16:52] LA: Tens or hundreds, but not thousands.

[0:16:54] IN: Yes. The average is more like 20, 30, 40 servers. There's a lot of use cases where if it was to work well on a single node, you would use a database like Oracle. But now, what could the power of 10 or 20 or 30 hosts do that 1 host couldn't do? Then a technology like GemFire becomes really interesting.

So, there's a lot of business problems where the power of 10, 20, 30, 40 servers is really interesting compared to the power of a non-distributed system. But the number of people who have real-world use cases where they need hundreds or thousands of servers working together are few and far between. So, it's really, I would say, like, 50 to 100 servers would be a sweet spot.

[0:17:38] LA: So, describe some of your customers. We've talked about, obviously, airline ticketing is a good example.

[0:17:45] IN: Airline is interesting. I think the Wall Street Trading Desks is a prototypical case where you've got market data updating a real-time trade data, risk data, and it all forms into a trading hub. That's a classic GemFire case. But then also, we've got a lot of customers in payment processing where the history of payments are stored online in the GemFire grid, and that's used for things like fraud analysis. So, if you're doing payments, various forms of payment companies, they're always on the lookout for fraud.

[0:18:19] LA: So, credit card companies, major banks, distributions, money transfer agents.

[0:18:27] IN: Credit card companies, major banks, money transfer, right? So, the idea is, in real-time, can we update a data grid that has every payment? Within the second, can we be updating that grid as the payments are going, so that if two seconds later or one second later, we want to evaluate the fraud, that not only can we capture the history or the previous payments and do some statistical analysis, but I want to also have that include payments that were done one second ago. It could be correlated with payments all over the world.

So, the idea is create this global grid of real-time data to do fraud detection online that's always up to speed, always up to date. With the compute ability to do things like fraud analysis, online speed. Payment processing is a really interesting case for GemFire, and it can also be part of the actual transfers because of its consistency. It is used in various ways by firms as part of the actual money transfer software development.

Let's say you work at a - if you're actually implementing payment, money transfer, again, that becomes a software project, right? Then, the software project needs a data state. Once it reaches these kinds of requirements of scale and reliability, then having a data hub, so to speak, which can be from the developer of the software is always on, always working, always has the freshest sub-second data makes the development of these enterprise-type workloads possible.

[0:20:05] LA: Yes. It makes sense. Makes sense. So, it sounds the capabilities you need besides just basic key-value store, it sounds like you really do need a good searching capability as well, too, and that kind of gets into the vector analysis. I know it's a vector-based database where you have vector capabilities. First of all, is that a correct statement?

[0:20:25] IN: That's right.

[0:20:27] LA: But before we get into the vector, specifically, let's educate our audience on what a vector database is. I think they're still relatively new and not everyone understands.

[0:20:36] IN: So, when I think about a vector database, what we're doing is we're really talking about text data or image data or video data. We're not talking about storing people's credit card number or bank account number. We're talking about document data, Twitter data, text data, image data, what we call unstructured data. With the power of AI today, AI models can convert images and text into what's called a vector embedding, which essentially means that you've got a mathematical graph in multiple dimensions and we're converting and we're embedding the meaning of what's in that picture or what's in that video into a mathematical graph space. Then, what's result of that is a whole bunch of numbers which represent the location in that multidimensional graph where this image sits. The interesting property of this is that the closer different points are in the graph to each other, the more similar those objects actually are in real life because the AI that created this has been trained to do that.

So, for example, if you take pictures of animals and then you convert them into points on a multidimensional graph, all the cats are going to be close to each other on that graph, and all the elephants are going to be close to each other, but they're not going to be next to the cats, and the elephants will be separated. We're converting real-world objects into numbers on a graph that tell you how similar things are. That enables use cases like search, where you can search based on -

[0:22:22] LA: How close two vectors are to each other.

[0:22:25] IN: Yes. Another amazing property that we found out about these vectors is that you could take someone's words, convert it, and embed it into this vector space, and it will be the same vector space as was used to catalog all the images. So, if I say I'm looking for a yellow boat with a red flag, that can be converted into a point on the graph, and the pictures that I have could also be points on a graph. Now, I can find the closest point for my words to that picture. It essentially allows you to find what the meaning is of the words and find the picture that matches it, which is, I would dare say, it wasn't possible before this AI revolution.

[0:23:15] LA: You can imagine lots of use cases for that. In fact, we're using it many use cases all over the place now, but thinking specifically about GemFire, GemFire would find that useful for some of the fraud use cases you were talking about, but there's probably other cases as well. Do you want to talk about some of those cases?

[0:23:32] IN: So, in the database industry, I would say now vector features have become a must-have feature in almost every database. For most of the cases, what you're doing is you're taking all the current databases in the world. And what we're adding is a new data type, which is a vector, which is this multidimensional set of numbers. Then, we're also creating a new type of an index. An index is something that helps you to find what you're looking for.

So, in the same sense, whether it be GemFire, MySQL, Oracle, SQL server, every database in the world pretty much has added a vector data type and a vector index that allows you to find similar vectors and to store vectors. Why would it be interesting to have the ability to search these things in a system which supports a distributed global grid and can respond in sub-millisecond time?

Well, I think I can imagine some cases like, let's say I'm doing homeland security for an airport and I'm taking image and video feed in real-time and then searching for matching images, like very trivial example. Imagine in the old days you have those pictures of the 10 most wanted up on the sheriff's wall, right? Just very trivially, if I wanted to do that in real-time as millions of people are coming through LAX airport, that's something you'd want to do with a system that had that kind of concurrency and that kind of availability, and maybe it's being updated in real-time by other sources, and maybe it's correlating other data sets and doing analytical computations of other types too.

Again, once it becomes a non-trivial use case and kind of an industrial scale use case with sub-millisecond time, then a platform like GemFire allows you to build amazing things.

[0:25:27] LA: Yes, the FBI 10 Most Wanted list turns into the 1 million most wanted this millisecond and dynamically changing and updating.

[0:25:39] IN: So, something where the results need to happen right away, and where it needs to be available all the time.

[0:25:47] LA: Is that the killer use case for GemFire or is there something else? Is that really where you see your greatest value, where you're having all the data in one location, business logic going on there, and then being able to do these correlations to do incidental thing? Not incidental, but important, but side but side things like security analysis?

[0:26:11] IN: I think that's a differentiator back to the original question about Redis, right? Where Redis really is that key-value look up, that cache, right? I think this is the killer use case for Gemfire is that you're doing data processing, you're doing analytics, you're building out something in the grid of your application. So, complex analytical systems in real-time, I do think that allows GemFire to shine. But I would say, for me, I would just look at, if I wanted to find a customer for GemFire, I would say, who are people in the world that have the most demanding IT workload demands, or use cases where they can't use off-the-shelf software. They need to write and build something unique and that it's going to have data and it's going to be operating and processing in real-time. All major corporations and entities things interesting is going to be a good candidate for GemFire.

[0:27:10] LA: Right. Okay, that makes sense. So, GemFire started about 20 years ago, you said. I'm assuming, a lot of these use cases are very modern use cases and are probably just becoming a bigger deal in the last several years. Has your growth really shown that that's where the value is, is in those things? In other words, have you been growing a lot in the last few years? Where do you see this going?

[0:27:33] IN: We are growing a lot and we're growing a lot and I see financial services and government are two sectors that are growing a lot. The financial services continue to have use cases where information is being passed and stored in real-time and the calculations are complex. The risk calculations, the always updating event-driven systems. If you're a famous bank, I don't want to mention names, but if you're a famous bank or famous investment bank or someone like that and you're processing billions and trillions of dollars, you're not going to be able to go and buy off-the-shelf software, you're going to be writing your own software. The same for government agencies doing unique stuff.

So, with GemFire, that's one of the key points is, it's really for software developers, right? It's an infrastructure for software developers to write applications that have data, kind of data in the cloud or data in the grid, right? People building custom software for advanced cases, it's not slowing down, right? The world keeps changing and the real-time nature of the world keeps accelerating, and the people who had a use case that maybe in the past, they had a two-tier architecture with an Oracle database. But because of the increasing demands of the workload and the lower latency requirements, they need to move to a new model. The potential market is growing and the actual business is growing. So, it's good. It's exciting. It's fun to work in IT, tech, software.

[0:29:12] LA: Definitely. Large systems like this have to be monitored. What type of monitoring do you have built in and how do you make sure these nodes stay up and operational?

[0:29:25] IN: First of all, the system does self-healing and is upgradable online. It's an always-on system, or as they used to call it in some other products, non-stop, right? It's a non-stop running system. That being said, from the perspective of performance, every node has statistics collector, and there's thousands of different statistics that are published on a Java message queue, and that Java message queue can be exported into things like Prometheus and Grafana, or into proprietary mechanisms and customers.

We also have a management console. The management console gives you off-the-shelf reporting and visibility to the health of the cluster. But a lot of the advanced customers will export the raw stats into their own systems and then can process that.

This is a Java-based system. So, one of the surprising or important things people need to think about is which JVM to go with. I don't want to mention any specific companies or whatnot, but there are people who have made - there are different JVMs available and different garbage collectors available, and when you get into these real-time use cases the selection of the Java infrastructure actually has a massive difference. Then, there are metrics emitted for that as well. How often is the garbage collector kicking in? There's a lot of detailed metrics, so we published that all out to an exportable message queue and provide off-the-shelf analysis, but also let people dig in that data and kind of do their own thing.

[0:31:02] LA: Got it. Makes sense. So, Tanzu is more than just GemFire. You have other data services as well. Do you want to talk about any of those?

[0:31:11] IN: In my team, we have six database engines that we sell. We sell GemFire, which we talked about. We sell Greenplum, which is a big data, more of a disk-based analytics platform. But it's also kind of like a scale-out grid. Then, we also are the creators of RabbitMQ, which is a messaging platform. RabbitMQ is open source, but we lead the RabbitMQ project. So, that's a say, Pub/Sub and streaming data, very popular messaging software.

These three products are designed and built at Tanzu. Then, we also support open-source, Postgres, MySQL, and even Redis. We're now [inaudible 0:31:57] in our team, in our company. We sell support for that. But then Tanzu itself has a Tanzu platform, which is an application development platform that allows for very efficient scaling of app development. So, you could have thousands of developers all working off of this Tanzu platform infrastructure, publishing new updates, and multiple times per day, and patching all the applications and everything you need to build some software factory with thousands of applications running on this platform. We have a successful application development platform, and then we have this portfolio of database engines that help accelerate the use of data in your applications and in your use cases.

[0:32:43] LA: Are they all geared towards the same customer set, or are there some distinctions that occur that separated them on?

[0:32:50] IN: They're really all geared to enterprise, and they're really all geared to people who are building custom software and building unique use cases, right? As opposed to an off-the-shelf software for like, let's say, SAP for supply chain management, right? We're giving the tools, the infrastructure, for people to build their own thing, their own software, their own systems. We're basically selling to, let's say, 5,000 largest enterprises and saying, "Let us be your tool sets, your infrastructure for enterprise to build advanced software and advanced applications and to maintain that."

[0:33:31] LA: Thank you very much. So, Ivan Novick is the product manager for Tanzu GemFire,
a high-performing in-memory enterprise-grade database at VMware/Broadcom. Some other time we're going to have to talk about how that merger went and how everything's going, but it's a good joint now. Anyway, he's been my guest today, and Ivan, thank you for joining me today on Software Engineering Daily.

[0:33:53] IN: Thank you, Lee. That was really fun and really enjoyed the conversation. Thank you very much.

[END]