rss-bridge 2023-11-30T00:57:00+00:00

SE Radio 592: Jaxon Repp on Distributed Data Infrastructure

Jaxon Repp of HarperDB speaks with Brijesh Ammanath about distributed data infrastructure, including what it is and why it's important. They discuss the key factors that make distributed data infrastructure attractive, as well as challenges to implementing it. The episode explores the architecture and design principles, the key security considerations, and the transition factors for distributed data Infrastructure. Brought to you by IEEE Computer Society and IEEE Software.

Jaxon Repp of HarperDB speaks with Brijesh Ammanath about distributed data infrastructure, including what it is and why it’s important. They discuss the key factors that make distributed data infrastructure attractive, as well as challenges to implementing it. The episode explores the architecture and design principles, the key security considerations, and the transition factors for distributed data Infrastructure. Brought to you by IEEE Computer Society and IEEE Software.

Show Notes

Jaxon Repp bio

Patent 10956448

Related Episodes

SE Radio 335 – Maria Gorlatova on Edge Computing

SE Radio 353 – Max Neunhoffer on Multi-Model Databases and Arangodb/

SE Radio 199 – Michael Stonebraker

SE Radio 510 – Deepthi Sigireddi on How Vitess Scales mysql/

SE Radio 102 – Relational Databases/

Transcript

Transcript brought to you by IEEE Software magazine and IEEE Computer Society. This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number.

Brijesh Ammanath 00:00:18 Welcome to Software Engineering Radio. I’m your host, Brijesh Ammanath. Our guest today is Jackson Repp. Jackson is the CTO and head of marketing for HarperDB. He has over 25 years of experience architecting, designing, and developing enterprise software. He’s the founder of three technology startups and has consulted with multiple Fortune 500 companies on IoT. That’s Internet of Things and Digital Transformation Initiatives. Jackson, welcome.

Jaxon Repp 00:00:43 Thank you very much for having me.

Brijesh Ammanath 00:00:45 In today’s session, we will talk about Distributed Data Infrastructure, understand what it is, benefits of using it, the challenges which are unique to it. We will deep dive into the architecture and design patterns of distributed data infrastructure and also delve into security concentrations. Finally, we’ll look at what one should consider while transitioning into a distributed data infrastructure setup. So let’s start with the fundamentals. Jackson, can you help explain to our listeners the fundamental concept of distributed data infrastructure and why it’s important?

Jaxon Repp 00:01:17 Absolutely. There are a few reasons why one would want to set up one’s data infrastructure in a way that data does not live in a single location. The old adage was that I want to be more resilient. A data center might go down. So if it does, I don’t want my entire application to fail. I would like to be able to route to a different place where the data is able to be served and my application can continue to function. That is what we have sort of dealt with for years and years. And there’s lots of rules and tools around setting up a distributed data infrastructure, mostly for resilience. And that includes things like disaster recovery, a site that’s very far away that eventually gets that data. I may not access it, so it doesn’t need to be very particularly fast, but if all of my main office buildings burned down, I still need that data.

Jaxon Repp 00:02:06 So I needed to be and exist somewhere. More recently, it has been a challenge of scale. I have so many users that are trying to interact with my data, that having a single database answer those questions or accept those inserts is not sufficient. Or if it is, I’m spending a lot of money to put a terabyte of RAM and multiple cores in a machine so that it can perform at a level that is satisfactory for my users when they’re trying to interact with that data. If I am in many locations, obviously I divide up that user base and the resource requirements for each node of my data infrastructure are lower. It introduces other challenges though, however, the most recent reason, and this is I don’t think it’s entirely pandemic driven, but before the pandemic sub 50 millisecond response times for APIs were not something everybody was demanding.

Jaxon Repp 00:03:06 We rarely heard about it. High performance applications, absolutely. But for most people it was, get it to me within half a second, 500 milliseconds, and that’s fine. And when everybody started working from home, everybody got real impatient for apps that were slow. And the newest reason we have distributed data infrastructure and the one that sort of drives my company and companies like us, is the idea of extremely low latency. So not only am I distributing the load across multiple data ingest points or access points for queries, and not only am I resilient, and not only am I protected in case of a disaster but also, I am closer to the user. So I am lowering the latency for that response or for that insert. So it’s really a combination of latency, cost, resiliency, and ultimately the whole balance of that is complexity, because obviously we keep adding moving parts to a system, it gets more and more complex.

Brijesh Ammanath 00:04:06 Yeah, all of that makes sense. So resiliency, scale, low latency, all of that is very critical and important for any business that’s out there on the net right now. But circling back to the fundamental concept, what does data, when we say it’s a distributed data infrastructure, what does that mean?

Jaxon Repp 00:04:24 It means that I’m going to store data in more than one place. The idea that I may divide my data into multiple servers, I may divide my data across multiple regions. I may replicate my data in whole or in part across multiple regions. So traditional sharding divides up very large data sets across multiple servers, keeps track of where each piece of data is and allows you to access it. So that’s the basic level of putting data in more than one place. When I talk about distributed data infrastructure, I mean your data is accessible in whole or in part in many different locations around, let’s say the globe, or it may be around your service area, but distributed data infrastructure are the servers and the connections and the software that allows you to access data as if it was sitting on your laptop as quickly as possible, even though it may be located in multiple places in the world.

Brijesh Ammanath 00:05:27 Understood. And why is it called data infrastructure? Is it same as distributed databases or when you say it’s distributed data infrastructure, are you taking a different view of what the stack is and does it include additional components and layers?

Jaxon Repp 00:05:42 Well, infrastructure requires obviously hardware, the servers to run it, the software that is going to store your data. And when you have distributed data, you need some way for those independent nodes, those independent servers to speak to each other. So there’s tooling around how those nodes communicate, how a transaction that happens on one server is replicated and reconciled on a different server where a similar transaction might have just happened to affect the exact same record. So the tooling and systems to make sure that everything in an end state is what should have happened based on the time at which it happened anywhere in the world is complex and requires a lot of individual parts. So I refer to infrastructure as a combination of software and hardware, and not just databases, but all the tooling around it that allows you to break them out and have them live more than one place.

[...]

Original source