rss-bridge 2024-11-06T18:00:00+00:00

SE Radio 641: Catherine Nelson on Machine Learning in Data Science

Catherine Nelson, author of the new O'Reilly book, Software Engineering for Data Scientists, discusses the collaboration between data scientists and software engineers -- an increasingly common pairing on machine learning and AI projects. Host Philip Winston speaks with Nelson about the role of a data scientist, the difference between running experiments in notebooks and building an automated pipeline for production, machine learning vs. AI, the typical pipeline steps for machine learning, and the role of software engineering in data science. Brought to you by IEEE Computer Society and IEEE Software magazine.

Catherine Nelson, author of the new O’Reilly book, Software Engineering for Data Scientists, discusses the collaboration between data scientists and software engineers — an increasingly common pairing on machine learning and AI projects. Host Philip Winston speaks with Nelson about the role of a data scientist, the difference between running experiments in notebooks and building an automated pipeline for production, machine learning vs. AI, the typical pipeline steps for machine learning, and the role of software engineering in data science. Brought to you by IEEE Computer Society and IEEE Software magazine.

Show Notes

Software Engineering for Data Scientists (O’Reilly, 2024)

Building Machine Learning Pipelines (O’Reilly, 2020)

LinkedIn: CatherineNelson1

Related Episodes

Episode 315: Jeroen Janssens on Tools for Data Science

Episode 286: Katie Malone Intro to Machine Learning

Episode 594: Sean Moriarity on Deep Learning with Elixir and Axon

Episode 588: José Valim on Elixir, Machine Learning, and Livebook

Episode 450: Hadley Wickham on R and Tidyverse

Transcript

Transcript brought to you by IEEE Software magazine and IEEE Computer Society. This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number.

Philip Winston 00:00:35 Welcome to Software Engineering Radio. This is Philip Winston. My guest today is Catherine Nelson. Catherine is a freelance data scientist and the author of two O’Reilly books: this year’s Software Engineering for Data Scientists and her 2020 book, Building Machine Learning Pipelines co-authored with Hannah Kafka. Previously, she was a principal data scientist at SAP Concur, and before that she had a career as a geophysicist. Catherine has a PhD in Geophysics from Durham University and a master’s of Earth Sciences from Oxford University. She is currently consulting for startups in the generative AI space. Welcome Catherine.

Catherine Nelson 00:01:16 Thanks Philip. It’s great to be on the podcast.

Philip Winston 00:01:19 Today we’re going to discuss the role of the data scientist and how this role can overlap with or intersect software engineering. Let’s start with what is a data scientist?

Catherine Nelson 00:01:31 That’s such a great question because what a data scientist is depends on where you work. At some companies it can be more in the data analytics space and at others it can mean that you’re spending all your time training machine learning models. But overall, I’d say being a data scientist involves translating business problems into data problems, solving them where possible, and then sometimes building machine learning powered features.

Philip Winston 00:01:57 So what skills does a data scientist need either prior to getting the role or what skills do they need to develop to be good at the role?

Catherine Nelson 00:02:05 They need to have skills for working with data. So those would include a knowledge of statistics, a knowledge of coding to be able to manipulate the data, take courses in basic machine learning, learn about the algorithms that make up machine learning, data visualization, sometimes storytelling with data, how to weave those data visualizations together to a coherent whole. A lot of data scientists will take courses on data ethics, data privacy, because sometimes that is part of the data scientists job as well. It’s a real mixed bag.

Philip Winston 00:02:43 It seems like data scientists need perhaps more domain knowledge or business knowledge than some engineering roles. Why do you think this is?

Catherine Nelson 00:02:55 I’d say that’s right. I think it’s because you are translating the problems from a business problem to a data problem. So you might be tasked to answer a problem such as why are our customers churning? Why do some customers leave the business? And you dig into the data to try and see what features of a company are correlated with them stopping using your product. So it might be something like the size of the business or they might have left given you feedback that has some reasons for that. So you can’t really answer a problem like that without having a good sense of what the business does, what products that are, how things fit together. So yeah, I think it involves a lot more context.

Philip Winston 00:03:41 On a typical project, who does the data scientists have to communicate with, typically?

Catherine Nelson 00:03:46 The interesting thing I’ve found with my data science career is I wouldn’t say I have a typical project. So I’ve done some projects where it’s been extremely exploratory. It’s been like, we might be considering creating a new feature for the product, is this even possible? It’s really blue sky. And then there’s other projects I’ve worked on where it’s been towards the production end of things, deploying new models into production. So I’m going to be working with different people depending on the type of project, but some commonalities would be a product organization and obviously engineers if it’s involving building features.

Philip Winston 00:04:32 For most of the episode we’re going to be talking about machine learning and AI, but as I understand it, there’s more to data science than just these two fields. Can you give me example of a problem you solved or a solution you came up with in data science that didn’t involve ML or AI?

Catherine Nelson 00:04:51 Actually, the example that I just mentioned, looking at why customers might leave a business that involved no machine learning at all. It was a predictive modeling problem, but I didn’t use a machine learning solution. So the projects that are more around answering questions versus building features are the ones where there’s a lower level of machine learning, AI usage, and more statistics or data visualization or general data analysis skills.

Philip Winston 00:05:20 I want to mention two past episodes related to data science. There’s Episode 315, Jeroen Janssen’s on Tools for Data Science. That was in 2018 and Episode 286, Katie Malone, Intro to Machine Learning, 2017. Katie Malone is a data scientist. So now let’s move into talking about machine learning and AI to start with. What is the difference between these two fields? And in my research, I think I’ve seen the term AI has been evolving a lot, so I’m wondering what definitions you use.

Catherine Nelson 00:05:55 The most useful definition I’ve heard and the one that I’ve adopted and continue to use was from a podcast that I heard with Christopher Manning who’s a professor at Stanford University in their natural language processing. And that is that if you’re dealing with machine learning, then you are training a model for one particular problem, one particular use. But an AI model can answer many problems. So you might use your AI model to power a chat bot, but you could use the same model to summarize some text or extract some information for some text. Whereas in classical traditional machine learning, if you wanted to have a model that extracted some information from some text, you’d go and collect the dataset designed for exactly that problem. You take some of the input text and what the output that you wanted it to produce and then you’d train your model and measure how accurate it was on that particular problem.

[...]

Original source