Agile Data Science | Thoughtworks
Brief summary
Learn how agile disciplines are applied to the complexities of data science to demonstrate incremental value within intelligent systems and solutions. Join Thoughtworks' CTO, Rebecca Parsons, and Principal Associate, Alexey Boas, as they interview agile data scientist David Johnston and Thoughtworks' head of data science, Ken Collier. The result is a better understanding of how agile practices are adapted to the uncertainty of machine learning, and how data scientists fit within a cross-functional agile development teams.
Podcast Transcript
Rebecca Parsons:
Welcome to the Thoughtworks podcast. My name is Rebecca Parsons. I'm the Chief Technology Officer of Thoughtworks and one of your hosts for today.
Alexey Vilas Boas:
And I am Alexey Vilas Boas. I'm the Head of Technology for Brazil.
David Johnson:
Hi, my name is David Johnson. I'm a data scientist at Thoughtworks.
Ken Collier:
And I'm Ken Collier. I'm the Head of Data Science and Engineering at Thoughtworks.
Rebecca Parsons:
So let me start by tossing a question to you. Why is agile data science important? What is it about this combination of agile and data science that is interesting for us?
David Johnson:
I think the important part is that there is data science that is going on but not being done very well or very effective. It's not resulting in any kind of applications that are useful for clients. So we're starting to look at the delivery method and see what needs to change. And the agile method of delivering software certainly is very effective for the kind of software that we've been developing the last 20 or 30 years. But this new kind of software is quite different. It needs to be modified in a way, to account for some of the differences in the applications that we're developing.
David Johnson:
The way normally a software app is... you have... the user is the center of the world, right? You have user stories and UAT and user feedback, and it's because of the applications are made for the user, right? You have a UI for example, in the web app, and the user is going to be using that to do things. Whereas the applications that we're developing now, are more predictive for example. So, the user may not have a large role, right? It's the algorithm that is making the choices of what to do. So these are very different. The agile delivery method needs some change to adapt to those kinds of things.
Rebecca Parsons:
So Ken, you've got a lot of experience in applying agile techniques in somewhat related fields of agile data warehousing, business intelligence. How would you characterize how things are different with data science from the agile perspective?
Ken Collier:
Sure. And in addition to data warehousing and business intelligence, I have a background in machine learning and data science as well. One of the challenges with conventional data science is that the data scientist often will do a lot of model tuning, model training, feature extraction, and feature engineering, which may take weeks to months in order to tune a model that has a high degree of accuracy. And that time spent doesn't always translate into business value or actionable value. So sometimes we see data scientists that will build a model in a laboratory environment and then that model has to get recoded in production code with testing and proper deployment techniques. And all of those things add friction to the delivery of these models into a production landscape. So I would say that data scientists are starting to really tweak the way they work. At least agile data scientists are starting to change the way they work so that they're much more synchronized and collaborative with developers and delivery teams.
Alexey Vilas Boas:
So Ken, does this mean that we're also bringing to the world of data science projects that principle of delivering concrete value into the hands of panel customers, if you can say that, in small increments... in small iterations?
Ken Collier:
Yes. I think the same core principles apply. It's still a question of how to deliver value frequently and incrementally so that the business can take advantage of that business value. So that action can be taken, decisions can be made without waiting months to get the best possible analytical models developed. So what we've been looking at is, how do we very early start creating models that are part of that type of software that David mentioned, the intelligent software, and then incrementally improve those models possibly replacing them with completely different models that do a better job. Whether that be looking at fraud detection in the financial sector or looking at customer churn in retail, other kinds of models instead of waiting months for those models to be ready. It's very much the same as software requirements being delivered into production early, even though the overall software system is maturing still.
Alexey Vilas Boas:
And David, I'm curious. So, when we bring this mindset to the project, does the role of the data science change? So... What would you say is an agile data scientist and how is that different from a traditional approach?
David Johnson:
Well, I think the key thing is to really be working as one team. You have software developers and you have data scientists. They should be working as one team, pairing together, working together. The troubles in the past sometimes is the data scientists have been separate and they go and they spend three months or four months creating a POC model, and then when they've proven it out they bring it back to the business and they hand it over the wall to the developers and they... productionizing it.
David Johnson:
They take their code and they port it to some other language. And the trouble is when you get the feedback then from the production data, it's not the same code that was written by the researchers, right? They've written their own code and when something's wrong they say, well I don't know if it's my code or my algorithm or if it's the way it was ported over, right? So what we recommend is really to work together and write one code base, right? The data scientist is working with the developers and you're writing... The POC code that you write is still the first version of the production code. It's not something separate.
Alexey Vilas Boas:
You evolve the code into production and build a feedback loop around that so that you can converge towards something that has business value.
David Johnson:
So what you don't have is phases, whereas the research phase goes on for six months and then you come back for the productionizing phase. That doesn't work out too well. So we have just one project, and you do research but you do it in small chunks and you come back with the results and say I did a research spike on this for the last week or so, this is what I've learned. And because of what I've learned, this is how we're going to change the application to make it better.
Alexey Vilas Boas:
And how was your experience in getting engineers and data scientists working together? So, you mentioned collaboration is as a key aspect and as something very important to make this happen. Is there a cultural or practice clash when it tried to merge all those personalities and that work in one team?
David Johnson:
Yeah, certainly there can be, especially in the beginning when you start. What happens often is the data scientists that come from a research background, they may have been writing code for years but they've been writing it as a scientist so they're... they don't really follow the same kind of practices or tests. They're not used to delivering software to someone else. They write software for themselves or their own collaboration. So they have to change. But also the software developers have to change because they're not used to having to... a research spike that'll go on for a week.
David Johnson:
With the kind of applications that we write that are based on the data, it means that we can't really plan them out in the same kind of way. We have to change direction a lot more. So that feeling of, we headed in this direction and it's not working, we change direction, try something else, it takes a while for the developers to kind of get used to that kind of mindset of we'll try this, we'll try that and eventually we'll find our way through. You can't really plan it all out the same kind of way that you normally do with a backlog of stories that we'll do one after the other, because you don't know which kind of turns you'll have to make along the way.
Alexey Vilas Boas:
And Ken, you've been seeing the impact of agile practices in data science projects for a while. Do you see any other benefits that agile practices bring to model building as you evolve the model? We start getting more feedback. How does that help the overall process?
Ken Collier:
Actually there's several ways that this helps the overall process. One of the things that we see, and David has experienced this on teams that he was... he's worked on, we see data scientists starting to learn the value of engineering discipline. And we seed software engineers learning to understand the complexity or uncertainty of data science. So like David said earlier, it's not simple to just write straight ahead requirements around data science because we don't know what the data are going to tell us until we get into discovery and analysis and research.
Ken Collier:
So what tends to happen when we embed a data scientists with data engineers and software engineers, is all parties learn something about the skill sets of the others and they start to learn how to move in faster ways and remove friction by writing tests around code early, looking at code as not being just temporary code or throw away code but looking at code as being something that needs to be sustainable and maintainable. Looking at continuous delivery and continuous integration techniques around data, which is... which looks somewhat different from conventional software. So, there's quite a few different ways that the entire feedback loop is tightened or shrunk when you embed data scientists collaboratively with software engineers, data engineers, and especially business analysts and testers. It Becomes a pretty powerful thing.
Rebecca Parsons:
Let's dig in to those disciplines a little bit more. Because from an agile software development perspective we have very clear engineering practices around pair programming, David mentioned pairing a little bit, around test automation, continuous integration, continuous delivery, the role of source control, frequent deployments into production. Lets dig into some of those disciplines a little bit. How is testing different in a context when you really don't know the answer to the question? When we think about testing in a normal story requirements, we require our business owner to say this is what constitutes good. This is my unambiguous definition of done for this story. This is the behavior that I want. And yet in a data science, particularly during the exploratory phase, we don't know what that is. So, how do these things actually play out and what kinds of changes do we need to our traditional thinking about these practices to really make them work on an agile data science project?
David Johnson:
So there's a couple of things going on here. One is that when you write code as a data scientist, it's not really different in the sense that you write it in small chunks and each of those trunks is supposed to do a certain thing. But the things they do are different. You might have a piece of code which takes a matrix and a vector and multiplies them together, right? So that kind of thing, it's very easy to write a test to see if that's working correctly. But what you can't do is you can't go back to the user and say, what should this test look like or what do you want this to do? They're going to tell you, I have no idea what you're talking about. I don't know what a matrix is or whatever. So it's not the same kind of thing.
David Johnson:
You don't deliver value in the same kind of chunks that you can bring back to the user and say, here's what I've done... this little chunk is done and I want feedback from you to tell me whether or not it's correct. It's the data scientists which knows whether or not it's correct or not. You deliver value in somewhat larger pieces. The value that you deliver is not as localized in a sense. With a web app for example, you'll have... one of the stories will be to create a dropdown menu so that the user can go click on this and do this particular thing. In data science, a lot of the pieces that you work on are not like that. You can't say this thing has value by itself. It only has value when the whole thing is plugged in together and does what it's supposed to do, which is to predict whatever it is you're trying to predict for example. It's a more holistic way of working where you can't really break things down into those pieces, that each one of those pieces have a value.
Ken Collier:
And I would also add, there's a couple of different aspects in data science. So there is the uncertain nature of what patterns exist in the data. And so we use validation techniques and cross validation techniques for verifying that our models are accurately telling us what's represented in the data. And that's a data science skill that the good data scientists know and understand, and it's different than testing in the software context. On the other hand, there is some code that data scientists write that is utility code, such as loading data sets from CSB files or doing other kinds of data manipulation in order to prepare it for modeling, and that code can be tested in a more conventional software engineering context.
Ken Collier:
In addition to that, there's the question of what is the proper way for a particular project or product? What's the expected way of deploying a model into a production environment? So is it going to be deployed as a microservice? Is it going to be deployed into a streaming data pipeline or some other means? And that code also needs to be tested. So there's different layers of concerns that have to be validated in different ways. And like David said, it's not so much behavior-driven development testing in the software sense, it's more verifying that you haven't written code that is causing some kind of anomalies or not behaving in the way that you expect it to behave.
David Johnson:
Yeah, I agree. I think it's important to keep the validation separate from the testing as you say. Because the validation is something that you basically show to the client and say we have agreed that certain score here is what we're trying to target, right? We're trying to get to the accuracy of the prediction to be a certain level. So when you have a story to write some new [inaudible 00:12:36] into there that's going to change the model and make it better, you can then come back and say I built this new feature and the validation score went up 20%. So that's a way of validating that what you have developed has created value for the client. Right?
Ken Collier:
So maybe a concrete example would help here. So, David was on a project a few years ago developing a revenue forecasting model that was basically a month-over-month revenue forecast in an environment where the prior techniques for forecasting were very subjective, best guests kind of scenarios. And so he needed to be able to... Number one, he needed to show the team how those models, those forecasting models, were going to be deployed or could be deployed into a forecasting application for the company. But he also needed to be able to demonstrate to the salespeople and others who cared about the forecast, that his models were actually giving more accurate forecasting predictions than their previous methods. And David you might elaborate on that, but it may be helpful to think about some concrete examples when we talk about this.
David Johnson:
Yeah, so if you take that case there, there was... one feature for example, was the seasonality. There are some contracts for which there's a lot of seasonality, so you could show it to the client and say look, there's a lot of variation. It goes up a lot in January every single year. So I'm going to have a feature where I'm going to build in some way to model that. So then you model it and you write the code. You have your tests to make sure the code is working, but then you come back and you show them what you developed. You can show them that it has brought the score up by 5%.
David Johnson:
You can show them how it's modeling the seasonality on certain contracts the way it's supposed to. That's the way you deliver value that way. Right? You might not spend the time explaining each of the little steps that go into the model that you built, because there could be some details and mathematics that they don't really... care about or have an opinion about. But they want to see at the end of the day that the score that you've agreed on... to... the target has gone up and that it's doing what it's supposed to be doing.
Rebecca Parsons:
So when we think about continuous delivery and continuous integration, within software development there's a very obvious mechanism for thinking about versioning. You make a change to the code, you check in the new code and now you have a new version. You've talked about introducing new features. There might also be new sets of data that are trained. How do you think about the version that would be deployed and is that incremented and how is that really different in a data science context?
David Johnson:
So the biggest difference is that when you produce a model for example, There's the code that you write and there's the parameters of your model which you can store as code for example, and that is just the same as... you would check that into the repo for example. But then there's the question of the data that you use to train a model, right? And then there's a question on do you check in the data that you trained the model into your repo? Or... That's usually a bad idea because the data could be very big, for example. It could be changing every day. So there is a question about how you do that. And I wouldn't say there's really... this is a solve problem. It really depends on the situation.
David Johnson:
For example if the data is huge, you have a terabyte of data that you're training on, there's no chance that you will check that into the repo or anything like that. Right? And it's probably too big that you won't even want to make copies of it. So then you have to have some mechanism to ensure that you always have that data. So if you want to roll back to the way the model was a few weeks ago, you have the ability to do that. You have to save the information, which gets that data and plugs it into your model, but at the same time you can't really save the data by itself.
David Johnson:
And then there's the question, when you produce a model, for example... The model is usually code, which is generated by some library, right? It's code that takes the configuration code that you actually wrote and the data and runs it and produces some kind of output, which could be, for example, a JAR. And then what do you do with that? Do you check that into the repo? Well, you probably don't want to do that because it is generated from the data, and if you check it in and you have these... a redundancy in a sense. How you do this in practice I would say is not really a solve problem. There's not really one way to do it. It depends on the situation.
Ken Collier:
But one of the things that we do know is that we want to understand not just the version of the model itself, but also the version of the data that was used to train and validate that model. Because, one of the important things in data science is to monitor the drift of a model over time, as production data changes or the nature of the universe changes over time. Being able to monitor that drift so that you know when it's time to retrain the model maybe on a new data set or build an entirely new model that hopefully will perform better. So there's a lot of complexity around this version control.
Ken Collier:
But ultimately, there's the need for a large volume of data during the training phase, when you're building the model, and then when the model moves into deployment into production it's ideally running against either a streaming data pipeline or some other collection of production data that's not necessarily that high-volume historical data that was used for training purposes. So there's a lot of complexity or a lot of considerations to be made in this continuous delivery and version controlling in data science.
Alexey Vilas Boas:
We're coming to the end of the episode. But I do have one final question. So, we all know the benefits agility can bring to the enterprise as a whole. And enterprises are also looking at becoming more and more data-driven. So, how does agile data science impacted the enterprise in general?
Ken Collier:
I'll take a shot at this one. There is a bigger picture. And this is something that I think a lot of enterprises, a lot of companies, have failed to fully realize. As data science has become a hot topic in the last six or seven years companies have hired data scientists and now we're starting to hear stories of companies that have hired data scientists and are surprised to find out that it's not been entirely game-changing. So if you look at the bigger landscape of most companies, we could use some examples here, but if you think about most companies, there is the customer and employee experience. And there's various interactions that happen with customers, with suppliers, with employees, et cetera. And those interactions and events and transactions get captured by transactional and operational systems which typically move that data into something like a data warehouse or some kind of... we call this an insight enablement environment.
Ken Collier:
So, whatever the consolidation of data is for analysis, this is where data scientists glean their data, build models, create those insights that then get presented or delivered to the decision makers and action takers within the enterprise. And those decision makers and action takers are in a position of changing the user experience, the customer experience, the employee experience in ways that hopefully materially benefit or improve the business landscape. So you can think of this in retail, you can think of this in technology and financial services, insurance, et cetera. The bigger picture here is, how do we remove friction from that data, to information, to insight, to decision, to action loop? So there's a lot of steps in that loop. And there's a lot of thinking that we can be doing in terms of streaming data architectures, consolidated data at rest architectures, that enable easy democratization of data, access to data, secure trustworthiness of data.
Ken Collier:
And then the ability for data scientists to bring this back to a data science discussion. The ability for data scientists to easily and quickly deploy newer and better models into a production pipeline so that all of this is automated and it's not a bench or desktop presentation of results, it's actually results in a production pipeline. For example, scoring customers on the basis of propensity to do something that we can influence or improve their experience. Or detecting the likelihood of anomalies or fraud, etc. All of these things could benefit from a bigger tightening of that entire data to action loop that I described earlier. And we've been referring to this as continuous intelligence, but it really involves a lot of moving parts including data science, machine learning platforms, data platforms, and data engineering as well as software engineering.
Alexey Vilas Boas:
Continuous intelligence. It looks like we should definitely be talking a little bit more about that. Well, anyway, it was a great conversation. It was great to have you with us. Thank you very much for joining and see you next time. Bye.