Scrum Makes No Sense for Data Scientists

I know the title is provocative, but I believe it’s necessary to foster discussion — and it also summarizes the central idea of this article quite well: Scrum makes no sense for data scientists, and you shouldn’t force them to use it.

“Scrum is a lightweight framework that helps people, teams, and organizations generate value through adaptive solutions for complex problems.” - Scrum Guide

Is Scrum bad?

That’s a good question. In my opinion, based on my experience: I don’t believe it’s a bad agile methodology, but I also don’t believe it’s perfect, nor the silver bullet for all organization and management problems in IT, as its community tends to sell it. I also have my reservations about its efficiency in large projects where some ceremonies end up causing more harm than good.

“After the Manifesto became popular, the word agile became a magnet for anyone with points to defend, hours to bill, or products to sell.” - Dave Thomas, AGILE IS DEAD

Below we’ll discuss these ceremonies and also some things that simply don’t make sense for a data project.

Scrum focuses on incremental deliveries — Sprints

The concept of a Sprint, where at every predetermined time interval (usually 2-4 weeks) an incremental (partial) delivery is made, doesn’t make sense for data science projects — precisely because of the scientific and highly speculative nature of the work. This, at the end of the day, will only be counterproductive, as the team will feel “forced” to deliver something that generally changes almost completely over the course of the next Sprints or is simply discarded.

The delivery of a data science project normally consists of one or several ML models, or simply a report (exploratory analysis).

There are rarely intermediate features that can stand on their own. The work has a value and a flow that represents an exponential pattern, not a linear one. For Stakeholders, it can seem like the project simply achieved nothing over the course of the Sprint.

Forcing a data scientist to work within a predetermined time interval is a terrible idea. You simply can’t define what tasks will be performed next and what will be done during two weeks (a Sprint). It just doesn’t work that way. Data science is experimental work that can take a long time to produce any result. If it produces any at all.

“No matter how great the talent or efforts, some things just take time. You can’t produce a baby in one month by getting nine women pregnant.” - Warren Buffett

Task, task, tasks

Scrum tries to break everything down into small tasks. Again, this doesn’t fit the nature of the work. Data science is a process of constant experimentation and iteration.

Another very important thing is that: tasks that require a lot of time are not always laborious. In traditional software development, if a task needs 4 days to complete, in most cases it will keep a developer busy for 4 days (or 2 developers for 2 days, etc.).

This makes completion speed easy to predict and development flow easy to manage. However: that’s not always the case in data science. For example, if a data scientist needs to train a model they’re developing, training might take 4 days (due to the amount of data), but the data scientist won’t be busy with it the entire time — they’ll just do periodic monitoring to confirm the process is running correctly.

Congratulations, you’ve just wasted a data scientist for 4 days because “you can’t work on two tasks simultaneously.”

Data scientists don’t always need a PO

“The Product Owner represents the needs of Stakeholders in the Product Backlog. The Product Owner represents the stakeholders to the Scrum Team, which includes representing their desired requirements in the Product Backlog.” - Scrum Guide

According to Scrum, it’s the PO’s (Product Owner’s) responsibility to “translate” the complex problems of stakeholders into a list of tasks. Nice, but this also doesn’t work for data scientists.

Relying on a PO for this instead of the data scientist is a terrible idea. Getting problem context from the PO simply isn’t sufficient, nor is it motivating, and it’s unnecessary. It’s more effective for data scientists to interact directly with the client, understand the problem context with them, dive into discussions, and then design a plan on their own.

Much of the work data scientists do is experimental (research) and not product or feature development. Given the problem and its context, they design their own roadmap for how the project will be carried out analytically.

Since the necessary tasks aren’t yet clear to anyone (PO or stakeholders), it doesn’t make sense to remove this responsibility from the data scientist, whose job is precisely to clarify them.

What always happens is that the data scientist frequently ends up becoming their own PO, the “owner” of their projects.

The definition of “Done” is very relative

“Developers must conform to the Definition of Done. If there are multiple Scrum Teams working together on a product, they must mutually define and comply with the same Definition of Done. Understanding and applying the Scrum Framework enables teams and organizations to iteratively and incrementally deliver valuable releasable ‘Done’ software products in 30 days or less.” - Scrum Guide

Data projects are intrinsically different from one another — it’s impossible to use the same definition of “done” for all projects. Especially when working alongside a traditional software development team.

For example, “done” for one project might mean that a preliminary exploratory analysis was conducted, where it was determined that the project isn’t viable to continue with. But the development team has the next Sprint planned where they would implement feature X in the product — a feature that would be derived from the project being conducted by the data team. And now what?!

So What Should We Do?

OK, we understand that Scrum doesn’t have the best “fit” for Data Science projects, but I’m also not advocating that data teams should follow the eXtreme Go Horse methodology. However, they should certainly have more freedom to flow horizontally and vertically within the company.

The data field is very peculiar in the technology world, because at the same time that it’s a highly technical, hands-on field, it’s also extremely scientific and experimental — and it’s the one closest to the “business.” It’s very common to see a data scientist flowing between meetings with the development team and with the company’s executive team, talking about the same subject but with a completely different vocabulary and positioning in each meeting.

For this reason, I believe we should treat data scientists more like business partners than just “Devs who work with data.”

But please, share your opinion below!

What do you think of Scrum applied to data science? Have you had any experience with it? How did it go?