When is the right time to get a data scientist in ?

I have had this particular discussion several times now. Every time the client wanted them ASAP and I recommended that they wait. But is my advice always right? Why did I give it? 

In his recent book Data Teams Jesse Anderson talks about there being typically three teams needed for successful data projects. ( https://www.datateams.io/ )  

  1. Data Science 
  1. Data Engineering 
  1. Operations. 

The Data Teams 

So who are the data teams and what differentiates them? 

The Data Science people are maths focussed, and only beginner programmers. They don’t really know too much engineering and probably do not know what “productionise” meant. They should however understand the data in that particular industrial sector. They may know about machine learning and similar statistical methods. 

The Data Engineers are typically more advanced programmers – but more importantly they are software engineers. They understand about version control and continuous integration/continuous delivery. They may understand the business data, but probably not in depth.  

The Operations people like things to work reliably. They don’t want surprises. They want the business to get the benefit of the data products that it looks after 24×7. 

It is of course possible that one person might fall into more than one of these camps, but in reality they will be mostly one of the three.  

The Missing Team 

The first thing to notice is that there is a group of people missing from this picture. There are often Business Information people who specify and run reports to satisfy real business needs. They probably do not have the same level of maths skills as the data science people. They might be reading their data from a Data Warehouse, or a traditional RDBMS, or a Big Data system. But in any case the Business Information people typically do need Data Engineers to make sure that their data pipelines are correct and up to date so that the data they read is right, and they need Operations people to make sure that the data pipelines are running reliably and frequently so that the data is readable at all. 

By itself I would normally call this a valid argument for getting Data Engineers and Operations staff in before Data Scientists, but let us look some more at what Data Scientists do.  

The Data Science Hierarchy of Needs 

A few years ago Monica Rogati (@mrogati) came up with a way of describing what a data scientist needs to do their job. They used Maslow’s Hierarchy of Needs as a model for what data scientists do.  

Maslow’s Heirarchy of Needs. 

( Image sourced from Wikimedia commons.  

https://commons.wikimedia.org/wiki/File:Maslow%27s_Hierarchy_of_Needs2.svg ) 

Compare that to @mrogati’s pyramid 

(Image Source: https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007 ) 

The Business might see the results from the top of the pyramid “Predictive analytics” and “AI” and say “We want that!” but in reality the top of the pyramid depends on almost everything underneath.  

Unless your firm is very simple then most of what goes on in the two bottom shaded layers is only made possible by data engineers and operations staff:

Reliable Data Flow, Infrastructure, Pipelines, ETL, Structured and Unstructured Data Storage

Instrumentation, Logging, Sensors, External Data, User Generated Content

When your data scientist wants to collect data, move it, store it, explore it and transform it then they will typically need a fully functioning big data system. They may also need data architecture to explain the data models involved and educate them about the data.  

Without those data engineers you will be wasting a data scientist’s time performing work which too manual and only works once.

But maybe your data is small. Maybe you only have a simple database. Maybe your data model is well documented and easily understood. If you have all that in hand then go ahead and engage your data scientist.   

But when you do, remember the Data Scientist’s Hierarchy of Needs.

Credits

Alex McLintock runs Alephant which helps companies in London to design new systems for Big Data Analytics and Data Science. He has written several articles here.

Contact us for all your Big Data design problems. We can work through Cafe Associates if you want Architecture As A Service instead of a single project.

The article also mentions the Apress book Data Teams by Jesse Anderson.

Monica Rogati (@mrogati) is an entertaining writer and well worth reading if you want to learn more about Data Science. They invented the Data Science Hierarchy of needs.

Photo of crystal taken by Daniele Levis Pelusi and obtained through Unsplash. Thanks.