The Harvard Business Review article ‘Data Scientist: The Sexiest Job of the 21st Century‘ created a ripple across the industry. Naturally, everyone began upskilling for the new hot job role. Furthermore, organizations went on to hire Data Scientists to keep up to the race. However, when Data Scientists came on board, people expected them to have a magic solution to every problem. They were expected to be Business Analysts, Software Engineers, mathematicians, statisticians etc. packaged in one human being. Hence, the unicorn breed was expected to know multiple skills like Business Analysis, SQL, DevOps, programming etc.
Surely, people could be the jack of all trades. However, that trait was good enough for proof of concepts or pilot projects and not for operationalizing a predictive analytics system to the real world. To elaborate, Data science involves a lot of Statistical Analysis, Mathematical modelling and intuition etc. Hence, people with a background in sciences and quantitative background dominated these roles and rightly so, since they have tons of experience in analysis and modelling and along with some experience in programming. However, when it comes to robust, real-time systems, they lacked the necessary experience and erudition. This was especially true with an increase in scale and complexity of data (big data). Hence, to augment the Data Scientists with the necessary skills, the eponym ‘Data’ Engineer emerged.
The emergence of Data Engineering
Data Engineers typically deal with the task of creating and maintaining data pipelines which ingest and process data for the consumption of Data Scientists. This role emerged from the traditional ETL developers (sometimes database developers). However, with changing paradigms the tools and technologies grew leaps and bounds. Data started pouring in volumes and variety at high velocities leading to the emergence of the Lambda Architecture. ETL is fast evolving into ELT with technologies like Hadoop and Spark.
Thus, the data science team now consisted of Analysts, Data Scientists and Data Engineers. The Data Engineers offloaded the data scientists of the data collection, processing and cleansing part of the data science life cycle. This enabled the latter to focus on Business Understanding, Model development etc. However, model deployment/converting to data product in the real world remained a challenge for data science teams. Here, a breed of professionals called ML Engineers emerged.
The need for ML Engineers
The democratization of AI with tools like Azure Machine Learning greatly simplified the data science life cycle. Here is an example of a prototype of IoT and ML in action together. In these examples, you can see the first cut prototypes of an ML system in action. Any data scientist and data engineer can build such systems. Moreover, these systems are static i.e. these articles do not elucidate on model retraining. Hence, a natural question would be “Why to retrain models?” The answer is the concept of ‘drift.’ To understand the concept of drift, we need to see why ML systems are fundamentally different from traditional software systems.
In a traditional software system, we have an input and a logic written to compute an output. However, in ML systems, we have output and inputs and the system figures out a pattern/relation between them. For instance, let’s say the system is an equation of the straight line.
In traditional systems, we have m, x, and c to compute y. However, in ML systems we have y and x while we figure out m and c to extrapolate the values of y in future. This forms the basis of inductive reasoning.
Having said that, it is intuitive that ML systems are dependent on the underlying distribution of data. Naturally, a small change in the distribution of input data will throw the system off track, since the relation between the input and output variables change. This is called the concept of drift in Machine Learning.
The emergence of ML Engineers
This problem of drift one of the areas that ML engineers deal with by establishing DevOps practices (can be called as DataOps) to ML systems. However, DataOps is fundamentally different from DevOps.
In traditional Software systems, DevOps take care of code versioning, maintenance and deployment in production systems. As far as versioning and maintenance are concerned all they need to do is maintain code and monitor system health and security. However, in the ML systems, there is an additional burden of data versioning and model versioning to track the training history of the models. Moreover, from a security standpoint, any smart user can fool the ML model by figuring out a pattern in which the system responds.
Toolset and skillset
Now, since the skillset is different, it is but natural that the toolset will vary. As far as deployment is concerned, we have API like Flask in python. Furthermore, there are frameworks like MLFlow from Databricks which can take care of model governance and deployment simultaneously.
However, ML engineering is more about mindset than skillset or toolset (of course they are important). It’s a mindset to take on the uncertainty of the real world. It is not about maintaining traditionally heavy systems, but a data infrastructure and model infrastructure together. Hence, this role is a combination of Data Engineer, Data Scientist and a Software Engineer.