Before we dive into the discipline of Machine Learning Engineering, let’s see the below diagram from Google’s seminal paper called Hidden Technical Debt in Machine Learning Systems:
As we can see that the actual ML Code is a small part of the entire Machine Learning System. Similar to any software engineering workload, it is a complex amalgam of tools and tasks. Moreover, these tools and tasks require a variety of skill sets. Hence, it is rare to find a personality with all the required know how to build and maintain these systems.
When I wrote an article on Machine Learning Engineers around 2.5 years ago, I had an inkling of the broader first principles of the role. However, the rapid pace of advancements in this field has forced me to re-look at the discipline(s) of AI Engineering/Machine Learning Engineering. So, what has changed in a quarter-decade? Simply put, Machine Learning Engineering has evolved from being a role into a discipline/team. Therefore, Machine Learning Engineers could broadly be divided into two categories:
- ML Platform Engineer(s)
- ML System Engineer(s)
Machine Learning Platform Engineers
In order to build a solid software system, a robust platform and processes is a must. The same principle applies to Machine Learning Engineering as well. In order to build and maintain useful ML applications, a robust Machine Learning Platform is a must. But, ML platform and Infrastructure brings its own unique sets of requirements and challenges. For instance, in case of code maintenance, ML processes need to track data, models and experiments apart from code and dependencies. This adds to the complexity for ML platforms.
Having said that, various platforms like Azure Machine Learning, Kubeflow, Sagemaker etc. have emerged. These platforms abstract a lot of components like Model Registry, Deployment, etc. Then, why do we need an ML platform engineer? Wouldn’t the traditional SRE and DevOps folks suffice? Yes, they would. They are called MLOps Engineers. But, the ML Platform Engineer is more than that. ML platform Engineer is a strategic MLOps Engineer. This personality is responsible for architecting and integrating the ML Platform with the existing systems and calls for know how of infrastructure, networking and security, along with traditional DevOps/SRE knowledge. Moreover, ML Platform Engineers define standards for compliance, security and auditing of ML Systems.
We leave ML Platform Engineering with our article on Machine Learning Infrastructure.
Machine Learning System Engineers
Once the ML Platform is ready, it’s time to build ML Systems/Applications. You may argue that we have Data Scientists for that. But, Data Scientists typically build prototypes of ML Models, that need to be taken to production systematically. This is where ML System Engineers come into play. They are the “ML Engineers” known in hiring circles. To understand what ML System Engineers do, it is necessary for us to understand the first principles of ML System Design.
It is important to understand the Data Engineers with some know how of Feature Engineering/Feature Stores could upskill to be ML System Engineers. However, the key skill of ML Engineers lies in building Reliable and Repeatable ML Pipelines for Model Training and Deployment. Furthermore, they are expected to setup monitoring and observability for ML Systems, given their highly dynamic nature.
To dive more into the high level tasks of ML Systems Engineers, we recommend our readers to read our article on Machine Learning System Design.
This blog is based on personal experience, being a part of ML teams. We do not claim any guarantees regarding the same. There may be varied definitions and dichotomies around the discipline of Machine Learning Engineering. For instance, here is an article by Shreya Shankar.
P.C. Tensorflow Blog