Introducing Machine Learning System Design


Why do we need Machine Learning System Design?

In their seminal paper, Hidden Technical Debt in Machine Learning SystemsGoogle researchers expound that only a small fraction of real-world ML systems is the ML code. The surrounding infrastructure is vast and complex. However, certain guiding principles help us navigate this vast myriad of infrastructure options. Those guiding principles can be found using ML System Design. Please note that the principles of ML System Design we present here apply to routine ML/DL cases. But, certain special ones like Self Driving cars need a peculiar setup and design altogether.

But, before elaborating on ML System Design, we will look at the first principles.

Principles of Machine Learning System Design

So what are the guiding principles for Machine Learning System Design? Broadly, Machine Learning systems have two major workflows viz. Training and Serving.

The first three comprise the training end of the ML Systems:

  • Data Management
  • Orchestrated Experimentation
  • Reliable and Repeatable Training

The next three are pertaining to deployment, serving and monitoring.

  • Continuous Deployment
  • Reliable and Scalable Serving
  • Continuous Monitoring

Data Management

Machine Learning and Data Science work on Data. Hence, robust data management practices should be put in place for a successful MLOps practice. But what makes it different from traditional analytics/software engineering is that the datasets need to be versioned for reproducibility. Data Scientists often need to track different runs of an experiment to get the optimal configurations for training. This includes code, data, models, etc.

Moreover, in the absence of Data Management, Data Science teams may end up spending a significant amount of time rebuilding datasets, already built by other teams. This calls for a centralized Dataset Repository called Feature Stores. Read more on feature stores here.

Orchestrated Experimentation

Once the data is ready, Data Scientists perform experiments. In this phase, the key tasks include ML Problem Definition, Data Exploration and Selection, Feature Engineering, Model Tuning and Validation. The output of this phase is the Training Code and Hyperparameters, for building the training pipelines.

Orchestrated Experimentation is key to MLOps. Hence, tracking each iteration, an artefact of an experiment is a key to success in Data Science. Hence, MLOps platforms/frameworks come with provisions to create experiments and track them.  For instance, here are different ways to perform experiments in Microsoft Azure Machine Learning.

Reliable and Repeatable Training Pipelines

Once the ML experiments are successful, it’s time to operationalize it. This calls for reliable and repeatable Training Pipelines. Typically, it comprises Data Extraction, Data Validation, Data Transformation, Model Training and Evaluation and Model Registration. The output of this step is a Trained Model. Moreover, these pipelines should be repeatable for retraining. Check out our post on Azure Machine Learning training pipelines.

Continuous Deployment

This step is the beginning of the Serving workflow. Here, we take a registered model and deploy it to a target compute environment. Typically, a deployment pipeline comprises two components viz. Inference Configuration and Deployment Configuration:

The inference configuration comprises all the prerequisites and dependencies for Model Scoring/Inferencing. Primarily, it needs a scoring script and an environment config. A scoring script has two functions viz. init() and run(). The init function loads a registered model to be scored against, while the run executes the scoring logic. The environment config comprises the details of scoring dependencies like the libraries. On the other hand, the deployment configuration defines the target compute environment for model deployment and inferencing.

In Microsoft Azure, we have multiple options. For batch deployment, we have the Azure ML Compute Clusters at our disposal. Whereas, for real-time deployment, we have the Azure Container Instances and Azure Kubernetes Service. For more details, refer to Section 1.4 i.e. Model Serving, in our article.

Reliable and Scalable Serving

The end goal of building a machine learning model is to use it for gaining insights and/or making decisions. In a serving pipeline, the deployed model/service starts accepting requests (serving data) and sends the response(predictions) back. Thus, reliability and scalability are key requirements. Not to forget security.

Batch scoring can be set up like an ETL pipeline. However, real-time scoring could be more complicated. The model is exposed as a web service, which should be secure and scalable to handle the large and unpredictable volumes of data. Another key requirement could be interpretability, where the consumers of the service may want to know the rationale behind the model’s predictions.

Continuous Monitoring

Machine Learning deals with uncertainty. It tries to model complex real-world phenomena that are ever-evolving. This is known as Drift. Drift in Machine Learning occurs in two forms: concept drift and data drift.

Concept drift happens when a new business scenario emerges, which was non-existent in the model building process. It is likely that users change their behaviour over time, thus causing new patterns to emerge. In simple words, the relationship between X and y changes. An example of this could be users using their credit cards excessively in the face of a black swan event like COVID-19. In the email classification example, a new category may come up.

As far as Data drift is concerned, it results from changing features, which are typically unseen in the modeling phase. For instance, in the email classification example, people may use synonyms of a word from the dictionary. In simpler terms, here X changes. This may happen because of either a non-stationary environment or, in a big data scenario, the training data is not representative of the population. Having said that, refer to our article on Data Drift.

Lastly, it is also important to monitor the health of the training and serving infrastructure, especially the client-facing endpoints. Tools like Azure Application Insights come in handy here.


With any discipline, the complexity of design increases with time. Machine Learning is no different. These are some high-level ideas and tools to get started in the growing field of Machine Learning System Design. Since it is evolving, we do not claim any guarantees regarding its completeness or accuracy. We will keep adding new articles and ideas to this space. Hence, stay tuned!

I am a Data Scientist with 6+ years of experience.

Leave a Reply