how_to_approach_ml_system

Overview

Overview

How to appraoch ML System Design Interview

An ML system design interview is a type of interview used at many tech companies (Google, Meta, Apple, etc.) to evaluate whether a candidate can design and reason about large-scale machine learning systems, not just train a single model. It's different from a coding or pure ML theory interview. It's about connecting ML, software engineering, and systems thinking. It requires in-depth understanding of the complete Lifecycle of ML Projects.
This article introduces a structured framework for approaching ML system design problems and outlines the dimensions typically discussed in interviews.

1. Requirement Clarification

In real-world ML projects, the first step is always to understand the problem and then translate the business problem into a well-defined ML problem. The same approach is expected in interviews. Often, business problems are not clearly defined. That's why you are expected to ask clarifying questions, narrow down the scope, and clearly outline what exactly you want your ML solution to address.
For example, suppose the interview question is: "Design an ML system to identify the intent categories of feedback reviews." Here, you would first need to clarify what the intent categories are, and what the system should do once the intent is identified.
Now, consider another example: "We've launched our app and are seeing a lot of failures. We want to prioritize login and payment failures over other issues. Users are submitting feedback, and we'd like to automatically detect login and payment failures in that feedback so the engineering team can quickly address them." This second problem is more concrete compared to the first one. When faced with an ambiguous problem statement, your goal should be to make it more concrete by asking the right questions.

This process of clarifying and defining the scope of the problem is called functional requirements gathering (what the system should do). But there are also non-functional requirements, such as latency, scalability, and whether the model needs to run offline or online. Let's go back to the feedback review classification example. Once you train a classifier, you can run it offline on all collected reviews and filter out those related to login or payment failures. In this case, inference is offline, so latency is not a big concern.
Now compare that with a scenario like classifying the intent of Twitter posts in real time. Here, as soon as a post is created, the model must generate predictions within strict latency limits (often <50ms). This requires a low-latency online model. These requirements directly affect your model choices. For real-time, low-latency systems, you might prefer lightweight models with fewer parameters. For offline tasks like review classification, you have more flexibility and can choose larger models.

Key Takeaway From this section

In ML system design interviews, a central focus is on evaluating how effectively a candidate can transform an initially ambiguous problem statement into a clear and well-defined ML problem.
Clearly define the problem statement by identifying the functional requirements (what the system should achieve).
Discuss non-functional requirements such as latency constraints, online vs. offline deployment, and the scale at which the system needs to operate

2. Data Collection Strategies

Once the ML problem is clearly defined, the next step is to identify the type of task you're solving. It could be a classification problem (binary or multiclass) or a regression problem. From there, you need to consider how you will obtain the right data for training.

Supervised learning requires labeled data.
Unsupervised learning does not rely on labels, so the approach is different.

For instance, suppose you are working on a binary text classification problem. If labeled data is not readily available, you should discuss possible strategies to generate it:

Human annotation: Hiring annotators or domain experts to label data.
LLM-assisted labeling: Using large language models to generate labels via prompt engineering.

Both approaches come with trade-offs. Human annotation is usually more accurate but costly and time-consuming. LLM-based labeling is faster and cheaper but may introduce noise or bias. In interviews, you should highlight these pros and cons and justify which approach you will choose.

Once you have labeled data, the next step is the dataset splitting into training, validation, and test sets. A few points to consider:

Training and validation sets are often kept balanced (e.g., a positive-to-negative ratio of 1:4 to 1:7). Pick a ratio based on the problem at hand.
The test set should mimic real-world production distribution, which is often highly imbalanced.

This choice of label distribution has a direct impact on evaluation metrics:

For balanced datasets, accuracy can be a reasonable metric.
For imbalanced datasets, accuracy becomes misleading. Instead, focus on metrics like precision, recall, F1 score, or AUROC.

Important note: Even a dataset with a positive-to-negative ratio of 1:5 is still considered imbalanced, so in such cases, precision-based metrics are more reliable than accuracy.

Key Takeaway From this section

Define label criteria: Clearly specify what counts as the positive class and what counts as the negative class.
Labeling strategy: Explain how you plan to obtain labels (e.g., human annotators, automated/LLM-based labeling) and discuss trade-offs.
Dataset distribution: Set appropriate class distributions for train, validation, and test datasets, keeping the test set close to real-world production data.

3. Model Training

When you approach an ML problem, one of the most important steps is deciding what kind of model to use. The right choice depends a lot on the type of data you are working with and the requirements of the problem. For example, if your data is a tabular data 2D structured data (Numerical or categorical features), models such as Random Forest or XGBoost often work very well. You could also try a neural network, but what matters most in an interview is being able to explain why you would pick one model over another.

Think about a use case in banking, such as predicting whether a customer will default on a loan. In this scenario, the people using your model , like risk analysts or regulators , often care about why a prediction was made. Tree-based models are easier to interpret than neural networks, so they may be the better choice here.

Now consider a different example: text classification. You might start with a simple approach like TF-IDF, which represents text using word frequencies. But this method has limitations because it doesn't capture context or word meanings very well. That's when you could move to better techniques such as Word2Vec or even modern language models like BERT, which understand text in a much deeper way. Again, the key is to explain why you are choosing one approach instead of another.

After you have picked your model architecture, the next step is to decide on the right loss function. If you are solving a binary classification problem, binary cross-entropy is a good choice. For multiclass classification, cross-entropy is commonly used. And if your task is about ranking, say ordering search results or recommendations, you will need to use a ranking-specific loss function.

4. Model Evaluation Metrics

Once you have selected a suitable model architecture, the next step is to decide how you will evaluate the model's performance. The choice of evaluation metrics depends on the type of problem you are solving and the nature of your dataset.

For classification tasks, commonly used metrics include precision, recall, F1 score, and accuracy. However, accuracy is only reliable when the dataset is balanced. If the dataset is imbalanced for example, when one class appears much more frequently than the other metrics like precision, recall, and F1 score become more meaningful than accuracy.

For regression problems, you would typically evaluate the model using metrics such as Mean Squared Error (MSE), R-squared , or Adjusted R-squared, which capture how well the model fits the data and how much variance it explains.

For ranking problems, the focus shifts to metrics that evaluate the quality of the ranked list. Examples include Precision@K and Recall@K, which measure the relevance of the top results, as well as more advanced metrics like Normalized Discounted Cumulative Gain (NDCG), which accounts for both the relevance and the position of items in the ranking.

5. Model Deployment and Monitoring

After selecting and evaluating your model, the next important step is to think about how the model will be deployed and monitored in production. In interviews, you are expected to discuss this part of the ML workflow as well.

The most common and straightforward way to deploy a model is to expose it through a REST API endpoint. This allows other applications or services to send input data and receive predictions in real time. For text-based models, remember that the tokenizer is an essential component of inference, not just the model weights. To avoid mismatches or ambiguity, you should package both the tokenizer and the model together when deploying.

Once the model is deployed, monitoring becomes critical. Monitoring ensures that your model continues to perform well in the real world, not just on the test set. For example, if you have built an intent classification model, you might want to track metrics such as weekly precision and the volume of predicted classes. If precision drops below a certain threshold, that is a signal that the model may need retraining.

For a ranking model, monitoring might involve setting up dashboards that display metrics such as NDCG or even business-impact metrics like revenue lift generated by the model.

Similarly, when working with tabular data, you should consider monitoring feature drift, changes in the input feature distribution over time. This can be measured using statistical tools such as KL divergence. These ideas can also be extended to unstructured data (text, images, audio), where distribution drift can still harm performance.

In addition to model-level metrics, it is also valuable to track business metrics tied to the success of the system. For example, for a recommendation system, this might include click-through rate or conversion rate. For a fraud detection system, it could be the number of fraudulent cases prevented. Mentioning such business metrics during an interview shows that you understand not just the ML pipeline, but also how the system contributes to organizational goals.