In this first stage, the system starts with a potentially huge corpus and generates a much smaller subset of candidates. For example, YouTube's candidate generator reduces billions of videos down to hundreds or thousands. The model must evaluate queries quickly due to the enormous size of the corpus. A single model may provide multiple candidate generators, each nominating a different subset of candidates.
We can use a Two-Tower architecture model to train this component. The problem can be formulated as a binary classification task: whether a user has interacted with a video or not. If a user has interacted with a video, the user-video pair is labeled as 1; otherwise, it is labeled as 0.
In this stage, another model scores and ranks the candidate videos to select a final set (typically around 10) to display to the user. Since this model evaluates a relatively small subset of items, it can leverage more sophisticated algorithms and additional user/video features for higher precision.
We can extract the user and video embeddings from the Two-Tower model trained in the first step. At this stage, we can choose a more stringent objective function — for example, optimizing whether the user has watched the complete video or at least 80% of it. During inference, the model trained in the scoring step can be used to rank the candidate videos generated in the previous step, and select the top 10 most relevant ones to display.
Finally, the system must account for additional constraints to produce the final ranking. For example, it may remove videos that the user has explicitly disliked or boost newer content. Re-ranking also helps to ensure diversity, freshness, and fairness in the recommendations.