The two-tower model consists of three components:
-
User feature encoder
-
Video encoder
-
Aggregation layer (dense layers)
We can deploy all three components separately. Inference on all videos can be performed offline, and their embeddings can be stored in a video feature store.
Below Diagram explain the deployment of the system
When a user comes in, we first compute the user features, then call the user encoder, which generates the user embedding. Video embeddings are pre computed and stored in video feature store. User and video embeddings are then passed to the aggregation layer to compute the final score.