youtube_recommender_system/dataset_creation

Dataset Creation for Model Training


We need to create data labels (dependent variable) as well as user & video features (Independent variables).

1. Labled data Creation
There are a few strategies we can adopt to generate labeled data:

  • If a user has liked or shared a video, we can mark it as a positive sample.
  • If a user has watched at least 50% of a video (discuss the ideal threshold with the interviewer), we can also consider that as a positive sample.

If you plan to use models like matrix factorization to learn user embeddings, this labeled data should suffice. However, for more sophisticated models like the Two-Tower architecture, you'll need both positive and negative classes. To generate negative samples for each user - video positive pair, you can randomly select 5 - 8 user-video negative pairs.

2. Independent Variable creation
We need to create user and video features:

  • User Features:

    user_id: Unique identifier for the user (may not be directly useful for ML model training).
    user_history: A record of all videos the user has previously watched. How can we convert user history into useful ML features? Take videos watched over the last six months and compute the following:

    • Percentage of short, medium, and long-form videos watched.
    • Percentage of videos with more than 10k interactions (helps identify affinity towards viral content).
    • Percentage of educational vs. entertainment videos. You can define additional categories such as finance, history, tech, etc., and generate corresponding features.
    • Percentage of videos watched on mobile vs. desktop.
    • Use the titles and descriptions of the videos as input to a text encoder model to get embeddings. Then, apply an embedding aggregation strategy (e.g., max pooling) to generate a single embedding vector for each user.
    is_premium: Indicates whether the user has a premium account.
    account_age: Age of the user account.

    You can discuss additional features with the interviewer that may improve model performance.

  • Video Features:

    For each video, you can consider the following features:

    • Video frames features : requires a video frame encoder model (to be discussed in the model training section).
    • Title and description feature : requires a text encoder model.
    • Length of the video.