spam_classification/data_collection

Data Collection


Candidate: Do we already have labeled data for this task?
Interviewer: No, we don't. You need to think about how to collect labeled data.

Candidate: What proportion of posts are actually spam, and how many posts does the platform receive in a month?
Interviewer: Good question. Let's assume the platform receives about 1 million posts per month, and roughly 0.1% to 0.2% of those are spam.

To start building a spam classifier, the first requirement is to clearly define what counts as spam and what counts as non-spam. Once these guidelines are well-documented, we can use them to design prompts for an LLM (Large Language Model) that will help with data labeling. However, given that we may have millions of social media posts available, it's not practical to run the LLM across the entire dataset. It would be too expensive. Instead, the usual approach is to work with a random sample. For example, suppose we draw a sample of 100,000 posts from one month's data. If the expected spam rate is 0.1 - 0.2%, then out of those 100,000 posts, only about 100 - 200 are likely to be spam. Running the LLM prompt on this sample would give us a starting point with a small but usable set of labeled spam and non-spam posts.

Why not just take a bigger sample? The main constraint here is LLM cost. If labeling 100,000 samples costs, say, $100, then labeling 1 million samples would cost $1,000. So the actual sample size depends on how much budget is available for LLM inference. If the budget is higher, we can increase the sample size; if not, we need to work within the limits.

To train a reasonably good classifier, we will eventually need at least 5,000 - 6,000 positive (spam) samples. Here's how we can move toward that goal:
After the initial LLM labeling, let's say we obtain around 400 - 500 spam samples. With this, we can train a weak classifier , for instance, by using pre - trained Word2Vec embeddings and a simple classifier such as logistic regression. This weak model won't be perfect, but it can help us scale. We can then run it on six months of historical data, flagging all posts predicted as spam. Suppose this produces 200,000 candidate spam posts. Even though this set will include some false positives, it is still a much denser collection of potential spam than random sampling.
Next, we can apply the LLM again on a subset of these candidates , for example, 50k posts. Because the weak classifier has already filtered them, the proportion of true spam in this batch will be much higher. After running the LLM prompt, we may get several thousand additional spam samples.
By repeating this process for 2 - 3 iterations, we can gradually build up a dataset of 6,000 - 8,000 verified spam samples, along with a much larger set of non-spam posts. This iterative strategy balances cost with efficiency and helps ensure we have enough positive samples for training.
If human annotators are available, they can be used to validate the LLM's outputs and fine-tune the prompt for better labeling accuracy. But in many real-world cases, human annotation is limited or unavailable, which is why LLM - assisted labeling is a practical alternative.
The next step is to split the dataset into training, validation, and test sets. Because the spam detection problem is naturally very imbalanced, where spam posts make up only a tiny fraction of all posts, we need to carefully think about how to structure these splits for both training and evaluation.
One option is to keep the original imbalance (for example, a 1:100 ratio of spam to non-spam) in all splits. While this reflects reality, training models directly on such skewed data can be very challenging. In these cases, you would usually need to adjust the loss function by assigning higher weights to the minority (spam) class, so the model doesn't simply ignore it. This is a valid approach and often used in practice.
Another common strategy is to make the training and validation sets more balanced, such as a 1:5 or 1:6 ratio of spam to non-spam. This helps the model learn to recognize spam more effectively, since it sees enough positive examples during training. Meanwhile, you keep the test set distribution close to the production environment, which is usually highly imbalanced. This way, your evaluation metrics on the test set reflect how the model will actually perform in the real world.