Data Preprocessing Tricks that I used for CTR Prediction in Recommender System

Brian Chan
4 min readSep 17
Photo by Justin Morgan on Unsplash

In the world of recommender systems, where the click-through rate (CTR) model reigns supreme, the construction of positive and negative samples plays a pivotal role in model performance. In this article, we’ll explore some practical tricks to optimize the process of creating these samples while preserving the quality and integrity of your data.

1. Avoid Sampling Unless Necessary

The golden rule when dealing with data for CTR models is simple: if there’s no compelling reason to do so, don’t sample. Data is the upper limit of your model, and unnecessary sampling can do more harm than good.

Unless you’re dealing with an overwhelming amount of data that’s causing training difficulties, refrain from subsampling negative examples. Even then, be careful, as improper sampling can introduce biases and negatively impact your model’s performance. Remember, when you sample, you’re altering the distribution of your data, and recalibrating your predictions can be a cumbersome task.

Some may argue that sampling is necessary when dealing with users who exhibit disproportionately high levels of activity. However, if your data is accurate and well-distributed, the impact of these users should be manageable without resorting to sampling. If all else fails, consider incorporating rule-based strategies into your approach.

Data is precious, and unless there’s a compelling reason to sample, it’s best to utilize every piece of user feedback. Avoid errors and maintain data integrity, and you’ll often find that this is the best strategy of all.

2. Filtering Sessions with No Clicks

Sessions in which users don’t perform any clicks (or other interactions) are not ideal candidates for negative samples. The exposed items typically involve high-interest scenarios or items recommended by the system. Labeling them as negative can lead to the model drifting away from capturing user interests.

During training, it’s suggested to filter out sessions where no positive interactions occurred. If a user fails to click on any items in a batch of recommendations, likely that s/he is not browsing the items…

Brian Chan

Data Scientist | AI Engineer | Developer