Data Preprocessing Tricks that I used for CTR Prediction in Recommender System

Brian Chan
4 min readSep 17, 2023
Photo by Justin Morgan on Unsplash

In the world of recommender systems, where the click-through rate (CTR) model reigns supreme, the construction of positive and negative samples plays a pivotal role in model performance. In this article, we’ll explore some practical tricks to optimize the process of creating these samples while preserving the quality and integrity of your data.

1. Avoid Sampling Unless Necessary

The golden rule when dealing with data for CTR models is simple: if there’s no compelling reason to do so, don’t sample. Data is the upper limit of your model, and unnecessary sampling can do more harm than good.

Unless you’re dealing with an overwhelming amount of data that’s causing training difficulties, refrain from subsampling negative examples. Even then, be careful, as improper sampling can introduce biases and negatively impact your model’s performance. Remember, when you sample, you’re altering the distribution of your data, and recalibrating your predictions can be a cumbersome task.

Some may argue that sampling is necessary when dealing with users who exhibit disproportionately high levels of activity. However, if your data is accurate and well-distributed, the impact of these users should be manageable without resorting to…

--

--

Brian Chan

☕ Data Scientist | AI Engineer | Programmer, I write post from time to time