pakkasys logo
← Back to blog

Impression Logs and Exposure Data: the Missing Half of Recommender Evaluation

Impression logs (exposure data) turn clicks into accountable metrics. Learn the request_id join pattern that makes recommender evaluation and debugging sane.

Aatu Harju14 min read

Impression Logs and Exposure Data: the Missing Half of Recommender Evaluation

When I started building RecSys, I assumed the hard part would be ranking. If I can score items well, the rest should follow.

The first real wall I hit had nothing to do with ML. We had outcomes (clicks, purchases, playbacks), but we didn’t have a reliable record of what we actually showed. In other words: we could see what users did, but we couldn’t prove what the system did to users’ screens.

That sounds like a minor detail until you try to answer a basic question like: “Did our recommendations improve last week?” If all you have is clicks, you’re one missing denominator away from fooling yourself.

Why impression logs are the missing denominator

A click is not a vote in a vacuum. It’s a reaction to something that was visible, in a particular position, inside a particular UI, at a particular moment.

If you don’t log exposures, you can’t attribute outcomes. You can’t tell whether clicks went up because the ranking improved, because the UI changed, because traffic shifted, because the catalog changed, or because you simply rendered more recommendation slots.

That missing piece is exposure data, often called impression logs: the ordered set of items the system presented to the user at a specific moment. In the literature you’ll also see “exposures” and “slates” used for essentially the same concept. The important point isn’t the vocabulary. The important point is that recommendations are an intervention, not just a prediction. (arXiv)

Clicks alone lie to you (in two predictable ways)

There are many sources of bias in implicit feedback. Two of them are enough to break most “we’ll just use clicks” evaluation setups.

Position bias. People interact more with items that appear higher in a list, even when those items aren’t truly more relevant. If you don’t log rank positions, you can mistake “we put it higher” for “people prefer it.” This is not a theoretical edge case, it’s a well-studied, persistent effect. (Anne Schuth)

Exposure bias. Users can only interact with items they were exposed to. Items that weren’t shown are not automatically “irrelevant”; they’re often just underexposed. This is the quiet trap behind a lot of offline-evaluation disappointment: the dataset never contains evidence for items the system never gave a chance.

The meta-point is simple: a recommender system doesn’t merely observe behavior. It shapes the evidence it later trains and evaluates on. If you want evaluation you can trust, you must keep “what we showed” separate from “what happened next.”

What an impression log actually is (and why I’m picky about naming)

I like the term “impression log” because it forces clarity.

An impression (or exposure) is a served list: a selection of N items generated by the system and presented in some arrangement. An outcome is a user event: click, conversion, playback, add-to-cart, whatever matters in your product.

These are related, but they are not the same event. If you merge them too early, you lose the ability to debug, evaluate, and audit.

In practice, the impression log I care about is boring by design:

It tells you which items were shown, in which order, when, and in what context.

Boring is good. Boring is how you get honest numbers.

The join key that makes everything sane

Once you accept that exposures and outcomes are separate, the next question is: how do you reliably connect them?

In RecSys, the answer is embarrassingly simple: treat recommendation serving as a traceable request, and propagate a join key across the entire loop.

You generate a request_id at recommendation time. The service returns it alongside the ranked items. The service writes an exposure event that includes the same request_id plus the ordered items it served. Then the client emits outcome events (clicks, conversions) that carry that same request_id, so you can join “what we showed” with “what happened.” (recsys.app)

This is the moment where evaluation stops feeling like “ML magic” and starts feeling like basic accounting.

If you want your KPIs to mean anything, you need two things:

First, exposures as the denominator. CTR is literally clicks divided by exposures, and conversion rate is conversions divided by exposures. (minimum instrumentation)

Second, join integrity. If request_id propagation is broken, you can compute numbers, but you can’t trust them. (minimum instrumentation)

Here’s the essence, stripped down:

json

{"request_id":"req-1","user_id":"u_hash_1","ts":"2026-02-05T10:00:00Z","items":[{"item_id":"item_1","rank":1},{"item_id":"item_2","rank":2}]} {"request_id":"req-1","user_id":"u_hash_1","item_id":"item_2","event_type":"click","ts":"2026-02-05T10:00:03Z"}

Once you have this, you can answer questions like “did we improve last week?” without turning the postmortem into a guessing contest.

The practical traps that quietly ruin attribution

Most teams don’t fail at exposure logging because they forgot to log items. They fail because the log is not faithful to what the user actually saw.

Multiple renders and re-use. If you re-use the same request_id across multiple renders, you smear attribution. If you generate request_id twice (one for the API call and another for logging), your join rate collapses. (minimum instrumentation)

Served is not always seen. A system can return items, but the UI can fail to render them, truncate them, or replace them. This is why it helps to define a strict meaning for an exposure event in your product: “this list was rendered on screen” versus “this list was returned by the API.” If you can’t enforce render-level logging, at least be explicit about which one you’re doing.

Context drift. If you don’t include surface identifiers, segments, and version fields, you can’t explain changes. If your stakeholders ask “what changed?”, “the model” is rarely a satisfying answer. (exposure logging)

Privacy sloppiness. Exposure logging is powerful, so it must be handled with care. Stable pseudonymous IDs, no raw PII in logs, and clear retention rules are not “enterprise overhead.” They are part of shipping this responsibly. (exposure logging)

None of this is glamorous. But it’s the difference between observability and mythology.

What exposure logging unlocks after you get it right

Once you have impression logs you trust, three doors open. Each door is useful on its own. Together, they’re the foundation for serious evaluation.

Offline evaluation that’s less dishonest

Offline evaluation will always be an approximation, because logged data comes from a particular historical system in a particular historical UI.

But once you can join outcomes to exposures, you can compute ranking proxies (like NDCG-style metrics against your chosen outcome signal), sanity-check join rates, and detect shifts that are actually UI or traffic changes. That’s also why many evaluation guides emphasize being explicit about what ground truth means in your setting. (Evidently AI)

You’re still not measuring true relevance. You’re measuring behavior under exposure. The difference is that now you can admit it clearly and improve from there.

Online experiments with credible attribution

A/B testing is the honest way to answer “did this new ranking improve outcomes?” But even A/B testing falls apart if you can’t reliably attribute outcomes to the right exposure.

With request-level joins, you can measure lift by surface, by segment, by variant, and you can explain to stakeholders what changed and where it changed. That’s how experimentation stops being a political argument and becomes an engineering discipline.

Counterfactual evaluation that has a fighting chance

Counterfactual evaluation tries to estimate what would have happened if we had shown a different set of recommendations, without running the experiment.

The classic tool here is Inverse Propensity Scoring (IPS) and related methods. They depend on knowing, or estimating, the probability that the logging policy would have shown a particular item in a particular context. In practice, that propensity question is exactly where teams realize they needed better exposure logging yesterday. (Eugene Yan)

I don’t treat counterfactual methods as a free lunch. I treat them as a reminder: the more you want to infer, the more you must measure.

Why this becomes a product question, not just an ML question

If you’re shipping recommendations in a real system, you’re not only building a model. You’re building something that will be questioned.

Someone will ask why the feed changed. Someone will ask whether the new model hurt revenue. Someone will ask whether the algorithm is doing something weird. Someone in compliance or security will ask what data you store and whether you can audit decisions. These questions aren’t annoying. They’re reality.

This is where I’ve found a surprisingly helpful mental move: treat exposure logging as a practice of honesty. Not moral honesty, operational honesty. “What happened, exactly?” is the only question that consistently reduces suffering in production.

When your system can answer that question, the team calms down. When it can’t, people start projecting stories onto graphs. And that’s how you end up arguing with ghosts.

A five-minute thought experiment

Imagine you get paged because recommendations look wrong.

Now ask yourself what evidence you can pull in five minutes.

If the answer is “we have clicks,” you’re going to spend the next day inventing narratives and then debating which narrative feels most plausible.

If the answer is “we have impression logs, we can join exposures to outcomes by request_id, and we can reproduce what we served,” you’re going to debug like an engineer instead of a fortune-teller.

You can copy this pattern into any stack, even if you never use RecSys itself. But if you want a reference implementation that treats exposure logging as a first-class default (including quickstart wiring and minimum instrumentation guardrails), that’s the mindset I baked into RecSys docs and schemas. (recsys.app, quickstart)

If you’re building or evaluating a recommender and want a sanity check on your instrumentation, reply with your situation.

Need help with this in your own stack?

If reliability or delivery friction is slowing your team down, we can fix it in focused steps.

Related posts

12 min read

RFCs as a Delivery Tool, Not Paperwork

A practical guide to engineering RFCs: when to write one, how to keep it lightweight, and a template that reduces risk without slowing delivery.

EngineeringRFCDelivery