Counterfactual evaluation replays historical sessions under alternate ranking or bidding policies without exposing customers to risky changes. By recording propensities and intervention points, you can estimate exposure, click, and conversion impacts credibly. We discuss assumptions, variance reduction, and places where off-policy estimators break, plus validation tricks that benchmark results against small online tests so stakeholders trust conclusions enough to prioritize genuine fixes over inconclusive dashboard noise.
Synthetic queries, typed by generators or curated by experts, stress coverage of niche attributes, ambiguous intents, or sensitive catalog boundaries. Paired with shadow indices that mirror production data but isolate particular signals, these toolkits reveal how representations drift, which features overfit, and where recall collapses. We describe building maintainable sets, sampling strategies, and acceptance criteria that prevent regressions while encouraging bold, responsible iteration across retrieval, ranking, and rewriting components.
Split-market audits divide traffic or inventory along stable, business-motivated partitions to reveal uneven performance without inferring personal characteristics. By comparing exposure, cost, and outcome parity between well-defined segments, you can detect structural disadvantages early. We outline pitfalls like leakage, seasonality, and confounding promotions, then show governance patterns for escalation, remediation ownership, and transparent communication so partners understand both the diagnosis and the concrete steps you are taking.