Assemble representative evaluation datasets
- Sample from multiple time windows and customer segments to capture seasonality and edge cases.
- Label datasets with provenance metadata and store them in versioned S3 prefixes.
- Maintain a balanced set of positive and negative outcomes to prevent metric skew.
