An eval harness is the framework that runs a model against a dataset of cases and records the scores.
It loads cases, calls the model, applies scoring, and aggregates the results so runs are repeatable. A good harness makes it cheap to re-test after every change. It is the backbone of an evaluation workflow.