Results
Understanding and Working with Evaluation Results
When you run evaluate
, Scorebook produces results for both aggregate scores and
per-item scores.
The structure and level of detail can be controlled with the following flags:
return_dict
(default:True
): Set return type as a dict/list or (EvalResult)return_aggregates
(default:True
): In dicts returned, include aggregate scoresreturn_items
(default:False
): In dicts returned, include item scoresreturn_output
(default:False
): In item scores in dicts returned, include associated model outputs
Return Shapes
Dict Output
By default, results are returned as a dictionary with up to two sections. If return_dict=True and both return_aggregates and return_items are True, evaluate() returns a dictionary with two keys:
aggregate_results
→ list of dicts (one row per dataset × hyperparameter run)item_results
→ list of dicts (one row per evaluated item)
If only one of return_aggregates or return_items is True, then the return value is a list containing just that section.
At least one of return_aggregates or return_items must be True when return_dict=True, otherwise a ParameterValidationError is raised.
{
"aggregate_results": [
{
"dataset": "qa_dataset",
"run_completed": true,
"temperature": 0.7,
"accuracy": 0.81,
"f1": 0.78
}
],
"item_results": [
{
"item_id": 0,
"dataset_name": "qa_dataset",
"temperature": 0.7,
"accuracy": 1,
"f1": 1
}
]
}
EvalResult Output
If the return_dict
parameter in an evaluation is set to False
, evaluate will return an
EvalResult instance.
results: EvalResult = evaluate(inference_function, eval_dataset, return_dict=False)
results.scores # Dict[str, List[Dict[str, Any]]]
results.aggregate_scores # List[Dict[str, Any]] (same rows as aggregate_results)
results.item_scores # List[Dict[str, Any]] (same rows as item_results)
Return Details
Outputs
Model outputs for each evaluation item can be included optionally with the return_output
flag. These are found
under "inference_output" within the item results of an evaluation.
"item_results": [
{
"item_id": 0,
"dataset_name": "basic_questions",
"inference_output": "4"
"temperature": 0.7,
"accuracy": 1,
"f1": 1
},
{
"item_id": 1,
"dataset_name": "basic_questions",
"inference_output": "Paris"
"temperature": 0.7,
"accuracy": 1,
"f1": 1
},
{
"item_id": 2,
"dataset_name": "basic_questions",
"inference_output": "William Shakespeare"
"temperature": 0.7,
"accuracy": 1,
"f1": 1
},
]
Run Id
If using Trismik's services within Scorebook, any evaluation results uploaded to the Trismik dashboard will include a unique run_id for each run within the evaluation. An evaluation run refers to a evaluation dataset × hyperparameter configuration.
{
"dataset": "dataset",
"run_id": "387b77604e21654f238c74ec3e12b25df33e89e7",
"accuracy": 1.0
}