scorebook.evaluate
Model evaluation functionality for the Scorebook framework.
This module provides the core evaluation logic to assess model predictions against ground truth labels using configurable metrics. It supports:
- Batch evaluation of models across multiple datasets
- Flexible metric computation and aggregation
- Optional parameter sweeping and experiment tracking
- Customizable inference functions
The main entry point is the evaluate()
function which handles running
models on datasets and computing metric scores.
evaluate
def evaluate(inference: Callable,
datasets: Union[str, EvalDataset, List[Union[str, EvalDataset]]],
hyperparameters: Optional[Union[Dict[str, Any],
List[Dict[str, Any]]]] = None,
experiment_id: Optional[str] = None,
project_id: Optional[str] = None,
metadata: Optional[Dict[str, Any]] = None,
upload_results: Union[Literal["auto"], bool] = "auto",
sample_size: Optional[int] = None,
parallel: bool = False,
return_dict: bool = True,
return_aggregates: bool = True,
return_items: bool = False,
return_output: bool = False) -> Union[Dict, List]
Evaluate a model and collection of hyperparameters over datasets with specified metrics.
Arguments:
inference
- A callable that runs model inference over a list of evaluation itemsdatasets
- One or more evaluation datasets to run evaluation on.hyperparameters
- Optional list of hyperparameter configurations or grid to evaluateexperiment_id
- Optional ID of the experiment to upload results to on Trismik's dashboard.project_id
- Optional ID of the project to upload results to on Trismik's dashboard.metadata
- Optional metadata to attach to the evaluation.upload_results
- If True, uploads results to Trismik's dashboard.sample_size
- Optional number of items to sample from each dataset.parallel
- If True, runs evaluation in parallel. Requires the inference callable to be async.return_dict
- If True, returns eval results as a dictdatasets
0 - If True, returns aggregate scores for each datasetdatasets
1 - If True, returns individual items for each datasetdatasets
2 - If True, returns model outputs for each dataset item evaluated
Returns:
Union[Dict, List, EvalResult]: The evaluation results in the format specified by return parameters:
- If return_dict=False: Returns an EvalResult object containing all run results
- If return_dict=True Returns the evaluation results as a dict