scorebook.evaluate

Model evaluation functionality for the Scorebook framework.

This module provides the core evaluation logic to assess model predictions against ground truth labels using configurable metrics. It supports:

Batch evaluation of models across multiple datasets
Flexible metric computation and aggregation
Optional parameter sweeping and experiment tracking
Customizable inference functions

The main entry point is the evaluate() function which handles running models on datasets and computing metric scores.

evaluate

def evaluate(inference: Callable,
             datasets: Union[str, EvalDataset, List[Union[str, EvalDataset]]],
             hyperparameters: Optional[Union[Dict[str, Any],
                                             List[Dict[str, Any]]]] = None,
             experiment_id: Optional[str] = None,
             project_id: Optional[str] = None,
             metadata: Optional[Dict[str, Any]] = None,
             upload_results: Union[Literal["auto"], bool] = "auto",
             sample_size: Optional[int] = None,
             parallel: bool = False,
             return_dict: bool = True,
             return_aggregates: bool = True,
             return_items: bool = False,
             return_output: bool = False) -> Union[Dict, List]

Evaluate a model and collection of hyperparameters over datasets with specified metrics.

Arguments:

inference - A callable that runs model inference over a list of evaluation items
datasets - One or more evaluation datasets to run evaluation on.
hyperparameters - Optional list of hyperparameter configurations or grid to evaluate
experiment_id - Optional ID of the experiment to upload results to on Trismik's dashboard.
project_id - Optional ID of the project to upload results to on Trismik's dashboard.
metadata - Optional metadata to attach to the evaluation.
upload_results - If True, uploads results to Trismik's dashboard.
sample_size - Optional number of items to sample from each dataset.
parallel - If True, runs evaluation in parallel. Requires the inference callable to be async.
return_dict - If True, returns eval results as a dict
datasets0 - If True, returns aggregate scores for each dataset
datasets1 - If True, returns individual items for each dataset
datasets2 - If True, returns model outputs for each dataset item evaluated

Returns:

Union[Dict, List, EvalResult]: The evaluation results in the format specified by return parameters:

If return_dict=False: Returns an EvalResult object containing all run results
If return_dict=True Returns the evaluation results as a dict

evaluate​

evaluate