Skip to main content

scorebook.evaluate

Model evaluation functionality for the Scorebook framework.

This module provides the core evaluation logic to assess model predictions against ground truth labels using configurable metrics. It supports:

  • Batch evaluation of models across multiple datasets
  • Flexible metric computation and aggregation
  • Optional parameter sweeping and experiment tracking
  • Customizable inference functions

The main entry point is the evaluate() function which handles running models on datasets and computing metric scores.

evaluate

def evaluate(inference: Callable,
datasets: Union[str, EvalDataset, List[Union[str, EvalDataset]]],
hyperparameters: Optional[Union[Dict[str, Any],
List[Dict[str, Any]]]] = None,
experiment_id: Optional[str] = None,
project_id: Optional[str] = None,
metadata: Optional[Dict[str, Any]] = None,
upload_results: Union[Literal["auto"], bool] = "auto",
sample_size: Optional[int] = None,
parallel: bool = False,
return_dict: bool = True,
return_aggregates: bool = True,
return_items: bool = False,
return_output: bool = False) -> Union[Dict, List]

Evaluate a model and collection of hyperparameters over datasets with specified metrics.

Arguments:

  • inference - A callable that runs model inference over a list of evaluation items
  • datasets - One or more evaluation datasets to run evaluation on.
  • hyperparameters - Optional list of hyperparameter configurations or grid to evaluate
  • experiment_id - Optional ID of the experiment to upload results to on Trismik's dashboard.
  • project_id - Optional ID of the project to upload results to on Trismik's dashboard.
  • metadata - Optional metadata to attach to the evaluation.
  • upload_results - If True, uploads results to Trismik's dashboard.
  • sample_size - Optional number of items to sample from each dataset.
  • parallel - If True, runs evaluation in parallel. Requires the inference callable to be async.
  • return_dict - If True, returns eval results as a dict
  • datasets0 - If True, returns aggregate scores for each dataset
  • datasets1 - If True, returns individual items for each dataset
  • datasets2 - If True, returns model outputs for each dataset item evaluated

Returns:

Union[Dict, List, EvalResult]: The evaluation results in the format specified by return parameters:

  • If return_dict=False: Returns an EvalResult object containing all run results
  • If return_dict=True Returns the evaluation results as a dict