Quick Start
A Quick Start Guide to Running Evaluations With Scorebook
Getting started with Scorebook is simple. Scorebook can be installed via pip into your project:
pip install scorebook
A Simple Scorebook Evaluation Example
The following example demonstrates the three core steps in a Scorebook evaluation:
- Creating an evaluation dataset
- Defining an inference callable
- Running an evaluation
The full implementation of this simple example can be found in example 1.
1) Creating an Evaluation Dataset
An evaluation dataset can be created from a list of evaluation items. The model evaluated will use each evaluation item to generate a prediction, which will be scored against the label value for that item.
from scorebook import EvalDataset
# Create a list of evaluation items
evaluation_items = [
{"question": "What is 2 + 2?", "answer": "4"},
{"question": "What is the capital of France?", "answer": "Paris"},
{"question": "Who wrote Romeo and Juliet?", "answer": "William Shakespeare"}
]
# Create an evaluation dataset
evaluation_dataset = EvalDataset.from_list(
name = "basic_questions", # Dataset name
label = "answer", # Key for the label value in evaluation items
metrics = "accuracy", # Metric/Metrics used to calculate scores
data = evaluation_items # List of evaluation items
)
2) Defining an Inference Callable
An inference callable can be implemented as a function, method, or class. Its role is to handle the inference process and return a list of model predictions for a list of evaluation items.
Scorebook is model-agnostic, so you can plug in any model or framework. In this example, we use Hugging Face’s Transformers library to run a local Phi-4-mini-instruct model.
An inference callable in Scorebook must:
- Accept a list of evaluation items
- Accept hyperparameters as **kwargs
- Return a list of predictions
import transformers
# Create a model
pipeline = transformers.pipeline(
"text-generation",
model="microsoft/Phi-4-mini-instruct",
model_kwargs={"torch_dtype": "auto"},
device_map="auto",
)
# Define an inference function
def inference_function(evaluation_items: List[Dict[str, Any]], **hyperparameters: Any) -> List[Any]:
"""Return a list of model predictions for a list of evaluation items."""
predictions = []
for evaluation_item in evaluation_items:
# Transform evaluation item into valid model input format
messages = [
{
"role": "system",
"content": hyperparameters.get("system_message"),
},
{"role": "user", "content": evaluation_item.get("question")},
]
# Run inference on the item
output = pipeline(messages, temperature=hyperparameters.get("temperature"))
# Extract and collect the output generated from the model's response
predictions.append(output[0]["generated_text"][-1]["content"])
return predictions
3) Running an Evaluation
A Scorebook evaluation of a model can be called with the evaluate
function, provided an inference callable, and
evaluation dataset. Hyperparameters can optionally be passed in as a dict.
from scorebook import evaluate
# Evaluate a model against an evaluation dataset
results: List[Dict[str, Any]] = evaluate(
inference_function, # The inference function we defined
evaluation_dataset, # The evaluation dataset we created
hyperparameters={
"temperature": 0.7,
"system_message": "Answer the question directly and concisely.",
},
)
By default, evaluate
will return results as a list of dicts, with a result for each evaluation. In the simple example
the evaluate
call runs only one evaluation, a single model (Phi-4-mini-instruct) against a single evaluation dataset.
Example Results:
[
{
"dataset": "basic_questions",
"run_completed": true,
"temperature": 0.7,
"accuracy": 1
}
]
If you made it this far, congrats! You have completed your first Scorebook Evaluation 🎉