Skip to main content

Quick Compare SDK

The trismik Python package lets you drive the same QuickCompare flow you'd run in the web UI — upload a dataset, create an evaluation across one or more models with the metrics you choose, wait for it to finish, and export the results — directly from your own code. It's the right tool when you want to script evaluations, wire them into CI, or compare many model/prompt combinations programmatically.

Installation

pip install trismik

The package requires Python 3.10 or newer and is published on PyPI as trismik.

Authentication

The SDK reads its API key from the environment:

  • TRISMIK_API_KEY — your API key (required). You can find it in your profile.
export TRISMIK_API_KEY="your-api-key"

You can also pass the key explicitly when constructing a client:

from trismik import TrismikAsyncClient

client = TrismikAsyncClient(api_key="your-api-key")

If no key is found in either the argument or the environment, the client raises a TrismikError.

Clients

QuickCompare is exposed as a sub-client accessed through the main Trismik client. Both async and sync variants exist; the examples below are async, which mirrors the SDK's primary API.

from trismik import TrismikAsyncClient

async with TrismikAsyncClient() as client:
quick_compare = client.quick_compare # TrismikQuickCompareAsyncClient
balance = await quick_compare.get_account_balance()

The synchronous equivalent uses TrismikClient, a plain with block, and drops async/await:

from trismik import TrismikClient

with TrismikClient() as client:
quick_compare = client.quick_compare # TrismikQuickCompareClient
balance = quick_compare.get_account_balance()

When you obtain quick_compare from the parent client it reuses the parent's HTTP connection, so let the parent's with / async with block handle cleanup — don't call aclose() / close() on the sub-client yourself.

Quickstart

The example below runs the full life-cycle: check your balance, upload a local CSV as a dataset, create an evaluation across two models with an LLM-as-Judge metric plus accuracy, wait for completion, and request a download URL for the results.

The data/qa.csv file used below is the sample dataset from the SDK's examples/data directory — a simple two-column question,answer CSV. For the accepted dataset formats, see the Datasets guide, which also covers the agent skill for turning raw data into an upload-ready file.

import asyncio
import secrets
from pathlib import Path

from trismik import (
TrismikAsyncClient,
TrismikEvaluation,
TrismikEvaluationDatasetUser,
TrismikEvaluationMetricAccuracy,
TrismikEvaluationMetricLLMAJ,
TrismikEvaluationModelInputTemplate,
)


def log_progress(evaluation: TrismikEvaluation) -> None:
pct = evaluation.progress.percentage if evaluation.progress else 0.0
print(f" [{evaluation.status}] {pct:.1f}%")


async def main() -> None:
async with TrismikAsyncClient() as client:
quick_compare = client.quick_compare

# 1) Check your account balance
balance = await quick_compare.get_account_balance()
print(
f"Balance: ${balance.balance_usd:.2f} "
f"(held ${balance.held_usd:.2f}, available ${balance.available_usd:.2f})"
)

run_suffix = secrets.token_hex(4) # keep names unique per run

# 2) Upload a local CSV (init -> S3 upload -> finalize, all in one call)
dataset = await quick_compare.upload_user_dataset(
Path("data/qa.csv"),
name=f"example-qa-{run_suffix}",
)
print(f"Uploaded dataset {dataset.id} ({dataset.status}, {dataset.row_count} rows)")

# 3) Create the evaluation
evaluation = await quick_compare.create_evaluation(
name=f"quick-compare-demo-{run_suffix}",
dataset=TrismikEvaluationDatasetUser(dataset_id=dataset.id),
model_input=TrismikEvaluationModelInputTemplate(
template="Q: {{ row.question }}\nA:",
),
metrics=[
# An LLM-as-Judge metric, when present, must come first
TrismikEvaluationMetricLLMAJ(
judge_model="anthropic/claude_sonnet_4.5",
rubric="Rate factual accuracy from 1 to 5.",
),
TrismikEvaluationMetricAccuracy(gold_column="answer"),
],
models=["openai/gpt-4.1_mini", "amazon/nova_2_lite"],
budget_limit_usd=2.0,
sample_limit={"test": 50},
)
print(f"Created evaluation {evaluation.id}")

# 4) Wait for completion, logging progress along the way
completed = await quick_compare.wait_for_evaluation(
evaluation.id,
on_progress=log_progress,
)
for score in completed.scores:
print(f" {score.model} | {score.metric}: {score.value:.3f}")

# 5) Request a presigned URL to download the full results
export = await quick_compare.get_evaluation_export_url(
completed.id, format="parquet"
)
print(f"Export ({export.status}): {export.url}")


if __name__ == "__main__":
asyncio.run(main())

Operations reference

The methods below are available on the QuickCompare client (client.quick_compare). Arguments after * are keyword-only.

Datasets

  • upload_user_dataset(source, *, name, file_type=None, content_type=None) — Upload a dataset end-to-end (init → S3 upload → finalize) and return the finalized dataset. source may be a file path (str / Path), raw bytes, or a binary file-like object. Supported file_type values are csv, tsv, json, jsonl, and parquet; when source is a path the type is inferred from its extension, so file_type is only required for bytes/file-like sources.
  • init_user_dataset(*, name, file_type, size_bytes) and finalize_user_dataset(dataset_id) — The manual two-step upload that upload_user_dataset wraps. Use these if you want to manage the S3 upload yourself.
  • list_user_datasets(*, page=1, limit=20, sort_by=None, sort_order=None, search=None) — Paginated list of your datasets.
  • get_user_dataset(dataset_id) — A single dataset with its inferred column metadata.
  • list_user_dataset_rows(dataset_id, *, page=1, limit=50) — Paginated dataset rows.

Evaluations

  • create_evaluation(*, name, dataset, model_input, metrics, models, budget_limit_usd, sample_limit=None, dataset_name=None) — Create an evaluation and return the TrismikEvaluation record. See Building blocks below for the dataset, model_input, metrics, and models arguments. sample_limit is an optional per-split cap, e.g. {"test": 50}.
  • get_evaluation(evaluation_id) — Fetch a detailed evaluation, including progress, per-model/split batches, and aggregated scores.
  • list_evaluations(*, page=1, limit=20, sort_by=None, sort_order=None, search=None) — Paginated slim view of your evaluations.
  • wait_for_evaluation(evaluation_id, *, poll_interval=10.0, timeout=3600.0, on_progress=None) — Poll until the evaluation reaches a terminal status and return it. poll_interval must be at least 6.0 seconds (the API rate-limits reads to 10 per 60 s per resource). on_progress is an optional callable invoked with the latest TrismikEvaluation after each poll. Raises TimeoutError if timeout (in seconds; None disables it) elapses, and TrismikApiError if the evaluation ends in a failed state.
  • get_evaluation_export_url(evaluation_id, *, format="parquet") — Request a presigned download URL for the results. format is one of parquet, csv, or jsonl. If status is "generating" the URL isn't ready yet — retry shortly.

Account

  • get_account_balance() — Return a TrismikAccountBalance with balance_usd, held_usd (in-flight holds), and available_usd.

Building blocks

These typed objects describe an evaluation. Import them from the top-level trismik package.

Dataset source

TypeUse it for
TrismikEvaluationDatasetUser(dataset_id)A dataset you uploaded with upload_user_dataset.
TrismikEvaluationDatasetHF(repo, sub_dataset=None, splits=None)A public HuggingFace dataset, e.g. TrismikEvaluationDatasetHF(repo="allenai/winogrande", sub_dataset="winogrande_xs", splits=["train"]).

Model input

TypeUse it for
TrismikEvaluationModelInputColumn(column)Send a single dataset column as the model input.
TrismikEvaluationModelInputTemplate(template)Compose the input from a template that references dataset columns.

Templates use the same {{ row.<name> }} syntax described in the Jinja templates guide.

Metrics

TypeDescription
TrismikEvaluationMetricAccuracy(gold_column)Accuracy against a reference column.
TrismikEvaluationMetricExactMatch(gold_column, normalize=True)Exact match against a reference column.
TrismikEvaluationMetricBleu(gold_column)BLEU score against a reference column.
TrismikEvaluationMetricRougeL(gold_column)ROUGE-L score against a reference column.
TrismikEvaluationMetricLLMAJ(judge_model, rubric, gold_column=None, scale_min=1, scale_max=5)LLM-as-Judge scoring against a rubric.

When an LLM-as-Judge metric is present it must be the first entry in the metrics list, and at most one is allowed per evaluation (the SDK validates this before sending the request). See the LLM-as-Judge guide for how to write effective rubrics.

Note that not every model can be used as the judge_model — only some support LLM-as-Judge. The model list at stage.trismik.com/models indicates which models support LLMAJ.

Models

Pass each model as a plain string in "<provider>/<model>" form — the Model ID shown in the table at stage.trismik.com/models, for example "anthropic/claude_sonnet_4.5" or "amazon/nova_2_lite". For finer control, pass a TrismikEvaluationModelConfig instead:

from trismik import TrismikEvaluationModelConfig

TrismikEvaluationModelConfig(
id="openai/gpt-4.1_mini",
thinking=False,
reasoning_effort=None, # "low" | "medium" | "high"
max_tokens=None,
temperature=None,
)

Error handling

All SDK errors derive from TrismikError and live in trismik.exceptions:

ExceptionRaised when
TrismikErrorBase class for all SDK errors (e.g. a missing API key).
TrismikApiErrorA generic API request failure.
TrismikNotFoundErrorThe requested resource (404) doesn't exist or isn't yours.
TrismikValidationErrorThe request was rejected as invalid (422).
TrismikPayloadTooLargeErrorThe upload exceeded the size limit (413).
TrismikRateLimitedErrorYou hit the per-resource rate limit (429); see retry_after_seconds.
from trismik.exceptions import TrismikNotFoundError, TrismikRateLimitedError

try:
evaluation = await quick_compare.get_evaluation(evaluation_id)
except TrismikRateLimitedError as e:
print(f"Rate limited; retry after {e.retry_after_seconds}s")
except TrismikNotFoundError:
print("No such evaluation")