Quick Compare SDK
The trismik Python package lets you drive the same QuickCompare flow you'd run
in the web UI — upload a dataset, create an evaluation across one
or more models with the metrics you choose, wait for it to finish, and export the
results — directly from your own code. It's the right tool when you want to script
evaluations, wire them into CI, or compare many model/prompt combinations
programmatically.
Installation
pip install trismik
The package requires Python 3.10 or newer and is published on PyPI as
trismik.
Authentication
The SDK reads its API key from the environment:
TRISMIK_API_KEY— your API key (required). You can find it in your profile.
export TRISMIK_API_KEY="your-api-key"
You can also pass the key explicitly when constructing a client:
from trismik import TrismikAsyncClient
client = TrismikAsyncClient(api_key="your-api-key")
If no key is found in either the argument or the environment, the client raises a
TrismikError.
Clients
QuickCompare is exposed as a sub-client accessed through the main Trismik client. Both async and sync variants exist; the examples below are async, which mirrors the SDK's primary API.
from trismik import TrismikAsyncClient
async with TrismikAsyncClient() as client:
quick_compare = client.quick_compare # TrismikQuickCompareAsyncClient
balance = await quick_compare.get_account_balance()
The synchronous equivalent uses TrismikClient, a plain with block, and drops
async/await:
from trismik import TrismikClient
with TrismikClient() as client:
quick_compare = client.quick_compare # TrismikQuickCompareClient
balance = quick_compare.get_account_balance()
When you obtain quick_compare from the parent client it reuses the parent's HTTP
connection, so let the parent's with / async with block handle cleanup — don't
call aclose() / close() on the sub-client yourself.
Quickstart
The example below runs the full life-cycle: check your balance, upload a local CSV as a dataset, create an evaluation across two models with an LLM-as-Judge metric plus accuracy, wait for completion, and request a download URL for the results.
The data/qa.csv file used below is the sample dataset from the SDK's
examples/data
directory — a simple two-column question,answer CSV. For the accepted dataset
formats, see the Datasets guide, which also covers the agent skill
for turning raw data into an upload-ready file.
import asyncio
import secrets
from pathlib import Path
from trismik import (
TrismikAsyncClient,
TrismikEvaluation,
TrismikEvaluationDatasetUser,
TrismikEvaluationMetricAccuracy,
TrismikEvaluationMetricLLMAJ,
TrismikEvaluationModelInputTemplate,
)
def log_progress(evaluation: TrismikEvaluation) -> None:
pct = evaluation.progress.percentage if evaluation.progress else 0.0
print(f" [{evaluation.status}] {pct:.1f}%")
async def main() -> None:
async with TrismikAsyncClient() as client:
quick_compare = client.quick_compare
# 1) Check your account balance
balance = await quick_compare.get_account_balance()
print(
f"Balance: ${balance.balance_usd:.2f} "
f"(held ${balance.held_usd:.2f}, available ${balance.available_usd:.2f})"
)
run_suffix = secrets.token_hex(4) # keep names unique per run
# 2) Upload a local CSV (init -> S3 upload -> finalize, all in one call)
dataset = await quick_compare.upload_user_dataset(
Path("data/qa.csv"),
name=f"example-qa-{run_suffix}",
)
print(f"Uploaded dataset {dataset.id} ({dataset.status}, {dataset.row_count} rows)")
# 3) Create the evaluation
evaluation = await quick_compare.create_evaluation(
name=f"quick-compare-demo-{run_suffix}",
dataset=TrismikEvaluationDatasetUser(dataset_id=dataset.id),
model_input=TrismikEvaluationModelInputTemplate(
template="Q: {{ row.question }}\nA:",
),
metrics=[
# An LLM-as-Judge metric, when present, must come first
TrismikEvaluationMetricLLMAJ(
judge_model="anthropic/claude_sonnet_4.5",
rubric="Rate factual accuracy from 1 to 5.",
),
TrismikEvaluationMetricAccuracy(gold_column="answer"),
],
models=["openai/gpt-4.1_mini", "amazon/nova_2_lite"],
budget_limit_usd=2.0,
sample_limit={"test": 50},
)
print(f"Created evaluation {evaluation.id}")
# 4) Wait for completion, logging progress along the way
completed = await quick_compare.wait_for_evaluation(
evaluation.id,
on_progress=log_progress,
)
for score in completed.scores:
print(f" {score.model} | {score.metric}: {score.value:.3f}")
# 5) Request a presigned URL to download the full results
export = await quick_compare.get_evaluation_export_url(
completed.id, format="parquet"
)
print(f"Export ({export.status}): {export.url}")
if __name__ == "__main__":
asyncio.run(main())
Operations reference
The methods below are available on the QuickCompare client (client.quick_compare).
Arguments after * are keyword-only.
Datasets
upload_user_dataset(source, *, name, file_type=None, content_type=None)— Upload a dataset end-to-end (init → S3 upload → finalize) and return the finalized dataset.sourcemay be a file path (str/Path), rawbytes, or a binary file-like object. Supportedfile_typevalues arecsv,tsv,json,jsonl, andparquet; whensourceis a path the type is inferred from its extension, sofile_typeis only required for bytes/file-like sources.init_user_dataset(*, name, file_type, size_bytes)andfinalize_user_dataset(dataset_id)— The manual two-step upload thatupload_user_datasetwraps. Use these if you want to manage the S3 upload yourself.list_user_datasets(*, page=1, limit=20, sort_by=None, sort_order=None, search=None)— Paginated list of your datasets.get_user_dataset(dataset_id)— A single dataset with its inferred column metadata.list_user_dataset_rows(dataset_id, *, page=1, limit=50)— Paginated dataset rows.
Evaluations
create_evaluation(*, name, dataset, model_input, metrics, models, budget_limit_usd, sample_limit=None, dataset_name=None)— Create an evaluation and return theTrismikEvaluationrecord. See Building blocks below for thedataset,model_input,metrics, andmodelsarguments.sample_limitis an optional per-split cap, e.g.{"test": 50}.get_evaluation(evaluation_id)— Fetch a detailed evaluation, including progress, per-model/split batches, and aggregated scores.list_evaluations(*, page=1, limit=20, sort_by=None, sort_order=None, search=None)— Paginated slim view of your evaluations.wait_for_evaluation(evaluation_id, *, poll_interval=10.0, timeout=3600.0, on_progress=None)— Poll until the evaluation reaches a terminal status and return it.poll_intervalmust be at least6.0seconds (the API rate-limits reads to 10 per 60 s per resource).on_progressis an optional callable invoked with the latestTrismikEvaluationafter each poll. RaisesTimeoutErroriftimeout(in seconds;Nonedisables it) elapses, andTrismikApiErrorif the evaluation ends in a failed state.get_evaluation_export_url(evaluation_id, *, format="parquet")— Request a presigned download URL for the results.formatis one ofparquet,csv, orjsonl. Ifstatusis"generating"the URL isn't ready yet — retry shortly.
Account
get_account_balance()— Return aTrismikAccountBalancewithbalance_usd,held_usd(in-flight holds), andavailable_usd.
Building blocks
These typed objects describe an evaluation. Import them from the top-level trismik
package.
Dataset source
| Type | Use it for |
|---|---|
TrismikEvaluationDatasetUser(dataset_id) | A dataset you uploaded with upload_user_dataset. |
TrismikEvaluationDatasetHF(repo, sub_dataset=None, splits=None) | A public HuggingFace dataset, e.g. TrismikEvaluationDatasetHF(repo="allenai/winogrande", sub_dataset="winogrande_xs", splits=["train"]). |
Model input
| Type | Use it for |
|---|---|
TrismikEvaluationModelInputColumn(column) | Send a single dataset column as the model input. |
TrismikEvaluationModelInputTemplate(template) | Compose the input from a template that references dataset columns. |
Templates use the same {{ row.<name> }} syntax described in the
Jinja templates guide.
Metrics
| Type | Description |
|---|---|
TrismikEvaluationMetricAccuracy(gold_column) | Accuracy against a reference column. |
TrismikEvaluationMetricExactMatch(gold_column, normalize=True) | Exact match against a reference column. |
TrismikEvaluationMetricBleu(gold_column) | BLEU score against a reference column. |
TrismikEvaluationMetricRougeL(gold_column) | ROUGE-L score against a reference column. |
TrismikEvaluationMetricLLMAJ(judge_model, rubric, gold_column=None, scale_min=1, scale_max=5) | LLM-as-Judge scoring against a rubric. |
When an LLM-as-Judge metric is present it must be the first entry in the
metrics list, and at most one is allowed per evaluation (the SDK validates this
before sending the request). See the LLM-as-Judge guide for how
to write effective rubrics.
Note that not every model can be used as the judge_model — only some support
LLM-as-Judge. The model list at
stage.trismik.com/models indicates which models
support LLMAJ.
Models
Pass each model as a plain string in "<provider>/<model>" form — the Model ID shown
in the table at stage.trismik.com/models, for
example "anthropic/claude_sonnet_4.5" or "amazon/nova_2_lite". For finer control,
pass a TrismikEvaluationModelConfig instead:
from trismik import TrismikEvaluationModelConfig
TrismikEvaluationModelConfig(
id="openai/gpt-4.1_mini",
thinking=False,
reasoning_effort=None, # "low" | "medium" | "high"
max_tokens=None,
temperature=None,
)
Error handling
All SDK errors derive from TrismikError and live in trismik.exceptions:
| Exception | Raised when |
|---|---|
TrismikError | Base class for all SDK errors (e.g. a missing API key). |
TrismikApiError | A generic API request failure. |
TrismikNotFoundError | The requested resource (404) doesn't exist or isn't yours. |
TrismikValidationError | The request was rejected as invalid (422). |
TrismikPayloadTooLargeError | The upload exceeded the size limit (413). |
TrismikRateLimitedError | You hit the per-resource rate limit (429); see retry_after_seconds. |
from trismik.exceptions import TrismikNotFoundError, TrismikRateLimitedError
try:
evaluation = await quick_compare.get_evaluation(evaluation_id)
except TrismikRateLimitedError as e:
print(f"Rate limited; retry after {e.retry_after_seconds}s")
except TrismikNotFoundError:
print("No such evaluation")