# FUTURE-TS benchmark card

## Summary

FUTURE-TS v0.1.0 is a public-preview benchmark for time-series foundation
models. It treats evaluation as an executable protocol: task cards declare
issue times, horizons, available history, delayed labels, adaptation budgets,
resource limits, and metrics; submissions are validated, scored, and audited
against those contracts.

The core rule is future-only ranking: a submission is only ranked on labels
that were not available when the model was frozen.

## Benchmark Surface

- Release: `v0.1.0`
- Strict surface: `benchmarks/v1/benchmark.json`
- Surface identifier: `future-ts-v1`
- Tasks: 25
- Tiers: `public_dev`, `blind_archive`, `live`
- Tracks: forecasting core, covariate-aware, multivariate relational, data
  quality, event, transfer, multimodal context
- Required manifest: `require_pretraining_manifest=true`
- Headline score: `tier_weighted_score`
- Additional views: capability vector `(F, U, R, A, E, D)`, Pareto flag,
  rank-based mean-of-ranks aggregate with bootstrap CI

The strict surface rejects manifestless submissions so "clean" and
"undeclared" are not conflated.

## Current Empirical Run

The current canonical empirical artifacts are in:

```text
reports/tsfm_ai_empirical_v2_multi_budget/
```

This run is the `future-ts-empirical-v2` hosted TSFM.ai slice used by the
empirical paper. It exercises 15 real-data tasks across three context budgets:

| Budget | Context length |
|--------|----------------|
| `zero_shot` | 96 |
| `few_shot` | 192 |
| `s16` | 288 |

Current result snapshot:

- Surfaced public catalog entries: 52
- Merged public submissions: 51
- Scored public models: 47
- Positive overall scores: 30
- Current winner: `Datadog/Toto-2.0-1B`
- Winner score: `0.2369`
- Rank-consistency leader by mean rank: `amazon/chronos-2`

Top five by tier-weighted score:

| Rank | Model | Score |
|------|-------|-------|
| 1 | `Datadog/Toto-2.0-1B` | 0.2369 |
| 2 | `Datadog/Toto-2.0-313m` | 0.2214 |
| 3 | `NX-AI/TiRex` | 0.2024 |
| 4 | `Salesforce/moirai-2.0-R-small` | 0.1860 |
| 5 | `amazon/chronos-2` | 0.1851 |

## How To Submit A Model Entry

Submit a pull request under:

```text
submissions/community/<org>_<model>/
```

Each submission directory must contain:

- `script.py`: sealed-runner entry point
- `declaration.json`: metadata, declarations, artifact URI, and pretraining
  manifest
- `README.md`: short model description and links

CI validates the directory structure and smoke-runs `script.py` through the
sealed runner. After review, the evaluator runs the full benchmark and can
publish the resulting `BenchmarkReport`.

See [submission-guide.md](submission-guide.md) for the full process.

## What This Does Not Yet Claim

The v0.1.0 release is a local benchmark package plus sealed-runner MVP. It is
not yet a fully hosted, externally attested live benchmark service.

Do not read the current empirical run as a permanent universal TSFM ranking.
It is a first real-data slice over 15 tasks and one materialized wave. Wider
task coverage, repeated waves, hosted attestation, immutable submission
windows, and stronger manifest evidence are the next steps.

See [validity-envelope.md](validity-envelope.md) for the precise claim
boundary.

## Local Commands

Validate the strict benchmark:

```bash
python3 -m future_ts.cli validate-benchmark benchmarks/v1/benchmark.json
```

Run the sealed reference submission:

```bash
python3 -m future_ts.cli run-sealed \
  examples/submissions/reference_seasonal_naive.py \
  .tmp/task_windows.json \
  --output .tmp/submission.json \
  --submission-id local::reference_seasonal_naive \
  --model-name "Reference Seasonal Naive"
```

Build a leaderboard from saved reports:

```bash
python3 -m future_ts.cli leaderboard reports/tsfm_ai_empirical_v2_multi_budget/*.report.json
```

These core workflows do not require TSFM.ai. A TSFM.ai API key is only needed
to run the optional hosted-catalog evaluation commands that invoke hosted
models.

## More Detail

- Design paper: `paper/future_ts_design.pdf`
- Empirical paper: `paper/future_ts_empirical.pdf`
- Benchmark design notes: [benchmark-design.md](benchmark-design.md)
- Submission guide: [submission-guide.md](submission-guide.md)
- Sealed runner: [sealed-runner.md](sealed-runner.md)
- Validity envelope: [validity-envelope.md](validity-envelope.md)
