# FUTURE-TS validity envelope

This page states what the current repository supports and where claims should
stop. It is intended for reviewers comparing the design paper, empirical
paper, and code.

## Supported today

FUTURE-TS v0.1.0 is a runnable benchmark package and integrity scaffold. Its
strict `benchmarks/v1` surface implements:

- task cards with issue times, horizons, delayed sources, revision metadata,
  resource budgets, anchors, metrics, and adaptation budgets
- submission and actual validation, including full visible-label coverage at
  score time
- leakage auditing through `available_at`, `first_published_at`, and
  `last_updated_at`
- archived prediction scoring with deterministic `prediction_hash` and
  `manifest_hash`
- strict pretraining manifests for `benchmarks/v1`; manifest source entries
  are included in the manifest hash and carry evidence/confidence metadata
- local sealed-runner MVP execution with platform-stamped
  `platform_issued_at` and `platform_received_at`, CPU/wall-clock limits,
  best-effort memory limits, and Linux network namespace isolation where
  available
- prequential cutoff scoring for tasks that declare `cutoff_schedule`, emitted
  as a backward-compatible aggregate task score
- capability-vector reports plus tier-weighted and rank-style aggregate
  leaderboard surfaces

This supports the claim that FUTURE-TS is an executable future-aware benchmark
protocol and local evaluation package.

## Not supported yet

The repository does not by itself establish a fully hosted, externally
attested live benchmark. These remain service-layer or data-expansion
milestones:

- immutable remote submission deadlines and live waves where labels
  physically do not exist at submission time
- signed external timestamps, container digest attestation, and artifact-bucket
  immutability
- production-grade no-egress execution on every host; non-Linux local runs are
  warning-only unless a Docker/Kubernetes `--network=none` backend is used
- fixed-hardware, platform-measured per-prediction runtime/memory telemetry for
  cross-model efficiency claims
- PEFT and full fine-tuning under sealed training/evaluation
- broad decision-utility coverage; v1 grounds capD in one newsvendor task, so
  grid redispatch, hospital staffing, and capacity-planning tasks are needed
  before making broad operational-utility claims
- frozen anchor prediction artifacts and alternate-anchor sensitivity tables
- a source ontology with canonical IDs and aliases for stronger pretraining
  overlap matching

## Claim wording

Use:

> FUTURE-TS introduces a runnable future-aware benchmark protocol for TSFMs,
> with task-card semantics, leakage auditing, strict submission validation,
> archived predictions, manifest-based contamination flags, multi-dimensional
> scoring, and an empirical TSFM.ai run demonstrating the protocol on real
> data. The current release is a local package plus sealed-runner MVP; hosted
> attested live evaluation is the next milestone.

Avoid:

> FUTURE-TS structurally solves live benchmark integrity.

The latter requires the hosted attested service, immutable submission windows,
and label release controls.

## Covariates and multimodal context

Task cards may declare known covariates or multimodal context. Those fields
are eligibility metadata unless the actual task-window payload handed to a
submission contains the corresponding inputs. A task should only be described
as operationally measuring covariate-aware or multimodal use when its runner
payload includes those typed inputs.

## Empirical scope

The empirical paper should be read as a first real-data slice: real hosted
TSFMs, temporally constrained tasks, archived predictions, and diagnostic
family differences. It does not establish a final ordering of TSFMs. Wider
waves, more tasks, stronger manifests, external attestation, and full
multi-budget adaptation are required before presenting a definitive public
leaderboard.