Quotas & rate limiting

Credit balance (402)

Billing is prepaid: every request checks your credit balance first. When the balance is at or below zero (and past any configured grace window), requests receive 402 Payment Required:

{
  "error": {
    "type": "billing_error",
    "code": "insufficient_balance",
    "message": "Insufficient credit balance. Please top up your account to continue.",
    "action": "topup",
    "top_up_url": "https://app.cloakapi.io/billing"
  }
}

Top up in the portal under Billing.

Token quotas (429)

Independent of balance, every tenant has hourly and daily token budgets (per pricing segment, with per-key overrides). Requests over budget receive 429 with a Retry-After header and one of two codes:

{
  "error": {
    "type": "rate_limit_error",
    "code": "hourly_quota_exceeded",
    "message": "Hourly token quota exceeded. Retry after the top of the hour.",
    "quota": { "window": "hourly", "limit": 1000000, "used": 1000000 }
  }
}

(daily_quota_exceeded is the daily-window equivalent.) Successful responses carry X-CloakAPI-Quota-Hourly-Remaining and X-CloakAPI-Quota-Daily-Remaining headers. Requests routed to local Ollama are never charged or quota-blocked.

Rate limits

Two scales, both sliding-window:

Scale	Default per (tenant, capability)	Header
60s	60 req/min	`RateLimit-Limit-Minute`
3600s	1000 req/hr	`RateLimit-Limit-Hour`

Public endpoints (/api/v1/receipts/verify, /.well-known/*) have their own per-IP limits (30 req/min for verify, 120 req/min for discovery).

Idempotency

Send Idempotency-Key: <uuid> with any POST request. The gateway caches the response for 24h keyed by (tenant, idempotency_key). Retries with the same key replay the same response — including the same receipt — and do not consume quota.

This is the recommended pattern for any production caller, not least because it makes failover safe to retry from the SDK.