Errors

Every error the SDK raises inherits from robotrace.RobotraceError. Catch by type, not by parsing message strings - the messages are human-readable and may change between minor versions. The types are stable and follow the same "sacred contract" rule as log_episode.

The hierarchy

RobotraceError
├── ConfigurationError       # missing api_key / base_url, bad path, etc.
├── TransportError           # network / timeout / DNS / TLS
└── APIError                 # the server responded with an error
    ├── AuthError            # 401 - bad / missing / revoked key
    ├── NotFoundError        # 404 - episode id doesn't exist (or cross-tenant)
    ├── ConflictError        # 409 - episode is archived, etc.
    ├── ValidationError      # 400 - payload didn't match the schema
    ├── RateLimitError       # 429 - quota tripped (carries `retry_after`)
    └── ServerError          # 5xx - flag for retries

APIError and its subclasses carry two extra attributes for debugging:

exc.status_code   # int - the HTTP status the server returned
exc.response_body # parsed JSON body (or raw text on non-JSON 5xx)

When you'll see each one

ConfigurationError

The SDK is missing or misconfigured. Caught at the call site, never reaches the network. Common cases:

  • api_key not passed and ROBOTRACE_API_KEY not set
  • base_url not passed and ROBOTRACE_BASE_URL not set
  • A path passed to upload_video(...) doesn't exist
  • You called ep.upload_video(...) on a metadata-only run (one opened with artifacts=[]) - the SDK fails loud rather than silently dropping bytes
from robotrace import ConfigurationError
 
try:
    rt.log_episode(name="oops", video="/missing/file.mp4")
except ConfigurationError as exc:
    print(f"fix your inputs: {exc}")

Don't retry - the inputs need to change first.

TransportError

The HTTP request failed before the server could respond. DNS, TCP reset, TLS handshake, or a timeout. The request is not known to have landed, so retrying with backoff is generally safe:

from robotrace import TransportError
import time
 
for attempt in range(3):
    try:
        rt.log_episode(...)
        break
    except TransportError:
        if attempt == 2:
            raise
        time.sleep(2 ** attempt)  # 1, 2, 4 seconds

The SDK doesn't auto-retry because what's safe depends on the call: re-trying a start_episode after a transport error is fine (server might have created the row twice, but each gets a unique id); re-trying an upload PUT against an expired signed URL just wastes bytes.

AuthError (401)

The API key is missing, malformed, or revoked. Don't retry - mint a fresh key from Portal → API keys.

from robotrace import AuthError
 
try:
    rt.log_episode(...)
except AuthError as exc:
    alerts.notify(
        "RoboTrace key needs rotation",
        details=str(exc),
    )
    raise

NotFoundError (404)

The episode id doesn't exist, or belongs to a different client. We deliberately make these two cases indistinguishable server-side to avoid a UUID-enumeration oracle.

This won't happen during normal log_episode(...) flow - you only see it if you constructed an Episode from a stale id and tried to finalize it.

ConflictError (409)

The request is well-formed but conflicts with current server state. The most common cause: trying to finalize(...) an episode that's already been archived.

Restore the episode from /portal/episodes/<id> (or just start a fresh one) before retrying.

ValidationError (400)

The payload didn't pass server-side validation. The server's error field tells you which constraint tripped:

from robotrace import ValidationError
 
try:
    rt.log_episode(name="x" * 500, ...)  # name is capped at 200 chars
except ValidationError as exc:
    print(exc)                # human message
    print(exc.response_body)  # {'error': 'name must be ≤ 200 chars'}

Don't retry without changing the inputs.

RateLimitError (429)

The server rejected the request because a quota was tripped - too many uploads from one client over the rate window, ingest-throttle on a specific endpoint, etc. The exception carries a parsed retry_after (integer seconds) sourced from the response's Retry-After header. None means the server didn't send one.

from robotrace import RateLimitError
import time
 
try:
    rt.log_episode(...)
except RateLimitError as exc:
    # `exc.retry_after` is the server's recommended wait, or None.
    wait = exc.retry_after or 30
    time.sleep(wait)
    rt.log_episode(...)  # try again

The SDK already auto-retries for you on the call sites where re-issuing the same request can never cause a duplicate row or a double-billing event:

CallAuto-retries on 429?
Client.start_episode(...) (create)yes (up to 4 total attempts)
Episode.upload_video/sensors/actions(...)yes (signed PUT is idempotent)
rt.evals.create_run(...)yes
rt.evals.run_against(...) per-result pushyes (server upserts)
Episode.finalize(...)no - see below
rt.evals.complete_run(...)no - same reason

Each retried call honors Retry-After when present (capped at 30 seconds so a misconfigured server can't pin a robot rig) and falls back to exponential backoff (1s, 2s, 4s) otherwise.

finalize and complete_run deliberately do not auto-retry - the server may have processed the mutation before the 429 was sent back, and silently re-finalizing in a future paid tier could double-bill artifact storage. Catch RateLimitError at the call site, sleep for exc.retry_after or 30 seconds, then retry yourself.

ServerError (5xx)

Something blew up on the server side - database hiccup, R2 signing failed, etc. Worth retrying with exponential backoff. The SDK deliberately does not auto-retry because retrying a finalize twice could double-bill artifact storage in future paid tiers.

from robotrace import ServerError
import time
 
for attempt in range(5):
    try:
        rt.log_episode(...)
        break
    except ServerError:
        if attempt == 4:
            raise
        time.sleep(2 ** attempt)  # 1, 2, 4, 8, 16

If ServerError persists past a few retries, check status.robotrace.dev (Phase 2) or ping us - there's likely an incident.

Catch-all pattern

For training scripts where you want one alert path for any RoboTrace problem without distinguishing types:

from robotrace import RobotraceError
 
try:
    rt.log_episode(...)
except RobotraceError as exc:
    # Anything from the SDK - auth, config, network, server.
    # User code bugs (TypeError, ValueError) still propagate.
    sentry_sdk.capture_exception(exc)
    raise

RobotraceError deliberately does not inherit from OSError / IOError - we don't want a blanket except Exception: in your training loop to silently eat our errors and leave you wondering why nothing's showing up in the portal.

Server vs SDK redaction

The SDK never logs:

  • The value of your API key
  • The body of an ingest request (which can carry trade secrets)
  • Signed PUT URLs (they expire fast but still)

The server side follows the same rule - ingest payloads and key material are never written to logs. If you find an exception message that leaks any of the above, it's a bug - please report it.