Errors
Every error the SDK raises inherits from robotrace.RobotraceError.
Catch by type, not by parsing message strings - the messages are
human-readable and may change between minor versions. The types are
stable and follow the same "sacred contract" rule as
log_episode.
The hierarchy
RobotraceError
├── ConfigurationError # missing api_key / base_url, bad path, etc.
├── TransportError # network / timeout / DNS / TLS
└── APIError # the server responded with an error
├── AuthError # 401 - bad / missing / revoked key
├── NotFoundError # 404 - episode id doesn't exist (or cross-tenant)
├── ConflictError # 409 - episode is archived, etc.
├── ValidationError # 400 - payload didn't match the schema
├── RateLimitError # 429 - quota tripped (carries `retry_after`)
└── ServerError # 5xx - flag for retriesAPIError and its subclasses carry two extra attributes for
debugging:
exc.status_code # int - the HTTP status the server returned
exc.response_body # parsed JSON body (or raw text on non-JSON 5xx)When you'll see each one
ConfigurationError
The SDK is missing or misconfigured. Caught at the call site, never reaches the network. Common cases:
api_keynot passed andROBOTRACE_API_KEYnot setbase_urlnot passed andROBOTRACE_BASE_URLnot set- A path passed to
upload_video(...)doesn't exist - You called
ep.upload_video(...)on a metadata-only run (one opened withartifacts=[]) - the SDK fails loud rather than silently dropping bytes
from robotrace import ConfigurationError
try:
rt.log_episode(name="oops", video="/missing/file.mp4")
except ConfigurationError as exc:
print(f"fix your inputs: {exc}")Don't retry - the inputs need to change first.
TransportError
The HTTP request failed before the server could respond. DNS, TCP reset, TLS handshake, or a timeout. The request is not known to have landed, so retrying with backoff is generally safe:
from robotrace import TransportError
import time
for attempt in range(3):
try:
rt.log_episode(...)
break
except TransportError:
if attempt == 2:
raise
time.sleep(2 ** attempt) # 1, 2, 4 secondsThe SDK doesn't auto-retry because what's safe depends on the call:
re-trying a start_episode after a transport error is fine (server
might have created the row twice, but each gets a unique id);
re-trying an upload PUT against an expired signed URL just wastes
bytes.
AuthError (401)
The API key is missing, malformed, or revoked. Don't retry - mint a fresh key from Portal → API keys.
from robotrace import AuthError
try:
rt.log_episode(...)
except AuthError as exc:
alerts.notify(
"RoboTrace key needs rotation",
details=str(exc),
)
raiseNotFoundError (404)
The episode id doesn't exist, or belongs to a different client. We deliberately make these two cases indistinguishable server-side to avoid a UUID-enumeration oracle.
This won't happen during normal log_episode(...) flow - you only
see it if you constructed an Episode from a stale id and tried to
finalize it.
ConflictError (409)
The request is well-formed but conflicts with current server state.
The most common cause: trying to finalize(...) an episode that's
already been archived.
Restore the episode from /portal/episodes/<id> (or just start a
fresh one) before retrying.
ValidationError (400)
The payload didn't pass server-side validation. The server's
error field tells you which constraint tripped:
from robotrace import ValidationError
try:
rt.log_episode(name="x" * 500, ...) # name is capped at 200 chars
except ValidationError as exc:
print(exc) # human message
print(exc.response_body) # {'error': 'name must be ≤ 200 chars'}Don't retry without changing the inputs.
RateLimitError (429)
The server rejected the request because a quota was tripped - too
many uploads from one client over the rate window, ingest-throttle
on a specific endpoint, etc. The exception carries a parsed
retry_after (integer seconds) sourced from the response's
Retry-After header. None means the server didn't send one.
from robotrace import RateLimitError
import time
try:
rt.log_episode(...)
except RateLimitError as exc:
# `exc.retry_after` is the server's recommended wait, or None.
wait = exc.retry_after or 30
time.sleep(wait)
rt.log_episode(...) # try againThe SDK already auto-retries for you on the call sites where re-issuing the same request can never cause a duplicate row or a double-billing event:
| Call | Auto-retries on 429? |
|---|---|
Client.start_episode(...) (create) | yes (up to 4 total attempts) |
Episode.upload_video/sensors/actions(...) | yes (signed PUT is idempotent) |
rt.evals.create_run(...) | yes |
rt.evals.run_against(...) per-result push | yes (server upserts) |
Episode.finalize(...) | no - see below |
rt.evals.complete_run(...) | no - same reason |
Each retried call honors Retry-After when present (capped at 30
seconds so a misconfigured server can't pin a robot rig) and falls
back to exponential backoff (1s, 2s, 4s) otherwise.
finalize and complete_run deliberately do not auto-retry -
the server may have processed the mutation before the 429 was sent
back, and silently re-finalizing in a future paid tier could
double-bill artifact storage. Catch RateLimitError at the call
site, sleep for exc.retry_after or 30 seconds, then retry
yourself.
ServerError (5xx)
Something blew up on the server side - database hiccup, R2 signing
failed, etc. Worth retrying with exponential backoff. The SDK
deliberately does not auto-retry because retrying a finalize
twice could double-bill artifact storage in future paid tiers.
from robotrace import ServerError
import time
for attempt in range(5):
try:
rt.log_episode(...)
break
except ServerError:
if attempt == 4:
raise
time.sleep(2 ** attempt) # 1, 2, 4, 8, 16If ServerError persists past a few retries, check
status.robotrace.dev (Phase 2) or
ping us - there's likely an incident.
Catch-all pattern
For training scripts where you want one alert path for any RoboTrace problem without distinguishing types:
from robotrace import RobotraceError
try:
rt.log_episode(...)
except RobotraceError as exc:
# Anything from the SDK - auth, config, network, server.
# User code bugs (TypeError, ValueError) still propagate.
sentry_sdk.capture_exception(exc)
raiseRobotraceError deliberately does not inherit from
OSError / IOError - we don't want a blanket except Exception:
in your training loop to silently eat our errors and leave you
wondering why nothing's showing up in the portal.
Server vs SDK redaction
The SDK never logs:
- The value of your API key
- The body of an ingest request (which can carry trade secrets)
- Signed
PUTURLs (they expire fast but still)
The server side follows the same rule - ingest payloads and key material are never written to logs. If you find an exception message that leaks any of the above, it's a bug - please report it.