# Retries & dead-letter

> How SchedStack retries failed deliveries, what's terminal vs retryable, the three dead-letter triggers, and how TTL time-caps retries — so nothing is ever silently lost.

SchedStack accepts a delivery, then owns it until it reaches a terminal state. A delivery
that fails on a retryable error is retried with exponential backoff. A delivery that can't
succeed is **recorded** — as `dead_letter` or `expired` — never silently dropped.

This page is the reliability model: the retry policy, what's retryable vs terminal, the
three ways a delivery dead-letters, and how TTL time-caps the whole process. For the
operational how-to — listing, inspecting, and replaying failed deliveries — see
[Dead-letter & replay](/docs/guides/dead-letter-and-replay/).

## Terminal states

Every delivery ends in exactly one of these. All three are durably recorded and queryable.

| State | Meaning |
|---|---|
| `succeeded` | The endpoint returned a 2xx. |
| `dead_letter` | Gave up after a terminal response, exhausted attempts, or an exhausted retry budget. |
| `expired` | The delivery's TTL deadline passed before it could be (re)delivered. |

At-least-once delivery means a retry can land **after** a slow success you didn't observe.
SchedStack makes that safe with [signed requests](/docs/guides/verify-signatures/) and
[idempotency keys](/docs/concepts/idempotency/), but **your endpoint must dedupe.** A `2xx`
is the only signal that stops retries — return it only once you've durably accepted the work.

## Retry policy

Retries use exponential backoff. The delay after a failed attempt is:

```text
delay = min(base * factor^attempt, max)
```

where `attempt` is 0-based (the delay after the **first** failure uses `attempt = 0`).

Defaults, applied when you don't set `retry_policy`:

| Field | Default | Meaning |
|---|---|---|
| `max_attempts` | `8` | Total attempts before dead-lettering (1–50). |
| `base` | `5s` | Base delay. |
| `factor` | `2` | Growth factor per attempt (1–100). |
| `max` | `1h` | Cap on any single backoff delay. |

With the defaults, the delays between attempts are:

```text
attempt 1 fails → wait 5s
attempt 2 fails → wait 10s
attempt 3 fails → wait 20s
attempt 4 fails → wait 40s
attempt 5 fails → wait 1m20s
attempt 6 fails → wait 2m40s
attempt 7 fails → wait 5m20s
attempt 8 fails → dead_letter (max_attempts reached)
```

The API accepts a `jitter` flag on `retry_policy`, but SchedStack does **not** currently
apply jitter to backoff delays — they are deterministic. Don't rely on jitter for thundering-herd
spreading; this is reserved for a future release.

### Setting a custom policy

Pass `retry_policy` when you create a schedule. Durations are strings (e.g. `"30s"`, `"1h"`),
consistent with `delay` and `ttl`.

```bash
curl -X POST https://api.schedstack.com/v1/schedules \
  -H "Authorization: Bearer sk_test_…" \
  -H "Content-Type: application/json" \
  -d '{
    "endpoint": "https://example.com/webhooks/orders",
    "method": "POST",
    "body": "{\"order_id\":\"o_123\"}",
    "delay": "5m",
    "retry_policy": {
      "max_attempts": 12,
      "base": "10s",
      "factor": 2,
      "max": "30m"
    }
  }'
```

```json
{
  "endpoint": "https://example.com/webhooks/orders",
  "method": "POST",
  "body": "{\"order_id\":\"o_123\"}",
  "delay": "5m",
  "retry_policy": {
    "max_attempts": 12,
    "base": "10s",
    "factor": 2,
    "max": "30m"
  }
}
```

Validation: `max_attempts` must be 1–50, `factor` must be 1–100, and `base`/`max` must be
valid duration strings. Omitted fields fall back to the defaults above.

## Retryable vs terminal

After each attempt, SchedStack classifies the outcome. Retryable outcomes back off and try
again; terminal outcomes dead-letter **immediately**, without consuming the rest of your
attempt budget.

| Outcome | Class |
|---|---|
| `2xx` | Success — stop. |
| `408`, `429` | Retryable. |
| `5xx` | Retryable. |
| Transport fault (timeout, connection refused, DNS/TLS error) | Retryable. |
| `3xx` | **Terminal** — redirects are not followed; a 3xx is a misconfiguration. |
| `4xx` (other than 408/429) | **Terminal** — retrying a client error won't help. |
| Blocked address (SSRF guard) | **Terminal.** |
| Scheme / header-injection rejected (request never sent) | **Terminal.** |

On a retryable outcome, SchedStack reads the endpoint's `Retry-After` header (delta-seconds
or HTTP-date), falling back to `RateLimit-Reset`. If present and in the future, the next
attempt is scheduled for `max(backoff, now + Retry-After)` — so a rate-limited endpoint is
never re-hit before it says it's ready.

## The three dead-letter triggers

A delivery moves to `dead_letter` for exactly one of these reasons:

1. **Terminal response.** The attempt returned a terminal class (3xx/4xx, blocked address,
   or a rejected request). Retrying can't help, so SchedStack stops immediately.

2. **Attempts exhausted.** A retryable failure occurred on the final attempt — the attempt
   count reached `max_attempts`.

3. **Retry budget exhausted.** Each tenant has a per-endpoint retry budget that caps
   aggregate retry amplification. When it's exhausted, further retries dead-letter instead
   of piling on. Only retries consume budget; first attempts don't.

Whichever trigger fires, the final attempt — status code, timing, and error — is recorded
on the delivery. Nothing disappears.

## TTL and expiry

`ttl` time-caps the entire process, complementing the count-based `max_attempts`. When you
set a `ttl`, the deadline is:

```text
deadline = first_fire_time + ttl
```

A delivery becomes `expired` when:

- it can't be **initiated** before the deadline (e.g. a circuit-breaker deferral for an
  unhealthy endpoint would push the next attempt past it), or
- a scheduled **retry** would land after the deadline — SchedStack expires it now rather
  than firing a request it knows is already too late.

`expired` is a distinct terminal state from `dead_letter`, but it's recorded the same way:
visible, queryable, and never a silent drop.

`max_attempts` caps retries by **count**; `ttl` caps them by **wall-clock time**. Whichever
limit is reached first ends the delivery. Set `ttl` when a late delivery is worse than no
delivery (e.g. a time-boxed notification).

## Everything ends recorded

The core guarantee: every accepted delivery reaches `succeeded`, `dead_letter`, or
`expired`, and each one is durable and inspectable. There is no fourth, silent outcome.

To find and act on failed deliveries — list the dead-letter queue, read the attempt
history, and replay — continue to [Dead-letter & replay](/docs/guides/dead-letter-and-replay/).
