﻿# FlowMarkup Testing Framework

**Version:** 0.9.0
**Date:** 2026-03-28
**Designed by:** Łukasz Nawojczyk
**Copyright:** © 2026 Progralink Łukasz Nawojczyk. All rights reserved.
**License:** Open Web Foundation Agreement 1.0 (OWFa 1.0)

This document specifies FlowMarkup's native testing framework — a declarative, YAML-first approach to unit testing, integration testing, and assertion of flows. Tests are authored as `.flowmarkup-test.yaml` files and executed by the engine's built-in test runner.

---

## Table of Contents

1. [Design Philosophy](#1-design-philosophy)
2. [Test File Format](#2-test-file-format)
3. [Test Cases](#3-test-cases)
4. [Mocks](#4-mocks)
5. [Fixtures](#5-fixtures)
6. [Assertions](#6-assertions)
7. [Error Testing](#7-error-testing)
8. [Yield / Streaming Testing](#8-yield--streaming-testing)
9. [Cross-Cutting Concern Testing](#9-cross-cutting-concern-testing)
10. [Step-Level Testing](#10-step-level-testing)
11. [Parameterized Tests](#11-parameterized-tests)
12. [Migration & Replay Testing](#12-migration--replay-testing)
13. [Snapshot Testing](#13-snapshot-testing)
14. [Breakpoints & Inspection](#14-breakpoints--inspection)
15. [Invariants](#15-invariants)
16. [Contract Testing](#16-contract-testing)
17. [Fault Injection](#17-fault-injection)
18. [Lifecycle Hooks](#18-lifecycle-hooks)
19. [Test Runner](#19-test-runner)
20. [Concurrency & Timing](#20-concurrency--timing)
21. [Coverage](#21-coverage)
22. [SA Rules](#22-sa-rules)
23. [Common Mistakes](#23-common-mistakes)

---

## 1. Design Philosophy

### Why YAML-native testing?

FlowMarkup flows are data — YAML documents with CEL expressions. Testing them with Java/Python/Go tests creates a language barrier: the test author must context-switch between the flow's declarative semantics and an imperative test harness. Temporal inherits Go/Java testing; Airflow inherits Python's `pytest`. FlowMarkup has no host language to inherit from.

The testing framework follows three principles:

1. **Tests are flows.** A test case is a constrained flow document — same YAML syntax, same CEL expressions, same `do:` body. Flow authors already know how to write tests.

2. **Mocks replace services, not steps.** The boundary between "code under test" and "external world" is the service layer (`SERVICES.*`), the exec boundary, the HTTP boundary, the sub-flow boundary, the mail boundary, the storage boundary, the SSH boundary, and the resource binding (`RESOURCES.*`). Mocks intercept at these boundaries — never inside the flow's own control logic.

3. **Assertions are CEL.** Every assertion is a CEL expression that evaluates to `true` or throws. No new assertion DSL. The same `assert` directive used in production flows is used in tests, extended with matcher functions for richer failure messages.

### Test tiers

The framework supports three tiers of testing with increasing scope:

| Tier | Activation | I/O Behavior | Use Case |
|---|---|---|---|
| **Unit** | `mode: UNIT` (default) | All actions must be mocked; unmatched calls throw `UnmockedServiceError` | Test a single flow's logic in complete isolation |
| **Integration** | `mode: INTEGRATION` | Actions call real services unless explicitly mocked; hooks may use actions | Test flow interaction with real (or staging) services |
| **Contract** | `flowmarkup test --contracts` | No execution; validates caller/callee input/output/throws compatibility | Verify cross-flow contracts before deployment |

Unit mode is hermetic by default — the test controls every I/O boundary. Integration mode flips the default: real services unless overridden. This mirrors the distinction every mature framework makes (Temporal's `TestWorkflowEnvironment` vs `DevServer`, Camunda's `@CamundaSpringProcessTest` vs embedded engine, Prefect's `prefect_test_harness` vs real runs).

> **Production environment safeguard.** Engines MUST implement a deployment-level configuration flag (`testMode.allowIntegration`, default: `false` in production profiles) that controls whether INTEGRATION mode tests can execute. When `testMode.allowIntegration` is `false`, any attempt to run an INTEGRATION mode test MUST fail immediately with a `ConfigurationError` before any test step executes.
>
> Additionally, INTEGRATION mode tests MUST be annotated with a required `environment:` field (enum: `development`, `staging`, `ci`) that declares the intended execution environment. Engines MUST refuse to run INTEGRATION tests when the declared environment does not match the engine's current environment profile. This prevents accidental execution of integration tests against production infrastructure, which could modify real data, send real emails, or trigger real external API calls. *(CWE-668: Exposure of Resource to Wrong Sphere)*
>
> Static analysis rule SA-TEST-44 MUST flag INTEGRATION mode tests that do not specify an `environment:` field.

> **CRITICAL: Integration mode production deployment prohibition.** The engine MUST require a positive attestation that the current environment permits integration tests. The engine MUST check for the environment variable `FLOWMARKUP_ALLOW_INTEGRATION=true` (or equivalent engine configuration flag). When this variable is absent or set to any value other than `true`, integration mode MUST be rejected regardless of other environment indicators. The engine MUST NOT rely on the absence of production indicators (such as `NODE_ENV=production`) as evidence that integration tests are safe to run — this deny-list approach fails for non-standard deployment configurations. When `FLOWMARKUP_ALLOW_INTEGRATION=true` is set, the engine SHOULD additionally check for production indicators as defense-in-depth and emit a WARNING if any are detected. *(CWE-489: Active Debug Code)*

### What the framework does NOT do

- **No step mocking.** You cannot replace a `set` or `if` step. Mocks apply to I/O boundaries only (services, sub-flows, HTTP, exec, mail, storage, SSH, resources).
- **No flow mutation.** The flow under test runs exactly as authored. The test controls inputs, mocks, and environment — not the flow's internal structure.
- **No time travel.** `wait` and `waitUntil` use virtual time (the engine's clock is injectable), but there is no "rewind" mechanism. Replay testing (§12) replays from the start, not from an arbitrary midpoint.

> **Security control preservation in mocks.** Mock definitions MUST preserve the security characteristics of the actions they replace:
> 1. **Capability checking:** The engine MUST still enforce capability requirements on mocked actions. A mock for a `request` step MUST still require `REQUEST` capability in the flow's `requires` declaration. SA-TEST-45 MUST flag mock definitions that would bypass capability enforcement.
> 2. **Secret handling:** Mocked actions that accept secret inputs MUST still validate that secrets are passed through permitted injection points (not args, not URL query parameters). The mock runtime MUST NOT allow secrets to leak through mock assertion inspection.
> 3. **Taint tracking:** Mock return values MUST respect the same taint annotations (`$secret`, `$exportable`) as their real counterparts. If a real action's output is annotated `$secret: true`, the mock's return value MUST also carry that annotation.

> **Mock security control parity.** Test mocks MUST NOT bypass security controls that would apply in production. Specifically:
> 1. **Taint propagation on mock responses:** Mocked service responses MUST still pass through taint propagation — if a mock returns data that would be tainted in production, it MUST be tainted in test. Mock return values MUST be processed by the same taint analysis that applies to real service responses.
> 2. **Capability enforcement on mocked services:** Capability checks MUST be enforced even when the underlying service is mocked. A flow that lacks the required capability for an action MUST fail with `MissingCapabilityError` regardless of whether the action is mocked.
> 3. **Secret redaction on mock data:** Secret redaction MUST apply to mock response data. If a mock response contains values that match secret patterns or carry `$secret` taint, those values MUST be redacted in logs, error messages, and debug output.
>
> Engines SHOULD provide a `securityParity: true` flag (default: `true`) that enforces security controls on mocked responses. When `securityParity` is `true`, the engine MUST apply the full production security pipeline (taint propagation, capability checking, secret redaction) to all mock interactions. Disabling this flag MUST emit a warning and MUST be recorded in the test execution audit log. *(CWE-693: Protection Mechanism Failure)*
>
> `securityParity: false` MUST only be accepted when ALL of the following conditions are met: (1) the engine is running in an explicitly declared test execution context (`FLOWMARKUP_TEST_MODE=true` environment variable or equivalent engine configuration), (2) the engine-level `testMode.allowInsecureMocks` is `true`, and (3) the test file contains a `_testOnly_: true` marker that the engine verifies is not present in production flow definitions. If ANY condition is not met, the engine MUST reject `securityParity: false` with `SecurityViolationError`. SA-TEST-48 MUST fire at ERROR severity when `securityParity: false` is detected.

---

## 2. Test File Format

### Naming convention

```
<flow-name>.flowmarkup-test.yaml          # co-located test file
tests/<flow-name>.flowmarkup-test.yaml    # or in a tests/ directory
```

The engine discovers tests by the `.flowmarkup-test.yaml` suffix. A test file references exactly one flow under test.

### Structure

```yaml
flowmarkup-test:
  flow: ./order-processing.flowmarkup.yaml    # path to flow under test (required)
  readme: Tests for the order processing flow.
  mode: UNIT                                  # UNIT (default) | INTEGRATION

  # Flow-level invariants — must hold for ALL test cases
  invariants:
    - =output.result.outcome != null
    - "=!steps.charge_payment.called || steps.send_receipt.called"

  # Shared fixtures — available to all test cases
  fixtures:
    standard_order:
      id: "ORD-1001"
      status: pending
      total: 150.00
      items:
        - { sku: "WG-1001", quantity: 2, price: 50.00 }
        - { sku: "AC-2003", quantity: 1, price: 50.00 }
      customer: { name: Alice, email: alice@example.com, tier: standard }

  # Shared mock definitions — reusable across test cases
  mocks:
    payment_success:
      service: payment
      operation: "*"
      return: { tx_id: "TX-9999" }

  # Shared environment overrides
  env:
    DEFAULT_CURRENCY: USD

  # Shared resource injection (RESOURCES.* bindings)
  resources:
    app_config:
      content: '{"feature_flags": {"dark_mode": true}}'

  # Service overrides (swap production providers for test providers)
  services:
    db:
      provider: progralink.clients.db.h2
      properties: { host: localhost }

  # Test cases
  tests:
    - name: approves critical priority order
      input:
        order: ${{ fixtures.standard_order }}
        priority: critical
      mocks:
        - ${{ mocks.payment_success }}
      expect:
        output:
          result:
            outcome: approved
            notes: "Auto-approved: critical priority"

    - name: rejects missing items
      input:
        order: { id: "ORD-0001", status: pending, items: [] }
        priority: normal
      expect:
        throws: AssertionError
```

### Top-level keys

| Key | Type | Required | Description |
|---|---|---|---|
| `flow` | string | yes | Path to the flow under test (relative to test file, or absolute) |
| `readme` | string | no | Documentation for the test suite |
| `mode` | string | no | `UNIT` (default, hermetic) or `INTEGRATION` (real services unless mocked) |
| `fixtures` | map | no | Named fixture data, referenced as `${{ fixtures.<name> }}` |
| `mocks` | map | no | Named mock definitions, referenced as `${{ mocks.<name> }}` |
| `invariants` | list | no | CEL expressions or `{ when, assert }` objects that must hold for every test case (§15) |
| `env` | map | no | `ENV.*` overrides applied to all tests (merged with per-test `env:`) |
| `secrets` | map | no | `SECRET.*` overrides (string values, treated as opaque) |
| `globals` | map | no | Pre-seeded `GLOBAL.*` values |
| `context` | map | no | Pre-seeded `CONTEXT.*` values |
| `runtime` | map | no | Synthetic `RUNTIME.*` values (OS, ENGINE, PLATFORM) |
| `timeout` | duration | no | Default timeout per test case (default: `30s`) |
| `tags` | string[] | no | Suite-level tags for filtering |
| `contracts` | list | no | Explicit cross-flow contract declarations; auto-detected when omitted (§16) |
| `services` | map | no | Override flow-level `services:` definitions (mock providers, properties) |
| `resources` | map | no | Injected `RESOURCES.*` bindings for all tests (mock file/directory handles) |
| `allowReal` | string[] or object | no | I/O boundaries exempt from hermetic mocking in `UNIT` mode (see §4.13) |
| `beforeAll` | step list | no | Run once before all tests (§18) |
| `afterAll` | step list | no | Run once after all tests (§18) |
| `beforeEach` | step list | no | Run before each test (§18) |
| `afterEach` | step list | no | Run after each test (§18) |
| `tests` | list | yes | Test case definitions |

### Test interpolation (`${{ }}`)

Test files use `${{ }}` for referencing fixtures, mocks, and matrix parameters. This is a **post-parse substitution** — the engine parses the YAML first, then resolves all `${{ }}` expressions on the parsed structure.

| Syntax | Resolves to |
|---|---|
| `${{ fixtures.<name> }}` | The named fixture value (injected verbatim as YAML) |
| `${{ mocks.<name> }}` | The named mock definition |
| `${{ matrix.<key> }}` | The current matrix row's value for that key |

**Evaluation order:** `${{ }}` substitution occurs after YAML parsing but before `=` CEL expression evaluation at runtime. The YAML document is parsed first, then `${{ }}` placeholders in the parsed tree are resolved, then `=` CEL expressions are evaluated.

**Type handling:** The substituted value retains its YAML type. If `${{ matrix.items }}` resolves to a list, the surrounding YAML must structurally accept a list. Inside YAML flow mappings (e.g., `{ total: ${{ matrix.total }} }`), scalar values substitute directly; complex values (lists, maps) require block form or JSON-style inline syntax in the matrix row.

**Interaction with CEL:** `=${{ matrix.total }}` first substitutes the matrix value (e.g., `50`), then string-escapes it, producing `="50"` (a CEL string literal). To allow direct CEL evaluation of substituted values, use the explicit form `${{= matrix.total }}` which produces `=50`. Outside CEL expression contexts, `${{ matrix.total }}` alone suffices and retains its YAML type without escaping.

`${{ }}` is distinct from FlowMarkup's runtime `{{ }}` template interpolation. `${{ }}` is test-framework preprocessing; `{{ }}` is CEL interpolation inside string values at flow execution time.

> **Template injection prevention.** The `${{ }}` post-parse substitution mechanism MUST be restricted to simple variable references and literal values. Specifically:
> 1. `${{ }}` expressions MUST NOT support function calls, method invocations, or property access chains deeper than 2 levels (e.g., `${{ fixture.key }}` is allowed, `${{ fixture.nested.deep.value.toString() }}` is not).
> 2. The substitution MUST be performed on the parsed YAML structure (post-parse), NOT on raw YAML text (pre-parse). Pre-parse text substitution enables YAML injection where a substituted value containing YAML syntax (colons, dashes, brackets) alters the document structure. Any structural divergence between the template and the substituted result MUST raise a `TemplateInjectionError`.
> 3. Values substituted via `${{ }}` MUST be treated as string literals by default. If the substitution result is placed in a CEL expression context (prefixed with `=`), the substituted value MUST be string-escaped before insertion — the engine MUST NOT allow `${{ }}` substitution to produce executable CEL syntax. For example, `=${{ matrix.total }}` MUST produce `="50"` (string literal), not `=50` (numeric expression). To allow CEL evaluation of matrix values, flow authors MUST use the explicit form `${{= matrix.total }}` (with `=` prefix inside the braces), which signals intentional CEL evaluation. SA-TEST-46 MUST fire at ERROR severity when `${{ }}` substitution occurs in a CEL expression context without the explicit `${{= }}` form and the substituted value is derived from CSV-loaded matrix data or external sources. *(CWE-94)*
>
> Static analysis rule SA-TEST-46 MUST flag `${{ }}` expressions that reference user-controlled values or contain function calls. *(CWE-94: Improper Control of Generation of Code)*

> **Template expression sandboxing.** The `${{ }}` template syntax MUST be evaluated in a sandboxed context. Template expressions MUST NOT have access to `RUNTIME.*` internals, engine configuration, or environment variables beyond those explicitly declared in the flow's `env:` section. Engines MUST reject template expressions that attempt to access out-of-scope variables with a `TemplateAccessViolationError`. Template evaluation MUST apply output encoding appropriate to the context: shell escaping for `exec` templates, URL encoding for `request` templates, and HTML entity encoding for `mail` templates. Unencoded template output in a security-sensitive context MUST be rejected by static analysis rule SA-TEST-46. *(CWE-94: Improper Control of Generation of Code)*

---

## 3. Test Cases

Each entry in `tests:` is a complete test scenario.

```yaml
tests:
  - name: calculates shipping for premium customer    # required, unique within suite
    _notes_: Verifies the premium discount path.
    tags: [premium, shipping]                          # for selective execution
    timeout: 10s                                       # override suite default

    # --- Arrange ---
    input:                          # flow input params (matches flow's input: contract)
      order: ${{ fixtures.premium_order }}
      priority: normal
    env:                            # ENV.* overrides (merged with suite-level)
      SHIPPING_RATE: "5.99"
    secrets:                        # SECRET.* overrides
      API_KEY: test-key-1234
    globals:                        # GLOBAL.* pre-seeded values
      request_count: 42
    context:                        # CONTEXT.* pre-seeded values
      correlation_id: test-corr-001
    vars:                           # override flow's vars: initial values
      discount_rate: 0.15
    runtime:                        # synthetic RUNTIME.* values
      ENGINE: { NAME: FlowMarkup, VERSION: "2.1.0", VENDOR: Acme }
      OS: { NAME: Linux, VERSION: "6.1", ARCH: amd64 }
      PLATFORM: { NAME: JVM, VERSION: "21.0.1" }  # varies by engine implementation
    services:                       # override flow-level service definitions
      db:
        provider: progralink.clients.db.h2   # swap postgres for in-memory H2
        properties:
          host: localhost
          port: 9092
    resources:                      # inject RESOURCES.* bindings
      config:
        path: ./test-fixtures/config.json
      data_dir:
        path: ./test-fixtures/data/
        $kind: DIRECTORY

    # --- Mocks ---
    mocks:
      - service: inventory
        operation: check_stock
        when: =params.sku == 'WG-1001'         # conditional mock (CEL guard)
        return: { in_stock: true, quantity: 50 }
      - ${{ mocks.payment_success }}           # reference shared mock

    # --- Act & Assert ---
    expect:
      output:                       # assert on flow output values
        shipping_cost: 5.99
        discount_applied: true
      vars:                         # assert on final variable state
        order_status: shipped
      globals:                      # assert on GLOBAL.* mutations
        request_count: 43
      context:                      # assert on CONTEXT.* mutations
        last_order: "ORD-2001"
      yields: []                    # assert yield sequence (empty = no yields)
      trace:                        # assert execution path (ordered step sequence)
        contains: [validate_order, charge_payment, reserve_inventory]
      steps:                        # step-level assertions (by _id_)
        charge_payment:
          called: true
          calledWith:
            amount: 150.00
          calledBefore: reserve_inventory
        send_notification:
          called: false
      log:                          # assert log output
        contains: ["Processing order ORD-2001"]
      resources:                    # assert resource access patterns
        config:
          accessed: true
      events:                       # assert emitted events
        count: 1
        messages:
          - event: order_processed
      mail:                         # assert mail sent
        sent: true
```

### Test case keys

| Key | Type | Required | Description |
|---|---|---|---|
| `name` | string | yes | Unique name within the suite |
| `_notes_` | string | no | Documentation for this test case |
| `tags` | string[] | no | Tags for selective execution (`--tag`) |
| `timeout` | duration | no | Override suite-level timeout |
| `input` | map | conditional | Flow input params (required unless `focus:` provides all deps) |
| `env` | map | no | `ENV.*` overrides (merged with suite-level) |
| `secrets` | map | no | `SECRET.*` overrides |
| `globals` | map | no | Pre-seeded `GLOBAL.*` values |
| `context` | map | no | Pre-seeded `CONTEXT.*` values |
| `vars` | map | no | Override flow's `vars:` initial values |
| `runtime` | map | no | Synthetic `RUNTIME.*` values: `{ OS: {NAME, VERSION, ARCH}, ENGINE: {NAME, VERSION, VENDOR}, PLATFORM: {NAME, VERSION} }` |
| `services` | map | no | Override flow-level `services:` definitions (merged with suite-level) |
| `resources` | map | no | Injected `RESOURCES.*` bindings (merged with suite-level) |
| `mocks` | list | no | Mock definitions for this test (merged with suite-level) |
| `expect` | map | no | Assertions: `output`, `vars`, `globals`, `context`, `yields`, `throws`, `trace`, `steps`, `log`, `mail`, `resources`, `events`, `snapshot`, `skipped` (§6, §20) |
| `focus` | string or string[] | no | Limit execution to `_id_`-tagged step(s) and their dependencies (§10) |
| `matrix` | list | no | Parameterized rows — generates one test per row (§11) |
| `migration` | map | no | Test `onVersionChange:` with synthetic checkpoint state (§12.1) |
| `replay` | string | no | Path to `.recording.yaml` (§4.9) or `.trace.yaml` (§12.2) — engine auto-detects format by root key |
| `breakpoints` | list | no | Mid-execution assertions at specific `_id_` steps (§14) |
| `faults` | map | no | Fault injection profile: probability, latency, network (§17) |
| `repeat` | integer | no | Run this test N times (for statistical/flaky testing) |
| `schedule` | map | no | Deterministic timing for parallel branches (§20) |
| `events` | list | no | Inject events at specific virtual times (§20) |
| `trigger` | map | no | Simulate trigger-initiated invocation: `{ event, data }` — mutually exclusive with `input:` (§20) |
| `idempotency` | map | no | Test `idempotencyKey:` deduplication: `preSeed` entries and `assertKey` (§20) |
| `skipInvariants` | integer[] | no | Indices of suite-level invariants to skip for this test |

### Test execution model

Each test case:

1. Starts a fresh, isolated flow instance (no shared state between tests)
2. Injects `env`, `secrets`, `globals`, `context`, `runtime` into the engine's scope
3. Overrides `vars:` initial values if specified (merged with flow defaults — test wins)
4. Registers all `mocks:`
5. Invokes the flow with `input:`
6. At each `breakpoints:` step (if any), pauses and evaluates intermediate assertions (§14)
7. Waits for completion (or timeout)
8. Evaluates all `expect:` assertions
9. Evaluates suite-level `invariants:` (§15)
10. Reports pass/fail with details

---

## 4. Mocks

Mocks intercept I/O at the flow boundary. Six mock types correspond to the five I/O actions plus the resource binding.

### 4.1 Service mocks (`service:`)

Replace `call` step service invocations.

```yaml
mocks:
  # Basic: match any operation on the payment service
  - service: payment
    return: { tx_id: "TX-0001", status: success }

  # Operation-specific
  - service: payment
    operation: refund
    return: { refund_id: "RF-0001" }

  # Conditional: match only when input matches
  - service: payment
    operation: charge
    when: =params.amount > 1000
    return: { tx_id: "TX-HIGH", status: review }

  # Throw an error instead of returning
  - service: payment
    operation: charge
    when: =params.amount > 10000
    throw: { error: PaymentDeclinedError, message: "Amount exceeds limit" }

  # Sequence: return different values on successive calls
  - service: inventory
    operation: check_stock
    sequence:
      - return: { in_stock: true, quantity: 10 }
      - return: { in_stock: true, quantity: 5 }
      - return: { in_stock: false, quantity: 0 }
      # After sequence exhausts: last entry repeats (sticky tail)

  # Delay: simulate latency (virtual time)
  - service: external_api
    delay: 2s
    return: { data: ok }

  # Dynamic return: CEL expression evaluated with mock input context
  - service: pricing
    operation: calculate
    return:
      total: =params.quantity * params.unit_price
      currency: =params.currency ?? 'USD'

  # Yield mock: simulate a streaming service
  - service: claude
    operation: complete
    yields:
      - { token: "Hello" }
      - { token: " world" }
      - { token: "!" }
    return: { result: "Hello world!" }

  # Spy: passthrough to real service but record calls (INTEGRATION mode only)
  - service: logger
    spy: true

  # Spy with assertion: passthrough but also verify params
  - service: audit
    spy: true
    operation: log_event
    # No return: — real service handles it. But step assertions can verify calledWith.
```

**Mock matching order:** Mocks are evaluated top-to-bottom. The first match wins. More specific mocks (with `operation:` and `when:`) should precede general ones.

**`operation:` matching:** Omitting `operation:` is equivalent to `operation: "*"` — matches any operation. The `"*"` wildcard matches the entire operation name; partial patterns (e.g., `"refund_*"`) are not supported. For pattern matching, use `when:` with a CEL expression: `when: =operation.startsWith('refund_')`. (In mock `when:` guards, `operation` is bound to the matched operation name and `params` is bound to the call's params map.)

**Unmatched calls:** In `UNIT` mode, unmatched `call` steps throw `UnmockedServiceError`. In `INTEGRATION` mode, unmatched calls pass through to real services. Use `spy: true` to passthrough while still recording calls for step assertions.

**Spy mode:** `spy: true` passes the call through to the real service (only meaningful in `INTEGRATION` mode) while recording the invocation for step assertions (`calledWith`, `calledTimes`, `calledBefore`, etc.). In `UNIT` mode, `spy: true` is an error — there is no real service to pass through to (SA-TEST-31).

### 4.2 Sub-flow mocks (`flow:`)

Replace `run` step sub-flow invocations.

```yaml
mocks:
  # Mock a sub-flow by path
  - flow: billing/charge-customer.flowmarkup.yaml
    return: { receipt: { id: "RCP-001", amount: 99.99 } }

  # Mock with input matching
  - flow: billing/charge-customer.flowmarkup.yaml
    when: =params.amount > 500
    throw: { error: PaymentDeclinedError, message: "Over limit" }

  # Yield mock for streaming sub-flow
  - flow: ai/generate-tokens.flowmarkup.yaml
    yields:
      - "token1"
      - "token2"
    return: { full_text: "token1token2" }

  # Mock CURRENT (self-recursive flow) — limit recursion depth
  - flow: CURRENT
    when: =params.depth > 3
    return: { result: "base case" }

  # Spy: passthrough to real sub-flow but record invocations (INTEGRATION mode)
  - flow: billing/charge-customer.flowmarkup.yaml
    spy: true

  # Verify capability gating — mock asserts expected capabilities
  - flow: sensitive-operation.flowmarkup.yaml
    assertCapabilities:                          # verify caller restricts capabilities
      SERVICES: [payment, inventory]             # expected cap.SERVICES
      REQUEST: ["api.example.com"]                # expected cap.REQUEST
    return: { result: "processed" }
```

**Capability assertions:** `assertCapabilities:` on a sub-flow mock verifies that the `run` step's `cap:` results in the expected capability set. If the actual capabilities don't match, the test fails with `CapabilityMismatchError`. This tests that callers correctly restrict sub-flow permissions.

### 4.3 HTTP mocks (`request:`)

Replace `request` step HTTP calls.

```yaml
mocks:
  # Match by URL pattern
  - request:
      method: GET
      url: "https://api.example.com/users/*"
    respond:
      status: 200
      headers: { Content-Type: application/json }
      body: { users: [{ id: 1, name: "Alice" }] }

  # Match by method + host
  - request:
      method: POST
      host: api.stripe.com
    respond:
      status: 200
      body: { charge_id: "ch_test_123" }

  # Simulate HTTP error
  - request:
      method: GET
      url: "https://api.example.com/health"
    respond:
      status: 503
      body: { error: "Service unavailable" }

  # Match by path pattern + query params
  - request:
      method: GET
      path: "/api/users/*"
      query:
        page: "1"                                  # match specific query param value
        limit: "*"                                 # wildcard: any value
    respond:
      status: 200
      body: { users: [{ id: 1 }], total: 1 }

  # Match by headers
  - request:
      method: POST
      url: "https://api.example.com/data"
      headers:
        Authorization: "Bearer *"                  # wildcard pattern
        Content-Type: application/json
    respond:
      status: 200
      body: { accepted: true }

  # Match by request body (partial match)
  - request:
      method: POST
      url: "https://api.example.com/orders"
      body:
        customer_id: "C-001"                       # only match when body has this field
    respond:
      status: 201
      body: { order_id: "ORD-NEW-001" }

  # Dynamic response using CEL (request bindings available)
  - request:
      method: GET
      url: "https://api.example.com/echo/*"
    respond:
      status: 200
      body:
        echoed_path: =request.path
        echoed_method: =request.method

  # Sequence for pagination
  - request:
      method: GET
      url: "https://api.example.com/items*"
    sequence:
      - respond: { status: 200, body: { data: [1, 2, 3], next_cursor: "abc" } }
      - respond: { status: 200, body: { data: [4, 5], next_cursor: null } }

  # Spy: passthrough to real endpoint but record request/response (INTEGRATION mode)
  - request:
      method: "*"
      host: api.internal.example.com
    spy: true
```

**Request matching fields:** `method`, `url` (full URL pattern with `*` wildcards), `host`, `path` (path pattern), `query` (map of query param name→value, supports `*` wildcard), `headers` (map, supports `*` wildcard), `body` (partial match on request body). All fields are optional — omitted fields match anything.

**Dynamic responses:** In `respond.body:`, CEL expressions prefixed with `=` can reference `request.method`, `request.url`, `request.path`, `request.query`, `request.headers`, and `request.body`.

### 4.4 Exec mocks (`exec:`)

Replace `exec` step system command invocations.

```yaml
mocks:
  # Match by command
  - exec:
      command: curl
    return:
      stdout: '{"status": "ok"}'
      exitCode: 0

  # Match by command + args pattern
  - exec:
      command: docker
      args: ["build", "*"]
    return:
      stdout: "Successfully built abc123"
      exitCode: 0

  # Simulate command failure
  - exec:
      command: npm
      args: ["test"]
    return:
      stderr: "1 test failed"
      exitCode: 1

  # Match by environment variables
  - exec:
      command: deploy
      env:
        TARGET_ENV: staging                     # match only when env var is set
    return:
      stdout: "Deployed to staging"
      exitCode: 0

  # Match by stdin content
  - exec:
      command: jq
      stdin: "*"                                 # any stdin provided
    return:
      stdout: '{"filtered": true}'
      exitCode: 0

  # Dynamic return using CEL
  - exec:
      command: echo
    return:
      stdout: =args.join(' ')                    # echo the args back
      exitCode: 0

  # Spy: passthrough to real command but record invocation (INTEGRATION mode)
  - exec:
      command: git
    spy: true

  # Sequence for commands called multiple times
  - exec:
      command: terraform
      args: ["apply"]
    sequence:
      - return: { stdout: "Plan: 3 to add", exitCode: 0 }
      - return: { stdout: "Apply complete! Resources: 3 added", exitCode: 0 }
```

**Exec matching fields:** `command` (exact match, required), `args` (array pattern with `*` wildcards), `env` (map of env var name→value), `stdin` (string match or `*` wildcard). Omitted fields match anything.

### 4.5 Mail mocks (`mail:`)

Replace `mail` step email sends. Always mocked — the test runner never sends real email. All sent mail is captured automatically for `expect.mail` assertions (§6.8) — mail is implicitly a spy.

```yaml
mocks:
  # Default: all mail is captured (no explicit mock needed)
  # Access via expect.mail assertions

  # Throw on specific recipient (simulate SMTP failure)
  - mail:
      to: "invalid@bad-domain.example"
    throw: { error: MailError, message: "Domain not found" }

  # Match by subject pattern
  - mail:
      subject: "*password*"                      # wildcard matching
    throw: { error: MailError, message: "Security policy: password emails blocked" }

  # Match by multiple criteria
  - mail:
      to: "*@internal.example.com"
      cc: "*@compliance.example.com"
    throw: { error: MailError, message: "Internal routing error" }

  # Delay mail delivery (virtual time)
  - mail:
      to: "*@slow-domain.example"
    delay: 5s

  # Dynamic response: return custom message_id
  - mail:
      to: "*"
    return:
      message_id: "='MSG-' + to[0]"
      timestamp: "2026-03-14T10:00:00Z"
```

**Mail matching fields:** `to`, `cc`, `bcc` (string patterns with `*` wildcards — matched against any address in the list), `from`, `subject` (string pattern), `replyTo`. All fields are optional — omitted fields match anything.

**Implicit capture:** Even without explicit mail mocks, all mail is captured. Mock entries for mail are only needed to simulate errors, add delays, or customize the return value (`message_id`, `timestamp`).

### 4.6 Resource mocks (`resources:`)

Inject `RESOURCES.*` bindings for flows that declare `requires: { RESOURCES: ... }`. Resources are opaque handles — in tests, you provide synthetic file/directory handles with controlled content.

```yaml
# Suite-level resource injection (available to all tests)
resources:
  app_config:
    path: ./test-fixtures/app-config.json
    content: |                                # inline content (overrides path)
      {"feature_flags": {"dark_mode": true}, "version": "2.0"}
  templates:
    path: ./test-fixtures/templates/
    $kind: DIRECTORY
    files:                                    # synthetic directory contents
      - { name: "welcome.html", content: "<h1>Welcome</h1>" }
      - { name: "receipt.html", content: "<h1>Receipt</h1>" }

# Per-test resource overrides
tests:
  - name: handles missing config resource
    resources:
      app_config: null                        # resource declared but not bound (nullable)
    expect:
      throws: ResourceNotFoundError

  - name: reads config from resource
    resources:
      app_config:
        content: '{"feature_flags": {"dark_mode": false}}'
    expect:
      vars:
        dark_mode_enabled: false
```

**Resource handle properties:** In test-injected resources, the following properties are available via the `meta()` macro (matching production `Resource` handles):

| Property | Description |
|---|---|
| `meta(resource).value` | Lazily loaded content — MAP for JSON/YAML, TEXT for text files, BINARY for unknown |
| `meta(resource).name` | Filename or directory name |
| `meta(resource).size` | Content length in bytes (computed from `content:` or `path:` file) |

**`content:` vs `path:`:** When both are specified, `content:` takes precedence (the `path:` is ignored). When only `path:` is specified, the engine reads the file at test startup. When neither is specified, an empty resource is created.

**Null resources:** Setting a resource to `null` simulates a declared-but-unbound resource. Accessing properties on a null resource throws `ResourceNotFoundError`. This tests the flow's null-handling logic.

**UNIT mode:** Resource injection is required for all resources declared in `requires: { RESOURCES: ... }`. Accessing an uninjected resource throws `UnmockedResourceError` (analogous to `UnmockedServiceError`). Use `allowReal:` at suite level to exempt specific resources.

**INTEGRATION mode:** Uninjected resources resolve via the engine's normal resource binding. Injected resources override the engine binding.

### 4.7 Storage mocks (`storage:`)

Replace `storage` step file operations. In `UNIT` mode, unmatched storage operations throw `UnmockedStorageError` (analogous to `UnmockedServiceError`).

```yaml
mocks:
  # Match by url (alias) + operation
  - storage:
      url: s3_data                     # bare alias
      operation: get
      path: "config/settings.json"
    return:
      data: '{"region": "us-east-1", "feature_flags": {"dark_mode": true}}'
      name: settings.json
      size: 64
      type: application/json
      modified: "2026-03-14T10:00:00Z"

  # Match by url (alias) + operation (list)
  - storage:
      url: s3_data
      operation: list
      path: "data/*"
    return:
      entries:
        - { name: "file1.csv", path: "data/file1.csv", size: 1024, type: "file", modified: "2026-03-14T10:00:00Z" }
        - { name: "file2.csv", path: "data/file2.csv", size: 2048, type: "file", modified: "2026-03-14T11:00:00Z" }
      truncated: false
      cursor: null

  # Simulate storage failure
  - storage:
      url: backup_sftp
      operation: put
    throw: { error: StorageError, message: "Quota exceeded", data: { reason: "quota_exceeded" } }

  # Match by literal URL + put by path pattern
  - storage:
      url: "s3://archive-bucket"       # literal URL
      operation: put
      path: "uploads/*"
    return:
      path: ="uploads/" + MOCK.path_basename
      size: 1024

  # Spy: passthrough to real storage (INTEGRATION mode)
  - storage:
      url: s3_data
    spy: true
```

**Storage matching fields:** `url` (exact match on alias or resolved URL), `operation` (exact match), `path` (string pattern with `*` wildcards). Omitted fields match anything.

**Cache mock fields:** Storage mocks support a `cache:` field within `return:` to inject `RESULT.cache` metadata. This allows test flows to verify cache-aware logic without requiring a real cache implementation.

```yaml
mocks:
  # Mock a cache hit for get
  - storage:
      url: s3_data
      operation: get
      path: "config/settings.json"
    return:
      data: '{"region": "us-east-1"}'
      name: settings.json
      size: 26
      type: application/json
      modified: "2026-03-14T10:00:00Z"
      cache:
        hit: true
        stale: false
        age: 120
        source: CACHE
        revalidated: false
        negativeHit: false

  # Mock a stale-while-revalidate scenario
  - storage:
      url: s3_data
      operation: get
      path: "data/events.csv"
    return:
      data: "id,status\n1,active"
      name: events.csv
      size: 20
      type: text/csv
      modified: "2026-03-14T09:00:00Z"
      cache:
        hit: true
        stale: true
        age: 600
        source: STALE_REVALIDATING
        revalidated: false
        negativeHit: false

  # Mock a write with cache invalidation
  - storage:
      url: s3_data
      operation: put
      path: "config/settings.json"
    return:
      path: config/settings.json
      size: 64
      cache:
        invalidated: 3
```

**Cache fault injection:** The `cacheForce:` field is a sibling to `return:` and `throw:` on storage mocks. It overrides the engine's cache behavior for the matched operation, regardless of `cacheHint:` configuration:

```yaml
mocks:
  # Force a cache miss (always go to origin)
  - storage:
      url: s3_data
      operation: get
      path: "config/*"
    cacheForce: MISS

  # Force serving stale content
  - storage:
      url: s3_data
      operation: get
      path: "data/snapshot.json"
    cacheForce: STALE

  # Force a cache error (simulates cache infrastructure failure)
  - storage:
      url: s3_data
      operation: get
      path: "data/critical.json"
    cacheForce: ERROR
```

`cacheForce:` values: `MISS` (bypass cache, fetch from origin), `STALE` (serve stale entry if available, origin error if not), `ERROR` (simulate cache infrastructure failure — triggers `staleIfError` behavior if configured, otherwise proceeds to origin).

**UNIT mode:** Cache is always mocked — `RESULT.cache` is populated from mock `cache:` fields. When no `cache:` field is provided in the mock, `RESULT.cache` is `null` (consistent with "caching not supported" behavior). `cacheForce:` has no effect in UNIT mode since there is no real cache to override.

**INTEGRATION mode:** Real cache behavior applies. `cacheForce:` overrides the real cache for matched operations, enabling targeted cache fault injection during integration tests. Mock `cache:` fields in `return:` are ignored when real caching is active — the engine populates `RESULT.cache` from actual cache state.

### 4.8 SSH mocks (`ssh:`)

Replace `ssh` step remote command invocations. In `UNIT` mode, unmatched SSH invocations throw `UnmockedSshError` (analogous to `UnmockedServiceError`).

```yaml
mocks:
  # Match by host (alias) + command
  - ssh:
      host: prod_server                # bare alias
      command: df
    return:
      stdout: "Filesystem  Size  Used Avail Use%\n/dev/sda1   50G   20G   30G  40%"
      exitCode: 0

  # Match by host (alias) + command + args pattern
  - ssh:
      host: prod_server
      command: rsync
      args: ["*", "/backups/*"]
    return:
      stdout: "sending incremental file list\n\nsent 1024 bytes"
      exitCode: 0

  # Simulate command failure (literal hostname)
  - ssh:
      host: "staging.example.com"      # literal hostname
      command: deploy.sh
    return:
      stderr: "Error: deployment failed"
      exitCode: 1

  # Dynamic return
  - ssh:
      host: prod_server
      command: df
    return:
      stdout: ="Filesystem: " + args[0]
      exitCode: 0
```

**SSH matching fields:** `host` (exact match on alias or resolved hostname), `command` (exact match), `args` (array pattern with `*` wildcards), `env` (map of env var name→value). Omitted fields match anything.

### 4.9 Record-and-replay mocks

Run a flow once against real services, capture all action invocations and responses, then replay them in subsequent test runs. This is the workflow equivalent of VCR/Betamax/nock.

**Recording:**

```bash
# Run flow against real services, capture all I/O
flowmarkup test --record order-processing.flowmarkup-test.yaml

# Recordings saved to:
#   recordings/order-processing/approves-critical-priority-order.recording.yaml
```

The recording file captures every action invocation:

```yaml
# AUTO-GENERATED — do not edit
recording:
  flow: ./order-processing.flowmarkup.yaml
  test: approves critical priority order
  recorded: 2026-03-11T14:30:00Z
  interactions:
    - step: charge_payment
      action: call
      service: payment
      operation: charge
      params: { order_id: "ORD-1001", amount: 150.00 }
      result: { tx_id: "TX-8834", status: success }
      duration: 234ms
    - step: reserve_inventory
      action: call
      service: inventory
      params: { order_id: "ORD-1001", items: [...] }
      result: { reservation_id: "INV-4421" }
      duration: 89ms
    - step: fetch_shipping_rate
      action: request
      method: GET
      url: "https://api.shipping.com/rates?zip=94105"
      respond:
        status: 200
        headers: { Content-Type: application/json }
        body: { rate: 5.99, carrier: "USPS" }
      duration: 156ms
```

**Replaying:**

```yaml
tests:
  - name: approves critical priority order
    replay: ./recordings/order-processing/approves-critical-priority-order.recording.yaml
    input:
      order: ${{ fixtures.standard_order }}
      priority: critical
    expect:
      output:
        result: { outcome: approved }
```

When `replay:` is set, the recording file supplies mock responses. The test runner matches each action invocation against the recording by step `_id_` (or positional order) and replays the captured response. If the flow makes an action call not present in the recording, the test fails with `RecordingMismatchError`.

**Re-recording:** Run `flowmarkup test --record` periodically (or in CI on a schedule) to detect service contract drift. If a re-recording produces different response shapes, downstream tests using `replay:` will fail — signaling a breaking change.

> **Secret redaction in recordings.** The recording mechanism MUST apply secret redaction to all captured data (params, results, headers, bodies) before writing to disk. Recording files MUST be written with restricted file permissions (owner-only read/write). The engine MUST refuse `--record` mode for flows that declare `requires: { SECRET: [...] }` unless the operator explicitly enables it via `--record-with-secrets` flag. When `--record-with-secrets` is used, the engine MUST emit a CRITICAL-level warning. *(CWE-532)*

### 4.10 Stateful mocks

For services that maintain state across multiple calls within a single flow execution, use `scenario:` to define state-machine behavior (inspired by WireMock's scenarios):

```yaml
mocks:
  - service: order_tracker
    scenario: order_lifecycle
    states:
      - state: STARTED                          # initial state
        operation: get_status
        return: { status: pending }
        nextState: PROCESSING

      - state: PROCESSING
        operation: get_status
        return: { status: processing }
        nextState: SHIPPED

      - state: SHIPPED
        operation: get_status
        return: { status: shipped }
        # terminal — stays in SHIPPED
```

The mock transitions through states as it is called. Each call matches the current state's entry, returns the defined response, and advances to `nextState`. If `nextState` is omitted, the mock stays in the current state.

**Failure behavior:** When a stateful mock receives a call that doesn't match any transition from the current state (wrong `operation` or no matching entry), the engine throws `UnmockedServiceError` — the same behavior as unmatched non-stateful mocks. Reaching a terminal state is not an error; subsequent calls are matched against the terminal state's transitions, or fall through to `UnmockedServiceError` if none match.

### 4.11 Spy mode

Spy mode (`spy: true`) is available on all mock types except mail (which is always implicitly a spy). Spy mode passes the call through to the real implementation while recording the invocation for step assertions.

| Mock type | `spy: true` behavior | Available in |
|---|---|---|
| `service` | Calls real service, records params/result | INTEGRATION only |
| `flow` | Runs real sub-flow, records params/result/yields | INTEGRATION only |
| `request` | Makes real HTTP call, records request/response | INTEGRATION only |
| `exec` | Runs real command, records command/args/output | INTEGRATION only |
| `mail` | N/A — always captured (implicit spy) | Both modes |

**SA-TEST-31:** `spy: true` in `mode: UNIT` is an error — there is no real service to pass through to.

**Resources are not spy-able.** Resource mocks (§4.6) do not support `spy: true`. Resources are filesystem handles with static content — there is no "real invocation" to pass through and record. Resource access is tracked automatically via `expect.resources:` assertions (§6.9).

**Spy with conditional matching:** A spy mock can include `operation:`, `when:`, or other matching criteria. Only calls that match the spy's criteria are recorded as spy interactions. This enables selective observation:

```yaml
mocks:
  # Spy only on payment charge operations (ignore refund, verify, etc.)
  - service: payment
    operation: charge
    spy: true
```

**Spy with assertions:** Spy-recorded calls are available in `expect.steps:` assertions just like mocked calls. The `returned:` assertion reflects the real service's response.

### 4.12 Dynamic return CEL bindings

When a mock's `return:` (or `respond.body:`) contains CEL expressions (prefixed with `=`), the expression is evaluated with bindings specific to the mock type:

| Mock type | Available CEL bindings |
|---|---|
| `service` | `service` (alias name), `operation` (operation name), `params` (input params map) |
| `flow` | `flow` (flow path), `params` (input params map) |
| `request` | `request.method`, `request.url`, `request.path`, `request.query`, `request.headers`, `request.body` |
| `exec` | `command` (command string), `args` (args list), `env` (env map), `stdin` (stdin string or null) |
| `mail` | `to` (list), `cc` (list), `bcc` (list), `from` (string), `subject` (string) |

These bindings are also available in mock `when:` guards.

### 4.13 `allowReal:` (selective passthrough in UNIT mode)

By default, `mode: UNIT` requires all I/O boundaries to be mocked. `allowReal:` exempts specific services from this requirement, allowing them to pass through to real implementations without triggering `UnmockedServiceError`.

```yaml
flowmarkup-test:
  flow: ./my-flow.flowmarkup.yaml
  mode: UNIT
  allowReal:
    services: [logger, metrics]           # these services may call real implementations
    resources: [static_config]            # these resources resolve via engine binding
```

**`allowReal:` structure:**

| Key | Type | Description |
|---|---|---|
| `services` | string[] | Service aliases that may pass through unmocked |
| `resources` | string[] | Resource names that may resolve via engine binding |
| `exec` | string[] | Command names that may execute unmocked |
| `request` | string[] | Host patterns that may make real HTTP calls |

**Use case:** Lightweight services (logging, metrics, caching) that don't affect flow correctness. By allowing them to pass through, you avoid creating noise mocks while keeping the test otherwise hermetic.

**Shorthand:** When `allowReal:` is a plain string array (not an object), it applies to services only:

```yaml
allowReal: [logger, metrics]
# equivalent to: allowReal: { services: [logger, metrics] }
```

> Static analysis rule SA-TEST-49 (ERROR) MUST flag `allowReal:` patterns containing wildcards (`*`) in `request:` URLs or security-sensitive commands in `exec:` (using the same interpreter denylist as the production `exec` directive). `allowReal: { exec: [...] }` in UNIT mode MUST be restricted to an engine-configured allowlist. `allowReal:` entries remain subject to capability enforcement — a test cannot `allowReal` an action the flow does not declare in `requires:`.

**Interaction with spy mode.** `allowReal:` does not override the spy mode restriction (SA-TEST-31). `spy: true` remains an error in `mode: UNIT` even for services listed in `allowReal:`. `allowReal:` exempts services from the mocking requirement -- it does not upgrade the test mode's observation capabilities. To observe real service calls, use `mode: INTEGRATION` with `spy: true`.

---

## 5. Fixtures

Fixtures are named data blocks defined at the suite level and referenced in test cases via `${{ fixtures.<name> }}`. They are pure data (no CEL evaluation) — just YAML values injected verbatim.

### Inline fixtures

```yaml
fixtures:
  standard_order:
    id: "ORD-1001"
    status: pending
    items:
      - { sku: "WG-1001", quantity: 2, price: 50.00 }

  premium_customer:
    name: Alice
    email: alice@example.com
    tier: premium
```

### Fixture composition

Fixtures can reference other fixtures:

```yaml
fixtures:
  base_order:
    id: "ORD-0001"
    status: pending
    items: []

  order_with_items:
    $merge: ${{ fixtures.base_order }}
    items:
      - { sku: "WG-1001", quantity: 1, price: 100.00 }

  big_order:
    $merge: ${{ fixtures.order_with_items }}
    id: "ORD-9999"
    items:
      - { sku: "WG-1001", quantity: 100, price: 100.00 }
```

`$merge` performs a shallow merge — the referencing fixture's top-level keys override the base entirely. For recursive merging of nested maps, use `$deepMerge`:

```yaml
fixtures:
  base_config:
    settings:
      timeout: 30
      retries: 3
      logging: { level: INFO, format: json }
    tags: [production]

  debug_config:
    $deepMerge: ${{ fixtures.base_config }}
    settings:
      logging: { level: DEBUG }           # merges into settings.logging (keeps format: json)
    tags: [debug]                          # replaces tags (arrays are replaced, not concatenated)
```

Result of `${{ fixtures.debug_config }}`: `{ settings: { timeout: 30, retries: 3, logging: { level: DEBUG, format: json } }, tags: [debug] }`.

**`$deepMerge` semantics:** Maps merge recursively (nested keys are merged, not replaced). Arrays and scalars are replaced entirely (the overriding value wins). `null` values in the override remove the key from the base.

> **Security advisory:** `$deepMerge` with `null` values can remove keys from the base fixture, including security-critical fields (e.g., capability restrictions, authorization headers). Test authors should review `$deepMerge` overrides that set capability- or authorization-related keys to `null` to ensure this is intentional.
>
> Static analysis rule SA-TEST-47 (ERROR) MUST flag `$deepMerge` overrides that set security-critical keys (`cap`, `requires`, `integrity`, `auth`, `$secret`, `tls`) to `null`. SA-TEST-47 (WARN) for other keys set to `null`.

### External fixture files

```yaml
fixtures:
  $include: ./fixtures/order-fixtures.yaml
```

The included file is a plain YAML map of fixture names to values. The test runner MUST resolve `$include` paths relative to the test file's directory and MUST reject paths containing `..` segments. This prevents path traversal (CWE-22) and aligns with ENGINE.md §5.4 item 25 (storage path validation).

> **$include path traversal prevention (strengthened).** The `$include` directive MUST reject paths containing:
> - `..` segments (parent directory traversal)
> - Absolute paths (starting with `/` or drive letter on Windows)
> - URL schemes (`http://`, `https://`, `file://`, etc.)
> - Null bytes (`\0`)
> - Backslash path separators on all platforms (normalize to forward slash)
> - Symbolic link targets that resolve outside the test suite root directory
> - Path components starting with `.` (hidden files/directories), unless explicitly allowed by engine configuration
>
> The engine MUST resolve the included path relative to the test file's directory and MUST verify that the resolved absolute path is a descendant of the test suite root directory. Resolution MUST use canonical path resolution (resolving all symlinks) before the descendant check. The engine MUST use `O_NOFOLLOW` (or platform equivalent) when opening `$include` target files and MUST perform the canonical path check on the file descriptor obtained from the open operation, not on the path string. This eliminates the TOCTOU race between path resolution and file read. If `O_NOFOLLOW` is not available on the target platform, the engine MUST reject `$include` paths that traverse symbolic links (detected via `lstat()` on each path component). *(CWE-367)* *(CWE-22: Improper Limitation of a Pathname to a Restricted Directory)*

> **$include path traversal enforcement.** The test runner MUST reject `$include` paths containing `..` segments, absolute paths, symbolic link targets outside the test root, and URL schemes other than relative file paths. The resolved inclusion path MUST be canonicalized and verified to be within the test suite's root directory. Path traversal attempts MUST cause an immediate test failure with a security error (category: `PathTraversalError`), not a silent skip. The error message MUST include the offending path (with any secret components redacted) and the test suite root boundary that was violated. Engines MUST perform this validation before any file I/O occurs on the resolved path to prevent TOCTOU race conditions. *(CWE-22: Improper Limitation of a Pathname to a Restricted Directory)*

### Fixture factories

For fixtures that need dynamic variation, use a `$generate` block with CEL:

```yaml
fixtures:
  order_factory:
    $generate:
      id: "='ORD-' + (1000 + INDEX)"                # auto-stringification: produces "ORD-1003" etc.
      status: pending
      total: =50.0 * (INDEX + 1)
      items:
        - { sku: "='SKU-' + INDEX", quantity: =INDEX + 1, price: 50.00 }    # auto-stringification
```

Inside `$generate:`, the special binding `INDEX` is an integer (0-based) representing the requested factory index. Reference with an index: `${{ fixtures.order_factory[3] }}` generates an order with `INDEX=3`.

---

## 6. Assertions

All assertions live under the `expect:` block. Every assertion failure reports the test name, the assertion path, the expected value, and the actual value.

### 6.1 Output assertions

```yaml
expect:
  output:
    # Exact match
    result: { outcome: approved, notes: "Auto-approved" }

    # CEL expression assertion (prefix with =)
    total: =value > 100 && value <= 200

    # Schema assertion — value conforms to a declared types: schema
    receipt: =matchesSchema('ProcessingResult')

    # Nested field match (partial — unmentioned fields are ignored)
    result:
      outcome: approved
      # notes: not asserted
```

**Matching semantics:** Literal values use deep equality. CEL expressions (prefixed with `=`) are evaluated with `value` bound to the actual value. Objects match structurally — only declared fields are checked (partial match by default). Use `$exact: true` to require exact match with no extra fields.

When `$exact: true` is present on a map assertion, the assertion fails if the actual value contains any keys not listed in the expected map. Without `$exact`, extra keys are ignored (subset matching).

```yaml
expect:
  output:
    $exact: true
    order_id: 123
    status: completed
    # Fails if output contains any keys beyond order_id and status
```

### 6.2 Data element assertions

> **`$readonly` assertions:** Tests should verify: (1) write attempts to `$readonly` data elements are rejected (SA-CONST-1), (2) copying a readonly value does not propagate `$readonly` to the target, (3) `meta(var).readonly` returns the correct boolean.

```yaml
expect:
  vars:
    order_status: completed
    item_count: =value >= 3
  globals:
    request_count: =value == old + 1    # 'old' is the pre-test value
  context:
    correlation_id: test-corr-001       # unchanged
```

**The `old` binding:** In `globals:` and `context:` assertions, `old` is bound to the variable's value at the start of the test (after `beforeEach`, before flow execution). This enables assertions like "incremented by 1" without hardcoding absolute values. `old` is not available in `vars:` assertions (flow-local variables have no pre-existing state) or `output:` assertions.

### 6.3 Yield assertions

```yaml
expect:
  yields:
    # Exact sequence
    - { token: "Hello" }
    - { token: " world" }

    # Or use CEL for the whole sequence
    $expr: =size(yields) > 0 && yields.last().token == '!'

    # Or just assert count
    $count: 3

    # Or assert each element matches a pattern
    $each: =has(value.token) && size(value.token) > 0

    # Or assert specific positions
    $first: { token: "Hello" }
    $last: { token: "!" }
```

### 6.4 Error assertions

```yaml
expect:
  throws: PaymentDeclinedError                    # just the type

  # Or with details
  throws:
    error: PaymentDeclinedError
    message: =value.contains('limit')             # CEL on message
    data:                                          # assert on ERROR.DATA.*
      amount: 10001
    cause:                                         # assert on ERROR.CAUSE (for wrapped errors)
      error: OriginalError                         # ERROR.CAUSE.TYPE
      message: =value.contains('root cause')       # ERROR.CAUSE.MESSAGE
```

**`throws:` assertion fields:**

| Field | Maps to | Description |
|---|---|---|
| `error` | `ERROR.TYPE` | Exact error type name |
| `message` | `ERROR.MESSAGE` | CEL assertion on the message string |
| `data` | `ERROR.DATA.*` | Partial match on structured error data |
| `cause` | `ERROR.CAUSE` | Nested assertion on the wrapped/chained error (recursive — `cause` can itself have `error`, `message`, `data`, `cause`) |

### 6.5 Execution trace assertions

Assert on the **ordered sequence** of steps that actually executed. This is the workflow equivalent of Camunda's `hasCompletedElementsInOrder(...)` — a key testing capability for verifying step execution order.

Steps are identified by `_id_`. Only steps with `_id_` appear in the trace.

```yaml
expect:
  trace:
    # Exact sequence — every _id_-tagged step, in order
    exact: [validate_order, charge_payment, reserve_inventory, send_confirmation]

    # Subsequence — these steps occurred in this order (other steps may appear between)
    contains: [charge_payment, reserve_inventory]

    # Exclusion — these steps did NOT execute
    excludes: [send_refund, notify_fraud]

    # Starts/ends with
    startsWith: [validate_order]
    endsWith: [send_confirmation]

    # CEL expression on the full trace list
    $expr: =size(trace) >= 3 && trace.first() == 'validate_order'
```

**Parallel branches:** Steps in parallel branches appear in the trace in completion order. Use `schedule:` (§20) for deterministic ordering. `contains:` asserts an ordered subsequence (the listed steps appeared in that order, with other steps possibly between them) — this is robust for parallel timing because it only constrains relative order, not absolute position.

### 6.6 Step assertions

Assert on step execution by `_id_`:

```yaml
expect:
  steps:
    charge_payment:
      called: true                      # step was executed
      calledTimes: 1                    # exact invocation count (including retries)
      calledWith:                       # input params matched
        amount: 150.00
        currency: USD
      returned:                         # output values
        tx_id: "TX-0001"
      duration: =value < duration('5s') # completed in under 5s (virtual time)

      # Ordering assertions — relative to other _id_-tagged steps
      calledBefore: reserve_inventory       # this step completed before that step started
      calledAfter: validate_order           # this step started after that step completed

      # Catch inspection — what error was caught by this step's catch block
      caught:
        error: TimeoutError
        message: =value.contains('timed out')     # CEL on caught error message
        data:                                      # partial match on ERROR.DATA.*
          endpoint: "api.example.com"

    send_notification:
      called: false                     # step was skipped (condition was false)

    retry_loop:
      calledTimes: 3                    # retried 3 times

    saga_group:
      # Rollback assertions
      rolledBack: true                  # transaction group rolled back
      rollbackSteps: [cancel_payment, release_inventory]   # rollback handlers that fired

    flaky_service_call:
      # Retry assertions — inspect retry behavior
      retryAttempts: 2                  # number of retry attempts (not counting the initial call)
      retryErrors:                      # errors that triggered each retry
        - { error: TimeoutError }
        - { error: ConnectionError }
      retryDurations:                   # delay between retries (virtual time)
        $each: =value >= duration('1s')

    rate_limited_call:
      # Rate limit assertions
      rateLimited: true                 # step was subject to rate limiting
      rateLimitWaited: =value > duration('0s')  # time spent waiting for rate limit token

    circuit_broken_call:
      # Circuit breaker assertions
      circuitBreakerState: OPEN         # breaker state when step was invoked
      circuitBreakerTripped: true       # this step's failure caused the breaker to trip

    timeout_call:
      # Timeout assertions
      timedOut: true                    # step was terminated by its timeout
      onTimeoutExecuted: true           # the onTimeout handler ran
```

**Step assertion keys (complete reference):**

| Key | Type | Description |
|---|---|---|
| `called` | boolean | Step was executed |
| `calledTimes` | integer/CEL | Exact invocation count (including retries) |
| `calledWith` | map | Input params (partial match) |
| `returned` | map/CEL | Output values |
| `duration` | CEL | Total duration (virtual time) |
| `calledBefore` | string | This step completed before the named step started |
| `calledAfter` | string | This step started after the named step completed |
| `caught` | map | Error caught by this step's catch block (supports `error`, `message`, `data`, `cause` — same structure as `throws:` in §6.4) |
| `rolledBack` | boolean | Transaction group was rolled back |
| `rollbackSteps` | string[] | Rollback handlers that fired (in execution order) |
| `retryAttempts` | integer/CEL | Number of retry attempts (excludes initial call) |
| `retryErrors` | list | Errors that triggered each retry (ordered) |
| `retryDurations` | list/CEL | Delay between retries |
| `rateLimited` | boolean | Step was subject to rate limiting |
| `rateLimitWaited` | CEL | Time spent waiting for rate limit token |
| `circuitBreakerState` | string | Breaker state at invocation time: `CLOSED`, `OPEN`, `HALF_OPEN` |
| `circuitBreakerTripped` | boolean | This step's failure caused the breaker to trip OPEN |
| `timedOut` | boolean | Step was terminated by its timeout |
| `onTimeoutExecuted` | boolean | The `onTimeout` handler ran |
| `skippedReason` | string | Why step was skipped: `CONDITION_FALSE`, `CIRCUIT_OPEN`, `RATE_LIMITED`, `CANCELLED` (parallel branch cancelled by `failPolicy: FAST`) |

**Ordering semantics:** `calledBefore: X` means this step's last side-effect committed before step X's first side-effect began. In parallel groups, ordering is only assertable between steps in different branches if `schedule:` provides deterministic timing.

### 6.7 Log assertions

```yaml
expect:
  log:
    # Contains substring (any log entry)
    contains: ["Processing order", "complete"]

    # Exact entry match
    entries:
      - level: WARN
        message: =value.contains('critical')
      - level: INFO

    # No errors logged
    not:
      levels: [ERROR]
```

### 6.8 Mail assertions

```yaml
expect:
  mail:
    sent: true
    count: 1
    messages:
      - to: ["customer@example.com"]
        subject: "Order Confirmed"
        body:
          contains: "ORD-1001"
        cc: []                                    # no CC recipients
        from: "noreply@example.com"
        replyTo: "support@example.com"
        attachments:
          $count: 1                               # one attachment
          - $name: "receipt.pdf"
            $type: =value.endsWith('/pdf')
        headers:
          X-Priority: "1"
    not:                                          # negative assertions
      messages:
        - to: ["admin@example.com"]              # admin should NOT receive mail
```

### 6.9 Resource assertions

Assert on resource access patterns during flow execution:

```yaml
expect:
  resources:
    config:
      accessed: true                              # resource was read
      accessCount: 1                              # accessed exactly once
    templates:
      accessed: true
      filesAccessed: ["welcome.html"]             # specific files accessed in directory resource
    missing_resource:
      accessed: false                             # resource was never accessed
```

**Resource assertion keys:**

| Key | Type | Description |
|---|---|---|
| `accessed` | boolean | Resource was accessed during flow execution |
| `accessCount` | integer/CEL | Number of times the resource was accessed |
| `filesAccessed` | string[] | Files accessed within a directory resource (unordered — use `isSupersetOf()` for robust assertions) |

### 6.10 Event assertions

Assert on events emitted by the flow during execution via `emit` directives. All emitted events are captured automatically by the test runner (analogous to mail capture).

```yaml
expect:
  events:
    emitted: true                                   # at least one event was emitted
    count: 3                                        # exactly 3 events emitted

    # Assert on specific emitted events (ordered by emission time)
    messages:
      - event: order_confirmed
        data:
          order_id: "ORD-1001"
        scope: LOCAL                                # LOCAL (default), CONTEXT, or GLOBAL
      - event: notification_sent
        data:
          recipient: =value.contains('@')

    # CEL expression on the full list
    $expr: =events.filter(e, e.event == 'order_confirmed').size() == 1

    # Negative assertion
    not:
      messages:
        - event: error_reported                     # this event should NOT have been emitted
```

**Event assertion keys:**

| Key | Type | Description |
|---|---|---|
| `emitted` | boolean | At least one event was emitted |
| `count` | integer/CEL | Total number of events emitted |
| `messages` | list | Ordered list of emitted event assertions (partial match) |
| `messages[].event` | string | Event type name |
| `messages[].data` | map/CEL | Partial match on event data payload |
| `messages[].scope` | string | Emission scope: `LOCAL`, `CONTEXT`, `GLOBAL` |
| `$expr` | CEL | CEL expression on the full `events` list |
| `not.messages` | list | Events that must NOT have been emitted |

**The `events` CEL binding:** In `$expr`, the `events` binding is a list of `{ event: string, data: map, scope: string, timestamp: duration }` objects, ordered by emission time.

### 6.11 Snapshot assertions

Compare the entire test output against a saved baseline (golden file). See §13 for full details.

```yaml
expect:
  snapshot: ./snapshots/happy-path.snapshot.yaml
```

### 6.12 Severity levels

By default, assertions are hard failures — any mismatch fails the test. For non-critical checks, use `$warn` severity. Warnings are reported but do not fail the test.

```yaml
expect:
  output:
    result: { outcome: approved }          # hard failure if wrong
  steps:
    charge_payment:
      duration:
        $warn: =value < duration('1s')     # warn if slow, don't fail
        $error: =value < duration('10s')   # fail if very slow
  log:
    not:
      levels:
        $warn: [WARN]                      # warn if warnings logged
        $error: [ERROR]                    # fail if errors logged
```

**`$warn` vs `$error`:** When both are present on the same field, `$error` is the hard failure threshold and `$warn` is the soft threshold. When neither is present, a bare assertion is implicitly `$error`.

### 6.13 CEL matcher functions

These CEL functions are available exclusively in test assertion expressions (within `expect:` blocks) — they are not available in production flows. They bind `value` to the actual value being asserted:

| Function | Description |
|---|---|
| `approx(expected, tolerance)` | Numeric approximate equality |
| `matches(pattern)` | Regex match on string value |
| `contains(substring)` | String contains |
| `startsWith(prefix)` | String prefix |
| `endsWith(suffix)` | String suffix |
| `hasKey(key)` | Map contains key |
| `hasKeys([k1, k2])` | Map contains all keys |
| `isNull()` | Value is null |
| `isNotNull()` | Value is not null |
| `isBetween(low, high)` | Inclusive range check |
| `hasSize(n)` | Collection/string size |
| `isEmpty()` | Collection/string empty |
| `eachSatisfies(expr)` | All elements match |
| `anySatisfies(expr)` | At least one element matches |
| `isSorted()` | List is sorted ascending |
| `isUnique()` | List has no duplicates |
| `duration(str)` | Parse duration for comparison |
| `matchesSchema(TypeName)` | Value conforms to a declared `types:` schema (**test-only**) |
| `isSubsetOf(list)` | All elements are in the given list |
| `isSupersetOf(list)` | Contains all elements of the given list |

---

## 7. Error Testing

### Test that a flow throws

```yaml
tests:
  - name: rejects empty order
    input:
      order: { id: "ORD-0001", status: pending, items: [] }
      priority: normal
    expect:
      throws: AssertionError

  - name: payment failure triggers rollback
    input:
      order: ${{ fixtures.standard_order }}
    mocks:
      - service: payment
        throw: { error: PaymentDeclinedError, message: "Insufficient funds" }
    expect:
      throws:
        error: RolledBackError
        cause:
          error: PaymentDeclinedError        # ERROR.CAUSE.TYPE — the original error
          message: =value.contains('Insufficient')
      steps:
        saga_group:
          rolledBack: true
          rollbackSteps: [cancel_payment]
```

### Test error hierarchy (polymorphic catch)

```yaml
tests:
  - name: catches DnsResolutionError via NetworkError handler
    mocks:
      - service: http_client
        throw: { error: DnsResolutionError, message: "NXDOMAIN" }
    input:
      endpoint: "https://bad-host.example.com"
    expect:
      throws:
        error: NetworkError
        message: =value.contains('Unrecoverable')
      steps:
        network_try:
          caught:
            error: DnsResolutionError    # ERROR.TYPE is the exact type
```

### Test that an error is NOT thrown

```yaml
tests:
  - name: gracefully handles timeout with fallback
    mocks:
      - service: primary_api
        delay: 60s    # exceeds step timeout
      - service: fallback_api
        return: { data: "fallback" }
    expect:
      output:
        result: "fallback"
      # No throws: assertion means test fails if any error propagates
```

### Test saga rollback ordering

```yaml
tests:
  - name: rollback fires in reverse completion order
    mocks:
      - service: payment
        return: { tx_id: "TX-001" }
      - service: inventory
        return: { reservation_id: "INV-001" }
      - service: shipping
        throw: { error: ShippingError, message: "Address invalid" }
      # Rollback mocks
      - service: inventory
        operation: cancel
        return: { cancelled: true }
      - service: payment
        operation: refund
        return: { refund_id: "RF-001" }
    expect:
      throws:
        error: RolledBackError
      trace:
        # Rollback steps execute in reverse completion order
        contains: [charge_payment, reserve_inventory, rollback_inventory, rollback_payment]
      steps:
        rollback_inventory:
          calledBefore: rollback_payment
```

---

## 8. Yield / Streaming Testing

### Assert yield sequence

```yaml
tests:
  - name: streams all tokens
    input:
      prompt: "Hello"
    mocks:
      - service: claude
        yields:
          - { token: "Hi" }
          - { token: " there" }
        return: { result: "Hi there" }
    expect:
      yields:
        - "Hi"
        - " there"
      output:
        response: "Hi there"
```

### Assert yield count without exact values

```yaml
tests:
  - name: paginates all items
    input:
      api_url: "https://api.example.com/items"
    mocks:
      - request:
          method: GET
          url: "https://api.example.com/items*"
        sequence:
          - respond: { status: 200, body: { data: [1, 2, 3], next_cursor: "pg2" } }
          - respond: { status: 200, body: { data: [4, 5], next_cursor: null } }
    expect:
      yields:
        $count: 5
        $each: =value >= 1 && value <= 5
```

### Test backpressure / buffer behavior

```yaml
tests:
  - name: respects buffer limit
    input: { prompt: "test" }
    mocks:
      - service: claude
        yields:
          - { token: "a" }
          - { token: "b" }
          - { token: "c" }
        yieldDelay: 100ms    # delay between yields
        return: { result: "abc" }
    expect:
      yields:
        $count: 3
```

---

## 9. Cross-Cutting Concern Testing

### 9.1 Retry testing

Test that retry policies work correctly — errors trigger retries, backoff is honored, and the flow recovers (or fails after exhaustion).

```yaml
tests:
  - name: retries transient failures then succeeds
    mocks:
      - service: payment
        operation: charge
        sequence:
          - throw: { error: TimeoutError, message: "timeout 1" }
          - throw: { error: ConnectionError, message: "connection reset" }
          - return: { tx_id: "TX-001" }           # third attempt succeeds
    expect:
      output:
        result: { tx_id: "TX-001" }
      steps:
        charge_payment:
          calledTimes: 3                           # initial + 2 retries
          retryAttempts: 2
          retryErrors:
            - { error: TimeoutError }
            - { error: ConnectionError }

  - name: exhausts retries and propagates error
    mocks:
      - service: payment
        operation: charge
        throw: { error: TimeoutError, message: "persistent timeout" }
    expect:
      throws:
        error: TimeoutError
      steps:
        charge_payment:
          calledTimes: 4                           # initial + 3 retries (maxAttempts: 4)
          retryAttempts: 3

  - name: non-retryable error skips retry
    mocks:
      - service: payment
        operation: charge
        throw: { error: ValidationError, message: "invalid amount" }
    expect:
      throws:
        error: ValidationError
      steps:
        charge_payment:
          calledTimes: 1                           # no retry — ValidationError is non-retryable
          retryAttempts: 0

  - name: retry backoff timing (virtual time)
    mocks:
      - service: payment
        sequence:
          - throw: { error: TimeoutError }
          - throw: { error: TimeoutError }
          - return: { tx_id: "TX-001" }
    expect:
      steps:
        charge_payment:
          calledTimes: 3                               # initial + 2 retries
          retryAttempts: 2
          retryDurations:
            # With delay: 2s, backoff: EXPONENTIAL
            - =value >= duration('2s') && value <= duration('3s')    # ~2s + jitter
            - =value >= duration('4s') && value <= duration('6s')    # ~4s + jitter
```

**Breakpoint on retry:** Use `onRetry:` breakpoints (§14) to inspect state between retry attempts:

```yaml
breakpoints:
  - onRetry: charge_payment
    assert:
      steps:
        charge_payment:
          retryAttempts: =value >= 1
```

### 9.2 Circuit breaker testing

Test circuit breaker state transitions: CLOSED → OPEN → HALF_OPEN → CLOSED.

```yaml
tests:
  - name: circuit breaker trips after threshold failures
    mocks:
      - service: payment
        sequence:
          - throw: { error: TimeoutError }
          - throw: { error: TimeoutError }
          - throw: { error: TimeoutError }
          - throw: { error: TimeoutError }
          - throw: { error: TimeoutError }        # 5 failures → trip threshold
          # 6th call: circuit is OPEN, not dispatched to mock
    expect:
      throws: CircuitOpenError
      steps:
        charge_payment:
          # First 5 calls: CLOSED, each fails
          calledTimes: 5                           # 5 actual calls before breaker trips
          circuitBreakerTripped: true

  - name: circuit breaker half-open recovery
    mocks:
      - service: payment
        sequence:
          - throw: { error: TimeoutError }
          - throw: { error: TimeoutError }
          - throw: { error: TimeoutError }
          - throw: { error: TimeoutError }
          - throw: { error: TimeoutError }        # trips OPEN
          - return: { tx_id: "TX-RECOVERED" }     # HALF_OPEN trial succeeds
    # Virtual time auto-advances past resetTimeout (30s) when the flow
    # reaches a waitUntil or retry delay. The engine's virtual clock skips
    # forward to the next actionable time point — no schedule: needed.
    expect:
      output:
        result: { tx_id: "TX-RECOVERED" }
      steps:
        retry_charge:
          circuitBreakerState: HALF_OPEN

  - name: circuit breaker ignores non-countable errors
    mocks:
      - service: payment
        sequence:
          - throw: { error: ValidationError }      # non-countable
          - throw: { error: ValidationError }
          - throw: { error: ValidationError }
          - throw: { error: ValidationError }
          - throw: { error: ValidationError }
          - return: { tx_id: "TX-001" }
    expect:
      output:
        result: { tx_id: "TX-001" }
      steps:
        charge_payment:
          circuitBreakerTripped: false              # breaker never tripped
```

**Circuit breaker shorthand forms** — string and integer shorthands desugar to object form at parse time. Test them the same way:

```yaml
# Flow under test uses string shorthand:
#   circuitBreaker: "5/com.example.orders.inventory"
# Equivalent to: { name: com.example.orders.inventory, threshold: 5 }

tests:
  - name: string shorthand circuit breaker trips at threshold
    mocks:
      - service: inventory
        sequence:
          - throw: { error: TimeoutError }
          - throw: { error: TimeoutError }
          - throw: { error: TimeoutError }
          - throw: { error: TimeoutError }
          - throw: { error: TimeoutError }
    expect:
      throws: CircuitOpenError
      steps:
        check_inventory:
          calledTimes: 5
          circuitBreakerTripped: true

  - name: integer shorthand circuit breaker uses _id_ as name
    # Flow under test:
    #   - call:
    #       _id_: shipping_quote
    #       circuitBreaker: 3           # name derived as "shipping_quote"
    mocks:
      - service: shipping
        sequence:
          - throw: { error: TimeoutError }
          - throw: { error: TimeoutError }
          - throw: { error: TimeoutError }
    expect:
      throws: CircuitOpenError
      steps:
        shipping_quote:
          calledTimes: 3
          circuitBreakerTripped: true
```

### 9.3 Rate limit testing

Test rate limit enforcement — WAIT strategy queues requests, REJECT strategy throws immediately.

```yaml
tests:
  - name: rate limit queues excess requests (WAIT strategy)
    mocks:
      - service: api
        return: { data: "ok" }
    expect:
      steps:
        api_call_1:
          rateLimited: false                       # within limit
        api_call_2:
          rateLimited: true                        # exceeded limit, waited
          rateLimitWaited: =value > duration('0s') # waited for rate limit token

  - name: rate limit rejects excess requests (REJECT strategy)
    mocks:
      - service: api
        return: { data: "ok" }
    expect:
      throws: RateLimitError
      steps:
        api_call_burst:
          skippedReason: RATE_LIMITED

  - name: rate limit per-key isolation
    mocks:
      - service: api
        return: { data: "ok" }
    input:
      tenant_id: tenant_a
    expect:
      # Only tenant_a's rate is counted — other tenants unaffected
      steps:
        api_call:
          rateLimited: false
```

### 9.4 Timeout testing

Test timeout behavior — step cancellation, `onTimeout` handler execution, `TimeoutError` propagation.

```yaml
tests:
  - name: timeout fires and onTimeout handler runs
    mocks:
      - service: slow_api
        delay: 60s                                 # exceeds 30s step timeout
    expect:
      steps:
        slow_call:
          timedOut: true
          onTimeoutExecuted: true
          duration: =value == duration('30s')       # cancelled at timeout, not 60s

  - name: timeout propagates when no onTimeout handler
    mocks:
      - service: slow_api
        delay: 60s
    expect:
      throws: TimeoutError

  - name: onTimeout handler can return fallback value
    mocks:
      - service: primary
        delay: 60s
    expect:
      output:
        result: "fallback value"                   # onTimeout handler sets fallback
      steps:
        primary_call:
          timedOut: true
```

### 9.5 Transaction & rollback testing

Test saga patterns — variable scope forking, rollback ordering, partial rollback failures.

```yaml
tests:
  - name: transaction auto-rolls back variables on error
    mocks:
      - service: payment
        return: { tx_id: "TX-001" }
      - service: inventory
        throw: { error: InventoryError }
    expect:
      throws: RolledBackError
      vars:
        payment_tx: =value == null                 # auto-rolled back to pre-transaction state
      steps:
        transaction_group:
          rolledBack: true

  - name: rollback handlers fire in reverse completion order
    mocks:
      - service: payment
        operation: charge
        return: { tx_id: "TX-001" }
      - service: inventory
        operation: reserve
        return: { reservation_id: "INV-001" }
      - service: shipping
        throw: { error: ShippingError }
      # Rollback mocks
      - service: inventory
        operation: cancel_reservation
        return: { cancelled: true }
      - service: payment
        operation: refund
        return: { refund_id: "RF-001" }
    expect:
      throws: RolledBackError
      steps:
        transaction_group:
          rollbackSteps: [cancel_reservation, refund_payment]   # reverse order
        cancel_reservation:
          calledBefore: refund_payment

  - name: partial rollback failure
    mocks:
      - service: payment
        operation: refund
        throw: { error: RefundError, message: "refund gateway down" }
      - service: inventory
        operation: cancel_reservation
        return: { cancelled: true }
    expect:
      throws:
        error: RollbackFailedError                 # some rollbacks failed
        data:
          succeeded: [cancel_reservation]
          failed: [refund_payment]
```

---

## 10. Step-Level Testing

### Testing individual steps with `focus:`

Sometimes you want to test a specific step path without running the entire flow. `focus:` limits execution to a subgraph anchored at one or more `_id_`-tagged steps.

```yaml
tests:
  - name: discount calculation step
    focus: calculate_discount     # only run this _id_ and its dependencies
    vars:
      order_total: 500.00
      customer_tier: premium
    expect:
      vars:
        discount_amount: 50.00
```

**How `focus:` works:**

1. Engine resolves the step(s) by `_id_`
2. Traces backward to identify all data dependencies (variables read by the step)
3. If the test provides those variables in `vars:`, uses them directly (no dependency execution)
4. If a dependency is not provided, executes the minimal set of prior steps needed to produce it
5. Runs the focused step(s)
6. Skips all steps after the focused subgraph

This is analogous to Step Functions' TestState API — test a single state in isolation with injected inputs. But `focus:` can also resolve dependencies automatically, which TestState cannot.

### Multiple focus targets

```yaml
tests:
  - name: validate and discount
    focus: [validate_order, calculate_discount]   # run both steps (and their dependencies)
```

---

## 11. Parameterized Tests

### `matrix:` — table-driven testing

Run the same flow against many input/output combinations without duplicating test definitions. This is the workflow equivalent of JUnit's `@ParameterizedTest`, Go's table-driven tests, dbt's `given`/`expect` YAML, and Cucumber's Scenario Outlines.

```yaml
tests:
  - name: "order routing: {{ priority }} priority, {{ total }} total"
    matrix:
      - { priority: critical, total: 50,   expected_outcome: approved,  expected_notes: "Auto-approved: critical priority" }
      - { priority: critical, total: 5000, expected_outcome: approved,  expected_notes: "Auto-approved: critical priority" }
      - { priority: high,     total: 50,   expected_outcome: approved,  expected_notes: "Approved: high priority, within threshold" }
      - { priority: high,     total: 1500, expected_outcome: review,    expected_notes: null }
      - { priority: normal,   total: 50,   expected_outcome: approved,  expected_notes: "Standard processing" }
    input:
      order:
        $merge: ${{ fixtures.standard_order }}
        total: =${{ matrix.total }}
      priority: ${{ matrix.priority }}
    mocks:
      - ${{ mocks.payment_success }}
    expect:
      output:
        result:
          outcome: ${{ matrix.expected_outcome }}
```

### How `matrix:` works

1. Each entry in the `matrix:` list is a **row** of parameters
2. The test runner generates one test case per row
3. Parameters are referenced as `${{ matrix.<key> }}`
4. The test `name:` can interpolate matrix parameters via `{{ }}` for readable output
5. All other test properties (`input:`, `mocks:`, `expect:`, etc.) can reference `${{ matrix.<key> }}`

### Test output with matrix

```
order-processing.flowmarkup-test.yaml
  order routing: critical priority, 50 total
    ✓ (8ms)
  order routing: critical priority, 5000 total
    ✓ (7ms)
  order routing: high priority, 50 total
    ✓ (9ms)
  order routing: high priority, 1500 total
    ✓ (8ms)
  order routing: normal priority, 50 total
    ✓ (7ms)

5 passed (39ms)
```

### External matrix from CSV/JSON

```yaml
tests:
  - name: "edge case: {{ description }}"
    matrix:
      $include: ./test-data/order-edge-cases.csv
    input:
      order: { id: ${{ matrix.order_id }}, total: ${{ matrix.total }}, items: ${{ matrix.items }} }
    expect:
      output:
        result:
          outcome: ${{ matrix.expected }}
```

CSV format:

```csv
description,order_id,total,items,expected
zero total,ORD-001,0,"[{""sku"":""A"",""quantity"":1,""price"":0}]",approved
negative total,ORD-002,-10,"[{""sku"":""B"",""quantity"":1,""price"":-10}]",rejected
max items,ORD-003,99999,"...",review
```

> CSV-loaded matrix values MUST be subject to SA-CSV-2 formula-prefix validation. Values starting with `=`, `+`, `-`, `@`, `\t`, or `\r` MUST be sanitized or rejected. CSV matrix values containing YAML special characters MUST be treated as string literals — no YAML interpretation.

### Combining matrix with tags

```yaml
tests:
  - name: "{{ scenario }}"
    tags: [regression, ${{ matrix.tag }}]
    matrix:
      - { scenario: "happy path",        tag: smoke,    ... }
      - { scenario: "edge case: empty",  tag: edge,     ... }
      - { scenario: "error: timeout",    tag: negative, ... }
```

Run only smoke tests: `flowmarkup test --tag smoke`.

---

## 12. Migration & Replay Testing

FlowMarkup flows have `version:` and `onVersionChange:` for checkpoint-resume compatibility when flow definitions change. Testing this is critical — it's the workflow equivalent of Temporal's `WorkflowReplayer`, a critical testing capability for checkpoint-based flows.

### 12.1 Migration testing (`migration:`)

Test that `onVersionChange:` correctly handles checkpoint resume across versions.

```yaml
tests:
  - name: migrates from v1 to v2 at charge_payment step
    migration:
      from: 1                              # version of the checkpointed execution
      to: 2                                # version of the current flow definition
      checkpoint:
        stepCursor: charge_payment        # step where the old execution was suspended
        vars:                              # variable state at checkpoint time
          order_status: processing
          payment_tx: null
        context:                           # CONTEXT.* state at checkpoint
          correlation_id: "corr-v1-001"
        globals: {}                        # GLOBAL.* state at checkpoint
      hash:
        old: "sha256-abc123"               # optional: old flow content hash
        new: "sha256-def456"               # optional: new flow content hash
    mocks:
      - ${{ mocks.payment_success }}
    expect:
      output:
        result: { status: completed }      # flow resumes and completes
      trace:
        # onVersionChange handler runs first, then flow resumes
        startsWith: [migration_handler]
        contains: [charge_payment, send_confirmation]

  - name: rejects incompatible v0 checkpoint
    migration:
      from: 0
      to: 2
      checkpoint:
        stepCursor: old_step
        vars: {}
    expect:
      throws:
        error: MigrationError              # handler throws are wrapped as MigrationError
        cause:
          error: IncompatibleVersionError   # original error accessible via ERROR.CAUSE
          message: =value.contains('pre-v1')

  - name: gracefully terminates pre-v2 checkpoint
    migration:
      from: 1
      to: 3
      checkpoint:
        stepCursor: fetch_order
        vars: {}
    expect:
      output:
        status: terminated
        reason: incompatible_version
```

**How `migration:` works:**

1. Engine constructs a synthetic `MIGRATION` binding from the `migration:` block:
   - `MIGRATION.OLD_VERSION` ← `from:`
   - `MIGRATION.NEW_VERSION` ← `to:`
   - `MIGRATION.OLD_HASH` ← `hash.old:` (or null)
   - `MIGRATION.NEW_HASH` ← `hash.new:` (or null)
   - `MIGRATION.STEP_CURSOR` ← `checkpoint.stepCursor:`
   - `MIGRATION.LOCATION` ← auto-derived
   - `MIGRATION.CHECKPOINTED_AT` ← synthetic timestamp
2. Engine restores variable state from `checkpoint.vars:`, `checkpoint.context:`, `checkpoint.globals:`
3. Engine executes the flow's `onVersionChange:` handler
4. If the handler completes normally, engine resumes the flow from `stepCursor:`
5. If the handler executes `return`, the flow is gracefully terminated (`finally:` runs, checkpoint discarded, return value becomes flow output)
6. If the handler throws, `MigrationError` propagates (non-catchable)
7. Test assertions evaluate against the final state

### 12.2 Replay testing

Record a complete flow execution and replay it against a modified flow definition to verify version compatibility. This detects when flow changes would break in-flight executions.

**Recording an execution:**

```bash
# Record a real execution trace
flowmarkup run order-processing.flowmarkup.yaml \
  --input '{"order": {...}, "priority": "normal"}' \
  --record-trace ./recordings/order-v1-normal.trace.yaml
```

The trace file captures the complete execution event sequence:

```yaml
# AUTO-GENERATED
trace:
  flow: order-processing.flowmarkup.yaml
  version: 1
  hash: sha256-abc123
  recorded: 2026-03-11T14:30:00Z
  input: { order: {...}, priority: normal }
  events:
    - type: STEP_ENTER
      step: validate_order
      timestamp: 0ms
      vars: { order_status: new }
    - type: STEP_EXIT
      step: validate_order
      timestamp: 2ms
      vars: { order_status: new }
    - type: ACTION_INVOKE
      step: charge_payment
      action: call
      service: payment
      params: { order_id: "ORD-1001", amount: 150.00 }
    - type: ACTION_COMPLETE
      step: charge_payment
      result: { tx_id: "TX-8834" }
      timestamp: 236ms
    - type: STEP_EXIT
      step: charge_payment
      timestamp: 237ms
      vars: { order_status: new, payment_tx: "TX-8834" }
    # ... more events ...
    - type: FLOW_COMPLETE
      output: { result: { outcome: approved } }
      timestamp: 412ms
```

> **Secret redaction in traces.** Trace files MUST apply the same secret redaction as log output. Variables annotated with `$secret: true` or `$exportable: false` MUST be replaced with `[REDACTED]` in trace output.

**Replaying against a new version:**

```yaml
tests:
  - name: v2 compatible with v1 execution trace
    replay: ./recordings/order-v1-normal.trace.yaml
    expect:
      # The new flow, fed the same inputs and mock responses from the recording,
      # must produce equivalent output
      output:
        result:
          outcome: approved
      # Optionally assert the execution path is similar
      trace:
        contains: [validate_order, charge_payment]
```

**Replay semantics:**

1. Engine loads the recorded `input:` and action responses
2. Engine executes the *current* flow definition (not the recorded version)
3. Action responses are replayed from the recording (same as record-and-replay mocks)
4. If the current flow makes an action call not in the recording (new step), `ReplayDivergenceError` is thrown
5. If the current flow skips a recorded action call (removed step), a warning is emitted
6. Final output is compared against the recording's output (if no explicit `expect:` overrides it)

**Bulk replay in CI:**

```bash
# Replay all recorded traces against current flow definitions
flowmarkup test --replay-all ./recordings/

# Fail CI if any trace diverges
flowmarkup test --replay-all ./recordings/ --strict
```

This is the primary mechanism for **version compatibility testing** when modifying flows that have in-flight instances.

### `replay:` format disambiguation

The `replay:` key on a test case accepts two file formats. The engine auto-detects by the root YAML key:

| Root key | Extension | Source | Error on mismatch | See |
|---|---|---|---|---|
| `recording:` | `.recording.yaml` | `flowmarkup test --record` (§4.9) — captures action I/O only | `RecordingMismatchError` | §4.9 |
| `trace:` | `.trace.yaml` | `flowmarkup run --record-trace` (§12.2) — captures full execution events | `ReplayDivergenceError` | §12.2 |

SA-TEST-22 (`replay:` file version mismatch) applies to `.trace.yaml` files only, since `.recording.yaml` files do not carry a flow `version:` field.

---

## 13. Snapshot Testing

Capture the complete test result as a golden file and compare against it on subsequent runs. This is the workflow equivalent of Jest's `toMatchSnapshot()`, ApprovalTests, and Temporal's history-based replay.

### Creating a snapshot

```yaml
tests:
  - name: happy path snapshot
    input:
      order: ${{ fixtures.standard_order }}
      priority: critical
    mocks:
      - ${{ mocks.payment_success }}
    expect:
      snapshot: ./snapshots/happy-path.snapshot.yaml
```

**First run:** The test executes normally. Since no snapshot file exists, the runner captures the complete result and writes it:

```yaml
# AUTO-GENERATED — review before committing
snapshot:
  test: happy path snapshot
  created: 2026-03-11T14:30:00Z
  output:
    result: { outcome: approved, notes: "Auto-approved: critical priority" }
    audit_log: "Order ORD-1001 processed: approved"
  vars:
    order_status: completed
  globals: {}
  context: {}
  yields: []
  trace: [validate_order, charge_payment, reserve_inventory, send_confirmation]
  log:
    - { level: INFO, message: "Processing order ORD-1001 with 3 items, total: $150.00" }
    - { level: INFO, message: "Order ORD-1001 processed: approved" }
  steps:
    charge_payment:
      calledTimes: 1
      calledWith: { order_id: "ORD-1001", amount: 150.00 }
      returned: { tx_id: "TX-9999" }
  mail: []
  resources:
    config: { accessed: true, accessCount: 1 }
```

**Subsequent runs:** The test executes and the result is compared against the snapshot. Any diff fails the test:

```
✗ happy path snapshot (12ms)
  Snapshot mismatch:
    trace:
      - expected: [..., reserve_inventory, send_confirmation]
      + actual:   [..., reserve_inventory, send_receipt, send_confirmation]
                                           ^^^^^^^^^^^^
    log:
      + actual has extra entry: { level: INFO, message: "Receipt sent" }

  Run with --update-snapshots to accept the new output.
```

### Updating snapshots

```bash
# Update all snapshots that differ
flowmarkup test --update-snapshots

# Update snapshots for a specific test file
flowmarkup test order-processing.flowmarkup-test.yaml --update-snapshots

# Interactive: review each diff before accepting
flowmarkup test --update-snapshots --interactive
```

### Snapshot scope

By default, snapshots capture everything (output, vars, trace, log, steps, mail, yields, resources, events). Use `$include` or `$exclude` to narrow the snapshot scope:

```yaml
expect:
  snapshot:
    file: ./snapshots/happy-path.snapshot.yaml
    $include: [output, trace]          # only snapshot these aspects
    # or
    $exclude: [log, steps]             # snapshot everything except these
```

### Combining snapshot with explicit assertions

Snapshot and explicit assertions can coexist. Explicit assertions are evaluated first; if they pass, the snapshot is compared:

```yaml
expect:
  output:
    result:
      outcome: approved                # explicit check — fails immediately if wrong
  snapshot: ./snapshots/happy-path.snapshot.yaml   # then compare full snapshot
```

---

## 14. Breakpoints & Inspection

### Mid-execution breakpoints

Inspect variable state, CEL evaluation results, and mock interactions at intermediate points during flow execution. This is the workflow equivalent of Temporal's `RegisterDelayedCallback` for querying state mid-workflow, and Step Functions' `DEBUG` inspection level.

```yaml
tests:
  - name: intermediate state is correct during order processing
    input:
      order: ${{ fixtures.standard_order }}
      priority: high
    mocks:
      - ${{ mocks.payment_success }}
      - service: inventory
        return: { reservation_id: "INV-001" }
    breakpoints:
      - after: validate_order
        assert:
          vars:
            order_status: validated
            payment_tx: =value == null           # not yet set

      - after: charge_payment
        assert:
          vars:
            payment_tx: "TX-9999"                # now set
            order_status: validated               # not yet updated to 'charged'
          steps:
            charge_payment:
              calledWith:
                amount: 150.00

      - before: send_confirmation
        assert:
          vars:
            inventory_reserved: true
            payment_tx: =value != null
    expect:
      output:
        result: { outcome: approved }
```

### Breakpoint types

| Property | Description |
|---|---|
| `before: <_id_>` | Pause before the step executes |
| `after: <_id_>` | Pause after the step completes (including output binding) |
| `thrown: <_id_>` | Pause when the step throws (before catch/retry) |
| `onRetry: <_id_>` | Pause on each retry attempt |
| `onRollback: <_id_>` | Pause when a rollback handler fires for this step |

### Conditional breakpoints

Use `when:` to conditionally activate a breakpoint. The breakpoint is only triggered when the `when:` expression evaluates to `true`. This prevents hangs when a breakpoint targets a step that might be skipped.

```yaml
breakpoints:
  - after: optional_step
    when: =steps.optional_step.called             # only break if step actually executed
    assert:
      vars:
        optional_result: =value != null

  - onRetry: flaky_call
    when: =steps.flaky_call.retryAttempts >= 2    # only break on 3rd+ retry
    assert:
      steps:
        flaky_call:
          retryAttempts: =value >= 2

  - after: loop_body
    when: =i == 5                                 # flow declares forEach index: i
    assert:
      vars:
        accumulated: =value >= 50
```

### Breakpoint assertions

Inside a breakpoint's `assert:` block, the same assertion syntax as `expect:` is available (`vars:`, `globals:`, `context:`, `steps:`, `resources:`), scoped to the state at that point in execution. `output:` and `trace:` are not available in breakpoints (flow hasn't completed yet).

### Inspection mode (CLI)

For ad-hoc debugging without writing breakpoint assertions, use `--inspect`:

```bash
# Show step-by-step variable state and CEL evaluation results
flowmarkup test order-processing.flowmarkup-test.yaml \
  --name "approves critical priority order" \
  --inspect
```

Output:

```
[0ms] STEP_ENTER  validate_order
  vars: { order_status: "new", payment_tx: null }
[1ms] ASSERT      =size(order.items) > 0  →  true
[1ms] ASSERT      =order.status == 'pending'  →  true
[2ms] STEP_EXIT   validate_order

[2ms] STEP_ENTER  set (line 86)
  CEL: order.items.map(i, i.quantity).sum()  →  3
  CEL: order.items.map(i, i.price * i.quantity).sum()  →  150.00
  vars: { item_count: 3, order_total: 150.00 }
[2ms] STEP_EXIT   set

[3ms] STEP_ENTER  charge_payment
  MOCK MATCH: service=payment, operation=* → return { tx_id: "TX-9999" }
  input: { order_id: "ORD-1001", amount: 150.00 }
  output: { tx_id: "TX-9999" } → payment_tx
[3ms] STEP_EXIT   charge_payment
  vars: { ..., payment_tx: "TX-9999" }

...

[12ms] FLOW_COMPLETE
  output: { result: { outcome: "approved", notes: "Auto-approved: critical priority" } }
```

### Trace mode (CLI)

Even more detailed — includes mock matching decisions and skipped branches:

```bash
flowmarkup test --trace order-processing.flowmarkup-test.yaml
```

Additional output:

```
[3ms] MOCK_EVAL   service=payment, operation=charge
  Mock 1: service=inventory → SKIP (service mismatch)
  Mock 2: service=payment, operation=* → MATCH
  Response: { tx_id: "TX-9999" }

[5ms] BRANCH_EVAL if (line 105): =order_total > 1000
  CEL: order_total > 1000  →  false (order_total=150.00)
  Taking: else branch

[6ms] STEP_SKIP   send_refund (condition: =payment_failed → false)
```

---

## 15. Invariants

Declare properties that must hold across **all** test cases in a suite. Invariants are checked after every test case's `expect:` assertions pass. This is the workflow equivalent of TLA+'s safety invariants — a native safety verification capability.

### Suite-level invariants

```yaml
flowmarkup-test:
  flow: ./order-processing.flowmarkup.yaml

  invariants:
    # Output invariants — properties of the result
    - =output == null || output.result.order_id != null
    - =output == null || output.result.outcome in ['approved', 'rejected', 'review']

    # State invariants — properties of final variable state
    - =vars.order_total >= 0

    # Ordering invariants — if X happened, Y must have happened first
    - "=!steps.send_confirmation.called || steps.charge_payment.called"
    - "=!steps.rollback_payment.called || steps.charge_payment.called"

    # Error invariants — if a specific error is thrown, certain state must hold
    - "=!thrown('RolledBackError') || !steps.send_confirmation.called"

    # Coverage invariants — every test must exercise at least these steps
    - =steps.validate_order.called

  tests: [...]
```

### How invariants work

1. After each test case completes (pass or fail), all invariants are evaluated
2. The invariant expression has access to: `output`, `vars`, `globals`, `context`, `steps`, `trace`, `yields`, `log`, `thrown(type)` (returns bool)
3. If any invariant evaluates to `false`, the test is marked as failed with the invariant expression and actual values
4. Invariants are evaluated even when the test case expects an error (`throws:`)

### Invariant failure output

```
✗ edge case: negative total (5ms)
  INVARIANT VIOLATION: =vars.order_total >= 0
    vars.order_total = -10.00
  This invariant must hold for all test cases.
```

### Conditional invariants

Use `when:` to apply an invariant only to tests matching a condition:

```yaml
invariants:
  - when: =!thrown()                          # only for non-error test cases
    assert: =output.result.outcome != null

  - when: =thrown('RolledBackError')              # only for saga rollback tests
    assert: =steps.charge_payment.calledTimes == 1
```

### Per-test invariant override

A test case can opt out of specific invariants:

```yaml
tests:
  - name: intentionally produces invalid state for testing
    skipInvariants: [0, 2]     # skip invariants at index 0 and 2
    input: { ... }
    expect:
      throws: ValidationError
```

---

## 16. Contract Testing

Verify that caller/callee flow contracts are compatible without executing either flow. This is the workflow equivalent of Pact's consumer-driven contract testing.

### What contracts verify

When flow A calls flow B via `run:`, the contract test checks:

| Check | Description |
|---|---|
| **Input completeness** | A's `input:` on the `run` step provides all of B's required input params |
| **Input type compatibility** | A's input values match B's declared types |
| **Output handling** | A's `output:` mapping covers all required output params from B |
| **Error handling** | A catches (or propagates) all of B's declared `throws:` types |
| **Yield handling** | If B declares `yields:`, A has `onYield:` or `$yields` |
| **Capability compatibility** | A's `cap:` restrictions don't revoke capabilities B's `requires:` needs |

### Declaring contracts explicitly

```yaml
flowmarkup-test:
  flow: ./orchestrator.flowmarkup.yaml

  contracts:
    - run: billing/charge-customer.flowmarkup.yaml
      # Optional: override the expected contract (useful when B is external/third-party)
      input:
        required:
          order_id: { $kind: TEXT }
          amount: { $kind: NUMBER }
      output:
        required:
          receipt: JSON
      throws:
        - PaymentDeclinedError
        - PaymentTimeoutError
```

### Auto-detected contracts

When `contracts:` is omitted, the contract checker automatically discovers all `run:` steps in the flow and resolves their target flows. For local flows, it reads the target's `input:`/`output:`/`throws:`/`yields:`/`requires:` declarations directly.

### CLI

```bash
# Verify all cross-flow contracts in the project
flowmarkup test --contracts

# Verify contracts for a specific flow
flowmarkup test --contracts order-processing.flowmarkup.yaml

# Verify contracts across all flows in a directory
flowmarkup test --contracts flows/
```

### Output

```
Contract verification: orchestrator.flowmarkup.yaml

  → billing/charge-customer.flowmarkup.yaml
    ✓ Input: all required params provided
    ✓ Output: all required params consumed
    ✗ Errors: PaymentTimeoutError declared but not caught
      orchestrator.flowmarkup.yaml:45 — run step has no catch for PaymentTimeoutError
      billing/charge-customer.flowmarkup.yaml:12 — declares: throws: [PaymentTimeoutError]
    ✓ Capabilities: all requirements satisfiable

  → notifications/send-email.flowmarkup.yaml
    ✓ Input: all required params provided
    ✓ Output: no output contract (async: true)
    ✓ Capabilities: mail granted via default inheritance

1 contract violation found.
```

### Schema evolution detection

When a callee flow changes its contract (new required input, removed output param, new error type), contract tests fail immediately — before any flow executes. This provides the same safety as Pact's broker-based verification but within the FlowMarkup ecosystem.

```bash
# In CI: run contract checks before deployment
flowmarkup test --contracts --strict   # non-zero exit on any violation
```

---

## 17. Fault Injection

Beyond per-mock `throw:` and `delay:`, fault injection profiles apply systematic failures across all actions. This is the workflow equivalent of Netflix's Chaos Monkey, Toxiproxy, and Litmus Chaos — but declarative and integrated into the test framework.

### Fault profiles

```yaml
tests:
  - name: survives 30% service failures
    input:
      order: ${{ fixtures.standard_order }}
    mocks:
      - ${{ mocks.payment_success }}
      - ${{ mocks.inventory_success }}
    faults:
      # Random failure injection
      services:
        probability: 0.3                         # 30% of service calls fail
        error: { error: TimeoutError, message: "injected timeout" }
        exclude: [payment]                       # don't fail payment (it's critical)

      # Latency injection
      latency:
        min: 100ms
        max: 2s
        exclude: []                              # apply to all actions

      # Specific step faults
      steps:
        reserve_inventory:
          failOnAttempt: 1                       # fail first attempt, succeed on retry
          error: { error: TimeoutError, message: "first attempt timeout" }
    expect:
      output:
        result: { outcome: approved }            # flow recovers via retry
    repeat: 10                                   # run 10 times with random seed
```

### Fault types

| Fault | Description |
|---|---|
| `services.probability` | Random failure probability (0.0-1.0) for all service/request/exec/mail calls |
| `services.error` | Error to inject on failure |
| `services.include` / `services.exclude` | Target specific services |
| `latency.min` / `latency.max` | Random latency range added to all actions |
| `latency.include` / `latency.exclude` | Target specific services |
| `steps.<_id_>.failOnAttempt` | Fail a specific step on attempt N (1-based) |
| `steps.<_id_>.error` | Error to inject for that step |
| `steps.<_id_>.delay` | Fixed delay for that step |
| `network.partition` | Simulate network partition: all `request:` steps fail |
| `network.partition.after` | Partition occurs after this `_id_` step |
| `network.partition.duration` | Partition lasts this long (virtual time) |

### Deterministic fault injection

By default, random faults use a deterministic seed derived from the test name. This means `services.probability: 0.3` produces the same pattern of failures on every run. Use `seed:` to override:

```yaml
faults:
  seed: 42                           # explicit seed
  services:
    probability: 0.3
    error: { error: TimeoutError }
```

> Fault injection tests with `services.probability` MUST use `repeat:` with a minimum of 10 iterations. Static analysis rule SA-TEST-51 (WARN) MUST flag fault injection tests without `repeat:` or with `repeat: < 10`.

### `repeat:` for statistical testing

Run a fault-injected test multiple times with different seeds to build confidence:

```yaml
tests:
  - name: resilient under chaos
    faults:
      services:
        probability: 0.2
        error: { error: TimeoutError }
    repeat: 50                        # run 50 times, each with a different seed
    expect:
      output:
        result: { outcome: approved }
    # Test passes only if ALL 50 runs pass
```

Output:

```
  resilient under chaos [x50]
    ✓ 50/50 passed (seed range: 1-50, avg: 15ms, max: 42ms)
```

### Combining faults with mocks

Mocks take precedence over faults. If a mock matches an action call, the mock response is used (no fault injection). Faults only apply to unmatched calls or calls where `spy: true` is set.

---

## 18. Lifecycle Hooks

### `beforeAll` / `afterAll`

Run once per test suite (before/after all test cases). Useful for shared setup like seeding global state.

```yaml
flowmarkup-test:
  flow: ./my-flow.flowmarkup.yaml

  beforeAll:
    - set:
        GLOBAL.test_counter: 0

  afterAll:
    - assert: =GLOBAL.test_counter > 0

  tests: [...]
```

### `beforeEach` / `afterEach`

Run before/after each test case. Useful for per-test setup/teardown.

```yaml
flowmarkup-test:
  flow: ./my-flow.flowmarkup.yaml

  beforeEach:
    - set:
        GLOBAL.request_count: 0

  afterEach:
    - log: "='Test completed. Requests made: ' + GLOBAL.request_count"

  tests: [...]
```

Hooks are step lists — they use the same step syntax as flow bodies.

**Unit mode (`mode: UNIT`):** Hooks are restricted to directives only (`set`, `log`, `assert`, etc.). Action steps (`call`, `run`, `request`, `exec`, `mail`) are forbidden (SA-TEST-16). This ensures tests are hermetic — setup/teardown cannot accidentally contact real services.

**Integration mode (`mode: INTEGRATION`):** Hooks may use action steps. This enables real-world setup/teardown scenarios: seeding a database via `call`, provisioning test resources via `request`, cleaning up via `exec`. Mocks are still respected in hooks — if a service is mocked, the hook's `call` will use the mock.

> Lifecycle hooks in INTEGRATION mode MUST be subject to capability enforcement. Hook actions MUST NOT exceed the capabilities declared by the flow under test. A separate `hookCapabilities:` declaration MAY be used to explicitly gate hook actions. When `hookCapabilities:` is not declared, hooks inherit the flow's `requires:` as their capability boundary. SA-TEST-50 (WARN) MUST flag action steps in hooks that would require capabilities not declared by the flow.

```yaml
flowmarkup-test:
  flow: ./order-processing.flowmarkup.yaml
  mode: INTEGRATION

  beforeEach:
    - call:
        service: =SERVICES.test_db
        operation: execute
        params: { action: seed, dataset: orders }
    - set:
        GLOBAL.test_run_id: "='test-run-' + GLOBAL.test_counter"

  afterEach:
    - call:
        service: =SERVICES.test_db
        operation: execute
        params: { action: cleanup, run_id: =GLOBAL.test_run_id }

  tests: [...]
```

---

## 19. Test Runner

### CLI

```bash
# ─── Basic execution ───
flowmarkup test                                            # run all tests
flowmarkup test order-processing.flowmarkup-test.yaml       # run specific suite
flowmarkup test --tag premium                              # filter by tag
flowmarkup test --name "approves critical priority order"  # filter by name
flowmarkup test tests/**/*.flowmarkup-test.yaml             # filter by glob

# ─── Execution modes ───
flowmarkup test --parallel                          # parallel test cases (within and across suites)
flowmarkup test --fail-fast                         # stop on first failure
flowmarkup test --repeat 5                          # run each test 5 times (flaky detection)
flowmarkup test --dry-run                           # validate test files without executing

# ─── Inspection & debugging ───
flowmarkup test --inspect                           # step-by-step variable state + CEL evaluation
flowmarkup test --trace                             # full execution trace with mock matching
flowmarkup test --inspect --name "my test"          # inspect a specific test

# ─── Coverage ───
flowmarkup test --coverage                          # step, branch, error, mock coverage
flowmarkup test --coverage --min-step-coverage 80 --min-branch-coverage 70
flowmarkup test --coverage --suggest                # propose tests to fill coverage gaps

# ─── Snapshots ───
flowmarkup test --update-snapshots                  # accept new snapshot output
flowmarkup test --update-snapshots --interactive    # review each diff

# ─── Recording & replay ───
flowmarkup test --record                            # record real service responses as mocks
flowmarkup test --replay-all ./recordings/          # replay all recorded traces
flowmarkup test --replay-all ./recordings/ --strict # fail on any divergence

# ─── Contract testing ───
flowmarkup test --contracts                         # verify cross-flow contracts
flowmarkup test --contracts --strict                # non-zero exit on any violation

# ─── Output formats ───
flowmarkup test --reporter junit --output test-results.xml   # JUnit XML (CI)
flowmarkup test --reporter json                              # JSON
flowmarkup test --reporter tap                               # TAP (Test Anything Protocol)
flowmarkup test --verbose                                    # detailed output
```

### Output format

```
order-processing.flowmarkup-test.yaml
  ✓ approves critical priority order (12ms)
  ✓ calculates shipping for premium customer (8ms)
  ✗ rejects missing items (5ms)
    AssertionError expected but flow completed successfully
    Expected: throws AssertionError
    Actual:   output { result: { outcome: approved, notes: "..." } }
  ✓ payment failure triggers rollback (15ms)
  ⚠ large order warning threshold (9ms)
    WARN: steps.charge_payment.duration $warn: =value < duration('1s')
      actual: 1.2s
  order routing [x5 matrix] ✓✓✓✓✓ (39ms)
  resilient under chaos [x50 repeat] ✓ 50/50 (812ms)

  Invariants: 3/3 passed across all test cases

7 passed, 1 failed, 1 warning (885ms)
```

---

## 20. Concurrency & Timing

### Virtual time

The test runner uses a **virtual clock**. `wait` and `waitUntil` steps advance virtual time instantly — tests don't actually sleep. This makes timing-dependent tests deterministic and fast.

```yaml
tests:
  - name: timeout triggers after 30s
    mocks:
      - service: slow_api
        delay: 60s                       # virtual: 60s delay
    expect:
      throws: TimeoutError
      # Test completes in milliseconds despite 60s virtual delay
```

### Testing parallel branches

```yaml
tests:
  - name: both branches complete in parallel group
    mocks:
      - service: hotel
        delay: 2s
        return: { confirmation: "H-001" }
      - service: airline
        delay: 3s
        return: { ticket: "F-001" }
    expect:
      output:
        hotel_confirmation: "H-001"
        flight_ticket: "F-001"
      steps:
        book_hotel:
          duration: =value == duration('2s')    # virtual time
        book_flight:
          duration: =value == duration('3s')
```

### Testing race conditions

The test runner is deterministic — parallel branches execute in declaration order by default. To test specific interleaving, use `schedule:`:

```yaml
tests:
  - name: first branch wins the race
    schedule:
      book_hotel: { completeAt: 1s }
      book_flight: { completeAt: 2s }
    expect:
      output:
        winner: hotel

  - name: second branch wins the race
    schedule:
      book_hotel: { completeAt: 3s }
      book_flight: { completeAt: 1s }
    expect:
      output:
        winner: flight
```

### Testing events

```yaml
tests:
  - name: processes event when received
    events:
      # Inject events into the event bus at specific virtual times
      - event: order_confirmed
        at: 2s
        data: { order_id: "ORD-001" }
    expect:
      vars:
        confirmation_received: true

  - name: event timeout fires correctly
    events: []                           # no events injected
    expect:
      throws: TimeoutError              # waitFor times out

  - name: waitFor condition filters events
    events:
      - event: order_update
        at: 1s
        data: { status: pending, order_id: "ORD-001" }     # doesn't match condition
      - event: order_update
        at: 3s
        data: { status: completed, order_id: "ORD-001" }   # matches condition
    expect:
      vars:
        captured_status: completed       # waitFor captured the second event
```

### Testing event emission

Assert that a flow correctly emits events using `expect.events:`:

```yaml
tests:
  - name: emits order_confirmed event with correct data
    input:
      order: ${{ fixtures.standard_order }}
    mocks:
      - ${{ mocks.payment_success }}
    expect:
      events:
        count: 1
        messages:
          - event: order_confirmed
            data:
              order_id: "ORD-1001"
              status: approved
            scope: LOCAL

  - name: emits no events on validation failure
    input:
      order: { id: "ORD-BAD", items: [] }
    expect:
      throws: AssertionError
      events:
        emitted: false
```

### Testing trigger-activated flows

Flows declared with `triggers: [event: ...]` are activated by external events. To test trigger behavior, use the `trigger:` test case key to simulate a trigger-initiated invocation instead of direct `input:`:

```yaml
tests:
  - name: trigger activates flow with matching event
    trigger:
      event: order_submitted
      data:
        order_id: "ORD-1001"
        priority: rush
        total: 150.00
    mocks:
      - ${{ mocks.payment_success }}
    expect:
      output:
        result: { outcome: approved }

  - name: trigger condition filters non-matching events
    trigger:
      event: order_submitted
      data:
        order_id: "ORD-1002"
        priority: normal                  # does not match condition: =EVENT.DATA.priority == 'rush'
    expect:
      skipped: true                       # flow was not activated (trigger condition false)
```

**`trigger:` test case key:**

| Key | Type | Description |
|---|---|---|
| `trigger.event` | string | Event type name matching one of the flow's `triggers:` entries |
| `trigger.data` | map | Event data payload — mapped to flow `input:` by name |

**How `trigger:` works:**

1. Engine looks up the matching `triggers:` entry in the flow's declaration
2. Binds `EVENT.TYPE`, `EVENT.DATA`, `EVENT.SOURCE` (source defaults to `"test"`)
3. Evaluates the trigger's `condition:` (if present) — if false, test passes with `skipped: true`
4. Maps `EVENT.DATA.*` fields to flow `input:` parameters by name
5. Invokes the flow normally with the mapped input

**`trigger:` is mutually exclusive with `input:`.** A test case with `trigger:` must not also have `input:` (SA-TEST-41). When `trigger:` is absent, the flow is invoked directly via `input:` regardless of any `triggers:` declaration.

**`expect.skipped:`** — When a trigger condition evaluates to `false`, the flow is not activated. Assert `skipped: true` to verify the trigger correctly rejected the event.

### Testing lock contention

```yaml
tests:
  - name: parallel writers serialize via lock
    schedule:
      writer_a: { startAt: 0ms }
      writer_b: { startAt: 0ms }        # both start simultaneously
    expect:
      steps:
        writer_a:
          calledBefore: writer_b         # or vice versa — lock serializes
      vars:
        counter: 2                       # both writers increment
```

**Lock behavior by test mode:**

| Mode | Lock behavior |
|---|---|
| `UNIT` | Locks are always acquired immediately (no contention). The `lock:` directive succeeds without delay. This ensures deterministic test behavior. |
| `INTEGRATION` | Locks use the real distributed lock provider. Contention is real — use `schedule:` and `timeout:` to control timing. |

**Lock timeout testing:**

```yaml
tests:
  - name: lock timeout fires when resource unavailable
    schedule:
      holder: { startAt: 0ms, completeAt: 60s }    # holds lock for 60s
      waiter: { startAt: 1ms }                       # tries to acquire same lock
    expect:
      steps:
        waiter:
          timedOut: true                              # lock timeout fired
      throws: TimeoutError

  - name: lock acquisition succeeds after holder releases
    schedule:
      holder: { startAt: 0ms, completeAt: 2s }
      waiter: { startAt: 1ms }                       # waits, then acquires
    expect:
      steps:
        waiter:
          called: true
          duration: =value >= duration('2s')          # waited for lock release
```

### Testing idempotency

Test the flow-level `idempotencyKey:` deduplication behavior.

```yaml
tests:
  - name: duplicate invocation rejected
    input:
      order_id: "ORD-1001"
      amount: 150.00
    # First invocation (setup)
    idempotency:
      preSeed:                                        # pre-seed the idempotency store
        - key: "payment:ORD-1001:150.00"              # key that matches the flow's idempotencyKey
          result: { tx_id: "TX-ORIGINAL" }             # original invocation's result
    expect:
      throws: DuplicateInvocationError

  - name: different key passes
    input:
      order_id: "ORD-1002"
      amount: 200.00
    idempotency:
      preSeed:
        - key: "payment:ORD-1001:150.00"              # different key → no collision
    mocks:
      - ${{ mocks.payment_success }}
    expect:
      output:
        result: { tx_id: "TX-9999" }                  # new invocation succeeds

  - name: idempotency key evaluates correctly
    input:
      order_id: "ORD-1003"
      amount: 75.50
    idempotency:
      assertKey: "payment:ORD-1003:75.5"              # verify CEL evaluation of idempotencyKey
    mocks:
      - ${{ mocks.payment_success }}
    expect:
      output:
        result: { tx_id: "TX-9999" }
```

**`idempotency:` test case key:**

| Key | Type | Description |
|---|---|---|
| `preSeed` | list | Pre-seed the idempotency store with `{ key, result }` entries |
| `assertKey` | string/CEL | Assert the evaluated idempotency key value |

---

## 21. Coverage

The `--coverage` flag tracks which steps were executed across all test cases.

### Coverage report

```
order-processing.flowmarkup.yaml
  Steps:    18/22 (81.8%)
  Branches: 5/7  (71.4%)
  Errors:   2/4  (50.0%)
  Mocks:    6/6  (100.0%)

  Uncovered:
    line 42: elseIf condition (priority == 'low')          [BRANCH]
    line 55: catch PaymentDeclinedError                    [BRANCH]
    line 61: forEach body (empty items case)               [STEP]
    line 78: else branch (non-enterprise, normal priority) [BRANCH]
    line 12: throws PaymentTimeoutError — never thrown     [ERROR]
    line 13: throws InventoryError — never thrown          [ERROR]
```

### Coverage metrics

| Metric | Description |
|---|---|
| **Step coverage** | Percentage of steps (by `_id_` or position) that were executed at least once |
| **Branch coverage** | Percentage of conditional branches (`if.then`, `if.else`, `elseIf`, `switch.cases`, `switch.default`, `catch` clauses) entered |
| **Error coverage** | Percentage of declared `throws:` error types that were actually thrown and caught/asserted |
| **Mock coverage** | Percentage of registered mocks that were actually matched (warns on unused mocks) |

### Coverage enforcement

```bash
# Fail if coverage drops below thresholds
flowmarkup test --coverage --min-step-coverage 80 --min-branch-coverage 70
```

### Test suggestions (`--suggest`)

Analyze uncovered branches and propose test cases to fill gaps. FlowMarkup provides this capability natively.

```bash
flowmarkup test --coverage --suggest
```

Output:

```
Coverage: 71.4% branch coverage (target: 80%)

Suggested test cases to improve coverage:

  1. Cover: elseIf (priority == 'low') at line 42
     Suggested input:
       priority: "low"
       order: { id: "ORD-001", total: 50.00, items: [...] }
     Expected path: validate_order → low_priority_handler → ...

  2. Cover: catch PaymentDeclinedError at line 55
     Suggested mock:
       - service: payment
         throw: { error: PaymentDeclinedError, message: "Declined" }
     Expected: throws PaymentDeclinedError (or caught and handled)

  3. Cover: else branch at line 78 (non-enterprise customer, normal priority)
     Suggested input:
       priority: "normal"
       order: { ..., customer: { tier: "standard" } }
     Expected path: → default → else → ...

Add --suggest-write to generate test YAML and append to the test file.
```

```bash
# Auto-generate and append suggested tests
flowmarkup test --coverage --suggest-write
```

The generated tests are appended as commented-out YAML with `# SUGGESTED` markers for human review:

```yaml
  # SUGGESTED — covers elseIf (priority == 'low') at line 42
  # - name: low priority order routing
  #   input:
  #     order: { id: "ORD-001", total: 50.00, ... }
  #     priority: low
  #   mocks:
  #     - service: payment
  #       return: { tx_id: "TX-001" }
  #   expect:
  #     output:
  #       result: { outcome: approved }
```

---

## 22. SA Rules

### SA-TEST rules (static analysis for test files)

The canonical SA-TEST rules table is maintained in [FLOWMARKUP-VALIDATION.md §1.58](FLOWMARKUP-VALIDATION.md). Refer to that document for the full, authoritative list of SA-TEST rules with descriptions and severity levels.

---

## 23. Common Mistakes

```yaml
# WRONG — mocking a directive (set, if, log are not mockable)
mocks:
  - step: calculate_total    # ✗ steps are not mockable
    return: 500.00

# CORRECT — control the step's inputs via vars: or upstream mocks
vars:
  item_prices: [100.00, 200.00, 200.00]
```

```yaml
# WRONG — expect.output without flow having output: contract
expect:
  output:
    result: "done"    # flow has no output: — assertion meaningless

# CORRECT — assert on vars: if the flow doesn't declare output:
expect:
  vars:
    status: done
```

```yaml
# WRONG — using real SECRET.* values in tests
secrets:
  DB_PASSWORD: "pr0d_p@ssw0rd!"    # ✗ never commit real secrets

# CORRECT — use synthetic test values
secrets:
  DB_PASSWORD: "test-password-placeholder"
```

> **Test secret plain text warning.** Test secret values in YAML source files are stored in plain text. The engine MUST emit SA-TEST-52 (WARN) when test secret values exceed 20 characters or match credential patterns (JWT format, base64-encoded strings > 32 chars, connection string patterns), as these patterns suggest real credentials may have been accidentally committed. CI/CD pipelines SHOULD use environment variable injection for test secrets rather than embedding them in YAML files.

> **Test secret parity.** Test-injected secrets (via the `secrets:` key in test definitions) MUST behave identically to production secrets at the type system level:
> 1. Test secrets MUST be wrapped in `SecretValue` opaque handles — they MUST NOT be plain strings accessible via CEL string operations.
> 2. Test secret redaction MUST be active in all test modes. Logs and error messages produced during test execution MUST redact test secrets using the same redaction pipeline as production.
> 3. Taint tracking MUST be active for test secrets. Variables derived from test secrets MUST carry `$secret: true` taint.
> 4. The `$declassify` annotation MUST function identically in test and production modes.
>
> This ensures that security-sensitive flows tested with synthetic secrets accurately reflect their production behavior. Tests that pass with plaintext test secrets but fail with opaque SecretValues indicate a flow that improperly handles secrets. *(CWE-1007: Insufficient Visual Distinction of Homoglyphs)*

> **Test secret provider security contracts.** Test secret providers MUST implement the same security contracts as production providers: constant-time `has()` checks (to prevent timing-based secret enumeration), rate limiting (to prevent brute-force extraction), taint propagation (all values retrieved from test secret providers MUST carry `$secret: true` taint), and audit logging (all secret access during test execution MUST be logged with the same granularity as production). Test-only secrets MUST be clearly distinguished from production secrets — engines MUST enforce a naming convention (e.g., a `test-` prefix) or use a separate provider namespace (e.g., `test/SECRET.*` vs `SECRET.*`). Engines MUST NOT allow test secret providers to be registered in production deployments; attempting to register a test secret provider when any production indicator is detected (see integration mode production safeguard above) MUST result in a fatal `ConfigurationError`. *(CWE-522: Insufficiently Protected Credentials)*

```yaml
# WRONG — asserting on mock internals instead of flow behavior
expect:
  steps:
    charge_payment:
      calledWith:
        internal_field: "..."    # testing the mock, not the flow

# CORRECT — assert on flow outputs and state changes
expect:
  output:
    receipt: { status: charged }
  vars:
    payment_complete: true
```

```yaml
# WRONG — overly broad mock catches unrelated calls
mocks:
  - service: "*"            # ✗ matches everything, hides bugs
    return: { ok: true }

# CORRECT — mock each service explicitly
mocks:
  - service: payment
    return: { tx_id: "TX-001" }
  - service: inventory
    return: { in_stock: true }
```

```yaml
# WRONG — test depends on execution order of parallel branches
expect:
  vars:
    last_completed: hotel    # ✗ non-deterministic in real engine

# CORRECT — use schedule: for deterministic parallel testing
schedule:
  book_hotel: { completeAt: 1s }
  book_flight: { completeAt: 2s }
expect:
  vars:
    last_completed: flight
```

```yaml
# WRONG — testing onVersionChange without version:
tests:
  - name: migration handler works
    focus: onVersionChange       # ✗ focus: targets flow body steps, not lifecycle hooks

# CORRECT — use the migration: test property
tests:
  - name: migrates from v1 to v2
    migration:
      from: 1
      to: 2
      checkpoint:
        stepCursor: "process_order"
    expect:
      output:
        result: { status: migrated }
```

```yaml
# WRONG — sequence mock with no awareness of call count
mocks:
  - service: flaky_api
    sequence:
      - throw: { error: TimeoutError, message: "timeout" }
      # only one entry — what happens on second call?

# CORRECT — sequence repeats last entry (sticky tail), but be explicit
mocks:
  - service: flaky_api
    sequence:
      - throw: { error: TimeoutError, message: "timeout" }
      - throw: { error: TimeoutError, message: "timeout" }
      - return: { data: "success" }    # third attempt succeeds
```

```yaml
# WRONG — matrix rows with inconsistent keys
tests:
  - name: order routing
    matrix:
      - { priority: critical, total: 50, expected: approved }
      - { priority: high, amount: 50 }    # ✗ 'amount' not 'total'; missing 'expected'

# CORRECT — all rows have the same keys
tests:
  - name: order routing
    matrix:
      - { priority: critical, total: 50, expected: approved }
      - { priority: high, total: 50, expected: approved }
```

```yaml
# WRONG — replay file from a different flow
tests:
  - name: version compat
    replay: ./recordings/different-flow.trace.yaml    # ✗ wrong flow
    expect:
      output: { ... }

# CORRECT — replay file must match the flow under test
tests:
  - name: version compat
    replay: ./recordings/order-processing-v1.trace.yaml
    expect:
      output: { ... }
```

```yaml
# WRONG — invariant that depends on test-specific state
invariants:
  - =output.result.order_id == 'ORD-1001'    # ✗ only true for one fixture

# CORRECT — invariants should be universal properties
invariants:
  - =output == null || output.result.order_id != null
  - =output == null || output.result.order_id.startsWith('ORD-')
```

```yaml
# WRONG — breakpoint after a step that might not execute
breakpoints:
  - after: optional_step        # ✗ if condition is false, breakpoint hangs

# CORRECT — use breakpoints on steps that always execute, or add condition
breakpoints:
  - after: optional_step
    when: =steps.optional_step.called    # only break if step executed
```

```yaml
# WRONG — fault injection with no retry configured in the flow
faults:
  services:
    probability: 0.5
    error: { error: TimeoutError }
# Flow has no retry: — every failure is fatal. Test almost always fails.

# CORRECT — fault injection tests resilience; the flow must HAVE resilience
# Verify the flow has retry: or try/catch, then inject faults
faults:
  services:
    probability: 0.3
    error: { error: TimeoutError }
repeat: 20                    # statistical confidence
expect:
  output:
    result: { outcome: approved }    # flow recovers
```

```yaml
# WRONG — stateful mock with no transition to terminal state
mocks:
  - service: order_tracker
    scenario: lifecycle
    states:
      - state: STARTED
        operation: get_status
        return: { status: pending }
        nextState: STARTED              # ✗ infinite loop — never progresses

# CORRECT — ensure the scenario reaches a terminal state
mocks:
  - service: order_tracker
    scenario: lifecycle
    states:
      - state: STARTED
        operation: get_status
        return: { status: pending }
        nextState: PROCESSING
      - state: PROCESSING
        operation: get_status
        return: { status: complete }
        # terminal — no nextState
```

```yaml
# WRONG — using spy: true in UNIT mode
mode: UNIT
mocks:
  - service: payment
    spy: true                        # ✗ no real service in UNIT mode

# CORRECT — spy is for INTEGRATION mode only
mode: INTEGRATION
mocks:
  - service: payment
    spy: true                        # ✓ passthrough to real service, record calls
```

```yaml
# WRONG — no resources injected when flow requires them (UNIT mode)
flowmarkup-test:
  flow: ./resource-dependent.flowmarkup.yaml   # requires: { RESOURCES: { config: ... } }
  mode: UNIT
  tests:
    - name: test without resources
      input: { ... }
      # ✗ throws UnmockedResourceError at runtime

# CORRECT — inject required resources
flowmarkup-test:
  flow: ./resource-dependent.flowmarkup.yaml
  mode: UNIT
  resources:
    config:
      content: '{"key": "test-value"}'
  tests:
    - name: test with injected resource
      input: { ... }
```

```yaml
# WRONG — testing circuit breaker without enough failures to trip
tests:
  - name: circuit breaker test
    mocks:
      - service: api
        throw: { error: TimeoutError }
    expect:
      steps:
        api_call:
          circuitBreakerTripped: true   # ✗ might not trip if threshold > calledTimes

# CORRECT — ensure enough failures to exceed threshold
tests:
  - name: circuit breaker trips after threshold
    mocks:
      - service: api
        sequence:
          - throw: { error: TimeoutError }
          - throw: { error: TimeoutError }
          - throw: { error: TimeoutError }
          - throw: { error: TimeoutError }
          - throw: { error: TimeoutError }   # 5 failures = threshold
    expect:
      steps:
        api_call:
          calledTimes: 5
          circuitBreakerTripped: true
```

```yaml
# WRONG — asserting retryAttempts without checking calledTimes
tests:
  - name: retry test
    expect:
      steps:
        api_call:
          retryAttempts: 3             # ✗ confusing: is this retries or total calls?

# CORRECT — assert both for clarity
tests:
  - name: retry test
    expect:
      steps:
        api_call:
          calledTimes: 4               # initial + 3 retries = 4 total calls
          retryAttempts: 3             # 3 retries (excludes initial call)
```
