Agent Evaluations - IXO Docs Hub

Build this when an AI agent should inspect evidence, apply rules, summarize findings, or recommend a decision without becoming the final source of accountability. Agent evaluations in Qi are not ordinary model evaluations. You are not only asking whether an answer is fluent, correct, or useful. You are checking whether an agent acted within delegated authority, used the right evidence, applied the right rubric, and produced a verifiable decision record that people, services, regulators, or communities can inspect.

Treat the model response as a working output. Treat Claims, evidence references, UCAN delegations, Flow state, and UDID records as the accountable system of record.

The problem

Teams need agents to help with evidence review, claim processing, decision support, fulfillment checks, and workflow routing. Ordinary agents create several risks:

they may act outside their authority
they may use stale or incomplete context
they may confuse submitted evidence with verified state
they may summarize without citing sources
they may recommend actions that the workflow does not allow
they may produce outputs that cannot be replayed, challenged, or audited
they may trigger payments, credentials, or state changes before a valid determination exists

In Qi, the Flow engine solves this by making evaluation part of a governed state machine. An agent evaluation should answer four questions:

Who was allowed to act?

Qi mechanism: UCAN delegation and capability checks

What was being evaluated?

Qi mechanism: Claims, entities, evidence, and Flow state

How was it evaluated?

Qi mechanism: Protocol, rubric, tools, checks, and citations

What was determined?

Qi mechanism: Evaluation Claim and UDID record

What you build

You build a Qi Flow that evaluates agent work or agent-assisted reviews against verified context. The Flow should:

receive a Claim, task output, or Flow event
check the agent’s UCAN authority
resolve the current state of the relevant entities, Claims, credentials, and evidence
run the evaluation against a declared rubric
produce a structured Evaluation Claim
produce or update a UDID when a decision and impact determination is ready
route the result to a human, service, payment step, credential step, dispute path, or next Flow state

Core pattern

A Claim or task enters evaluation

A participant, service, device, or agent submits a Claim, completes a task, or proposes a state transition inside a Qi Flow.

The Flow opens an evaluation context

The Flow instance defines the subject, claim type, protocol, rubric version, allowed evidence sources, current state snapshot, and decision boundary.

UCAN authority is checked

The Flow verifies that the agent has the required object-capability delegation for this exact action, resource, claim type, tool, time window, and Flow instance.

The agent retrieves permitted context

The agent may only inspect Claims, evidence, rooms, graph state, tools, and records that are inside its UCAN scope.

The agent applies the rubric

The agent checks the Claim against required fields, evidence rules, protocol constraints, scoring thresholds, disqualifiers, and escalation rules.

The agent emits an Evaluation Claim

The result is written as structured data with cited evidence references, applied checks, confidence, recommendation, limitations, and proof of the agent’s authority.

The Flow issues a UDID when ready

A UDID records the decision and impact determination: what was decided, why, under which authority, with which evidence, and what state or value changed.

Humans or services act on the result

The Flow routes the evaluation to approval, rejection, dispute, settlement, credential issuance, state update, or a request for more evidence.

Key concepts

UCAN

Delegates what an agent may do, on which resource, under which constraints.

Claim

Records something asserted or performed, including agent work, evidence submission, review, or evaluation.

Evaluation Claim

Records how a Claim, task, or proposed state transition was evaluated.

UDID

Records the final decision and impact determination produced from one or more evaluations.

Flow state

Defines where the evaluation happens and which transitions are allowed.

Rubric

Defines the rules, scoring, thresholds, disqualifiers, and escalation conditions.

Evidence reference

Links the evaluation to documents, measurements, observations, attestations, media, reports, sensor records, or external records.

Agentic Oracle

Performs specialized review, evidence analysis, scoring, prediction, recommendation, or verification support.

Start with one evaluation

Do not begin with a fully autonomous approval workflow. Start with one narrow review task where the agent can recommend, but not finalize, the outcome. Good first tasks:

check whether a Claim is complete
classify evidence by type and relevance
detect missing required fields
compare submitted evidence against protocol requirements
summarize conflicting evidence
score one rubric section
recommend whether a human verifier should approve, reject, dispute, or request more evidence

Avoid first tasks where the agent can directly release funds, issue credentials, update high-value state, or approve irreversible outcomes.

Design the evaluation Flow

Use this minimum Flow shape:

`submitted`

Purpose: Claim or agent task output has entered the Flow.Exit condition: Claim exists and has a subject.

`authority_check`

Purpose: Verify UCAN delegation and credentials.Exit condition: Agent has required capabilities.

`context_resolved`

Purpose: Retrieve current graph state, evidence, and protocol rules.Exit condition: All required context references are loaded or marked missing.

`evaluating`

Purpose: Run the agent evaluation against the rubric.Exit condition: Structured Evaluation Claim is produced.

`review_required`

Purpose: Human or authorized verifier reviews the recommendation.Exit condition: Reviewer accepts, rejects, disputes, or requests more evidence.

`determined`

Purpose: UDID is created or updated.Exit condition: Decision and impact determination is signed or recorded.

`actioned`

Purpose: Allowed state transition, payment, credential, or next Flow is triggered.Exit condition: Action result is recorded.

`closed`

Purpose: Evaluation loop is complete.Exit condition: Audit trail is inspectable.

Add explicit failure states:

`unauthorized`

The agent does not have the required UCAN capability.

`insufficient_evidence`

Required evidence is missing, stale, inaccessible, or invalid.

`rubric_failed`

The Claim fails a hard protocol rule.

`conflict_detected`

Evidence sources conflict or cannot be reconciled.

`human_escalation`

The agent is uncertain or the rubric requires human judgment.

`disputed`

A participant challenges the evaluation or determination.

Configure UCAN authority

A UCAN should be scoped to the smallest useful evaluation task. Define:

issuer: the human, organization, POD, or service delegating authority
audience: the agent or Agentic Oracle DID receiving the authority
resource: the POD, Flow instance, Claim Collection, Claim, entity, evidence set, room, or tool
capabilities: the exact actions the agent may perform
constraints: limits on claim type, time, budget, tool use, output type, state transition, and approval power
expiry: when the delegation ends
revocation path: how the delegation can be suspended or revoked
proof chain: how the Flow verifies the delegation

A useful first UCAN capability set:

`claim.read`

Read one submitted Claim and its metadata.

`evidence.read`

Read only evidence linked to the Claim.

`entity.read`

Read only entities referenced by the Claim.

`rubric.read`

Read the active rubric and protocol version.

`evaluation.create`

Create an Evaluation Claim.

`state.propose`

Propose a Flow transition, not execute it.

`message.create`

Send a structured finding to the review room.

Avoid granting these at first:

`payment.release`

Value movement should require a UDID and stronger approval.

`credential.issue`

Credential issuance should require human or protocol-controlled determination.

`state.update`

Direct state updates can bypass review.

`claim.approve`

Approval should be separate from recommendation.

`policy.change`

Agents should not modify the rules they are evaluated against.

Example UCAN design shape

Use this as a design shape, then map it to the canonical Qi and IXO SDK fields used in your implementation.

{
  "ucan": {
    "issuer": "did:ixo:pod:review-board",
    "audience": "did:ixo:oracle:evidence-reviewer",
    "resource": {
      "pod": "did:ixo:pod:clean-cooking-program",
      "flow": "flow:claim-review:v1",
      "claimCollection": "claims:stove-usage:v1"
    },
    "capabilities": [
      "claim.read",
      "evidence.read",
      "entity.read",
      "rubric.read",
      "evaluation.create",
      "state.propose"
    ],
    "constraints": {
      "claimTypes": ["stove_usage"],
      "maxEvidenceItems": 50,
      "allowedTools": ["ixo.graph.query", "evidence.hash.verify", "rubric.evaluate"],
      "mayApprove": false,
      "mayReleasePayment": false,
      "expiresAt": "2026-06-30T23:59:59Z"
    }
  }
}

The safest default is propose-only. Let the agent create an Evaluation Claim and propose the next Flow state. Let the Flow, verifier, or protocol decide whether the proposal becomes a UDID-backed determination.

Define the Claim under review

Each evaluation should start with a clear Claim. Minimum Claim inputs:

`claimId`

Unique identifier for the Claim.

`claimType`

Type of Claim being reviewed.

`issuer`

DID of the person, organization, service, device, or agent that made the Claim.

`subject`

Entity, asset, project, person, device, service, or outcome the Claim is about.

`data`

Structured submitted data.

`evidence`

Linked evidence references.

`proof`

Cryptographic proof, signature, hash, attestation, or provenance record.

`submittedAt`

Submission time.

`protocolId`

Protocol or Blueprint used to evaluate the Claim.

`flowId`

Qi Flow instance handling the review.

Example:

{
  "claimId": "claim:stove-usage:000123",
  "claimType": "stove_usage",
  "issuer": "did:ixo:org:field-operator",
  "subject": "did:ixo:device:stove-7781",
  "data": {
    "period": "2026-04",
    "reportedBurnHours": 182,
    "householdId": "entity:household:442"
  },
  "evidence": [
    {
      "type": "deviceTelemetry",
      "uri": "ipfs://...",
      "hash": "bafy..."
    },
    {
      "type": "fieldVisitReport",
      "uri": "ipfs://...",
      "hash": "bafy..."
    }
  ],
  "protocolId": "blueprint:clean-cooking-mrv:v1",
  "flowId": "flow:claim-review:7781"
}

Define the rubric

The rubric converts protocol rules into checks that the agent and Flow can apply. A practical rubric should include:

required evidence
evidence freshness rules
source authenticity checks
data integrity checks
field completeness checks
allowed value ranges
consistency checks across evidence sources
disqualifying conditions
scoring rules
minimum score for recommendation
conditions that require human review
conditions that require dispute or investigation
allowed Flow transitions after evaluation

Example rubric shape:

{
  "rubricId": "rubric:stove-usage-review:v1",
  "claimType": "stove_usage",
  "requiredEvidence": [
    "deviceTelemetry",
    "fieldVisitReport"
  ],
  "checks": [
    {
      "id": "evidence.telemetry.present",
      "type": "required",
      "description": "Telemetry evidence is linked and hash-verifiable"
    },
    {
      "id": "usage.range.valid",
      "type": "range",
      "description": "Reported burn hours are within the protocol range",
      "min": 1,
      "max": 744
    },
    {
      "id": "field.report.consistent",
      "type": "consistency",
      "description": "Field visit report does not contradict telemetry period or device identity"
    }
  ],
  "thresholds": {
    "recommendApprove": 0.85,
    "recommendRejectBelow": 0.5,
    "humanReviewBelow": 0.85
  },
  "disqualifiers": [
    "missingTelemetry",
    "invalidEvidenceHash",
    "deviceNotLinkedToHousehold",
    "claimOutsideReportingPeriod"
  ]
}

Run the evaluation

The agent evaluation should produce structured output, not a free-form opinion. Minimum Evaluation Claim output:

`evaluationId`

Unique identifier for the evaluation.

`subjectClaimId`

Claim being evaluated.

`evaluatorDid`

Agentic Oracle, agent, human, or service performing the evaluation.

`ucanProof`

Proof that the evaluator had authority.

`rubricId`

Rubric used.

`rubricVersion`

Exact version used.

`stateSnapshotRef`

Reference to the Flow or graph state used at evaluation time.

`evidenceRefs`

Evidence items inspected.

`checks`

Rule-by-rule results.

`findings`

Structured observations with evidence citations.

`score`

Numeric score if the rubric requires one.

`recommendation`

Proposed next action.

`confidence`

Confidence in the recommendation.

`limitations`

Missing evidence, uncertainty, assumptions, or unresolved conflicts.

`proposedTransition`

Flow transition proposed by the agent.

`proof`

Signature, hash, attestation, or other proof of the evaluation record.

Example Evaluation Claim:

{
  "evaluationId": "eval:claim:stove-usage:000123:oracle-01",
  "type": "AgentEvaluationClaim",
  "subjectClaimId": "claim:stove-usage:000123",
  "evaluatorDid": "did:ixo:oracle:evidence-reviewer",
  "ucanProof": "ucan:proof:...",
  "rubricId": "rubric:stove-usage-review:v1",
  "rubricVersion": "1.0.0",
  "stateSnapshotRef": "state:flow:claim-review:7781:checkpoint:004",
  "evidenceRefs": [
    "evidence:telemetry:hash:bafy...",
    "evidence:field-report:hash:bafy..."
  ],
  "checks": [
    {
      "checkId": "evidence.telemetry.present",
      "result": "pass",
      "evidenceRef": "evidence:telemetry:hash:bafy..."
    },
    {
      "checkId": "usage.range.valid",
      "result": "pass",
      "observedValue": 182
    },
    {
      "checkId": "field.report.consistent",
      "result": "needs_review",
      "reason": "Field report date is one day outside the telemetry period"
    }
  ],
  "score": 0.82,
  "recommendation": "request_more_evidence",
  "confidence": 0.76,
  "limitations": [
    "Field report date mismatch requires human review"
  ],
  "proposedTransition": "human_escalation"
}

Do not store private model scratchpad as the audit trail. Store evidence references, extracted facts, tool calls, applied checks, rule outcomes, recommendation, limitations, and the final rationale that reviewers can inspect.

Create the UDID

A UDID is created when the Flow has enough information to record a decision and impact determination. Do not create a UDID for every intermediate model output. Create or update a UDID when the Flow reaches a determination point. A UDID should record:

`udid`

Unique determination identifier.

`decisionType`

Approval, rejection, request for evidence, dispute, settlement, credential issuance, state update, or no-op.

`subjectClaims`

Claims considered.

`evaluationClaims`

Evaluations used.

`authority`

UCANs, credentials, verifier role, or governance authority.

`rubric`

Rubric and protocol version applied.

`evidence`

Evidence references used in the determination.

`determination`

Final decision.

`impact`

What changed or will change because of the decision.

`stateTransition`

Flow transition or graph update authorized by the determination.

`approver`

Human, service, governance process, or authorized verifier.

`timestamp`

Determination time.

`proof`

Signature, attestation, transaction hash, or other proof.

`disputeWindow`

Period or condition under which the determination can be challenged.

Example UDID shape:

{
  "udid": "udid:flow:claim-review:7781:determination:001",
  "type": "UniversalDecisionAndImpactDetermination",
  "decisionType": "request_more_evidence",
  "subjectClaims": [
    "claim:stove-usage:000123"
  ],
  "evaluationClaims": [
    "eval:claim:stove-usage:000123:oracle-01"
  ],
  "authority": {
    "verifier": "did:ixo:person:human-verifier-17",
    "agentUcan": "ucan:proof:...",
    "protocol": "blueprint:clean-cooking-mrv:v1"
  },
  "rubric": {
    "id": "rubric:stove-usage-review:v1",
    "version": "1.0.0"
  },
  "determination": {
    "status": "more_evidence_required",
    "reason": "Telemetry is present and valid, but the field report date conflicts with the reporting period."
  },
  "impact": {
    "paymentReleased": false,
    "credentialIssued": false,
    "claimStatus": "evidence_requested"
  },
  "stateTransition": {
    "from": "review_required",
    "to": "insufficient_evidence"
  },
  "disputeWindow": "P14D"
}

Decide what the agent may do

Use three evaluation modes.

Recommend

Propose

Agent can do: Create an Evaluation Claim and propose a Flow transition.Use when: The rubric is stable and human review remains required.

Act

Agent can do: Execute a permitted transition after checks pass.Use when: Low-risk actions with strict UCAN scope and protocol guardrails.

For most production systems, start with Recommend, move to Propose, and only allow Act for narrow, reversible, low-risk transitions.

Test the evaluation

Use a test set before connecting the evaluation to real state changes. Create cases for:

Valid Claim with complete evidence

Agent can recommend approval correctly.

Missing required evidence

Agent detects incomplete submissions.

Invalid evidence hash

Agent does not trust unverifiable evidence.

Stale evidence

Agent applies freshness rules.

Conflicting evidence

Agent escalates instead of forcing a decision.

Unauthorized agent

UCAN gate blocks evaluation.

Wrong claim type

Agent cannot operate outside scope.

Revoked UCAN

Flow rejects previously authorized access.

Prompt injection in evidence

Agent treats evidence content as untrusted input.

Score below threshold

Agent does not recommend approval.

Boundary score

Rubric threshold behavior is correct.

Human disagreement

Review and correction are captured.

Disputed determination

Flow can route to dispute handling.

Payment-triggering decision

Payment cannot happen without valid UDID authority.

Evaluation metrics

Track operational quality, not only model quality.

Authorization accuracy

Agent only acts when UCAN scope permits.

Evidence citation coverage

Every finding links to evidence or state.

Rubric adherence

All required checks are applied.

False approval rate

Bad Claims are not recommended for approval.

False rejection rate

Valid Claims are not rejected without cause.

Escalation quality

Ambiguous cases route to humans.

UDID completeness

Determinations include authority, evidence, decision, impact, and proof.

Human override rate

High override rate triggers rubric or agent improvement.

Audit replay success

A reviewer can reconstruct the evaluation from records.

State transition accuracy

Proposed transitions match Flow rules.

Time to determination

Review speed improves without reducing accountability.

Common failure modes

The agent invents evidence

Require evidence references for every finding. Reject evaluations that cite documents, measurements, Claims, or state that are not present in the permitted context.

The agent has too much authority

Replace broad API keys with UCAN capability delegation. Scope authority by Flow instance, resource, claim type, tool, time window, and allowed action.

The agent uses stale context

Include a state snapshot reference in the Evaluation Claim. If the graph state changes, require a new evaluation or explicit refresh.

The rubric is too vague

Convert policy language into checks, thresholds, disqualifiers, and escalation rules. Ambiguity should route to human review.

The model output is treated as truth

Write structured Evaluation Claims and UDID records. A chat response should not be the source of truth for settlement, credentials, or state changes.

The agent approves its own work

Separate task execution from evaluation. Use independent evaluators or human review for high-stakes decisions.

The Flow cannot explain a decision

Require the UDID to reference Claims, evaluations, evidence, rubric version, authority, decision, impact, and proof.

First implementation move

Build one agent-assisted evaluation that cannot directly approve, pay, issue, or update state. Define:

one Claim type
one Claim Collection
one Flow
one Agentic Oracle or agent DID
one UCAN delegation
one rubric
one Evaluation Claim schema
one UDID schema
one human review step
one dispute path
one production metric dashboard

Then run at least 20 representative Claims through the Flow before enabling any automated state transition.

Production checklist

Before launch, confirm:

the agent has a DID
every evaluation action requires UCAN authority
UCAN scopes are narrow and expire
Claims have typed schemas and evidence references
evidence can be resolved and verified
the rubric is versioned
the Flow has explicit states and failure paths
the agent emits structured Evaluation Claims
the UDID records authority, evidence, decision, impact, state transition, and proof
irreversible actions require human, protocol, or governance approval
disputes can be submitted and resolved
reviewers can replay the evaluation from stored records
revoked authority blocks future actions
payment, credential, and state update actions cannot execute without valid determination authority

Example: agent-assisted evidence review

A field operator submits a Claim that a clean cooking device was used during a reporting period. The Qi Flow:

receives the Claim
checks that the Evidence Review Oracle has UCAN authority to inspect this claim type
retrieves linked telemetry, field report, device entity, household entity, and active protocol rules
asks the agent to apply the usage review rubric
records an Evaluation Claim with findings, evidence references, score, recommendation, and limitations
routes the recommendation to a human verifier because the score is below the automatic threshold
records a UDID after the verifier decides to request more evidence
updates the Flow state to insufficient_evidence
notifies the claimant about the missing or conflicting evidence

The agent helped evaluate the Claim, but it did not become the final authority. The accountable record is the combination of UCAN delegation, Claim, evidence, Evaluation Claim, human review, Flow transition, and UDID.

Next steps

Claims Management

Create, process, evaluate, dispute, and automate verifiable Claims.

Model Context Protocol

Connect agents to IXO services through secure, capability-scoped tool interfaces.

Agentic Oracles

Build agent services for verification, decision support, evidence analysis, and workflow automation.

Qi Intelligent Cooperating System

Coordinate humans, agents, services, and applications over shared state.

​The problem

​What you build

​Core pattern

​Key concepts

​Start with one evaluation

​Design the evaluation Flow

​Configure UCAN authority

​Example UCAN design shape

​Define the Claim under review

​Define the rubric

​Run the evaluation

​Create the UDID

​Decide what the agent may do

​Test the evaluation

​Evaluation metrics

​Common failure modes

​First implementation move

​Production checklist

​Example: agent-assisted evidence review

​Next steps

Claims Management

Model Context Protocol

Agentic Oracles

Qi Intelligent Cooperating System

The problem

What you build

Core pattern

Key concepts

Start with one evaluation

Design the evaluation Flow

Configure UCAN authority

Example UCAN design shape

Define the Claim under review

Define the rubric

Run the evaluation

Create the UDID

Decide what the agent may do

Test the evaluation

Evaluation metrics

Common failure modes

First implementation move

Production checklist

Example: agent-assisted evidence review

Next steps