Skip to main content
Build this when an AI agent should inspect evidence, apply rules, summarize findings, or recommend a decision without becoming the final source of accountability. Agent evaluations in Qi are not ordinary model evaluations. You are not only asking whether an answer is fluent, correct, or useful. You are checking whether an agent acted within delegated authority, used the right evidence, applied the right rubric, and produced a verifiable decision record that people, services, regulators, or communities can inspect.
Treat the model response as a working output. Treat Claims, evidence references, UCAN delegations, Flow state, and UDID records as the accountable system of record.

The problem

Teams need agents to help with evidence review, claim processing, decision support, fulfillment checks, and workflow routing. Ordinary agents create several risks:
  • they may act outside their authority
  • they may use stale or incomplete context
  • they may confuse submitted evidence with verified state
  • they may summarize without citing sources
  • they may recommend actions that the workflow does not allow
  • they may produce outputs that cannot be replayed, challenged, or audited
  • they may trigger payments, credentials, or state changes before a valid determination exists
In Qi, the Flow engine solves this by making evaluation part of a governed state machine. An agent evaluation should answer four questions:
Qi mechanism: UCAN delegation and capability checks
Qi mechanism: Claims, entities, evidence, and Flow state
Qi mechanism: Protocol, rubric, tools, checks, and citations
Qi mechanism: Evaluation Claim and UDID record

What you build

You build a Qi Flow that evaluates agent work or agent-assisted reviews against verified context. The Flow should:
  1. receive a Claim, task output, or Flow event
  2. check the agent’s UCAN authority
  3. resolve the current state of the relevant entities, Claims, credentials, and evidence
  4. run the evaluation against a declared rubric
  5. produce a structured Evaluation Claim
  6. produce or update a UDID when a decision and impact determination is ready
  7. route the result to a human, service, payment step, credential step, dispute path, or next Flow state

Core pattern

1

A Claim or task enters evaluation

A participant, service, device, or agent submits a Claim, completes a task, or proposes a state transition inside a Qi Flow.
2

The Flow opens an evaluation context

The Flow instance defines the subject, claim type, protocol, rubric version, allowed evidence sources, current state snapshot, and decision boundary.
3

UCAN authority is checked

The Flow verifies that the agent has the required object-capability delegation for this exact action, resource, claim type, tool, time window, and Flow instance.
4

The agent retrieves permitted context

The agent may only inspect Claims, evidence, rooms, graph state, tools, and records that are inside its UCAN scope.
5

The agent applies the rubric

The agent checks the Claim against required fields, evidence rules, protocol constraints, scoring thresholds, disqualifiers, and escalation rules.
6

The agent emits an Evaluation Claim

The result is written as structured data with cited evidence references, applied checks, confidence, recommendation, limitations, and proof of the agent’s authority.
7

The Flow issues a UDID when ready

A UDID records the decision and impact determination: what was decided, why, under which authority, with which evidence, and what state or value changed.
8

Humans or services act on the result

The Flow routes the evaluation to approval, rejection, dispute, settlement, credential issuance, state update, or a request for more evidence.

Key concepts

Delegates what an agent may do, on which resource, under which constraints.
Records something asserted or performed, including agent work, evidence submission, review, or evaluation.
Records how a Claim, task, or proposed state transition was evaluated.
Records the final decision and impact determination produced from one or more evaluations.
Defines where the evaluation happens and which transitions are allowed.
Defines the rules, scoring, thresholds, disqualifiers, and escalation conditions.
Links the evaluation to documents, measurements, observations, attestations, media, reports, sensor records, or external records.
Performs specialized review, evidence analysis, scoring, prediction, recommendation, or verification support.

Start with one evaluation

Do not begin with a fully autonomous approval workflow. Start with one narrow review task where the agent can recommend, but not finalize, the outcome. Good first tasks:
  • check whether a Claim is complete
  • classify evidence by type and relevance
  • detect missing required fields
  • compare submitted evidence against protocol requirements
  • summarize conflicting evidence
  • score one rubric section
  • recommend whether a human verifier should approve, reject, dispute, or request more evidence
Avoid first tasks where the agent can directly release funds, issue credentials, update high-value state, or approve irreversible outcomes.

Design the evaluation Flow

Use this minimum Flow shape:
Purpose: Claim or agent task output has entered the Flow.Exit condition: Claim exists and has a subject.
Purpose: Verify UCAN delegation and credentials.Exit condition: Agent has required capabilities.
Purpose: Retrieve current graph state, evidence, and protocol rules.Exit condition: All required context references are loaded or marked missing.
Purpose: Run the agent evaluation against the rubric.Exit condition: Structured Evaluation Claim is produced.
Purpose: Human or authorized verifier reviews the recommendation.Exit condition: Reviewer accepts, rejects, disputes, or requests more evidence.
Purpose: UDID is created or updated.Exit condition: Decision and impact determination is signed or recorded.
Purpose: Allowed state transition, payment, credential, or next Flow is triggered.Exit condition: Action result is recorded.
Purpose: Evaluation loop is complete.Exit condition: Audit trail is inspectable.
Add explicit failure states:
The agent does not have the required UCAN capability.
Required evidence is missing, stale, inaccessible, or invalid.
The Claim fails a hard protocol rule.
Evidence sources conflict or cannot be reconciled.
The agent is uncertain or the rubric requires human judgment.
A participant challenges the evaluation or determination.

Configure UCAN authority

A UCAN should be scoped to the smallest useful evaluation task. Define:
  • issuer: the human, organization, POD, or service delegating authority
  • audience: the agent or Agentic Oracle DID receiving the authority
  • resource: the POD, Flow instance, Claim Collection, Claim, entity, evidence set, room, or tool
  • capabilities: the exact actions the agent may perform
  • constraints: limits on claim type, time, budget, tool use, output type, state transition, and approval power
  • expiry: when the delegation ends
  • revocation path: how the delegation can be suspended or revoked
  • proof chain: how the Flow verifies the delegation
A useful first UCAN capability set:
Read one submitted Claim and its metadata.
Read only evidence linked to the Claim.
Read only entities referenced by the Claim.
Read the active rubric and protocol version.
Create an Evaluation Claim.
Propose a Flow transition, not execute it.
Send a structured finding to the review room.
Avoid granting these at first:
Value movement should require a UDID and stronger approval.
Credential issuance should require human or protocol-controlled determination.
Direct state updates can bypass review.
Approval should be separate from recommendation.
Agents should not modify the rules they are evaluated against.

Example UCAN design shape

Use this as a design shape, then map it to the canonical Qi and IXO SDK fields used in your implementation.
{
  "ucan": {
    "issuer": "did:ixo:pod:review-board",
    "audience": "did:ixo:oracle:evidence-reviewer",
    "resource": {
      "pod": "did:ixo:pod:clean-cooking-program",
      "flow": "flow:claim-review:v1",
      "claimCollection": "claims:stove-usage:v1"
    },
    "capabilities": [
      "claim.read",
      "evidence.read",
      "entity.read",
      "rubric.read",
      "evaluation.create",
      "state.propose"
    ],
    "constraints": {
      "claimTypes": ["stove_usage"],
      "maxEvidenceItems": 50,
      "allowedTools": ["ixo.graph.query", "evidence.hash.verify", "rubric.evaluate"],
      "mayApprove": false,
      "mayReleasePayment": false,
      "expiresAt": "2026-06-30T23:59:59Z"
    }
  }
}
The safest default is propose-only. Let the agent create an Evaluation Claim and propose the next Flow state. Let the Flow, verifier, or protocol decide whether the proposal becomes a UDID-backed determination.

Define the Claim under review

Each evaluation should start with a clear Claim. Minimum Claim inputs:
Unique identifier for the Claim.
Type of Claim being reviewed.
DID of the person, organization, service, device, or agent that made the Claim.
Entity, asset, project, person, device, service, or outcome the Claim is about.
Structured submitted data.
Linked evidence references.
Cryptographic proof, signature, hash, attestation, or provenance record.
Submission time.
Protocol or Blueprint used to evaluate the Claim.
Qi Flow instance handling the review.
Example:
{
  "claimId": "claim:stove-usage:000123",
  "claimType": "stove_usage",
  "issuer": "did:ixo:org:field-operator",
  "subject": "did:ixo:device:stove-7781",
  "data": {
    "period": "2026-04",
    "reportedBurnHours": 182,
    "householdId": "entity:household:442"
  },
  "evidence": [
    {
      "type": "deviceTelemetry",
      "uri": "ipfs://...",
      "hash": "bafy..."
    },
    {
      "type": "fieldVisitReport",
      "uri": "ipfs://...",
      "hash": "bafy..."
    }
  ],
  "protocolId": "blueprint:clean-cooking-mrv:v1",
  "flowId": "flow:claim-review:7781"
}

Define the rubric

The rubric converts protocol rules into checks that the agent and Flow can apply. A practical rubric should include:
  • required evidence
  • evidence freshness rules
  • source authenticity checks
  • data integrity checks
  • field completeness checks
  • allowed value ranges
  • consistency checks across evidence sources
  • disqualifying conditions
  • scoring rules
  • minimum score for recommendation
  • conditions that require human review
  • conditions that require dispute or investigation
  • allowed Flow transitions after evaluation
Example rubric shape:
{
  "rubricId": "rubric:stove-usage-review:v1",
  "claimType": "stove_usage",
  "requiredEvidence": [
    "deviceTelemetry",
    "fieldVisitReport"
  ],
  "checks": [
    {
      "id": "evidence.telemetry.present",
      "type": "required",
      "description": "Telemetry evidence is linked and hash-verifiable"
    },
    {
      "id": "usage.range.valid",
      "type": "range",
      "description": "Reported burn hours are within the protocol range",
      "min": 1,
      "max": 744
    },
    {
      "id": "field.report.consistent",
      "type": "consistency",
      "description": "Field visit report does not contradict telemetry period or device identity"
    }
  ],
  "thresholds": {
    "recommendApprove": 0.85,
    "recommendRejectBelow": 0.5,
    "humanReviewBelow": 0.85
  },
  "disqualifiers": [
    "missingTelemetry",
    "invalidEvidenceHash",
    "deviceNotLinkedToHousehold",
    "claimOutsideReportingPeriod"
  ]
}

Run the evaluation

The agent evaluation should produce structured output, not a free-form opinion. Minimum Evaluation Claim output:
Unique identifier for the evaluation.
Claim being evaluated.
Agentic Oracle, agent, human, or service performing the evaluation.
Proof that the evaluator had authority.
Rubric used.
Exact version used.
Reference to the Flow or graph state used at evaluation time.
Evidence items inspected.
Rule-by-rule results.
Structured observations with evidence citations.
Numeric score if the rubric requires one.
Proposed next action.
Confidence in the recommendation.
Missing evidence, uncertainty, assumptions, or unresolved conflicts.
Flow transition proposed by the agent.
Signature, hash, attestation, or other proof of the evaluation record.
Example Evaluation Claim:
{
  "evaluationId": "eval:claim:stove-usage:000123:oracle-01",
  "type": "AgentEvaluationClaim",
  "subjectClaimId": "claim:stove-usage:000123",
  "evaluatorDid": "did:ixo:oracle:evidence-reviewer",
  "ucanProof": "ucan:proof:...",
  "rubricId": "rubric:stove-usage-review:v1",
  "rubricVersion": "1.0.0",
  "stateSnapshotRef": "state:flow:claim-review:7781:checkpoint:004",
  "evidenceRefs": [
    "evidence:telemetry:hash:bafy...",
    "evidence:field-report:hash:bafy..."
  ],
  "checks": [
    {
      "checkId": "evidence.telemetry.present",
      "result": "pass",
      "evidenceRef": "evidence:telemetry:hash:bafy..."
    },
    {
      "checkId": "usage.range.valid",
      "result": "pass",
      "observedValue": 182
    },
    {
      "checkId": "field.report.consistent",
      "result": "needs_review",
      "reason": "Field report date is one day outside the telemetry period"
    }
  ],
  "score": 0.82,
  "recommendation": "request_more_evidence",
  "confidence": 0.76,
  "limitations": [
    "Field report date mismatch requires human review"
  ],
  "proposedTransition": "human_escalation"
}
Do not store private model scratchpad as the audit trail. Store evidence references, extracted facts, tool calls, applied checks, rule outcomes, recommendation, limitations, and the final rationale that reviewers can inspect.

Create the UDID

A UDID is created when the Flow has enough information to record a decision and impact determination. Do not create a UDID for every intermediate model output. Create or update a UDID when the Flow reaches a determination point. A UDID should record:
Unique determination identifier.
Approval, rejection, request for evidence, dispute, settlement, credential issuance, state update, or no-op.
Claims considered.
Evaluations used.
UCANs, credentials, verifier role, or governance authority.
Rubric and protocol version applied.
Evidence references used in the determination.
Final decision.
What changed or will change because of the decision.
Flow transition or graph update authorized by the determination.
Human, service, governance process, or authorized verifier.
Determination time.
Signature, attestation, transaction hash, or other proof.
Period or condition under which the determination can be challenged.
Example UDID shape:
{
  "udid": "udid:flow:claim-review:7781:determination:001",
  "type": "UniversalDecisionAndImpactDetermination",
  "decisionType": "request_more_evidence",
  "subjectClaims": [
    "claim:stove-usage:000123"
  ],
  "evaluationClaims": [
    "eval:claim:stove-usage:000123:oracle-01"
  ],
  "authority": {
    "verifier": "did:ixo:person:human-verifier-17",
    "agentUcan": "ucan:proof:...",
    "protocol": "blueprint:clean-cooking-mrv:v1"
  },
  "rubric": {
    "id": "rubric:stove-usage-review:v1",
    "version": "1.0.0"
  },
  "determination": {
    "status": "more_evidence_required",
    "reason": "Telemetry is present and valid, but the field report date conflicts with the reporting period."
  },
  "impact": {
    "paymentReleased": false,
    "credentialIssued": false,
    "claimStatus": "evidence_requested"
  },
  "stateTransition": {
    "from": "review_required",
    "to": "insufficient_evidence"
  },
  "disputeWindow": "P14D"
}

Decide what the agent may do

Use three evaluation modes.
Agent can do: Inspect context and create an Evaluation Claim.Use when: First implementation, high-stakes review, new rubric.
Agent can do: Create an Evaluation Claim and propose a Flow transition.Use when: The rubric is stable and human review remains required.
Agent can do: Execute a permitted transition after checks pass.Use when: Low-risk actions with strict UCAN scope and protocol guardrails.
For most production systems, start with Recommend, move to Propose, and only allow Act for narrow, reversible, low-risk transitions.

Test the evaluation

Use a test set before connecting the evaluation to real state changes. Create cases for:
Agent can recommend approval correctly.
Agent detects incomplete submissions.
Agent does not trust unverifiable evidence.
Agent applies freshness rules.
Agent escalates instead of forcing a decision.
UCAN gate blocks evaluation.
Agent cannot operate outside scope.
Flow rejects previously authorized access.
Agent treats evidence content as untrusted input.
Agent does not recommend approval.
Rubric threshold behavior is correct.
Review and correction are captured.
Flow can route to dispute handling.
Payment cannot happen without valid UDID authority.

Evaluation metrics

Track operational quality, not only model quality.
Agent only acts when UCAN scope permits.
Every finding links to evidence or state.
All required checks are applied.
Bad Claims are not recommended for approval.
Valid Claims are not rejected without cause.
Ambiguous cases route to humans.
Determinations include authority, evidence, decision, impact, and proof.
High override rate triggers rubric or agent improvement.
A reviewer can reconstruct the evaluation from records.
Proposed transitions match Flow rules.
Review speed improves without reducing accountability.

Common failure modes

Require evidence references for every finding. Reject evaluations that cite documents, measurements, Claims, or state that are not present in the permitted context.
Replace broad API keys with UCAN capability delegation. Scope authority by Flow instance, resource, claim type, tool, time window, and allowed action.
Include a state snapshot reference in the Evaluation Claim. If the graph state changes, require a new evaluation or explicit refresh.
Convert policy language into checks, thresholds, disqualifiers, and escalation rules. Ambiguity should route to human review.
Write structured Evaluation Claims and UDID records. A chat response should not be the source of truth for settlement, credentials, or state changes.
Separate task execution from evaluation. Use independent evaluators or human review for high-stakes decisions.
Require the UDID to reference Claims, evaluations, evidence, rubric version, authority, decision, impact, and proof.

First implementation move

Build one agent-assisted evaluation that cannot directly approve, pay, issue, or update state. Define:
  • one Claim type
  • one Claim Collection
  • one Flow
  • one Agentic Oracle or agent DID
  • one UCAN delegation
  • one rubric
  • one Evaluation Claim schema
  • one UDID schema
  • one human review step
  • one dispute path
  • one production metric dashboard
Then run at least 20 representative Claims through the Flow before enabling any automated state transition.

Production checklist

Before launch, confirm:
  • the agent has a DID
  • every evaluation action requires UCAN authority
  • UCAN scopes are narrow and expire
  • Claims have typed schemas and evidence references
  • evidence can be resolved and verified
  • the rubric is versioned
  • the Flow has explicit states and failure paths
  • the agent emits structured Evaluation Claims
  • the UDID records authority, evidence, decision, impact, state transition, and proof
  • irreversible actions require human, protocol, or governance approval
  • disputes can be submitted and resolved
  • reviewers can replay the evaluation from stored records
  • revoked authority blocks future actions
  • payment, credential, and state update actions cannot execute without valid determination authority

Example: agent-assisted evidence review

A field operator submits a Claim that a clean cooking device was used during a reporting period. The Qi Flow:
  1. receives the Claim
  2. checks that the Evidence Review Oracle has UCAN authority to inspect this claim type
  3. retrieves linked telemetry, field report, device entity, household entity, and active protocol rules
  4. asks the agent to apply the usage review rubric
  5. records an Evaluation Claim with findings, evidence references, score, recommendation, and limitations
  6. routes the recommendation to a human verifier because the score is below the automatic threshold
  7. records a UDID after the verifier decides to request more evidence
  8. updates the Flow state to insufficient_evidence
  9. notifies the claimant about the missing or conflicting evidence
The agent helped evaluate the Claim, but it did not become the final authority. The accountable record is the combination of UCAN delegation, Claim, evidence, Evaluation Claim, human review, Flow transition, and UDID.

Next steps

Claims Management

Create, process, evaluate, dispute, and automate verifiable Claims.

Model Context Protocol

Connect agents to IXO services through secure, capability-scoped tool interfaces.

Agentic Oracles

Build agent services for verification, decision support, evidence analysis, and workflow automation.

Qi Intelligent Cooperating System

Coordinate humans, agents, services, and applications over shared state.