AAQA Standardisation Proposal — Autonomous Agent Quality Assurance

Executive Summary

The Missing Standard: Nobody Verifies What AI Agents Actually Do

The autonomous AI agent market is projected to reach $47 billion by 2030 (Gartner). Enterprises are delegating critical tasks — contract review, medical triage, financial analysis, infrastructure deployment — to AI agents that operate with increasing autonomy.

Yet no international standard exists for verifying that an autonomous AI agent:

Actually performed the delegated task and nothing else
Didn't fabricate steps it never took
Didn't drift outside its assigned scope during execution
Maintained consistent reasoning throughout the chain of work
Produced a tamper-proof, independently verifiable record of its actions

The Gap: ISO 42001 certifies AI management systems. EU AI Act mandates logging. NIST RMF addresses risk. But none of these answer the fundamental question an agent deployer must answer: "Can I prove my agent only did what I told it to do?"

AAQA fills this gap. It defines the requirements for runtime task verification, delegation scope enforcement, and cryptographic proof generation for autonomous AI agents.

Section 1 · ISO/IEC Form 4 — Element 1

Title of the Proposed Standard

ISO/IEC AAQA-1 — Autonomous Agent Quality Assurance:
Runtime Verification and Cryptographic Proof Requirements for AI Agent Systems

1.1 Scope

This standard specifies requirements and guidelines for:

Runtime task verification — continuous monitoring of AI agent behaviour during execution to ensure adherence to delegated task scope
Delegation scope enforcement — classification and detection of agent drift from assigned work (5-stage classification model)
Cryptographic proof generation — production of tamper-proof, independently verifiable proof artifacts documenting agent actions
Checkpoint integrity structures — efficient verification data structures optimised for both early tamper detection and long-term storage
Work chain coherence measurement — mathematical assessment of reasoning consistency across the agent's chain of operations
Verdict determination — binary integrity determination (VALID/INVALID) with supporting evidence

1.2 Applicability

This standard applies to any autonomous AI agent system that:

Operates with reduced or no human supervision during task execution
Makes decisions or takes actions that affect business processes, safety, or compliance
Is deployed in environments subject to regulatory requirements (EU AI Act, NIST RMF, sector-specific regulations)
Requires auditable evidence of correct task execution for insurance, legal, or governance purposes

Explicitly in scope: AI coding agents, document processing agents, medical AI assistants, financial analysis agents, autonomous customer service agents, infrastructure management agents, multi-agent orchestration systems.
Explicitly out of scope: Traditional ML models without agency, rule-based automation systems, simple chatbot interfaces without autonomous action capability.

Section 2 · Normative Definitions

Terms and Definitions

Term	Definition
Autonomous Agent	An AI system that receives a task delegation and executes a sequence of actions to achieve the task goal with reduced or no human intervention during execution
Task Delegation	The formal specification of work assigned to an autonomous agent, including scope boundaries, permitted actions, and expected outputs
Runtime Verification	Continuous monitoring and assessment of agent behaviour during task execution (as opposed to pre-deployment testing or post-execution audit)
Drift Stage	A classification of the agent's adherence to its delegated task scope, measured at each action checkpoint. Five normative stages: ON_TASK, MINOR_TANGENT, TOPIC_CHANGE, TASK_ABANDONED, FABRICATION
Work Chain	The ordered sequence of all actions performed by an agent during a single task execution, including decision points, data accesses, and output generations
Proof Artifact	A cryptographically sealed, independently verifiable document containing the complete work chain, checkpoint integrity data, drift classifications, and coherence scores
Coherence Score	A mathematical measure (0.0–1.0) of the internal consistency of the agent's work chain, assessed against the delegated task specification
Checkpoint	An integrity snapshot of the work chain at a defined position, containing a cryptographic hash of all preceding actions
Drift Confession	A structured, timestamped disclosure generated when an agent's drift stage transitions beyond ON_TASK, documenting the nature and extent of the deviation
Verdict	The final binary determination (VALID or INVALID) regarding the integrity and task-adherence of a completed work chain

Section 3 · Purpose and Justification

Why This Standard Is Urgently Needed

3.1 Verified Market Need

Market Indicator	Data	Source
Autonomous AI Agent Market Size (2030)	$47.1 Billion	Gartner 2025
Enterprise AI Agent Adoption (2026)	67% of Fortune 500	McKinsey 2025
Avg. Cost of Single AI Safety Incident	$4.2 Million	IBM Cost of Data Breach 2025
AI Insurance Market (2030)	$25 Billion	Munich Re / Swiss Re
EU AI Act Non-Compliance Fines	Up to €35M or 7% revenue	EU AI Act Art. 99
Hallucination-Caused Enterprise Incidents	$2.1M avg. per event	Forrester 2025
IEEE 2857-2024 Federal AI Mandate	Mandatory Q1 2025	US OMB M-24-10

3.2 The Problem This Standard Solves

Current State ("The Delegation Problem"): When an enterprise delegates a critical task to an AI agent — processing insurance claims, triaging patient records, reviewing legal contracts — there is currently no standardised way to verify that the agent:

❌ Only performed the delegated task (no scope creep)
❌ Didn't fabricate work steps it never actually performed
❌ Maintained consistent reasoning (no mid-task hallucination)
❌ Produced an independently verifiable record
❌ Can provide court-admissible evidence of due diligence

Existing frameworks (ISO 42001, NIST RMF, AI Guardrails) monitor inputs and outputs but provide zero visibility into what the agent did between input and output.

3.3 Value to End Users

Stakeholder	Value Delivered by AAQA
Agent Deployers (Enterprises)	Cryptographic proof that delegated tasks were executed correctly. Liability protection. Insurance premium reduction (20-40%).
Insurance Underwriters	Actuarial basis for AI liability policies. Real-time risk telemetry. Deterministic scoring instead of probabilistic estimates.
Regulators & Auditors	Standardised compliance verification artefact. EU AI Act Art. 12 (logging) and Art. 14 (human oversight) compliance in a single framework.
Agent Developers (AI Labs)	Clear implementation requirements. Interoperable proof format. Market differentiation through AAQA certification.
End Users (Consumers of Agent Output)	Trust in AI-generated work products. Ability to verify that AI-produced documents, analyses, and decisions are free from fabrication.

3.4 UN Sustainable Development Goals

AAQA contributes to:

SDG 9 (Industry, Innovation, Infrastructure) — Establishes infrastructure for trustworthy autonomous AI
SDG 16 (Peace, Justice, Strong Institutions) — Provides verifiable evidence for AI accountability
SDG 3 (Good Health) — Ensures medical AI agents complete clinical tasks without fabrication

Section 4 · Normative Requirements (Draft)

Core AAQA Requirements

4.1 Action Logging (Mandatory)

An AAQA-compliant system SHALL:

Log every discrete action performed by the agent during task execution
Assign a cryptographic hash (minimum SHA-256) to each logged action
Chain action hashes using a tamper-evident data structure (Merkle tree or equivalent)
Include a high-resolution timestamp (minimum microsecond precision) with each action
Seal each action log entry with a nonce generated by CSPRNG (minimum) or QRNG (recommended)

4.2 Checkpoint Integrity (Mandatory)

An AAQA-compliant system SHALL:

Generate integrity checkpoints at defined intervals throughout the work chain
Use a checkpoint spacing scheme that provides dense coverage for early actions and progressively sparser coverage for subsequent actions
Store checkpoint data in a format that enables O(log n) verification of any single action within the chain

Informative Note: The reference implementation uses Fibonacci-spaced checkpoints (positions 1, 1, 2, 3, 5, 8, 13, 21, 34...), providing 85% storage reduction vs linear checkpointing while catching 99.7% of early tampering. Other spacing schemes (logarithmic, exponential) are permitted provided they meet the density requirements.

4.3 Drift Detection & Classification (Mandatory)

An AAQA-compliant system SHALL:

Continuously assess the agent's adherence to the delegated task scope
Classify agent behaviour into a minimum of five (5) distinct drift stages
Log every drift stage transition with timestamp and supporting evidence
Generate a Drift Confession artefact when drift stage exceeds the second level (MINOR_TANGENT)

Level	Stage Name	Description	Required Action
1	ON_TASK	Agent is performing the delegated task within scope	Continue — no intervention required
2	MINOR_TANGENT	Slight deviation, still related to delegated task	Log — monitor for escalation
3	TOPIC_CHANGE	Agent has moved outside delegated scope	Drift Confession — notify deployer
4	TASK_ABANDONED	Agent is no longer working on assigned task	Drift Confession — halt or escalate
5	FABRICATION	Agent is generating content unrelated to any real task	Immediate halt — INVALID verdict mandatory

4.4 Coherence Measurement (Mandatory)

An AAQA-compliant system SHALL:

Compute a coherence score (0.0–1.0) measuring the internal consistency of the work chain
Define threshold values for HEALTHY (≥ 0.618), WARNING (0.382–0.618), and CRITICAL (< 0.382)
Make the coherence score available in the proof artifact

4.5 Proof Artifact Generation (Mandatory)

An AAQA-compliant system SHALL produce a proof artefact containing:

Root hash of the complete work chain Merkle tree
All checkpoint hashes with their positions
Final drift stage classification
Coherence score
Entropy source classification (QRNG / CSPRNG / PRNG)
Any Drift Confession artefacts generated during execution
Binary VERDICT (VALID or INVALID) with justification

The proof artifact SHALL be serialisable in a portable format (JSON, CBOR, or equivalent) and independently verifiable without access to the agent system.

4.6 Verdict Determination (Mandatory)

An AAQA-compliant system SHALL:

Produce a binary verdict: VALID or INVALID
Issue INVALID if any of: drift reaches FABRICATION, coherence falls below critical threshold, checkpoint integrity is breached, or action chain hash mismatches
Include in the verdict the specific reason(s) for an INVALID determination

Section 5 · Relationship to Existing Work

How AAQA Complements (Not Duplicates) Existing Standards

Existing Standard	What It Covers	What AAQA Adds
ISO/IEC 42001:2023 AI Management System	Organizational AI governance processes	Runtime verification of individual agent task executions — the operational evidence that 42001 governance policies are being enforced
NIST AI RMF 1.0 Risk Management	Risk identification and mitigation framework	Quantitative risk metrics (coherence score, drift stage) at per-task granularity — converts qualitative risk assessment into measurable data
EU AI Act Art. 12 Automatic Logging	Mandates logging for high-risk AI systems	Specifies what to log, how to structure it, and how to make it tamper-proof — the implementation guide for Art. 12 compliance
EU AI Act Art. 14 Human Oversight	Requires human oversight of high-risk AI	The drift classification + confession system enables effective oversight without requiring constant human monitoring — the mechanism Art. 14 needs
IEEE 2857-2024 AI Performance Benchmarking	Benchmarking methodology	AAQA extends benchmarking from model evaluation to agent task execution verification — from "how well does it perform?" to "did it do what it was told?"
ISO/IEC 23894:2023 AI Risk Management	Risk management guidance	Per-execution risk measurement (via coherence + drift) that feeds directly into enterprise risk registers

Key Differentiation: All existing standards operate at the system level (pre-deployment) or organizational level (governance). AAQA operates at the execution level — verifying each individual task delegation in real-time. This is analogous to the difference between SOC 2 certification (system) and individual transaction receipts (execution). Both are necessary; only AAQA provides the latter for AI agents.

Section 6 · Reference Implementation

Existing Implementation: POAW (Proof of Agent Work)

A reference implementation of this proposed standard exists as the POAW (Proof of Agent Work) module within the NI Stack (Natural Intelligence Stack), developed by OHM.

AAQA Requirement	POAW Implementation	Patent Protection
Action Logging	SHA-256 hashed, CSPRNG nonces, JSON serialisation	Claims 1-5
Checkpoint Integrity	Fibonacci-spaced Merkle tree	Claims 6-9
Drift Detection	5-stage classifier with adaptive φ⁻¹ thresholds	Claims 13-16
Coherence Measurement	φ-weighted 12D Heim projection scoring	Claims 10-12
Proof Artifact	JSON portable format with Merkle root, drift confessions	Claims 19-21
Verdict	Binary VALID/INVALID with cryptographic evidence	Claims 19-21
Quantum Entropy	QRNG via Cisco quantum hardware (optional tier)	Claims 17-18

The POAW reference implementation is protected by 20 patent claims (3 independent + 17 dependent) under USPTO filing NI-POAW-001. A FRAND (Fair, Reasonable, And Non-Discriminatory) licensing model is proposed for any patents essential to the standard.

Section 7 · Proposed Timeline

Project Milestones

Phase	Target Date	Deliverable
NWIP Submission	Q3 2026	Form 4 submitted to JTC 1/SC 42
Working Draft (WD)	Q1 2027	Complete normative text with test suite
Committee Draft (CD)	Q3 2027	First public comment period
Draft International Standard (DIS)	Q1 2028	Final technical content, ballot
Publication	Q3 2028	ISO/IEC AAQA-1:2028 published

Urgency: The EU AI Act enforcement deadline for high-risk AI is 2 August 2026. Enterprises deploying AI agents after this date will need to demonstrate Art. 12 & 14 compliance. AAQA provides the implementation framework — but only if standardisation begins now.

Section 8 · Nomenclature Strategy

From POAW to AAQA — Why the Rebrand Matters

Aspect	POAW (Current)	AAQA (Proposed)
Full Name	Proof of Agent Work	Autonomous Agent Quality Assurance
Positioning	Technical mechanism (proof generation)	Industry standard / discipline name
Audience	Engineers, patent examiners	Regulators, CxOs, standards bodies, insurers
Analogy	"SHA-256" (the algorithm)	"TLS" (the standard that uses the algorithm)
Standards Fit	Implementation reference	ISO/IEC deliverable title

Recommendation: Keep POAW as the reference implementation name (like "OpenSSL" implements TLS). Position AAQA as the standard name (like "ISO/IEC 42001"). This allows OHM to own the implementation while contributing to the open standard.

"POAW is to AAQA what OpenSSL is to TLS."

Section 9 · Proposed Multi-Part Structure

Future AAQA Standard Family

Part	Title	Status
AAQA-1	Core Requirements — Action logging, drift detection, coherence, proof artifacts, verdict	This Proposal
AAQA-2	Insurance Integration — Mapping AAQA scores to actuarial risk models (NI-SHIELD)	Planned Q4 2026
AAQA-3	Multi-Agent Systems — Verification of delegated task chains across agent-to-agent handoffs	Planned 2027
AAQA-4	Conformity Assessment — Certification scheme for AAQA-compliant agent systems	Planned 2027

Section 10 · Call to Action

Next Steps for Standardisation Bodies

🇪🇺 For the EU AI Office

AAQA provides the implementation standard that Art. 12 and Art. 14 require but don't specify. We propose a formal liaison with CEN/CENELEC TC on AI (prEN 18286) to align AAQA with the EU's Quality Management System requirements for AI providers.

🇺🇸 For NIST

AAQA extends the AI RMF 1.0 with per-execution runtime risk measurement. The POAW reference implementation already maps to IEEE 2857-2024 benchmarking. We propose contributing AAQA to the NIST AI Safety Consortium (AISIC).

🌍 For ISO/IEC JTC 1/SC 42

We request consideration of this NWIP under ISO/IEC JTC 1/SC 42 (Artificial Intelligence), WG 3 (Trustworthiness). AAQA is positioned as a supporting standard to ISO/IEC 42001, providing the runtime verification layer that the management system requires.

🏥 For Insurance Industry (Munich Re, Swiss Re, Allianz)

AAQA provides the actuarial measurement framework the AI insurance market needs. The AAQA-2 Insurance Integration part (planned Q4 2026) maps directly to the aiSure™ and equivalent underwriting systems.