Continuous control testing is the continuous execution of control-effectiveness tests against the full control population, producing signed, reperformable evidence aligned to PCAOB AS 2201 paragraphs .39, .42, and .46–.50. In 2026, the operative change is not the testing cadence itself — rule-based continuous controls monitoring has existed since the mid-2000s — but the arrival of LLM-driven reasoning agents that can read policies, interpret evidence, and produce walkthrough-grade test records for roughly 30–45 percent of a typical mid-market control population. The remaining controls still require human judgment. The right architecture is hybrid: agents cover high-volume deterministic families (access review, change management, ITGC baseline); humans cover judgmental families (estimates, journal entry review, anti-fraud). The PCAOB evidence bar is clear and achievable for platforms that preserve reasoning traces, cryptographically sign evidence, and support external-auditor walkthrough reperformability.
This primer is for the Controller, Internal Audit Director, or Risk/Compliance Lead evaluating whether agent-driven continuous testing can credibly sit inside a PCAOB-aligned SOX program. The reader is assumed audit-literate and appropriately skeptical. The goal is not to pitch continuous testing as a universal automation — that claim routinely fails at external audit walkthrough — but to ground the honest picture of what works today, what the regulator actually expects, and how to evaluate a platform without hand-waving.
What is continuous control testing in 2026?
Continuous control testing is the execution of tests of control operating effectiveness against the full relevant population on a rolling cadence, with each test execution producing an evidence record suitable for PCAOB AS 2201 walkthrough reperformance. The phrase has been in SOX programs since the mid-2000s, but it has meant three different things in three different generations.
The first generation was rule-based continuous controls monitoring (CCM). Pre-configured rules observing transactions in SAP, Oracle, NetSuite, and other ERP systems, flagging violations. Vendors like Pathlock (formerly Greenlight Technologies) and Fastpath pioneered this category for segregation-of-duties enforcement and transaction-level fraud detection. Useful, narrow, and mostly confined to the ERP layer.
The second generation was scheduled-sample automated testing. Scripts that pulled evidence on a cadence — backup completion logs, job scheduling output, patching status — and compared to expected values. Useful for ITGC where the control's operating expectation was deterministic and the evidence was structured. This generation extended CCM beyond the ERP into infrastructure and cloud.
The third generation, arriving 2024–2026, is agentic reasoning testing. LLM-driven agents that read control narratives, pull relevant evidence from source systems, reason about whether the evidence supports the control operating effectively, and produce signed test records with preserved reasoning traces. A reasoning agent can read an access-review policy, observe actual provisioning activity in Okta and Workday, and make a judgment like "this terminated user retained Snowflake access for 14 days post-termination, which violates the policy" — a judgment that previously required a human auditor reviewing a user access review spreadsheet over a weekend.
The distinction between generation three and the earlier generations matters for evaluation. Not every "AI control testing" vendor ships real reasoning. Many ship rule-based CCM with an AI-colored wrapper, and the difference is not always visible in marketing. The evaluation test is simple: does the test execution output include a reasoning trace the agent produced, or just a rule-match flag? If only the flag, the product is generation-one CCM rebadged. Move on.
Why is this shift happening now, and why 2026 specifically?
Three forces converged in 2025–2026 that make agent-driven continuous testing economically and regulatorily viable for the first time.
Force one: LLM capability crossed the audit-judgment threshold. GPT-4-class models released in 2023–2024 could produce plausible-looking control interpretations but routinely failed at audit-grade reasoning — they missed context, hallucinated policy language, and produced reasoning that would not withstand partner review. The 2025–2026 generation of models, combined with retrieval-augmented generation against the company's actual policy corpus and control matrix, produces interpretations that pass partner-level scrutiny on deterministic control families. The bar moved in one model generation.
Force two: PCAOB inspection posture normalized automated evidence. PCAOB inspection reports from late 2025 and early 2026 specifically addressed automated control testing. The tone has been cautiously accepting: automated evidence is valid when the design meets AS 2201 expectations, but firms are held to a sufficiency standard. This is the correct posture for the industry and the right level of rigor. Public inspection reports referencing agent-produced evidence have appeared, and Big 4 firms have updated their audit guides to reference automated testing procedures.
Force three: adjacent regulation expanded the audit surface. The EU AI Act entered its general-purpose obligations phase in August 2026, extending AI governance controls into the SOX perimeter for multinationals. The Digital Operational Resilience Act (DORA) took effect in January 2025, bringing ICT risk management obligations that overlap with ITGC testing. CMMC 2.0 phased implementation through 2025–2028 adds defense-contractor control families. SOC 2 Trust Services Criteria CC7 (system operations) and CC8 (change management) continue evolving. ISO 42001 (AI management systems) published in 2023 is now the reference standard for AI-specific controls. The aggregate effect is a broader control population per company, which makes manual sampling economically untenable and pushes the mid-market toward continuous testing as the only tractable path.
The window for early movers is finite. Within 24 months the practice becomes standard and the economic arbitrage closes. Companies evaluating platforms in 2026 are timing the window correctly.
How does agent-driven continuous testing actually work?
Start with the taxonomy. Here is a simple schematic of the control-test type landscape in 2026.
Control-test taxonomy (2026)
HIGH-CONFIDENCE (agent-testable)
├── Access Review
│ ├── Role-entitlement alignment
│ ├── Orphan accounts
│ ├── Terminated-user access
│ └── Privileged access use
├── Change Management
│ ├── Deployment approval evidence
│ ├── Testing evidence
│ ├── SoD between developer and deployer
│ └── Emergency change documentation
├── ITGC Baseline
│ ├── Backup completion
│ ├── Job scheduling
│ ├── Incident response evidence
│ └── Vendor access logs
└── SoD (entitlement level)
MEDIUM-CONFIDENCE (partially agent-testable, human-anchored)
├── Reconciliation Controls
│ └── Variance identification (agent) + adequacy judgment (human)
└── Journal Entry Review
└── High-risk JE flagging (agent) + accounting review (human)
LOW-CONFIDENCE / NO-GO (human-only)
├── Estimate Review (reserves, fair-value, impairment)
├── Management Review where "qualified human reviewed" IS the control
└── Anti-Fraud Program Review
A working continuous-testing architecture has five components: source-system integrations (identity, HRIS, ERP, cloud, source-control), an evidence ingestion layer that pulls read-only snapshots on the test cadence, a reasoning engine that interprets the evidence against the control narrative, an evidence persistence layer that cryptographically signs and hashes each test record, and a review queue where control owners and IA staff review exceptions.
The actual test execution, in concrete: the agent reads the control narrative for control AC-04 ("User access to the NetSuite production environment is reviewed quarterly by the control owner, with terminated users removed within 48 hours of HRIS termination date"). It pulls the current user population from NetSuite, the HRIS termination list from Workday for the review period, and the access review attestations from the company's review tooling. It reconciles the three data sources, identifies any NetSuite users whose Workday termination date was more than 48 hours before the access review period cutoff, flags exceptions, and produces a signed evidence record. The record contains the control ID, the source systems queried, the observed data snapshot with SHA-256 hash, the agent's interpretive reasoning trace, any exceptions identified with proposed remediation, and a sign-off field for the control owner.
The population is continuous rather than sample-based. For a quarterly review control with a population of 847 users, traditional sampling would examine 25 users under statistical inference. Continuous testing examines all 847 every cycle, for every cycle. This is not faster sampling; it is a fundamentally different evidence profile.
Which control families are agent-testable — and which are not?
The honest answer by confidence tier:
High-confidence (agent-testable today): user access review — role-entitlement alignment, orphan accounts, terminated-user access, privileged access use. Change management — deployment approval evidence, testing evidence, separation-of-duties between developer and deployer, emergency change documentation. ITGC baseline — backup completion, job scheduling, incident response evidence, vendor access logs. Segregation of duties at the entitlement level (distinct from transaction-level SoD, which remains rule-based CCM territory).
Medium-confidence (partially agent-testable, human-anchored): reconciliation controls — the agent observes that the reconciliation was performed and that variances were investigated, but the judgment of whether the variance explanation is adequate remains largely human. Journal entry review — the agent identifies high-risk journals by materiality, timing, and user characteristics, but the accounting-judgment review of whether the journal is appropriate remains human.
Low-confidence and explicit no-go: estimate review (reserves, fair-value, impairment analysis) — requires accounting judgment the agent should not own. Management review controls where the control objective is literally "a qualified human reviewed this" — the agent observing the review does not constitute the review. Anti-fraud program review — requires investigative judgment. Complex revenue recognition assessment under ASC 606 — requires judgmental evaluation of contract terms, performance obligations, and variable consideration that exceeds what an agent should assert.
The key point: the right architecture is hybrid. Agents cover the high-volume, deterministic-enough families. Humans cover the judgmental families. A vendor claiming "we fully automate your SOX program" is either lying or has built something that will fail PCAOB review the first time a judgmental control is tested.
What does the PCAOB actually expect from automated evidence?
PCAOB AS 2201 (Auditing Standard No. 5) establishes the framework for tests of controls. The sections that govern automated evidence are paragraph .39 (evidence of control operation), paragraph .42 (nature, timing, and extent of testing), and paragraphs .46–.50 (testing design and operating effectiveness).
Translated to automated testing requirements, the evidence must demonstrate: the control operated as designed; the control operated consistently over the period; the evidence is authentic and unaltered; the source of the evidence is reliable; and the testing is reperformable by the external auditor.
How agent-generated evidence meets these characteristics: SHA-256 hashing establishes authenticity — the evidence cannot be altered after signing without invalidating the hash. Continuous test cadence over the period demonstrates consistency — the control operated on every tested date, not just on the sample dates. Signed reasoning traces preserve reperformability — the external auditor sees the specific steps the agent took and can re-execute them. Read-only integrations with identity, ERP, and cloud systems preserve source reliability — the evidence originated at the system of record, not a re-uploaded screenshot.
Where agent-generated evidence can fail: platforms that do not preserve reasoning traces (the external auditor cannot reperform); platforms without cryptographic evidence integrity (authenticity is challenged at walkthrough); platforms that take remediation actions autonomously (this creates a control-design risk because the tester is also the actor, violating the separation between control execution and testing). Each of these is a common failure mode in immature continuous-testing products.
Adjacent framework alignment matters when the company runs multi-framework compliance. SOC 2 Trust Services Criteria CC7 (system operations) and CC8 (change management) overlap substantially with SOX ITGC testing. ISO 42001 specifies AI management system controls for AI-specific testing agents — relevant when the continuous-testing agent itself falls within the AI governance perimeter under EU AI Act obligations. DORA Article 28 (ICT risk management) and Article 30 (ICT third-party risk) overlap with ITGC vendor access logging. The mature continuous-testing platform produces evidence that maps to all applicable framework criteria without duplication of effort.
What changes when testing the full population instead of sampling?
Traditional SOX testing relies on sampling. For a quarterly user access review control, an auditor samples 25 users from a population of 1,200 and examines whether their access was reviewed appropriately. Statistical inference extends the conclusion from the sample to the population under a defined confidence level.
Continuous agent testing changes the math. The agent tests 100 percent of the population continuously. This is a fundamentally different evidence profile.
Practical implications: sampling can miss a deficiency entirely if the problematic user was not in the sample. Population testing cannot. This is both a benefit (higher assurance, real deficiencies surface) and a risk (deficiencies that sampling would have missed are now surfaced and must be remediated and disclosed). Companies transitioning from sample-based to population-based testing frequently see a spike in surfaced deficiencies in the first quarter, not because the controls got worse but because the testing got more rigorous.
PCAOB perspective: this is one reason automated control testing acceptance has accelerated — the evidence quality is objectively higher than sample-based testing when the agent design is sound. Evidence sufficiency under paragraph .46 is more defensible with population testing than with a 25-user sample.
Caveat: population testing still requires the same rigor on control-design effectiveness. Testing 100 percent of a poorly-designed control confirms that the poorly-designed control operated as designed, which is not the objective. The design-effectiveness assessment under paragraph .42 remains the Internal Audit Director's job.
How should you evaluate a continuous control testing platform?
Six questions separate mature platforms from repackaged CCM.
For each control family the vendor claims to test, can they show you the reasoning trace for a test execution? If the answer is a rule-engine output, the product is CCM rebadged. Move on.
Is the evidence cryptographically signed, hashed, and immutable? Vendors who wave hands at this question have built something that will be challenged at walkthrough.
Will the vendor conduct an evidence walkthrough dry-run with your external audit partner before you commit? This is the single highest-signal evaluation step. Vendors confident in their evidence format will do this; vendors who are not will find excuses.
What is the human-in-the-loop model? A vendor claiming the agent acts fully autonomously (testing plus remediation) is incorrect on SOX design. The correct answer: agent tests, agent produces evidence, human control-owner signs evidence, human IA reviews population-level results, human IA Director evaluates design effectiveness.
What happens when the agent's reasoning is wrong? Every LLM-driven system has false positives and false negatives. Mature vendors surface these explicitly, provide a correction workflow, and feed corrections back into the agent's guidance. Immature vendors pretend the agent does not make mistakes.
Read the vendor's actual evidence export. If you cannot hand the export to your external auditor and have them understand it in 20 minutes, the format is wrong. Compare Prova vs. AuditBoard provides a head-to-head reference for what an AS 2201-aligned evidence export looks like.
Who should use agent-driven continuous testing?
Honest fit: PE portfolio companies at 300–1,500 emp with a one-to-three-person IA function and a Big 4 or regional audit partner willing to review the evidence format. Pre-IPO 300–1,500 emp companies with 12–18 months of 404(b) runway — sufficient for a walkthrough dry-run, onboarding, and agent calibration before year-end commitment. Sub-$1B public microcaps with active 404(b) programs and AuditBoard renewals they can no longer defend.
Honest miss: 2,000+ emp enterprise with a fully-loaded 10+ person IA team and a decade of AuditBoard history — switching cost exceeds saving. Companies whose external audit partner has explicitly refused to review an automated evidence format — rare in 2026 but it exists, and the platform decision cannot override the audit partner.
The six stage pages cover narrower persona verticals — 404(a)-only portcos, 404(b)-active public microcaps, pre-IPO readiness, multi-entity PE portcos — with decision frameworks calibrated to each trigger.
What does implementation actually look like?
Phase 1 (weeks 1–4): integrate read-only connections to identity, HR, cloud, ERP, and source-control systems; import the control matrix and narratives; configure the control-to-system mapping. Expect 40–80 hours of IA team time plus vendor implementation support.
Phase 2 (weeks 3–6): run initial agent test executions; review reasoning traces with IA and control owners; calibrate agent guidance where the reasoning misses context. Calibration continues for the life of the program, heaviest in the first quarter.
Phase 3 (weeks 5–8): external audit walkthrough dry-run. Present three to five test executions end-to-end; receive feedback on evidence format; remediate before year-end. If the audit partner rejects the format, exit the platform commitment before year-end ties you to the evidence trail.
Phase 4 (steady-state): continuous testing on the rolling cadence; IA reviews population-level trends, investigates exceptions, documents remediations. A mid-market SOX program that consumed 3,000 internal audit hours per year typically sees that reduced to 1,200–1,800 — saved hours redirect to higher-judgment work (deficiency evaluation, management review, external audit coordination).
The takeaway
Continuous agent-driven control testing is a real and material shift in SOX economics, but it is not a universal automation. The honest architecture is hybrid: agents cover high-volume deterministic control families; humans cover judgmental families. The right vendor claim is "we automate 30–45 percent of your control population and surface the rest for human judgment more efficiently." Anything more aggressive is marketing hyperbole that fails at walkthrough.
The PCAOB evidence bar is clear and achievable. Vendors whose output preserves reasoning traces, cryptographically signs evidence, and supports external-auditor walkthrough reperformance will pass AS 2201 scrutiny. Vendors who hand-wave these requirements will not.
For Controllers and Internal Audit Directors evaluating platforms, the highest-signal step is the external-auditor walkthrough dry-run. Every other question is secondary. If your Big 4 or regional audit partner accepts the evidence format in a pre-commitment walkthrough, the platform works. If they do not, it does not — regardless of marketing claims.
Strategic read: the mid-market SOX program that moves to agentic testing in 2026 gets 12–24 months of economic advantage before the practice becomes standard. The window is finite, the decision is tractable, and the evidence bar is defensible if you evaluate carefully. Request a design partner slot if you are ready to run the walkthrough.