What is continuous control testing in SOX compliance?

Continuous control testing is the execution of control-effectiveness tests against the full control population on a rolling cadence, producing signed, reperformable evidence aligned to PCAOB AS 2201 paragraphs .39, .42, and .46-.50. The 2026 form is agent-driven: LLM-based reasoning agents read control narratives, pull evidence from source systems, interpret whether the evidence supports control operating effectiveness, and produce signed records with preserved reasoning traces. It differs from earlier rule-based continuous controls monitoring because agents can produce audit-judgment-grade output on deterministic control families rather than just flagging rule violations.

Can AI agents test SOX controls reliably enough for external audit?

Yes, for approximately 30 to 45 percent of a typical mid-market control population. Agents test credibly on user access review, change management, ITGC baseline (backup completion, job scheduling, incident response), and segregation of duties at the entitlement level. Agents partially test reconciliation controls and journal entry review, with humans anchoring the judgment portion. Agents cannot replace human testing on estimate review, management review controls where the control objective requires human review, complex revenue recognition under ASC 606, and anti-fraud program review. External audit acceptance has normalized through the 2025 to 2026 PCAOB inspection cycle for vendors whose evidence format meets AS 2201 characteristics.

What controls can be tested automatically and which require human judgment?

High-confidence automated: access review, change management, ITGC baseline, entitlement-level SoD. Medium-confidence hybrid: reconciliation controls (agent observes, human judges adequacy), journal entry review (agent flags high-risk, human reviews accounting). Human-only: estimate review (reserves, fair-value, impairment), management review controls where 'qualified human reviewed' IS the control objective, anti-fraud program review, complex revenue recognition under ASC 606. The right architecture is hybrid. Any vendor claiming full automation is either misrepresenting capability or has built something that will fail PCAOB review.

Does the PCAOB accept evidence from automated control testing?

Yes, under AS 2201 paragraphs .39, .42, and .46-.50, when the evidence meets five characteristics: the control operated as designed, operated consistently over the period, evidence is authentic and unaltered, source is reliable, and testing is reperformable. Agent-generated evidence meets these through SHA-256 hashing (authenticity), continuous-population testing (consistency), direct read-only source-system integration (source reliability), and preserved reasoning traces (reperformability). PCAOB inspection reports from late 2025 and 2026 explicitly addressed automated control testing with a cautiously accepting tone. The highest-signal confirmation is a pre-commitment walkthrough with the specific external auditor.

How does continuous testing differ from continuous controls monitoring (CCM)?

Continuous controls monitoring (CCM) from vendors like Pathlock and Fastpath is rule-based: pre-configured rules observe transactions in ERP systems and flag violations. Useful for segregation of duties and transaction-level fraud detection. Continuous control testing with agent reasoning (2024-2026 generation) differs in that an LLM-based agent reads control narratives, interprets policy requirements against observed evidence, and produces audit-judgment-grade test records with reasoning traces. CCM flags rule violations; agent-driven testing produces tests of control operating effectiveness suitable for AS 2201 walkthrough. The evaluation test: does the test output include a reasoning trace, or just a rule-match flag?

What is the difference between sampling and population testing?

Traditional SOX testing samples 25 to 40 records from a population under statistical inference. Continuous agent testing examines 100 percent of the population continuously. Population testing surfaces deficiencies that sampling would miss entirely, which is both a benefit (higher assurance) and a risk (more deficiencies to remediate and disclose). PCAOB perspective favors population testing when agent design is sound because evidence sufficiency under paragraph .46 is objectively higher. Control-design effectiveness under paragraph .42 still requires human judgment regardless of testing population; testing 100 percent of a poorly-designed control only confirms the poor design.

How do regulatory changes like EU AI Act, DORA, and CMMC 2.0 affect control testing?

The EU AI Act entered general-purpose obligations in August 2026, extending AI governance controls into the SOX perimeter for multinationals and requiring AI management system controls where the testing agent itself falls within AI governance. DORA took effect January 2025, bringing ICT risk management and third-party risk obligations that overlap with ITGC testing under Articles 28 and 30. CMMC 2.0 phased implementation through 2025 to 2028 adds defense-contractor control families. The aggregate effect expands the control population per company, pushing mid-market programs toward continuous testing as the only economically tractable path. Mature platforms produce evidence that maps to multiple framework criteria (SOX, SOC 2 CC7 and CC8, ISO 42001, DORA, CMMC) without duplicated effort.

What are the six questions to evaluate a continuous testing platform?

One: can the vendor show a reasoning trace for a test execution? If only a rule-match flag, the product is CCM rebadged. Two: is evidence cryptographically signed, hashed, and immutable? If not, it will be challenged at walkthrough. Three: will the vendor do a walkthrough dry-run with your external audit partner before commitment? The highest-signal step. Four: what is the human-in-the-loop model? Fully autonomous testing-plus-remediation is incorrect on SOX design. Five: what happens when agent reasoning is wrong? Mature vendors surface false positives with correction workflows. Six: read the actual evidence export - if your external auditor cannot understand it in 20 minutes, the format is wrong.

Is continuous control testing the same as audit automation?

No. Audit automation is a broader term covering any mechanization of audit work (workpaper automation, analytics, report generation). Continuous control testing is specifically the continuous execution of tests of control operating effectiveness against the full population, aligned to AS 2201 evidence characteristics. Audit automation includes continuous control testing but also includes scheduling tools, workpaper management platforms like AuditBoard, and analytics platforms. The distinction matters for vendor evaluation because many audit-automation tools do not produce AS 2201-grade evidence, even when their marketing uses similar language.

Continuous Control Testing for SOX: A Primer on What Agents Can (and Can't) Do in 2026

Q: How long does it take to implement a continuous control testing platform?

Four phases over roughly eight weeks to steady-state. Phase 1 (weeks 1-4): integrate read-only connections to identity, HR, cloud, ERP, and source-control systems; import control matrix and narratives; configure control-to-system mapping. Phase 2 (weeks 3-6): run initial agent test executions; review reasoning traces with IA and control owners; calibrate agent guidance. Phase 3 (weeks 5-8): external audit walkthrough dry-run; remediate feedback before year-end. Phase 4 (steady-state): continuous testing with IA population-level review. Typical mid-market SOX programs see internal audit hours reduced from 3,000 to 1,200-1,800 per year, with saved hours redirected to higher-judgment work.

Continuous control testing is the continuous execution of control-effectiveness tests against the full control population, producing signed, reperformable evidence aligned to PCAOB AS 2201 paragraphs .39, .42, and .46–.50. In 2026, the operative change is not the testing cadence itself — rule-based continuous controls monitoring has existed since the mid-2000s — but the arrival of LLM-driven reasoning agents that can read policies, interpret evidence, and produce walkthrough-grade test records for roughly 30–45 percent of a typical mid-market control population. The remaining controls still require human judgment. The right architecture is hybrid: agents cover high-volume deterministic families (access review, change management, ITGC baseline); humans cover judgmental families (estimates, journal entry review, anti-fraud). The PCAOB evidence bar is clear and achievable for platforms that preserve reasoning traces, cryptographically sign evidence, and support external-auditor walkthrough reperformability.

This primer is for the Controller, Internal Audit Director, or Risk/Compliance Lead evaluating whether agent-driven continuous testing can credibly sit inside a PCAOB-aligned SOX program. The reader is assumed audit-literate and appropriately skeptical. The goal is not to pitch continuous testing as a universal automation — that claim routinely fails at external audit walkthrough — but to ground the honest picture of what works today, what the regulator actually expects, and how to evaluate a platform without hand-waving.

What is continuous control testing in 2026?

Continuous control testing is the execution of tests of control operating effectiveness against the full relevant population on a rolling cadence, with each test execution producing an evidence record suitable for PCAOB AS 2201 walkthrough reperformance. The phrase has been in SOX programs since the mid-2000s, but it has meant three different things in three different generations.

The first generation was rule-based continuous controls monitoring (CCM). Pre-configured rules observing transactions in SAP, Oracle, NetSuite, and other ERP systems, flagging violations. Vendors like Pathlock (formerly Greenlight Technologies) and Fastpath pioneered this category for segregation-of-duties enforcement and transaction-level fraud detection. Useful, narrow, and mostly confined to the ERP layer.

The second generation was scheduled-sample automated testing. Scripts that pulled evidence on a cadence — backup completion logs, job scheduling output, patching status — and compared to expected values. Useful for ITGC where the control's operating expectation was deterministic and the evidence was structured. This generation extended CCM beyond the ERP into infrastructure and cloud.

The third generation, arriving 2024–2026, is agentic reasoning testing. LLM-driven agents that read control narratives, pull relevant evidence from source systems, reason about whether the evidence supports the control operating effectively, and produce signed test records with preserved reasoning traces. A reasoning agent can read an access-review policy, observe actual provisioning activity in Okta and Workday, and make a judgment like "this terminated user retained Snowflake access for 14 days post-termination, which violates the policy" — a judgment that previously required a human auditor reviewing a user access review spreadsheet over a weekend.

The distinction between generation three and the earlier generations matters for evaluation. Not every "AI control testing" vendor ships real reasoning. Many ship rule-based CCM with an AI-colored wrapper, and the difference is not always visible in marketing. The evaluation test is simple: does the test execution output include a reasoning trace the agent produced, or just a rule-match flag? If only the flag, the product is generation-one CCM rebadged. Move on.

Why is this shift happening now, and why 2026 specifically?

Three forces converged in 2025–2026 that make agent-driven continuous testing economically and regulatorily viable for the first time.

Force one: LLM capability crossed the audit-judgment threshold. GPT-4-class models released in 2023–2024 could produce plausible-looking control interpretations but routinely failed at audit-grade reasoning — they missed context, hallucinated policy language, and produced reasoning that would not withstand partner review. The 2025–2026 generation of models, combined with retrieval-augmented generation against the company's actual policy corpus and control matrix, produces interpretations that pass partner-level scrutiny on deterministic control families. The bar moved in one model generation.

Force two: PCAOB inspection posture normalized automated evidence. PCAOB inspection reports from late 2025 and early 2026 specifically addressed automated control testing. The tone has been cautiously accepting: automated evidence is valid when the design meets AS 2201 expectations, but firms are held to a sufficiency standard. This is the correct posture for the industry and the right level of rigor. Public inspection reports referencing agent-produced evidence have appeared, and Big 4 firms have updated their audit guides to reference automated testing procedures.

Force three: adjacent regulation expanded the audit surface. The EU AI Act entered its general-purpose obligations phase in August 2026, extending AI governance controls into the SOX perimeter for multinationals. The Digital Operational Resilience Act (DORA) took effect in January 2025, bringing ICT risk management obligations that overlap with ITGC testing. CMMC 2.0 phased implementation through 2025–2028 adds defense-contractor control families. SOC 2 Trust Services Criteria CC7 (system operations) and CC8 (change management) continue evolving. ISO 42001 (AI management systems) published in 2023 is now the reference standard for AI-specific controls. The aggregate effect is a broader control population per company, which makes manual sampling economically untenable and pushes the mid-market toward continuous testing as the only tractable path.

The window for early movers is finite. Within 24 months the practice becomes standard and the economic arbitrage closes. Companies evaluating platforms in 2026 are timing the window correctly.

How does agent-driven continuous testing actually work?

Start with the taxonomy. Here is a simple schematic of the control-test type landscape in 2026.

Control-test taxonomy (2026)

  HIGH-CONFIDENCE (agent-testable)
  ├── Access Review
  │   ├── Role-entitlement alignment
  │   ├── Orphan accounts
  │   ├── Terminated-user access
  │   └── Privileged access use
  ├── Change Management
  │   ├── Deployment approval evidence
  │   ├── Testing evidence
  │   ├── SoD between developer and deployer
  │   └── Emergency change documentation
  ├── ITGC Baseline
  │   ├── Backup completion
  │   ├── Job scheduling
  │   ├── Incident response evidence
  │   └── Vendor access logs
  └── SoD (entitlement level)

  MEDIUM-CONFIDENCE (partially agent-testable, human-anchored)
  ├── Reconciliation Controls
  │   └── Variance identification (agent) + adequacy judgment (human)
  └── Journal Entry Review
      └── High-risk JE flagging (agent) + accounting review (human)

  LOW-CONFIDENCE / NO-GO (human-only)
  ├── Estimate Review (reserves, fair-value, impairment)
  ├── Management Review where "qualified human reviewed" IS the control
  └── Anti-Fraud Program Review

A working continuous-testing architecture has five components: source-system integrations (identity, HRIS, ERP, cloud, source-control), an evidence ingestion layer that pulls read-only snapshots on the test cadence, a reasoning engine that interprets the evidence against the control narrative, an evidence persistence layer that cryptographically signs and hashes each test record, and a review queue where control owners and IA staff review exceptions.

The actual test execution, in concrete: the agent reads the control narrative for control AC-04 ("User access to the NetSuite production environment is reviewed quarterly by the control owner, with terminated users removed within 48 hours of HRIS termination date"). It pulls the current user population from NetSuite, the HRIS termination list from Workday for the review period, and the access review attestations from the company's review tooling. It reconciles the three data sources, identifies any NetSuite users whose Workday termination date was more than 48 hours before the access review period cutoff, flags exceptions, and produces a signed evidence record. The record contains the control ID, the source systems queried, the observed data snapshot with SHA-256 hash, the agent's interpretive reasoning trace, any exceptions identified with proposed remediation, and a sign-off field for the control owner.

The population is continuous rather than sample-based. For a quarterly review control with a population of 847 users, traditional sampling would examine 25 users under statistical inference. Continuous testing examines all 847 every cycle, for every cycle. This is not faster sampling; it is a fundamentally different evidence profile.

Which control families are agent-testable — and which are not?

The honest answer by confidence tier:

High-confidence (agent-testable today): user access review — role-entitlement alignment, orphan accounts, terminated-user access, privileged access use. Change management — deployment approval evidence, testing evidence, separation-of-duties between developer and deployer, emergency change documentation. ITGC baseline — backup completion, job scheduling, incident response evidence, vendor access logs. Segregation of duties at the entitlement level (distinct from transaction-level SoD, which remains rule-based CCM territory).

Medium-confidence (partially agent-testable, human-anchored): reconciliation controls — the agent observes that the reconciliation was performed and that variances were investigated, but the judgment of whether the variance explanation is adequate remains largely human. Journal entry review — the agent identifies high-risk journals by materiality, timing, and user characteristics, but the accounting-judgment review of whether the journal is appropriate remains human.

Low-confidence and explicit no-go: estimate review (reserves, fair-value, impairment analysis) — requires accounting judgment the agent should not own. Management review controls where the control objective is literally "a qualified human reviewed this" — the agent observing the review does not constitute the review. Anti-fraud program review — requires investigative judgment. Complex revenue recognition assessment under ASC 606 — requires judgmental evaluation of contract terms, performance obligations, and variable consideration that exceeds what an agent should assert.

The key point: the right architecture is hybrid. Agents cover the high-volume, deterministic-enough families. Humans cover the judgmental families. A vendor claiming "we fully automate your SOX program" is either lying or has built something that will fail PCAOB review the first time a judgmental control is tested.

What does the PCAOB actually expect from automated evidence?

PCAOB AS 2201 (Auditing Standard No. 5) establishes the framework for tests of controls. The sections that govern automated evidence are paragraph .39 (evidence of control operation), paragraph .42 (nature, timing, and extent of testing), and paragraphs .46–.50 (testing design and operating effectiveness).

Translated to automated testing requirements, the evidence must demonstrate: the control operated as designed; the control operated consistently over the period; the evidence is authentic and unaltered; the source of the evidence is reliable; and the testing is reperformable by the external auditor.

How agent-generated evidence meets these characteristics: SHA-256 hashing establishes authenticity — the evidence cannot be altered after signing without invalidating the hash. Continuous test cadence over the period demonstrates consistency — the control operated on every tested date, not just on the sample dates. Signed reasoning traces preserve reperformability — the external auditor sees the specific steps the agent took and can re-execute them. Read-only integrations with identity, ERP, and cloud systems preserve source reliability — the evidence originated at the system of record, not a re-uploaded screenshot.

Where agent-generated evidence can fail: platforms that do not preserve reasoning traces (the external auditor cannot reperform); platforms without cryptographic evidence integrity (authenticity is challenged at walkthrough); platforms that take remediation actions autonomously (this creates a control-design risk because the tester is also the actor, violating the separation between control execution and testing). Each of these is a common failure mode in immature continuous-testing products.

Adjacent framework alignment matters when the company runs multi-framework compliance. SOC 2 Trust Services Criteria CC7 (system operations) and CC8 (change management) overlap substantially with SOX ITGC testing. ISO 42001 specifies AI management system controls for AI-specific testing agents — relevant when the continuous-testing agent itself falls within the AI governance perimeter under EU AI Act obligations. DORA Article 28 (ICT risk management) and Article 30 (ICT third-party risk) overlap with ITGC vendor access logging. The mature continuous-testing platform produces evidence that maps to all applicable framework criteria without duplication of effort.

What changes when testing the full population instead of sampling?

Traditional SOX testing relies on sampling. For a quarterly user access review control, an auditor samples 25 users from a population of 1,200 and examines whether their access was reviewed appropriately. Statistical inference extends the conclusion from the sample to the population under a defined confidence level.

Continuous agent testing changes the math. The agent tests 100 percent of the population continuously. This is a fundamentally different evidence profile.

Practical implications: sampling can miss a deficiency entirely if the problematic user was not in the sample. Population testing cannot. This is both a benefit (higher assurance, real deficiencies surface) and a risk (deficiencies that sampling would have missed are now surfaced and must be remediated and disclosed). Companies transitioning from sample-based to population-based testing frequently see a spike in surfaced deficiencies in the first quarter, not because the controls got worse but because the testing got more rigorous.

PCAOB perspective: this is one reason automated control testing acceptance has accelerated — the evidence quality is objectively higher than sample-based testing when the agent design is sound. Evidence sufficiency under paragraph .46 is more defensible with population testing than with a 25-user sample.

Caveat: population testing still requires the same rigor on control-design effectiveness. Testing 100 percent of a poorly-designed control confirms that the poorly-designed control operated as designed, which is not the objective. The design-effectiveness assessment under paragraph .42 remains the Internal Audit Director's job.

How should you evaluate a continuous control testing platform?

Six questions separate mature platforms from repackaged CCM.

For each control family the vendor claims to test, can they show you the reasoning trace for a test execution? If the answer is a rule-engine output, the product is CCM rebadged. Move on.

Is the evidence cryptographically signed, hashed, and immutable? Vendors who wave hands at this question have built something that will be challenged at walkthrough.

Will the vendor conduct an evidence walkthrough dry-run with your external audit partner before you commit? This is the single highest-signal evaluation step. Vendors confident in their evidence format will do this; vendors who are not will find excuses.

What is the human-in-the-loop model? A vendor claiming the agent acts fully autonomously (testing plus remediation) is incorrect on SOX design. The correct answer: agent tests, agent produces evidence, human control-owner signs evidence, human IA reviews population-level results, human IA Director evaluates design effectiveness.

What happens when the agent's reasoning is wrong? Every LLM-driven system has false positives and false negatives. Mature vendors surface these explicitly, provide a correction workflow, and feed corrections back into the agent's guidance. Immature vendors pretend the agent does not make mistakes.

Read the vendor's actual evidence export. If you cannot hand the export to your external auditor and have them understand it in 20 minutes, the format is wrong. Compare Prova vs. AuditBoard provides a head-to-head reference for what an AS 2201-aligned evidence export looks like.

Who should use agent-driven continuous testing?

Honest fit: PE portfolio companies at 300–1,500 emp with a one-to-three-person IA function and a Big 4 or regional audit partner willing to review the evidence format. Pre-IPO 300–1,500 emp companies with 12–18 months of 404(b) runway — sufficient for a walkthrough dry-run, onboarding, and agent calibration before year-end commitment. Sub-$1B public microcaps with active 404(b) programs and AuditBoard renewals they can no longer defend.

Honest miss: 2,000+ emp enterprise with a fully-loaded 10+ person IA team and a decade of AuditBoard history — switching cost exceeds saving. Companies whose external audit partner has explicitly refused to review an automated evidence format — rare in 2026 but it exists, and the platform decision cannot override the audit partner.

The six stage pages cover narrower persona verticals — 404(a)-only portcos, 404(b)-active public microcaps, pre-IPO readiness, multi-entity PE portcos — with decision frameworks calibrated to each trigger.

What does implementation actually look like?

Phase 1 (weeks 1–4): integrate read-only connections to identity, HR, cloud, ERP, and source-control systems; import the control matrix and narratives; configure the control-to-system mapping. Expect 40–80 hours of IA team time plus vendor implementation support.

Phase 2 (weeks 3–6): run initial agent test executions; review reasoning traces with IA and control owners; calibrate agent guidance where the reasoning misses context. Calibration continues for the life of the program, heaviest in the first quarter.

Phase 3 (weeks 5–8): external audit walkthrough dry-run. Present three to five test executions end-to-end; receive feedback on evidence format; remediate before year-end. If the audit partner rejects the format, exit the platform commitment before year-end ties you to the evidence trail.

Phase 4 (steady-state): continuous testing on the rolling cadence; IA reviews population-level trends, investigates exceptions, documents remediations. A mid-market SOX program that consumed 3,000 internal audit hours per year typically sees that reduced to 1,200–1,800 — saved hours redirect to higher-judgment work (deficiency evaluation, management review, external audit coordination).

The takeaway

Continuous agent-driven control testing is a real and material shift in SOX economics, but it is not a universal automation. The honest architecture is hybrid: agents cover high-volume deterministic control families; humans cover judgmental families. The right vendor claim is "we automate 30–45 percent of your control population and surface the rest for human judgment more efficiently." Anything more aggressive is marketing hyperbole that fails at walkthrough.

The PCAOB evidence bar is clear and achievable. Vendors whose output preserves reasoning traces, cryptographically signs evidence, and supports external-auditor walkthrough reperformance will pass AS 2201 scrutiny. Vendors who hand-wave these requirements will not.

For Controllers and Internal Audit Directors evaluating platforms, the highest-signal step is the external-auditor walkthrough dry-run. Every other question is secondary. If your Big 4 or regional audit partner accepts the evidence format in a pre-commitment walkthrough, the platform works. If they do not, it does not — regardless of marketing claims.

Strategic read: the mid-market SOX program that moves to agentic testing in 2026 gets 12–24 months of economic advantage before the practice becomes standard. The window is finite, the decision is tractable, and the evidence bar is defensible if you evaluate carefully. Request a design partner slot if you are ready to run the walkthrough.