Secure Agentic Code Generation Process for AI Security Teams

TL;DR

We are not limited to only big LLMs for secure coding, but we should use stronger models for ambiguous security reasoning, architecture tradeoffs, and high-risk code review.
The current evidence says model size alone is not a reliable security control. Veracode’s Spring 2026 update reports that only 55% of tested AI code-generation tasks produced secure code, and that the earlier small advantage from model size has largely disappeared.
Earlier research found similar warning signs: Copilot produced vulnerable programs in about 40% of tested scenarios, and a user study found participants with an AI assistant wrote significantly less secure code while being more confident in their result.
A secure agentic workflow should treat the model as one component inside a controlled SDLC: security requirements, trusted context, sandboxed execution, least-privilege tools, static analysis, dependency scanning, secret scanning, adversarial evals, and human approval gates.
Unit tests and E2E tests are part of the security control surface: unit tests lock down validation and authorization rules, while E2E tests prove that real user flows preserve those controls across browser, API, session, and data layers.
The practical goal is not “vulnerability-free code.” That claim is too strong. The goal is a repeatable process that reduces vulnerability introduction, catches common failures before merge, and produces auditable evidence.

The hypothesis is tempting:

For security-sensitive code generation, maybe only the biggest LLMs are good enough.

My answer after looking at the current evidence is:

Bigger models help, but they are not the control. The control is the agentic process around the model.

That distinction matters. A frontier model can reason better about trust boundaries, authentication flows, threat models, and subtle implementation details. But a bigger model still reads untrusted context, inherits insecure patterns from training data, misunderstands app-specific policy, and can execute unsafe tool calls if the runtime lets it.

In security, “the model is smart” is not a defense. A secure agentic coding system needs a process that assumes the model will sometimes be wrong.

What the Evidence Actually Says

The strongest current signal is that AI-generated code security is not improving at the same pace as functional correctness.

Veracode’s Spring 2026 GenAI Code Security Update tested major model releases across 80 coding tasks, four languages, and four CWE classes: SQL injection, cross-site scripting, log injection, and insecure cryptography. Their headline result was uncomfortable: across all models and tasks, only 55% of generation tasks resulted in secure code. Syntax correctness exceeded 95%, but security pass rates stayed roughly flat. Veracode also reported that model size had only a very small effect in its October 2025 update, and that this marginal difference had mostly disappeared in newer releases.

That supports one part of your hypothesis and rejects another:

yes, capability matters for complex security reasoning
no, model size alone is not enough to trust generated code

This is consistent with earlier academic work. In “Asleep at the Keyboard?”, Pearce et al. generated 1,689 Copilot programs across 89 security-relevant scenarios and found approximately 40% were vulnerable. In “Do Users Write More Insecure Code with AI Assistants?”, Perry et al. found that participants with access to an AI assistant wrote significantly less secure code than participants without one, and were also more likely to believe their code was secure.

The newest benchmark direction points the same way. CyberSecEval 4 includes AutoPatchBench, which evaluates whether an LLM agent can automatically patch security vulnerabilities in native code. A.S.E, a repository-level benchmark for AI-generated code security, reports that current LLMs still struggle in realistic repo-level settings and that a larger reasoning budget does not necessarily produce better secure code. “Rethinking the Evaluation of Secure Code Generation” also warns that secure-code-generation techniques can trade away functionality, and that relying on one static analyzer can hide real risk.

The pattern is clear:

Evidence	What it suggests for engineers
AI-generated code can be syntactically correct while still vulnerable	Build security checks into the workflow, not after it
Bigger models do not consistently eliminate vulnerabilities	Use model routing, but do not treat model choice as the security boundary
Users can become more confident while producing less secure code	Require evidence: tests, scans, review notes, and trace logs
Repo-level tasks are harder than snippet tasks	Evaluate agents inside real project context, not only toy prompts
Agentic tool use creates prompt-injection and exfiltration risk	Constrain tools, network, permissions, and external content

Why Agentic Coding Is a Different Security Problem

Classic code generation risk is mostly about the code text:

insecure SQL construction
weak crypto
missing authorization checks
unsafe deserialization
bad escaping
hardcoded secrets

Agentic coding adds another layer: the agent can read files, run commands, call tools, browse the web, install dependencies, modify code, and sometimes push changes. That means the security question becomes:

What can the agent do when it is wrong, confused, or manipulated?

OWASP’s 2025 Top 10 for LLM and GenAI applications names risks that map directly to coding agents: prompt injection, supply chain risk, improper output handling, excessive agency, sensitive information disclosure, and unbounded consumption. The OWASP Agentic AI threats guide frames this as a threat-modeling problem, not just a prompt-quality problem.

OpenAI’s agent safety guidance says to design workflows so untrusted data never directly drives agent behavior, to use structured outputs, keep tool confirmations on, use guardrails, and run trace graders and evals. OpenAI’s Codex internet-access docs call out prompt injection from untrusted web content, exfiltration of code or secrets, malware or vulnerable dependencies, and license-risky content as reasons to keep network access limited.

Anthropic’s Claude Code security docs use a similar architecture: read-only defaults, explicit approval for sensitive operations, sandboxed bash, write boundaries, command blocklists for risky web-fetching commands, and user responsibility for reviewing proposed code and commands.

Google’s AI security strategy gives a concise agentic security principle: agents need well-defined human controllers, carefully limited powers, and observable actions and planning.

That is the current industry consensus: secure agents are controlled systems.

The Current Secure Agentic Coding Process

Here is the process I would use for an engineering team building code with AI agents in 2026.

1. Start With a Security Spec, Not a Coding Prompt

Before the agent writes code, give it a security contract:

assets: secrets, customer data, credentials, tokens, model outputs, logs
trust boundaries: browser, API, worker, database, third-party service, agent tool
attacker-controlled input: request body, markdown, issue text, dependency README, file upload, tool result
required controls: authn, authz, validation, escaping, rate limits, audit logs, safe defaults
forbidden patterns: raw SQL string concatenation, eval, hardcoded secrets, weak crypto, broad IAM permissions, unpinned remote scripts
acceptance evidence: tests, scanner output, review checklist, threat-model notes

This turns “build feature X” into “build feature X under these security constraints.”

A good agent should produce a plan that includes the security assumptions before touching files. If the agent cannot explain the trust boundary, it is not ready to generate the implementation.

2. Split the Workflow Into Roles

Use different roles even if they run on the same underlying model:

Planner: turns the feature request into a small implementation plan and threat model
Implementer: writes the smallest code change that satisfies the plan
Security reviewer: reviews the diff against OWASP, CWE, project policy, and data-flow risks
Verifier: runs tests, SAST, SCA, secret scanning, linting, and targeted exploit checks
Release gate: decides whether the change needs human security review before merge

This is not theater. It forces the system to separate generation from judgment. The agent that wrote the code should not be the only authority saying the code is secure.

3. Route Models by Risk, Not Ego

You do not need a frontier model for every step. You do need one where ambiguity and reasoning depth matter.

Use a stronger model for:

threat modeling
auth and permissions design
cryptography choices
multi-file refactors touching trust boundaries
code review of security-sensitive diffs
deciding whether a scanner finding is exploitable

Use a smaller or cheaper model for:

converting scanner output into issues
summarizing diffs
checking checklist completeness
classifying files by risk category
generating boilerplate tests from an explicit spec

Use deterministic tools for:

SAST
SCA and dependency vulnerability scanning
secret scanning
type checking
linting
formatting
unit and integration tests
IaC and container scanning

The lesson from the evidence is not “small models are safe.” It is “model size is the wrong place to put the security boundary.”

4. Treat Retrieved Context as Untrusted Data

Coding agents often read GitHub issues, docs, READMEs, package pages, Stack Overflow snippets, internal tickets, and web pages. Every one of those can contain instructions that are not part of your task.

The rule should be:

External content can inform the task, but it cannot grant permissions, change policy, request secrets, or cause tool execution.

Implementation patterns:

wrap retrieved text in explicit untrusted_content blocks
extract structured facts from untrusted content before planning
require citations or file references for factual claims
do not pass untrusted text into developer/system instruction channels
block network egress by default during code execution
allowlist dependency hosts and package registries
require human approval for commands that read secrets, upload data, install packages, or modify infra

This is especially important for coding agents because malicious instructions can hide in places engineers normally skim: issue descriptions, comments, dependency install scripts, docs, test fixtures, and generated markdown.

5. Put Tool Permissions on a Diet

Agentic security is mostly permission design.

Give the agent the least authority needed for the current step:

read-only repo access during planning
write access only to the working directory
no production credentials
no default cloud admin permissions
no unrestricted internet
no automatic package installation from arbitrary URLs
no direct write access to production databases
no ability to merge, deploy, rotate secrets, or approve its own PR

For local execution, use a sandbox or container. For cloud execution, use ephemeral credentials, narrow network allowlists, and scoped service accounts. For MCP or connector tools, separate read tools from write tools and require approvals for side effects.

The agent should be productive inside a small box. If it needs to leave the box, that is a security event worth reviewing.

6. Generate Code in Small, Reviewable Changes

Secure code generation gets harder as the diff gets larger.

Good agentic changes are:

one feature or fix at a time
small enough to review line by line
covered by tests in the same PR
explicit about data flow
explicit about error handling
explicit about auth and permission checks
traceable back to a spec or issue

Bad agentic changes are:

“refactor the whole auth system”
“modernize the API layer”
“make this secure”
“add tests everywhere”
hundreds of lines across unrelated modules

The agent should optimize for reviewability, not maximum code volume.

7. Run Security Checks Before the Human Review

Every agentic coding PR should attach evidence.

Minimum checks:

unit tests
integration tests for changed API boundaries
type check
linter
SAST for the relevant language
dependency vulnerability scan
secret scan
package lockfile review
license scan if internet retrieval or new dependencies were involved

Security-sensitive changes should add:

abuse-case tests
authz negative tests
input validation tests
SQL injection or XSS regression tests where relevant
fuzzing for parsers, file inputs, and protocol handlers
IaC scanning for cloud or Kubernetes changes
container scanning for deployable artifacts

The verifier agent can summarize the evidence, but deterministic tools should produce the underlying signal.

8. Use Unit and E2E Tests as Security Evidence

Unit tests and E2E tests do different security jobs. You need both.

Unit tests are the fastest way to turn security rules into executable invariants. They are especially useful for:

input validation: malicious payloads are rejected before they reach storage, rendering, or command execution
output encoding and escaping: untrusted strings are transformed consistently before display
authorization logic: users without a role, scope, tenant, or ownership relationship are denied
rate-limit and lockout rules: counters, windows, and reset behavior work at the function level
crypto wrappers: only approved algorithms, modes, and key sizes are accepted
secret handling: redaction utilities remove tokens, API keys, and credentials from logs and errors
parser behavior: malformed JSON, markdown, HTML, CSV, YAML, or uploaded files fail closed

For AI-generated code, unit tests have an extra benefit: they force the agent to express the security requirement in a form the repo can keep. A prompt disappears. A unit test stays in CI.

Good security unit tests are usually negative tests:

it("denies access when the user belongs to a different tenant", () => {
  const user = { id: "u_1", tenantId: "tenant_a", role: "member" };
  const resource = { id: "doc_1", tenantId: "tenant_b" };

  expect(canReadDocument(user, resource)).toBe(false);
});

it("redacts API tokens before writing structured logs", () => {
  const logEvent = redactSecrets({
    message: "request failed",
    authorization: "Bearer sk_live_123",
  });

  expect(JSON.stringify(logEvent)).not.toContain("sk_live_123");
});

E2E tests are slower, but they catch the failures that unit tests intentionally do not model:

browser-to-API auth flows
cookie flags and session persistence
CSRF behavior
tenant switching and deep links
file upload and download flows
XSS regression paths through real rendering
password reset and invitation flows
audit log creation after sensitive actions
permission failures as seen by real users
redirects, headers, and route guards

In an agentic coding process, E2E tests should be selective. Do not ask the agent to generate hundreds of brittle browser tests. Ask it to generate high-value security journeys:

anonymous user cannot reach authenticated routes
member cannot access another tenant’s resource by changing the URL
admin action writes an audit event with actor, target, action, timestamp, and request ID
stored user input renders as text, not executable HTML
logout invalidates the session and blocks back-button access
destructive action requires confirmation and fails without CSRF protection

The audit value is important. Security review is easier when a PR can say:

the authorization rule is covered by unit tests
the browser flow is covered by E2E tests
the audit trail is asserted in the test
the test names map to the risk in the threat model
the CI run links prove the evidence before merge

OWASP ASVS is useful here because it gives teams a common vocabulary for verification requirements. OWASP WSTG is useful because it gives testers a structured web-security testing methodology. NIST SSDF is useful because it frames testing as part of reducing vulnerabilities, mitigating exploitation impact, and addressing root causes.

For agentic code generation, the practical policy should be:

Change type	Required test evidence
Pure UI copy or styling	build, lint, visual smoke test if layout risk exists
Validation or parsing	unit tests for valid, invalid, malformed, oversized, and adversarial inputs
Authorization or tenancy	unit tests for allow/deny matrix plus E2E test for one real denied path
Authentication or session	unit tests for token/session helpers plus E2E tests for login, logout, expiry, and protected route behavior
Sensitive data handling	unit tests for redaction plus E2E/API test proving secrets do not appear in UI, logs, or exported data
Agent tool or workflow permission	unit tests for policy decisions plus E2E or integration test proving disallowed actions are blocked before execution
Audit logging	unit tests for event shape plus E2E/API test proving sensitive actions create durable audit records

This turns testing into review evidence instead of a vague quality ritual.

9. Add an Agentic Security Review Checklist

This checklist should run on every AI-generated or AI-modified PR:

Does the diff touch authentication, authorization, cryptography, secrets, payments, tenant isolation, data export, file upload, dependency resolution, networking, infra, or logging?
What attacker-controlled inputs reach this code?
What trust boundary changed?
Is every external input validated, encoded, escaped, parameterized, or rejected at the correct boundary?
Are permissions narrower than before?
Did the agent add a dependency, script, MCP server, GitHub Action, container image, or remote resource?
Are secrets absent from code, tests, logs, prompts, traces, and generated artifacts?
Are scanner findings resolved, suppressed with justification, or escalated?
Does the PR include unit tests for the low-level security rule and E2E or integration tests for the user-visible security flow?
Do tests assert audit events for sensitive actions, including actor, target, action, timestamp, and request/correlation ID?
Would a prompt injection in an issue, doc, webpage, or tool output change what the agent executed?

If the answer to any of these is unclear, the PR is not ready.

10. Use Human Approval Gates for High-Risk Actions

Human review should be mandatory when the agent touches:

authentication
authorization
tenant isolation
cryptography
secret handling
security headers
logging of sensitive data
dependency installation
CI/CD workflows
infrastructure as code
production data access
database migrations
network egress rules
agent tool permissions

This is where stronger models are most useful as assistants, not deciders. Let the model produce the threat-model summary, affected files, scanner evidence, and suggested review focus. Let a human own the approval.

11. Keep an Audit Trail

For agentic coding, the PR should preserve:

original user request
agent plan
security assumptions
commands run
files changed
tool calls
scanner results
tests run
known residual risks
human approvals

This matters for incident response. If a vulnerability ships, you need to know whether the failure was in the prompt, the model, the tool policy, the scanner coverage, the human review, or the deployment gate.

A Practical Reference Workflow

Here is a compact workflow that engineering teams can adapt:

1. Intake
   - classify feature risk
   - identify touched trust boundaries
   - define security acceptance criteria

2. Plan
   - agent reads trusted project context
   - agent produces implementation plan and threat model
   - human approves plan for high-risk changes

3. Generate
   - agent writes a small diff
   - sandboxed execution
   - least-privilege tools
   - network off or allowlisted

4. Verify
   - tests
   - type check
   - lint
   - SAST
   - SCA
   - secret scan
   - targeted security tests
   - unit tests for security invariants
   - E2E tests for high-risk user journeys
   - audit-log assertions for sensitive actions

5. Review
   - security-review agent checks diff
   - reviewer maps tests to threat model and ASVS/WSTG-style requirements
   - human reviews high-risk areas
   - findings become blocking issues

6. Merge
   - evidence attached to PR
   - approvals recorded
   - residual risk documented

7. Learn
   - production signals feed eval cases
   - vulnerabilities become regression tests
   - prompt and policy rules are updated

The most important part is not the specific tool stack. It is that every phase has an explicit control.

What “Vulnerability-Free” Should Mean in Practice

No serious security team should claim an agent can guarantee vulnerability-free code. Even human-written, manually reviewed, fully scanned code can contain vulnerabilities.

A better target is:

The agentic workflow must not be allowed to merge code unless it has produced enough evidence for the risk level of the change.

For low-risk UI copy changes, that evidence might be lint and build. For a login flow, that evidence should include auth tests, negative tests, scanner output, session behavior checks, and human review. For cryptography, the evidence should include approved library usage and security-owner review.

Security is not one universal threshold. It is a risk-adjusted gate.

Tests are part of that gate. A unit test proves the rule in isolation. An E2E test proves the rule survives the real path users and attackers take. An audit-log assertion proves the organization can reconstruct what happened later.

The Model-Size Takeaway

If your team is building an AI coding workflow, I would use frontier models for the security-heavy parts. They are better at connecting architecture, code, policy, and ambiguous requirements.

But I would not trust any model, big or small, as the final security control.

The strongest current process is model-agnostic in the way secure engineering has always been tool-agnostic:

define secure requirements
constrain execution
separate trusted instructions from untrusted data
limit permissions
scan the output
test abuse cases
review risky changes
log what happened
turn failures into regression tests

The future of secure agentic coding is not “one big model that writes safe code.”

It is a secure software factory where models help at every step, but the process refuses to confuse fluency with safety.

Source List

Veracode Spring 2026 GenAI Code Security Update - longitudinal benchmark reporting 55% secure-code pass rate across tested models and tasks, with security lagging behind syntax correctness.
CyberSecEval 4 - Meta’s cybersecurity evaluation suite, including AutoPatchBench for LLM agents patching native-code vulnerabilities.
CyberSecEval 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models - benchmark suite covering multiple cybersecurity risk categories for LLMs.
Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions - empirical study finding about 40% of generated Copilot programs vulnerable across security-relevant scenarios.
Do Users Write More Insecure Code with AI Assistants? - user study finding AI-assistant users wrote significantly less secure code and were more confident in its security.
A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code - repository-level benchmark showing current LLMs still struggle with secure code generation in realistic project contexts.
Rethinking the Evaluation of Secure Code Generation - ICSE 2026 paper warning that secure-code-generation techniques can compromise functionality and that limited analyzer coverage can distort results.
OWASP Top 10 for LLM and GenAI Applications 2025 - prompt injection, supply chain, improper output handling, excessive agency, and related GenAI application risks.
OWASP Agentic AI - Threats and Mitigations - threat-model-based reference for agentic AI risks and mitigations.
OpenAI Safety in Building Agents - guidance on prompt injection, structured outputs, tool confirmations, guardrails, trace graders, and evals.
OpenAI Codex Agent Internet Access - guidance on prompt injection, exfiltration, malware, vulnerable dependencies, allowlists, and limiting agent network access.
Anthropic Claude Code Security - security guidance covering read-only defaults, explicit approvals, sandboxed bash, write boundaries, command blocklists, and prompt-injection protections.
Google’s AI Security Strategy - agent security principles: human controllers, limited powers, and observable actions and planning.
NIST SP 800-218 Secure Software Development Framework - secure SDLC practices for reducing vulnerabilities, mitigating exploitation impact, and addressing root causes.
NIST AI Risk Management Framework: Generative AI Profile - cross-sector generative AI risk-management profile aligned to AI lifecycle governance.
Cisco Project CodeGuard announcement - model-agnostic framework for secure-by-default rules before, during, and after AI-generated code.
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents - benchmark for tool-calling agents operating over untrusted data and prompt-injection attacks.
OWASP Application Security Verification Standard - verification requirements and vocabulary for testing technical security controls.
OWASP Web Security Testing Guide - structured methodology and scenarios for web application security testing.

Luis Mori Guerra

Recent Articles

Topics

A Secure Agentic Coding Process Is Not Just a Bigger LLM