TL;DR
- We are not limited to only big LLMs for secure coding, but we should use stronger models for ambiguous security reasoning, architecture tradeoffs, and high-risk code review.
- The current evidence says model size alone is not a reliable security control. Veracode’s Spring 2026 update reports that only 55% of tested AI code-generation tasks produced secure code, and that the earlier small advantage from model size has largely disappeared.
- Earlier research found similar warning signs: Copilot produced vulnerable programs in about 40% of tested scenarios, and a user study found participants with an AI assistant wrote significantly less secure code while being more confident in their result.
- A secure agentic workflow should treat the model as one component inside a controlled SDLC: security requirements, trusted context, sandboxed execution, least-privilege tools, static analysis, dependency scanning, secret scanning, adversarial evals, and human approval gates.
- Unit tests and E2E tests are part of the security control surface: unit tests lock down validation and authorization rules, while E2E tests prove that real user flows preserve those controls across browser, API, session, and data layers.
- The practical goal is not “vulnerability-free code.” That claim is too strong. The goal is a repeatable process that reduces vulnerability introduction, catches common failures before merge, and produces auditable evidence.
The hypothesis is tempting:
For security-sensitive code generation, maybe only the biggest LLMs are good enough.
My answer after looking at the current evidence is:
Bigger models help, but they are not the control. The control is the agentic process around the model.
That distinction matters. A frontier model can reason better about trust boundaries, authentication flows, threat models, and subtle implementation details. But a bigger model still reads untrusted context, inherits insecure patterns from training data, misunderstands app-specific policy, and can execute unsafe tool calls if the runtime lets it.
In security, “the model is smart” is not a defense. A secure agentic coding system needs a process that assumes the model will sometimes be wrong.
What the Evidence Actually Says
The strongest current signal is that AI-generated code security is not improving at the same pace as functional correctness.
Veracode’s Spring 2026 GenAI Code Security Update tested major model releases across 80 coding tasks, four languages, and four CWE classes: SQL injection, cross-site scripting, log injection, and insecure cryptography. Their headline result was uncomfortable: across all models and tasks, only 55% of generation tasks resulted in secure code. Syntax correctness exceeded 95%, but security pass rates stayed roughly flat. Veracode also reported that model size had only a very small effect in its October 2025 update, and that this marginal difference had mostly disappeared in newer releases.
That supports one part of your hypothesis and rejects another:
- yes, capability matters for complex security reasoning
- no, model size alone is not enough to trust generated code
This is consistent with earlier academic work. In “Asleep at the Keyboard?”, Pearce et al. generated 1,689 Copilot programs across 89 security-relevant scenarios and found approximately 40% were vulnerable. In “Do Users Write More Insecure Code with AI Assistants?”, Perry et al. found that participants with access to an AI assistant wrote significantly less secure code than participants without one, and were also more likely to believe their code was secure.
The newest benchmark direction points the same way. CyberSecEval 4 includes AutoPatchBench, which evaluates whether an LLM agent can automatically patch security vulnerabilities in native code. A.S.E, a repository-level benchmark for AI-generated code security, reports that current LLMs still struggle in realistic repo-level settings and that a larger reasoning budget does not necessarily produce better secure code. “Rethinking the Evaluation of Secure Code Generation” also warns that secure-code-generation techniques can trade away functionality, and that relying on one static analyzer can hide real risk.
The pattern is clear:
| Evidence | What it suggests for engineers |
|---|---|
| AI-generated code can be syntactically correct while still vulnerable | Build security checks into the workflow, not after it |
| Bigger models do not consistently eliminate vulnerabilities | Use model routing, but do not treat model choice as the security boundary |
| Users can become more confident while producing less secure code | Require evidence: tests, scans, review notes, and trace logs |
| Repo-level tasks are harder than snippet tasks | Evaluate agents inside real project context, not only toy prompts |
| Agentic tool use creates prompt-injection and exfiltration risk | Constrain tools, network, permissions, and external content |
Why Agentic Coding Is a Different Security Problem
Classic code generation risk is mostly about the code text:
- insecure SQL construction
- weak crypto
- missing authorization checks
- unsafe deserialization
- bad escaping
- hardcoded secrets
Agentic coding adds another layer: the agent can read files, run commands, call tools, browse the web, install dependencies, modify code, and sometimes push changes. That means the security question becomes:
What can the agent do when it is wrong, confused, or manipulated?
OWASP’s 2025 Top 10 for LLM and GenAI applications names risks that map directly to coding agents: prompt injection, supply chain risk, improper output handling, excessive agency, sensitive information disclosure, and unbounded consumption. The OWASP Agentic AI threats guide frames this as a threat-modeling problem, not just a prompt-quality problem.
OpenAI’s agent safety guidance says to design workflows so untrusted data never directly drives agent behavior, to use structured outputs, keep tool confirmations on, use guardrails, and run trace graders and evals. OpenAI’s Codex internet-access docs call out prompt injection from untrusted web content, exfiltration of code or secrets, malware or vulnerable dependencies, and license-risky content as reasons to keep network access limited.
Anthropic’s Claude Code security docs use a similar architecture: read-only defaults, explicit approval for sensitive operations, sandboxed bash, write boundaries, command blocklists for risky web-fetching commands, and user responsibility for reviewing proposed code and commands.
Google’s AI security strategy gives a concise agentic security principle: agents need well-defined human controllers, carefully limited powers, and observable actions and planning.
That is the current industry consensus: secure agents are controlled systems.
The Current Secure Agentic Coding Process
Here is the process I would use for an engineering team building code with AI agents in 2026.
1. Start With a Security Spec, Not a Coding Prompt
Before the agent writes code, give it a security contract:
- assets: secrets, customer data, credentials, tokens, model outputs, logs
- trust boundaries: browser, API, worker, database, third-party service, agent tool
- attacker-controlled input: request body, markdown, issue text, dependency README, file upload, tool result
- required controls: authn, authz, validation, escaping, rate limits, audit logs, safe defaults
- forbidden patterns: raw SQL string concatenation,
eval, hardcoded secrets, weak crypto, broad IAM permissions, unpinned remote scripts - acceptance evidence: tests, scanner output, review checklist, threat-model notes
This turns “build feature X” into “build feature X under these security constraints.”
A good agent should produce a plan that includes the security assumptions before touching files. If the agent cannot explain the trust boundary, it is not ready to generate the implementation.
2. Split the Workflow Into Roles
Use different roles even if they run on the same underlying model:
- Planner: turns the feature request into a small implementation plan and threat model
- Implementer: writes the smallest code change that satisfies the plan
- Security reviewer: reviews the diff against OWASP, CWE, project policy, and data-flow risks
- Verifier: runs tests, SAST, SCA, secret scanning, linting, and targeted exploit checks
- Release gate: decides whether the change needs human security review before merge
This is not theater. It forces the system to separate generation from judgment. The agent that wrote the code should not be the only authority saying the code is secure.
3. Route Models by Risk, Not Ego
You do not need a frontier model for every step. You do need one where ambiguity and reasoning depth matter.
Use a stronger model for:
- threat modeling
- auth and permissions design
- cryptography choices
- multi-file refactors touching trust boundaries
- code review of security-sensitive diffs
- deciding whether a scanner finding is exploitable
Use a smaller or cheaper model for:
- converting scanner output into issues
- summarizing diffs
- checking checklist completeness
- classifying files by risk category
- generating boilerplate tests from an explicit spec
Use deterministic tools for:
- SAST
- SCA and dependency vulnerability scanning
- secret scanning
- type checking
- linting
- formatting
- unit and integration tests
- IaC and container scanning
The lesson from the evidence is not “small models are safe.” It is “model size is the wrong place to put the security boundary.”
4. Treat Retrieved Context as Untrusted Data
Coding agents often read GitHub issues, docs, READMEs, package pages, Stack Overflow snippets, internal tickets, and web pages. Every one of those can contain instructions that are not part of your task.
The rule should be:
External content can inform the task, but it cannot grant permissions, change policy, request secrets, or cause tool execution.
Implementation patterns:
- wrap retrieved text in explicit
untrusted_contentblocks - extract structured facts from untrusted content before planning
- require citations or file references for factual claims
- do not pass untrusted text into developer/system instruction channels
- block network egress by default during code execution
- allowlist dependency hosts and package registries
- require human approval for commands that read secrets, upload data, install packages, or modify infra
This is especially important for coding agents because malicious instructions can hide in places engineers normally skim: issue descriptions, comments, dependency install scripts, docs, test fixtures, and generated markdown.
5. Put Tool Permissions on a Diet
Agentic security is mostly permission design.
Give the agent the least authority needed for the current step:
- read-only repo access during planning
- write access only to the working directory
- no production credentials
- no default cloud admin permissions
- no unrestricted internet
- no automatic package installation from arbitrary URLs
- no direct write access to production databases
- no ability to merge, deploy, rotate secrets, or approve its own PR
For local execution, use a sandbox or container. For cloud execution, use ephemeral credentials, narrow network allowlists, and scoped service accounts. For MCP or connector tools, separate read tools from write tools and require approvals for side effects.
The agent should be productive inside a small box. If it needs to leave the box, that is a security event worth reviewing.
6. Generate Code in Small, Reviewable Changes
Secure code generation gets harder as the diff gets larger.
Good agentic changes are:
- one feature or fix at a time
- small enough to review line by line
- covered by tests in the same PR
- explicit about data flow
- explicit about error handling
- explicit about auth and permission checks
- traceable back to a spec or issue
Bad agentic changes are:
- “refactor the whole auth system”
- “modernize the API layer”
- “make this secure”
- “add tests everywhere”
- hundreds of lines across unrelated modules
The agent should optimize for reviewability, not maximum code volume.
7. Run Security Checks Before the Human Review
Every agentic coding PR should attach evidence.
Minimum checks:
- unit tests
- integration tests for changed API boundaries
- type check
- linter
- SAST for the relevant language
- dependency vulnerability scan
- secret scan
- package lockfile review
- license scan if internet retrieval or new dependencies were involved
Security-sensitive changes should add:
- abuse-case tests
- authz negative tests
- input validation tests
- SQL injection or XSS regression tests where relevant
- fuzzing for parsers, file inputs, and protocol handlers
- IaC scanning for cloud or Kubernetes changes
- container scanning for deployable artifacts
The verifier agent can summarize the evidence, but deterministic tools should produce the underlying signal.
8. Use Unit and E2E Tests as Security Evidence
Unit tests and E2E tests do different security jobs. You need both.
Unit tests are the fastest way to turn security rules into executable invariants. They are especially useful for:
- input validation: malicious payloads are rejected before they reach storage, rendering, or command execution
- output encoding and escaping: untrusted strings are transformed consistently before display
- authorization logic: users without a role, scope, tenant, or ownership relationship are denied
- rate-limit and lockout rules: counters, windows, and reset behavior work at the function level
- crypto wrappers: only approved algorithms, modes, and key sizes are accepted
- secret handling: redaction utilities remove tokens, API keys, and credentials from logs and errors
- parser behavior: malformed JSON, markdown, HTML, CSV, YAML, or uploaded files fail closed
For AI-generated code, unit tests have an extra benefit: they force the agent to express the security requirement in a form the repo can keep. A prompt disappears. A unit test stays in CI.
Good security unit tests are usually negative tests:
it("denies access when the user belongs to a different tenant", () => {
const user = { id: "u_1", tenantId: "tenant_a", role: "member" };
const resource = { id: "doc_1", tenantId: "tenant_b" };
expect(canReadDocument(user, resource)).toBe(false);
});
it("redacts API tokens before writing structured logs", () => {
const logEvent = redactSecrets({
message: "request failed",
authorization: "Bearer sk_live_123",
});
expect(JSON.stringify(logEvent)).not.toContain("sk_live_123");
});
E2E tests are slower, but they catch the failures that unit tests intentionally do not model:
- browser-to-API auth flows
- cookie flags and session persistence
- CSRF behavior
- tenant switching and deep links
- file upload and download flows
- XSS regression paths through real rendering
- password reset and invitation flows
- audit log creation after sensitive actions
- permission failures as seen by real users
- redirects, headers, and route guards
In an agentic coding process, E2E tests should be selective. Do not ask the agent to generate hundreds of brittle browser tests. Ask it to generate high-value security journeys:
- anonymous user cannot reach authenticated routes
- member cannot access another tenant’s resource by changing the URL
- admin action writes an audit event with actor, target, action, timestamp, and request ID
- stored user input renders as text, not executable HTML
- logout invalidates the session and blocks back-button access
- destructive action requires confirmation and fails without CSRF protection
The audit value is important. Security review is easier when a PR can say:
- the authorization rule is covered by unit tests
- the browser flow is covered by E2E tests
- the audit trail is asserted in the test
- the test names map to the risk in the threat model
- the CI run links prove the evidence before merge
OWASP ASVS is useful here because it gives teams a common vocabulary for verification requirements. OWASP WSTG is useful because it gives testers a structured web-security testing methodology. NIST SSDF is useful because it frames testing as part of reducing vulnerabilities, mitigating exploitation impact, and addressing root causes.
For agentic code generation, the practical policy should be:
| Change type | Required test evidence |
|---|---|
| Pure UI copy or styling | build, lint, visual smoke test if layout risk exists |
| Validation or parsing | unit tests for valid, invalid, malformed, oversized, and adversarial inputs |
| Authorization or tenancy | unit tests for allow/deny matrix plus E2E test for one real denied path |
| Authentication or session | unit tests for token/session helpers plus E2E tests for login, logout, expiry, and protected route behavior |
| Sensitive data handling | unit tests for redaction plus E2E/API test proving secrets do not appear in UI, logs, or exported data |
| Agent tool or workflow permission | unit tests for policy decisions plus E2E or integration test proving disallowed actions are blocked before execution |
| Audit logging | unit tests for event shape plus E2E/API test proving sensitive actions create durable audit records |
This turns testing into review evidence instead of a vague quality ritual.
9. Add an Agentic Security Review Checklist
This checklist should run on every AI-generated or AI-modified PR:
- Does the diff touch authentication, authorization, cryptography, secrets, payments, tenant isolation, data export, file upload, dependency resolution, networking, infra, or logging?
- What attacker-controlled inputs reach this code?
- What trust boundary changed?
- Is every external input validated, encoded, escaped, parameterized, or rejected at the correct boundary?
- Are permissions narrower than before?
- Did the agent add a dependency, script, MCP server, GitHub Action, container image, or remote resource?
- Are secrets absent from code, tests, logs, prompts, traces, and generated artifacts?
- Are scanner findings resolved, suppressed with justification, or escalated?
- Does the PR include unit tests for the low-level security rule and E2E or integration tests for the user-visible security flow?
- Do tests assert audit events for sensitive actions, including actor, target, action, timestamp, and request/correlation ID?
- Would a prompt injection in an issue, doc, webpage, or tool output change what the agent executed?
If the answer to any of these is unclear, the PR is not ready.
10. Use Human Approval Gates for High-Risk Actions
Human review should be mandatory when the agent touches:
- authentication
- authorization
- tenant isolation
- cryptography
- secret handling
- security headers
- logging of sensitive data
- dependency installation
- CI/CD workflows
- infrastructure as code
- production data access
- database migrations
- network egress rules
- agent tool permissions
This is where stronger models are most useful as assistants, not deciders. Let the model produce the threat-model summary, affected files, scanner evidence, and suggested review focus. Let a human own the approval.
11. Keep an Audit Trail
For agentic coding, the PR should preserve:
- original user request
- agent plan
- security assumptions
- commands run
- files changed
- tool calls
- scanner results
- tests run
- known residual risks
- human approvals
This matters for incident response. If a vulnerability ships, you need to know whether the failure was in the prompt, the model, the tool policy, the scanner coverage, the human review, or the deployment gate.
A Practical Reference Workflow
Here is a compact workflow that engineering teams can adapt:
1. Intake
- classify feature risk
- identify touched trust boundaries
- define security acceptance criteria
2. Plan
- agent reads trusted project context
- agent produces implementation plan and threat model
- human approves plan for high-risk changes
3. Generate
- agent writes a small diff
- sandboxed execution
- least-privilege tools
- network off or allowlisted
4. Verify
- tests
- type check
- lint
- SAST
- SCA
- secret scan
- targeted security tests
- unit tests for security invariants
- E2E tests for high-risk user journeys
- audit-log assertions for sensitive actions
5. Review
- security-review agent checks diff
- reviewer maps tests to threat model and ASVS/WSTG-style requirements
- human reviews high-risk areas
- findings become blocking issues
6. Merge
- evidence attached to PR
- approvals recorded
- residual risk documented
7. Learn
- production signals feed eval cases
- vulnerabilities become regression tests
- prompt and policy rules are updated
The most important part is not the specific tool stack. It is that every phase has an explicit control.
What “Vulnerability-Free” Should Mean in Practice
No serious security team should claim an agent can guarantee vulnerability-free code. Even human-written, manually reviewed, fully scanned code can contain vulnerabilities.
A better target is:
The agentic workflow must not be allowed to merge code unless it has produced enough evidence for the risk level of the change.
For low-risk UI copy changes, that evidence might be lint and build. For a login flow, that evidence should include auth tests, negative tests, scanner output, session behavior checks, and human review. For cryptography, the evidence should include approved library usage and security-owner review.
Security is not one universal threshold. It is a risk-adjusted gate.
Tests are part of that gate. A unit test proves the rule in isolation. An E2E test proves the rule survives the real path users and attackers take. An audit-log assertion proves the organization can reconstruct what happened later.
The Model-Size Takeaway
If your team is building an AI coding workflow, I would use frontier models for the security-heavy parts. They are better at connecting architecture, code, policy, and ambiguous requirements.
But I would not trust any model, big or small, as the final security control.
The strongest current process is model-agnostic in the way secure engineering has always been tool-agnostic:
- define secure requirements
- constrain execution
- separate trusted instructions from untrusted data
- limit permissions
- scan the output
- test abuse cases
- review risky changes
- log what happened
- turn failures into regression tests
The future of secure agentic coding is not “one big model that writes safe code.”
It is a secure software factory where models help at every step, but the process refuses to confuse fluency with safety.
Source List
- Veracode Spring 2026 GenAI Code Security Update - longitudinal benchmark reporting 55% secure-code pass rate across tested models and tasks, with security lagging behind syntax correctness.
- CyberSecEval 4 - Meta’s cybersecurity evaluation suite, including AutoPatchBench for LLM agents patching native-code vulnerabilities.
- CyberSecEval 3: Advancing the Evaluation of Cybersecurity Risks and Capabilities in Large Language Models - benchmark suite covering multiple cybersecurity risk categories for LLMs.
- Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions - empirical study finding about 40% of generated Copilot programs vulnerable across security-relevant scenarios.
- Do Users Write More Insecure Code with AI Assistants? - user study finding AI-assistant users wrote significantly less secure code and were more confident in its security.
- A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code - repository-level benchmark showing current LLMs still struggle with secure code generation in realistic project contexts.
- Rethinking the Evaluation of Secure Code Generation - ICSE 2026 paper warning that secure-code-generation techniques can compromise functionality and that limited analyzer coverage can distort results.
- OWASP Top 10 for LLM and GenAI Applications 2025 - prompt injection, supply chain, improper output handling, excessive agency, and related GenAI application risks.
- OWASP Agentic AI - Threats and Mitigations - threat-model-based reference for agentic AI risks and mitigations.
- OpenAI Safety in Building Agents - guidance on prompt injection, structured outputs, tool confirmations, guardrails, trace graders, and evals.
- OpenAI Codex Agent Internet Access - guidance on prompt injection, exfiltration, malware, vulnerable dependencies, allowlists, and limiting agent network access.
- Anthropic Claude Code Security - security guidance covering read-only defaults, explicit approvals, sandboxed bash, write boundaries, command blocklists, and prompt-injection protections.
- Google’s AI Security Strategy - agent security principles: human controllers, limited powers, and observable actions and planning.
- NIST SP 800-218 Secure Software Development Framework - secure SDLC practices for reducing vulnerabilities, mitigating exploitation impact, and addressing root causes.
- NIST AI Risk Management Framework: Generative AI Profile - cross-sector generative AI risk-management profile aligned to AI lifecycle governance.
- Cisco Project CodeGuard announcement - model-agnostic framework for secure-by-default rules before, during, and after AI-generated code.
- AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents - benchmark for tool-calling agents operating over untrusted data and prompt-injection attacks.
- OWASP Application Security Verification Standard - verification requirements and vocabulary for testing technical security controls.
- OWASP Web Security Testing Guide - structured methodology and scenarios for web application security testing.