Software Systems Research

The Pieces of Modern, Effective Software Design

A practical ladder for growing a software system from one app and one database into observable, event-driven, permission-aware, AI-ready architecture.

18 min read

Modern software design is not one giant architecture diagram.

It is a set of pieces you add when the system earns them: isolated environments, clear layers, observability, read models, event streams, contracts, authorization, failure handling, migration paths, and eventually AI-aware workflows.

The trap is adding all of them at the beginning. The opposite trap is waiting so long that every feature becomes risky. Good architecture lives between those mistakes.

This article is a practical ladder. It starts with one server and one database, then adds each design piece only when a real pain appears.

TL;DR

  • Start with one app and one database unless the problem is already distributed.
  • Separate development and production early. Most “it worked in the demo” pain is environment pain.
  • Keep a single app modular before reaching for microservices.
  • Add observability before optimization. You cannot tune what you cannot see.
  • Split read and write paths when reads dominate and query shape becomes different from write shape.
  • Use events when features need to react to state changes without editing the core service every time.
  • Treat APIs and event schemas as contracts. Version them and test them.
  • Keep authorization separate from data access once roles, teams, tenants, or agents enter the product.
  • Design for failure with timeouts, retries, idempotency, circuit breakers, dead-letter queues, and graceful degradation.
  • Modernize legacy systems slice by slice with a strangler fig pattern, not a big-bang rewrite.
  • Treat AI as another system layer: model, tools, memory, permissions, evals, observability, and fallback behavior.

The shortest version: modern architecture is the discipline of adding boundaries at the moment they reduce risk more than they add complexity.

What You Will Learn Here

  • How a system can grow from a simple CRUD app into a production architecture without over-engineering.
  • Which pain usually justifies each architectural pattern.
  • How NATS, OpenSearch, and OpenFGA map to common modern system-design problems.
  • Where AI features and AI coding agents fit into the same architecture.
  • How to decide which piece to add next, and which pieces to avoid for now.

The Ladder

Think of the system as moving through six stages:

Act 0       Act 1          Act 2          Act 3           Act 4        Act 5
simple  ->  organized  ->  observable ->  distributed ->  resilient -> evolving
app         app            data           services        system       system

Each act is a response to a pain:

PainDesign piece
”Testing on real user data is terrifying.”Isolated environments
”Nobody knows where logic belongs.”Layered modular structure
”Users say it is slow, but we are guessing.”Observability
”Reads are drowning writes.”CQRS and read models
”Every new feature edits the core service.”Events
”Changing one service breaks another.”Contracts and versioning
”Permissions are copied everywhere.”Externalized authorization
”One outage breaks everything.”Failure design
”The legacy system cannot be replaced safely.”Strangler fig migration
”AI features need data, tools, and guardrails.”AI as a first-class layer

That table is the article in miniature. The rest explains how to use it.

Act 0: Start With One App and One Database

The best first architecture is usually boring:

Browser -> Server -> Database

One deployable app. One source of truth. One path to understand.

This is not a toy. A well-built monolith can carry a serious product for a long time, especially when one team owns the whole system. The problem is not starting simple. The problem is starting simple while leaving yourself no way to grow.

At this stage, avoid:

  • a message bus with no asynchronous work
  • microservices with one team
  • CQRS before queries hurt
  • a custom authorization service before roles are complex
  • AI orchestration before there is a clear user problem

The useful discipline is to keep the first system small but not messy.

User
  |
  v
API / Web App
  |
  v
Database

The first real pain usually appears when people depend on the app and you need to change it safely.

Act 1: Make One App Safe to Change

The first design pieces are not about scale. They are about safety.

Isolate Environments

Development and production should be separate copies of the system:

Development environment             Production environment
-----------------------             ----------------------
newest code                         released code
sample or seeded data               real user data
test credentials                    production credentials
safe to break                       must stay available

The Twelve-Factor App calls this dev/prod parity: keep environments similar in shape while keeping data, credentials, and runtime state isolated.

This distinction explains a common product misunderstanding. “It works in dev” and “customers can use it” are different milestones. The missing step is release.

For PMs, this is not engineering bureaucracy. It is the difference between a demo, a staging validation, and a production launch.

Keep the App Modular

Before splitting into services, split the codebase into clear layers:

Edge        routing, auth entry points, rate limits
Application use cases, workflows, commands
Domain      business rules and invariants
Data        repositories, queries, transactions
Async       jobs, queues, outbox, retries
Ops         logging, metrics, tracing, health checks

Those layers can live inside one deployable app. That is a modular monolith.

The point is not ceremony. The point is that a future engineer should know where a new rule belongs. If “who can approve an invoice” appears in a controller, a SQL query, a React component, and a background job, the system is already drifting.

At the end of Act 1, the app still looks simple from the outside:

User -> Edge -> Modular App -> Database

But inside, the concerns have homes. That makes the next stage possible.

Act 2: Measure Before You Scale

Scale problems are often described emotionally:

  • “Search feels slow.”
  • “The dashboard hangs sometimes.”
  • “Checkout was weird last night.”
  • “Customers say data is missing.”

Those are not yet engineering facts. Observability turns them into facts.

Add Observability

For a production app, you need at least three kinds of telemetry:

Logs     what happened
Metrics  how often, how fast, how many
Traces   where a request went across system boundaries

OpenTelemetry describes these as telemetry signals that can be generated, collected, and exported through a vendor-neutral framework. The tooling matters less than the habit: every important flow should have enough evidence to debug it later.

For each critical user journey, track:

  • request rate
  • error rate
  • latency, especially p95 and p99
  • saturation, such as queue depth or connection pool usage
  • a correctness signal, such as “orders created” or “todos indexed”

That last one is easy to skip and painful later. A system can return HTTP 200 while silently doing the wrong thing.

Split Reads From Writes When Query Shape Demands It

At some point, the write model and the read model want different shapes.

Writes want correctness:

  • normalized tables
  • transactions
  • constraints
  • invariants

Reads want speed:

  • denormalized documents
  • precomputed views
  • search indexes
  • cached aggregates

CQRS, or Command Query Responsibility Segregation, names that split.

Write path
---------
Create todo -> App service -> Write database
                              |
                              | project changes
                              v
Read path                     Read model
---------                     ----------
Search todos -> Read API  ->  OpenSearch / cache / materialized view

The read model is not the source of truth. It is a purpose-built copy.

This design buys speed and query flexibility, but it introduces a new truth: eventual consistency. A user may create something, then wait a moment before it appears in search. That is acceptable for some flows and unacceptable for others.

Use this pattern when the read side has genuinely outgrown the write side. Do not use it because a diagram looks more modern with two databases.

Concrete Implementation: A Read Model Pipeline

Here is a small implementation shape that appears in many real systems:

Write database -> Outbox table -> Publisher -> NATS -> Projector -> OpenSearch
                                                        |
                                                        v
                                                    Read API
                                                        |
                                                        v
                                                    OpenFGA check

The write transaction stores both the business change and an outbox event:

BEGIN;

INSERT INTO todos (id, user_id, title, completed)
VALUES (:id, :user_id, :title, false);

INSERT INTO outbox_events (id, topic, payload, created_at)
VALUES (
  :event_id,
  'todo.created',
  json_build_object(
    'todo_id', :id,
    'user_id', :user_id,
    'title', :title
  ),
  now()
);

COMMIT;

A publisher reads unpublished outbox rows and sends them to the event bus:

type OutboxEvent = {
  id: string;
  topic: string;
  payload: unknown;
};

async function publishOutboxBatch(events: OutboxEvent[]) {
  for (const event of events) {
    await messageBus.publish(event.topic, {
      id: event.id,
      occurredAt: new Date().toISOString(),
      data: event.payload,
    });

    await markPublished(event.id);
  }
}

A projector updates the read model idempotently:

type TodoCreated = {
  id: string;
  data: {
    todo_id: string;
    user_id: string;
    title: string;
  };
};

async function handleTodoCreated(event: TodoCreated) {
  if (await alreadyProcessed(event.id)) {
    return;
  }

  await openSearch.index({
    index: "todos",
    id: event.data.todo_id,
    document: {
      title: event.data.title,
      owner_id: event.data.user_id,
      completed: false,
    },
  });

  await markProcessed(event.id);
}

The details change by stack, but the shape is stable:

  • write once to the source of truth
  • publish changes reliably
  • update read models in the background
  • make projectors idempotent
  • measure lag between write and read visibility

That lag becomes an operational signal.

Act 3: Split Systems Along Real Boundaries

Microservices are not the next step after “we have files.” They are the next step after independent parts of the system need independent ownership, scaling, release cadence, or failure isolation.

The design problem changes once there is more than one service. Calls cross boundaries. Schemas drift. Permissions duplicate. Debugging gets harder.

Use Events When Services Need to React

Direct calls couple services tightly:

Todo service -> Email service
Todo service -> Analytics service
Todo service -> Search projector
Todo service -> Notification service

Events invert that relationship:

Todo service -> todo.completed event -> Message bus
                                      -> Email worker
                                      -> Analytics worker
                                      -> Search projector
                                      -> Notification worker

The producer announces what happened. Consumers decide what to do.

NATS is one practical open-source option here. Core NATS supports pub/sub and request/reply. JetStream adds persistence, durable consumers, replayable streams, and key-value capabilities. That makes it useful when you need lightweight messaging without adopting a heavier event platform immediately.

The tradeoff: events make flows more flexible, but less obvious. You need naming conventions, schema ownership, tracing, dead-letter handling, and replay procedures.

Treat Contracts as Code

Once multiple services share APIs or events, every boundary is a contract.

Examples:

GET /v1/todos
POST /v1/todos
event todo.created.v1
event todo.completed.v2

Version numbers are not decoration. They tell consumers what can change safely.

The executable version of a contract is a test:

  • OpenAPI validation for HTTP APIs
  • schema validation for events
  • contract tests between producers and consumers
  • integration tests for the highest-risk flows

This matters even more with AI coding agents. Agents can change code quickly, but they still need an oracle. Contracts and tests are the oracle.

Separate Authorization From Data Access

Authorization gets complicated when the product gets useful.

At the beginning, this may be enough:

if (user.role !== "admin") {
  throw new ForbiddenError();
}

Later, the question becomes:

Can this user, service account, or agent
perform this action
on this workspace, project, document, invoice, or tool
under this task context?

At that point, permission logic scattered through services becomes dangerous.

A cleaner design separates data retrieval from authorization:

Read API -> OpenSearch     "Which records match the query?"
Read API -> OpenFGA        "Which matching records can this subject see?"
Read API -> User           "Return only authorized records."

OpenFGA is one open-source implementation of relationship-based access control inspired by Google’s Zanzibar paper. It stores relationship tuples, evaluates an authorization model, and answers permission checks such as:

Can user:luis view document:roadmap?
Can agent:triage read ticket:123?
Can service:billing refund invoice:456?

The separation gives you a debugging advantage:

Data exists?Permission exists?Likely problem
yesyesAPI or UI bug
yesnoauthorization bug
noyes/nodata sync or creation bug

This is especially useful in systems with read models, search indexes, tenants, and agents. You can inspect the data path and permission path independently.

Act 4: Design for Failure

Distributed systems fail in ordinary ways:

  • the network drops
  • a dependency slows down
  • a message arrives twice
  • a consumer crashes halfway through
  • a search index lags
  • a permission tuple is missing
  • an AI model times out

Production design assumes this will happen.

The basic toolkit:

PieceWhat it prevents
TimeoutsWaiting forever
RetriesFailing on a transient error
IdempotencyDouble-processing messages
Circuit breakersCascading dependency failure
Dead-letter queuesOne bad message blocking all work
Graceful degradationOptional features taking down core flows
RollbacksBad deployments staying bad

Idempotency is the most important one once events exist:

async function handleEvent(event: Event) {
  if (await processedEvents.has(event.id)) {
    return;
  }

  await doWork(event);
  await processedEvents.add(event.id);
}

Graceful degradation is equally important for product experience:

async function createTodo(input: CreateTodoInput) {
  const todo = await saveTodo(input);

  try {
    const suggestions = await aiSuggestions.forTodo(todo);
    await saveSuggestions(todo.id, suggestions);
  } catch (error) {
    logger.warn({ error, todoId: todo.id }, "AI suggestions unavailable");
  }

  return todo;
}

The core action succeeds even if the optional AI feature fails.

But graceful degradation has a trap: it hides failure from users, so it can hide failure from the team. Every degraded path should emit logs, metrics, or traces. Otherwise the app can be broken quietly.

Act 5: Evolve a Living System

Once a system is valuable, replacing it becomes risky. The goal shifts from “build the new thing” to “change the running thing without breaking it.”

Use the Strangler Fig Pattern for Legacy Migration

The strangler fig pattern puts a facade in front of the old system, then routes one capability at a time to the new system:

Request -> Facade / router
             |
             +-> legacy system for old routes
             |
             +-> new system for migrated routes

The default should usually be legacy until a slice is proven safe. Useful techniques include:

  • routing by endpoint, tenant, feature flag, or cohort
  • X-Served-By headers to identify which system handled a response
  • shadow traffic to compare old and new behavior
  • change data capture to keep new read models in sync
  • canary rollout before broad migration
  • rollback paths for every migrated slice

This is how modernization ships value continuously instead of disappearing into a rewrite.

Treat AI as a First-Class Layer

AI features are not just “call the model here.”

A production AI feature usually has several pieces:

User request
  |
  v
AI orchestration layer
  |
  +-> model
  +-> tools
  +-> retrieval / vector store
  +-> authorization checks
  +-> evaluation and tracing
  +-> fallback behavior

The vector store is just another read model, optimized for semantic retrieval. OpenSearch can serve this role through vector search, and Postgres with pgvector can be enough when staying close to the primary database is more valuable than specialized search infrastructure.

The same architecture rules still apply:

  • AI calls need timeouts and fallbacks.
  • AI tools need authorization.
  • AI output needs evaluation, not just uptime checks.
  • AI traces should show which tools, documents, prompts, and model calls influenced the result.
  • Sensitive data should be filtered before retrieval and before tool invocation.

AI does not remove system design. It raises the cost of sloppy boundaries.

Let Agents Help Engineering, With Guardrails

AI coding agents fit into this architecture too. A useful agent loop looks like this:

Task -> Agent reads context -> edits code -> runs tests -> fixes failures -> opens PR
                                      ^                         |
                                      |                         v
                                      +------ test feedback <---+

That loop is only safe when earlier pieces exist:

  • isolated dev environments
  • clear architecture boundaries
  • tests and contract checks
  • source control
  • CI
  • human review
  • permission boundaries for tools and secrets

A stronger self-correction loop adds production evidence:

Observe regression -> classify cause -> propose patch -> test in dev
        ^                                                   |
        |                                                   v
Rollback if unhealthy <- canary release <- human approval <- PR

This is not magic self-healing. It is architecture wired into a feedback loop. Observability detects the problem. Contracts define expected behavior. The dev environment contains the experiment. Tests judge the candidate fix. Canary rollout limits risk. Rollback stops the bleeding.

The more disciplined the system, the more useful autonomy becomes.

The Full Architecture in One Picture

A mature version of the original simple app might look like this:

One environment: dev, staging, or prod

User / Client
  |
  v
Edge: auth entry point, routing, rate limits
  |
  v
Application services
  |
  +-> Write database
  |      |
  |      v
  |   Outbox events
  |      |
  |      v
  +-> Message bus: NATS / Kafka / Pub/Sub
          |
          +-> Search projector -> OpenSearch read model
          +-> Email worker
          +-> Analytics worker
          +-> AI orchestration
                    |
                    +-> model
                    +-> vector store
                    +-> tools

Read API
  |
  +-> OpenSearch for candidate data
  +-> OpenFGA for permission checks
  |
  v
Authorized response

Ops across everything:
logs, metrics, traces, alerts, SLOs, deploys, rollbacks

Do not read this as the starting point. Read it as the result of many justified steps.

A Practical Decision Matrix

Use symptoms, not fashion, to choose the next piece.

SymptomReach forAvoid
Live data is being used for testingEnvironment isolationMore manual caution
Logic is duplicated across UI, API, and jobsModular layersA service split too early
Nobody can explain a slowdownObservabilityBlind caching
Queries need search, filtering, and aggregation at scaleRead model / CQRSMaking the write schema serve every read
Features react to the same state changeEvent busMore direct service calls
Consumers break after producer changesContracts and versioningSlack-based coordination
Permissions differ by workspace, object, role, or agentExternalized authorizationCopy-pasted if checks
Messages arrive twice or dependencies flapIdempotency, retries, timeoutsAssuming the happy path
Legacy replacement is riskyStrangler fig migrationBig-bang rewrite
AI features need private data or toolsAI layer with auth, evals, tracingRaw model calls from random services
AI coding agents create risky diffsSpecs, tests, CI, review gatesTrusting generated code because it compiles

The point is not to use every pattern. The point is to know what pain each pattern is meant to cure.

Common Mistakes

Splitting Services Before Splitting Concepts

If the monolith is tangled, microservices will usually distribute the tangle. First clarify module boundaries inside one app. Then split the parts that need independent ownership or scaling.

Adding CQRS Without Measuring

CQRS adds a synchronization problem. Use it when reads genuinely need a different model, not because read/write separation sounds sophisticated.

Treating Events as Invisible Function Calls

Events need ownership, schemas, replay rules, and observability. If nobody owns an event contract, every consumer owns the fallout.

Putting Authorization Only at the UI

UI checks improve experience. They do not protect data. Permission enforcement belongs server-side, close to every read, write, tool call, and background action.

Bolting AI Onto the Side

AI features still need security, evaluation, observability, and fallback behavior. A model call is an integration point, not a product architecture.

Conclusion

Modern software design is the art of adding structure at the right time.

Start simple. Keep the code modular. Isolate environments. Measure the system. Split reads from writes when the data demands it. Use events when reactions multiply. Treat contracts and permissions as first-class boundaries. Design for failure. Migrate incrementally. Add AI as a governed layer, not a shortcut around engineering.

None of these pieces are exotic on their own. The skill is sequencing them.

Architecture becomes less scary when every box in the diagram answers one question: what pain does this solve?

Sources