Distributed systems fail in ways that monoliths never do. In a single process, if a function call fails you get an exception — simple, local, easy to reason about. Spread that same call across a network boundary and suddenly you’re dealing with partial failures, redelivered messages, timeouts that may or may not mean the operation succeeded, and cascading outages that start from something as mundane as a slow database query.

This post is about the patterns that make inter-service communication reliable — not just the happy path, but every failure mode that will eventually bite you in production.

When Does Reliability Actually Matter?

Not every system has the same requirements. Before adding complexity, be honest about what you actually need.

Reliability is critical when:

  • Correctness affects money or legal standing — bank transfers, invoices, inventory adjustments
  • Double-processing causes real harm — booking the same seat twice, charging a card twice
  • You need a full audit trail — compliance systems, healthcare records, financial ledgers

It’s probably fine to be looser when:

  • Data loss or duplication is low-stakes — analytics events, live metrics, view counts
  • The reconciliation cost is trivial — a nightly job can heal inconsistencies cheaply
  • You’re building something where “best effort” semantics match the domain — streaming, notifications, feeds

For everything in between, the patterns below give you a toolkit to dial up exactly as much reliability as you need.

Communication Protocols

Before diving into patterns, a quick orientation on the two families of protocols that matter most here.

Synchronous (HTTP/gRPC): Service A calls Service B and waits for a response. Simple to reason about, easy to trace, but tightly couples availability — if B is down, A’s request fails immediately.

Asynchronous (Message Queues): Service A publishes a message to a broker (SQS, RabbitMQ, Kafka) and moves on. Service B consumes it later, on its own schedule. Decouples availability, but introduces eventual consistency and more complex failure modes.

Most real systems use both. The patterns below apply to each, sometimes differently.

Idempotency: The Foundation

Idempotency is the property that says: performing the same operation multiple times has the same effect as performing it once.

In distributed systems, retries are inevitable. A network timeout doesn’t tell you whether the operation succeeded or failed — it just tells you the connection dropped. To retry safely, the operation must be idempotent.

Naturally idempotent operations:

  • Setting a field to a specific value: UPDATE user SET status = 'active' WHERE id = X
  • Inserting with an upsert: INSERT ... ON CONFLICT DO NOTHING

Not naturally idempotent:

  • Incrementing a counter: UPDATE balance SET amount = amount + 100
  • Appending to a list

For the non-idempotent cases, the standard approach is an idempotency key — a unique identifier attached to the operation that the receiver tracks in persistent storage:

-- Receiver checks before processing
SELECT 1 FROM processed_events WHERE event_id = $1
-- If not found, process and insert atomically
INSERT INTO processed_events (event_id, processed_at) VALUES ($1, NOW())

A unique database constraint on event_id gives you safety for free: if two workers race to process the same message, one insert succeeds and one gets a constraint violation, with no double-processing. This is the same mechanism used by Stripe, Braintree, and most payment processors when they expose an Idempotency-Key header.

Another approach is timestamp-based ordering: attach a created_at timestamp to each event and reject any event whose timestamp is older than the last processed timestamp for that entity.

Async: Message Queues

Message queues are the backbone of resilient async communication. The basic topology is Producer → Queue → Consumer, but the interesting part is what happens when things go wrong.

Async Queue Flow with AWS SQS and Dead Letter Queue

Happy Path

  1. Producer publishes a message to the queue
  2. Consumer picks it up, processes it, and ACKs (acknowledges) the message
  3. Queue removes the message

The ACK is the critical safety mechanism. Until the consumer sends an ACK, the queue considers the message “in flight” and won’t redeliver it — but it will redeliver if the consumer disappears.

When the Consumer Crashes

If the consumer process crashes mid-processing (hardware failure, OOM kill, deployment), the message never gets ACKed. Most queue implementations expose a visibility timeout — a window within which the message is hidden from other consumers while one is processing it. When that window expires without an ACK, the message becomes visible again and another consumer instance picks it up.

This means your consumer must be idempotent. The same message can and will be delivered more than once. Design for it, not around it.

When There’s an Application Error

If the consumer receives a message but throws an exception during processing (downstream service down, unexpected data, bug), the message is typically marked as failed. The pattern here is a Dead Letter Queue (DLQ): after N failed attempts, the message is moved to a separate queue where it sits until a human (or an automated recovery process) handles it.

The DLQ is not a graveyard — it’s a holding area. You need:

  • Metrics on DLQ depth (alert when non-zero)
  • Tooling to inspect DLQ messages (what failed, why)
  • A process to replay DLQ messages once the underlying issue is fixed — which is why idempotency matters again

Sync: HTTP and the Circuit Breaker

HTTP communication is simpler to reason about but more fragile by default. When Service A calls Service B and gets an error, it has a few options.

Transient errors (network blip, momentary overload) are best handled with retries with backoff. The Fibonacci sequence works well for retry delays: 1s, 1s, 2s, 3s, 5s, 8s… It keeps early retries fast while avoiding thundering-herd problems under sustained load.

Sustained failures call for a circuit breaker.

Circuit Breaker State Machine with retry strategy

The circuit breaker is a state machine with three states:

  • Closed (normal): Requests pass through. The breaker counts failures. Below a threshold, all is well.
  • Open (tripped): Too many failures in a window. New requests fail immediately without hitting the downstream service. This protects both the caller (no more waiting for timeouts) and the dependency (no more load while it’s struggling to recover).
  • Half-Open (probing): After a timeout, a small number of requests are allowed through as a probe. If they succeed, the breaker resets to Closed. If they fail, it trips back to Open.

Well-maintained libraries handle this for you: Polly (.NET), Resilience4j (Java), cockatiel (Node.js), gobreaker (Go).

What to Do After Propagating the Error

Once you’ve retried and the circuit is open, you need to handle the failure at a higher level. What happens depends on how the original operation was triggered:

OriginOn failure
HTTP request handlerReturn HTTP error to caller (let them retry)
Queue message consumerLet message fail → DLQ
Background/batch jobLeave record in “pending” state for next run

The key principle: never silently swallow errors. Every unhandled failure should leave a trace — in a queue, in a database record state, in an alert.

The Outbox Pattern: Atomic Multi-Target Updates

Here’s a scenario that breaks naive implementations: you update your database, then try to publish a message to a queue. Between those two operations, your process crashes. The database has the update; the queue never got the message. Your system is now inconsistent and there’s no record of the gap.

The Outbox pattern solves this with one rule: never publish messages or make side-effect calls outside of a database transaction.

Outbox Pattern with RDS and background worker

Instead of calling the queue directly, you write a record to an outbox table inside the same transaction as your business data change:

BEGIN;

-- Business operation
UPDATE balance SET amount = amount + 100 WHERE user_id = $1;

-- Side effect deferred to outbox
INSERT INTO outbox (id, event_type, payload, status, created_at)
VALUES ($2, 'PAYMENT_RECEIVED', $3::jsonb, 'PENDING', NOW());

COMMIT;

Either both rows commit or neither does. Atomicity guaranteed.

A background relay worker polls the outbox table for pending records and dispatches them — publishing to a queue, making an HTTP call, whatever the operation requires — then marks each record as DONE. If dispatch fails, the record stays PENDING and is retried on the next poll.

-- Worker marks record done after successful dispatch
UPDATE outbox SET status = 'DONE', processed_at = NOW()
WHERE id = $1 AND status = 'PENDING';

Because the worker may retry the same record (worker crash, race condition), the downstream handler must be idempotent. There’s that word again.

Operational requirements:

  • Alert on outbox records that are PENDING for too long (broken worker, dispatcher bug)
  • Alert on records in FAILED status (permanent errors that need manual intervention)
  • Consider a retry_count and max retries before moving to FAILED

A note on database support: ACID transactions with multi-table atomicity are available in PostgreSQL, MySQL, SQL Server, Oracle, MongoDB (multi-document transactions), and DynamoDB (TransactWrite API). The pattern works with any of them.

Putting It All Together: Payment Processing

Let’s walk through a concrete system that combines all of these patterns.

The problem: We have a payment webhook that receives a POST from the bank when a payment is received. We need to:

  1. Update the customer’s balance
  2. Notify the compliance team (via RabbitMQ for async processing)
  3. Notify the user (via HTTP to a notification service)

All three must eventually happen, consistently, with no double-processing.

Payment Processing End-to-End Flow

The Flow

① Bank → API Gateway → Webhook Service

The bank sends a POST with a payment_id. The webhook service checks the payment_id against a processed_webhooks table — if already seen, return 200 OK immediately (idempotent receiver). Otherwise proceed.

② → ③ Webhook Service publishes to SQS

Rather than processing inline (which risks timeouts, and the bank retries on non-2xx), the webhook service publishes a message to an SQS queue and returns 200 OK immediately. The bank is happy.

④ SQS triggers the Balance Service

SQS delivers the message to the Balance Service consumer. The consumer begins processing. If the consumer crashes, the message reappears after the visibility timeout. Idempotency check at the top of the handler guards against double-processing.

⑤ Atomic DB transaction

Inside a single transaction, the Balance Service:

  • Updates the customer’s balance
  • Inserts an outbox record for compliance notification
  • Inserts an outbox record for user notification
  • Commits

If any of these three fail, the whole transaction rolls back. The SQS message will be redelivered and the handler will re-attempt from scratch.

⑥ ⑦ Outbox workers dispatch in the background

Two background workers (or one worker handling both types) poll the outbox table:

  • Compliance worker: publishes a CHECK_PAYMENT message to RabbitMQ, marks outbox record DONE
  • Notification worker: calls the notification service HTTP endpoint, marks outbox record DONE

Both workers use idempotency keys when dispatching so that retries don’t cause double notifications.

If a dispatch fails repeatedly, the outbox record moves to FAILED and an alert fires.

What the Monitoring Stack Watches

SignalThresholdAction
SQS DLQ depth> 0Page on-call
Outbox PENDING age> 5 minutesPage on-call
Outbox FAILED count> 0Page on-call
Circuit breaker open eventsAnyLog + alert
Compliance/notification HTTP error rate> 1%Alert

Summary

Reliable distributed communication comes down to a small set of composable principles:

  1. Make handlers idempotent. Retries will happen. Design for them.
  2. Use dead letter queues for async failures, and alert on them.
  3. Retry with backoff for transient HTTP errors; use a circuit breaker to stop hammering dependencies that are clearly down.
  4. Use the outbox pattern when you need atomic “at least once” delivery across multiple targets — never write to a queue and a database in separate operations.
  5. Monitor aggressively — DLQ depth, outbox staleness, circuit breaker state. Silent failures become production incidents that wake you up at 3am.

The patterns aren’t complex individually. The discipline is applying them consistently, across every service boundary, before the incident that would have motivated you to do so.