Skip to content
Go back

Be Careful with Retry – Don't DDoS Your Own System

Published:  at  04:02 PM

Retry isn’t bad. But if used incorrectly, you might accidentally become a “DDoS hacker”… against your own system.

Retry — the mechanism of repeating requests on failure — is an indispensable part of distributed system design. When an API call to another service fails due to network errors, timeouts, or temporary errors, we typically implement retry to increase the chance of success.

From a supporting mechanism, retry easily becomes the culprit causing a domino effect if uncontrolled.

1. When Retry Is a Double-Edged Sword

Imagine a simple scenario:

Now if 1000 requests arrive at Service A simultaneously:

Uncontrolled retry = shooting yourself in the foot.

2. Dangerous Retry Patterns

3.5 When to Retry and When Not To

Not all errors should be retried.

Should retry when:

Should NOT retry when:

✅ Only retry if the error has a chance of self-recovery.

3.6 How to Retry Correctly?

  1. Limit the number of retries Never retry infinitely. Maximum 2–3 times depending on context.

  2. Use delay and jitter Add delays between retries (exponential/linear), combined with jitter to avoid simultaneous retries.

  3. Only retry idempotent actions Example: GET, PUT are safer than POST, avoid creating multiple orders or duplicate transfers.

  4. Use circuit breaker Temporarily disconnect when downstream service fails continuously, retry later.

  5. Deferred Retry – Smart retry via jobs Instead of retrying immediately, put into queue or DB and process via background job when the system stabilizes. Avoid adding more load when the system is already in trouble.

  6. Log thoroughly Record error causes, retry count, retry timestamps for easy debugging and alerting.

3.7 How to Know When to Retry Again?

  1. Use circuit breaker Temporarily disconnect if target service fails continuously. Then gradually reopen (half-open).

  2. Observe health checks or metrics Check /health or data from Prometheus, Grafana to know if the system has recovered.

  3. Based on Retry-After header Some standard APIs return suggested retry timing.

  4. Rate limit retries Avoid flooding retries that overload the service again.

4. Tools Supporting Effective Retry Implementation

Java / Spring ecosystem:

Other languages/platforms:

Cloud-native:

5. Real Case: Saving the System During Peak Season with Strategic Retry

Context: Year-end, the system is under heavy load due to a promotional campaign. A payment processing service is overloaded, continuously returning timeout errors. Meanwhile, an automated batch job is running thousands of requests per minute, with 5 retries, no delay, no jitter.

Consequences: Flooding retries cause the payment service to completely congest → cascading impact on other systems → 15 minutes of downtime during peak hours.

Resolution:

Result: System stabilized in under 10 minutes. Retries no longer “suffocated” the backend.

Lesson:

Retry is not about “forcing it through”, but about helping the system recover in a controlled manner.

6. Conclusion

Retry is a powerful tool when used correctly. But if implemented without control, it can break the system faster than the original error.

Remember:

Retry is medicine — used correctly, it heals; used incorrectly, it poisons your own system.


Share this post on:

Previous Post
Saga Pattern: When Theory Meets Reality
Next Post
Hundreds of Orders Vanished in Just 3 Minutes – All Because of One Forgotten Config Line