First class support for retry.

Topics: C# Language Design
Apr 10, 2014 at 8:09 PM
I really wish .NET had built in retry capability. Working in the cloud there are many transient errors that are resolved by simply asking again. Given the current state of C# & .NET this is the code i find to be best:
int tries = 0;

start:
try { tries++;  DO THINGS }
catch(Exception) {
if(tries < 3) goto start
throw
}
Apr 10, 2014 at 9:40 PM
Edited Apr 10, 2014 at 9:41 PM
I agree that retry logic is definitely a necessity nowadays, but it sounds like what you're looking for is more a matter of writing a helper method than changing the language itself. I just whipped this together so it's far from pretty, but the following sounds like it could handle what you would want:
public T ExecuteWithRetries<T>(Func<T> function, Func<Exception, bool> IsRetryWorthy, int MaxAttempts)
{
     Exception finalException = null;
     for(int i = 0; i < MaxAttempts; i++)
     {
          try
          {
               T ret = function();
               return ret;
           }
          catch(Exception ex)
          {
               finalException = ex;
               
               if (!IsRetryWorthy(ex))
                    throw ex;
          }
     }

     throw finalException;
}
How could you change the language to accommodate for this any better?
Apr 11, 2014 at 1:39 PM
Edited Apr 11, 2014 at 1:43 PM
Probably something very similar to the for loop syntax.
try {  DO THINGS }
retry(int retries=0; retries<2; { Task.Delay(250ms); retries[plus][plus] })
... for some reason codeplex is encoding the plus sign -_-
Apr 11, 2014 at 9:22 PM
Edited Apr 11, 2014 at 9:28 PM
Probably something very similar to the for loop syntax.
I'm not sure I see the advantage of this syntax over a function like I showed before. Using an outside method takes a few more lines of code the first time, but by the time you wrap every call in a try-retry block as you're suggesting, that seems like it could become unwieldy pretty quickly, all at the cost of reduced functionality. Pretty well any call you make to something that you would think to wrap in this could have some exceptions that need specific action. Say, for instance, you're calling a REST API; if you get back a 500, it might be good to retry, but if you get back a 401 with a WWW-Authenticate, or even a 400 in some cases, you'll want to deal with the exception that's thrown and not retry. However much we'd like to pretend it does sometimes, a bad password entered several times does not yield a good password. Take another example of executing T-SQL: sometimes you'll hit a deadlock or timeout that just needs to be retried, but sometimes a constraint is violated that you shouldn't retry.

To make try-retry` feasible, then, we need to add in some way of discerning, as I put them in my example, "retry-worthy" exceptions. Then when something is either not retry-worthy or when we reach more than the maximum number of retries, we need to handle the exception. Say, for instance,
try
{
     DO THINGS
}
retry (int; Func<Exception, bool>; int);
catch (Exception ex)
{
     HANDLE EXCEPTION
}
Where retry takes parameters in the following order:
  • int: maximum retry count
  • Func<Exception, bool>: check to make sure we should retry it
  • int: the number of milliseconds to wait between failures
The exact syntax could vary, of course, but that seems like the simplest, most restrictive form that would be usable in real-world scenarios.

An example of that might be,
var req = HttpWebRequest.CreateHttp(url);
try
{
     HttpWebResponse resp = req.GetResponse();
     [Use resp]
}
retry (2; ex => ex is WebException && ((HttpWebResponse)((WebException)ex).Response).StatusCode == HttpStatusCode.RequestTimeout; 250);
catch (Exception ex)
{
    Console.Writeline(ex);
}
By the time you have that sort of infrastructure in place, it seems easier and significantly more readable to simply call another method, such as the one I proposed. That would reduce this down to something like:
var req = HttpWebRequest.CreateHttp(url);
try
{
    var resp = ExecuteWithRetries(req.GetResponse, ex => ex is WebException && ((HttpWebResponse)((WebException)ex).Response).StatusCode == HttpStatusCode.RequestTimeout, 2);
     [Use resp]
}
catch (Exception ex)
{
    Console.Writeline(ex);
}
You could also, of course, add support for waiting between calls with another parameter to the method. I'm just not sure this way feels any more complicated or less readable than the native support you're showing us. Having our own function, of course, also allows us to make changes down the road to how we handle retry situations. Extensibility can be valuable when you need it. Say, for instance, you wanted to handle 401 (Unauthorized) at one bottleneck for every request and pop up a "login" dialog, you could easily throw in support for that in this function, or even a middleware function that calls this one, but you couldn't do that at one place with the suggested retry block.

Like I said at the beginning of my previous post, I definitely like your idea of having easier support for retrying. It's important, and I feel like it can be pretty easily overlooked when you're making a lot of requests. But I'm just not sure it's realistic to have a fully-functioning language feature to handle it, however nice it would be to see it more easily accomplished.
... for some reason codeplex is encoding the plus sign -_-
This seems to be fixed just with a refresh, I had that as well and I've seen a couple other people comment on it on other threads.
Apr 13, 2014 at 10:30 PM
Retry is definitely belong to some library/framework, not the language. It raises too many questions, like what if I want to retry with delay before first retry? What if I want these delays to be progressive? How nested retries should be managed? How retry and async can be combined together? How to log retry for quality of service diagnostics?

All of these questions can be answered within library or class, but doing so at language level just increase complexity. Ideally language should have as little syntax as possible, leaving most of the sugar/features to library code. The reason is quite simple - you can change library much easier than language.
Coordinator
Apr 16, 2014 at 2:25 PM
dotnetchris wrote:
I really wish .NET had built in retry capability. Working in the cloud there are many transient errors that are resolved by simply asking again.
Out of curiosity, how do you classify "transient" vs "permanent"? I ask because I believe there is NO good general classification possible into these two buckets, and also because they're the wrong classification to use.

The question you should be asking is "Is a failure now correlated with a failure in 10ms? in 10s? in 10 hours?" If yes, then you shouldn't try again within that time period. If no, then it's fine to try again.

I have seen ZERO evidence that anyone has looked for this kind of statistical correlation data. I can tell you what I found in 1 year of running a super-heavy workload on AWS. No intra-datacenter failures at all. Therefore, any code you write here to do retry would be needless confusing and untested code at worst, and a bug source at worst.

In a datacenter, if I'm running batch processing and a workitem failed, then I'd normally stick it at the end of the queue to re-run in 10 hours or so.

If I'm in a client, then I will NEVER automatically retry. Think of your web-browser. It doesn't automatically retry. Instead it displays a failure message as soon as possible, and lets the user hit the retry button. This is especially valuable for mobile devices, where the dominant mode of failure is "walk into a room with poor coverage", and the user wants to see immediately feedback so they can take immediate remedial action.


Back to the classification of transient vs permanent in the light of this discussion. If you get a timeout, is that permanent or transient? Remember that TCP already deals with automatic retries due to congestion-related packet loss. If you get a timeout, is it because of a spike in load? Will that spike last all day because there's a major news event that everyone's looking at? or will it be over in 10ms? If you get a "404 Page Not Found / 500 Server Error", is that a permanent error? Or will an operations person be woken up by his pager to fix it up within 2 hours?