Yielding IEnumerables aka nested iterators

Topics: C# Language Design
Apr 25, 2014 at 10:37 PM
Edited Apr 27, 2014 at 12:51 PM
In our wonderful new forum i would like to raise the issue of dealing with nested IEnumerables. I am far from am authority on the matter, but i have encountered situations where i have wanted to have a syntax for "flattening" multiple IEnumerables at once from an iterator block.

To quote (almost) from this connect suggestion
Suppose we have:

public IEnumerable<string> GetStrings()
{
    yield return "first";
    yield return "last";
}

and now we'd like to write:

public IEnumerable<string> GetStrings(IEnumerable<string> innerList)
{
    yield return "first";
    yield foreach innerList;
    yield return "last";
}

instead of:

public IEnumerable<string> GetStrings(IEnumerable<string> innerList)
{
    yield return "first";
    foreach (string s in innerList)
    {
        yield return s;
     }
     yield return "last";
}
A commonly cited syntax for this is a new "yield foreach" contextual keyword to augment the existing "yield return" and "yield break". Below i aggregate some discussion of the matter. The most important of which is the first link.

This Link by Wes Dyer (formerly? of the C# compiler team) gives a good explanation of the performance implications along with the proposed syntax. His post references a paper by Erik Meijer et al. from the MSR project Spec# whose concepts where born from the MSR project (which shared most of the members of the Spec# team).

Some bonus chatter is available at the following locations (and many others) :
Lastly i fully recognize that unlike many of the sugary requests and ideas on this forum, this one would require some heavy lifting inside of a very complicated section of the compiler.
Apr 26, 2014 at 4:23 AM
I'd love to see this. I've written a lot of iterator-over-enumerable code and said syntax would make life a little bit easier. I was unaware that by implementing it "properly" that it could also lead to performance benefits. I'd be happy if the syntax just expanded out to what foreach and yield return produce in an iterator today and perhaps that's a stepping stone towards a proper and performant implementation tomorrow.
Coordinator
Apr 27, 2014 at 7:45 AM
AlgorithmsAreCool, note that your syntax wouldn't work as is because it's ambiguous in the following case:
public IEnumerable<object> GetStrings(IEnumerable<string> innerList)
{
    yield return "first";
    yield return innerList; // Does this return the object "innerList" itself, or its contents?
    yield return "last";
}
It would have to be a new keyword, yield return each innerList or something like that.
Apr 27, 2014 at 11:46 AM
lwischik wrote:
AlgorithmsAreCool, note that your syntax wouldn't work as is because it's ambiguous in the following case:
It would have to be a new keyword, yield return each innerList or something like that.
Absolutely. Some of the posts he included mused over some variations of syntax for that. My personal favorite is yield foreach since it is pretty clear what it does, does not introduce any additional keywords and doesn't break any currently valid syntax.
public IEnumerable<object> GetStrings(IEnumerable<string> innerList)
{
    yield return "first";
    yield foreach innerList;
    yield return "last";
}
Apr 27, 2014 at 12:56 PM
Oh dear, in my haste I failed to note that the example code that quoted had the right idea, but did not feature the proposed syntax.

As Halo_Four has correctly noted, the middle case should read as yield foreach. I have amended the OP to reflect this.

Thanks for the save Halo_Four.
Apr 27, 2014 at 3:51 PM
Any new keyword after yield will avoid breaking currently valid syntax, since anything except return or break at that position is currently a syntax error.

A vote in favor of yield each innerList. foreach includes for because it's part of starting a loop. This doesn't, and yield each reads better and is shorter.

I came up with an alternative I like even more, yield from innerList, but that would get ridiculous if you wanted to use query comprehensions because you could write yield from from x in innerList .... If that could be made a special case where only one from was necessary, I'd like that most of all, but C# doesn't really like special cases like that, and now that'd turn into yield innerList.Where... after the query comprehension transformation which is plainly illegal.
Apr 27, 2014 at 7:08 PM
I don't have an strong opinion about the syntax, yield foreach sound reasonable.

The important thing is: Will the c# compiler be able to translate the pattern to something more performant than foreach(...) yield return?

If the answer is not, I won't bother to do anything. Is a corner case feature and will make important performance considerations due to the added level of indirection even less visible.
Apr 27, 2014 at 7:26 PM
Olmo wrote:
The important thing is: Will the c# compiler be able to translate the pattern to something more performant than foreach(...) yield return?
The initial post in this discussion mentions a paper detailing a better implementation.
Apr 27, 2014 at 11:03 PM
Olmo wrote:
I don't have an strong opinion about the syntax, yield foreach sound reasonable.

The important thing is: Will the c# compiler be able to translate the pattern to something more performant than foreach(...) yield return?

If the answer is not, I won't bother to do anything. Is a corner case feature and will make important performance considerations due to the added level of indirection even less visible.
True, the benefit of the syntax change would be a lot less without changes to the generated iterator state machine that could improve performance. However such changes very likely increase the degree of difficulty and potential impact significantly. I would be all for the syntax addition to be made sooner with the understanding that the iterator improvements would be added later as time and resources would permit but potentially after Roslyn is released.
Apr 27, 2014 at 11:20 PM
A very good point. Although it shouldn't be used without a great deal of forethought, I think I could live with a few language features that start out as more convenient syntax and that may later evolve into more efficient implementations with the same semantics. That could also allow, for example, implementing an extra interface or few methods to provide even faster/more efficient iteration. All part of specifying what's to be done rather than how.
Apr 28, 2014 at 4:43 AM
Edited Apr 28, 2014 at 4:51 AM
What it if want the yield all of the nodes if recursive data structure say a tree?
IEnumerable< Node > YieldNodes ( Node n )
{
  if ( n is null ) return
  foreach e in YieldNodes( n.left )
  {
    yield return e;
  }
  yield return n;
  foreach e in YieldNodes( n.righ )
  {
    yield return e
  }
}
It's inefficient because iterators are a method local construct, which means you have to repeatedly yield the same nodes to the recursive caller
Think its O(n²)

Plus a there is no "iterator" object that you can pass around as an additional argument to the recursive method.

Suppose C# had one.
IEnumerable< Node > YieldNode ( Node n )
{
 return YieldNodes( n, thismethod )
}

IEnumerable< Node > YieldNodes ( Node n, Iterator iter )
{
  if ( thisnode == null ) return
  YieldNodes( thisnode.Left, iter )
  yield return thisnode on iter;
  YieldNodes ( thisnode.Righ, iter )
}
This would be O(N)

How sometimes what you require is a different algorithm
Coordinator
Apr 28, 2014 at 6:34 AM
Edited Apr 28, 2014 at 6:35 AM
JesperTreetop wrote:
A very good point. Although it shouldn't be used without a great deal of forethought, I think I could live with a few language features that start out as more convenient syntax and that may later evolve into more efficient implementations with the same semantics. That could also allow, for example, implementing an extra interface or few methods to provide even faster/more efficient iteration. All part of specifying what's to be done rather than how.
Let's spell that out explicitly...
  • You could do "yield each" as a concise syntax for the current inefficient implementation. That would be bad, because people would be fooled into using it more than they should.
  • The proposed "more efficient implementation" is NOT MORE EFFICIENT. Indeed it has extra overhead for the common case (where no-one uses "yield each" on it). It's only more efficient when someone does do "yield each" on it. So it would be bad to switch wholesale to that new implementation.
  • There's no middle ground. You can't make something that uses the current "optimized-for-simple" implementation in some cases, and uses the proposed "optimized-for-nested" implementation in other cases. That's because when you write an iterator method, you don't yet know how it's going to be used. The only way I see is if you declare two iterator functions
Iterator Function f() As IEnumerable(Of Integer) ...
Iterator Function f() As INestableEnumerable(Of Integer) ...
and then you devolve your codebase into two parallel chains of iterators, one using nested, one using non-nested, since only the ultimate top-level caller knows which one he wants to use.
Apr 28, 2014 at 10:51 AM
The proposed "more efficient implementation" is NOT MORE EFFICIENT. Indeed it has extra overhead for the common case (where no-one uses "yield each" on it). It's only more efficient when someone does do "yield each" on it. So it would be bad to switch wholesale to that new implementation.
That is interesting. Obviously if the "improved" implementation is only "improved" for the edge case and is a regression for the common case then implementing that is a bad idea.
You could do "yield each" as a concise syntax for the current inefficient implementation. That would be bad, because people would be fooled into using it more than they should.
I'd still like to see a new syntax like yield each precisely because it is more concise. And for as to whether people would be fooled into using it more, considering that it's far from difficult to enumerate within an iterator as it is I don't think it would encourage people to use it more. If logically that's what the iterator needs to do then they're going to do it regardless of whether it's ~10 characters or ~30 characters.
Apr 28, 2014 at 11:31 AM
Halo_Four wrote:
I'd still like to see a new syntax like yield each precisely because it is more concise. And for as to whether people would be fooled into using it more, considering that it's far from difficult to enumerate within an iterator as it is I don't think it would encourage people to use it more. If logically that's what the iterator needs to do then they're going to do it regardless of whether it's ~10 characters or ~30 characters.
Not really, the current syntax is more clear about the behavior. It's easier that you realize in the debugger of the N level of indirection that created the O(N^2) problem. Then avoid using recursions and use a manual Stack<Node> on your tree iterator.
Apr 28, 2014 at 4:47 PM
lwischik wrote:
  • The proposed "more efficient implementation" is NOT MORE EFFICIENT. Indeed it has extra overhead for the common case (where no-one uses "yield each" on it). It's only more efficient when someone does do "yield each" on it. So it would be bad to switch wholesale to that new implementation.
This is information I didn't have. This does indeed sound like it creates more worries.
Apr 28, 2014 at 9:51 PM
Edited Apr 28, 2014 at 9:54 PM
  • The proposed "more efficient implementation" is NOT MORE EFFICIENT. Indeed it has extra overhead for the common case (where no-one uses "yield each" on it). It's only more efficient when someone does do "yield each" on it. So it would be bad to switch wholesale to that new implementation.
@lwischik Could you elaborate on this at all? Why would adding yield foreach make the common case (no yield foreach) slower? I would think you would keep the code-gen the same for existing iterators and only have different codegen if a yield foreach statement is actually used (in which case the author presumably knows what he's doing by invoking it).

If it's at all possible to add this into the languages, I hope that it does get added someday. No one would ever expect the asymptotic behavior of Concat() that Wes Dyer describes in his article unless they are intimately familiar with the codegen.
Coordinator
Apr 28, 2014 at 10:03 PM
MgSam wrote:
@lwischik Could you elaborate on this at all? Why would adding yield foreach make the common case (no yield foreach) slower?
Let's give a concrete example...
Iterator Function f() As IEnumerable(Of Integer)
   Yield 0
 
   ' OPTION 1:
   For Each x In g() : Yield x : Next
   ' OPTION 2:
   Yield Each g()

   Yield 99
End Function

Iterator Function g() As IEnumerable(Of Integer)
   Yield 1 : Yield 2 : Yield 3
End Function

' and the caller code looks like this...
For Each x In f() : Console.WriteLine(x) : Next
What happens today with "option 1" is that, each time the caller calls MoveNext() for its loop, it delegates to the MoveNext implementation of f()'s iterator, which delegates to the MoveNext implementation of g()'s iterator. All these multiple layers of indirection are inefficient. When the tree of calls gets deeper, you can imagine that all the layers of delegating end up dominating the cost of doing ForEach.


Erik Meijer's paper proposes a different "bottom-up" implementation, where there is a single "current iterator" object. At first, f() is in charge of that iterator object for returning "0". Next, it hands over control, so that g() is the current iterator object. When the caller invokes MoveNext(), it dispatches directly to the current iterator object. This avoids the inefficiency.


It's a nice idea, and pays off when there are deep trees... But,
  1. It doesn't help for the (large) body of legacy code which doesn't support these enhanced iterators: all library code would need to be re-compiled to support it.
  2. If library code isn't re-compiled, then it would need to generate an "adapter", which introduces overhead without benefit.
  3. And regardless, if you don't have deep nested trees, then managing the "current iterator" object has some small overhead as compared to how iterators behave today.
Apr 30, 2014 at 8:18 AM
Edited Apr 30, 2014 at 8:18 AM
lwischik wrote:
It's a nice idea, and pays off when there are deep trees... But,
  1. It doesn't help for the (large) body of legacy code which doesn't support these enhanced iterators: all library code would need to be re-compiled to support it.
  2. If library code isn't re-compiled, then it would need to generate an "adapter", which introduces overhead without benefit.
  3. And regardless, if you don't have deep nested trees, then managing the "current iterator" object has some small overhead as compared to how iterators behave today.
  1. Is this really a problem? In the worst case, existing libraries will behave exactly as they do today, performance-wise (except for point 2).
    Since the .NET framework itself is likely to be compiled with the new compiler, most LINQ-to-objects usages would benefit from this change, even without recompiling the callers. (I know that Where/Select calls are already optimized to use a single iterator but nesting starts as soon as you're using other methods such as SelectMany.)
  2. That might be an issue at first, but I think that it's likely to disappear over time, as most libraries get re-compiled.
  3. Are there some benchmarks available (this also applies to 2)? Erik Meijer's paper only compare nested trees.