Roslyn's performance

Topics: APIs, General
Apr 11, 2014 at 10:02 AM
Hi everyone!

From discussions here, it can be clearly seen that the Roslyn team is very mindful of performance. It would be very interesting to learn more about techniques used to optimize Roslyn.

For instance, I've seen diagnostics and annotations being stored in weak hash tables rather than directly as fields of GreenNode. Is this done purely to save memory, or there are also speed implications? What other tricks are used to improve performance?

Also, I've heard rumors about Roslyn being parallel, but have never seen definitive information about that. Is it indeed multi-threaded? If yes, what parts of Roslyn can run in parallel? Parsing? Typechecking? IL emission?

Finally, are there any benchmarks that compare Roslyn-based compilers with old ("native") ones?

Apr 11, 2014 at 12:03 PM
I'm not familiar with everything done in Roslyn, but at the basic level the two big parallelizations (in the c# compiler) are that files' individual Syntax trees are generated in parallel and type compilation (field, initializers, and Methods) is done with each type in parallel. While I can't find it, I believe that binding of Type and Method Signatures occurs synchronously between these two operations. Binding of fields and method bodies is done lazily during type compilation(in parallel). Once the language compiler is done and you have a complete in memory representation of the IL data, putting the output file together and writing it is done synchronously.
Apr 11, 2014 at 10:36 PM
Edited Apr 12, 2014 at 3:57 PM
== Concurrency:

Most (all?) of the Roslyn compiler's APIs are thread-safe. Thread-safe is used here in a sense that multiple threads may access same or different API entry points at the same time concurrently and get sound and consistent results.
That was motivated by the desire to support free-threaded clients. It is not uncommon for IDEs to only do minimal work on UI thread to keep it responsive and do most of the heavy lifting in worker threads/tasks. We did not want to have any strict requirements about order in which APIs can be accessed, sequentially or concurrently. Note that this does not necessarily make things faster, just more convenient to use.
Once you have your APIs accessible from multiple threads (and once you have already paid for this feature in terms of design and development costs), it makes sense to use concurrency in batch compile scenario too.

At the moment the major stages of batch compile are roughly :
1) initial work: parsing command line, creating compilation etc. This is relatively fast and sequential.

2) parsing source files: This is done concurrently with granularity of a single file. I.E. each file is parsed sequentially, but several files may be parsed at the same time.

3) creating declaration table: - a kind of an index to the sources that tells where namespaces and types are declared. It is done sequentially. I do not know if there are reasons for that or it just does not matter. It is a fairly quick stage.

4) getting declaration diagnostics: this is basically just going over all symbols and forcing them to fully bind. The sideeffect of binding symbols is that if there are problems with symbols (like inheritance loops, unresolved symbols and such), they will be reported. This stage is concurrent. There is implicit partial ordering - like binding members of a type will force binding of the base type since there are semantical dependencies. Otherwise unrelated symbols could be bound in a pretty random order.

5) Compile method bodies: this is the stage where method bodies are bound, validated, lowered and emitted as streams of IL and debug information. In theory we do not need to wait until # 4 finishes, but we want to know if there were any declaration errors. In a case of errors, there is no much point in emitting method bodies and we will do only the binding part to produce diagnostics in method bodies if there are any. This stage is concurrent at type granularity. - Methods within the same type compile sequentially one after another in predetermined order (lexical if possible). This is done to have stable output. Sometimes compiling methods will introduce additional synthetic members into containing class - nested display classes for lambda closures, async and iterator state machines, etc. If the same code gets compiled twice, we want the synthetic members to be emitted in the same order and to have same names and shapes.
This is generally the most expensive stage in terms of CPU time, but also the one that scales pretty well on multiple cores.

6) Serialization of PE image and PDB information. This is where all the data that we have to the moment - symbols, string constants, method bodies, exception handling regions, debug information etc are traversed and packed into corresponding tables and streams in the output binary. This is a sequential process. It is relatively hard to do serialization concurrently since the binary data has internal dependencies - entries in some tables refer to entries in other tables and so forth.

Concurrency can be disabled programmatically via compilation options (concurrentBuild parameter).
Command line drivers like rcsc expose that as /parallel flag

One interesting case when one might want to disable concurrency is when debugging a problem related to binding.
Marked as answer by TomasMatousek on 4/18/2014 at 8:25 PM
Apr 11, 2014 at 11:48 PM
diagnostics and annotations - yes they are stored in side tables to make green nodes smaller. And yes, that helps performance.

Note that normally far from every syntax node has syntax errors, so a field on the green node would be virtually always unused. Considering the number of syntax nodes and the rate at which they are created by the parser, extra fields are a noticeable expense. In addition, larger nodes would make overall syntax model larger and less likely to fit in CPU caches when used. As a result storing this rarely occurring data in side tables is a profitable trade-off. It does make it more expensive to deal with diagnostics when errors do happen, but the vastly more common nodes with no diagnostics are cheaper.
Also note that nodes track the fact when the node or any of its descendants have associated diagnostics (ContainsDiagnostics bit in flags) - there is no need to access the tables at all when particular sub-tree as a whole has no diagnostics.

Same applies to annotations - they are not common enough to justify the cost of an extra field in the green node.

There are definitely more interesting performance "tricks" in the codebase. I will try describing more here later.
Marked as answer by TomasMatousek on 4/18/2014 at 8:25 PM
Apr 12, 2014 at 12:43 PM
@VSadov, thanks a lot for your posts! Looking forward to the follow-up :)
Apr 19, 2014 at 7:49 PM
Edited Apr 21, 2014 at 4:51 PM
While we are on the topic of green trees, the general mechanics of red/green trees can be found here:
and here:

I do not think I can explain that any better.

There are some additional implementation details that could be interesting from the angle of performance.

=== Deduplication of green nonterminals

As you may notice from descriptions and from the actual implementation, green nodes are immutable and self-contained. In particular, green node does not know its position in text and does not know its parent. As a result the same green node for "2 - 2 " can represent any occurrence of "2 - 2 ", - as long as it is parses into identically shaped green sub-tree, why not just use the same sub-tree?
That is achieved in green node factories - before creating a green node for "2 - 2 " the factory will check if it has already seen a binary expression with same children "2 ", "- ", "2 " and if answer is yes, the it will just return a cached node. One can see it as a kind of memorization.

When "2 * 2 - 2 * 2 " is parsed, the parser builds a tree like this:
            2 * 2 - 2 * 2
           /      |       \
     2 * 2        -       2 * 2
   /   |   \             /  |   \
  2    *    2           2   *    2
However, because of deduplication the actual green tree may look like
            2 * 2 - 2 * 2
              /   |   \
              \   -   /
               \     /
                \   /
                2 * 2
              /   |   \
              \   *   /
               \     /
                \   /
Since identity of green nodes is irrelevant, the second tree is indistinguishable from the first, but at the same time is much more compact. - 5 nodes instead of 10.

Note the "may" in the "may look like". That is because deduplication is not a guaranteed feature. To guarantee deduplication we would need to cache all already seen nodes. That would make deduplicating caches unnecessarily big. Instead, the caches have fixed size with new entries preempting older ones. As a result deduplication is not perfect, but because of locality of syntax patterns in the code it still works pretty well.

Another trade-off here is that we do not deduplicate nonterminals with more than 3 children. The more children you need to consider, the more expensive it gets to compare nodes and the more likely that something will not match. All that leads to diminishing returns in term of cache hits when larger nodes are considered.
Since most common syntax nodes (simple names, binary expressions, member accesses, typical argument lists...) have 3 children or less, handling just nodes with 3 or less children seems to provide the best benefit/cost ratio.

=== deduplication of terminals
Typical, human readable, syntax trees are very shallow. Even in enormously large code files there seem to be some natural limits on how deep syntax could be nested. Perhaps because of that a very large fraction of nodes in the green tree are tokens - terminal nodes like punctuation, keywords, identifiers, literals. They are also ones with the most repetitions - once you use variable "x" in code you are likely to use it more than once.

That makes it even more important to dedupe tokens. The principle is the same though as with nonterminals. There are fixed sized caches of tokens that are used by token factories or by the scanner directly to avoid creating a new token when one can be used from the cache. The difference is that most tokens do not have any children structure and are basically direct representations of text so matching of tokens typically means matching their textual values.

Another interesting difference in deduping strategy here is that there are two levels of caching for nonterminals.
L1 is a fast, relatively small, not thread-safe, local cache used exclusively by a single parsing session.
L2 is a slower, larger, thread-safe cache that is shared between L1 caches. When L1 cache experiences a miss, it looks in L2.
So, if a token for "System" is created while parsing aaa.cs, that same token may be reused when parsing bbb.cs

=== deduplication of strings
There is clearly a trend here. - Immutable, self-contained objects are reusable and when repetitions are common, it is hard to resist deduping them.

Strings used in code could be massively redundant. How many times would you find a string "int" or "System" in a typical C# solution?
The caching strategy here is similar to the caching of tokens - two layered L1 / L2 cache.
The interesting difference is that L2 is shared across languages and is also used by metadata reader.

A string instance for "System" allocated when parsing VB code my end up used by C# or when a namespace name "String" is read from metadata.
Marked as answer by VSadov on 4/21/2014 at 9:51 AM
May 2, 2014 at 7:58 PM
@VSadov thanks a lot for these posts. They are really cool. Do you blog somewhere? If so, where?
May 7, 2014 at 8:50 PM
@xyztwentythree sadly, nowhere. I meant to start a blog more than once in the past, but every time it failed due to "not today" factor. :-)
May 7, 2014 at 9:40 PM
=== Laziness of red trees:

Red nodes are the public façade to the syntax model. Unlike a green node, red node:
  • knows its location span in the tree
  • knows not only children, but the parent too
  • knows the syntax tree to which the node belongs.
This all makes red nodes very convenient, but it also ensures that red nodes cannot be reused within a tree (all nodes have different spans) or between trees (different parents). The problem arises – if there is a red tree for every version of the text, how can we afford a new tree every time when user types a character? The solution is to build the red tree lazily.

The key observation here is that there is no way to get a random red node directly. To access a node you must get its parent first and then ask for the child. That is when red nodes are constructed. Note that when we descend from a parent into a child we know all the information that we need to construct the child. We know the parent – it is the current node. We also know the green node that contains internal data for the child (kind, width and such) - if this is the left red node for a binary expression, the green node for it will be the left green child of the current node, and so on. And we can compute the position of the new node since we know parent’s position and width of all children that precede the given one (if it is the first child, its position is the same as parent’s).

Since we can build red nodes on demand we do not need to create an entire new tree for every character typed. We only need to create a new green tree which is a relatively cheap incremental operation - (parser uses most of the original tree, generally only nodes that intersect with edited span need replacing) and create a trivial new root red node which has 0 for position, null for the parent and the root green node for the green node. After that the red tree can build itself if/when user descends into its children nodes. The whole building red tree becomes a pay-for-play scenario.

Laziness helps tremendously in scenarios like typing. - When user is furiously typing, no one has the time or need to examine the entire red tree. Once user types another character the current tree becomes irrelevant and any analysis, if started, tend to be cancelled and restarted with the new tree (often with a small delay – in case user will continue typing, no need to start anything too eagerly). As a result intermediate red trees, while fully "available", will generally see only a tiny fraction of nodes realized.
Marked as answer by VSadov on 5/7/2014 at 2:40 PM
May 8, 2014 at 9:52 PM
You may be interested in a draft of a perf article I posted here.

May 16, 2014 at 12:20 PM
Are red nodes cached?
May 16, 2014 at 3:26 PM
Red nodes are remembered by their parent red nodes in most circumstances and the root is held by the syntax tree, so they are shared by all users of the same tree instance, but are not otherwise cached. In a few places, like method body blocks, red nodes may be cached weakly so they will be garbage collected if not continually referenced. This is a performance optimization and it cannot be observed through normal means. Each time you type a keystroke in the VS editor, most of the green nodes in that documents tree are reused, but the red nodes are new. This does cause GC pressure, using access patterns to the tree that focus in on tokens and surrounding nodes can minimize most of these allocations.
Marked as answer by VSadov on 6/25/2014 at 3:45 PM
Jun 25, 2014 at 10:45 PM
== Weak red subtrees.

An interesting thing to note about red nodes is that they have reference identity. Same piece of syntax in a given tree is observably represented by the same unique red node object. Example: If one gets from a leftmost child of some node to the node via parent reference and then descends from the parent into its leftmost child he will get exactly the same node from which he started. This is a very important quality of the red nodes. Equality of nodes is a very cheap reference comparison.
The implementation is obvious - parents store references to children that are already created and just return already created ones if these children are need again. A side-effect of such implementation is that it makes the red tree a fully connected graph (can get from any node to any another). From the prospective of GC as long as any node in the tree is reachable, the whole red tree would be retained. Once nodes are created, they would live as long as the whole tree. That is a bit inconvenient. Imagine user scrolling through a code file – colorizer will materialize the nodes for the visible syntax, but after the view moves the nodes become invisible and no longer interesting for most kinds of analysis.

Note that all the information that we put in the red tree comes from the green nodes. Even node positions, are actually inferred from the green node widths and the tree topology. The process of red tree creation can be seen as a form of decompression. That allows to make some of the parent->child links weak (as in WeakReference<T>). A weakly referenced red subtrees can be reclaimed by GC if noone uses them and if needed “decompressed” again. Note that this does not violate the principle of reference identity. Newly constructed red nodes are indeed new objects, but since the old nodes representing same syntax are no longer reachable, no one would be able to notice the difference.

Obviously, we do not want to make all the children references in the tree weak. Weak references take space themselves, not as fast as regular references and require extra GC housekeeping. It has been observed that many kinds of analysis, including binding, are much more likely to require syntax within same method where analysis is performed while reaching for other method bodies is rare. Based on this, Roslyn red trees refer to method body syntax weakly. - If a particular method body is no longer used, GC can put its red nodes to a better use. Typically at any given time only some methods in a large file would have materialized red trees thus significantly reducing average heap size.
Marked as answer by VSadov on 6/25/2014 at 3:45 PM
May 13, 2015 at 1:23 PM
Edited Oct 30, 2015 at 3:27 PM
Hi together!

I wrote down all the knowledge I've collected regarding performance considerations in Roslyn. Maybe someone likes to review it? :) Would be great!

Inside the .NET Compiler Platform - Performance Considerations during Syntax Analysis (#SpeakRoslyn)

Thanks a lot!