While we are on the topic of green trees, the general mechanics of red/green trees can be found here:
I do not think I can explain that any better.
There are some additional implementation details that could be interesting from the angle of performance.
=== Deduplication of green nonterminals
As you may notice from descriptions and from the actual implementation, green nodes are immutable and self-contained. In particular, green node does not know its position in text and does not know its parent. As a result the same green node for "2 - 2
" can represent any occurrence of "2 - 2 ", - as long as it is parses into identically shaped green sub-tree, why not just use the same sub-tree?
That is achieved in green node factories - before creating a green node for "2 - 2 " the factory will check if it has already seen a binary expression with same children "2 ", "- ", "2 " and if answer is yes, the it will
just return a cached node. One can see it as a kind of memorization.
When "2 * 2 - 2 * 2 " is parsed, the parser builds a tree like this:
2 * 2 - 2 * 2
/ | \
2 * 2 - 2 * 2
/ | \ / | \
2 * 2 2 * 2
However, because of deduplication the actual green tree may look like
2 * 2 - 2 * 2
/ | \
\ - /
2 * 2
/ | \
\ * /
Since identity of green nodes is irrelevant, the second tree is indistinguishable from the first, but at the same time is much more compact. - 5 nodes instead of 10.
Note the "may" in the "may look like". That is because deduplication is not a guaranteed feature. To guarantee deduplication we would need to cache all already seen nodes. That would make deduplicating caches unnecessarily big. Instead,
the caches have fixed size with new entries preempting older ones. As a result deduplication is not perfect, but because of locality of syntax patterns in the code it still works pretty well.
Another trade-off here is that we do not deduplicate nonterminals with more than 3 children. The more children you need to consider, the more expensive it gets to compare nodes and the more likely that something will not match. All that leads to diminishing
returns in term of cache hits when larger nodes are considered.
Since most common syntax nodes (simple names, binary expressions, member accesses, typical argument lists...) have 3 children or less, handling just nodes with 3 or less children seems to provide the best benefit/cost ratio.
=== deduplication of terminals
Typical, human readable, syntax trees are very shallow. Even in enormously large code files there seem to be some natural limits on how deep syntax could be nested. Perhaps because of that a very large fraction of nodes in the green tree are tokens - terminal
nodes like punctuation, keywords, identifiers, literals. They are also ones with the most repetitions - once you use variable "x" in code you are likely to use it more than once.
That makes it even more important to dedupe tokens. The principle is the same though as with nonterminals. There are fixed sized caches of tokens that are used by token factories or by the scanner directly to avoid creating a new token when one can be used
from the cache. The difference is that most tokens do not have any children structure and are basically direct representations of text so matching of tokens typically means matching their textual values.
Another interesting difference in deduping strategy here is that there are two levels of caching for nonterminals.
L1 is a fast, relatively small, not thread-safe, local cache used exclusively by a single parsing session.
L2 is a slower, larger, thread-safe cache that is shared between L1 caches. When L1 cache experiences a miss, it looks in L2.
So, if a token for "System" is created while parsing aaa.cs, that same token may be reused when parsing bbb.cs
=== deduplication of strings
There is clearly a trend here. - Immutable, self-contained objects are reusable and when repetitions are common, it is hard to resist deduping them.
Strings used in code could be massively redundant. How many times would you find a string "int" or "System" in a typical C# solution?
The caching strategy here is similar to the caching of tokens - two layered L1 / L2 cache.
The interesting difference is that L2 is shared across languages and is also used by metadata reader.
A string instance for "System" allocated when parsing VB code my end up used by C# or when a namespace name "String" is read from metadata.