Get back original nodes

Topics: APIs
May 9, 2014 at 4:24 PM
I'm not sure how feasible this would be, but it would be really nice if we could mark an original tree as 'trackable' somehow, modify it one or more times, and then be able to ask new nodes in the final tree if they came from the original tree, and if so, from where (either by getting back the original span or by getting a reference to the original node).

With the current node tracking functionality, you need to know up front which original nodes you want to track and then ask for them later in the modified tree. Technically, this could be used to track in the other direction: by tracking every original node, then iterating over all original nodes to find which one corresponds to a particular node in the current tree, but it would probably be hugely inefficient in terms of both memory and performance.
Developer
May 12, 2014 at 5:51 AM
I agree, it would be really nice if you could just mark the whole tree, or better yet not have to do anything up front and just have the tracking API figure it out based on the changes. However, the current API is the closest we've been able to achieve so far.
May 12, 2014 at 4:35 PM
Edited May 12, 2014 at 4:47 PM
Having tracking for free always would be great. I didn't even think of that, but obviously it would be more difficult.

In the absence of this feature, I have had to roll my own, but it would be great if it could be baked into Roslyn itself, so I'll describe what I did, and if the team feels my solution is correct and minimally impactful to other areas of the code/functionality/performance/etc., please consider adding something like this.

Also, my solution is currently only implemented for nodes (you can't ask tokens or trivia for their original location), but I think this solution could easily extend to those as well. Also, my solution allows you to mix parts of different syntax trees and get back the original span and file name from which a node came, and then I keep a map from original file name to tree so I can get back original trees, but this would probably be better as a guid or some other mapping I would think.

So basically, I have four possible annotations that can be added to a node:
  • Synthesized (data: none) – a node was newly created during modifications
  • OriginalLocationInfo (data: <original span start>;<original file name>) – a node was in the original tree with the specified file name, and its span started at the absolute position in the annotation data
  • StartOffset (data: <span start offset>) – a node's span start was offset by the specified amount
  • LengthOffset (data: <span length offset>) – a node's span length was changed by the specified amount
To determine the original position of a node, N:
  • Start with the node and walk up its parent chain, totaling up the StartOffset values to get a total offset (totalOffset) since the original tree.
  • If during the walk, a node is hit that contains a Synthesized annotation, stop because N is also synthesized and therefore has no original position.
  • If during the walk, a node, A, is hit that contains an OriginalLocationInfo annotation, stop because N was a child of the original ancestor node. Take the difference of N.SpanStart and A.SpanStart to get N's current relative offset to A. Subtract from that the current totalOffset value to get N's original offset from A. Add to that A's original position as specified in the OriginalLocationInfo annotation to get N's original position. And A's original file specified in the OriginalLocationInfo is the original file name of N.
Once it is determined that a node is not synthesized, its original span length can be determined by getting the LengthOffset annotation on N, if any, and subtracting that from N's current length. This coupled with the original span start and file name gives the original span and file name of the node.

So this describes how to utilize the annotations to determine original information, and the following describes how to actually annotate the nodes during modifications (to actually implement this part, I needed to auto-generate wrappers for all nodes and the syntax factory, since I need to be involved in node creation, and in the Add…, With…, and Update calls on each node, but I'm hoping I won't have to do this anymore if this functionality could be included in Roslyn):

First, I have some extension methods defined for nodes:
  • GetOriginalLocationInfo() – This will perform the walk described above and determine if a node is synthesized (or within a completely synthesized sub-tree) or if the node came from an original location. It returns a Tuple<TextSpan,string> for modified nodes, and null for synthesized nodes.
  • CaptureOriginalLocationInfo(Tuple<TextSpan,string> info) – Can take the result of GetOriginalLocationInfo() and if it is determined a node is synthesized it will return a cloned node marked with the Synthesized annotation. If, however, it is determined that the node came from an original location, it will return a cloned node marked with an OriginalLocationInfo annotation describing the original location of the node.
  • CaptureOriginalLocationInfo() – Defined as return node.CaptureOriginalLocationInfo(node.GetOriginalLocationInfo()). This ensures that the walk can be done on the resulting node even if it no longer has the same ancestors as it has previously.
Then in the wrapper factory, I use CaptureOriginalLocationInfo() on all arguments before passing them to the built-in SyntaxFactory, and I mark the return value with the Synthesized annotation before returning it, like so:
public static TrackedIfStatementSyntax IfStatement(TrackedExpressionSyntax condition, TrackedStatementSyntax statement)
{
    return 
        SyntaxFactory.IfStatement(
            condition.Unwrap().CaptureOriginalLocationInfo(), 
            statement.Unwrap().CaptureOriginalLocationInfo())
        .MarkSynthesized();
}
As for the Add…, With…, and Update methods, all of the actual annotating happens in Update since Add… and With… methods funnel to it. Here is the process for annotating in Update:
  • If all children passed to Update are the same, just return this.
  • Otherwise, declare a local named currentOffset and set it to 0.
  • Walk over all children and do the following:
    • For tokens, increment currentOffset by the new token's length minus the old token's length
    • For nodes,
      • If the new child node (newChild) is not null and not the same instance as the old child node (oldChild), set newChild to newChild.CaptureOriginalLocationInfo().
      • Increment currentOffset by the difference in leading trivia lengths from new to old (using 0 lengths for null nodes).
      • If currentOffset is not 0 and newChild is not null, set a StartOffset annotation on newChild with the value of currentOffset. If a StartOffset annotation is already present, remove it and set a new StartOffset annotation to the value of currentOffset plus the old StartOffset value.
      • Increment currentOffset by the difference in span lengths from new to old (using 0 lengths for null nodes).
      • Increment currentOffset by the difference in trailing trivia lengths from new to old (using 0 lengths for null nodes).
    • I will not describe the handling of lists because this is already a pretty long post, but they are handled similarly when they are created and then I just walk over them here and then set (or add to) the StartOffset annotations with the value of currentOffset.
  • Pass all new children (which may have been modified from their original parameter values) to the underlying node's Update(…) method to get the updated node (newNode).
  • Take the difference in span lengths of newNode and the original node and if it is not 0, set it on newNode as a LengthOffset annotation (or add to an existing one).
  • return newNode.CaptureOriginalLocationInfo(oldNode.GetOriginalLocationInfo())
By this process, you can always get back the original span and file name of a node and it only requires annotations on the ancestors, subsequent siblings, and subsequent siblings of ancestors of modified nodes.

I know this process probably has some bugs since I haven't tried out all the edge cases or even tried modifying trivia using this approach, but I think it might be a good starting point.