This project is read-only.

Pure values

Topics: C# Language Design, General
Dec 1, 2014 at 1:20 PM
Edited Dec 3, 2014 at 5:33 PM
I propose a way of explicitly declaring methods and classes as pure. Intuitively, when something is "pure" I mean that given a specific input, it always return the same output and it should not have any side effects.

For a method to be pure, it requires the following:
  • It may not set any property or field, except on local variables
  • It may not call a non-pure method (including getters and constructors)
  • It may not evaluate non-pure expressions
  • It must return a value (void-returning methods suggest they have side effects)
  • It may not have ref parameters (not sure about out parameters)
For a class to be pure, it requires the following:
  • All fields are read-only
  • All methods, including constructors and getters, are pure
  • It does not have setters
  • It must be a value-type? (at least it should be sealed)
What would be the advantage of doing this? It enables at least two mechanisms. First, an expression can (in theory) be evaluated at compile-time, and second, it gives a lot of safety such as thread-safety and testability.

Evaluating expressions at compile-time is an optimization thing. It allows us to have const values of custom classes, such as
public const Point Origin = new Point(0, 0);
which could be evaluated at compile-time if Point is a pure struct. Or,
public Point Origin { get { return new Point(0, 0); } }
would have an optimization where the same Point is always returned. There are many optimizations that can be made when we know for sure that a struct is immutable.


Example:
public pure struct Vector
{
    public double X { get; }
    public double Y { get; }
    public Vector(double x, double y)
    {
        X = x;
        Y = y;
    }
    public double Magnitude { get { return Math.Sqrt(X * X + Y * Y); } }
}
 // Now, we're allowed to do something like
 // public const SquareRootOfTwo = new Vector(1, 1).Magnitude;
Thoughts?
(ps. my first post here - some comments on my presentation is much appreciated)
Dec 1, 2014 at 3:29 PM
This is already present in CodeContracts. I wouldn't want "pure" without the rest of the CodeContracts concepts. That said, I'd very much like DbC to be baked into the language. There are other posts about doing that here.
Dec 1, 2014 at 5:39 PM
There's an attribute <Pure> that is used a lot in the Roslyn source, maybe it does the same.

Question Do thrown exceptions violate purity? Or don't because if you supply the same input values an exception will be thrown?
Dec 1, 2014 at 10:37 PM
I wasn't aware of those attributes. Thanks for letting me know, I'll look into it.

Regarding exceptions, I can think of several possible behaviors:
  1. Disallow exceptions in pure methods. This is as simple as disallowing throws from a pure method - since pure methods cannot call non-pure methods, they'll never get to a point where an exception could be thrown. This would mean that many common functions Math.Sqrt and integer division is non-pure, so it's probably not a good solution.
  2. Allow throwing exceptions like normal. This suggests we consider exceptions "pure" behavior, which feels wrong to me.
  3. Allow exception classes to be pure. Basically, ensure that the same input always throws the exact same exception. Not sure what to do with the stack trace in this case, and it seems artificial.
  4. Pure methods must explicitly state which exceptions it will throw Similar to java. Then, an expression containing such a method is pure if it either catches the exception, or explicitly rethrows it. This seems a bit unlike C#, but I like the since it has that pure feeling about it. Catching the exception is of course not required if called from a non-pure method.
The important question here is, what is the primary use of pure methods? If it is to be able to evaluate expressions at compile-time, then option 2 seems like the most straightforward solution. Either the expression evaluates successfully, or it throws an exception which produces a compile-time error.

Thinking of pure expressions in terms of safety promises, I think case 4 makes a lot of sense. It would explicitly give you a compile-time warning/error if you call a pure method without catching or explicitly propagating any possible exceptions. Forcing you to do these things explicitly seems consistent with pure behavior.


A problem with compile-time evaluation is, what if the execution never ends?

public static pure int Sum(int n)
{
int sum = 0;
for (int i = 1; i <= n; i++)
    sum += i;
return sum;
}
const int = Sum(-4);

How could the compiler handle this? Maybe give a compile-time error/warning after a specified timeout?
Dec 1, 2014 at 10:50 PM
The ability of ReferenceEquals, ConditionalWeakTable, and related other such methods and classes to operate on references to any class object means that it's not possible for any code which exposes a reference to control what information might get associated with it. The biggest use I can see for a "pure" concept would be if it could provide a means for declaring value-type constants; it wouldn't matter whether the value type pretended to be "immutable"--what would matter would be that the compiler could ascertain at compile-time the value of all the value type's fields. That non-constant instances of the value types might have field values changed after construction would be irrelevant.

The biggest difficulty I see with allowing value-type initialization is that although it is possible to define a const of type String, and it would be possible to define a const of any non-generic value type which does not contain any reference-type or variable-sized fields, I don't think CIL would support a value-type constant which contained fields of type String. For a language to allow only those const expressions which it could in turn express in CIL would lead to a more confusing language design than simply limiting const expressions to a small set of legal types.
Dec 2, 2014 at 12:17 PM
MariusUtheim wrote:
For a method to be pure, it requires the following:
  • It may not set any property or field, except on local variables
  • It may not call a non-pure method
  • It may not evaluate non-pure expressions
  • It must return a value (void-returning methods suggest they have side effects)
  • It may not have ref parameters (not sure about out parameters)
For a class to be pure, it requires the following:
  • All fields are read-only
  • All methods, including constructors and getters, are pure
  • It does not have setters
  • It must be a value-type? (at least it should be sealed)
Your list needs to include calling getters as well, as the underlying value may not depend on mutable fields. (Ditto for accessing fields for reading.)
Dec 2, 2014 at 2:46 PM
@supercat:
That's a good argument for only allowing value types. Pure reference type objects might not be equal in a ReferenceEquals sense.
Perhaps a const string could be an exception to this rule? I don't like special cases, but String is already very exceptional in many ways.


@VladD:
You're right. When I wrote "may not call a non-pure method", I meant to include property getters and constructors. A pure method can still call a pure getter or a constructor of a pure type. I updated my original post to reflect this.
Dec 2, 2014 at 3:53 PM
Edited Dec 2, 2014 at 3:53 PM
Being able to set constant values at compile-time is something I really hope is doable using pure values. There are however a few issues here.

First problem is to define what happens if the computation of the value takes a long time. Do we want to make the compiler do all these computations? Compilation could take a much longer time, and what happens if it actually never halts at all?

Second, what should be stored in the const field? It could take the raw bits stored in the struct, but we want some metadata in there. Is it interesting to keep the actual line that was used to create the value?

What if we store a compile-time constant value of a pure type defined in another assembly, then that type is changed in that assembly? For example, suppose pure struct Point is defined in Foo.dll. In Bar.dll, we have the line const Point p = new Point(2, 3);. After Bar.dll is compiled, the internal structure of Point from Foo.dll is changed, without changing any of its public interfaces. This situation probably requires a recompilation of Bar.dll, which wouldn't be necessary if we had instead used readonly Point p = new Point(2, 3);.

More?
Dec 2, 2014 at 9:50 PM
MariusUtheim wrote:
First problem is to define what happens if the computation of the value takes a long time. Do we want to make the compiler do all these computations? Compilation could take a much longer time, and what happens if it actually never halts at all?
The language rules for C# are already NP-hard; rules that allow only primitive-recursive computations to be used with in "pure" functions would guarantee that they could be processed in a finite amount of time.
Second, what should be stored in the const field? It could take the raw bits stored in the struct, but we want some metadata in there. Is it interesting to keep the actual line that was used to create the value?
I think .NET just keeps the raw bits and is agnostic to their content, which brings up an important point you notice below:
What if we store a compile-time constant value of a pure type defined in another assembly, then that type is changed in that assembly? For example, suppose pure struct Point is defined in Foo.dll. In Bar.dll, we have the line const Point p = new Point(2, 3);. After Bar.dll is compiled, the internal structure of Point from Foo.dll is changed, without changing any of its public interfaces. This situation probably requires a recompilation of Bar.dll, which wouldn't be necessary if we had instead used readonly Point p = new Point(2, 3);.
There are already issues with compilation order if an assembly makes use of a constant defined in another assembly and the latter assembly gets modified. Having the structure layout change could make things worse, however, since changing the layout could totally change the meaning of the pattern of bytes associated with the constant. Perhaps code which is going to define a structure-type constant could could require that the structure itself use a StructLayout tag to overlay a structure which contained all the primitives that were present in the outer structure, in their corresponding places, and had a name which was formed as a hash of the structure layout? In that case, a change to the layout of the structure would "cleanly" break code that would rely upon its layout (since the new layout would imply a new name for the overlaid structure) until such time as the code was recompiled.
Dec 3, 2014 at 5:32 PM
supercat wrote:
The language rules for C# are already NP-hard; rules that allow only primitive-recursive computations to be used with in "pure" functions would guarantee that they could be processed in a finite amount of time.
That's interesting. I'm not an expert on computability, but would this requirement possibly be very restricting? Either way, there is still a problem here: even if it's definitely finite-time, it could still be unreasonably long time.
Using a StructLayout doesn't feel right. If we demand an attribute, the feature loses some simplicity and purity since we now always have to refer to the correct namespaces and assemblies.
Since this feature is still quite far in the future, I'd like to make a more hypothetical suggestion. It seems conceiveable that we're gonna get record types before pure structs. Suppose we say that all pure structs must be record types (or have some similar kind of representation). For example,
pure struct Vector(double X, double Y)
{
    public double Magnitude { get { return Math.Sqrt(X * X + Y * Y); } }
}
Here, we implicitly say that X and Y are readonly. If no other fields are allowed, then this declaration always gives the same binary representation of the struct. It is impossible to change that representation without changing the public surface of the type. This also lends itself to a "canonical" representation of a struct that can be used as metadata.

Off the top if my head, I can't think of any issues introduced here that aren't already present in the current way things work. In normal usage, I don't see pure structs as having very complex structures anyway, and this seems consistent with that notion.
Dec 4, 2014 at 4:20 AM
If a structure semantically represents a fixed-sized collection of related but independent values, such as the coordinates of a vector, those values should be exposed public fields. While there are some cases where it can be useful to have a structure pretend to be an object, and structures that are pretending to be objects should have immutable fields, the notion that in someone who wants a few variables stuck together with duct tape one should have a structures try to imitate objects which act as a poor substitute for a bunch of variables stuck together with duct tape, rather having the structures act like bunches of variables stuck together with duct tape, is IMHO counterproductive and unfortunate. I see no reason for a language to encourage such an unfortunate pattern.