This project is read-only.
1

Resolved

VB and C# compilers fail to reject non-text file on CHS locale

description

  1. Open File Explorer, type “Control Panel” in the location
  2. Under Clock, Language, and Region, click Change date, time, or number formats
  3. Click Administrative tab
  4. Change System locale…
  5. Select Chinese (Simplified, China)
  6. Restart machine
  7. Run rvbc or rcsc on a binary file (for example, a dll)
Expected:
As in ENU locale, compilers reject the input file with messages similar to

error CS2015: '<file full path'> is a binary file instead of a text file

Actual:
Hundreds of compiler errors, most are

error CS1056: Unexpected character '

Note:
Microsoft.CodeAnalysis.VisualBasic.UnitTests.CommandLineTests.BinaryFile() test failed because of this. The test is using a binary file of rather large size. There are more than 1,000,000 diagnostics to output. StringBuilder couldn't handle it and throws OutOfMemoryException instead.

Note 2:
Be sure to modify StringTextTests_Default.GetTextDiacritic test accordingly, once this issue is correctly fixed, by not relying on Windows-1252 and passing in any extended-ASCII character in current encoding, not specifically a diacritic one.
Also fix EncodedStringTextTests.Decode_NonUtf8 to rely on Encoding.Default instead of Windows-1252.

comments

jeremymeng wrote Mar 25, 2014 at 7:50 PM

I haven't check other locales. Native compilers work fine on CHS locale, although they issue a different error message.

astasyuk wrote Apr 1, 2014 at 1:39 AM

The cause of the out of memory issue is likely the following:

EncodedStringText.DetectEncodingAndDecode(), which gets called to parse code files passed as arguments to command-line compiler, sets encoding to Encoding.Default (which depends on system locale), if no byte order marks are found and an attempt to decode the file as UTF8 fails.

Then there is an explicit check: if (encoding.CodePage == 1252) – which is the default ENU encoding, run DecodeIfNotBinary(), otherwise run Decode().
DecodeIfNotBinary() differs from its counterpart by checking if a file contains two consecutive NUL characters (0x00) – then it stops trying to decode the file and yields a single error that the file is not in the correct format.
On the other hand, Decode() method, which gets run for all non-ENU encodings, does not have this check and continues to decode the file regardless of file content – yielding thousands or errors (Jeremy actually says about 1,000,000+ diagnostics) and leading to OutOfMemoryException.

There are two issues here:
  1. We do not detect that the file may be binary for non-ENU encodings. This was probably done as a safe bet that maybe in other encodings NUL-NUL character combination is a valid one. After looking into different encodings, it appears that it may be still safe to apply this heuristic to all encodings, except for UTF16 and UTF32 (see note below).
  2. There is a more fundamental issue, that there is no upper limit on the number of diagnostics that we produce, which might lead to an OutOfMemoryException, or worse be a security threat. It is easy to create an arbitrary binary that would bypass the NUL-NUL heuristics and lead to an OOM compiler crash. We might want to consider a way to safeguard us against that (like VS does, limiting Error List output to 1000 items, IIRC).
Note: It appears that the only common place where a 0x0000 could appear in a text file is UTF-16 or UTF-32 encoding, and this would be to encode the NUL character itself (U+0000), which is reasonably rare in text or code files. But Encoding.Default would never represent UTF16 or UTF32, as there is no country locale choice with Unicode-based default encoding.

I’ve researched other encodings as well. Vast majority is compatible with ASCII, e.g. bytes in range 0x00-0x7F are treated as single ASCII characters with a few tweaks, and most importantly 0x00 is always treated as a NUL character, so 0x0000 corresponds to a NUL-NUL sequence. This includes all legacy OEM codepages, all ISO-8859-* sub-standards, and most Microsoft code pages (except for 52936 for HZ encoding), and obviously UTF8. Asian encodings Shift-JIS, Big5, GB 2312 and 12345 and various mappings of those are also ASCII-compatible. There is one interesting exception of a HZ encoding (CP 52936), which allows two input modes, one being non-ASCII compatible, however it seems that 0x0000 combination cannot appear, as code row/column numbering in GB 2312 and 12345 starts with 1.