Escaping Invalid XML Unicode characters

Hi folks

Recently I discovered a bug in NUnit

Basically the issue caused by the fact that NUnit may create a XmlDocument with Unicode characters that are not valid in XML.

To fix the issue we need to either strip those characters or maybe escape them

According to the xml spec, the only valid XML characters:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Let’s construct a Regex to replace invalid xml characters

First naive approach

var invalidXmlCharactersRegex = new Regex("[^\u0009\u000a\u000d\u0020-\ud7ff\ue000-\ufffd\U00010000-\U0010ffff]");

won’t work because \U00010000-\U0010ffff represented as Unicode surrogate pairs and equivalent to \ud800\udc00-\udbff\udff and form an invalid Regex

All characters \U00010000-\U0010ffff (Supplementary Planes) can be described as a Regex:

var supplementaryPlanesRegex = new Regex("[\ud800-\udbff][\udc00-\udfff]");

According to the list of valid characters shown above, [#xD800-#xDFFF] are invalid XML characters. Taking into account Supplementary Planes, this means that we are interested in surrogate characters that don’t form a valid surrogate pair.

In my previous blogpost I described a Regex to match such characters.

var invalidCharactersRegex = new Regex("([\ud800-\udbff](?![\udc00-\udfff]))|((?<![\ud800-\udbff])[\udc00-\udfff])");

Second naive approach

var invalidXmlCharactersRegex = new Regex("[^\u0009\u000a\u000d\u0020-\ud7ff\ue000-\ufffd]|([\ud800-\udbff](?![\udc00-\udfff]))|((?<![\ud800-\udbff])[\udc00-\udfff])");

This forms a valid Regex but this won’t work correctly. It will match the string with valid Unicode code point “\U00010000” which is equivalnt to “\ud800\udc00” . The reason for that is the fact that these characters were matched by the first part of the Regex. We need to skip this by adding this range into Regex

Third approach

var invalidXmlCharactersRegex = new Regex("[^\u0009\u000a\u000d\u0020-\ud7ff\ud800-\udfff\ue000-\ufffd]|([\ud800-\udbff](?![\udc00-\udfff]))|((?<![\ud800-\udbff])[\udc00-\udfff])");

We can simplify it a bit by combining \u0020-\ud7ff\ud800-\udfff\ue000-\ufffd

Final approach

var invalidXmlCharactersRegex = new Regex("[^\u0009\u000a\u000d\u0020-\ufffd]|([\ud800-\udbff](?![\udc00-\udfff]))|((?<![\ud800-\udbff])[\udc00-\udfff])");

I checked and this Regex is really filtering only the characters from the spec.

And here is the final version of the desired methods

        public static string StripInvalidXmlCharacters(string str)
            var invalidXmlCharactersRegex = new Regex("[^\u0009\u000a\u000d\u0020-\ufffd]|([\ud800-\udbff](?![\udc00-\udfff]))|((?<![\ud800-\udbff])[\udc00-\udfff])");
            return invalidXmlCharactersRegex.Replace(str, "");

        public static string EscapeInvalidXmlCharacters(string str)
            var invalidXmlCharactersRegex = new Regex("[^\u0009\u000a\u000d\u0020-\ufffd]|([\ud800-\udbff](?![\udc00-\udfff]))|((?<![\ud800-\udbff])[\udc00-\udfff])");
            return invalidXmlCharactersRegex.Replace(str, match => CharToUnicodeSequence(match.Value[0]));

        static string CharToUnicodeSequence(char symbol)
            return string.Format("\\u{0}", ((int) symbol).ToString("x4"));

UPD: As I was asked in a comment, I provide a positive regex for a valid xml characters

My first incorrectt attempt was to simply negate the invalidXmlCharactersRegex by replacing negative group [^…] with positive group […], and negative lookahead (?!…) with positive lookahed (?=…), and negative lookbehind (?<!…) with positive lookbehind (?<=…)

var validXmlCharactersRegex = new Regex("[\u0009\u000a\u000d\u0020-\ufffd]|([\ud800-\udbff](?=[\udc00-\udfff]))|((?<=[\ud800-\udbff])[\udc00-\udfff])");

But this is wrong, because \u0020-\ufffd includes surrogate characters so it will false positively match the string

string badString = "\ud800";

Here is the correct version of the regex

var validXmlCharactersRegex = new Regex("[\u0009\u000a\u000d\u0020-\ud7ff\ue000-\ufffd]|([\ud800-\udbff](?=[\udc00-\udfff]))|((?<=[\ud800-\udbff])[\udc00-\udfff])");

UPD2: As I was asked in a comment, we can simplify the regex if we don’t need to get individual codepoints.

var validXmlCharactersRegex = new Regex("[\u0009\u000a\u000d\u0020-\ud7ff\ue000-\ufffd]|[\ud800-\udbff][\udc00-\udfff]");

About mnaoumov

Senior .NET Developer in Readify
This entry was posted in Uncategorized. Bookmark the permalink.

24 Responses to Escaping Invalid XML Unicode characters

  1. yetibrain says:

    i need some help to use this regex the other way round. Not to find the invalid codepoints but in a positive way. I want to find all codepoints in a string, using #x9 | #xA | #xD and then start from #x20 up to the entire codepoint range, but exclude orphaned high- or low-surrogates and allow just surrogate-pairs. As i am not very familar with regex i don’t have a clue on how to re-write your regex so that it becomes “inverted”. Any help highly appreciated.


  2. yetibrain says:

    Thanks a lot mnaoumov, for the reverse, the matching, positive expression. So i understand that first, we want the range up to the start of the high-surrogates, skip them, skip the low-surrogates as well and then finally let match surrogate-pairs. But why must we state a high-surrogate, followed by low-surrogate and also a low-surrogate preceeded by a high-surrogate? Isn’t this redundant?

    • mnaoumov says:

      Hi yetibrain. That’s because as I understood you wanted a regex which matches individual codepoints. If you ok to get surrogate pairs as two characters – I’ll prepare a new simpler regex for you 🙂

  3. yetibrain says:

    Hi mnaoumov, no thanks, i think your regex is ok the way it is. I wanted to test it but i didn’t have access when trying to use it as a substitute for \w+. Possibly this doesn’t work because my expression already has a matching prefix as well as matching suffix. Possibly nesting doesn’t work.
    What i try to do is somehow easy, but the following regex only works for non-unicode characters:
    (?<=FUNCTION )\s*\"(?[\w+\s*]+)\”\s*\:\s\w+\s*((?=\W$)|\z)
    I tried to use your expression instead of [\w+\s*]+ but i had no success. The regex should just capture names of FCs within SCL code (PLC programming language). The SCL code looks like this:
    FUNCTION “Timer_OC” : Void
    The expression (?<=FUNCTION )\s*\"(?[\w+\s*]+)\”\s*\:\s\w+\s*((?=\W$)|\z) catches the Timer_OC but if the input isfor example something like:
    FUNCTION “Timer_€_OC” : Void

    then it doesn’t work, because of the unicode character €. This is because of the \w that catches no unicode characters.

  4. yetibrain says:

    Sorry, i didn’t mean access i meant success

  5. yetibrain says:

    B.t.w., i have used a named capture for the name of the FC like this:
    (?<=FUNCTION )\s*\"(?[\w+\s*]+)\”\s*\:\s\w+\s*((?=\W$)|\z)
    The only thing i need is a subtitute for \w , something that matches to all unicode codepoints except orphaned surrogates or other non-character codepoints.

  6. yetibrain says:

    Isn't ([\ud800-\udbff](?=[\udc00-\udfff])) enough to catch hi/lo surrogate pairs? The above expression seems to be redundant.The first of the two alternatives says, match a hi-surrogate followed by a lo-surrogate and the second alternative says match a lo-surrogate preceeded by a hi-surrogate.

    • mnaoumov says:

      As I said before, you want to capture individual UTF-16 codepoints, you will need to use the regex to capture hi and lo surrogate parts separately. If you want just to capture them as one match, you’ll need much simpler [\ud800-\udbff][\udc00-\udfff] – without any lookaheads and lookbehinds

      The one that you suggested will capture only the hi surrogate and skips the low surrogate.

      • yetibrain says:

        hi mnaoumov, you are absoluetely right. It confused me because the original regex found all the codepoints that are invalid and then of course, all orphaned single surrogates or pairs where hi- and lo-surrogates are swapped are invalid. The valid pairs are expressed as easy as in your posting. Thanks again!

      • yetibrain says:

        Thanks for your help. So i just need [\ud800-\udbff][\udc00-\udfff] to capture a surrogate pair, this i understand. But don’t i have to add a + in order to capture not just one pair but as many as there are in the string?

      • mnaoumov says:

        Hi yetibrain. If you add +:

        You will just find all consuquent occurent of such pairs. So if you have string

        string s = “a\ud800\udc00\ud801\udc01b\ud802\udc02c”;

        the regex you suggested will capture separately

  7. yetibrain says:

    When i use the positive expression:
    with Expresso and place a symbol of a higher plane into the input, in my case i have chosen the non-smoking-symbol with codepoint 128685 , then Expresso crashes! 😦

    • mnaoumov says:

      Hi yetibrain. Yes, that’s actually an interesting one. Expresso failed for me as well. When I attached debugger I could even find why

      An unhandled exception of type ‘System.ArgumentException’ occurred in mscorlib.dll

      Additional information: Found a high surrogate char without a following low surrogate at index: 0. The input may not be in this encoding, or may not contain valid Unicode (UTF-16) characters.

      Stack Trace:

      mscorlib.dll!char.ConvertToUtf32(string s, int index)
      RegDecoder.dll!RegDecoder.Utility.DisplayAllASCIICharacters(string text)
      Expresso.exe!Ultrapico.Expresso.MainForm.CreateTreeNodes(System.Text.RegularExpressions.Match[] matches)

      Obviously it is a bug in Expresso, because it is failing when it tries to display hi surrogate separately.

      If will fail if you use just simple regex dot . for the same character


  8. yetibrain says:

    Hi mnaumov,
    concerning Expresso, did you post the beyond BMP codepoint character from a utf-8 encoded file? Well, i did so. In utf-8, the character i’ve mentioned is a 4-byte sequence. Do you think it’s possible that when copied to the clipboard and pasted into Expresso, the re-encoding from utf-8 to utf-16 doesn’t take place?

  9. Peter Brightman says:

    I got an answer from Jim Hollenhorst, he says that .NET regex class does only support utf-16 characters?! I mean, this is not a bug in .NET regex class, because i use it successfully (thanks for your regex concerning surrogate pairs) within a windows forms application. The matches return a .NET string and there are surrogate pairs within the string, only i cannot display the glyph because the non-smoking-symbol needs a special symbolica font. I believe, this must be a bug in expresso. What do you think?

    • mnaoumov says:

      Hi Peter. It’s an issue with Expresso where they are using char.ConvertToUtf32 method, which fails if hi surrogate is not followed by low surrogate. I think they just need to handle this hi/lo surrogate characters separately.

  10. Thanks for you work on this. I found it interesting that your Regex string isn’t a verbatim string literal, i.e. prefixed with an @”…” as Regexes usually are to handle “\” – but it still works as the Regex gets parsed the same way.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s