Stripping invalid characters from UTF-16 strings

Hi folks

The more you work with Unicode the more discoveries you can make.

.NET System.Char represents a character as a UTF-16 code unit.

UTF-16 has a concept of surrogates:

Characters from U+D800 to U+DBFF – lead surrogate aka first code unit aka high surrogate

Characters from U+DC00 to U+DFFF – tail surrogate aka second code unit aka low surrogate

To form a valid Unicode code point, lead surrogate should be always followed by tail surrogate.

Though, this rule is not enforced by .NET. You can create a string which is not valid from the UTF-16 point of view.

For example

string s = "a\ud800b";

here \ud800 is lead surrogate but it is followed by b letter which is not a low surrogate.

This string is not a valid Unicode string and this may cause some issues.

For example


fails with

System.ArgumentException: Invalid Unicode code point found at index 2.
Parameter name: strInput

If we store such string into the file, some text editors can fail on the file open.

So I think if we got a string from an unreliable source we may want to strip the incorrect symbols.

There are some approaches but I would like to suggest another one based on Regex.

We will use negative lookahead and lookbehinds: find lead surrogates that are not followed by tail surrogate and also find tail surrogate that are not led by lead surrogates

public static string StripInvalidUnicodeCharacters(string str)
    var invalidCharactersRegex = new Regex("([\ud800-\udbff](?![\udc00-\udfff]))|((?<![\ud800-\udbff])[\udc00-\udfff])");
    return invalidCharactersRegex.Replace(str, "");

About mnaoumov

Senior .NET Developer in Readify
This entry was posted in Uncategorized. Bookmark the permalink.

4 Responses to Stripping invalid characters from UTF-16 strings

  1. Pingback: Escaping Invalid XML Unicode characters | mnaoumov.NET

  2. Alex says:

    Do you know what byte[] for this “\ud800” symbol?

    • mnaoumov says:

      It depends on the encoding.

      [System.Text.Encoding]::UTF8.GetBytes([char] 0xd800)

      [System.Text.Encoding]::Unicode.GetBytes([char] 0xd800)

  3. James says:

    Thanks for that regex. Unfortunately it does not always remove all illegal content from strings (at least string:normalize still fails). Did some experiments and made myself a lot of uicode strings with all kind of garbage in them and then tried to clean them using your regex, but as mentioned string.Normalize() sometimes still fails.

    DId a few checks and it seems that codes in the following ranges are not accepted by string.Normalize(), but they are also not replaced with the regex:
    0xFFD0 – 0xFDEF
    0xFFFE – 0xFFFF

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s