Stripping invalid characters from UTF-16 strings

Hi folks

The more you work with Unicode the more discoveries you can make.

.NET System.Char represents a character as a UTF-16 code unit.

UTF-16 has a concept of surrogates:

Characters from U+D800 to U+DBFF – lead surrogate aka first code unit aka high surrogate

Characters from U+DC00 to U+DFFF – tail surrogate aka second code unit aka low surrogate

To form a valid Unicode code point, lead surrogate should be always followed by tail surrogate.

Though, this rule is not enforced by .NET. You can create a string which is not valid from the UTF-16 point of view.

For example

string s = "a\ud800b";

here \ud800 is lead surrogate but it is followed by b letter which is not a low surrogate.

This string is not a valid Unicode string and this may cause some issues.

For example

s.Normalize();

fails with

System.ArgumentException: Invalid Unicode code point found at index 2.
Parameter name: strInput

If we store such string into the file, some text editors can fail on the file open.

So I think if we got a string from an unreliable source we may want to strip the incorrect symbols.

There are some approaches but I would like to suggest another one based on Regex.

We will use negative lookahead and lookbehinds: find lead surrogates that are not followed by tail surrogate and also find tail surrogate that are not led by lead surrogates

public static string StripInvalidUnicodeCharacters(string str)
{
    var invalidCharactersRegex = new Regex("([\ud800-\udbff](?![\udc00-\udfff]))|((?<![\ud800-\udbff])[\udc00-\udfff])");
    return invalidCharactersRegex.Replace(str, "");
}
Advertisements

About mnaoumov

Senior .NET Developer in Readify
This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to Stripping invalid characters from UTF-16 strings

  1. Pingback: Escaping Invalid XML Unicode characters | mnaoumov.NET

  2. Alex says:

    Do you know what byte[] for this “\ud800” symbol?

    • mnaoumov says:

      It depends on the encoding.

      [System.Text.Encoding]::UTF8.GetBytes([char] 0xd800)
      239
      191
      189

      [System.Text.Encoding]::Unicode.GetBytes([char] 0xd800)
      253
      255

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s