Unicode, scripts (writing systems) and Regular Expressions

Unicode is a vast topic, and so is Regular Expression (regex).

You can know little about Unicode or even nothing when using regex. But if you are dealing with non-English (non-Latin) text, some knowledge is needed.

To understand why this regex \p{Arabic} is valid (in most implementations) and why it works, you must know the following fundamental things about Unicode.

The fundamentals of the Unicode fundamentals

On the one hand, Unicode maps characters to code points.

If we take the © character as an example, its code point is U+00A9. The 00A9 is in hexadecimal (base-16); if we converted that to decimal, we would get 169.

There are over 149000 characters that have a code point. A lot!

As mentioned, Unicode is a vast topic; there's no space here to explain why some characters can have multiple code points, why two code points represent accented characters, etc.

Qualities

On the other hand, Unicode, besides the mapping, defines some character qualities.

For example, Unicode "knows" about the character A (U+0041) that it's an uppercase letter, and it's written from left to right.

Many other characters have similar qualities. The character E is also uppercase and left to right written.

If you want to select all uppercase letters with regex, how would you do it? Would you use the [A-Z]? That doesn't select non-English (non-Latin) uppercase letters.

A regex that matches the uppercase quality is: \p{Lu}.

\p for quality (property), {Lu} for a letter that is uppercase. {L} would be any letter.

Scripts

But there are other ways, more obvious ways, to group characters than by their case.

The A, E are not letters of the Chinese writing system, nor letters of the Arabic writing system. It's definitely something latinesq?!, latinish?!

And not surprisingly, this quality, the quality of belonging to a writing system, is also stored in Unicode. This is referred to as script quality.

Back to the beginning

Similarly to the possibility of matching the case (quality) with regex, there's a way to match the script. Conveniently, there's no abbreviation, simply the name of the writing system.

And we arrived at why the \p{Arabic} regex matches Arabic characters.

Another example: the \p{Cyrillic} regex matches Cyrillic characters.

And so on.

2023-02-22