Diff for "CharacterNormalization"

Differences between revisions 1 and 3 (spanning 2 versions)

How Character Encodings Were Normalized

Overview

As of 2024-12-18 there are some characters in the database's textual fields that use incompatible character set encodings, the bit patterns that are used to represent character glyphs. For example, the database contains the glyph "é" represented in the db as 3 different bit patterns, one in the Windows-1252 character encoding, one in the Mac OS Roman character encoding, and a third as Unicode UTF-8.

For the "Babase production schemas" in the "babase" database, this means switching to only US ASCII characters -- removing all accents, etc. For all other schemas and all other databases this means converting all character encodings to the UTF-8 Unicode encoding.

Note that copying any of the Babase production schemas from the babase database to any other database will copy not only the encoded data but also copy the encoding requirements. So, eventually, all the Babase production schemas in any of the databases will be ASCII-only.

What This Means To You

To begin with, searching for non-ASCII characters (anything not on a US keyboard) will no longer find anything in the Babase production schemas.

Searching for text containing a "é" character must be done with SQL written in UTF-8 encoded characters. This is not expected to be an issue because UTF-8 has become the standard. But there still could be some impact; because the encoding is now all UTF-8 a query searching for, say, "é" might now find all the "é" glyphs where a previous query would find only those encoded in UTF-8.

There are also issues involving sorting. Changing the character encoding used in the database will change the way text is sorted, so an "é" that was previously encoded in a non-UTF-8 encoding will now sort differently because it has been changed to a UTF-8 encoding.

What Encoding Normalization Means for Sorting and Equality Comparisons

Although character encodings were standardized, at the time of this writing the database sorts and compares characters based on the bit patterns used to encode them. This means that non-ASCII characters, those with graves or accents or emojis, will not compare equal to their non-accented counterparts. Further, these characters may not sort as expected.

The choice to leave collation and equality untouched was made both in the interest of backwards compatibility and because the "Babase production schemas" are intended to be ASCII-only. This choice also improves performance, although it is not known whether this is significant.

The Babase "Production Schemas"

The following schemas, in and only in the babase database, were switched to US ASCII only:

babase
babase_history
babase_pending
archived_data
brook_matepref
genetic_markers
genetic_sequencing
gi_parasites
levy_protogrammetry
pace_mate_guarding_data
patsis_ksmith
published_data_tables
sca_mehav_8791
urine_samples
zipple_mother_infant_focals

All other schemas in the babase database and all schemas in all other databases had their non-Unicode character encodings normalized to the UTF-8 Unicode encoding.

The Process

A database dump was examined and all characters with non-ASCII character encodings were reported. These were then searched for in the database dump and the intended character glyph was guessed at, by examination of the surrounding text. It was assumed that the encodings used were either UTF-8, Windows-1252, or Mac OS Roman. Bit patterns were looked up in these encodings to guess at the intended glyph.

The database was then dumped, the result run through a program which normalizes the character encodings, and the result restored into the database. The programs used to do this are in the "util" directory under revision control.

ASCII Normalization

Unicode (UTF-8) Normalization

Background

When Babase was re-implemented in Postgres UTF-8 support was not widespread. The choice was made to give the database a C (US-ASCII) character encoding and collation. This was done both for performance reasons and because the expectation was that only US-ASCII characters would appear in the database content.

Postgres treats the ASCII encoding specially. It allows any bit patterns to be used in textual content and does not make any assumptions about the encoding of text.

When people used different systems (Mac OS, MS Windows, etc.) to enter data, the textual data was encoded in whatever encoding the given system used at the time. We have no record of what encodings were used, so no real way to know the intended character glyph of any bit pattern. For the most part, the intended glyph was obvious -- it rendered to something expected when displayed. None-the-less, having the database contain different bit-patterns that render to the same glyph on display opens the door to error an confusion.

With the release of Postgres 16 errors could be raised when characters from the database server, which did not care about encoding, were given to the database front-end, which does care. This prompted the character set normalization.

The Babase database is now encoded using UTF-8, and the database enforces the requirement that all textual data be UTF-8 encoded.

-  ⇤ ← Revision 1 as of 2024-12-19 22:07:30 → 
  Size: 3734
  Editor: KarlPinc
  Comment: Initial page
+   ← Revision 3 as of 2024-12-19 22:26:06 → ⇥
  Size: 5378
  Editor: KarlPinc
  Comment: More on the overview and what it means to people
-Deletions are marked like this.
+Additions are marked like this.
 Line 20:
+Note that copying any of the Babase production schemas from the babase database to any other database will copy not only the encoded data but also copy the encoding requirements.
So, eventually, all the Babase production schemas in any of the databases will be ASCII-only.
-Line 21:
+Line 24:
+To begin with, searching for non-ASCII characters (anything not on a US keyboard) will no longer find anything in the Babase production schemas.
-Line 76:
+Line 81:
-When Babase was re-implemented in Postgres UTF-8 support was not widespread.  The choice was made to give the database a C (US-ASCII) character encoding and collation.
+When Babase was re-implemented in Postgres UTF-8 support was not widespread.
The choice was made to give the database a C (US-ASCII) character encoding and collation.
This was done both for performance reasons and because the expectation was that only US-ASCII characters would appear in the database content.

Postgres treats the ASCII encoding specially.
It allows any bit patterns to be used in textual content and does not make any assumptions about the encoding of text.

When people used different systems (Mac OS, MS Windows, etc.) to enter data, the textual data was encoded in whatever encoding the given system used at the time.
We have no record of what encodings were used, so no real way to know the intended character glyph of any bit pattern.
For the most part, the intended glyph was obvious -- it rendered to something expected when displayed.
None-the-less, having the database contain different bit-patterns that render to the same glyph on display opens the door to error an confusion.

With the release of Postgres 16 errors could be raised when characters from the database server, which did not care about encoding, were given to the database front-end, which does care.
This prompted the character set normalization.

The Babase database is now encoded using UTF-8, and the database enforces the requirement that all textual data be UTF-8 encoded.