CharacterNormalization

How Character Encodings Were Normalized

Overview

As of 2024-12-18 there are some characters in the database's textual fields that use incompatible character set encodings, the bit patterns that are used to represent character glyphs. For example, the database contains the glyph "é" represented in the db as 3 different bit patterns, one in the Windows-1252 character encoding, one in the Mac OS Roman character encoding, and a third as Unicode UTF-8.

For the "Babase production schemas" in the "babase" database, this means switching to only US ASCII characters -- removing all accents, etc. For all other schemas and all other databases this means converting all character encodings to the UTF-8 Unicode encoding.

What This Means To You

Searching for text containing a "é" character must be done with SQL written in UTF-8 encoded characters. This is not expected to be an issue because UTF-8 has become the standard. But there still could be some impact; because the encoding is now all UTF-8 a query searching for, say, "é" might now find all the "é" glyphs where a previous query would find only those encoded in UTF-8.

There are also issues involving sorting. Changing the character encoding used in the database will change the way text is sorted, so an "é" that was previously encoded in a non-UTF-8 encoding will now sort differently because it has been changed to a UTF-8 encoding.

What Encoding Normalization Means for Sorting and Equality Comparisons

Although character encodings were standardized, at the time of this writing the database sorts and compares characters based on the bit patterns used to encode them. This means that non-ASCII characters, those with graves or accents or emojis, will not compare equal to their non-accented counterparts. Further, these characters may not sort as expected.

The choice to leave collation and equality untouched was made both in the interest of backwards compatibility and because the "Babase production schemas" are intended to be ASCII-only. This choice also improves performance, although it is not known whether this is significant.

The Babase "Production Schemas"

The following schemas, in and only in the babase database, were switched to US ASCII only:

babase
babase_history
babase_pending
archived_data
brook_matepref
genetic_markers
genetic_sequencing
gi_parasites
levy_protogrammetry
pace_mate_guarding_data
patsis_ksmith
published_data_tables
sca_mehav_8791
urine_samples
zipple_mother_infant_focals

All other schemas in the babase database and all schemas in all other databases had their non-Unicode character encodings normalized to the UTF-8 Unicode encoding.

The Process

A database dump was examined and all characters with non-ASCII character encodings were reported. These were then searched for in the database dump and the intended character glyph was guessed at, by examination of the surrounding text. It was assumed that the encodings used were either UTF-8, Windows-1252, or Mac OS Roman. Bit patterns were looked up in these encodings to guess at the intended glyph.

The database was then dumped, the result run through a program which normalizes the character encodings, and the result restored into the database. The programs used to do this are in the "util" directory under revision control.

ASCII Normalization

Unicode (UTF-8) Normalization

Background

When Babase was re-implemented in Postgres UTF-8 support was not widespread. The choice was made to give the database a C (US-ASCII) character encoding and collation.