How Character Encodings and Textual Data Types Were Normalized
Overview
As of 2024-12-20 there were some characters in the database's textual fields that used inconsistent character set encodings, the bit patterns that are used to represent character glyphs. For example, the database contains the glyph "é" encoded using 3 different bit patterns, one in the Windows-1252 character encoding, one in the Mac OS Roman character encoding, and a third as Unicode UTF-8.
For the "Babase production schemas" in the "babase" database, fixing this means switching to only US ASCII characters -- removing all accents, etc. For all other schemas and all other databases this means converting all character encodings to the UTF-8 Unicode encoding.
Note that copying any of the Babase production schemas from the babase database to any other database will copy not only the encoded data but also copy the encoding requirements. So, eventually, all the Babase production schemas in any of the databases will be ASCII-only.
As part of the normalization process some "fixed-length" character columns had their data type changed. This is not expected to impact anybody, but it could. See below for more on how this might, possibly, impact you.
What This Means To You
To begin with, searching for non-ASCII characters (anything not on a US keyboard) will no longer find anything in the Babase production schemas.
Searching for text containing non-7bit-ASCII characthers, say, an "é" character, must be done with SQL written in UTF-8 encoded characters. This is not expected to be an issue because UTF-8 has become the standard. But there still could be some impact; because the encoding is now all UTF-8, a query searching for, say, "é" might now find all the "é" glyphs where a previous query would find only those encoded in UTF-8.
There are also issues involving sorting. Changing the character encoding used in the database will change the way text is sorted, so an "é" that was previously encoded in a non-UTF-8 encoding will now sort differently because it has been changed to a UTF-8 encoding.
What Encoding Normalization Means for Sorting and Equality Comparisons
Although character encodings were standardized, at the time of this writing the database sorts and compares characters based on the bit patterns used to encode them. This means that non-ASCII characters, those with graves or accents or emojis, will not compare equal to their non-accented counterparts. Further, these characters may not sort as expected.
The choice to leave collation and equality untouched was made both in the interest of backwards compatibility and because the "Babase production schemas" are intended to be ASCII-only. This choice also improves performance, although it is not known whether this is significant.
What Changing From a "Fixed-Width" Data Type Means
The Postgres "fixed-width" data type (CHAR) has some unusual properties. Using another data type instead makes the database behavior more "normal".
Queries on "fixed-width" data columns that are more than one character long could be affected.
Postgres ignores all trailing spaces of "fixed-width" character columns. So, the following 2 queries previously produced identical results (they no longer do):
SELECT * FROM acts WHERE act = 'A';
SELECT * FROM acts WHERE act = 'A ';
It is unlikely that you write queries with extra trailing spaces, but it is possible that some of the data you have in your private schema contains such trailing spaces. If the datatype used in your private schema is CHAR then the trailing spaces in your data are ignored. This is likely the case if you copied tables from the babase schema, so this change is not expected to cause any problems.
(The recommendation is to avoid using the CHAR data type in your personal schemas.)
Columns Affected By The Data Type Change
This list does not include columns that were "fixed-width" but limited to a single character. Nor does it list the columns in views which might utilize these columns.
- acts.act
- stypes.stype
- stypes_postures.stype
- stypes_activities.stype
- stypes_ncodes.stype
- adcodes.adcode
- swerb_loc_data.adcode
- samples.stype
- interact_data.act
- populations.wild_captive
The Babase "Production Schemas"
The following schemas, in and only in the babase database, were switched to US ASCII only:
- babase
- babase_history
- babase_pending
- archived_data
- brook_matepref
- genetic_markers
- genetic_sequencing
- gi_parasites
- levy_protogrammetry
- pace_mate_guarding_data
- patsis_ksmith
- published_data_tables
- sca_mehav_8791
- urine_samples
- zipple_mother_infant_focals
All other schemas in the babase database and all schemas in all other databases had their non-Unicode character encodings normalized to the UTF-8 Unicode encoding.
The Process
A database dump was examined and all characters with non-ASCII character encodings were reported. These were then searched for in the database dump and the intended character glyph was guessed at, by examination of the surrounding text. It was assumed that the encodings used were either UTF-8, Windows-1252, or Mac OS Roman. Bit patterns were looked up in these encodings to guess at the intended glyph.
It is unlikely, but possible, that the glyph chosen for a particular bit pattern is the wrong choice. It is also possible that the same bit pattern, occurring more than once in the data, was intended to represent different glyphs, because different operating systems were used to enter the same bit pattern. Only a sampling of any one bit pattern was examined when glyphs were chosen. Or, it could be that a glyph from a character encoding that was not considered is the correct glyph in some particular case. This document lists below the details of bit-patterns, encodings, and glyphs in the event any particular choice should be re-examined.
The database was dumped, the result run through a program which normalizes the character encodings, and the result restored into the database. The programs used to do this are in the "util" directory under revision control.
ASCII Normalization
Non-7bit-ASCII encoded characters are replaced with 7bit-ASCII encodings.
For the most part the replacement glyphs are identical but there are some cases where the usage of the glyph made clear that a different glyph should be chosen.
Unicode |
Bit |
From |
To |
Count* |
Original Encoding |
0x2026 |
|
… |
... |
1 |
UTf-8, HORIZONTAL ELLIPSES (replaced w. 3 periods) |
|
0xd5 |
’ |
' |
7 |
Mac OS Roman, RIGHT SINGLE QUOTATION MARK (replaced w. apostrophe) |
|
0x92 |
’ |
' |
39 |
Windows-1252, RIGHT SINGLE QUOTATION MARK (replaced w. apostrophe) |
0xd4 |
|
Ô |
u |
182 |
UTF-8, Part of a ÔøΩ sequence all replaced w. a single "u" |
0xf8 |
|
ø |
|
182 |
Part of a sequence all replaced w. a single "u" |
0x03a9 |
|
Ω |
|
182 |
Part of a sequence all replaced w. a single "u" |
0xb5 |
|
µ |
u |
10 |
UTF-8, MICRO SIGN (replaced w. lower case "u") |
|
0xa0 |
|
|
91 |
Windows-1252, NO-BREAK SPACE (replaced w. a regular space) |
|
0x91 |
‘ |
' |
12 |
Windows-1252, LEFT SINGLE QUOTATION MARK (replaced w. apostrophe) |
|
0x83 |
É |
E |
3 |
Mac OS Roman, LATIN CAPITAL LETTER E WITH ACUTE (replaced with capital E) |
|
0xe0 |
à |
a |
1 |
Windows-1252, LATIN SMALL LETTER A WITH GRAVE (replaced with small a) |
|
0xc6 |
∆ |
' |
1 |
Mac OS Roman, capital delta, Used as an apostrophe (replaced with apostrophe) |
0x01f412 |
|
🐒 |
none |
1 |
Monkey emoji, Removed |
|
0xca |
|
|
8 |
Mac OS Roman, non-breaking space, Appears to be used in place of EOL (replaced with regular space) |
|
0xb5 |
µ |
u |
23 |
Windows-1252, MICRO SIGN (replaced w. lower case "u") |
Note that it was only possible to replace the UTF-8 HORIZONTAL ELLIPSES with 3 periods because the character only showed up in ALLMISCS.Txt for ALMID 10478, which has unlimited length.
* This is the number of times the encoding shows up in the Babase production schemas.
Unicode (UTF-8) Normalization
Non-UTF-8 encoded characters are replaced with UTF-8 encoded characters.
For the most part the replacement glyphs are identical but there are some cases where the usage of the glyph made clear that a different glyph should be chosen. This is most clear in notes appended to the rightmost column.
From Bit |
To UTF-8 |
Count* |
Glyph |
Original Encoding |
0x8e |
0x00e9 |
12 |
é |
Mac OS Roman, LATIN SMALL LETTER E WITH ACUTE |
0x92 |
0x2019 |
57 |
’ |
Windows-1252, RIGHT SINGLE QUOTATION MARK |
0x90 |
0x00ea |
5 |
ê |
Mac OS Roman, LATIN SMALL LETTER E WITH CIRUMFLEX |
0x8b |
0x00e3 |
7 |
ã |
Mac OS Roman, LATIN SMALL LETTER A WITH TILDE |
0x87 |
0x00e1 |
4 |
á |
Mac OS Roman, LATIN SMALL LETTER A WITH ACUTE |
0x97 |
0x00f3 |
2 |
ó |
Mac OS Roman, LATIN SMALL LETTER O WITH ACUTE |
0x8d |
0x00e7 |
1 |
ç |
Mac OS Roman, LATIN SMALL LETTER O WITH CEDILLA |
0xd5 |
0x2019 |
7 |
’ |
Mac OS Roman, RIGHT SINGLE QUOTATION MARK |
0xa0 |
x00a0 |
91 |
|
Windows-1252, NO-BREAK SPACE |
0x91 |
x2018 |
12 |
‘ |
Windows-1252, LEFT SINGLE QUOTATION MARK |
0x83 |
0x00c9 |
4 |
É |
Mac OS Roman, LATIN CAPITAL LETTER E WITH ACUTE |
0xe0 |
0x00e0 |
1 |
à |
Windows-1252, LATIN SMALL LETTER A WITH GRAVE |
0xc6 |
0x27 |
1 |
∆ |
Mac OS Roman, capital delta, Used as an apostrophe (replaced with apostrophe) |
0xca |
0x20 |
8 |
|
Mac OS Roman, non-breaking space, Appears to be used in place of EOL (replaced with space) |
0xb5 |
0x00b5 |
2,872 |
µ |
Windows-1252, MICRO SIGN |
0xc9 |
0x2026 |
5 |
… |
Mac OS Roman, HORIZONTAL ELLIPSES |
0xaf |
0x30 |
2 |
Ø |
Windows-1252, LATIN CHARACTER O WITH A STROKE (replaced with 0 [zero] glyph) |
0xf3 |
0x00f3 |
156 |
ó |
Windows-1252, LATIN SMALL LETTER O WITH ACUTE |
0xe3 |
0x00e3 |
258 |
ã |
Windows-1252, LATIN SMALL LETTER A WITH TILDE |
0xe9 |
0x00e9 |
15 |
é |
Windows-1252, LATIN SMALL LETTER E WITH ACUTE |
0xed |
0x00ed |
5 |
í |
Windows-1252, LATIN SMALL LETTER I WITH ACUTE |
0xea |
0x00ea |
6 |
ê |
Windows-1252, LATIN SMALL LETTER E WITH CIRCUMFLEX |
0xe1 |
0x00e1 |
9 |
á |
Windows-1252, LATIN SMALL LETTER A WITH ACUTE |
0xe7 |
0x00e7 |
3 |
ç |
Windows-1252, LATIN SMALL LETTER C WITH CEDILLA |
0xfa |
0x00fa |
1 |
ú |
Windows-1252, LATIN SMALL LETTER U WITH ACUTE |
0xf4 |
0x00f4 |
1 |
ô |
Windows-1252, LATIN SMALL LETTER O WITH CIRCUMFLEX |
0xcd |
0x00cd) |
2 |
Í |
Windows-1252, LATIN CAPITAL LETTER i WITH ACUTE |
0xe2 |
0x00e1 |
2 |
á |
Windows-1252, LATIN SMALL LETTER A WITH ACUTE |
0xc1 |
0x00c1 |
1 |
Á |
Windows-1252, LATIN CAPITAL LETTER A WITH ACUTE |
0xc7 |
0x00c7 |
1 |
Ç |
Windows-1252, LATIN CAPITAL LETTER C WITH CEDILLA |
0x85 |
0x2026 |
10 |
… |
Windows-1252, HORIZONTAL ELLIPSES |
* This is the number of times the encoding shows up anywhere in the Babase database, in any schema. This does not reflect the number of changes made, because the encodings in the Babase production schemas were changed to ASCII these encodings were already changed.
Background
When Babase was re-implemented in Postgres UTF-8 support was not widespread. The choice was made to give the database a C (US-ASCII) character encoding and collation. This was done both for performance reasons and because the expectation was that only US-ASCII characters would appear in the database content.
Postgres treats the ASCII encoding specially. It allows any bit patterns to be used in textual content and does not make any assumptions about the encoding of text.
When people used different systems (Mac OS, MS Windows, etc.) to enter data, the textual data was encoded in whatever encoding the given system used at the time. We have no record of what encodings were used, so no real way to know the intended character glyph of any bit pattern. For the most part, the intended glyph was obvious -- it rendered to something expected when displayed. None-the-less, having the database contain different bit-patterns that render to the same glyph on display opens the door to error an confusion.
With the release of Postgres 16 errors could be raised when characters from the database server, which did not care about encoding, were given to the database front-end, which does care. This prompted the character set normalization.
The Babase database is now encoded using UTF-8, and the database enforces the requirement that all textual data be UTF-8 encoded.