CharacterNormalization

How Character Encodings and Textual Data Types Were Normalized

Contents

How Character Encodings and Textual Data Types Were Normalized

Overview

As of 2024-12-20 there were some characters in the database's textual fields that used inconsistent character set encodings, the bit patterns that are used to represent character glyphs. For example, the database contains the glyph "é" encoded using 3 different bit patterns, one in the Windows-1252 character encoding, one in the Mac OS Roman character encoding, and a third as Unicode UTF-8.

For the "Babase production schemas" in the "babase" database, fixing this means switching to only US ASCII characters -- removing all accents, etc. For all other schemas and all other databases this means converting all character encodings to the UTF-8 Unicode encoding.

Note that copying any of the Babase production schemas from the babase database to any other database will copy not only the encoded data but also copy the encoding requirements. So, eventually, all the Babase production schemas in any of the databases will be ASCII-only.

As part of the normalization process some "fixed-length" character columns had their data type changed. This is not expected to impact anybody, but it could. See below for more on how this might, possibly, impact you.

What This Means To You

To begin with, searching for non-ASCII characters (anything not on a US keyboard) will no longer find anything in the Babase production schemas.

Searching for text containing non-7bit-ASCII characthers, say, an "é" character, must be done with SQL written in UTF-8 encoded characters. This is not expected to be an issue because UTF-8 has become the standard. But there still could be some impact; because the encoding is now all UTF-8, a query searching for, say, "é" might now find all the "é" glyphs where a previous query would find only those encoded in UTF-8.

There are also issues involving sorting. Changing the character encoding used in the database will change the way text is sorted, so an "é" that was previously encoded in a non-UTF-8 encoding will now sort differently because it has been changed to a UTF-8 encoding.

What Encoding Normalization Means for Sorting and Equality Comparisons

Although character encodings were standardized, at the time of this writing the database sorts and compares characters based on the bit patterns used to encode them. This means that non-ASCII characters, those with graves or accents or emojis, will not compare equal to their non-accented counterparts. Further, these characters may not sort as expected.

The choice to leave collation and equality untouched was made both in the interest of backwards compatibility and because the "Babase production schemas" are intended to be ASCII-only. This choice also improves performance, although it is not known whether this is significant.

What Changing From a "Fixed-Width" Data Type Means

The Postgres "fixed-width" data type (CHAR) has some unusual properties. Using another data type instead makes the database behavior more "normal".

Queries on "fixed-width" data columns that are more than one character long could be affected.

Postgres ignores all trailing spaces of "fixed-width" character columns. So, the following 2 queries previously produced identical results (they no longer do):

SELECT * FROM acts WHERE act = 'A';

SELECT * FROM acts WHERE act = 'A ';

It is unlikely that you write queries with extra trailing spaces, but it is possible that some of the data you have in your private schema contains such trailing spaces. If the datatype used in your private schema is CHAR then the trailing spaces in your data are ignored. This is likely the case if you copied tables from the babase schema, so this change is not expected to cause any problems.

(The recommendation is to avoid using the CHAR data type in your personal schemas.)

Columns Affected By The Data Type Change

This list does not include columns that were "fixed-width" but limited to a single character. Nor does it list the columns in views which might utilize these columns.

acts.act
stypes.stype
stypes_postures.stype
stypes_activities.stype
stypes_ncodes.stype
adcodes.adcode
swerb_loc_data.adcode
samples.stype
interact_data.act
populations.wild_captive

The Babase "Production Schemas"

The following schemas, in and only in the babase database, were switched to US ASCII only:

babase
babase_history
babase_pending
archived_data
brook_matepref
genetic_markers
genetic_sequencing
gi_parasites
levy_protogrammetry
pace_mate_guarding_data
patsis_ksmith
published_data_tables
sca_mehav_8791
urine_samples
zipple_mother_infant_focals

All other schemas in the babase database and all schemas in all other databases had their non-Unicode character encodings normalized to the UTF-8 Unicode encoding.

The Process

A database dump was examined and all characters with non-ASCII character encodings were reported. These were then searched for in the database dump and the intended character glyph was guessed at, by examination of the surrounding text. It was assumed that the encodings used were either UTF-8, Windows-1252, or Mac OS Roman. Bit patterns were looked up in these encodings to guess at the intended glyph.

It is unlikely, but possible, that the glyph chosen for a particular bit pattern is the wrong choice. It is also possible that the same bit pattern, occurring more than once in the data, was intended to represent different glyphs, because different operating systems were used to enter the same bit pattern. Only a sampling of any one bit pattern was examined when glyphs were chosen. Or, it could be that a glyph from a character encoding that was not considered is the correct glyph in some particular case. This document lists below the details of bit-patterns, encodings, and glyphs in the event any particular choice should be re-examined.

The database was dumped, the result run through a program which normalizes the character encodings, and the result restored into the database. The programs used to do this are in the "util" directory under revision control.

ASCII Normalization

Non-7bit-ASCII encoded characters are replaced with 7bit-ASCII encodings.

For the most part the replacement glyphs are identical but there are some cases where the usage of the glyph made clear that a different glyph should be chosen.

Unicode Codepoint	Bit Pattern	From Glyph	To Glyph	Count*	Original Encoding
0x2026		…	...	1	UTf-8, HORIZONTAL ELLIPSES (replaced w. 3 periods)
	0xd5	’	'	7	Mac OS Roman, RIGHT SINGLE QUOTATION MARK (replaced w. apostrophe)
	0x92	’	'	39	Windows-1252, RIGHT SINGLE QUOTATION MARK (replaced w. apostrophe)
0xd4		Ô	u	182	UTF-8, Part of a ÔøΩ sequence all replaced w. a single "u"
0xf8		ø		182	Part of a sequence all replaced w. a single "u"
0x03a9		Ω		182	Part of a sequence all replaced w. a single "u"
0xb5		µ	u	10	UTF-8, MICRO SIGN (replaced w. lower case "u")
	0xa0			91	Windows-1252, NO-BREAK SPACE (replaced w. a regular space)
	0x91	‘	'	12	Windows-1252, LEFT SINGLE QUOTATION MARK (replaced w. apostrophe)
	0x83	É	E	3	Mac OS Roman, LATIN CAPITAL LETTER E WITH ACUTE (replaced with capital E)
	0xe0	à	a	1	Windows-1252, LATIN SMALL LETTER A WITH GRAVE (replaced with small a)
	0xc6	∆	'	1	Mac OS Roman, capital delta, Used as an apostrophe (replaced with apostrophe)
0x01f412		🐒	none	1	Monkey emoji, Removed
	0xca			8	Mac OS Roman, non-breaking space, Appears to be used in place of EOL (replaced with regular space)
	0xb5	µ	u	23	Windows-1252, MICRO SIGN (replaced w. lower case "u")

Note that it was only possible to replace the UTF-8 HORIZONTAL ELLIPSES with 3 periods because the character only showed up in ALLMISCS.Txt for ALMID 10478, which has unlimited length.

* This is the number of times the encoding shows up in the Babase production schemas.

Unicode (UTF-8) Normalization

Non-UTF-8 encoded characters are replaced with UTF-8 encoded characters.

For the most part the replacement glyphs are identical but there are some cases where the usage of the glyph made clear that a different glyph should be chosen. This is most clear in notes appended to the rightmost column.

From Bit Pattern	To UTF-8 Codepoint	Count*	Glyph	Original Encoding
0x8e	0x00e9	12	é	Mac OS Roman, LATIN SMALL LETTER E WITH ACUTE
0x92	0x2019	57	’	Windows-1252, RIGHT SINGLE QUOTATION MARK
0x90	0x00ea	5	ê	Mac OS Roman, LATIN SMALL LETTER E WITH CIRUMFLEX
0x8b	0x00e3	7	ã	Mac OS Roman, LATIN SMALL LETTER A WITH TILDE
0x87	0x00e1	4	á	Mac OS Roman, LATIN SMALL LETTER A WITH ACUTE
0x97	0x00f3	2	ó	Mac OS Roman, LATIN SMALL LETTER O WITH ACUTE
0x8d	0x00e7	1	ç	Mac OS Roman, LATIN SMALL LETTER O WITH CEDILLA
0xd5	0x2019	7	’	Mac OS Roman, RIGHT SINGLE QUOTATION MARK
0xa0	x00a0	91		Windows-1252, NO-BREAK SPACE
0x91	x2018	12	‘	Windows-1252, LEFT SINGLE QUOTATION MARK
0x83	0x00c9	4	É	Mac OS Roman, LATIN CAPITAL LETTER E WITH ACUTE
0xe0	0x00e0	1	à	Windows-1252, LATIN SMALL LETTER A WITH GRAVE
0xc6	0x27	1	∆	Mac OS Roman, capital delta, Used as an apostrophe (replaced with apostrophe)
0xca	0x20	8		Mac OS Roman, non-breaking space, Appears to be used in place of EOL (replaced with space)
0xb5	0x00b5	2,872	µ	Windows-1252, MICRO SIGN
0xc9	0x2026	5	…	Mac OS Roman, HORIZONTAL ELLIPSES
0xaf	0x30	2	Ø	Windows-1252, LATIN CHARACTER O WITH A STROKE (replaced with 0 [zero] glyph)
0xf3	0x00f3	156	ó	Windows-1252, LATIN SMALL LETTER O WITH ACUTE
0xe3	0x00e3	258	ã	Windows-1252, LATIN SMALL LETTER A WITH TILDE
0xe9	0x00e9	15	é	Windows-1252, LATIN SMALL LETTER E WITH ACUTE
0xed	0x00ed	5	í	Windows-1252, LATIN SMALL LETTER I WITH ACUTE
0xea	0x00ea	6	ê	Windows-1252, LATIN SMALL LETTER E WITH CIRCUMFLEX
0xe1	0x00e1	9	á	Windows-1252, LATIN SMALL LETTER A WITH ACUTE
0xe7	0x00e7	3	ç	Windows-1252, LATIN SMALL LETTER C WITH CEDILLA
0xfa	0x00fa	1	ú	Windows-1252, LATIN SMALL LETTER U WITH ACUTE
0xf4	0x00f4	1	ô	Windows-1252, LATIN SMALL LETTER O WITH CIRCUMFLEX
0xcd	0x00cd)	2	Í	Windows-1252, LATIN CAPITAL LETTER i WITH ACUTE
0xe2	0x00e1	2	á	Windows-1252, LATIN SMALL LETTER A WITH ACUTE
0xc1	0x00c1	1	Á	Windows-1252, LATIN CAPITAL LETTER A WITH ACUTE
0xc7	0x00c7	1	Ç	Windows-1252, LATIN CAPITAL LETTER C WITH CEDILLA
0x85	0x2026	10	…	Windows-1252, HORIZONTAL ELLIPSES

* This is the number of times the encoding shows up anywhere in the Babase database, in any schema. This does not reflect the number of changes made, because the encodings in the Babase production schemas were changed to ASCII these encodings were already changed.

Background

When Babase was re-implemented in Postgres UTF-8 support was not widespread. The choice was made to give the database a C (US-ASCII) character encoding and collation. This was done both for performance reasons and because the expectation was that only US-ASCII characters would appear in the database content.

Postgres treats the ASCII encoding specially. It allows any bit patterns to be used in textual content and does not make any assumptions about the encoding of text.

When people used different systems (Mac OS, MS Windows, etc.) to enter data, the textual data was encoded in whatever encoding the given system used at the time. We have no record of what encodings were used, so no real way to know the intended character glyph of any bit pattern. For the most part, the intended glyph was obvious -- it rendered to something expected when displayed. None-the-less, having the database contain different bit-patterns that render to the same glyph on display opens the door to error an confusion.

With the release of Postgres 16 errors could be raised when characters from the database server, which did not care about encoding, were given to the database front-end, which does care. This prompted the character set normalization.

The Babase database is now encoded using UTF-8, and the database enforces the requirement that all textual data be UTF-8 encoded.