How Character Encodings and Textual Data Types Were Normalized

Overview

As of 2024-12-20 there were some characters in the database's textual fields that used inconsistent character set encodings, the bit patterns that are used to represent character glyphs. For example, the database contains the glyph "é" encoded using 3 different bit patterns, one in the Windows-1252 character encoding, one in the Mac OS Roman character encoding, and a third as Unicode UTF-8.

For the "Babase production schemas" in the "babase" database, fixing this means switching to only US ASCII characters -- removing all accents, etc. For all other schemas and all other databases this means converting all character encodings to the UTF-8 Unicode encoding.

Note that copying any of the Babase production schemas from the babase database to any other database will copy not only the encoded data but also copy the encoding requirements. So, eventually, all the Babase production schemas in any of the databases will be ASCII-only.

As part of the normalization process some "fixed-length" character columns had their data type changed. This is not expected to impact anybody, but it could. See below for more on how this might, possibly, impact you.

What This Means To You

To begin with, searching for non-ASCII characters (anything not on a US keyboard) will no longer find anything in the Babase production schemas.

Searching for text containing non-7bit-ASCII characthers, say, an "é" character, must be done with SQL written in UTF-8 encoded characters. This is not expected to be an issue because UTF-8 has become the standard. But there still could be some impact; because the encoding is now all UTF-8, a query searching for, say, "é" might now find all the "é" glyphs where a previous query would find only those encoded in UTF-8.

There are also issues involving sorting. Changing the character encoding used in the database will change the way text is sorted, so an "é" that was previously encoded in a non-UTF-8 encoding will now sort differently because it has been changed to a UTF-8 encoding.

What Encoding Normalization Means for Sorting and Equality Comparisons

Although character encodings were standardized, at the time of this writing the database sorts and compares characters based on the bit patterns used to encode them. This means that non-ASCII characters, those with graves or accents or emojis, will not compare equal to their non-accented counterparts. Further, these characters may not sort as expected.

The choice to leave collation and equality untouched was made both in the interest of backwards compatibility and because the "Babase production schemas" are intended to be ASCII-only. This choice also improves performance, although it is not known whether this is significant.

What Changing From a "Fixed-Width" Data Type Means

The Postgres "fixed-width" data type (CHAR) has some unusual properties. Using another data type instead makes the database behavior more "normal".

Queries on "fixed-width" data columns that are more than one character long could be affected.

Postgres ignores all trailing spaces of "fixed-width" character columns. So, the following 2 queries previously produced identical results (they no longer do):

SELECT * FROM acts WHERE act = 'A';

SELECT * FROM acts WHERE act = 'A ';

It is unlikely that you write queries with extra trailing spaces, but it is possible that some of the data you have in your private schema contains such trailing spaces. If the datatype used in your private schema is CHAR then the trailing spaces in your data are ignored. This is likely the case if you copied tables from the babase schema, so this change is not expected to cause any problems.

(The recommendation is to avoid using the CHAR data type in your personal schemas.)

Columns Affected By The Data Type Change

This list does not include columns that were "fixed-width" but limited to a single character. Nor does it list the columns in views which might utilize these columns.

The Babase "Production Schemas"

The following schemas, in and only in the babase database, were switched to US ASCII only:

All other schemas in the babase database and all schemas in all other databases had their non-Unicode character encodings normalized to the UTF-8 Unicode encoding.

The Process

A database dump was examined and all characters with non-ASCII character encodings were reported. These were then searched for in the database dump and the intended character glyph was guessed at, by examination of the surrounding text. It was assumed that the encodings used were either UTF-8, Windows-1252, or Mac OS Roman. Bit patterns were looked up in these encodings to guess at the intended glyph.

It is unlikely, but possible, that the glyph chosen for a particular bit pattern is the wrong choice. It is also possible that the same bit pattern, occurring more than once in the data, was intended to represent different glyphs, because different operating systems were used to enter the same bit pattern. Only a sampling of any one bit pattern was examined when glyphs were chosen. Or, it could be that a glyph from a character encoding that was not considered is the correct glyph in some particular case. This document lists below the details of bit-patterns, encodings, and glyphs in the event any particular choice should be re-examined.

The database was dumped, the result run through a program which normalizes the character encodings, and the result restored into the database. The programs used to do this are in the "util" directory under revision control.

ASCII Normalization

Non-7bit-ASCII encoded characters are replaced with 7bit-ASCII encodings.

For the most part the replacement glyphs are identical but there are some cases where the usage of the glyph made clear that a different glyph should be chosen.

Unicode
Codepoint

Bit
Pattern

From
Glyph

To
Glyph

Count*

Original Encoding

0x2026

...

1

UTf-8, HORIZONTAL ELLIPSES (replaced w. 3 periods)

0xd5

'

7

Mac OS Roman, RIGHT SINGLE QUOTATION MARK (replaced w. apostrophe)

0x92

'

39

Windows-1252, RIGHT SINGLE QUOTATION MARK (replaced w. apostrophe)

0xd4

Ô

u

182

UTF-8, Part of a ÔøΩ sequence all replaced w. a single "u"

0xf8

ø

182

Part of a sequence all replaced w. a single "u"

0x03a9

Ω

182

Part of a sequence all replaced w. a single "u"

0xb5

µ

u

10

UTF-8, MICRO SIGN (replaced w. lower case "u")

0xa0

91

Windows-1252, NO-BREAK SPACE (replaced w. a regular space)

0x91

'

12

Windows-1252, LEFT SINGLE QUOTATION MARK (replaced w. apostrophe)

0x83

É

E

3

Mac OS Roman, LATIN CAPITAL LETTER E WITH ACUTE (replaced with capital E)

0xe0

à

a

1

Windows-1252, LATIN SMALL LETTER A WITH GRAVE (replaced with small a)

0xc6

'

1

Mac OS Roman, capital delta, Used as an apostrophe (replaced with apostrophe)

0x01f412

🐒

none

1

Monkey emoji, Removed

0xca

8

Mac OS Roman, non-breaking space, Appears to be used in place of EOL (replaced with regular space)

0xb5

µ

u

23

Windows-1252, MICRO SIGN (replaced w. lower case "u")

Note that it was only possible to replace the UTF-8 HORIZONTAL ELLIPSES with 3 periods because the character only showed up in ALLMISCS.Txt for ALMID 10478, which has unlimited length.

* This is the number of times the encoding shows up in the Babase production schemas.

Unicode (UTF-8) Normalization

Non-UTF-8 encoded characters are replaced with UTF-8 encoded characters.

For the most part the replacement glyphs are identical but there are some cases where the usage of the glyph made clear that a different glyph should be chosen. This is most clear in notes appended to the rightmost column.

From Bit
Pattern

To UTF-8
Codepoint

Count*

Glyph

Original Encoding

0x8e

0x00e9

12

é

Mac OS Roman, LATIN SMALL LETTER E WITH ACUTE

0x92

0x2019

57

Windows-1252, RIGHT SINGLE QUOTATION MARK

0x90

0x00ea

5

ê

Mac OS Roman, LATIN SMALL LETTER E WITH CIRUMFLEX

0x8b

0x00e3

7

ã

Mac OS Roman, LATIN SMALL LETTER A WITH TILDE

0x87

0x00e1

4

á

Mac OS Roman, LATIN SMALL LETTER A WITH ACUTE

0x97

0x00f3

2

ó

Mac OS Roman, LATIN SMALL LETTER O WITH ACUTE

0x8d

0x00e7

1

ç

Mac OS Roman, LATIN SMALL LETTER O WITH CEDILLA

0xd5

0x2019

7

Mac OS Roman, RIGHT SINGLE QUOTATION MARK

0xa0

x00a0

91

Windows-1252, NO-BREAK SPACE

0x91

x2018

12

Windows-1252, LEFT SINGLE QUOTATION MARK

0x83

0x00c9

4

É

Mac OS Roman, LATIN CAPITAL LETTER E WITH ACUTE

0xe0

0x00e0

1

à

Windows-1252, LATIN SMALL LETTER A WITH GRAVE

0xc6

0x27

1

Mac OS Roman, capital delta, Used as an apostrophe (replaced with apostrophe)

0xca

0x20

8

Mac OS Roman, non-breaking space, Appears to be used in place of EOL (replaced with space)

0xb5

0x00b5

2,872

µ

Windows-1252, MICRO SIGN

0xc9

0x2026

5

Mac OS Roman, HORIZONTAL ELLIPSES

0xaf

0x30

2

Ø

Windows-1252, LATIN CHARACTER O WITH A STROKE (replaced with 0 [zero] glyph)

0xf3

0x00f3

156

ó

Windows-1252, LATIN SMALL LETTER O WITH ACUTE

0xe3

0x00e3

258

ã

Windows-1252, LATIN SMALL LETTER A WITH TILDE

0xe9

0x00e9

15

é

Windows-1252, LATIN SMALL LETTER E WITH ACUTE

0xed

0x00ed

5

í

Windows-1252, LATIN SMALL LETTER I WITH ACUTE

0xea

0x00ea

6

ê

Windows-1252, LATIN SMALL LETTER E WITH CIRCUMFLEX

0xe1

0x00e1

9

á

Windows-1252, LATIN SMALL LETTER A WITH ACUTE

0xe7

0x00e7

3

ç

Windows-1252, LATIN SMALL LETTER C WITH CEDILLA

0xfa

0x00fa

1

ú

Windows-1252, LATIN SMALL LETTER U WITH ACUTE

0xf4

0x00f4

1

ô

Windows-1252, LATIN SMALL LETTER O WITH CIRCUMFLEX

0xcd

0x00cd)

2

Í

Windows-1252, LATIN CAPITAL LETTER i WITH ACUTE

0xe2

0x00e1

2

á

Windows-1252, LATIN SMALL LETTER A WITH ACUTE

0xc1

0x00c1

1

Á

Windows-1252, LATIN CAPITAL LETTER A WITH ACUTE

0xc7

0x00c7

1

Ç

Windows-1252, LATIN CAPITAL LETTER C WITH CEDILLA

0x85

0x2026

10

Windows-1252, HORIZONTAL ELLIPSES

* This is the number of times the encoding shows up anywhere in the Babase database, in any schema. This does not reflect the number of changes made, because the encodings in the Babase production schemas were changed to ASCII these encodings were already changed.

Background

When Babase was re-implemented in Postgres UTF-8 support was not widespread. The choice was made to give the database a C (US-ASCII) character encoding and collation. This was done both for performance reasons and because the expectation was that only US-ASCII characters would appear in the database content.

Postgres treats the ASCII encoding specially. It allows any bit patterns to be used in textual content and does not make any assumptions about the encoding of text.

When people used different systems (Mac OS, MS Windows, etc.) to enter data, the textual data was encoded in whatever encoding the given system used at the time. We have no record of what encodings were used, so no real way to know the intended character glyph of any bit pattern. For the most part, the intended glyph was obvious -- it rendered to something expected when displayed. None-the-less, having the database contain different bit-patterns that render to the same glyph on display opens the door to error an confusion.

With the release of Postgres 16 errors could be raised when characters from the database server, which did not care about encoding, were given to the database front-end, which does care. This prompted the character set normalization.

The Babase database is now encoded using UTF-8, and the database enforces the requirement that all textual data be UTF-8 encoded.

CharacterNormalization (last edited 2024-12-20 20:10:34 by KarlPinc)

Wiki content based upon work supported by the National Science Foundation under Grant Nos. 0323553 and 0323596. Any opinions, findings, conclusions or recommendations expressed in this material are those of the wiki contributor(s) and do not necessarily reflect the views of the National Science Foundation.