A Question of Characters

Sunday, 31 May 2015

At various times, I'm confronted with confusion by persons and by systems of characters with glyphs. Most of the time, that confusion is a very minor annoyance; sometimes, as when wrestling with the preparation of a technical document, it can cause many hours of difficulty.

It's probably rather easier for people first to see that a character may have multiple glyphs. For example, here are two distinct yet common glyphs for the lower-case letter a: and here are two for g:

People have a bit more trouble with the idea that a single glyph can correspond to more than one character. Perhaps most educated folk generally understand that a Greek Ρ is not our P, even though one could easily imagine an identical glyph being used in some fonts. But many people think that they're looking at a o with an umlaut in each of these two words: whereäs the two dots over the o in the first word are a diæresis, an ancient diacritical mark used in various languages to clarify whether and how a vowel is pronounced.[1] The two dots over the o in the German shön are indeed an umlaut, which evolved far more recently from a superscript e.[2] (One may alternately write the same word schoen, whereäs schon is a different word.)

Out of context, what one sees is a glyph. Generally, we need context to tell use whether we're looking at Ϲ (upper-case lunate sigma), our familiar C, or С (upper-case Cyrillic ess); likewise for many other characters and their similar or identical glyphs. Until comparatively recently, we usually had sufficient context, mistakes were relatively infrequent and usually unimportant. (Okay, so a bunch of people thought that the Soviet Union called itself the CCCP, rather than the СССР. Meh.) But, with the development of electronic information technology, and with globalization, the distinction becomes more pressing. Most of us have seen the problems of OCR; these are essentially problems of inferring characters from glyphs. It's not so messy when converting instead from plain-text or from something such as ODF, but when character substitutions were made based upon similarity or identity of glyph, the very same problems can then arise. For example, as I said, one sees glyphs, but what is heard when the text is rendered audible will be phonetic values associated with the characters used. And sometimes the system will process a less-than sign as a left angle bracket, because everyone else is using it as such. In an abstract sense, these are of course problems of transliteration, and of its effects upon translation.

Some of you will recognize the contrast between character and glyph as a special case of the contrast between content and presentation — between what one seeks to deliver and the manner of delivery. Some will also note that the boundary between the two shifts. For example, the difference between upper-case and lower-case letters originated as nothing more than a difference in glyphs. Indeed, our R was once no more than a different way of writing the Greek Ρ; our A simply was the Greek Α, and it can remain hard to distinguish them! I don't know that ſ (long ess) should be regarded as a different character from s, rather than just as an archaïc glyph thereof.

Still, the fact that what is sometimes mere presentation may at other times be content doesn't mean that we should forgo the gains to be had in being mindful of the distinction and in creating structures that often help us to avoid being shackled to the accidental.

[1] In English and most other languages, a diæresis over the second of two vowels indicates that the vowel is pronounced separately, rather than forming a diphthong. (So here /koˈapəˌret/ rather than /ˈkupəˌret/ or /ˈkʊpəˌret/.) Over a vowel standing alone, as in Brontë, the diæresis signals that the vowel is not silent. (In English and some other languages, a grave accent may be used to the very same effect.) Portuguese cleverly uses a diæresis over the first of two vowels to signal that diphthong is formed where it might not be expected.

[2] Germans used to use a dreadful script — Kurrentschrift — in which such an evolution is less surprising.