Saturday, 7 January 2017

The Basics: Diaritics and ALT Codes

 Text with colourful accents
This post was inspired by Lori Kaufman's post on How To Geek about adding symbols.  She writes a regular column on how to do useful things with common applications, and it's very much worth following.

It might seem odd to label a post about diacritics and ALT codes under "The Basics" when most people don't even know what the word "diacritic" means, nor have they used ALT codes.  But you've seen diacritics, and quite possibly used them, even if you don't know what they are, so you definitely need to know how to add them when you're writing.  And the easiest way to add them is with ALT code.


So, what is a diacritic?

(Warning: If you've studied linguistics, philology or a European language to any extent you can probably skip down to the part about how to add them to your documents.  For everyone else, read on.)

A diacritic is a symbol or glyph added to a letter, primarily to show how that letter is to be pronounced in the context of the word. If you ever studied French at school you'll be familiar with accents such as
´ (acute) or ` (grave); accents are a type of diacritical mark. For those of you with a more Germanic bent you'll be familiar with the ¨ (umlaut), but even if you've never spoken a language other than English you should still be familiar with the diacritics in façade and naïve (called a cedilla and diaereses respectively).  A good example of how a diacritic changes the pronunciation of the word is "rèsumè" (also known as a CV), where only the accents distinguish it from "resume" (to begin again after a pause).

There is plenty of material available on the use of diacritics if you want to delve into it a bit more (and please do, they're fascinating), but what we're concerned with here is how to add these marks when using standard technical communication tools. 

Adding Diacritics to your Content

We'll focus on 3 types of documentation tool: word processing, content authoring, and wikis.  Between them, these cover the vast majority of types of tools that are used for professional technical communication (we're ignoring things like JavaDocs or Swagger on the basis that code doesn't have accents). There are a lot of different tools out there, so I'll focus on the following examples:
  • Word (word processing)
  • FrameMaker (content authoring)
  • Confluence (wiki)
Each of these tools provides built-in fonts and/or glyphs that cover diacritics, as will any decent word processing or content authoring tool.  Wikis are often a bit different because they have more or less of a GUI or WYSIWYG interface.  As a general rule, the less GUI, the less built-in support you'll get.  Confluence does have built-in diacritics, but once we've covered the 3 tool examples we'll look at a more general method for typing diacritics.

Word

Word - of course - provides pretty much complete support for every conceivable diacritical mark, of which, when combined with every letter of every currently used alphabet, there are a very large number indeed.  For common words like "facade", Word will simply autocorrect them, in this case to "façade ".  That will probably cover you for most English words, simply because English has only a small number of words with diacritical marks.  But it's always better to know how to do things manually if you have to, so in Word go to Insert on the ribbon and on the far right hand side click the down arrow next to Symbol.  On the small dropdown that appears click More Symbols to open the Symbol dialogue:


(For you keyboard shortcut aficionados, you can also access this dialogue with ALT+N+U+M)

By default, this dialogue will open with a Font value of (normal text) and a Subset value of Basic Latin. This will display the familiar Latin alphabet and if you scroll down you'll find lots of diacritically marked letters, grouped into language families.  For example, the French diacritics are together, the Slavic ones are together, and so on.

Select the Font you want, if it's different from the font you're currently using, and scroll down the list.  You will almost certainly find the letter and diacritic combination you want, whether it's the familiar Latin-derived alphabet, Cyrillic, or Greek.  If you want to know the official name of a symbol, click the character and the Unicode name will be displayed under the Recently used symbols box.  In the screenshot above, I've selected "Latin Small Letter A With Tilde".

FrameMaker

FrameMaker also provides a way of adding diacritical marks, but in a far smaller range than Word. The list of supported diacritics is as follows:
  • ´ (acute)
  • ` (grave)
  • ˜ (tilde)
  • ¨ (diaeresis)
  • ˆ (circumflex)
  • ^ (caret)
  • ° (ring)
  • ¸ (cedilla)
(For those of you who did a little German at school, a diaeresis is identical in formation to an umlaut, although they alter the word in slightly different ways).  FrameMaker allows you to enter these supported diacritics by using an Escape key sequence.  For example, to type an è (an e with a grave accent) you press the Escape key, then the ` (left quote) key, then the e key.  Unlike CTRL+ALT+DEL where all of the keys need to be pressed at the same time (i.e. in combination), in FrameMaker you need to press them one after the other (i.e. in sequence).

You can find the list of FrameMaker-supported diacritics and other symbols here.

Confluence

Confluence lies somewhere in the middle.  It provides more diacritics than FrameMaker, but substantially less than Word.  Like Word, you can select your diacritic from a modal panel rather than with a keyboard sequence and the following shows all of the diacritics and symbols that are available to select:
 


These are good examples of the type of functionality you'll find around diacritics in modern tech comm tools.  They range from the comprehensive to the minimal, but they're still better than some applications that provide no functionality at all.



What's the generic way of entering diacritics?

Different applications have different methods for entering diacritics, and different native support, which means more learning and a patchy experience.  So for those of you looking for a generic way that will work in any application, you need ALT codes.

ALT codes are a method of writing special characters that aren't represented on the keyboard.  Pressing the ALT key (or Option on a Mac) and a number on the numeric keypad will display the special character.  For example, ALT+138 will type è, an e with a grave accent.  You can find a complete list of the basic ALT codes here.  Note that you need to use the numeric keypad - the square of numbers on the right hand side of a standard keyboard - to type these in.

However, there is a little more to it than that, because some ALT codes do the same thing.  If you type ALT+0232 you'll get the same è that you do with ALT+138, which seems...odd.  What's going on?


An Incredibly Brief History of ALT Codes

Historically, ALT codes were used in early Microsoft computers to access the character set that couldn't be typed using the standard keyboard.  The early DOS machines were based on the IBM architecture which used a character set called code page 437, and the ALT codes for these are ALT+0 - 255.  In this character set ALT+138 gives you the è.  Because this character set comes from the architecture and not the operating system it's known as the OEM-encoded (Original Equipment Manufacturer) character set.  (Originally IBM used the ISO 7-bit character set that went from 0 - 127, but the ISO extended it to 8-bit to provide space for non-English characters, which led to the classic 256 character set - known as the extended set - that is so commonly used.) 

But 256 characters is quite limited, and other architectures slowly emerged as well, so Microsoft decided to use their own, additional character set, which is known technically as Window's ANSI/ISO Latin-1/ANSI Extended ASCII.  This provided additional space for more codes by prefixing them with a 0, and in this Windows-encoded character set ALT+0232 will produce è. 
You can download an ALT code cheat sheet in PDF format here (Warning: link goes straight to the PDF) that includes both the OEM- and Windows-encoded ALT codes, grouped together in useful sections rather than a numeric list. 

Unicode

In modern incarnations of Windows the Unicode 16-bit character set is used, as it's the global standard, but the original OEM- and Windows-encoded character sets are still there and the ALT codes for those still work.  

Because the Unicode 16-bit character set is vast (see here for some figures, but as a spoiler there are over 70,000 encoded Chinese characters alone), remembering even a tiny fraction of the codes that represent them is beyond mere mortals.  Luckily Windows includes a Character Map that will show you all of the available characters and tell you what their ALT code is.  To access this pre-Windows 10, type Character Map in Windows search.  In Windows 10 the Character Map is - for reasons known only to Microsoft - hidden away, and you'll have to launch it manually.  Hit Win+R and enter charmap, and up it'll pop:


As you can see in the bottom left, the Unicode value is shown for the selected character.  To use this as an ALT code just use ALT+[the four digits], in this case 0021.

If you switch on the Advanced view checkbox you'll also be able to choose a character set, group by types of ideograph and, most importantly, search for a character.  This is very useful when you've got so many characters to search through.  In the example below I've search for "grave":


The Character Map is probably the single easiest way to search for and find the ALT code you need, and will allow you to handle any diacritics with ease.

The history and technical specification of character sets, and how they're used, is a vast topic.  What I've mentioned above is the briefest outline of a long, complex and thoroughly interesting subject.  If you're at all interested in the topic then these links are good jumping off points:
If you know of other good resources about character sets and encoding for the educated non-developer, put them in the comments.