Special Characters, Unicode and XML
XMetal 2 and 3 provide Special Characters
and Symbol toolbars which feature many of the most commonly required non-Latin
characters. If you need to produce a character in your XML not found in
these toolbars, construct a "character reference" for the character
as explained below.
A special character can be very broadly defined as
any non-Latin character that you want to include in an XML file. These characters
cannot be inserted as-is into an XML file without breaking attempts to display
the file or any HTML translation of it. They need to be represented in XML
in a standard way known as a "character reference" -- a numeric value which makes it
possible to preserve the character information across platforms, languages and
document iterations. Character references is separate but related to
"entity references" which are easier to remember since they
approximate words but are of limited use since Netscape Navigator ignores them.
XMetal 2 and 3 will insert characters and character references for many of
the most commonly required special characters via a set of toolbars within the
software. There may be times you need to represent a character not included in
these toolbars. You can type character references into your XML documents by
hand via the directions below.
TARO participants decided early
in the planning stages to use Unicode character references when necessary.
Character reference tables are accessible on the Internet which list Unicode
numeric values in "decimal" and "hexadecimal" notation. The
hexadecimal value should be used for special characters in XML finding aids
prepared for submission to the TARO Archive. The most complete charts,
listing "hex" values, are available here in PDF format:
The 4 place alpha/numeric value you find for a character in these tables is
pasted into your document using this notation:
&#xtable-value-here;
| example |
hexadecimal notation character reference |
entity reference (for comparison and completeness -- DO NOT USE THIS) |
description |
| … |
… |
hellip; |
horizontal ellipsis |
| & |
& |
amp; |
ampersand |
It needs to be stressed that this coding in no way insures
proper display of the special character in any given environment. Keep this in
mind as you decide how to represent any given character in your XML document.
It is still the case that beyond the extended Latin character set, many
characters will fail to display on most systems.
|
Error Checking Special Character Displays
| What browser are you using? When we
receive emails or calls regarding characters not displaying correctly, this is
the first question we now ask. Netscape Navigator has a particular problem
displaying well formed, standard entity references (ie. things like &#ldquo; and &#rdquo; and &#hellip;)
which are rendered without incident in Opera or Internet Explorer. We STRONGLY
suggest that any worthwhile error ehecking needs to be done with something other
than Navigator.
|
In reference to the TARO project, Apex inserted proper
Unicode values (in decimal notation such as ¥ to represent yen)
for most, but not all, special characters. We will leave it to the repositories'
discretion as to whether they wish to replace the Apex decimal notation with
the preferred hexadecimal.
Repositories need to examine their XML files for two distinct types of
errors:
1. Special characters left uncoded by Apex
These may be difficult to detect without comparison to the paper originals.
2. Special characters miscoded by Apex
Here we are referring to instances where the Unicode data for a special
character is included in the XML file but it is either misapplied or
misspelled. One example which Apex consistently miscoded is the ampersand in
Texas A&M. Unfortunately, this type of error is harder to detect. A
misapplied Unicode representation will not break the raw XML. Rather than
rendering the special character, a typographical error in Unicode data will
simply appear as text within the XML document. One possible solution is to load
the XML file into Internet Explorer 5 or higher and use the browser's
"FIND" function to scan for the pound sign (#) within the document.
|