[ Main Index | Countries and Currencies | Currency names | Currency Abbreviations | Currency Symbols | Character Set Restrictions ]
Last updated: 10-May-2003
The situation is made worse by the fact that many modern applications dealing with these media are written by people who are unaware of, do not understand or (worse) deliberately ignore Internet standards and conventions. One of the worst offenders in this regard ironically claims that the Internet was designed to be compatible with its software (even though users of its products are regularly lambasted for using "broken" software that ignores and breaks most Internet standards and conventions).
Although the Internet transmits information in 8-bit chunks, since there were no 8-bit character sets around at the time, the designers of the Internet had to choose one of the many 7-bit character sets in use around the world as the standard character set for many protocols. The natural decision (since they were in the US) was to use US ASCII (ISO 646-US) as the standard character set for these protocols (such as e-mail and later news).
The rules presented here define what it is possible to do and still be certain that you will not cause problems for others. In some cases you can break the rules without any apparent problem (it may cause problems for some but if they don't complain you won't know about it) in the same way that you can often break the speed limit and not get caught. Just like breaking the speed limit, just because you have managed to get away with it several times that does not mean that it is a good idea to do so.
Some people don't particularly care that they may cause problems for others. In news posts a common attitude is that if 90% (or 75%, or whatever) of people can read their posts then that is good enough. This is a very selfish attitude if you have something to say that others would like to read and a very foolish attitude if you are requesting help (it is possible that the only person with the answer is one of those you have excluded by posting something he can't read). In practise you'll find that if you break the rules a large proportion of those with long experience of the Internet will ignore your posts whether they can read them or not on the principle that if you're so selfish or foolish as to exclude those with older software your posts have nothing worthwhile to offer anyway.
It takes a little more effort to follow the rules, but if you have something to say you surely wish as many people as possible to read it. If not, why bother in the first place?
!"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
Also valid are the space character and the tab character.
You should not explicitly use the carriage-return or line-feed control characters: when you start a new line in e-mail or news your TCP/IP software translates whatever your mailer/newsreader uses to start a new line into the standard end-of-line convention used in transmitting e-mail and news (carriage-return followed by line-feed). The effect of inserting an isolated carriage-return or line-feed (whichever of them happens not to be used to indicate new lines on your computer) is unpredictable.
!"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞß
àáâãäåæÇèéêëìíîïðñòóôõö÷øùúûüÝþÿ
Also valid are the space character, the tab character, the hard space and the soft hyphen. The hard space character (character code 160 decimal) is identical in appearance to the ordinary space but applications which automatically format text should not break a line at a hard space. The soft hyphen (character code 173 decimal) should be discarded by all applications unless they automatically format text, in which case it indicates a suitable point to break a line and display a hyphen (many applications either fail to discard the soft hyphen when they should or fail to treat it as a potential break point).
You should not explicitly use the carriage-return or line-feed control characters: when you start a new line in e-mail or news your TCP/IP software translates whatever your mailer/newsreader uses to start a new line into the standard end-of-line convention used in transmitting e-mail and news (carriage-return followed by line-feed). The effect of inserting an isolated carriage-return or line-feed (whichever of them happens not to be used to indicate new lines on your computer) is unpredictable.
Although the e-mail transfer protocol has been extended to allow the transfer of 8-bit characters, a significant fraction of mail transfer agents (mail sending software, mail receiving software and mail gateways) are still restricted to 7 bits. So even if your software is capable of sending 8-bit characters, they may not arrive at the destination intact: 8-bit characters may be discarded entirely or may mutate into 7-bit characters by having the top bit set to zero. This means that if you write:
German Schloß for sale in Köln, only £250,000.it may turn into (if 8-bit characters are discarded):
German Schlo for sale in Kln, only 250,000.or (if 8-bit characters have their top bit set to zero):
German Schlo_ for sale in Kvln, only #250,000.
Even if you are lucky enough that your message gets through intact, your recipient may not be using the same 8-bit character set as yourself. Although ISO 8859/1 is very common (being designed to serve the needs of several European languages), there are many other ISO 8859 character sets and even more 8-bit character sets that are not part of ISO 8859. The standard Mac Roman encoding for character sets has many (but not all) of the ISO 8859/1 characters but few of the characters are in the same position as in ISO 8859/1. People in different countries will use the character set appropriate to their country, which may or may not match yours. As far as English speaking countries go, any character set which is compatible with US ASCII will serve, and incompatibilities may only surface when you need to send a currency symbol or a non-English character.
Actually, even 7-bit US ASCII is not guaranteed to reach all recipients unchanged. Some older software running on some mail gateways (notably those that use the EBCDIC character set) has problems with the following characters:
[\]^`{|}~
Generally, the Internet community regards software that incorrectly handles these characters as "broken" but that is of no help if you're mailing C source code to somebody. Fortunately, this software is extremely rare these days.
The first problem with using MIME in e-mail is that not everyone has a MIME-capable mailer. You must always ensure that your recipient can handle MIME before sending MIME-encoded e-mail. What happens if the recipient does not have a MIME-capable mailer is that if you send:
German Schloß for sale in Köln, only £250000 (=$375000).he will see:
German Schlo=DF for sale in K=F6ln, only =A3250000 (=3D$375000).
Raw MIME is extremely confusing. Your recipient might figure out that "Schlo=DF" means "Schloß" and that "K=F6ln" means "Köln." It is very likely that he may think that "=A" is an obscure way of representing "£" and that you're asking £3250000 for your "Schlo=DF" (this sort of mistake is very common in these circumstances). The fact that your message already contained an equals sign is unfortunate because it gets turned into =3D and just adds to the confusion. And just to make things worse, if your message had trailing blanks at the end of a line then MIME adds equals signs to the ends of those lines.
The second problem with using MIME in e-mail is that not everyone is using the same character set as yourself. MIME will automatically include details of your character set so your recipient's software can automatically select the appropriate character set - if that character set happens to be available or the software knows how to synthesize it from available character sets. Even so, you must ensure that your recipient can deal with your character set before using MIME.
Modern news server software is "8-bit clean" which means that it passes 8-bit characters correctly. However, a lot of news servers are still running older software which is not 8-bit clean (which means it drops or mutates 8-bit characters just as with e-mail). Not only that, but news is global in scope so many people will not be using the same 8-bit character set as yourself and even if 8-bit characters reach them intact they will see something entirely different to what you wrote.
MIME is not a feasible solution to the problem of sending 8-bit characters. MIME should only be used when you can guarantee that your recipient can handle it. Since you have no knowledge or control over who receives your news article, you cannot guarantee that they can cope with MIME.
Note that some countries operate a closed group of news servers guaranteed to be 8-bit clean and with the convention that articles posted to those servers use a single, agreed character set. If, and only if, your article is being posted to a closed community of news servers then you may use 8-bit characters (provided you post using the correct character set). Newsgroups in the major hierarchies (alt, biz, comp, humanities, misc, news, rec, sci, soc, talk, etc.) as well as most English-speaking hierarchies (such as uk) are not in a closed community of this kind and you should restrict your posts in those hierarchies to US ASCII.
HTML allows you to enter all valid ISO 8859/1 characters directly. However, the 8-bit characters which are not in the ASCII (ISO 646-US) character set cause problems for those who are limited to retrieving web pages by e-mail because e-mail cannot be guaranteed to transmit other than ASCII characters correctly. It also causes problems for people using Apple Macs because most Mac browsers have problems with 8-bit characters entered directly. It is therefore considered good practise to use character entities (such as æ for "æ") or numeric character references (such as £ for £) unless your document contains so many 8-bit characters as to make this impractical.
The characters:
[\]^`{|}~
may also cause problems when web pages are retrieved by e-mail if the
e-mail passes through (now rare) older mail gateways, but it is not
considered necessary to use numeric character references to deal with
those in ordinary text (it is considered necessary to URL-encode them
when they appear in URLs).
One extension to HTML, which is an adjunct to most versions of HTML but incorporated into HTML version 4, is the ability to refer to characters in the Unicode character set. The first 255 characters of the Unicode character set are identical to the ISO 8859/1 character set (in the same way that the first 128 characters of the ISO 8859/1 character set are identical to the ASCII character set). Unicode characters can be entered means of numeric character references greater than 255 (such as ₠ to give the Euro currency symbol). This depends on browsers having Unicode support and the user having Unicode fonts which contain the relevant symbols. Older browsers may display nothing at all, an error marker, or a totally incorrect character (usually a character from ISO 8859/1). Some older browsers may even crash when confronted with a numeric character reference greater than 255. Until Unicode support is much more widespread you should think very carefully before using Unicode characters in web pages.
Character entities take the form &entity-name; where entity-name is one of those listed below. Entity names are case-sensitive: þ and Þ are lower- and upper-case versions of the same character whilst &Thorn; is not a valid entity.
Numeric character references take the form &#number; where number can be any of 9, 10, 13, 32-126, 160-255.
Notes:
| Character | Entity | Numeric Reference | Preferred |
|||
|---|---|---|---|---|---|---|
| " | " | " | " |
|||
| & | & | & | & |
|||
| < | | < | < |
|||
| > | > | > | > |
|||
| [Non-breaking space] | |   |   |
|||
| ¡ | ¡ | ¡ | ¡ |
|||
| ¢ | ¢ | ¢ | ¢ |
|||
| £ | £ | £ | £ |
|||
| ¤ | ¤ | ¤ | ¤ |
|||
| ¥ | ¥ | ¥ | ¥ |
|||
| ¦ | ¦ | ¦ | ¦ |
|||
| § | § | § | § |
|||
| ¨ | ¨ | ¨ | ¨ |
|||
| © | © | © | © |
|||
| ª | ª | ª | ª |
|||
| « | « | « | « |
|||
| ¬ | ¬ | ¬ | ¬ |
|||
| [Soft Hyphen] | ­ | ­ | ­ |
|||
| ® | ® | ® | ® |
|||
| ¯ | ¯on; | ¯ | ¯ |
|||
| ° | ° | ° | ° |
|||
| ± | ± | ± | ± |
|||
| ² | ² | ² | ² |
|||
| ³ | ³ | ³ | ³ |
|||
| ´ | ´ | ´ | ´ |
|||
| µ | µ | µ | µ |
|||
| ¶ | ¶ | ¶ | ¶ |
|||
| · | · | · | · |
|||
| ¸ | ¸ | ¸ | ¸ |
|||
| ¹ | ¹ | ¹ | ¹ |
|||
| º | º | º | º |
|||
| » | » | » | » |
|||
| ¼ | ¼ | ¼ | ¼ |
|||
| ½ | ½ | ½ | ½ |
|||
| ¾ | ¾ | ¾ | ¾ |
|||
| ¿ | ¿ | ¿ | ¿ |
|||
| À | À | À | À |
|||
| Á | Á | Á | Á |
|||
| Â | Â | Â | Â |
|||
| Ã | Ã | Ã | Ã |
|||
| Ä | Ä | Ä | Ä |
|||
| Å | Å | Å | Å |
|||
| Æ | Æ | Å | Æ |
|||
| Ç | Ç | Ç | Ç |
|||
| È | È | È | È |
|||
| É | É | É | É |
|||
| Ê | Ê | Ê | Ê |
|||
| Ë | Ë | Ë | Ë |
|||
| Ì | Ì | Ì | Ì |
|||
| Í | Í | Í | Í |
|||
| Î | Î | Î | Î |
|||
| Ï | Ï | Ï | Ï |
|||
| Ð | Ð | Ð | Ð |
|||
| Ñ | Ñ | Ñ | Ñ |
|||
| Ò | Ò | Ò | Ò |
|||
| Ó | Ó | Ó | Ó |
|||
| Ô | Ô | Ô | Ô |
|||
| Õ | Õ | Õ | Õ |
|||
| Ö | Ö | Ö | Ö |
|||
| × | × | × | × |
|||
| Ø | Ø | Ø | Ø |
|||
| Ù | Ù | Ù | Ù |
|||
| Ú | Ú | Ú | Ú |
|||
| Û | Û | Û | Û |
|||
| Ü | Ü | Ü | Ü |
|||
| Ý | Ý | Ý | Ý |
|||
| Þ | Þ | Þ | Þ |
|||
| ß | ß | ß | ß |
|||
| à | à | à | à |
|||
| á | á | á | á |
|||
| â | â | â | â |
|||
| ã | ã | ã | ã |
|||
| ä | ä | ä | ä |
|||
| å | å | å | å |
|||
| æ | æ | æ | æ |
|||
| ç | ç | ç | ç |
|||
| è | è | è | è |
|||
| é | é | é | é |
|||
| ê | ê | ê | ê |
|||
| ë | ë | ë | ë |
|||
| ì | ì | ì | ì |
|||
| í | í | í | í |
|||
| î | î | î | î |
|||
| ï | ï | ï | ï |
|||
| ð | ð | ð | ð |
|||
| ñ | ñ | ñ | ñ |
|||
| ò | ò | ò | ò |
|||
| ó | ó | ó | ó |
|||
| ô | ô | ô | ô |
|||
| õ | õ | õ | õ |
|||
| ö | ö | ö | ö |
|||
| ÷ | ÷ | ÷ | ÷ |
|||
| ø | ø | ø | ø |
|||
| ù | ù | ù | ù |
|||
| ú | ú | ú | ú |
|||
| û | û | û | û |
|||
| ü | ü | ü | ü |
|||
| ý | ý | ý | ý |
|||
| þ | þ | þ | þ |
|||
| ÿ | ÿ | ÿ | ÿ |
[ Index | Countries and Currencies | Currency names | Currency Abbreviations | Currency Symbols | Character Set Restrictions ]