Thanks for the ideas, they're much appreciated! You're the first person to take a technical interest in this so its actually quite exciting

. I looked into this a few years ago, and though I'm no expert on character encodings, the major challenge we found in what you describe is detecting the character encoding of a text file, there doesn't seem to be any reliable way of doing it.
quote BoneIdol
Alternatively, you could just default to UTF-8 (via editing your server's mime types list) and tell people to send foreign language FAQs as UTF-8, since it can display just about every character in every alphabet in common usage today and won't screw over your existing ASCII files (although some of the foreign language FAQs may need converting).

Switching to UTF-8 is also difficult due the huge catalogue of existing iso-8859 and windows-1252 encoded FAQs we already have which contain accented characters (these seem to corrupt when told to output as UTF-8).
For me, multilanguage support is always a "supply chain" problem:
- Asking users to encode files a certain way is unreliable because some might not follow directions
- Many authors ask us to acquire their FAQs from other sites, which may have the files encoded differently than we expect.
- Users who upload FAQs via the FAQCP might upload the wrong or different encoding
- We need to know what encoding the source files were made with to output them properly
- Detection of encoding is difficult and unreliable so we might output the wrong encoding headers.
- Existing files have unknown encodings - an automated conversion may not be reliable due to the above note about unreliable encoding detection.
- The system saving the files must not corrupt the file encoding being saved.
- The server has to know to output the correct headers to the browser declaring the correct encoding.
I know GameFAQs appears to have solved the problem, so hopefully we can too. I'll take a look into this again more indepth, but it seems to me the holy grail is the ability to detect encoding of a file rather than ask people to upload in a specific encoding, so if you have any good ideas on doing so in PHP that would be "half the battle" right there!
(actually the true holy grail is if Apache itself could detect the encoding of a text file and output the correct headers at the time of serving plain txt files).
Extra note about encoding detectionIf I understand character encoding properly the MAJOR problem is that there's no real way to detect encoding by reading the contents of a file. UTF-8 for instance is notoriously difficult to detect reliably. The technical reason appears to be that encoding itself is nothing more than a mapping of bytes to characters, so encoding detection relies on detecting sequences of bytes that may be exclusive to each encoding map. I suspect many encodings actually have significant overlap in byte sequences but map them to different characters, so the end result is that searching for byte sequences can be misleading depending on the range of maps you have to work with. Even detecting iso-8859 vs utf-8 is not trivial depending on the sequence of characters in the string because sometimes a specific string sequence in iso-8859 has bytes that map exactly to another sequence of characters in utf-8 (I think that's why you get so many bug reports and gotchas in the
PHP mb_detect_encoding function. Other UTF-8 detection methods rely on reading the BOM, which is also unreliable because the BOM is optionally added or not added depending on the editor being used (I have text editors that when told to save in UTF-8 will save BOM and some that won't).
Edit: Mar 27, 11