From Joomla! Documentation
Implementation of UTF-8 in Joomla! 1.5
This text is based on an article rewritten by former core team member David Gal for a German Linux publication.
UTF-8 is a variable length character encoding using one to four bytes per character, depending on the Unicode symbol. Four bytes may seem like a lot for one character, however, this is required only for special characters outside the Basic Multilingual Plane, which are generally very rare. The first byte (positions 0-127) is used for encoding ASCII which gives the character set full backward compatibility with ASCII.
UTF-8 is becoming the standard and internationally accepted multilingual environment and is the preferred way to communicate non-ASCII characters over the internet. Being a subset of Unicode, UTF-8 has the special benefit of using less space to store or transmit ASCII. As the bulk of internet transmissions are using the 7-bit ASCII characters, UTF-8 encoding saves volume and bandwidth.
It also provides a single encoding for all other characters that were previously implemented using 8-bit character codes hand-in-hand with a specific encoding table (i.e iso-8859-2) in order to know how to represent the character code. Up to now this basically limited Web page display to ASCII Latin characters plus one other language or set of diacritic Latin characters (accents and umlauts for example). UTF-8 now provides one code page for all languages.
Migration to UTF-8 promises to be simple for existing ASCII texts as UTF-8 encoding for ASCII has no changes.
Implementing in Joomla! – a Bit More Than Setting Encoding to UTF-8
Up to now in order to change from one encoding to another all that was required was to change the _ISO definition in the language file resulting in the charset=myNewEncoding statement in the HTML meta tag. This was simple as all encodings were single-byte character encodings. For the entire Joomla! system, a character equals a byte and Joomla! didn’t really care what the character representation of the particular byte is.
Now, in Joomla! 1.5, we are starting to use multi-byte characters and not only that – some are one byte long and some are 3 bytes long. How does this affect Joomla! and what is required to truly be able to state that Joomla! supports UTF-8?
There are Four Affected Areas
- The database needs to support UTF-8 data storage. For example: a text field of type varchar was given a length of 20 with the intention of being able to store up to 20 characters – which also meant 20 bytes. Now, with UTF-8, 20 characters might mean between 20 and 60 bytes. The database needs to be able to adjust accordingly. Fortunately MySQL version 4.1.2 and up, support UTF-8.
- The connection between the Joomla! application and the database needs to know the encoding of the data in order to know whether encoding translation is required. (There are cases where the application uses one encoding and the database uses another.)
- The HTML page needs to know which encoding it carries. This is trivial and all that is required is that the HTML meta tags will show that the encoding is UTF-8.
- Joomla’s PHP string handling functions need to be UTF-8 aware. This is not at all trivial as PHP’s regular string functions are all single-byte aware and a special set of functions is needed.
The Challenges and the Solutions
The first major challenge was the fact that there are still many hosts that are running MySQL version 4.0.x and older databases. These do not have UTF-8 support. It is possible to store UTF-8 data in non-UTF-8 tables. As far as the database is concerned, it is storing bytes and returning them to the application when needed.
However, as already mentioned, there is a possibility that the user will want to store a 20 character field that holds UTF-8 characters that are not in the regular ASCII area. If these are committed to a varchar (20) database field, the data will be truncated. This is not only a problem with non-Latin character languages (that are normally in the multi-byte area) but also with all European languages with possible the exception of English. Every one of these languages has some special Latin characters (accents and umlauts for example) that are now multi-byte characters. The word käse is now 5 bytes long!
The core team rightfully decided that Joomla! 1.5 should also be able to work on older databases and not only that, the backward compatibility should be transparent to the user. The installer now checks for the version of MySQL. If it is version 4.1.2 and later, then UTF-8 tables are created with the user being able to choose the desired collation. If the database does not support UTF-8, the installer actually runs a separate script creating a database structure that provides extra storage space for potentially longer strings. This is anticipated to eliminate the danger of data truncation by the database.
The second major challenge relates to the lack of UTF-8 support in PHP. All standard string functions in PHP are only able to work with single-byte characters. Using these functions on UTF-8 encoded data can result in logical failures and also in data corruption.
The problem lies with the fact that until PHP 6 is released, there is no comprehensive native UTF-8 support in PHP. There is a multi-byte extension named mbstring which exists from version 4.1 but it is not loaded by default. In addition it also serves other multi-byte encodings such as some Far Eastern languages. This means that it may be present but not set to the correct settings for UTF-8. An additional extension named iconv, which has some parallel capability, is present in PHP 5 but optional and missing some functions in PHP 4.
Here again, the core team decided to vote for full backward compatibility and for the solution to be transparent to the user. The solution is a combination of either using PHP provided functions, if they are present, or using a special library of UTF-8 aware string functions if no PHP native functions are available. This provides the best performance (PHP functions available) together with complete backward compatibility. A Joomla String Class provides this functionality and it will be included in the API for third party developers.
There is no user configuration or setup required regarding PHP UTF-8 support. There is one small exception to this rule which could theoretically occur if, in the host, one or two of the mbstring settings (that cannot be changed from within code) are set to a value that is adverse to UTF-8. The installer will identify this and advise on how to change the setting locally using .htaccess.
Considering that data will, in most cases, need to be converted to UTF-8, it will be recommended to migrate existing data to a freshly installed Joomla! 1.5 site and not to perform upgrades of existing Joomla! 1.0.x sites. Specific migration guidance will be provided with the release.
The Quantum Jump
Joomla! 1.5 will take a huge jump ahead of the rest of the CMS pack with its Internationalisation features. UTF-8 is undoubted, the big discriminator—all languages with one encoding. In addition, RTL support and the language packs for backend, installer and help system, make Joomla! 1.5 a complete package for use in any language or combination of languages. JoomFish will be the icing on the cake.
Language Codes - RFC 3066
Providing decent localisation support for Joomla, the kind of support that will carry smoothly across to the work of all extension providers (from language packs to components, modules and plugins), requires a certain amount of attention to some nitty gritty details. Some questions that pop up are: How do we identify a language? followed by How do we provide a consistent naming convention? and How will everyone know about this?
Simple. We need a convention, preferably public and hopefully without ambiguity. It should be kept current.
A little bit of digging unearthed RFC 3066 and a decision was made to use it as the convention for language identification in Joomla as of version 1.5.
This results in the following conventions for the language names:
- The delimiter between the subtags should be the HYPHEN / MINUS character and not the UNDERSCORE character.
- The first tag with the language tag (based on ISO 639) should use the two letter code for a language and not the three letter code. The three letter code may only be used if a two letter code does not exist for the language.