Actions

Difference between revisions of "How do you deal with UTF-8 issues in Joomla! 1.0.x series?"

From Joomla! Documentation

(New page: Many posts raise the issues of using utf-8 in the Joomla! 1.0.x series in spite of Joomla! 1.0.x not supporting utf-8. It is clear that there are community needs for utf-8 in Joomla! 1.0.x...)
 
(Adjusted layout, added link and TOC.)
 
Line 1: Line 1:
 +
__TOC__
 
Many posts raise the issues of using utf-8 in the Joomla! 1.0.x series in spite of Joomla! 1.0.x not supporting utf-8. It is clear that there are community needs for utf-8 in Joomla! 1.0.x otherwise the issue would not come up so often. This article explains the issues of utf-8 in Joomla! 1.0.x series and provides a reasonable workaround for implementing utf-8.
 
Many posts raise the issues of using utf-8 in the Joomla! 1.0.x series in spite of Joomla! 1.0.x not supporting utf-8. It is clear that there are community needs for utf-8 in Joomla! 1.0.x otherwise the issue would not come up so often. This article explains the issues of utf-8 in Joomla! 1.0.x series and provides a reasonable workaround for implementing utf-8.
  
Line 8: Line 9:
 
* The HTML page encoding needs to be set to utf-8 (setting charset in the language file)
 
* The HTML page encoding needs to be set to utf-8 (setting charset in the language file)
  
+
== Why does Joomla! 1.0.x seem to work fine when only the charset is set to utf-8? ==
 
+
Why does Joomla! 1.0.x seem to work fine when only the charset is set to utf-8?
+
  
 
This in fact occurs if only pure English is used. The reason is that all English characters are in lower ASCII and do not include any extended ASCII characters. In this special case utf-8 is equivalent to iso-8859-1 as all characters are single byte characters. The problems begin with European languages with diacritic Latin characters (umlauts, accents etc.) and with other non Latin languages. If you are only going to use English, you might as well stay with iso-8859-1. If you are going to use other languages, please check out the workaround below.
 
This in fact occurs if only pure English is used. The reason is that all English characters are in lower ASCII and do not include any extended ASCII characters. In this special case utf-8 is equivalent to iso-8859-1 as all characters are single byte characters. The problems begin with European languages with diacritic Latin characters (umlauts, accents etc.) and with other non Latin languages. If you are only going to use English, you might as well stay with iso-8859-1. If you are going to use other languages, please check out the workaround below.
  
How is all this solved for Joomla! 1.5?
+
== How is all this solved for Joomla! 1.5? ==
  
See: http://dev.joomla.org/component/option,com_jd-wp/Itemid,33/p,16/
+
See: [http://dev.joomla.org/component/option,com_jd-wp/Itemid,33/p,16/]
  
Is there a workaround to apply utf-8 in Joomla! 1.0.x series?
+
== Is there a workaround to apply utf-8 in Joomla! 1.0.x series? ==
  
 
Yes. Here is a quick guideline to getting Joomla! 1.0.x to work with utf-8
 
Yes. Here is a quick guideline to getting Joomla! 1.0.x to work with utf-8
Line 29: Line 28:
 
* You should uncomment one line of code in the includes/database.php file at about line 102 (second line below)
 
* You should uncomment one line of code in the includes/database.php file at about line 102 (second line below)
  
+
<source lang="php">
 
$this->_table_prefix = $table_prefix;  
 
$this->_table_prefix = $table_prefix;  
 
 
 
 
//@mysql_query("SET NAMES 'utf8'", $this->_resource); // THIS IS THE LINE TO UNCOMMENT  
 
//@mysql_query("SET NAMES 'utf8'", $this->_resource); // THIS IS THE LINE TO UNCOMMENT  
 
 
 
 
$this->_ticker = 0;  
 
$this->_ticker = 0;  
 
 
 
 
$this->_og = array();  
 
$this->_og = array();  
+
</source>
  
 
Please note that the above does not make Joomla! 1.0.x fully utf-8 compatible. All string functions will still be using singlebyte character functions. This works well in most cases (no guarantees). There will be some instances of garbage characters especially with diacritic Latin characters and logical errors in searches and filtering features.
 
Please note that the above does not make Joomla! 1.0.x fully utf-8 compatible. All string functions will still be using singlebyte character functions. This works well in most cases (no guarantees). There will be some instances of garbage characters especially with diacritic Latin characters and logical errors in searches and filtering features.

Latest revision as of 12:13, 17 May 2010

Contents

Many posts raise the issues of using utf-8 in the Joomla! 1.0.x series in spite of Joomla! 1.0.x not supporting utf-8. It is clear that there are community needs for utf-8 in Joomla! 1.0.x otherwise the issue would not come up so often. This article explains the issues of utf-8 in Joomla! 1.0.x series and provides a reasonable workaround for implementing utf-8.

To be fully utf-8 compatible the following needs to be fulfilled:

  • The database needs to be utf-8 compliant otherwise there is a danger of data truncation. A 20 character string in utf-8 may be up to 60 bytes long. In a varchar field that is defined as utf-8 with a length of 20 - 20 utf-8 characters can be safely stored. The field adapts to the byte length. In a non-utf-8 database the same varchar (20) field will truncate the string after 20 bytes.
  • The connection between the database and the php application needs to have utf-8 encoding otherwise unwanted conversions will occur and data corruption will result.
  • Multibyte string functions need to be used when the applied data is encoded as utf-8. Unfortunately PHP's native string functions are not utf-8 aware and can seriously corrupt data (see http://www.phpwact.org/php/i18n/utf-8). There is an extension package to PHP 4 and 5 that has utf-8 aware string function ('mb_string'). However this extension is not always loaded/installed and the php code needs to be modified to call the appropriate mb_ versions of the string function. (PHP 6 will be fully Unicode and utf-8 aware).
  • The HTML page encoding needs to be set to utf-8 (setting charset in the language file)

Why does Joomla! 1.0.x seem to work fine when only the charset is set to utf-8?

This in fact occurs if only pure English is used. The reason is that all English characters are in lower ASCII and do not include any extended ASCII characters. In this special case utf-8 is equivalent to iso-8859-1 as all characters are single byte characters. The problems begin with European languages with diacritic Latin characters (umlauts, accents etc.) and with other non Latin languages. If you are only going to use English, you might as well stay with iso-8859-1. If you are going to use other languages, please check out the workaround below.

How is all this solved for Joomla! 1.5?

See: [1]

Is there a workaround to apply utf-8 in Joomla! 1.0.x series?

Yes. Here is a quick guideline to getting Joomla! 1.0.x to work with utf-8

  • use MySQL version 4.1.2 or newer (older versions don't support utf-8).
  • create an empty database manually before installing Joomla!. Set the character set to utf8 when creating with some collation (utf8_general_ci is the default and should be OK).
  • convert the language files to utf-8 (all language files including for editors, components etc.). Make sure NOT to save with the utf-8 BOM header option.
  • Install Joomla using the pre-existing database. After installation check that the database has utf8 encoding for all text fields (just in case Joomla created a new database and is not working on the pre-created one).
  • set 'charset=utf-8' in the _ISO define in the language file
  • You should uncomment one line of code in the includes/database.php file at about line 102 (second line below)
$this->_table_prefix = $table_prefix; 
//@mysql_query("SET NAMES 'utf8'", $this->_resource); // THIS IS THE LINE TO UNCOMMENT 
$this->_ticker = 0; 
$this->_og = array();

Please note that the above does not make Joomla! 1.0.x fully utf-8 compatible. All string functions will still be using singlebyte character functions. This works well in most cases (no guarantees). There will be some instances of garbage characters especially with diacritic Latin characters and logical errors in searches and filtering features.