0004084: [all lang] Use UTF-8 codepage

ID	Project	Category	View Status	Date Submitted	Last Update
0004084	mantisbt	localization	public	2004-07-14 02:31	2009-06-23 15:26

Reporter	astax	Assigned To	siebrand
Priority	normal	Severity	feature	Reproducibility	always
Status	closed	Resolution	fixed
Target Version	1.2.0rc1	Fixed in Version	1.2.0rc1

Summary	0004084: [all lang] Use UTF-8 codepage
Description	Currently when the system is used in a few languages, this might cause problems due to different charsets for languages with non-latin charset. For example, if I type something in Russian, having Russian locale in Mantis, my input will be saved in Windows-1251 codepage. But when the person who use English locale in Mantis will try to look at my comments, he'll see them in iso-8859-1 or windows-1252 codepage and obviously won't be able to read my text. Even worse the situation when I use English locale and put Russian text. Depending on browser, I'll either won't be able to put Russian characters or they'll be put into field as UTF-8 chars. Both ways make putting the Russian text impossible. I see the only correct way for fixing this - add an option for global using UTF-8 codepage. Probably it should not be forced for everything, but at least we need an option. Any comments?
Tags	No tags attached.

parent of	0008352	closed	grangeway	Upgrading from 1.1.0a3 to 1.1.0a4 generating non correct visualization for 'é, è, à, ù, ..' characters
parent of	0008230	closed	siebrand	Character encoding in Mantis 1.1.0a4 on the Bugtracker site
parent of	0007472	closed	siebrand	[th] Thai chars are broken on Excel export; column header doesn't show
parent of	0009118	closed	siebrand	Bugtracker does not handle UTF-8 formatted text
has duplicate	0006144	closed	achumakov	Czech: encoding problem (Iiso <-> utf8)
has duplicate	0005401	closed	jlatour	Cyrillic characters encoding problem
has duplicate	0006226	closed	ryandesign	Invalid encoding of pages
has duplicate	0007406	closed	ryandesign	Handing of accents not working using UTF8
related to	0004085	closed	siebrand	The system complains about missing required fields
related to	0004195	closed	siebrand	Mails do not show national characters correctly
related to	0003812	closed	siebrand	[jp] Mantis send broken email when lang=Japanese_euc
related to	0005767	closed	grangeway	Win32 MySQL 4.1.12a-nt default character set
related to	0006536	closed		Mantis display a error infomation when create a chinese project
related to	0007235	closed	~~Wanderer~~	strings_czech.txt sets wrong charset
related to	0007319	closed	grangeway	[all lang] Can not display CJK characters
related to	0006155	closed	siebrand	[all lang] Using different language can cause mantis posted issues not to be viewable.
related to	0006217	closed	vboctor	[all lang] Wrong fIlename on download
related to	0006441	closed	achumakov	I have created an english_utf8 locale
related to	0006505	closed	grangeway	[zh_TW] Chinese_Simplified_UTF8 cannot display correctly in phpMyAdmin
related to	0007400	closed	siebrand	[all lang] When using UTF8 for encoding all reports some fields' contents are incorrectly truncated.
related to	0004742	closed	siebrand	[all lang] Mix languages in messages
related to	0007433	closed	achumakov	Setting encoding works only on main page
related to	0005850	closed	siebrand	[CJK] Section titles are garbled under Japanese.
related to	0005104	closed	achumakov	[all lang] IE 6.0 and Page encoding ISO-8859-2, and special characters ÃµÃ»
related to	0007481	closed	grangeway	Problem with special caracters
child of	0004181	closed		Features in Mantis 1.1 release

astax 2004-07-14 02:34 reporter ~0006035	I'll try to put a few Russan symbols here (should become UTF-8 metasymbols): проверим, как оно работает.

astax 2004-07-14 02:36 reporter ~0006036	Doesn't work correctly, as everybody see.

jlatour 2004-08-06 11:05 reporter ~0006710	I think it would be a good idea to convert everything to UTF-8. The problem is converting the old content (and figuring out what codepage that is in). Any ideas?

astax 2004-08-08 01:06 reporter ~0006791 Last edited: 2004-08-08 01:14	PHP contains a very good function for converting from/to different charsets, including UTF-8 - iconv() . But sometomes it requires some work to make it working in PHP, as its support is not always included. But I think it's better to use it than nothing. Probably the best is to have this as an option. Converrting everything to UTF-8 is not a real problem - that's the UTF-8 nice property - even if I just put charset="UTF-8" in english localization file, this won't affect usual 7-bit characters. The only problem is converting all previous input. But as initially Mantis practically wasn't able to work with many languages simultaneously, that means all 8-bit strings inside are in the same codepage. At least each project has strings only in one codepage. In this case, conversion will be quite simple - just need to run $string = iconv($source_codepage, 'UTF-8', $string) on all strings in the database. Further - as not all browsers (and mail clients) may support UTF-8, it'll be a very good to add a choice for display codepage (can be handled by ob_iconv_handler) and email codepage into personal settings. So the system will be usable even by those who use Lynx and Pine. (I have some experience in converting a multilingual web-based system to work with UTF-8 output and I can say that adding initial support for UTF-8 is quite simple task. Just need to carefully check all places where codepage is important - usually it's surprisingly little number of such places) edited on: 08-08-04 01:14

jlatour 2004-08-08 02:52 reporter ~0006794	Yes, I've been looking into the conversion features and the problem is just figuring out what the old codepage is. I don't think we can take in account different codepages in one database, so I guess we'll just have to convert from the default codepage of the default language? Or alternatively, from the default codepage of the user's language (the user who posted that bug, bugnote, et cetera). What do you think?

astax 2004-08-08 03:26 reporter ~0006795	I think it's not reliable to rely on user's preferences. As codepage conversion should be considered as irreversible, it would be better to delegate source codepage selection to somebody more responsible and make this as a controlled process. Moreover, somebody could have already switched to UTF-8 - 0004195. So I suggest putting this somewhere in admin area and make a BIG warning asking to backup database first. Then ask the source codepage (probably for each project) and start conversion. To make it a bit more handy, we can extract a random piece of text with 8 bit characters and convert it as an example, so it can be checked if selected codepage is correct. I'm not sure how to get a list of supported codepages - currently I have only one idea, but it'll work for Unix'es only - execute `iconv -l` and parse its output. Probably not a very good solution though...

jlatour 2004-08-08 04:05 reporter ~0006797	astax, with your extensive experience on the subject, do you think you could work on conversion routines? I've been working on the conversion towards gettext and when I do that, I also want to convert the language files to UTF-8 so that would be the time to convert the databases as well.

astax 2004-08-08 08:42 reporter ~0006802	I would be happy to do this, but sorry, I don't think I'll be able to. I just have no free time for this.

zerogan 2005-12-17 11:00 reporter ~0011802	Reminder sent to: zerogan

zerogan 2005-12-18 07:43 reporter ~0011808	Now I will input Chinese Simplifed Characters. "大家好！你们看到了吗？" Can it be displayed correctly?

achumakov 2006-10-02 09:51 reporter ~0013575	jlatour, astax: I've converted all the langfiles to utf8 and made an appropriate patch to config: http://www.chumakov.ru/mantis/mantis-110a-utf8.zip The Mantis now looks like http://www.chumakov.ru/images/Mantis_ML.gif I'll keep this utf-8 sync to CVS HEAD. All we need is: get rid of mb-string issues if any (string length, case conversion, search etc) create database with utf-8 charset and collation by default make a migration path for the existing installations: convert each and every db string from user-specified encoding to utf-8 change DB encoding and collation to utf-8

cbradney 2006-10-29 19:30 reporter ~0013658	Vote 1 for this for 1.1 :) http://bugs.scribus.net/view.php?id=4454

achumakov 2006-11-08 08:43 reporter ~0013700	I have changed the Mantis langfiles to be utf-8 by default. All current strings-<language>.txt files are utf-8 now, and only them are shown in the language picker by default. Other encodings preserved in /lang/ directory until somebody helps us with migration routine for existing databases. See lang/langreadme.txt So, Mantis is now basically utf-8. Please test and report what else to be done :)

lifo2 2007-02-06 08:24 reporter ~0014005	I checked if some languages were still not in UTF-8. Under Unix, you can use the following : grep s_charset * \| grep -v utf-8 \| grep -v [0-9].txt It returns some false result, but it shouldn't miss none utf-8 files. Only two languages haven't been converted : strings_czech.txt:$s_charset = 'iso-8859-2'; strings_polish.txt:$s_charset = 'iso-8859-2'; I think it would be easy to convert it with vim (:set encoding=iso-8849-2 and :set fileencoding=utf-8 and save) but I can't manage to check if it works as I haven't any font installed supporting iso-8859-2 charset.

astax 2007-03-04 23:08 reporter ~0014126	Now multi-byte UTF-8 strings are not correctly wrapped in email notifications. For most common two-byte symbols, line length is around 40 characters instead of 80.

ave 2007-03-22 12:53 reporter ~0014233	I found that the body of an email is cut off when Mantis fails to wrap relationship summary. I had to modify my php.ini to avoid this. mbstring.func_overload = 2 mbstring.internal_encoding = UTF-8 Is this intended for using utf-8 with Mantis?

siebrand 2009-04-27 16:55 reporter ~0021706	Update planned to 1.2.x

siebrand 2009-06-16 17:16 reporter ~0022180	All dependencies are resolved. Yay!

View Issue Details

Relationships

Activities