[Fsf-friends] Indic language bugs in Unicode

Mon Nov 20 17:49:13 IST 2006

Recently, I started using UTF-8 enabled applications to read and write 
in Tamil, the local official language here. It appears indic languages 
have been incorrectly represented at Unicode. India had sent less than 
128 chars each language to Unicode consortium in the 1990s, much less 
than the full complement of characters in each. For example, among Tamil 
characters, only 31 chars (12 vowels and 18 consonants + 1 Final (ஃ) 
have specific codes, and the chart misses almost 12 x 18 characters 
which now have to be encoded with three to nine bytes per character. To 
make things worse, their arrangement is not in any natural order, and so 
sorting is difficult. It appears it is difficult to amend the charts 
now, as a number of applications have started using the unicode coding 
charts. Almost all indic languages have the same problem.

Some would like to now have a 16 bit encoded Tamil-New chart, with codes 
allocated for 250+ characters in the Private Use area. I am not sure if 
other indic language groups are aware of the issues here, and what their 
plans are to deal with it.

Padmakumar pointed out the issues there to the fsf-friends mailing list 
in 2004:

http://mm.gnu.org.in/pipermail/fsf-friends/2004-December/002653.html
along with the link to the article at : 
http://www.angelfire.com/empire/thamizh/2/
(sad that there was no response to it)

A recent TVU conference doc on the issues there is available at:
http://tamilvu.org/coresite/html/cwwhatnw.htm

There are a number of things that need to be done:
[1] Add any missing characters and re-arrange the Tamil Unicode 
characters within the range of the existing 128 so that sorting could be 
done
[2] Examine the TVU doc and offer suggestions to those concerned 
regarding Tamil 16 bit encoding.
[3] Almost all indic languages are in the same boat here, and therefore, 
the language groups ought to come up with workable plans to remove the 
problems.

-Ramanraj K