Tamil All Character Encoding

Tamil All Character Encoding (TACE16) is a scheme for encoding the Tamil script in the Private Use Area of Unicode, implementing a syllabary-based character model differing from the modified-ISCII model used by Unicode's existing Tamil implementation.^[1]^[2]

Keyboard drivers and fonts

The keyboard driver for this encoding scheme is available on the Tamil Virtual Academy website for free.^[3]^[4] It uses Tamil 99 and Tamil Typewriter keyboard layouts, which are approved by the Government of Tamil Nadu, and maps the input keystrokes to its corresponding characters of the TACE16 scheme.^[2] To read files created using TACE16, the corresponding Unicode Tamil fonts are also available on the same website.^[3]^[4] These fonts map glyphs for characters of TACE16 format, but also for the Unicode block for both ASCII and Tamil characters, so that they can provide backward compatibility for reading existing files which are created using the Tamil Unicode block.

Character set

All the characters of this encoding scheme are located in the private use area of the Basic Multilingual Plane of Unicode's Universal Coded Character Set.

Tamil All Character Encoding (TACE16) Character Set^[5]
Vowels→		∅	A	Ā	I	Ī	U	Ū	E	Ē	Ai	O	Ō	Au	(Miscellaneous)
Consonants ↓		_0	_1	_2	_3	_4	_5	_6	_7	_8	_9	_A	_B	_C	_D	_E	_F
(Symbols)	U+E10_	௳	௴	௵	௶	௷	௸	௹	௺	○	●	★	ராஜ	ௐ
(Numbers)	U+E18_	௦	௧	௨	௩	௪	௫	௬	௭	௮	௯	௰	௱	௲
(Fractions)	U+E1A_	𑿌	𑿐	𑿑	𑿓	𑿅	𑿉	𑿎	𑿄	𑿈	𑿋	𑿍	𑿏	𑿀	𑿁	𑿂	𑿆
∅	U+E1F_	்		ா	ி	ீ	ு	ூ	ெ	ே	ை	ொ	ோ	ௌ
∅	U+E20_		அ	ஆ	இ	ஈ	உ	ஊ	எ	ஏ	ஐ	ஒ	ஓ	ஔ	ஃ
K	U+E21_	க்	க	கா	கி	கீ	கு	கூ	கெ	கே	கை	கொ	கோ	கௌ
Ng	U+E22_	ங்	ங	ஙா	ஙி	ஙீ	ஙு	ஙூ	ஙெ	ஙே	ஙை	ஙொ	ஙோ	ஙௌ
C	U+E23_	ச்	ச	சா	சி	சீ	சு	சூ	செ	சே	சை	சொ	சோ	சௌ
Ñ	U+E24_	ஞ்	ஞ	ஞா	ஞி	ஞீ	ஞு	ஞூ	ஞெ	ஞே	ஞை	ஞொ	ஞோ	ஞௌ
Ṭ	U+E25_	ட்	ட	டா	டி	டீ	டு	டூ	டெ	டே	டை	டொ	டோ	டௌ
Ṇ	U+E26_	ண்	ண	ணா	ணி	ணீ	ணு	ணூ	ணெ	ணே	ணை	ணொ	ணோ	ணௌ
T	U+E27_	த்	த	தா	தி	தீ	து	தூ	தெ	தே	தை	தொ	தோ	தௌ
N	U+E28_	ந்	ந	நா	நி	நீ	நு	நூ	நெ	நே	நை	நொ	நோ	நௌ
P	U+E29_	ப்	ப	பா	பி	பீ	பு	பூ	பெ	பே	பை	பொ	போ	பௌ
M	U+E2A_	ம்	ம	மா	மி	மீ	மு	மூ	மெ	மே	மை	மொ	மோ	மௌ
Y	U+E2B_	ய்	ய	யா	யி	யீ	யு	யூ	யெ	யே	யை	யொ	யோ	யௌ
R	U+E2C_	ர்	ர	ரா	ரி	ரீ	ரு	ரூ	ரெ	ரே	ரை	ரொ	ரோ	ரௌ
L	U+E2D_	ல்	ல	லா	லி	லீ	லு	லூ	லெ	லே	லை	லொ	லோ	லௌ
V	U+E2E_	வ்	வ	வா	வி	வீ	வு	வூ	வெ	வே	வை	வொ	வோ	வௌ
Ḻ	U+E2F_	ழ்	ழ	ழா	ழி	ழீ	ழு	ழூ	ழெ	ழே	ழை	ழொ	ழோ	ழௌ
Ḷ	U+E30_	ள்	ள	ளா	ளி	ளீ	ளு	ளூ	ளெ	ளே	ளை	ளொ	ளோ	ளௌ
Ṟ	U+E31_	ற்	ற	றா	றி	றீ	று	றூ	றெ	றே	றை	றொ	றோ	றௌ
Ṉ	U+E32_	ன்	ன	னா	னி	னீ	னு	னூ	னெ	னே	னை	னொ	னோ	னௌ
Grantha characters
J	U+E33_	ஜ்	ஜ	ஜா	ஜி	ஜீ	ஜு	ஜூ	ஜெ	ஜே	ஜை	ஜொ	ஜோ	ஜௌ
Sh	U+E34_	ஶ்	ஶ	ஶா	ஶி	ஶீ	ஶு	ஶூ	ஶெ	ஶே	ஶை	ஶொ	ஶோ	ஶௌ
Ṣ	U+E35_	ஷ்	ஷ	ஷா	ஷி	ஷீ	ஷு	ஷூ	ஷெ	ஷே	ஷை	ஷொ	ஷோ	ஷௌ
S	U+E36_	ஸ்	ஸ	ஸா	ஸி	ஸீ	ஸு	ஸூ	ஸெ	ஸே	ஸை	ஸொ	ஸோ	ஸௌ
H	U+E37_	ஹ்	ஹ	ஹா	ஹி	ஹீ	ஹு	ஹூ	ஹெ	ஹே	ஹை	ஹொ	ஹோ	ஹௌ
Kṣ	U+E38_	க்ஷ்	க்ஷ	க்ஷா	க்ஷி	க்ஷீ	க்ஷு	க்ஷூ	க்ஷெ	க்ஷே	க்ஷை	க்ஷொ	க்ஷோ	க்ஷௌ	ஶ்ரீ

Legend:
	Syllabograms with irregular glyphs, which inherently need to be handled individually by a font.^[a]
	Newly added. Not present in Unicode version 6.3.
	Corresponds to a character in the Tamil Supplement block, added in Unicode version 12 (2019)
	Allocated for research (NLP)

Comparison of TACE16 to present Tamil Unicode

Criticism of the standard Unicode character model for Tamil

Unicode's encoding models for Devanagari, Tamil, Kannada, Sinhala and emoji require use of the invisible zero-width joiner and zero-width non-joiner characters.

The existing Unicode character model for Tamil is, like most of Indic Unicode,^[b] an abugida-based model derived from ISCII. It been criticized for several reasons.^[1]

Unicode represents only 31 Tamil base characters as single code points, out of 247 grapheme clusters. These include stand-alone vowels, and 23 basic consonant glyphs (which, due to not bearing a virama, nonetheless denote a syllable with both a consonant and a vowel when used on their own). The others are represented as sequences of code points, requiring software support for advanced typography features (such as Apple Advanced Typography, Graphite, or OpenType advanced typography) to render correctly. This also requires the use of invisible zero-width joiner and zero-width non-joiner characters in places where the desired grapheme cluster would otherwise be ambiguous. This complexity can result in security vulnerabilities and ambiguous combinations, can require the use of an exception table to forbid invalid combinations of code points, and can necessitate the use of string normalization to compare two strings for equality.

Additionally, since syllables with both a consonant and a vowel form 64 to 70% of Tamil text, an abugida-based model which encodes the consonant and vowel parts as separate code points is inefficient, in terms of how long a string needs to be to contain a given piece of text, in comparison with a syllabary-based model.

Furthermore, ISCII is primarily an encoding of Devanagari, and the ISCII encodings of other Brahmic scripts (including Tamil) encode characters over the code points of the corresponding characters in Devanagari ISCII. Although Unicode encodes the Brahmic scripts separately from one another, the Tamil block mirrors the ISCII layout (with Devanagari-style character ordering, and reserved space in positions corresponding to Devanagari characters with no Tamil equivalent); consequently, the characters are not in the natural sequence order, and strings collated by code point (analogous to "ASCIIbetical" sorting of English text) will not produce the expected sorting order. It requires a complex collation algorithm for arranging them in the natural order.

TACE16 in comparison

The following data provides a comparison of current Unicode Tamil vs. TACE16 on e-governance and browsing:^[1]^{[better source needed]}

TACE16 is efficient over Unicode Tamil by about 5.46 to 11.94 percent for data storage^{[clarification needed]}.
TACE16 is efficient over Unicode Tamil by about 18.69 to 22.99 percent for sorting index data.
TACE16 is efficient over Unicode Tamil by about 25.39% when the entire data is Tamil. The default collation sequence followed (binary) while using the code-space values in TACE16 is not as per Tamil dictionary order.
TACE16 is faster in sorting over Unicode Tamil by about 0.31 to 16.96 percent.
Index creation on TACE16 data is faster by 36.7% than Unicode.
For full key search on indexed fields, TACE16 performs better than Unicode Tamil by up to 24.07%. In the case of non-indexed fields, TACE16 performs better than Unicode Tamil by up to 20.9%.
Rendering of static Tamil data works with TACE16.

TACE16 provides performance improvements in processing time and processing space. It encompasses all of the general Tamil text; it is sequential; and it is unambiguous, with any point corresponding to only one character.^[1]^{[better source needed]} The TACE16 system takes fewer instruction cycles than Unicode Tamil, and also allows programming based on Tamil grammar^{[clarification needed]}, which needs extra framework development in Unicode Tamil.

Responses by the Unicode Consortium

The Unicode Consortium publishes a dedicated FAQ page on the Tamil script which responds to some of the criticisms. In defence of the ISCII model, the Consortium notes that expert linguists, typographers and programmers were involved in its development, but acknowledges that compromises were made due to ISCII being constrained to single-byte extended ASCII. The Consortium points out that Unicode Tamil is now implemented by all major operating systems and web browsers, and maintains that it should be used in open interchange contexts, such as online, since tools such as search engines would not necessarily be able to identify or interpret a sequence of Unicode private-use code points as Tamil text. However, the Consortium does not object to the use of Private-Use Area schemes, including TACE16, internally to particular processes for which they are useful. In particular, it highlights that both markup schemes and alternative encoding schemes may be used by researchers for specialised purposes such as natural-language processing.^[6]

Unicode defines normative named-sequences for all Tamil pure consonants and syllables which are represented with sequences of more than one code point, and a dedicated table is published as part of the Unicode Standard listing all of these sequences, in their traditional order, along with their correct glyphs. The Consortium points out that it has been open to accepting proposals for characters for which no existing Unicode representation exists: for example, adding several historical fractions and other symbols as the Tamil Supplement block in version 12.0 in 2019.^[6]

Regarding collation, the Consortium argues that obtaining the correct result from sorting by code point is the exception rather than the rule, highlighting that, in unmodified ASCIIbetical ordering, the uppercase Latin letter Z sorts before the lowercase letter a, and also highlighting that collation rules often differ by language (see e.g. ö). Regarding space efficiency, the Consortium argues that storage space and bandwidth taken up by text is usually far overshadowed by other accompanying media such as images and video, and that text content performs well under general-purpose compression methods such as Deflate (originally from the ZIP file format, standardized in RFC 1951 and integrated in the HTTP protocol as a generic encoding scheme).^[6]

Unicode Stability Policy

When first published (version 1.0.0), Unicode made only limited stability guarantees. As such, the original Tibetan block was deleted in version 1.0.1 (and its space has since been occupied by the Myanmar block), and the original block for Korean syllables was deleted in version 2.0 (and is now occupied by CJK Unified Ideographs Extension A). Both the current Hangul Syllables block for Korean syllables, and the current Tibetan block, date back to Unicode 2.0. This was done on the assumption that little or no existing content using Unicode for those writing systems existed,^[7] since it would break compatibility with all existing Unicode content in, and input methods for, those writing systems. After this so-dubbed "Korean mess", the responsible committees pledged not to make such a compatibility-breaking change ever again,^[7] which now forms part of the Unicode Stability Policy.^[8]

This stability policy has been upheld ever since, in spite of demands to re-encode or change the character model for both Tibetan and Korean a second time, made by China and North Korea respectively.^[9]^[10]^[11]^[12] Likewise in relation to Tamil, the Consortium emphasises the "crucial issue of maintaining the stability of the standard for existing implementations", and argues that "the resulting costs and impact of destabilizing the standard" would substantially outweigh any efficiency benefits in processing speed or storage space.^[6]

There was a proposal to re-encode Tamil^[13] that was rejected by Unicode, who said that the re-encoding would be damaging and that there was no convincing evidence that Unicode Tamil encoding is deficient.^[14]

Alternatives

Open-Tamil

The Open-Tamil project^[15] provides many of the common operations. It claims Level-1 compliance of Tamil text processing without using TACE16, but is written on top of extra programming logic which is needed for Unicode Tamil.

Footnotes

^ Highlighted syllabograms in the U and Ū columns are those where the vowel portion of the glyph matches neither the simple subjoining forms shown for those combining vowel marks in the Unicode block chart, nor the right-joining Grantha forms (as used for those combining vowel marks in isolation by, for example, Noto fonts).
^ Except for Tibetan, which uses a different model, and for Thai and related scripts, which use a model derived from TIS-620.

References

^ ^a ^b ^c ^d REPORT ON THE FINAL RECOMMENDATIONS OF THE TASK FORCE ON TACE16 (PDF) (Report).
^ ^a ^b "TENDER DOCUMENT for Development of Tamil Fonts and Tamil Keyboard driver for 16-bit encodings (Unicode and TACE16)" (PDF). Tamil Virtual Academy.
^ ^a ^b "தமிழ் எழுத்துருக்கள்". தமிழ் இணையக் கல்விக்கழகம் TAMIL VIRTUAL ACADEMY.
^ ^a ^b Tamil Nadu Government's Order(G.O.), Keyboard Drivers and Fonts Archived 27 December 2023 at archive.today
^ Tamil Virtual Academy. "Annexure 4: Typewriter Extended Keyboard Sequence for Unicode and TACE16" (PDF). Tender Document for Development of Tamil Fonts and Tamil Keyboard driver for 16-bit encodings (Unicode and TACE16). Chennai.
^ ^a ^b ^c ^d "FAQ - Tamil Language and Script". Unicode Consortium.
^ ^a ^b Yergeau, F. (1998). UTF-8, a transformation format of ISO 10646. IETF. doi:10.17487/rfc2279. RFC 2279.
^ "Unicode Character Encoding Stability Policies". Unicode Consortium.
^ West, Andrew (2006-09-14). "Precomposed Tibetan Part 1 : BrdaRten". BabelStone.
^ China National Body (2003-10-20). "China's Statement of BrdaRten ad hoc". ISO/IEC JTC1/SC2/WG2 N2674.
^ Karlsson, Kent (2000-03-02). "Comments on DPRK New Work Item proposal on Korean characters". ISO/IEC JTC1/SC2/WG2 N2167.
^ Cho, Chun-Hui (2000-07-05). "DPRK letter on character names and ordering in 10646-1: 2000" (PDF). ISO/IEC JTC1/SC2/WG2 N2231.
^ Anantham, A.R.Amaithi (2012-01-26). "Fresh Encoding Proposals" (PDF). Unicode.
^ "Archive of Notices of Non-Approval". Unicode. 2012-03-05.
^ Annamalai, M.; Arulalan, T., Open-Tamil: Tamil language text processing tools for Python v3, retrieved 2023-12-31

[6] Highlighted syllabograms in the U and Ū columns are those where the vowel portion of the glyph matches neither the simple subjoining forms shown for those combining vowel marks in the Unicode block chart, nor the right-joining Grantha forms (as used for those combining vowel marks in isolation by, for example, Noto fonts).

[7] Except for Tibetan, which uses a different model, and for Thai and related scripts, which use a model derived from TIS-620.

[TACE16Report-1] REPORT ON THE FINAL RECOMMENDATIONS OF THE TASK FORCE ON TACE16 (PDF) (Report).

[TNGovernmentTenderDocument-2] "TENDER DOCUMENT for Development of Tamil Fonts and Tamil Keyboard driver for 16-bit encodings (Unicode and TACE16)" (PDF). Tamil Virtual Academy.

[KBDFonts-3] "தமிழ் எழுத்துருக்கள்". தமிழ் இணையக் கல்விக்கழகம் TAMIL VIRTUAL ACADEMY.

[GO-4] Tamil Nadu Government's Order(G.O.), Keyboard Drivers and Fonts Archived 27 December 2023 at archive.today

[5] Tamil Virtual Academy. "Annexure 4: Typewriter Extended Keyboard Sequence for Unicode and TACE16" (PDF). Tender Document for Development of Tamil Fonts and Tamil Keyboard driver for 16-bit encodings (Unicode and TACE16). Chennai.

[unicodefaq-8] "FAQ - Tamil Language and Script". Unicode Consortium.

[rfc2279-9] Yergeau, F. (1998). UTF-8, a transformation format of ISO 10646. IETF. doi:10.17487/rfc2279. RFC 2279.

[10] "Unicode Character Encoding Stability Policies". Unicode Consortium.

[11] West, Andrew (2006-09-14). "Precomposed Tibetan Part 1 : BrdaRten". BabelStone.

[12] China National Body (2003-10-20). "China's Statement of BrdaRten ad hoc". ISO/IEC JTC1/SC2/WG2 N2674.

[wg2-n2167-13] Karlsson, Kent (2000-03-02). "Comments on DPRK New Work Item proposal on Korean characters". ISO/IEC JTC1/SC2/WG2 N2167.

[wg2-n2231-14] Cho, Chun-Hui (2000-07-05). "DPRK letter on character names and ordering in 10646-1: 2000" (PDF). ISO/IEC JTC1/SC2/WG2 N2231.

[15] Anantham, A.R.Amaithi (2012-01-26). "Fresh Encoding Proposals" (PDF). Unicode.

[16] "Archive of Notices of Non-Approval". Unicode. 2012-03-05.

[17] Annamalai, M.; Arulalan, T., Open-Tamil: Tamil language text processing tools for Python v3, retrieved 2023-12-31

[1]

[2]

[3]

[4]

[5]

[a]

[b]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

v t e Character encodings
Early telecommunications	Telegraph code Needle Morse Non-Latin Wabun/Kana Chinese Cyrillic Baudot and Murray Fieldata ASCII ISO/IEC 646 BCDIC Teletex and Videotex/Teletext T.51/ISO/IEC 6937 ITU T.61 ITU T.101 World System Teletext background sets Transcode
ISO/IEC 8859	Approved parts -1 (Western Europe) -2 (Central Europe) -3 (Maltese/Esperanto) -4 (North Europe) -5 (Cyrillic) -6 (Arabic) -7 (Greek) -8 (Hebrew) -9 (Turkish) -10 (Nordic) -11 (Thai) -13 (Baltic) -14 (Celtic) -15 (New Western Europe) -16 (Romanian) Abandoned parts -12 (Devanagari) Proposed but not approved KOI-8 Cyrillic Sámi Adaptations Welsh Barents Cyrillic Estonian Ukrainian Cyrillic
Bibliographic use	MARC-8 ANSEL CCCII/EACC ISO 5426 5426-2 5427 5428 6438 6862
National standards	ArmSCII Big5 BraSCII CNS 11643 DIN 66003 ELOT 927 GOST 10859 GB 2312 GB 12345 GB 12052 GB 18030 HKSCS ISCII JIS X 0201 JIS X 0208 JIS X 0212 JIS X 0213 KOI-7 KPS 9566 KS X 1001 KS X 1002 LST 1564 LST 1590-4 PASCII Shift JIS SI 960 TIS-620 TSCII VISCII VSCII YUSCII
ISO/IEC 2022	ISO/IEC 8859 ISO/IEC 10367 Extended Unix Code / EUC
Mac OS Code pages ("scripts")	Armenian Arabic Barents Cyrillic Celtic Central European Croatian Cyrillic Devanagari Farsi (Persian) Font X (Kermit) Gaelic Georgian Greek Gujarati Gurmukhi Hebrew Iceland Inuit Keyboard Latin (Kermit) Maltese/Esperanto Ogham Roman Romanian Sámi Turkish Turkic Cyrillic Ukrainian VT100
DOS code pages	437 668 708 720 737 770 773 775 776 777 778 850 851 852 853 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 897 899 903 904 932 936 942 949 950 951 1040 1042 1043 1046 1098 1115 1116 1117 1118 1127 3846 ABICOMP CS Indic CSX Indic CSX+ Indic CWI-2 Iran System Kamenický Mazovia MIK
IBM AIX code pages	895 896 912 915 921 922 1006 1008 1009 1010 1012 1013 1014 1015 1016 1017 1018 1019 1046 1124 1133
Windows code pages	CER-GS 932 936 (GBK) 950 1169 Extended Latin-8 1250 1251 1252 1253 1254 1255 1256 1257 1258 1270 Cyrillic + Finnish Cyrillic + French Cyrillic + German Polytonic Greek
EBCDIC code pages	Japanese language in EBCDIC DKOI
DEC terminals (VTx)	Multinational (MCS) National Replacement (NRCS) French Canadian Swiss Spanish United Kingdom Dutch Finnish French Norwegian and Danish Swedish Norwegian and Danish (alternative) 8-bit Greek 8-bit Turkish SI 960 Hebrew Special Graphics Technical (TCS)
Platform specific	1052 1053 1054 1055 1056 1057 1058 Acorn RISC OS Amstrad CPC Apple II ATASCII Atari ST BICS Casio calculators CDC Compucolor 8001 Compucolor II CP/M+ DEC RADIX 50 DEC MCS/NRCS DG International Galaksija GEM GSM 03.38 HP Roman HP FOCAL HP RPL SQUOZE LICS LMBCS MSX NEC APC NeXT PETSCII PostScript Standard PostScript Latin 1 SAM Coupé Sega SC-3000 Sharp calculators Sharp MZ Sinclair QL Teletext TI calculators TRS-80 Ventura International WISCII XCCS ZX80 ZX81 ZX Spectrum
Unicode / ISO/IEC 10646	UTF-1 UTF-7 UTF-8 UTF-16 UTF-32 UTF-EBCDIC GB 18030 DIN 91379 BOCU-1 CESU-8 SCSU TACE16 Comparison of Unicode encodings
TeX typesetting system	Cork LY1 OML OMS OT1
Miscellaneous code pages	ABICOMP ASMO 449 Digital encoding of APL symbols ISO-IR-68 ARIB STD-B24 Fieldata HZ IEC-P27-1 INIS 7-bit 8-bit ISO-IR-169 ISO 2033 KOI KOI8-R KOI8-RU KOI8-U Mojikyō SEASCII Stanford/ITS Symbol TRON Unified Hangul Code
Control character	Morse prosigns C0 and C1 control codes ISO/IEC 6429 JIS X 0211 Unicode control, format and separator characters Whitespace characters
Related topics	CCSID Character encodings in HTML Charset detection Han unification Hardware code page MICR code Mojibake Variable-length encoding
Character sets