logo
MDA < Home 

Map Development Area (MDA)

Location: _/_   misc   _/_   encoding   _/_   unicode  

Index Browse Edit Visualize Statistics Download Help
 
unicode
is-adopted-for
- w3c [ adopter ]
- XHTML, EXtensible HyperText Markup Language [ purpose ]
- w3c [ adopter ]
- XML [ purpose ]
- w3c [ adopter ]
- HTML [ purpose ]
is-encoded-with
- Unicode Transformation Format, UTF-8 [ encoding ]
- Universal Character Set, UCS-4 [ encoding ]
- Unicode Transformation Format, UTF-16 [ encoding ]
- Unicode Transformation Format, UTF-7 [ encoding ]
- Unicode Transformation Format, UTF-32 [ encoding ]
- Universal Character Set, UCS-2 [ encoding ]
is-identical-to
- Universal Character Set, ISO/IEC 10646 [ thing2 ]
is-standardized-by
- ISO, International Standardisation Organisation [ body ]
- Unicode Consortium [ body ]
is-structured-into
- Unicode Planes [ structure ]
collection-of-character-sets
- Unicode
- Universal Character Set, ISO/IEC 10646
- Windows Code Pages
- ISO/IEC 8859
Types:
Comment:
is mainly used in document processing, XML and SGML applications
example
Euro symbol has code U+20AC
motivation
solve document exchange problem -- include all characters of all languages (on this planet only) -- living or dead -- natural or invented -- 250 writing systems and thousands of languages
syntax
U+XXXX (hexadecimal X)
Comment:
numbers below AND ABOVE 65535 are used (NOT 16 bit only!!) -- Unicode uses now 21 bits, UCS 32 bits
Comment:
classifies characters into letters, numbers, punctuation, accents, ... -- maps cases (a to A) -- defines how to display characters -- how to combine characters -- how to treat bidirectional text -- algorithms for sorting, case folding, regular expressions
history
v2.0: 38,885 assigned characters -- v3.0: 49,194 -- v3.2: 95,156 -- v4.0: 96,382
Comment:
ISO 8859-1 is embedded
Comment:
coordinated code points: Unicode standard by the Unicode consortium -- ISO 10646, Universal Multiple-Octet Character Set (UCS) by ISO -- "synchronized standards"
encoding
could be encoded with 4 bytes, but this is wasteful -- different encoding UCS-2, UCS-4, UTF-8, UTF-16, UTF-32