@@ -189,13 +189,15 @@ A **boolean** is either ``true`` or ``false``.
189
189
A **string ** is a variable-length sequence of bytes, usually represented with
190
190
alphanumeric characters inside single quotes. In both Lua and MsgPack, strings
191
191
are treated as binary data, with no attempts to determine a string's
192
- character set or to perform any string conversion.
193
- So, string sorting and comparison are done byte-by-byte, without any special
192
+ character set or to perform any string conversion -- unless there is an optional
193
+ :ref: `collation <index-collation >`.
194
+ So, usually, string sorting and comparison are done byte-by-byte, without any special
194
195
collation rules applied.
195
196
(Example: numbers are ordered by their point on the number line, so 2345 is
196
197
greater than 500; meanwhile, strings are ordered by the encoding of the first
197
198
byte, then the encoding of the second byte, and so on, so '2345' is less than '500'.)
198
199
200
+
199
201
.. _index-box_number :
200
202
201
203
In Lua, a **number ** is double-precision floating-point, but Tarantool allows both
@@ -321,6 +323,53 @@ Here's how Tarantool indexed field types correspond to MsgPack data types.
321
323
| | then strings. | | |
322
324
+----------------------------+----------------------------------+----------------------+--------------------+
323
325
326
+ .. _index-collation :
327
+
328
+ --------------------------------------------------------------------------------
329
+ Collations
330
+ --------------------------------------------------------------------------------
331
+
332
+ By default, when Tarantool compares strings, it uses what we call a
333
+ "binary" collation. The only consideration is the numeric value of
334
+ each byte in the string. Therefore, if the string is encoded with
335
+ ASCII or UTF-8, then 'A' < 'B' < 'a', because the encoding of 'A'
336
+ (what used to be called the "ASCII value") is 65, the encoding of
337
+ 'B' is 66, and the encoding of 'a' is 98. Binary collation is best
338
+ if you prefer fast deterministic simple maintenance and searching
339
+ with Tarantool indexes.
340
+
341
+ But if you want the order that you see in phone books and dictionaries,
342
+ then either 'A' < 'a' < 'B' or 'A' = 'a' < 'B'. These are Tarantool's
343
+ optional collations, 'unicode' and 'unicode_s1'. In fact, though,
344
+ good collation involves much more than these simple examples of
345
+ upper case / lower case equivalence in alphabets.
346
+ We also consider accent marks, non-alphabetic writing systems,
347
+ and special rules that apply for combinations of characters.
348
+
349
+ The optional collations always use the ordering according to the
350
+ `Default Unicode Collation Element Table <http://unicode.org/Public/UCA/latest/allkeys.txt >`_
351
+ and the rules described in
352
+ `Unicode® Technical Standard #10 Unicode Collation Algorithm <http://unicode.org/reports/tr10 >`_.
353
+ The optional collations are best if you prefer multilingual
354
+ standard end-user-oriented order in Tarantool indexes.
355
+
356
+ Example showing order of some Russian words with unicode_s1 collation:
357
+
358
+ .. code-block :: none
359
+
360
+ tarantool> box.space.T:create_index('I', {parts = {1,'str', collation='unicode_s1'}})
361
+ ...
362
+ tarantool> box.space.T.index.I:select()
363
+ ---
364
+ - - ['ЕЛЕ']
365
+ - ['елейный']
366
+ - ['ёлка']
367
+ - ['еловый']
368
+ - ['елозить']
369
+ - ['Ёлочка']
370
+ - ['ёлочный']
371
+ - ['ЕЛь']
372
+
324
373
.. _index-box_sequence :
325
374
326
375
--------------------------------------------------------------------------------
0 commit comments