Skip to content

Commit 999d4fe

Browse files
committed
Described collations but something is wrong
1 parent b0cdbe3 commit 999d4fe

File tree

2 files changed

+54
-3
lines changed

2 files changed

+54
-3
lines changed

doc/1.7/book/box/box_space.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -253,7 +253,8 @@ A list of all ``box.space`` functions follows, then comes a list of all
253253
| parts | field-numbers + types | {field_no, 'unsigned' or | ``{1, 'unsigned'}`` |
254254
| | | 'string' or 'integer' or | |
255255
| | | 'number' or 'boolean' or | |
256-
| | | 'array' or 'scalar'} | |
256+
| | | 'array' or 'scalar', | |
257+
| | | and optional collation} | |
257258
+---------------------+-------------------------------------------------------+----------------------------------+-------------------------------+
258259
| dimension | affects :ref:`RTREE <box_index-rtree>` only | number | 2 |
259260
+---------------------+-------------------------------------------------------+----------------------------------+-------------------------------+
@@ -312,6 +313,7 @@ A list of all ``box.space`` functions follows, then comes a list of all
312313
* **string**: any set of octets, up to the :ref:`maximum length
313314
<limitations_bytes_in_index_key>`. May also be called 'str'. Legal in
314315
memtx TREE or HASH or BITSET indexes, and in vinyl TREE indexes.
316+
A string may have a :ref:`collation <index-collation>`.
315317
* **integer**: integers between -9223372036854775808 and 18446744073709551615.
316318
May also be called 'int'. Legal in memtx TREE or HASH indexes, and in
317319
vinyl TREE indexes.

doc/1.7/book/box/data_model.rst

Lines changed: 51 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -189,13 +189,15 @@ A **boolean** is either ``true`` or ``false``.
189189
A **string** is a variable-length sequence of bytes, usually represented with
190190
alphanumeric characters inside single quotes. In both Lua and MsgPack, strings
191191
are treated as binary data, with no attempts to determine a string's
192-
character set or to perform any string conversion.
193-
So, string sorting and comparison are done byte-by-byte, without any special
192+
character set or to perform any string conversion -- unless there is an optional
193+
:ref:`collation <index-collation>`.
194+
So, usually, string sorting and comparison are done byte-by-byte, without any special
194195
collation rules applied.
195196
(Example: numbers are ordered by their point on the number line, so 2345 is
196197
greater than 500; meanwhile, strings are ordered by the encoding of the first
197198
byte, then the encoding of the second byte, and so on, so '2345' is less than '500'.)
198199

200+
199201
.. _index-box_number:
200202

201203
In Lua, a **number** is double-precision floating-point, but Tarantool allows both
@@ -321,6 +323,53 @@ Here's how Tarantool indexed field types correspond to MsgPack data types.
321323
| | then strings. | | |
322324
+----------------------------+----------------------------------+----------------------+--------------------+
323325

326+
.. _index-collation:
327+
328+
--------------------------------------------------------------------------------
329+
Collations
330+
--------------------------------------------------------------------------------
331+
332+
By default, when Tarantool compares strings, it uses what we call a
333+
"binary" collation. The only consideration is the numeric value of
334+
each byte in the string. Therefore, if the string is encoded with
335+
ASCII or UTF-8, then 'A' < 'B' < 'a', because the encoding of 'A'
336+
(what used to be called the "ASCII value") is 65, the encoding of
337+
'B' is 66, and the encoding of 'a' is 98. Binary collation is best
338+
if you prefer fast deterministic simple maintenance and searching
339+
with Tarantool indexes.
340+
341+
But if you want the order that you see in phone books and dictionaries,
342+
then either 'A' < 'a' < 'B' or 'A' = 'a' < 'B'. These are Tarantool's
343+
optional collations, 'unicode' and 'unicode_s1'. In fact, though,
344+
good collation involves much more than these simple examples of
345+
upper case / lower case equivalence in alphabets.
346+
We also consider accent marks, non-alphabetic writing systems,
347+
and special rules that apply for combinations of characters.
348+
349+
The optional collations always use the ordering according to the
350+
`Default Unicode Collation Element Table <http://unicode.org/Public/UCA/latest/allkeys.txt>`_
351+
and the rules described in
352+
`Unicode® Technical Standard #10 Unicode Collation Algorithm <http://unicode.org/reports/tr10>`_.
353+
The optional collations are best if you prefer multilingual
354+
standard end-user-oriented order in Tarantool indexes.
355+
356+
Example showing order of some Russian words with unicode_s1 collation:
357+
358+
.. code-block:: none
359+
360+
tarantool> box.space.T:create_index('I', {parts = {1,'str', collation='unicode_s1'}})
361+
...
362+
tarantool> box.space.T.index.I:select()
363+
---
364+
- - ['ЕЛЕ']
365+
- ['елейный']
366+
- ['ёлка']
367+
- ['еловый']
368+
- ['елозить']
369+
- ['Ёлочка']
370+
- ['ёлочный']
371+
- ['ЕЛь']
372+
324373
.. _index-box_sequence:
325374

326375
--------------------------------------------------------------------------------

0 commit comments

Comments
 (0)