Skip to content

Implement count() method #74

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
akudiyar opened this issue Oct 29, 2020 · 15 comments · Fixed by #230
Closed

Implement count() method #74

akudiyar opened this issue Oct 29, 2020 · 15 comments · Fixed by #230
Assignees
Labels
customer feature A new functionality

Comments

@akudiyar
Copy link
Contributor

akudiyar commented Oct 29, 2020

Use cases:
a) customer wants to see counts of templates in message template catalogs in application UI. The templates and catalogs are stored in vshard, and operated via CRUD through a connector.
b) pagination -- we need to know the total amount of results for displaying it in the UI.

The count() method must accept conditions.

Proposed API variant:

    local count = crud.count({{'=', 'status', 'NEW'}})
@knazarov
Copy link
Contributor

I would agree to add count() and bsize() methods for the whole space, but I won't support adding counts by a condition. In too many cases this will be a full scan.

In almost all cases I know, it is safe to not show the total count of whatever you return to the user.

@akudiyar
Copy link
Contributor Author

akudiyar commented Oct 29, 2020

len() and bsize() doesn't help if the customer wants to count filtered tuples without actually loading them to the client (over the network in the general case).

A full scan on a sharded space is not a big deal if it is not done often. We cannot avoid full scans completely in other cases, knowledge of how it works will be always necessary for developers.

@knazarov
Copy link
Contributor

It is a big deal because it stops other things from accessing a database. And yes, we definitely can avoid full scans. We just won't allow them through the API.

@akudiyar
Copy link
Contributor Author

If we won't allow the full scans, they will not disappear from the customer tasks. This pain will just shift to another place.

We need some kind of support for such tasks for being able to implement these things in connectors.

@knazarov
Copy link
Contributor

Your statement is demonstrably false. Aerospike and Redis can exist without such queries. They have the ability to iterate over the collection on the client the same way we propose to do with select or pairs.

To count the items of a large collection you can create a separate space with counters. With interactive transactions, you can atomically update both of those spaces from the client. This will not require you to write any additional code.

If you have only a few items to count, you can just select them all.

@akudiyar
Copy link
Contributor Author

akudiyar commented Oct 30, 2020

But counters in special spaces look like an implementation detail, why cannot we have count() method in CRUD API which does all this boilerplate under the hood?

I see that for every simple task like count which may involve scan complexity we are going to push the customers to reinvent the wheel. And connectors cannot help to avoid this because there is no DDL API for now.

UPD: There is a problem that CRUD API doesn't rely on any DDL API at the moment too.

@no1seman
Copy link

no1seman commented Aug 6, 2021

Seems it's time to triage once more because we have the following use case:
User have to get count by any contitions and user agree that the result will not be accurate.

So I suggest to make count_async:

  • arguments and options like crud.select/crud.pairs;
  • implement storage_count_async with cycle with paris that will count number of rows in space with yeild by batch_size;
  • router must call storage_count_async on all replicasets;

To avoid any locks and slowdowns - implement mutex, that will guarantee that storage_count_async may run no more than N times simultaneously on each storage.

@artur-barsegyan
Copy link

@no1seman
Here we are solving a special case of a general problem with a map-reducer call for crudes.
I suggest thinking about this in the direction of sending a stored procedure with a special contract for the return value and calling this procedure from the router.

Because, for example, there is still a frequent task on the cluster to write a set of data on the storage in a transaction. And in this transaction on the storage, you need to perform many different operations.

It is not necessary to send the procedure code through the cruise, you can simply teach to call an already existing store.

@unera
Copy link

unera commented Oct 1, 2021

local count = crud.count({{'=', 'status', 'NEW'}})

Lets do something like

crud.count({ '=', 'status', 'NEW' }, {options})

Where options:

  • sec_scan, default value is false
    Implementation:

count look through space indexes and find index for status.

  • If the index is found, count iterates using it.
  • If the index is not found, count iterates using pk if options.sec_scan == true

The same for bsize, pairs

@unera unera removed the teamP label Oct 1, 2021
@no1seman
Copy link

no1seman commented Oct 2, 2021

@unera Why not to use the same API as select/pairs? The man difference from select/pairs: count not get data and do it with yields. So, seems need the folllowing options:
batch_size (number of pairs cycles to yield after)
use_box_count - in some cases, for example not huge space we may need to count precisely by index, but if the size of space huge - need to count approximately with yeilds (this option may be automatic, because we may get len of space on this particular instance, if it is larger than COUNT_HARD_LIMIT we have to falldown to approximate algorithm)

@AnaNek AnaNek self-assigned this Oct 6, 2021
@Mons
Copy link

Mons commented Nov 2, 2021

One more thing, to kill the whole cluster with one wrong query.
Fullscan and filters are pure evil for Tarantool.

@unera
Copy link

unera commented Nov 2, 2021

@no1seman

Why not to use the same API as select/pairs?

I agree :)

I didn't think that the question and select/pairs are different.

So, lets do as select/pairs. Drop my comment from 1 Oct.

@unera
Copy link

unera commented Nov 2, 2021

like here

local objects, err = crud.count(space_name, conditions, opts)

Syntax is the same,

excluding options:

  • first
  • after
  • batch_size
  • fields

@no1seman
Copy link

no1seman commented Nov 2, 2021

@unera batch_size may be used as number of pairs cycles between yields or there may be any other option.

@R-omk
Copy link

R-omk commented Nov 2, 2021

What about this case:
select count(field) from t1 in case the field can be nullable ?

Can instead of inventing one more not working 'killer feature' , make a general map/reduce?

AnaNek added a commit that referenced this issue Dec 27, 2021
For `count` implementation with the support of the
query by conditions there is a need to use query plan
and condition filters that has been already written
for select. This commit separates common methods from
select module and moves them in common folders.

Part of #74
AnaNek added a commit that referenced this issue Dec 27, 2021
This commit introduces count method that:
* has arguments and options like `select()`/`pairs()`;
* counts number of rows in space with yield by `count_to_yield`;
* counts by any conditions.

Closes #74
AnaNek added a commit that referenced this issue Jan 17, 2022
For `count` implementation with the support of the
query by conditions there is a need to use query plan
and condition filters that has been already written
for select. This commit separates common methods from
select module and moves them in common folders.

Part of #74
AnaNek added a commit that referenced this issue Jan 17, 2022
This commit introduces count method that:
* has arguments and options like `select()`/`pairs()`;
* counts number of rows in space with yield by `count_to_yield`;
* counts by any conditions.

Closes #74
AnaNek added a commit that referenced this issue Feb 9, 2022
For `count` implementation with the support of the
query by conditions there is a need to use query plan
and condition filters that has been already written
for select. This commit separates common methods from
select module and moves them in common folders.

Part of #74
AnaNek added a commit that referenced this issue Feb 9, 2022
This commit introduces count method that:
* has arguments and options like `select()`/`pairs()`;
* counts number of rows in space with yield by `count_to_yield`;
* counts by any conditions.

Closes #74
AnaNek added a commit that referenced this issue Feb 15, 2022
For `count` implementation with the support of the
query by conditions there is a need to use query plan
and condition filters that has been already written
for select. This commit separates common methods from
select module and moves them in common folders.

Part of #74
AnaNek added a commit that referenced this issue Feb 15, 2022
This commit introduces count method that:
* has arguments and options like `select()`/`pairs()`;
* counts number of rows in space with yield by `count_to_yield`;
* counts by any conditions.

Closes #74
AnaNek added a commit that referenced this issue Feb 18, 2022
This commit introduces count method that:
* has arguments and options like `select()`/`pairs()`;
* counts number of rows in space with yield by `yield_every`;
* counts by any conditions.

Closes #74
AnaNek added a commit that referenced this issue Feb 18, 2022
For `count` implementation with the support of the
query by conditions there is a need to use query plan
and condition filters that has been already written
for select. This commit separates common methods from
select module and moves them in common folders.

Part of #74
AnaNek added a commit that referenced this issue Feb 18, 2022
This commit introduces count method that:
* has arguments and options like `select()`/`pairs()`;
* counts number of rows in space with yield by `yield_every`;
* counts by any conditions.

Closes #74
AnaNek added a commit that referenced this issue Feb 18, 2022
This commit introduces count method that:
* has arguments and options like `select()`/`pairs()`;
* counts number of rows in space with yield by `yield_every`;
* counts by any conditions.

Closes #74
Totktonada pushed a commit that referenced this issue Feb 18, 2022
For `count` implementation with the support of the
query by conditions there is a need to use query plan
and condition filters that has been already written
for select. This commit separates common methods from
select module and moves them in common folders.

Part of #74
Totktonada pushed a commit that referenced this issue Feb 18, 2022
This commit introduces count method that:
* has arguments and options like `select()`/`pairs()`;
* counts number of rows in space with yield by `yield_every`;
* counts by any conditions.

Closes #74
@Totktonada Totktonada added the 5sp label Feb 21, 2022
DifferentialOrange added a commit that referenced this issue Apr 8, 2022
Several "0.10.0" section changelog entries (see PRs #230 and #239) were
added after 0.10.0 release. This patch fixes the inconsistency by moving
them to "Unreleased" section.

Follows up #74, #237
DifferentialOrange added a commit that referenced this issue Apr 10, 2022
Several "0.10.0" section changelog entries (see PRs #230 and #239) were
added after 0.10.0 release. This patch fixes the inconsistency by moving
them to "Unreleased" section.

Follows up #74, #237
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer feature A new functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants