Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fullCount performance #4801

Open
ldap4life opened this issue Mar 9, 2018 · 6 comments
Open

fullCount performance #4801

ldap4life opened this issue Mar 9, 2018 · 6 comments

Comments

@ldap4life
Copy link

@ldap4life ldap4life commented Mar 9, 2018

my environment running ArangoDB

I'm using the latest ArangoDB of the respective release series:

  • 2.8
  • 3.0
  • 3.1
  • 3.2
  • 3.3
  • self-compiled devel branch

Mode:

  • Cluster
  • Single-Server

Storage-Engine:

  • mmfiles
  • rocksdb

On this operating system:

  • DCOS on
    • AWS
    • Azure
    • own infrastructure
  • Linux
    • Debian .deb
    • Ubuntu .deb
    • SUSE .rpm
    • RedHat .rpm
    • Fedora .rpm
    • Gentoo
    • docker - official docker library
    • other:
  • Windows, version:
  • MacOS, version:

this is an AQL-related issue:

[ ] I'm using graph features

I'm issuing AQL via:

  • web interface with this browser: running on this OS:
  • arangosh
  • this Driver: Java

Doing aql queries with a limit and the fullCount option perform very poorly in my testing. It turns out to be faster to do the query with full count off, and then a second query to get the full count manually.

@jsteemann
Copy link
Contributor

@jsteemann jsteemann commented Mar 12, 2018

Using fullCount will make a query run further even if the request number of documents have been reached. So it will naturally be slower than when omitting fullCount altogether.
Using fullCount may be slower or faster than when using a separate query just for counting. This depends on whether the fullCount query contains complex calculations that may be omitted in an extra counting query.
For example,

FOR doc IN collection
  LIMIT 0, 40
  RETURN doc

with fullCount will likely be slower than the same query with fullCount turned off and an extra couting query. The reason here is that counting the number of documents in a collection is fast and does not need to enumerate them individually.
Another example:

FOR doc IN collection
  LET subs = (FOR sub IN collection2 FILTER sub.attr == doc._key RETURN sub)
  LIMIT 0,40 
  RETURN { doc, subs }

In this example, the fullCount variant of the query will be much more expensive than the original query without fullCount and a separate counting query that can omit the subquery.

Can you provide an example of the query that is particularly slow with fullCount, plus the number of documents in the respective collections.

@dothebart
Copy link
Contributor

@dothebart dothebart commented Mar 12, 2018

@jsteemann
Copy link
Contributor

@jsteemann jsteemann commented Mar 12, 2018

@dothebart: go ahead!

@ldap4life
Copy link
Author

@ldap4life ldap4life commented Mar 12, 2018

my query is similar to

for log in logs sort log.date limit 0, 20 return log

there are about 500k docs. It runs in about 200ms with full count on, 2ms with full count off.

fwiw to get the count manually im adding this to the same query COLLECT WITH COUNT INTO length RETURN length

@jsteemann
Copy link
Contributor

@jsteemann jsteemann commented Mar 12, 2018

@ldap4life : ok, if there is a sorted index present on the date attribute in the logs collection, the non-fullCount query will execute almost instantly. It will only process 20 documents in index order and is done then.
The fullCount variant of the query will return 20 documents in index order, but it will iterate over the remaining 499980 documents in the index for counting them. So it will have to do a lot more work, especially in the RocksDB engine, where index iteration may (or may not) cause disk I/O.

If your original query does not contain any filters, the counting is probably most efficiently done by using just RETURN COUNT(logs). This will use an optimized code path that simply returns the number of documents in the collection without even looking at any document.

@matcho
Copy link

@matcho matcho commented Aug 13, 2018

Could this be detected by optimizer ? For example, if a query has no FILTER statement, return collection size as fullCount instead of iterating over every document ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.