Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArangoDB Fulltext fails with dashes in search #4996

Open
remingtonc opened this issue Mar 29, 2018 · 4 comments
Open

ArangoDB Fulltext fails with dashes in search #4996

remingtonc opened this issue Mar 29, 2018 · 4 comments

Comments

@remingtonc
Copy link

@remingtonc remingtonc commented Mar 29, 2018

It appears that ArangoDB Fulltext indexes fail to obtain results when a - is in the query parameter. This is not necessarily expected behavior based on the docs which do indicate that prefacing a search term with a - indicates to ignore that result, but not detailed as within being an issue.

Is there a way to "qualify" a search to indicate that dashes etc. should be part of the literal search?

Might be related to #2325

my environment running ArangoDB

I'm using the latest ArangoDB of the respective release series:

  • 2.8
  • 3.0
  • 3.1
  • 3.2
  • 3.3
  • self-compiled devel branch

Mode:

  • Cluster
  • Single-Server

Storage-Engine:

  • mmfiles
  • rocksdb

On this operating system:

  • DCOS on
    • AWS
    • Azure
    • own infrastructure
  • Linux
    • Debian .deb
    • Ubuntu .deb
    • SUSE .rpm
    • RedHat .rpm
    • Fedora .rpm
    • Gentoo
    • docker - official docker library
    • other:
  • Windows, version:
  • MacOS, version:

this is an AQL-related issue:

[ ] I'm using graph features

I'm issuing AQL via:

  • web interface with this browser: running on this OS:
  • arangosh
  • this Driver: pyarango

I've run db._explain("<my aql query>") and it didn't shed more light on this.
The AQL query in question is:

FOR dp IN FULLTEXT(DataPath, "human_id", "Cisco-IOS-XE", 10)
    RETURN dp.human_id
Query string:
 FOR dp IN FULLTEXT(DataPath, "human_id", "Cisco-IOS-XE", 10)
     RETURN dp.human_id

Execution plan:
 Id   NodeType            Est.   Comment
  1   SingletonNode          1   * ROOT
  6   IndexNode         400470     - FOR dp IN DataPath   /* fulltext index scan */
  7   LimitNode             10       - LIMIT 0, 10
  4   CalculationNode       10       - LET #3 = dp.`human_id`   /* attribute expression */   /* collections used: dp : DataPath */
  5   ReturnNode            10       - RETURN #3

Indexes used:
 By   Type       Collection   Unique   Sparse   Selectivity   Fields           Ranges
  6   fulltext   DataPath     false    true             n/a   [ `human_id` ]   FULLTEXT(DataPath   /* all collection documents */, "human_id", "Cisco-IOS-XE", 10)

Optimization rules applied:
 Id   RuleName
  1   fulltext-index-optimizer
  2   remove-unnecessary-calculations-2

Actual:

[]

Expected:

[
  "Cisco-IOS-XE-ip-sla-oper:ip-sla-stats",
  "Cisco-IOS-XE-ip-sla-oper:ip-sla-stats/sla-oper-entry",
  "Cisco-IOS-XE-ip-sla-oper:ip-sla-stats/sla-oper-entry/oper-id",
  "Cisco-IOS-XE-ip-sla-oper:ip-sla-stats/sla-oper-entry/oper-type",
  "Cisco-IOS-XE-ip-sla-oper:ip-sla-stats/sla-oper-entry/latest-return-code",
  "Cisco-IOS-XE-ip-sla-oper:ip-sla-stats/sla-oper-entry/success-count",
  "Cisco-IOS-XE-ip-sla-oper:ip-sla-stats/sla-oper-entry/failure-count",
  "Cisco-IOS-XE-ip-sla-oper:ip-sla-stats/sla-oper-entry/latest-oper-start-time",
  "Cisco-IOS-XE-ip-sla-oper:ip-sla-stats/sla-oper-entry/rtt-info",
  "Cisco-IOS-XE-ip-sla-oper:ip-sla-stats/sla-oper-entry/rtt-info/latest-rtt"
]

The issue can be reproduced using this dataset:
Not yet available. :( We can make this happen if necessary.

@remingtonc
Copy link
Author

@remingtonc remingtonc commented Mar 29, 2018

Looking at this closer, I might be using the Fulltext index erroneously... I could accomplish a similar use case using Contains or Like. Is there any significant performance difference in a Fulltext index versus using Contains or Like string functions, or benefit when using Fulltext and Contains/Like?

@Simran-B
Copy link
Contributor

@Simran-B Simran-B commented Mar 29, 2018

CONTAINS() and LIKE() can not utilize any indexes. Therefore, using a fulltext index is preferable unless your collection is small so that performance does not become an issue.

- is treated as word boundary, as can be seen by searching for e.g. "rtt,entry" - is does match the following strings:

  • Cisco-IOS-XE-ip-sla-oper:ip-sla-stats/sla-oper-entry/rtt-info/latest-rtt
  • Cisco-IOS-XE-ip-sla-oper:ip-sla-stats/sla-oper-entry/rtt-info

- is also used in the mini-language to define the fulltext search (logical not). Therefore, we need a little trick to use the fulltext index but only match documents that contain a certain phrase:

LET phrase = "Cisco-IOS-XE"
 FOR dp IN FULLTEXT(DataPath, "human_id", SUBSTITUTE(phrase, "-", ","), 10)
     FILTER CONTAINS(dp.human_id, phrase)
     RETURN dp.human_id

Instead of searching for "Cisco" without "IOS" and "XE", the hyphen is substituted by a comma to create a search expression that finds "Cisco" and "IOS" and "XE". The returned subset of documents is filtered to ensure that the initial search phrase is contained literally (would eliminate e.g. XE-foo-IOS-bar-Cisco).

Note that less than 10 results may be returned. You might want to remove the limit of 10 documents in the FULLTEXT() call and use LIMIT 10 above RETURN instead.

Since you seem to be searching for prefixes in this particular case, there is an alternative approach:

FOR dp IN DataPath
    FILTER dp.human_id >= "Cisco-IOS-XE" AND dp.human_id < "Cisco-IOS-XF"
    RETURN dp.human_id

Create a skiplist index for human_id to utilize an index. This is effectively a range search using a little trick (replace the last character in the search phrase with the "next" character, i.e. numerically +1). Also see https://docs.arangodb.com/3.3/Cookbook/UseCases/PopulatingAnAutocompleteTextbox.html

@remingtonc
Copy link
Author

@remingtonc remingtonc commented Mar 30, 2018

Wonderful, thank you for the advice @Simran-B. This would be great information to add to the AQL docs. Will try and pull request later, closing in the meantime. 👍

@remingtonc remingtonc closed this Mar 30, 2018
@dothebart
Copy link
Contributor

@dothebart dothebart commented Apr 9, 2018

we should leave that open, so it can be addressed in the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.