ArangoSearch STARTS_WITH Prefix Matches not working (+ nGrams?) #7825
Comments
Hello @JonDum, Thank you for your interest in ArangoSearch. Regarding your issue I may suggest to extend the view definition in order to have multiple analyzers for your data: After this, you may utilize STARTS_WITH taking the following into account:
In your original query there is no ANALYZER context and, by default, identity analyzer is used. Since your view has only text_en analyzer for makemodel field, nothing will be returned.
As soon, as you apply text_en analyzer here, you will get all documents where at least one term (word) starts with pila (with case-insensitive approach for text_* analyzers) like:
However, if you want to find only those where PILA is in the beginning, you may incorporate identity analyzer as you did in your initial query with respect to case-sensitive approach:
Regarding enhancement requests and future development of ArangoSearch, please follow our blog and news. |
@KVS85 Thanks for the response. I added an 'identity' analyzer so now my my full view config is:
And when I run this:
I get no results still. However if I run your other query:
I get some prefix results finally. What on earth. Hopefully you guys can see how confusing that is? Why do I need to wrap another Anyways, combining that with an
You can get around this by wrapping the bindvar in a Final working query for anyone coming across this:
As I said earlier this is probably a decent "default" search that works with both single word prefixes as well as multiple word terms. I would highly suggest should be in the docs somewhere. Perhaps in this section? A lot of these examples are just showing alternative AQL document syntax and are just noise. |
@JonDum glad that you found solution for your query. Let me explain the following here: a link with identity and text_en analyzers will do the following:
Therefore, this query didn't return anything, because you implicitly use identity analyzer context and all text indexed with this analyzer store data "as is". So, since only PILA (uppercase) was in input, this analyzer will not find pila.
Thus, your final query could be optimized:
value0 here will be case-insensitive (despite of any provided value, processed by text_en TOKENS) and checked against searchQuery within text_en ANALYZER. We are going to improve ArangoSearch documentation based on provided feedback and interesting cases. Thank you for your interest. |
@KVS85 Thanks for the explanation and optimized query. Works well. Your explanation makes sense and I'm real excited about the future of ArangoSearch. It's super convenient having some search capabilities without having to set up/maintain ElasticSearch and syncing documents to it. I'm sure other developers will love it too when it goes live and catches on. I can't wait for an ngram analyzer too! ;) The last thing that is unclear to me is the AQL side of it — if I already have |
@JonDum thank you for so kind feedback. The necessity to define an analyzer explicitly is not an AQL limitation. Initially, one can have multiple fields indexed with multiple analyzers and that is indexing context. While performing AQL query then, SEARCH-related functions can't guess, which of these analyzers should be used. To simplify this, "identity" analyzer was chosen as the default one. Generally, one should remember that both optional analyzer parameter of a SEARCH-related function or ANALYZER-function are used to apply index with appropriate analyzer. While text which is passed to them is "as is" and won't be "analyzed" automatically. Example:
|
@JonDum do you have any additional questions? If no, can we close this issue? |
Nope. Thanks for the additional info. Hopefully it's useful for people arriving from google as well. |
@KVS85 I've been noticing some oddities with this setup: some queries return the results like I expect, but then I add an additional letter to my search and get zero results. For example:
Return a set of Boeing planes just fine However, changing the search term to
Could this be something to do with the |
My Environment
Component, Query & Data
Affected feature: ArangoSearch
Dataset: https://pastebin.com/4auWyt0v
Steps to reproduce
aircraft
Problem:
This query correctly returns matches with the full word
pilatus
in themAs does this:
However, if I try to use
STARTS_WITH()
either in conjunction withOR
or even by itself:It returns zero matches. What's going on?
I can get results with a simple non-search query:
But not with
STARTS_WITH()
?Expected result:
Queries should return matches to prefixes as well as whole words. Ideally, it'd use ngrams as well.
Other notes on ArangoSearch in general:
I like how it's so customizable, but 95% of people are going to want "contains" matches (à la edged ngram's) + prefix matches + exact matches by default and I see no way of doing that at all, nonetheless easily. Fuzzy matches would be icing on the cake.
The text was updated successfully, but these errors were encountered: