ArangoSearch STARTS_WITH Prefix Matches not working (+ nGrams?) #7825

JonDum · 2018-12-21T09:34:45Z

My Environment

ArangoDB Version: 3.4.0-RC.6
Storage Engine: RocksDB
Deployment Mode: Single Server
Deployment Strategy: Docker Swarm
Configuration: Single Server
Infrastructure: local
Operating System: arangodb/arangodb-preview:3.4.0-rc.6
Used Package: arangodb/arangodb-preview:3.4.0-rc.6

Component, Query & Data

Affected feature: ArangoSearch

Dataset: https://pastebin.com/4auWyt0v

Steps to reproduce

Import data to aircraft
Create a ArangoSearch view:

	const asv = db.arangoSearchView('aircraft_search')
	await asv.create()
	await asv.setProperties({
		links: {
			aircraft: {
				fields: {
					makemodel: {analyzers: ['text_en']}
				}
			}
		}
	})

Attempt queries

Problem:

This query correctly returns matches with the full word pilatus in them

FOR doc IN aircraft_search
SEARCH PHRASE(doc.makemodel, 'pilatus', 'text_en')
RETURN doc

As does this:

FOR doc IN aircraft_search 
SEARCH ANALYZER(doc.makemodel IN TOKENS('pilatus', 'text_en'), 'text_en')
SORT BM25(doc)
limit 50
RETURN doc

However, if I try to use STARTS_WITH() either in conjunction with OR or even by itself:

FOR doc IN aircraft_search 
SEARCH STARTS_WITH(doc.makemodel, 'pila')
limit 50
RETURN doc

It returns zero matches. What's going on?

I can get results with a simple non-search query:

for doc in aircraft
filter CONTAINS(lower(doc.makemodel), 'pila')
return doc

But not with STARTS_WITH()?

Expected result:

Queries should return matches to prefixes as well as whole words. Ideally, it'd use ngrams as well.

Other notes on ArangoSearch in general:

I like how it's so customizable, but 95% of people are going to want "contains" matches (à la edged ngram's) + prefix matches + exact matches by default and I see no way of doing that at all, nonetheless easily. Fuzzy matches would be icing on the cake.

The text was updated successfully, but these errors were encountered:

KVS85 · 2018-12-21T10:53:08Z

Hello @JonDum,

Thank you for your interest in ArangoSearch. Regarding your issue I may suggest to extend the view definition in order to have multiple analyzers for your data: makemodel: {analyzers: ['text_en', 'identity']}.

After this, you may utilize STARTS_WITH taking the following into account:

FOR doc IN aircraft_search 
SEARCH STARTS_WITH(doc.makemodel, 'pila')
limit 50
RETURN doc

In your original query there is no ANALYZER context and, by default, identity analyzer is used. Since your view has only text_en analyzer for makemodel field, nothing will be returned.

FOR doc IN aircraft_search 
SEARCH ANALYZER(STARTS_WITH(doc.makemodel, 'pila'), 'text_en')
limit 50
RETURN doc

As soon, as you apply text_en analyzer here, you will get all documents where at least one term (word) starts with pila (with case-insensitive approach for text_* analyzers) like:

  {
    "_key": "786",
    "_id": "aircraft/786",
    "_rev": "_X7_R5Te--D",
    "makemodel": "OERLIKON ARMNTS CO PILATUS P-2/05"
  },
  {
    "_key": "787",
    "_id": "aircraft/787",
    "_rev": "_X7_R5Te--F",
    "makemodel": "PILATUS BRITTEN-NORMAN LTD. BN2B-20"
  }

However, if you want to find only those where PILA is in the beginning, you may incorporate identity analyzer as you did in your initial query with respect to case-sensitive approach:

FOR doc IN aircraft_search 
SEARCH STARTS_WITH(doc.makemodel, 'PILA')
limit 50
RETURN doc

Regarding enhancement requests and future development of ArangoSearch, please follow our blog and news.

JonDum · 2018-12-21T18:23:52Z

@KVS85 Thanks for the response.

I added an 'identity' analyzer so now my my full view config is:

{
  "writebufferSizeMax": 33554432,
  "consolidationPolicy": {
    "type": "bytes_accum",
    "threshold": 0.10000000149011612
  },
  "globallyUniqueId": "h75EDDA017957/4039166",
  "id": "4039166",
  "writebufferActive": 0,
  "consolidationIntervalMsec": 60000,
  "cleanupIntervalStep": 10,
  "links": {
    "aircraft": {
      "analyzers": [
        "identity"
      ],
      "fields": {
        "makemodel": {
          "analyzers": [
            "text_en",
            "identity"
          ]
        }
      },
      "includeAllFields": false,
      "storeValues": "none",
      "trackListPositions": false
    }
  },
  "type": "arangosearch",
  "writebufferIdle": 64
}

And when I run this:

FOR doc IN aircraft_search 
SEARCH STARTS_WITH(doc.makemodel, 'pila')
limit 50
RETURN doc

I get no results still.

However if I run your other query:

FOR doc IN aircraft_search 
SEARCH ANALYZER(STARTS_WITH(doc.makemodel, 'pilatus'), 'text_en')
limit 50
RETURN doc

I get some prefix results finally. What on earth. Hopefully you guys can see how confusing that is? Why do I need to wrap another ANALYZER function on top of the STARTS_WITH just to get prefix matching?

Anyways, combining that with an ANALYZER(... IN TOKENS()) seems to work decently except for one caveat: Mixing capitals returns no results again.

FOR doc IN aircraft_search 
SEARCH ANALYZER(STARTS_WITH(doc.makemodel, 'Pilat'), 'text_en') or
ANALYZER(doc.makemodel IN TOKENS('Pilat', 'text_en'), 'text_en')
SORT TFIDF(doc) DESC
limit 50
RETURN doc

You can get around this by wrapping the bindvar in a lower(). Perhaps this is a bug? text_en should be case-insensitive I thought.

Final working query for anyone coming across this:

FOR doc IN aircraft_search 
SEARCH ANALYZER(STARTS_WITH(a.makemodel, lower(@value0)), 'text_en') or
ANALYZER(a.makemodel IN TOKENS(lower(@value0), 'text_en'), 'text_en')
SORT TFIDF(doc) DESC
limit 50
RETURN doc

As I said earlier this is probably a decent "default" search that works with both single word prefixes as well as multiple word terms. I would highly suggest should be in the docs somewhere.

Perhaps in this section? A lot of these examples are just showing alternative AQL document syntax and are just noise.

KVS85 · 2018-12-22T15:04:21Z

@JonDum glad that you found solution for your query.

Let me explain the following here: a link with identity and text_en analyzers will do the following:

identity: TOKENS("iNPut tEXt", "identity) -> ["iNPut tEXt"]
text_en: TOKENS("iNPut tEXt", "text_en") -> ["input", "text"]

Therefore, this query didn't return anything, because you implicitly use identity analyzer context and all text indexed with this analyzer store data "as is". So, since only PILA (uppercase) was in input, this analyzer will not find pila.

FOR doc IN aircraft_search 
SEARCH STARTS_WITH(doc.makemodel, 'pila')
limit 50
RETURN doc

Thus, your final query could be optimized:

LET tokens = TOKENS(@value0, 'text_en')
FOR doc IN aircraft_search 
SEARCH ANALYZER(STARTS_WITH(doc.makemodel, tokens [0]) OR doc.makemodel IN tokens , 'text_en')
SORT TFIDF(doc) DESC
limit 50
RETURN doc

value0 here will be case-insensitive (despite of any provided value, processed by text_en TOKENS) and checked against searchQuery within text_en ANALYZER.

We are going to improve ArangoSearch documentation based on provided feedback and interesting cases. Thank you for your interest.

JonDum · 2018-12-22T21:54:34Z

@KVS85 Thanks for the explanation and optimized query. Works well.

Your explanation makes sense and I'm real excited about the future of ArangoSearch. It's super convenient having some search capabilities without having to set up/maintain ElasticSearch and syncing documents to it. I'm sure other developers will love it too when it goes live and catches on. I can't wait for an ngram analyzer too! ;)

The last thing that is unclear to me is the AQL side of it — if I already have text_en, identity (and potential future analyzers) specified in my view properties, why do I need to explicitly specify which analyzer to use? Can the AQL optimizer be smart enough to figure out which analyzer to use based off the fields in the link in the view properties if I don't specify one?

KVS85 · 2018-12-24T14:17:14Z

@JonDum thank you for so kind feedback.

The necessity to define an analyzer explicitly is not an AQL limitation. Initially, one can have multiple fields indexed with multiple analyzers and that is indexing context. While performing AQL query then, SEARCH-related functions can't guess, which of these analyzers should be used. To simplify this, "identity" analyzer was chosen as the default one.

Generally, one should remember that both optional analyzer parameter of a SEARCH-related function or ANALYZER-function are used to apply index with appropriate analyzer. While text which is passed to them is "as is" and won't be "analyzed" automatically.

Example:

FOR doc IN aircraft_search 
SEARCH ANALYZER(STARTS_WITH(doc.makemodel, 'pila'), 'text_en')
// Here ANALYZER(..., , 'text_en') means that "text_en" doc.makemodel index
// will be used for searching while STARTS_WITH(doc.makemodel, 'PILA')
// defines what should be searched and 'PILA' parameter will be passed
// as is without processing and nothing is found since doc.makemodel
// "text_en" index can't have uppercase values.
limit 50
RETURN doc

KVS85 · 2018-12-28T10:20:57Z

@JonDum do you have any additional questions? If no, can we close this issue?

JonDum · 2018-12-28T21:59:45Z

Nope. Thanks for the additional info. Hopefully it's useful for people arriving from google as well.

JonDum · 2019-02-11T20:06:37Z

@KVS85 I've been noticing some oddities with this setup: some queries return the results like I expect, but then I add an additional letter to my search and get zero results.

For example:

LET tokens = TOKENS('Boe', 'text_en')
FOR a IN aircraft_search
SEARCH ANALYZER(STARTS_WITH(a.makemodel, tokens [0]) OR a.makemodel IN tokens , 'text_en')
SORT TFIDF(a) DESC
limit 50
RETURN {
	makemode: a.makemodel
}

Return a set of Boeing planes just fine

However, changing the search term to Boei or Boein returns 0 results.

Search Term	Results
B	✅
Bo	✅
Boe	✅
Boei	🚫
Boein	🚫
Boeing	✅

Could this be something to do with the text_en stemmer/analyzer pulling off the -ing suffix? How would I get around that if that's the case?

JonDum changed the title ~~ArangoSearch nGrams~~ ArangoSearch STARTS_WITH Prefix Matches not working (+ nGrams?) Dec 21, 2018

KVS85 added 1 Question 3 ArangoSearch labels Dec 21, 2018

JonDum closed this Dec 28, 2018

JonDum reopened this Feb 11, 2019

Simran-B added the 3 Documentation label Feb 13, 2019

dothebart assigned Simran-B Sep 2, 2020

arangodb / arangodb

ArangoSearch STARTS_WITH Prefix Matches not working (+ nGrams?) #7825

ArangoSearch STARTS_WITH Prefix Matches not working (+ nGrams?) #7825

JonDum commented Dec 21, 2018

KVS85 commented Dec 21, 2018

JonDum commented Dec 21, 2018

KVS85 commented Dec 22, 2018

JonDum commented Dec 22, 2018

KVS85 commented Dec 24, 2018 •

edited

KVS85 commented Dec 28, 2018

JonDum commented Dec 28, 2018

JonDum commented Feb 11, 2019

arangodb / arangodb

ArangoSearch STARTS_WITH Prefix Matches not working (+ nGrams?) #7825

ArangoSearch STARTS_WITH Prefix Matches not working (+ nGrams?) #7825

Comments

JonDum commented Dec 21, 2018

My Environment

Component, Query & Data

Steps to reproduce

KVS85 commented Dec 21, 2018

JonDum commented Dec 21, 2018

KVS85 commented Dec 22, 2018

JonDum commented Dec 22, 2018

KVS85 commented Dec 24, 2018 • edited

KVS85 commented Dec 28, 2018

JonDum commented Dec 28, 2018

JonDum commented Feb 11, 2019

KVS85 commented Dec 24, 2018 •

edited