Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArangoSearch STARTS_WITH Prefix Matches not working (+ nGrams?) #7825

Open
JonDum opened this issue Dec 21, 2018 · 8 comments
Open

ArangoSearch STARTS_WITH Prefix Matches not working (+ nGrams?) #7825

JonDum opened this issue Dec 21, 2018 · 8 comments

Comments

@JonDum
Copy link

@JonDum JonDum commented Dec 21, 2018

My Environment

  • ArangoDB Version: 3.4.0-RC.6
  • Storage Engine: RocksDB
  • Deployment Mode: Single Server
  • Deployment Strategy: Docker Swarm
  • Configuration: Single Server
  • Infrastructure: local
  • Operating System: arangodb/arangodb-preview:3.4.0-rc.6
  • Used Package: arangodb/arangodb-preview:3.4.0-rc.6

Component, Query & Data

Affected feature: ArangoSearch

Dataset: https://pastebin.com/4auWyt0v

Steps to reproduce

  1. Import data to aircraft
  2. Create a ArangoSearch view:
	const asv = db.arangoSearchView('aircraft_search')
	await asv.create()
	await asv.setProperties({
		links: {
			aircraft: {
				fields: {
					makemodel: {analyzers: ['text_en']}
				}
			}
		}
	})
  1. Attempt queries

Problem:

This query correctly returns matches with the full word pilatus in them

FOR doc IN aircraft_search
SEARCH PHRASE(doc.makemodel, 'pilatus', 'text_en')
RETURN doc

As does this:

FOR doc IN aircraft_search 
SEARCH ANALYZER(doc.makemodel IN TOKENS('pilatus', 'text_en'), 'text_en')
SORT BM25(doc)
limit 50
RETURN doc

However, if I try to use STARTS_WITH() either in conjunction with OR or even by itself:

FOR doc IN aircraft_search 
SEARCH STARTS_WITH(doc.makemodel, 'pila')
limit 50
RETURN doc

It returns zero matches. What's going on?

I can get results with a simple non-search query:

for doc in aircraft
filter CONTAINS(lower(doc.makemodel), 'pila')
return doc

But not with STARTS_WITH()?

Expected result:

Queries should return matches to prefixes as well as whole words. Ideally, it'd use ngrams as well.

Other notes on ArangoSearch in general:

I like how it's so customizable, but 95% of people are going to want "contains" matches (à la edged ngram's) + prefix matches + exact matches by default and I see no way of doing that at all, nonetheless easily. Fuzzy matches would be icing on the cake.

@JonDum JonDum changed the title ArangoSearch nGrams ArangoSearch STARTS_WITH Prefix Matches not working (+ nGrams?) Dec 21, 2018
@KVS85
Copy link
Contributor

@KVS85 KVS85 commented Dec 21, 2018

Hello @JonDum,

Thank you for your interest in ArangoSearch. Regarding your issue I may suggest to extend the view definition in order to have multiple analyzers for your data: makemodel: {analyzers: ['text_en', 'identity']}.

After this, you may utilize STARTS_WITH taking the following into account:

FOR doc IN aircraft_search 
SEARCH STARTS_WITH(doc.makemodel, 'pila')
limit 50
RETURN doc

In your original query there is no ANALYZER context and, by default, identity analyzer is used. Since your view has only text_en analyzer for makemodel field, nothing will be returned.

FOR doc IN aircraft_search 
SEARCH ANALYZER(STARTS_WITH(doc.makemodel, 'pila'), 'text_en')
limit 50
RETURN doc

As soon, as you apply text_en analyzer here, you will get all documents where at least one term (word) starts with pila (with case-insensitive approach for text_* analyzers) like:

  {
    "_key": "786",
    "_id": "aircraft/786",
    "_rev": "_X7_R5Te--D",
    "makemodel": "OERLIKON ARMNTS CO PILATUS P-2/05"
  },
  {
    "_key": "787",
    "_id": "aircraft/787",
    "_rev": "_X7_R5Te--F",
    "makemodel": "PILATUS BRITTEN-NORMAN LTD. BN2B-20"
  }

However, if you want to find only those where PILA is in the beginning, you may incorporate identity analyzer as you did in your initial query with respect to case-sensitive approach:

FOR doc IN aircraft_search 
SEARCH STARTS_WITH(doc.makemodel, 'PILA')
limit 50
RETURN doc

Regarding enhancement requests and future development of ArangoSearch, please follow our blog and news.

@JonDum
Copy link
Author

@JonDum JonDum commented Dec 21, 2018

@KVS85 Thanks for the response.

I added an 'identity' analyzer so now my my full view config is:

{
  "writebufferSizeMax": 33554432,
  "consolidationPolicy": {
    "type": "bytes_accum",
    "threshold": 0.10000000149011612
  },
  "globallyUniqueId": "h75EDDA017957/4039166",
  "id": "4039166",
  "writebufferActive": 0,
  "consolidationIntervalMsec": 60000,
  "cleanupIntervalStep": 10,
  "links": {
    "aircraft": {
      "analyzers": [
        "identity"
      ],
      "fields": {
        "makemodel": {
          "analyzers": [
            "text_en",
            "identity"
          ]
        }
      },
      "includeAllFields": false,
      "storeValues": "none",
      "trackListPositions": false
    }
  },
  "type": "arangosearch",
  "writebufferIdle": 64
}

And when I run this:

FOR doc IN aircraft_search 
SEARCH STARTS_WITH(doc.makemodel, 'pila')
limit 50
RETURN doc

I get no results still.

However if I run your other query:

FOR doc IN aircraft_search 
SEARCH ANALYZER(STARTS_WITH(doc.makemodel, 'pilatus'), 'text_en')
limit 50
RETURN doc

I get some prefix results finally. What on earth. Hopefully you guys can see how confusing that is? Why do I need to wrap another ANALYZER function on top of the STARTS_WITH just to get prefix matching?

Anyways, combining that with an ANALYZER(... IN TOKENS()) seems to work decently except for one caveat: Mixing capitals returns no results again.

FOR doc IN aircraft_search 
SEARCH ANALYZER(STARTS_WITH(doc.makemodel, 'Pilat'), 'text_en') or
ANALYZER(doc.makemodel IN TOKENS('Pilat', 'text_en'), 'text_en')
SORT TFIDF(doc) DESC
limit 50
RETURN doc

You can get around this by wrapping the bindvar in a lower(). Perhaps this is a bug? text_en should be case-insensitive I thought.

Final working query for anyone coming across this:

FOR doc IN aircraft_search 
SEARCH ANALYZER(STARTS_WITH(a.makemodel, lower(@value0)), 'text_en') or
ANALYZER(a.makemodel IN TOKENS(lower(@value0), 'text_en'), 'text_en')
SORT TFIDF(doc) DESC
limit 50
RETURN doc

As I said earlier this is probably a decent "default" search that works with both single word prefixes as well as multiple word terms. I would highly suggest should be in the docs somewhere.

screen shot 2018-12-21 at 11 22 36 am

Perhaps in this section? A lot of these examples are just showing alternative AQL document syntax and are just noise.

@KVS85
Copy link
Contributor

@KVS85 KVS85 commented Dec 22, 2018

@JonDum glad that you found solution for your query.

Let me explain the following here: a link with identity and text_en analyzers will do the following:

  • identity: TOKENS("iNPut tEXt", "identity) -> ["iNPut tEXt"]
  • text_en: TOKENS("iNPut tEXt", "text_en") -> ["input", "text"]

Therefore, this query didn't return anything, because you implicitly use identity analyzer context and all text indexed with this analyzer store data "as is". So, since only PILA (uppercase) was in input, this analyzer will not find pila.

FOR doc IN aircraft_search 
SEARCH STARTS_WITH(doc.makemodel, 'pila')
limit 50
RETURN doc

Thus, your final query could be optimized:

LET tokens = TOKENS(@value0, 'text_en')
FOR doc IN aircraft_search 
SEARCH ANALYZER(STARTS_WITH(doc.makemodel, tokens [0]) OR doc.makemodel IN tokens , 'text_en')
SORT TFIDF(doc) DESC
limit 50
RETURN doc

value0 here will be case-insensitive (despite of any provided value, processed by text_en TOKENS) and checked against searchQuery within text_en ANALYZER.

We are going to improve ArangoSearch documentation based on provided feedback and interesting cases. Thank you for your interest.

@JonDum
Copy link
Author

@JonDum JonDum commented Dec 22, 2018

@KVS85 Thanks for the explanation and optimized query. Works well.

Your explanation makes sense and I'm real excited about the future of ArangoSearch. It's super convenient having some search capabilities without having to set up/maintain ElasticSearch and syncing documents to it. I'm sure other developers will love it too when it goes live and catches on. I can't wait for an ngram analyzer too! ;)

The last thing that is unclear to me is the AQL side of it — if I already have text_en, identity (and potential future analyzers) specified in my view properties, why do I need to explicitly specify which analyzer to use? Can the AQL optimizer be smart enough to figure out which analyzer to use based off the fields in the link in the view properties if I don't specify one?

@KVS85
Copy link
Contributor

@KVS85 KVS85 commented Dec 24, 2018

@JonDum thank you for so kind feedback.

The necessity to define an analyzer explicitly is not an AQL limitation. Initially, one can have multiple fields indexed with multiple analyzers and that is indexing context. While performing AQL query then, SEARCH-related functions can't guess, which of these analyzers should be used. To simplify this, "identity" analyzer was chosen as the default one.

Generally, one should remember that both optional analyzer parameter of a SEARCH-related function or ANALYZER-function are used to apply index with appropriate analyzer. While text which is passed to them is "as is" and won't be "analyzed" automatically.

Example:

FOR doc IN aircraft_search 
SEARCH ANALYZER(STARTS_WITH(doc.makemodel, 'pila'), 'text_en')
// Here ANALYZER(..., , 'text_en') means that "text_en" doc.makemodel index
// will be used for searching while STARTS_WITH(doc.makemodel, 'PILA')
// defines what should be searched and 'PILA' parameter will be passed
// as is without processing and nothing is found since doc.makemodel
// "text_en" index can't have uppercase values.
limit 50
RETURN doc
@KVS85
Copy link
Contributor

@KVS85 KVS85 commented Dec 28, 2018

@JonDum do you have any additional questions? If no, can we close this issue?

@JonDum
Copy link
Author

@JonDum JonDum commented Dec 28, 2018

Nope. Thanks for the additional info. Hopefully it's useful for people arriving from google as well.

@JonDum JonDum closed this Dec 28, 2018
@JonDum
Copy link
Author

@JonDum JonDum commented Feb 11, 2019

@KVS85 I've been noticing some oddities with this setup: some queries return the results like I expect, but then I add an additional letter to my search and get zero results.

For example:

LET tokens = TOKENS('Boe', 'text_en')
FOR a IN aircraft_search
SEARCH ANALYZER(STARTS_WITH(a.makemodel, tokens [0]) OR a.makemodel IN tokens , 'text_en')
SORT TFIDF(a) DESC
limit 50
RETURN {
	makemode: a.makemodel
}

Return a set of Boeing planes just fine

However, changing the search term to Boei or Boein returns 0 results.

Search Term Results
B
Bo
Boe
Boei 🚫
Boein 🚫
Boeing

Could this be something to do with the text_en stemmer/analyzer pulling off the -ing suffix? How would I get around that if that's the case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.