Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape() returns 0 results, when the Native query does #329

Open
AaronSadlerUK opened this issue Feb 22, 2023 · 16 comments
Open

Escape() returns 0 results, when the Native query does #329

AaronSadlerUK opened this issue Feb 22, 2023 · 16 comments

Comments

@AaronSadlerUK
Copy link

AaronSadlerUK commented Feb 22, 2023

When using the Escape() query is brings back no results, but if I extract he RAW query and run it, then it returns the expected number of results

This brings back 0 results:

if (!string.IsNullOrEmpty(criteria.Colour))
{
  query = query.And().Field("colour",  criteria.Colour.Escape());
}

But if I extract the RAW query and run it... It works:

string stringToParse = query.ToString();
int indexOfPropertyValue = stringToParse.IndexOf("LuceneQuery:") + 12;
string rawQuery = stringToParse.Substring(indexOfPropertyValue).TrimEnd('}');
var response = index.Searcher.CreateQuery("content").NativeQuery(rawQuery).Execute(QueryOptions.SkipTake((criteria.CurrentPage - 1) * criteria.PageSize, criteria.PageSize));
  
@callumbwhyte
Copy link
Contributor

callumbwhyte commented Feb 24, 2023

@AaronSadlerUK .Field and .NativeQuery do fundamentally different things.

An escaped field search builds a "phrase query" and as such doesn't need to go through the query parser, whilst a native query will use the query parser to turn your string into an actual query and execute it "raw". The phrase query ensures an exact match for each term and not parts of each term.

What "type" is your "colour" field in the index, and what format are you expecting the value it to be? The likelihood here is your value isn't indexed quite as you'd expect.

e.g. "light-blue" would actually be indexed as "light" and "blue" by default – searching for "light-blue".Escape() will do a phrase query for "light-blue" but that's not the value in the index. You'd need to index the value accordingly, such as changing the analyzer for that specific case.

Use a tool like Luke to inspect your index and find out what's happening.

@AaronSadlerUK
Copy link
Author

AaronSadlerUK commented Feb 24, 2023

Thanks for that @callumbwhyte I am trying to open the indexes with like but having a nightmare 😅

Once I manage to get in I'll see what's happening in terms of the Indexed value.
However it does work with the backoffice and as a raw query, so for example if I search with "Dark Blue" using the nativequery it works.

But if I do it using .Escape() it does not, I should know more if I can get Luke to work

@callumbwhyte
Copy link
Contributor

The backoffice is a whole different beast again ;-)

@AaronSadlerUK
Copy link
Author

@callumbwhyte Any ideas with this error?
image

I've tried rebuilding etc...

@AaronSadlerUK
Copy link
Author

Finally found a version which can read the indexes... 5.2.0

Can be found here for anyone looking:
https://github.com/DmitryKey/luke/releases/tag/luke-5.2.0

@AaronSadlerUK
Copy link
Author

Colour looks like this in the index:
image

It's also indexes as FullText

This field is used as an attribute, so the searching on it is always exact.
namedOptions.FieldDefinitions.AddOrUpdate(new FieldDefinition("colour", FieldDefinitionTypes.FullText));

Any other thoughts or pointers?

@callumbwhyte
Copy link
Contributor

@AaronSadlerUK If you right click on the value you can view the tokens, you should see 2 tokens: "light" and "green".

If you're trying to match 'Light Green' to either of those terms it won't match.

Rather than indexing your field as FullText you could opt for FieldDefinitionTypes.Raw which shouldn't be analyzed and therefore the tokens for the value indexed will be "Light Green" exactly as you expect.

You could also modify the value at index time in the TransformingIndexValues event, perhaps removing the space entirely?

@AaronSadlerUK
Copy link
Author

I was thinking about removing the space, but then I would need to create a whole thing to remove all the other different characters which are used such as / and - in different places.

I'll try the RAW way.

Am I right in thinking what I'm trying to do here would normally be done as a faceted search?

@dealloc
Copy link

dealloc commented Feb 24, 2023

this seems quite similar to #325, which has my current workaround but I'd love to get rid of that one.
Also an option is to use the token analyzer which doesn't split the text

		options.IndexValueTypesFactory = new Dictionary<string, IFieldValueTypeFactory>
		{
			[FIELD_DEFINITION_KEYWORD] = new DelegateFieldValueTypeFactory(name =>
				new GenericAnalyzerFieldValueType(
					name,
					_loggerFactory,
					new KeywordAnalyzer(),
					false
				)
			)
		};

@mistyn8
Copy link

mistyn8 commented Aug 2, 2023

@dealloc I'm also seeing native phrase query gives results as opposed to an escaped term? (examine 3.1)

if (_examineManager.TryGetIndex(Constants.UmbracoIndexes.ExternalIndexName, out var index))
{
    var searcher = index.Searcher;
    
    //var query = searcher.CreateQuery(IndexTypes.Media).Field("folderDevelopmentCode", developmentCode.Escape());

    var query = searcher.CreateQuery(IndexTypes.Media).NativeQuery($"+folderDevelopmentCode:\"{developmentCode}\"");
    var results =  query.Execute();

    if (!results.Any()){
    ...
    }

both result in Category: "media", LuceneQuery: {+folderDevelopmentCode:"E500117"}
(note the enclosing quotes for a phrase query)

but only the nativequery coded example has a result set?

@mistyn8
Copy link

mistyn8 commented Aug 2, 2023

I did notice there are tests for this.. that pass...
image
https://github.com/Shazwazza/Examine/runs/13242245210#r0s6
image
https://github.com/Shazwazza/Examine/blob/dev/src/Examine.Test/Examine.Lucene/Search/FluentApiTests.cs#L923-L934C61

 //now escape it
                var exactcriteria = searcher.CreateQuery("content");
                var exactfilter = exactcriteria.Field("__Path", "-1,123,456,789".Escape());
                var results2 = exactfilter.Execute();
                Assert.AreEqual(1, results2.TotalItemCount);

                //now try with native
                var nativeCriteria = searcher.CreateQuery();
                var nativeFilter = nativeCriteria.NativeQuery("__Path:\\-1,123,456,789");
                Console.WriteLine(nativeFilter);
                var results5 = nativeFilter.Execute();
                Assert.AreEqual(1, results5.TotalItemCount);

@mistyn8
Copy link

mistyn8 commented Aug 2, 2023

Actually seems that new FieldDefinitionCollection(new FieldDefinition("__Path", "raw") has some involvment here?

trying to develop a failing test it seems that native query also manipulates..
eg var nativeFilter = nativeCriteria.NativeQuery("folderDevelopmentCode:\"E500117\""); results in +folderDevelopmentCode:e500117 so seems to decide it's not a phrase so reverts to standard search
also var nativeFilter = nativeCriteria.NativeQuery("folderDevelopmentCode:\"E500 117\""); results in
`+folderDevelopmentCode:"e500 117" note the lower casing..

so I think the issue with Escape() is that it doesn't lowercase?
var exactFilter = exactCriteria.Field("folderDevelopmentCode", "E500 117".ToLower().Escape()); for me gets to the escaped fluent api returning the same as the native query.

 [Test]
 public void Phrase_Matching()
 {
     var analyzer = new StandardAnalyzer(LuceneInfo.CurrentVersion);

     using (var luceneDir = new RandomIdRAMDirectory())
     using (var indexer = GetTestIndex(
         luceneDir,
         analyzer
         //, new FieldDefinitionCollection(new FieldDefinition("folderDevelopmentCode", "raw"))
         ))
     {


         indexer.IndexItems(new[] {
             new ValueSet(1.ToString(), "media",
                 new Dictionary<string, object>
                 {
                     {"folderDevelopmentCode", "E500 117"}
                 }),
              new ValueSet(2.ToString(), "media",
                 new Dictionary<string, object>
                 {
                     {"folderDevelopmentCode", "E500 118"}
                 })
             });

         var searcher = indexer.Searcher;

         var criteria = searcher.CreateQuery("media");
         var filter = criteria.Field("folderDevelopmentCode", "E500 117");
         Console.WriteLine($"FILTER: {filter}");
         var results1 = filter.Execute();
         //expecting 2 as this results in E500 or 117 query (prob not what we want but that's lucene)
         Assert.AreEqual(2, results1.TotalItemCount);

         //native
         var nativeCriteria = searcher.CreateQuery("media");
         var nativeFilter = nativeCriteria.NativeQuery("folderDevelopmentCode:\"E500 117\"");
         Console.WriteLine($"NATIVE: {nativeFilter}");
         var results3 = nativeFilter.Execute();
         Assert.AreEqual(1, results3.TotalItemCount);

         //exact match
         var exactCriteria = searcher.CreateQuery("media");
         var exactFilter = exactCriteria.Field("folderDevelopmentCode", "E500 117".Escape());
         Console.WriteLine($"EXACT: {exactFilter}");
         var results2 = exactFilter.Execute();
         Assert.AreEqual(1, results2.TotalItemCount);
     }
 }

image

@Shazwazza Shazwazza changed the title Escape() returns 0 results, when the RAW query does Escape() returns 0 results, when the Native query does Jun 14, 2024
@Shazwazza
Copy link
Owner

Hi all, I know this topic is old but will add some clarity:

As @callumbwhyte notes, it creates a PhraseQuery, but this is not susceptible to the same tokenizing and analysis done by the default specified because it operates outside of the query parser and it creates an exact match.

Here's an example:

    var analyzer = new StandardAnalyzer(LuceneInfo.CurrentVersion);
    using (var luceneDir = new RandomIdRAMDirectory())
    using (var indexer = GetTestIndex(luceneDir, analyzer))
    {
        indexer.IndexItems(new[] {
            ValueSet.FromObject(1.ToString(), "content",
                new { phrase = "If You Can't Stand the Heat, Get Out of the Kitchen" }),
            ValueSet.FromObject(2.ToString(), "content",
                new { phrase = "When the Rubber Hits the Road" }),
            ValueSet.FromObject(3.ToString(), "content",
                new { phrase = "A Fool and His Money Are Soon Parted" }),
            ValueSet.FromObject(4.ToString(), "content",
                new { spaphraseth = "A Hundred and Ten Percent" }),
        });

        var searcher = (BaseLuceneSearcher)indexer.Searcher;

        var query = searcher
            .CreateQuery(IndexTypes.Content)
            .NativeQuery("+phrase:\"Get Out of the Kitchen\"");

        Console.WriteLine(query);
        var results = query.Execute();

        Assert.AreEqual(1, results.TotalItemCount);

        query = searcher
            .CreateQuery(IndexTypes.Content)
            .Field("phrase", "Get Out of the Kitchen".Escape());

        Console.WriteLine(query);
        results = query.Execute();

        Assert.AreEqual(1, results.TotalItemCount);
    }

What does this yield?

  • The first assertion works - and the output looks like: { Category: content, LuceneQuery: +(+phrase:"get out ? ? kitchen") }
  • The second assertion fails - and the output looks like: { Category: content, LuceneQuery: +phrase:"Get Out of the Kitchen" }

Why?

  • The first query uses the Query Parser to parse a phrase since it is contained in quotes and it passes through the tokenizer/analyzer which in this case is the StandardAnalyzer which lowercases everything and strips out common words.
  • The second query uses PhraseQuery under the hood, this does not go through the tokenizer/analyzer and becomes an exact match.
    • It does not match because the data in the index has been tokenized/analyzed, the actual value in there is lowercased and has common words removed

I understand the confusion around 'Escape()' but it does essentially mean 'exact match', not phrase. So if you are using Escape() than you would need to declare your field type as 'Raw' which equates to using the KeywordAnalyzer under the hood.

So where do we go from here?

  • We can introduce a .Phrase() extension method which will create a PhraseQuery based on the query parser and will respect the analyzer used and Escape() will remain an exact match.
  • We can change the .Escape() handling to use a PhraseQuery based on the query parser and will respect the analyzer which will make Escape() essentially become a PhraseQuery - but IMO this is a breaking/unbreaking change since some folks have worked around this behavior so far. This will also mean that .Escape() is no longer really an exact match, but it sort of would be based on the analyzer of that field. Although this might seem like a breaking change - if we look at the FluentApiTests for all usages of .Escape() these are solely based on RAW (KeywordAnalyzer) field types. So... Perhaps this is the correct fix because Escape() probably wouldn't work unless it is a raw field anyways. I'm a +1 for this change.

@callumbwhyte
Copy link
Contributor

My vote goes to having a Phrase() method.

Perhaps the terminology of Escape() is also confusing here? Maybe Exact() is more precise, and could be added now + Escape() obsoleted without breaking anything and leaving space for future...

@mattbrailsford
Copy link

My immediate thought was exactly the same as Callum, so that gets my vote. Makes things clearer and doesn’t break existing code, but prompts to move to the newer options.

@Shazwazza
Copy link
Owner

@callumbwhyte + @mattbrailsford

Thanks! Escape was originally, a very long time ago, meant to ensure that reserved chars were escaped and didn't cause lucene issues, but then this evolved in various ways over the years because it turns out we didn't need to do that with phrase queries.

I'm really not sure what the use case would be moving forward to introduce an 'Exact' method that would be the same as the current 'Escape' - because the value would need to match exactly to what the value is stored in the index based on what the field's analyzer is. I don't want to re-introduce more confusion and have to re-explain this all over again because we'll be back in the same boat.

I'm leaning towards creating "Phrase" and then just obsoleting "Escape"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants