Ways of evaluating result text content to create relevancy-likelihood scores for separate fields #1175

sarayourfriend · 2023-04-12T02:56:32Z

sarayourfriend
Apr 12, 2023
Collaborator

While working on the content report moderation flow project proposal, I was considering the use of Akismet to evaluate content report descriptions. As I did this, I was wondering what other applications language analysis APIs might have at the catalogue level. If we ran descriptions through Akismet, for example, would a "spammy" description be meaningful information that we could use to reduce the score of the description for a document? Consider the Flickr "blog post" style descriptions we sometimes see that result in odd results showing up because some text is in the description but isn't related to the actual image. Is there a way to detect these? That doesn't sound like "spamminess" but maybe we could evaluate the strength of the relationship between the title, tags, and the description to identify descriptions that aren't necessarily about the work itself?

If you search for "Netherlands" and filter by Flickr, you'll see this result on the first page of images (picture of a person speaking at a lectern): https://openverse.org/image/579a9509-f0a5-4e00-b938-03ec1e308309?q=Netherlands

If you go to the Flickr page for the result, you'll see that the description and the title are identical. With our current approach, all the tokens from the title and the description now appear double in the document and boosts its score for each individual term. That doesn't seem like a useful boost, and perhaps we could detect these by evaluating the raw similarity between a title and description and reducing the score of the description if they are too similar. Perhaps even ignoring the description entirely in search if it matches the title (title.lower() == description.lower()).

What other approaches exist for evaluating the "usefulness" of a particular text field as it relates to each document? Should we generally be preventing multiple instances of a term from boosting the score of a result? This result (picture of people sitting on floor doing a leadership exercise) appears on the first page for the query "Buffalo" because Buffalo is repeated twice in the title, the description, and the tags! It doesn't seem like there is an accurate correlation between the number of times a term is mentioned in a document and relevancy. Then again, if all documents that mentioned the word "buffalo" are analysed to have the same text score, will our results get better? The #392 project will enable easier iteration on index settings and things like this would be easier to give a spin to see whether it is worth testing out.

Are any of these ideas worth turning into issues? Should we plan for a project sometime next year to look deeper into how the relationship between the various text fields on documents can be leveraged to get better document scores?

AetherUnbound · 2023-04-17T22:46:07Z

AetherUnbound
Apr 17, 2023
Collaborator

I think all of the ideas you mention are worth converting into issues! Namely:

Removing boosting for multiple instances of a match
Removing boosting for multiple fields matching if those fields are identical

Perhaps my perspective is limited, but I can't think of a case where search results would be positively impacted by results which fall into either of those above cases.

1 reply

sarayourfriend Apr 18, 2023
Collaborator Author

Removing boosting for multiple instances of a match

I think the case where multiple instances are a good signal are when a the title and description vary significantly. We'd need to do some kind of analysis to tell the difference. If someone is writing about Galahs, a type of bird, and the description mentions Galah several times, I think that document should be considered more relevant than one that happens to mention Galahs but is not about Galahs, if that makes sense? I think it's more complicated than just turning off boosting for multiple matches, we need to do further text analysis to get it right, I think, to find out when the multiple matches are kinda bogus (like when title == description). Tag searching might be a good candidate for moving into filter rather than must, though: https://www.elastic.co/guide/en/elasticsearch/reference/7.5/query-dsl-bool-query.html

Putting it in filter will still require that it appears but its appearance won't affect the scoring.

Removing boosting for multiple fields matching if those fields are identical

This, on the other hand, I think is more straightforward, but unfortunately requires a lot more work 😅 We'd need to either do it via ES ingestion pipelines or add it to the ingestion server "cleanup" steps. And then we probably still want to preserve that information, just exclude it from the index for that document, so we'd need to make sure it's still in _source but not in the indexed fields. I don't know if that's easy to do.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ways of evaluating result text content to create relevancy-likelihood scores for separate fields #1175

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Ways of evaluating result text content to create relevancy-likelihood scores for separate fields #1175

sarayourfriend Apr 12, 2023 Collaborator

Replies: 1 comment · 1 reply

AetherUnbound Apr 17, 2023 Collaborator

sarayourfriend Apr 18, 2023 Collaborator Author

sarayourfriend
Apr 12, 2023
Collaborator

Replies: 1 comment 1 reply

AetherUnbound
Apr 17, 2023
Collaborator

sarayourfriend Apr 18, 2023
Collaborator Author