Ways of evaluating result text content to create relevancy-likelihood scores for separate fields #1175
sarayourfriend
started this conversation in
Ideas
Replies: 1 comment 1 reply
-
I think all of the ideas you mention are worth converting into issues! Namely:
Perhaps my perspective is limited, but I can't think of a case where search results would be positively impacted by results which fall into either of those above cases. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
While working on the content report moderation flow project proposal, I was considering the use of Akismet to evaluate content report descriptions. As I did this, I was wondering what other applications language analysis APIs might have at the catalogue level. If we ran descriptions through Akismet, for example, would a "spammy" description be meaningful information that we could use to reduce the score of the description for a document? Consider the Flickr "blog post" style descriptions we sometimes see that result in odd results showing up because some text is in the description but isn't related to the actual image. Is there a way to detect these? That doesn't sound like "spamminess" but maybe we could evaluate the strength of the relationship between the title, tags, and the description to identify descriptions that aren't necessarily about the work itself?
If you search for "Netherlands" and filter by Flickr, you'll see this result on the first page of images (picture of a person speaking at a lectern): https://openverse.org/image/579a9509-f0a5-4e00-b938-03ec1e308309?q=Netherlands
If you go to the Flickr page for the result, you'll see that the description and the title are identical. With our current approach, all the tokens from the title and the description now appear double in the document and boosts its score for each individual term. That doesn't seem like a useful boost, and perhaps we could detect these by evaluating the raw similarity between a title and description and reducing the score of the description if they are too similar. Perhaps even ignoring the description entirely in search if it matches the title (
title.lower() == description.lower()
).What other approaches exist for evaluating the "usefulness" of a particular text field as it relates to each document? Should we generally be preventing multiple instances of a term from boosting the score of a result? This result (picture of people sitting on floor doing a leadership exercise) appears on the first page for the query "Buffalo" because Buffalo is repeated twice in the title, the description, and the tags! It doesn't seem like there is an accurate correlation between the number of times a term is mentioned in a document and relevancy. Then again, if all documents that mentioned the word "buffalo" are analysed to have the same text score, will our results get better? The #392 project will enable easier iteration on index settings and things like this would be easier to give a spin to see whether it is worth testing out.
Are any of these ideas worth turning into issues? Should we plan for a project sometime next year to look deeper into how the relationship between the various text fields on documents can be leveraged to get better document scores?
Beta Was this translation helpful? Give feedback.
All reactions