Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR #244 [searching] is prone to false positives [ALSO: database integrity questions about safely/synchronously deleting "videos" from both/all DB's] #246

Open
deldesir opened this issue Sep 5, 2024 · 8 comments
Assignees
Labels
bug Something isn't working question Further information is requested

Comments

@deldesir
Copy link
Collaborator

deldesir commented Sep 5, 2024

The queries made to xklb-metadata.db return titles to be used as terms to fetch videos records from metadata.db. What if different videos have the same title!

@deldesir deldesir added bug Something isn't working question Further information is requested labels Sep 5, 2024
@deldesir deldesir self-assigned this Sep 5, 2024
@holta holta changed the title PR #244 is prone to false positives PR #244 is prone to false positives [database integrity questions about deleting videos too!] Sep 5, 2024
@holta
Copy link
Member

holta commented Sep 5, 2024

The challenge will likely be establishing a primary key (much like a UUID / GUID) for every video-or-similar, cleanly guaranteeing integrity across both (all 3) databases:

CLARIFS:

  • Both above SQLite databases live in /library/calibre-webRECAP:
    • Remember that metadata.db (for "book" and "video" titles, authors, descriptions and not much else!) has a quite fixed schema that we cannot change, as it's a de facto global standard dating back to Calibre's original introduction in ~2006.
    • Whereas the 2nd database xklb-metadata.db, that we added to IIAB Calibre-Web earlier this year to handle videos (especially for scraping of videos + their transcripts from the Internet, ETC), is defined by our very good friend who is the designer xklbhe might have suggestions for us?
  • While likely not relevant to this design discussion, also just be aware of a 3rd SQLite database /library/calibre-web/config/app.db that is tangentially involved here. Though this 3rd DB is really only for (1) usernames/passwords (2) Calibre-Web settings, AND (3) bookshelves (that teachers use to organize books + videos!)

Related:

cc: @mabuelhagag, @avni, @codewiz

@holta holta changed the title PR #244 is prone to false positives [database integrity questions about deleting videos too!] PR #244 is prone to false positives [ALSO: database integrity questions about safely/synchronously deleting "videos" from both DB's] Sep 5, 2024
@holta holta pinned this issue Sep 5, 2024
@holta holta changed the title PR #244 is prone to false positives [ALSO: database integrity questions about safely/synchronously deleting "videos" from both DB's] PR #244 [searching] is prone to false positives [ALSO: database integrity questions about safely/synchronously deleting "videos" from both DB's] Sep 5, 2024
@holta
Copy link
Member

holta commented Sep 6, 2024

Related IDEA from @deldesir:

  • At the same time we should very likely transition to (unify our db approaches around) SQLAlchemy — rather than Python’s sqlite3 library that we're currently using to manipulate /library/calibre-web/xklb-metadata.db — so as to better align with @OzzieIsaacs’s upstream Calibre-Web!

@avni
Copy link
Member

avni commented Sep 13, 2024

For metadata.db, according to the calibredb manual, these are the available book fields:

Available fields: author_sort, authors, comments, cover, formats, identifiers, isbn, languages, last_modified, pubdate, publisher, rating, series, series_index, size, tags, template, timestamp, title, uuid

uuid may be a unique identifier for a given book? It may still be helpful to map the relevant columns across the three tables, metadata.db, xklb-metadata.db, app.db to support solving this (i.e., a mini-specification). Foreign keys can be used to map across tables if that is not already being done.

@holta
Copy link
Member

holta commented Sep 20, 2024

@holta holta changed the title PR #244 [searching] is prone to false positives [ALSO: database integrity questions about safely/synchronously deleting "videos" from both DB's] PR #244 [searching] is prone to false positives [ALSO: database integrity questions about safely/synchronously deleting "videos" from both/all DB's] Sep 20, 2024
@avni
Copy link
Member

avni commented Sep 20, 2024

The queries made to xklb-metadata.db return titles to be used as terms to fetch videos records from metadata.db. What if different videos have the same title!

I'm seeking clarification on the problem statement, i.e., what does false positive mean, what's a use case that shows the problem?

If different videos have the same title, I think that's okay right? Hypothetical use case: 2 different people in different accounts may have uploaded videos with the same title coincidentally. An IIAB user downloads those two videos into their IIAB Calibre-Web instance. When they search for a term that includes the videos' title, we want both of these videos to return. Not clear to me yet what false positive means.

@holta
Copy link
Member

holta commented Sep 20, 2024

what does false positive mean

The currently proposed PR #244 searches thru video transcripts yes, but it incorrectly returns every video that contains the same title. That's an obvious bug, and we should keep very focused on fixing this bug. Recall from 2-3 weeks ago (PR #244 originally from Aug 30) that poorly named variable term (should be termList or titleList ?) that awkwardly accumulates[*] both search query and video titles into the same list?

term = [term] + other_terms

That's Line 982 of cps/db.py in PR #244 here: https://github.com/iiab/calibre-web/pull/244/files#diff-204c1e4c10ba05516a4a6ed88fed4a34a133e781e4db0c80a4d96cd64f0b268aR982 (hope that helps?!)

[*] This hack of putting search string and video titles together into the very same list... is a fine proof-of-concept for now... but probably should not be merged! 😅

@avni
Copy link
Member

avni commented Sep 20, 2024

  1. The term variable including results from get_search_terms() (essentially search results) is definitely odd/incorrect.
  2. side note: consider renaming get_search_terms() to be more reflecting of what it does - right now, it only searches captions (I left a comment on line 49 of lb_search.py for this)
  3. This hack of putting search string and video titles together into the very same list... is a fine proof-of-concept for now... but probably should not be merged!

Agree

  1. returns every video that contains the same title.

From what I can tell there's a chance that the search results contain the same video multiple times, which is not desired. If there are multiple videos with the same title (coincidentally) and a user is searching for that term, every unique video should be returned. You could/perhaps should also make sure the videos returned are unique (using a PK if available), which is perhaps what this is trying to address? I may be missing something though!

@avni
Copy link
Member

avni commented Sep 24, 2024

returns every video that contains the same title.

After speaking with @holta further on Friday, separating out the search string and video titles into separate variables should resolve this so that false positives do not return! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants