Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent [upcoming and truly live YouTube] videos from carrying over [to] subsequent download tasks [& discussion of UTC for code safety — avoid TZ code complexity wherever possible!?] #212

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

deldesir
Copy link
Collaborator

@deldesir deldesir commented Jul 5, 2024

Upcoming and actually live videos are not being downloaded due to their duration being unknown. Records of these videos are kept in xklb-metadata.db. Unfortunately, these videos urls are caught in subsequent downloads tasks. We choose not to delete them in support of an eventual "download retry" feature. This PR prevents them from being carried over by cutting the selection [threshold] to the time the task was created. So any live or failed videos prior to this timestamp is ignored.

Fixes #202

Tested on Ubuntu 24.04 (LRN2)

Before
image

After
image

@deldesir deldesir requested a review from holta July 5, 2024 15:51
@deldesir deldesir self-assigned this Jul 5, 2024
@deldesir deldesir added the bug Something isn't working label Jul 5, 2024
@holta holta changed the title Prevent live videos from carrying over subsequent download tasks Prevent live videos from carrying over [to] subsequent download tasks Jul 5, 2024
@holta
Copy link
Member

holta commented Jul 5, 2024

Is this PR safe in situations when the system clock (date & time) suddenly changes?

(e.g. during a country's time change — or when machine's sysadmin changes the system clock back in time?)

@deldesir
Copy link
Collaborator Author

deldesir commented Jul 5, 2024

Is this PR safe in situations when the system clock (date & time) suddenly changes?

(e.g. during a country's time change — or when machine's sysadmin changes the system clock back in time?)

Time changes generally affect a whole operating system. The views-per-day sorting, an important feature recently added to IIAB Calibre-Web, would be affected as well. So, the answer is No.

@holta
Copy link
Member

holta commented Jul 5, 2024

In the case of IIAB machines that are sometimes offline, system clocks can change VERY often:

  • Does this PR use (alleged) UTC for all times, or does it depend on individual IIAB machine's TZ (time zone) ?
  • Delineate specific risks.
  • Delineate specific protections (e.g. can a video fail to download right after some countries' time changes — but then become downloadable an hour later — without top much confusion?)

@holta
Copy link
Member

holta commented Jul 5, 2024

@deldesir thanks for answering the above questions — so that @codewiz can review this PR quickly & efficiently — helping us all understand if the specific underlying goals & approach are "safe enough" and why!

🙏

@holta holta changed the title Prevent live videos from carrying over [to] subsequent download tasks Prevent [upcoming and truly live YouTube] videos from carrying over [to] subsequent download tasks Jul 6, 2024
@deldesir
Copy link
Collaborator Author

deldesir commented Jul 6, 2024

  • Does this PR use (alleged) UTC for all times, or does it depend on individual IIAB machine's TZ (time zone) ?

The function compares time_created against self.start_time.timestamp(), both of which are Unix timestamps in UTC.

  • Delineate specific risks.

Machine-specific local times or time changes, accidental, voluntary or timezone change induced can cause inconsistencies.

  • Delineate specific protections (e.g. can a video fail to download right after some countries' time changes — but then become downloadable an hour later — without top much confusion?)

To make this possible, the task's time related operations should be timezone-aware. By explicitly using datetime.now(timezone.utc), all time comparisons should be consistent and based on UTC.

NB In Calibre-Web databases schema and other parts like in editbooks.py datetime.utcnow() is used to get a similar result, but in Python 3.12 datetime.utcnow() is deprecated in favor of [timezone-aware alternative] datetime.now(timezone.utc)

@holta
Copy link
Member

holta commented Jul 6, 2024

NB In Calibre-Web databases schema and other parts like in editbooks.py datetime.utcnow() is used to get a similar result, but in Python 3.12 datetime.utcnow() is deprecated in favor of datetime.now(timezone.utc)

Should we &/or upstream Calibre-Web begin to prepare, in a separate PR?

(Of course Debian 13 "Trixie" is likely a few weeks/months away from upgrading from Python 3.11 to 3.13, expected around 2024-10-01 !)

PS Mandating "perceived UTC" (from the local IIAB machine's standpoint) sounds like a great compromise — to at least increase code safety, and to avoid code bloat, by staying away from TZ complexity as much as humanly possible — i.e. to allow for crystal clear + maintainable + "Dead Simple" design! 🛤️

(After all, IIAB machines are shipped to other countries and time zones all the time, whether or not their system clocks inevitably drift, or suddenly jumping forward/back as they go offline/online intermittently and unpredictably ;)

@holta holta changed the title Prevent [upcoming and truly live YouTube] videos from carrying over [to] subsequent download tasks Prevent [upcoming and truly live YouTube] videos from carrying over [to] subsequent download tasks [& discussion of UTC for code safety — avoid TZ code complexity wherever possible!?] Jul 6, 2024
@holta holta added the documentation Improvements or additions to documentation label Jul 6, 2024
@deldesir
Copy link
Collaborator Author

deldesir commented Jul 6, 2024

Should we &/or upstream Calibre-Web begin to prepare, in a separate PR?

We begin in #217, we'll propose changes to upstream in a separate PR.

@holta
Copy link
Member

holta commented Jul 6, 2024

in Python 3.12 datetime.utcnow() is deprecated in favor of datetime.now(timezone.utc)

I didn't quite understand your above comment:

The missing piece is that datetime.datetime.now(datetime.timezone.utc) (what you've abbreviated above as datetime.now(timezone.utc)) is safe to use everywhere in 2024 — as it has worked since Python 3.2 (released February 2011!) — according to:

https://stackoverflow.com/questions/2331592/why-does-datetime-datetime-utcnow-not-contain-timezone-information/24666683#24666683

@holta
Copy link
Member

holta commented Jul 7, 2024

Upcoming and actually live videos are not being downloaded due to their duration being unknown. Records of these videos are kept in xklb-metadata.db. Unfortunately, these videos urls are caught in subsequent downloads tasks. We choose not to delete them in support of an eventual "download retry" feature.

Thanks much for explaining ✅

This PR prevents them from being carried over by cutting the selection [threshold] to the time the task was created. So any live or failed videos prior to this timestamp is ignored.

Question: Would it be feasible to distinguish between "upcoming / not yet live" and "truly live / actually live" YouTube video URLs versus "formerly live" YouTube video URLs more precisely — by avoiding the use of elapsed time calculations entirely?

@deldesir
Copy link
Collaborator Author

deldesir commented Jul 7, 2024

Question: Would it be feasible to distinguish between "upcoming / not yet live" and "truly live / actually live" YouTube video URLs versus "formerly live" YouTube video URLs more precisely — by avoiding the use of elapsed time calculations entirely?

@holta, no elapsed time calculations is used in this PR, neither in PR #217. The code only retrieves a list of videos added since a specific time, self.start_time.timestamp(). It ignores videos uploaded before this specific start time. To effectively achieve this, it checks the time_created timestamp in each video record in xklb-metada.db and compares it to the start time.

  • The lack of any duration value in xklb-metadata.db could be one hint.
  • Even better, could we possibly verify (reconfirm with YouTube itself!?) the following value of live_status in xklb-metadata.db ?

For every download request made, the value of live_status is reconfirmed with YouTube itself thanks to xklb metadata update feature which calls yt-dlp each time. This is how get the value of live_status always. The problem we're trying to solve here is the fact that some videos from prior download requests meet the same following criteria:

  • no duration
  • has one of the following values:
    • is_upcoming
    • is_live

@holta
Copy link
Member

holta commented Jul 8, 2024

For every download request made, the value of live_status is reconfirmed with YouTube itself thanks to xklb metadata update feature which calls yt-dlp each time. This is how get the value of live_status always.

Awesome news!

@deldesir can you please also explain specifically why time needs to be involved:

  • If we already know that a YouTube video URL was just moments ago reconfirmed / reverified (with YouTube itself!) to have live_status is_upcoming or is_live — isn't that all the info we need to prevent it from "carrying over" ?

  • Conversely, are there good reason(s) that time &/or time calculations also need to be involved — and if so what are those exact reasons?

Thanks!

@holta
Copy link
Member

holta commented Jul 23, 2024

@deldesir FYI merge conflict as a result of the just-merged:

@holta
Copy link
Member

holta commented Aug 26, 2024

If it helps us understand the original screenshots above:

Related:

@holta
Copy link
Member

holta commented Aug 26, 2024

Apologies I'm not yet up-to-speed here in understanding the whole design/architecture. But we'll get through this code review (design review) soon in coming days keeping our heads down here. As @deldesir began today to explain to me some critical above mechanics / assumptions relating to 2 above snippets (repasted below...)

For every download request made, the value of live_status is reconfirmed with YouTube itself thanks to xklb metadata update feature which calls yt-dlp each time. This is how get the value of live_status always. The problem we're trying to solve here is the fact that some videos from prior download requests meet the same following criteria:

no duration
has one of the following values:
is_upcoming
is_live

xklb-metadata.db live_status values for:
Not yet live: is_upcoming
Actually live: is_live
Formerly live: was_live

[CLARIF: xklb ADDS A 4TH VALUE not_live WHEN APPLICABLE!] (Above excerpt from...)

#199 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation
Projects
None yet
2 participants