Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use timezone-aware objects [mandate UTC in low-level vars for code safety, using Python 3.2's datetime.now(timezone.utc)] [should Python 3.3's time.monotonic() be used to approximate elapsed time?] #217

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

deldesir
Copy link
Collaborator

@deldesir deldesir commented Jul 6, 2024

This PR addresses modified the runtime property in CalibreTask class to ensure datetime objects are timezone-aware. It also updates start_time and end_time from the metadata fetch and download tasks accordingly.

Tested on Ubuntu 24.04 with python 3.12 (LRN2)
image

@deldesir deldesir marked this pull request as draft July 6, 2024 21:36
@holta
Copy link
Member

holta commented Jul 6, 2024

Underlying design discussion (likely still evolving, towards a KISS design decision we hope!) is here — to give others much more context: 🙏

@deldesir deldesir requested a review from holta July 6, 2024 22:04
@deldesir deldesir self-assigned this Jul 6, 2024
@deldesir deldesir added the enhancement New feature or request label Jul 6, 2024
@deldesir deldesir marked this pull request as ready for review July 6, 2024 22:07
@holta
Copy link
Member

holta commented Jul 6, 2024

  1. What is meant by "timezone-aware" here? Can we try to ignore all TZ settings as much as possible — when dealing with low-level operations that are not user-facing — i.e. can we mandate that time be recorded/handled in UTC as much as humanly possible?

    (Which really means UTC as perceived by the local machine, accepting that IIAB machines' system clocks drift and suddenly change all too regularly...)

  2. Further clarify the Python 3.12+ transition plan here? (Within this PR, or is a separate PR possibly needed?)

    NB In Calibre-Web databases schema and other parts like in editbooks.py datetime.utcnow() is used to get a similar result, but in Python 3.12 datetime.utcnow() is deprecated in favor of datetime.now(timezone.utc)

    Building on original discussion here, if possible:

@holta holta changed the title Use timezone-aware objects Use timezone-aware objects [ignore TZ and mandate UTC in low-level vars for code safety] Jul 6, 2024
@deldesir
Copy link
Collaborator Author

deldesir commented Jul 6, 2024

I am not sure I understand you fully. Does this explanation from Gemini fit for this discussion?
https://g.co/gemini/share/294eb065eb4c

Is the following snippet + result what you expected?

from datetime import datetime, timezone

# Get current UTC time
current_time_utc = datetime.now(timezone.utc)

# Print the current UTC date and time
print(f"Current UTC Time: {current_time_utc}")

Current UTC Time: 2024-07-06 23:36:38.923000+00:00

https://pym.dev/p/2sy96/

@holta
Copy link
Member

holta commented Jul 6, 2024

I am not sure I understand you fully. Does this explanation from Gemini fit for this discussion?

https://g.co/gemini/share/294eb065eb4c

Ok, I agree that storing tzinfo=datetime.timezone.utc in every involved variable doesn't hurt too much...

>>> from datetime import datetime, timezone
>>> datetime.now(timezone.utc)
datetime.datetime(2024, 7, 6, 23, 32, 25, 760775, tzinfo=datetime.timezone.utc)

...compared to recommendations from prior years, below, which (until Python 3.12!) would have worked fine...

>>> datetime.utcnow()
datetime.datetime(2024, 7, 6, 23, 34, 47, 308682)

RECAP: Sounds like we should retitle this PR... clarifying that another/core purpose of this PR is to modernize Python code to be compatible with Python 3.12+ ?

@holta holta changed the title Use timezone-aware objects [ignore TZ and mandate UTC in low-level vars for code safety] Use timezone-aware objects [mandate UTC in low-level vars for code safety, using Python 3.2's datetime.now(timezone.utc)] Jul 6, 2024
@@ -231,7 +231,8 @@ def error(self, x):

@property
def runtime(self):
return (self.end_time or datetime.now()) - self.start_time
now = datetime.now(timezone.utc)
return (self.end_time or now) - self.start_time
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are start_time and end_time exclusively used to compute the elapsed time in seconds?

If so, you don't need to use datetime objects, which are complex classes meant to represent calendar dates and localized time.

Internally, Linux and Windows represent time in (fractions of) seconds since a certain epoch. This system clock is the source for datetime objects, but you can also use it directly whenever you want to compute the delta:

>>> import time
>>> start_time = time.time()
>>> start_time
1720312378.525805
>>> type(start_time)
<class 'float'>
>>>
>>> end_time = time.time()
>>> end_time
1720312401.982123
>>> elapsed = end_time - start_time
>>> elapsed
23.456317901611328
>>> print(f"This example took exactly {elapsed} seconds to type, but we don't know when it started and ended in your timezone.")
This example took exactly 23.456317901611328 seconds to type, but we don't know when it started and ended in your timezone.

Copy link

@codewiz codewiz Jul 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important caveat from the docs of time.time():

While this function normally returns non-decreasing values, it can return a lower value than a previous call if the system clock has been set back between the two calls.

This problem also affects datetime.datetime.now(datetime.timezone.utc) which you're using now, because the time can jump due to NTP, daylight saving or the system administrator simply setting the clock.

A more robust way to compute time differences is using time.monotonic(), which exists since Python 3.3:

>>> start_time = time.monotonic()
>>> end_time = time.monotonic()
>>> elapsed = end_time - start_time
>>> elapsed
2.559895193000557
>>> type(elapsed)
<class 'float'>

Copy link
Member

@holta holta Jul 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Very useful knowledge: thanks @codewiz.

  2. Maybe everyone sleep on this overnight to consider whether datetime.now(timezone.utc) IS-OR-IS-NOT appropriate for these Prevent [upcoming and truly live YouTube] videos from carrying over [to] subsequent download tasks [& discussion of UTC for code safety — avoid TZ code complexity wherever possible!?] #212 use cases ?

Copy link
Member

@holta holta Jul 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time can jump due to [1] NTP, [2] daylight saving or [3] the system administrator simply setting the clock

@deldesir above is a good reminder that mandating UTC is a "step forward" — but possibly might not be quite enough... ?

  • Mandating "UTC as perceived by the local machine" when recording time snapshots... does eliminate 1 of 3 ways that time/clock can go backwards (i.e. protecting against impromptu daylight savings time changes!) ✅
  • BUT, the other 2 ways that time/clock can spontaneously go backwards (NTP and sysadmin simply setting the clock) remain risks/considerations we need to at least think through — do we really want to live with these 2 risks — or is time.monotonic() (or similar) better for some of these uses cases? ⚠️

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are start_time and end_time exclusively used to compute the elapsed time in seconds?

No. They're mainly to render the localized task status based on the runtime per https://github.com/iiab/calibre-web/blob/master/cps/tasks_status.py

If so, you don't need to use datetime objects, which are complex classes meant to represent calendar dates and localized time.

Internally, Linux and Windows represent time in (fractions of) seconds since a certain epoch. This system clock is the source for datetime objects, but you can also use it directly whenever you want to compute the delta:

I agree elapsed time computation can and must be simplified, however we're dealing with a codebase in which mainly naive datetime objects are used. Just lookup datetime.nowutc and see for yourself.

Copy link
Member

@holta holta Jul 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Mandating "UTC as perceived by the local machine" when recording time snapshots... does eliminate 1 of 3 ways that time/clock can go backwards (i.e. protecting against impromptu daylight savings time changes!) ✅

Also be aware of a 4th way IIAB system clock can change (go forward but not backwards!) rather automatically:

(e.g. when someone browses IIAB from a nearby phone / mobile device in a school!)

Summary of above PR from May 2019:

  1. the apache_allow_sudo flag must be true (set with local vars) and the menu config must have allow_server_time_update true (set through the admin console)

  2. if a device is mobile, as determined by the detectMobile function or has win64 or macintel in its device string then call updateServerTime

  3. updateServerTime calls a php script that checks to see if the time has been set with NTP and if so returns. It then checks to see if the UTC version of the system time is more than 2 minutes behind the browser's UTC time and if so it uses date to set the clock to the browser's UTC time and fake-hwclock to save it.

Copy link
Member

@holta holta Jul 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree elapsed time computation can and must be simplified

We're all aware of the counterargument of course:

  • Curatorial use cases almost always require that IIAB be online — so relying on (what system clock claims to be) UTC might possibly be good enough (??)
  • So, taking a step back to articulate when + where + why we truly need "elapsed time" calculations might help!

Copy link
Collaborator Author

@deldesir deldesir Jul 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't understand why "elapsed time" calculations became the central point of this discussion. This PR does only one thing, convert naive datetime objects into timezone-aware datetime objects, nothing else. The "elapsed time" calculations were always there and we can see their usefulness in the "tasks" page.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deeper Question: what should be the goal of this PR?

If the answer is safer time handling, that might be achieved in ~5 main ways:

  1. Disregard the local machine's TZ setting (time zone setting can and commonly does change during IIAB provisioning!)
  2. Disregard twice-annual daylight savings time changes (that occur in many time zones!)
  3. Disregard machine time changes triggered by NTP
  4. Disregard machine time changes triggered by nearby mobile devices whose time (according to their browser headers) is > 2min ahead of IIAB machine's time
  5. Disregard machine time changes triggered by sysadmin manual intervention

RECAP: So far this PR transitions to using the local machine's best guess at UTC time — which helps solve 1. and 2.

(But debatably this PR might not be quite enough to address considerations 3., 4. and 5. ?)

Copy link
Member

@holta holta Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick RECAP of @codewiz's suggestion from ~2 days ago:

Python 3.3's time.monotonic() should at least be considered for the "Run Time" column of "Tasks" view — and any other situations where "elapsed time" (or similar) are important... ?

image

@@ -151,7 +151,7 @@ def _add_download_tasks_to_worker(self, requested_urls):
def run(self, worker_thread):
self.worker_thread = worker_thread
log.info("Starting to fetch metadata for URL: %s", self.media_url)
self.start_time = self.end_time = datetime.now()
self.end_time = datetime.now(timezone.utc)
Copy link

@codewiz codewiz Jul 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously this was also setting start_time. No longer needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deldesir take a look, Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I made this change to make the elapsed time start immediately when the task is called instead of when the worker thread is started. Not significant though, but more realistic IMHO.

@@ -23,7 +23,7 @@ def __init__(self, task_message, media_url, original_url, current_user_name):
self.original_url = self._format_original_url(original_url)
self.is_playlist = None
self.current_user_name = current_user_name
self.start_time = self.end_time = datetime.now()
self.start_time = self.end_time = datetime.now(timezone.utc)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already being done in super(...).__init__(...), the constructor of CalibreTask.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deldesir do you agree? (Do revise if so!)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the reason I made the previous change to consider the time set in the constructor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the reason I made the previous change to consider the time set in the constructor.

@deldesir Plz clarify the exact line number to help me understand? Thx!

Copy link
Collaborator Author

@deldesir deldesir Jul 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #26 for metadata_extract.py

Line #35 for download.py

@@ -110,7 +110,7 @@ def _update_metadata(self, requested_urls):
requested_urls = {url: requested_urls[url] for url in requested_urls.keys() if "shorts" not in url and url not in failed_urls}

def _calculate_views_per_day(self, requested_urls, conn):
now = datetime.now()
now = datetime.now(timezone.utc)
for requested_url in requested_urls.keys():
try:
view_count = conn.execute("SELECT view_count FROM media WHERE path = ?", (requested_url,)).fetchone()[0]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On line 117, what's the type of the time_uploaded field in the database?
Could you point me at the code writing this field, to ensure it's actually a POSIX timestamp (aka UNIX epoch)?

Also, datestamp.utcfromtimestamp() creates a naive datestamp, so if this is really a UNIX epoch, we should write:

time_uploaded_raw = conn.execute("...").fetchone()[0]
time_uploaded = datetime.fromtimestamp(time_uploaded_raw, timezone.utc)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the code writing time_uploaded at line 186:

conn.execute("UPDATE playlists SET path = ? WHERE path = ?", (f"{self.media_url}&timestamp={int(datetime.now().timestamp())}", self.media_url))

Since int(datetime.now().timestamp()) is the same of int(time.time()), this can be simplified.

Note that most SQL databases have types like TIMESTAMP and DATETIME, but SQLite by convention uses integer POSIX timestamps. So we're already storing the correct number here, but we could directly use int(time.time()) instead of converting the system time to datetime and then back to seconds.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree here, but will this be consistent across the code base which uses datetime objects to render tasks status. Some tasks are triggered by cron job and some are run upon uploading. For reporting purpose, these datetime objects are used in th Start Time column of the tasks page.

image

@@ -110,7 +110,7 @@ def _update_metadata(self, requested_urls):
requested_urls = {url: requested_urls[url] for url in requested_urls.keys() if "shorts" not in url and url not in failed_urls}

def _calculate_views_per_day(self, requested_urls, conn):
now = datetime.now()
now = datetime.now(timezone.utc)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advice of using time.monotonic() does NOT apply here, because later time_uploaded is subtracted from it to compute a time difference.

Both time.monotonic() and time.time() are floats, but they are not based on t he same epoch:

>>> import time
>>> time.monotonic()
180162.46153954
>>> time.time()
1720317895.1978724

Thus, subtracting a monotonic clock from a system clock is a serious bug.

Copy link
Collaborator Author

@deldesir deldesir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @codewiz for your explanations. All theses changes were not made to enhance the runtime computation. They were motivated by

query = ("SELECT path, duration, live_status FROM media WHERE error IS NULL AND path LIKE 'http%' AND time_created > ?"
if "error" in self.columns
else "SELECT path, duration, live_status FROM media WHERE path LIKE 'http%' AND time_created > ?")
rows = conn.execute(query, (int(self.start_time.timestamp()),)).fetchall()
requested_urls = {}
from PR #212 which was deemed not safe enough due to eventual timezone-induced inconsistencies. I'll read your explanations again and check how they also relate to this logic.

query = ("SELECT path, duration, live_status FROM media WHERE error IS NULL AND path LIKE 'http%' AND time_created > ?"
                     if "error" in self.columns
                     else "SELECT path, duration, live_status FROM media WHERE path LIKE 'http%' AND time_created > ?")
            rows = conn.execute(query, (int(self.start_time.timestamp()),)).fetchall()

@holta holta changed the title Use timezone-aware objects [mandate UTC in low-level vars for code safety, using Python 3.2's datetime.now(timezone.utc)] Use timezone-aware objects [mandate UTC in low-level vars for code safety, using Python 3.2's datetime.now(timezone.utc)] [should time.monotonic() be used to approximate elapsed time?] Jul 7, 2024
@holta holta changed the title Use timezone-aware objects [mandate UTC in low-level vars for code safety, using Python 3.2's datetime.now(timezone.utc)] [should time.monotonic() be used to approximate elapsed time?] Use timezone-aware objects [mandate UTC in low-level vars for code safety, using Python 3.2's datetime.now(timezone.utc)] [should Python 3.3's time.monotonic() be used to approximate elapsed time?] Jul 7, 2024
@holta
Copy link
Member

holta commented Jul 23, 2024

@holta
Copy link
Member

holta commented Jul 23, 2024

Re: 6c9e200 (https://github.com/deldesir/calibre-web/blob/6c9e2000340aac615ab394ea12bb4a4877f156d6/cps/tasks/download.py#L56-L57)

@deldesir isn't a main point of this PR to change line 56 from... ?

last_progress_time = datetime.now()

...to...

last_progress_time = datetime.now(timezone.utc)

@deldesir
Copy link
Collaborator Author

Yes, I'll make the change.

@holta
Copy link
Member

holta commented Aug 22, 2024

@deldesir please confirm my resolving of the 2 merge conflicts in cps/tasks/download.py (2a12fbc#diff-8ae9358844e7383eaeb354c71c7eb20f7010536f7bad834364d731d93154f79a) when you can, Thanks!

(Arising from...)

@deldesir
Copy link
Collaborator Author

deldesir commented Aug 26, 2024

Your resolving of the 2 merge conflicts in cps/tasks/download.py (2a12fbc#diff-8ae9358844e7383eaeb354c71c7eb20f7010536f7bad834364d731d93154f79a) is on point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants