Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use timezone-aware objects [mandate UTC in low-level vars for code safety, using Python 3.2's datetime.now(timezone.utc)] [should Python 3.3's time.monotonic() be used to approximate elapsed time?] #217

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 6 additions & 5 deletions cps/services/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
import queue
except ImportError:
import Queue as queue
from datetime import datetime
from datetime import datetime, timezone
from collections import namedtuple

from cps import logger
Expand Down Expand Up @@ -92,7 +92,7 @@ def add(cls, user, task, hidden=False):
ins.queue.put(QueuedTask(
num=ins.num,
user=username,
added=datetime.now(),
added=datetime.now(timezone.utc),
task=task,
hidden=hidden
))
Expand Down Expand Up @@ -191,7 +191,7 @@ def is_cancellable(self):
raise NotImplementedError

def start(self, *args):
self.start_time = datetime.now()
self.start_time = datetime.now(timezone.utc)
self.stat = STAT_STARTED

# catch any unhandled exceptions in a task and automatically fail it
Expand All @@ -201,7 +201,7 @@ def start(self, *args):
self._handleError(str(ex))
log.error_or_exception(ex)

self.end_time = datetime.now()
self.end_time = datetime.now(timezone.utc)

@property
def stat(self):
Expand Down Expand Up @@ -231,7 +231,8 @@ def error(self, x):

@property
def runtime(self):
return (self.end_time or datetime.now()) - self.start_time
now = datetime.now(timezone.utc)
return (self.end_time or now) - self.start_time
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are start_time and end_time exclusively used to compute the elapsed time in seconds?

If so, you don't need to use datetime objects, which are complex classes meant to represent calendar dates and localized time.

Internally, Linux and Windows represent time in (fractions of) seconds since a certain epoch. This system clock is the source for datetime objects, but you can also use it directly whenever you want to compute the delta:

>>> import time
>>> start_time = time.time()
>>> start_time
1720312378.525805
>>> type(start_time)
<class 'float'>
>>>
>>> end_time = time.time()
>>> end_time
1720312401.982123
>>> elapsed = end_time - start_time
>>> elapsed
23.456317901611328
>>> print(f"This example took exactly {elapsed} seconds to type, but we don't know when it started and ended in your timezone.")
This example took exactly 23.456317901611328 seconds to type, but we don't know when it started and ended in your timezone.

Copy link

@codewiz codewiz Jul 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important caveat from the docs of time.time():

While this function normally returns non-decreasing values, it can return a lower value than a previous call if the system clock has been set back between the two calls.

This problem also affects datetime.datetime.now(datetime.timezone.utc) which you're using now, because the time can jump due to NTP, daylight saving or the system administrator simply setting the clock.

A more robust way to compute time differences is using time.monotonic(), which exists since Python 3.3:

>>> start_time = time.monotonic()
>>> end_time = time.monotonic()
>>> elapsed = end_time - start_time
>>> elapsed
2.559895193000557
>>> type(elapsed)
<class 'float'>

Copy link
Member

@holta holta Jul 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Very useful knowledge: thanks @codewiz.

  2. Maybe everyone sleep on this overnight to consider whether datetime.now(timezone.utc) IS-OR-IS-NOT appropriate for these Prevent [upcoming and truly live YouTube] videos from carrying over [to] subsequent download tasks [& discussion of UTC for code safety — avoid TZ code complexity wherever possible!?] #212 use cases ?

Copy link
Member

@holta holta Jul 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time can jump due to [1] NTP, [2] daylight saving or [3] the system administrator simply setting the clock

@deldesir above is a good reminder that mandating UTC is a "step forward" — but possibly might not be quite enough... ?

  • Mandating "UTC as perceived by the local machine" when recording time snapshots... does eliminate 1 of 3 ways that time/clock can go backwards (i.e. protecting against impromptu daylight savings time changes!) ✅
  • BUT, the other 2 ways that time/clock can spontaneously go backwards (NTP and sysadmin simply setting the clock) remain risks/considerations we need to at least think through — do we really want to live with these 2 risks — or is time.monotonic() (or similar) better for some of these uses cases? ⚠️

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are start_time and end_time exclusively used to compute the elapsed time in seconds?

No. They're mainly to render the localized task status based on the runtime per https://github.com/iiab/calibre-web/blob/master/cps/tasks_status.py

If so, you don't need to use datetime objects, which are complex classes meant to represent calendar dates and localized time.

Internally, Linux and Windows represent time in (fractions of) seconds since a certain epoch. This system clock is the source for datetime objects, but you can also use it directly whenever you want to compute the delta:

I agree elapsed time computation can and must be simplified, however we're dealing with a codebase in which mainly naive datetime objects are used. Just lookup datetime.nowutc and see for yourself.

Copy link
Member

@holta holta Jul 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Mandating "UTC as perceived by the local machine" when recording time snapshots... does eliminate 1 of 3 ways that time/clock can go backwards (i.e. protecting against impromptu daylight savings time changes!) ✅

Also be aware of a 4th way IIAB system clock can change (go forward but not backwards!) rather automatically:

(e.g. when someone browses IIAB from a nearby phone / mobile device in a school!)

Summary of above PR from May 2019:

  1. the apache_allow_sudo flag must be true (set with local vars) and the menu config must have allow_server_time_update true (set through the admin console)

  2. if a device is mobile, as determined by the detectMobile function or has win64 or macintel in its device string then call updateServerTime

  3. updateServerTime calls a php script that checks to see if the time has been set with NTP and if so returns. It then checks to see if the UTC version of the system time is more than 2 minutes behind the browser's UTC time and if so it uses date to set the clock to the browser's UTC time and fake-hwclock to save it.

Copy link
Member

@holta holta Jul 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree elapsed time computation can and must be simplified

We're all aware of the counterargument of course:

  • Curatorial use cases almost always require that IIAB be online — so relying on (what system clock claims to be) UTC might possibly be good enough (??)
  • So, taking a step back to articulate when + where + why we truly need "elapsed time" calculations might help!

Copy link
Collaborator Author

@deldesir deldesir Jul 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't understand why "elapsed time" calculations became the central point of this discussion. This PR does only one thing, convert naive datetime objects into timezone-aware datetime objects, nothing else. The "elapsed time" calculations were always there and we can see their usefulness in the "tasks" page.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deeper Question: what should be the goal of this PR?

If the answer is safer time handling, that might be achieved in ~5 main ways:

  1. Disregard the local machine's TZ setting (time zone setting can and commonly does change during IIAB provisioning!)
  2. Disregard twice-annual daylight savings time changes (that occur in many time zones!)
  3. Disregard machine time changes triggered by NTP
  4. Disregard machine time changes triggered by nearby mobile devices whose time (according to their browser headers) is > 2min ahead of IIAB machine's time
  5. Disregard machine time changes triggered by sysadmin manual intervention

RECAP: So far this PR transitions to using the local machine's best guess at UTC time — which helps solve 1. and 2.

(But debatably this PR might not be quite enough to address considerations 3., 4. and 5. ?)

Copy link
Member

@holta holta Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick RECAP of @codewiz's suggestion from ~2 days ago:

Python 3.3's time.monotonic() should at least be considered for the "Run Time" column of "Tasks" view — and any other situations where "elapsed time" (or similar) are important... ?

image


@property
def dead(self):
Expand Down
16 changes: 8 additions & 8 deletions cps/tasks/download.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import requests
import select
import sqlite3
from datetime import datetime
from datetime import datetime, timezone
from flask_babel import lazy_gettext as N_, gettext as _

from cps.constants import XKLB_DB_FILE
Expand All @@ -25,14 +25,14 @@ def __init__(self, task_message, media_url, original_url, current_user_name, she
self.shelf_id = shelf_id
self.duration = datetime.utcfromtimestamp(int(duration)).strftime("%H:%M:%S") if duration else "unknown"
self.live_status = live_status
self.start_time = self.end_time = datetime.now()
self.start_time = self.end_time = datetime.now(timezone.utc)
self.stat = STAT_WAITING
self.progress = 0

def run(self, worker_thread):
"""Run the download task"""
self.worker_thread = worker_thread
self.start_time = self.end_time = datetime.now()
self.end_time = datetime.now(timezone.utc)
self.stat = STAT_STARTED
self.progress = 0

Expand All @@ -53,14 +53,14 @@ def run(self, worker_thread):

complete_progress_cycle = 0

last_progress_time = datetime.now()
last_progress_time = datetime.now(timezone.utc)
timeout = 120 # seconds

self.message = f"Downloading {self.media_url_link}..."
if self.live_status == "was_live":
self.message += f" (formerly live video, length/duration {self.duration})"
while p.poll() is None:
self.end_time = datetime.now()
self.end_time = datetime.now(timezone.utc)
# Check if there's data available to read
rlist, _, _ = select.select([p.stdout], [], [], 0.1)
if rlist:
Expand All @@ -77,9 +77,9 @@ def run(self, worker_thread):
self.progress = min(0.99, (complete_progress_cycle + (percentage / 100)) / 4)
if percentage == 100:
complete_progress_cycle += 1
last_progress_time = datetime.now()
last_progress_time = datetime.now(timezone.utc)
else:
elapsed_time = (datetime.now() - last_progress_time).total_seconds()
elapsed_time = (datetime.now(timezone.utc) - last_progress_time).total_seconds()
if elapsed_time >= timeout:
self.message = f"{self.media_url_link} is taking longer than expected. It could be a stuck download due to unavailable fragments (<a href='https://github.com/yt-dlp/yt-dlp/issues/2137' target='_blank'>yt-dlp/yt-dlp#2137</a>) and/or an error in xklb's media_check. Please wait as we keep trying. See <a href='https://github.com/iiab/calibre-web/pull/223' target='_blank'>#223</a> for more info."
sleep(0.1)
Expand Down Expand Up @@ -132,7 +132,7 @@ def run(self, worker_thread):
self.message = f"{self.media_url_link} failed to download: {self.read_error_from_database()}"

finally:
self.end_time = datetime.now()
self.end_time = datetime.now(timezone.utc)
if p.returncode == 0 or self.progress == 1.0:
self.stat = STAT_FINISH_SUCCESS
log.info("Download task for %s completed successfully", self.media_url)
Expand Down
8 changes: 4 additions & 4 deletions cps/tasks/metadata_extract.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import re
import requests
import sqlite3
from datetime import datetime
from datetime import datetime, timezone
from flask_babel import lazy_gettext as N_, gettext as _

from cps.constants import XKLB_DB_FILE, MAX_VIDEOS_PER_DOWNLOAD
Expand All @@ -23,7 +23,7 @@ def __init__(self, task_message, media_url, original_url, current_user_name):
self.original_url = self._format_original_url(original_url)
self.is_playlist = None
self.current_user_name = current_user_name
self.start_time = self.end_time = datetime.now()
self.start_time = self.end_time = datetime.now(timezone.utc)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already being done in super(...).__init__(...), the constructor of CalibreTask.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deldesir do you agree? (Do revise if so!)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the reason I made the previous change to consider the time set in the constructor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the reason I made the previous change to consider the time set in the constructor.

@deldesir Plz clarify the exact line number to help me understand? Thx!

Copy link
Collaborator Author

@deldesir deldesir Jul 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #26 for metadata_extract.py

Line #35 for download.py

self.stat = STAT_WAITING
self.progress = 0
self.columns = None
Expand Down Expand Up @@ -112,7 +112,7 @@ def _update_metadata(self, requested_urls):
requested_urls = {url: requested_urls[url] for url in requested_urls.keys() if "shorts" not in url and url not in failed_urls}

def _calculate_views_per_day(self, requested_urls, conn):
now = datetime.now()
now = datetime.now(timezone.utc)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advice of using time.monotonic() does NOT apply here, because later time_uploaded is subtracted from it to compute a time difference.

Both time.monotonic() and time.time() are floats, but they are not based on t he same epoch:

>>> import time
>>> time.monotonic()
180162.46153954
>>> time.time()
1720317895.1978724

Thus, subtracting a monotonic clock from a system clock is a serious bug.

for requested_url in requested_urls.keys():
try:
view_count = conn.execute("SELECT view_count FROM media WHERE path = ?", (requested_url,)).fetchone()[0]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On line 117, what's the type of the time_uploaded field in the database?
Could you point me at the code writing this field, to ensure it's actually a POSIX timestamp (aka UNIX epoch)?

Also, datestamp.utcfromtimestamp() creates a naive datestamp, so if this is really a UNIX epoch, we should write:

time_uploaded_raw = conn.execute("...").fetchone()[0]
time_uploaded = datetime.fromtimestamp(time_uploaded_raw, timezone.utc)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the code writing time_uploaded at line 186:

conn.execute("UPDATE playlists SET path = ? WHERE path = ?", (f"{self.media_url}&timestamp={int(datetime.now().timestamp())}", self.media_url))

Since int(datetime.now().timestamp()) is the same of int(time.time()), this can be simplified.

Note that most SQL databases have types like TIMESTAMP and DATETIME, but SQLite by convention uses integer POSIX timestamps. So we're already storing the correct number here, but we could directly use int(time.time()) instead of converting the system time to datetime and then back to seconds.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree here, but will this be consistent across the code base which uses datetime objects to render tasks status. Some tasks are triggered by cron job and some are run upon uploading. For reporting purpose, these datetime objects are used in th Start Time column of the tasks page.

image

Expand Down Expand Up @@ -153,7 +153,7 @@ def _add_download_tasks_to_worker(self, requested_urls):
def run(self, worker_thread):
self.worker_thread = worker_thread
log.info("Starting to fetch metadata for URL: %s", self.media_url)
self.start_time = self.end_time = datetime.now()
self.end_time = datetime.now(timezone.utc)
Copy link

@codewiz codewiz Jul 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously this was also setting start_time. No longer needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deldesir take a look, Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I made this change to make the elapsed time start immediately when the task is called instead of when the worker thread is started. Not significant though, but more realistic IMHO.

self.stat = STAT_STARTED
self.progress = 0

Expand Down