Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time coordinates are sometimes integers, not datetime64 #2

Closed
gjoseph92 opened this issue Mar 13, 2021 · 9 comments
Closed

Time coordinates are sometimes integers, not datetime64 #2

gjoseph92 opened this issue Mar 13, 2021 · 9 comments
Labels
needs-future-test Add a test for this in the future, once tests exist (#26)

Comments

@gjoseph92
Copy link
Owner

Noticed this in https://gist.github.com/rmg55/b144cb273d9ccfdf979e9843fdf5e651, and I've had it happen before myself:

Coordinates:
  * time            (time) object 1594404158627000000 ... 1614276155393000000

Pretty sure stackstac is correctly making it into a pandas DatetimeIndex:

"time": pd.to_datetime(
[item["properties"]["datetime"] for item in items],
infer_datetime_format=True,
errors="coerce",
),

but something is going weird when xarray receives that, and it reinterprets it as an object array.

@scottyhq
Copy link
Contributor

@gjoseph92 I believe you're running into pydata/xarray#3291

I'm finally trying out stackstac and just ran into this. basically xarray doesn't like timezones, and since the ite spec has times always is UTC my strategy has been to either 1. rstrip('Z') on datestrings, or 2. use .tz_localize(None) on your pandas datetime index. If you'd like I can put in PR, should be an easy fix!

import numpy as np
import xarray as xr
print(xr.__version__) #0.17.0

np.random.seed(0)
temperature = 15 + 8 * np.random.randn(2, 2, 3)
precipitation = 10 * np.random.rand(2, 2, 3)
lon = [[-99.83, -99.32], [-99.79, -99.23]]
lat = [[42.25, 42.21], [42.63, 42.59]]

# https://github.com/radiantearth/stac-spec/blob/master/item-spec/item-spec.md#properties-object
# STAC ITEMS have datetime as UTC strings like this:
dates = ['2019-12-30T00:00:01.000Z', '2020-01-05T00:00:01.000Z', '2020-01-11T00:00:01.000Z']
times = pd.to_datetime(dates)

da = xr.DataArray(
    data=temperature,
    dims=["x", "y", "time"],
    coords=dict(
        lon=(["x", "y"], lon),
        lat=(["x", "y"], lat),
        #time=times, #results in dtype 'object'
        time = times.tz_localize(None), #results in dype 'datetime64[ns]'
    )
)
da

image

@scottyhq
Copy link
Contributor

So, after I looked into this a bit, I couldn't understand why your example with S2 cogs does work as expected. It looks to me like there is an issue in pd.to_datetime() with infer_datetime_format=True if timestamps do not have decimal seconds:

# NASA CMR-STAC Dates 
dates = ['2019-12-30T00:00:01.000Z', '2020-01-05T00:00:01.000Z', '2020-01-11T00:00:01.000Z']
pd.to_datetime(dates, infer_datetime_format=True, errors="coerce")

DatetimeIndex(['2019-12-30 00:00:01+00:00', '2020-01-05 00:00:01+00:00',
'2020-01-11 00:00:01+00:00'],
dtype='datetime64[ns, UTC]', freq=None)

# S2 dates
dates = ['2020-05-01T18:04:03Z', '2020-04-28T17:54:06Z', '2020-04-26T18:04:09Z']
pd.to_datetime(dates, infer_datetime_format=True, errors="coerce")

DatetimeIndex(['2020-05-01 18:04:03', '2020-04-28 17:54:06',
'2020-04-26 18:04:09'],
dtype='datetime64[ns]', freq=None)

the UTC time zone info is there as expected if you don't use infer_datetime_format (default is False)

# S2 dates
dates = ['2020-05-01T18:04:03Z', '2020-04-28T17:54:06Z', '2020-04-26T18:04:09Z']
pd.to_datetime(dates, infer_datetime_format=False, errors="coerce")

DatetimeIndex(['2020-05-01 18:04:03+00:00', '2020-04-28 17:54:06+00:00',
'2020-04-26 18:04:09+00:00'],
dtype='datetime64[ns, UTC]', freq=None)

Seems like a pandas bug? cc @TomAugspurger. Should stackstac expose infer_datetime_format as a keyword? not sure how much of a speedup that gets. In any case, adding a .tz_localize(None) gives you datetime64 in xarray coords, so that solution i suggested would still be a viable fix.

@TomAugspurger
Copy link
Contributor

It's probably a bug. My understanding of infer_datetime_format is that it should be purely an optimization.

@gjoseph92
Copy link
Owner Author

@scottyhq thanks a ton for looking into this!

Based on pydata/xarray#3291 (comment), it sounds like dropping timezone information before going to xarray is currently the only option.

I don't see anything in the STAC spec that explicitly requires UTC; it just says datetimes must follow RFC 3339, which does include time-offset = "Z" / time-numoffset. So I think we should handle other timezones, even if they're not commonly used.

How about we add a tz="UTC" parameter to stackstac.stack, so you can optionally set the timezone that stackstac converts to? And then do .tz_convert(tz).tz_localize(None) (maybe with a fastpath for tz=="UTC" to just do tz_convert(None)).

As for the infer_datetime_format problem, since you've looked into it a bit, @scottyhq could you open an issue on pandas? Then I'd say:

  • If the bug only applies when time-offset == "Z", then why don't we say that stackstac assumes any tz-naive dates are in UTC, and call tz_localize("UTC") appropriately (or, if tz == "UTC" already, then do nothing).
  • If the bug would cause us to lose timezone information from non-UTC dates, then let's set infer_datetime_format=False until it's fixed.

@scottyhq
Copy link
Contributor

scottyhq commented Apr 20, 2021

https://github.com/radiantearth/stac-spec/blob/master/item-spec/item-spec.md#properties-object states that datetime requires UTC:

REQUIRED. The searchable date and time of the assets, in UTC.

@matthewhanson or @m-mohr would greatly appreciate if you could clarify if that is a typo. I'm used to seeing UTC times in metadata, but it's also convenient to have the timezone information somewhere to easily convert to local time without relying on location-based lookups.

As for the infer_datetime_format problem, since you've looked into it a bit, @scottyhq could you open an issue on pandas?

yep -> pandas-dev/pandas#41047 . Also might rekindle discussion in xarray for handling timezones since I know there is current effort underway for more flexible indexing, and i feel like passing a pandas index with timezone information should work. cc @dcherian if you have pointers to relevant discussion here.

@m-mohr
Copy link

m-mohr commented Apr 20, 2021

I'm not sure myself. I'd suggest opening an issue in stac-spec with a use case description so that we can discuss there. I could imagine that this is a legacy thing that we still want to change, although start_datetime and end_datetime also require UTC. On the other hand, created and updated don't require UTC.

@gjoseph92
Copy link
Owner Author

states that datetime requires UTC

Duh! Totally missed that, thanks for the careful read.

Let's see what comes up on radiantearth/stac-spec#1095, but for now I'd vote for not assuming dates are always UTC, and doing the .tz_convert(tz).tz_localize(None) approach in stackstac. Do you think you'll get to a PR for that? If not, I can probably do it next week.

@scottyhq
Copy link
Contributor

for not assuming dates are always UTC, and doing the .tz_convert(tz).tz_localize(None) approach in stackstac. Do you think you'll get to a PR for that? If not, I can probably do it next week.

Sounds good. I can start a PR. What's your preference for tests @gjoseph92 ?, It'd be nice to put in a simple framework, if you're ok for it I could copy what we've been using over in intake-stac https://github.com/intake/intake-stac/blob/main/.github/workflows/main.yaml

@gjoseph92
Copy link
Owner Author

Tests are my biggest to-do right now, since IMO it's the main thing blocking contribution. I have a lot of thoughts on how I'd like this to look in #26.

I'd prefer to not copy in the intake-stac yaml, since it seems to be conda-oriented to me, and I'd like to keep this only using poetry. I'd say, if you want to write a pytest-friendly test for this case and put it in //stackstac/tests/test_prepare.py, go for it and we'll set up CI in the near future, otherwise don't worry about it.

@gjoseph92 gjoseph92 added needs-future-test Add a test for this in the future, once tests exist (#26) and removed needs-future-test Add a test for this in the future, once tests exist (#26) labels Dec 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-future-test Add a test for this in the future, once tests exist (#26)
Projects
None yet
Development

No branches or pull requests

4 participants