Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Date and Datetime column formatting information not extracted #5

Open
Gagravarr opened this issue Aug 14, 2015 · 10 comments
Open

Date and Datetime column formatting information not extracted #5

Gagravarr opened this issue Aug 14, 2015 · 10 comments

Comments

@Gagravarr
Copy link

In this gist you can find simple test SAS file (date_dd_mm_yyyy_copy.sas7bdat) with date formatting applied to the single column. In SAS viewer, it looks like this:
date_dd_mm_yyyy_copy

However, when I process it with a SasReader, my SasReaderCallback is called with a SasColumnType of NUMERIC and data values like 14610.0 for the 1st of Jan 2000, 14611.0 for the 2nd of Jan 2000 etc. The date formatting information isn't being extracted, and the raw values aren't being converted to dates

If I process the same SAS file with the Python SAS7BDAT library, it is able to detect the column as a date one, and return the values as dates

With a datetime formatted column, we see the same behaviour

It seems that the org.eobjects.metamodel.sas code isn't correctly detecting / fetching the date related formatting/type information for the column, and hence isn't correctly returning the data values as date objects

@kaspersorensen
Copy link
Member

A late thank you for the bug report. Unfortunately I have no clue how to fix this tho'! Do you?

@Gagravarr
Copy link
Author

The Python SAS7BDAT library is able to extract the date formatting information, at first glance it looks like there's more/new subheader support needed to capture this

@tdanford
Copy link
Contributor

tdanford commented Oct 7, 2015

I have a working fix for this which I'd be happy to contribute -- is there an example sas7bdat file with both date and time fields, that I could use to write a test?

@Gagravarr
Copy link
Author

@tdanford You can find an example file with dates and times in the Gist I referenced in the first post, along with a screenshot of what it looks like in SAS

@tdanford
Copy link
Contributor

tdanford commented Oct 8, 2015

I'm not sure that the values you've displayed in your screenshot match what's in the file, in either number or value -- for example, the eighth entry in your spreadsheet is "16JUN15", but the 8th value in the file attached above appears to be (as a double-width floating point number) -21913.5, which corresponds to either 01-Jan-1900 or 02-Jan-1900 (depending on whether you round or floor the number). I can confirm that the Python sas7bdat parser sees the value in the same way --
image
-- which also sees a different number of rows than you've shown above.

Possibilities: either you had pre-processed the file (removing NULLs, negative values) before generating the screenshot above, or there is a problem in the (reverse-engineered) spec that both this code and the Python code are working from. Either way, I don't have access to an independent SAS parser, I only have what's in the file and the incomplete file specs.

Thoughts? Not a blocker at the moment, I'll make my test match the output of the Python module, I just wanted to point it out.

@tdanford
Copy link
Contributor

I'm getting ready to file the PR, I think I have the TIME parsing working as well as the DATE parsing.

I've tested both, however, I am unable to share (and therefore the PR won't include) the test file that I've used for the TIME testing. This will probably be more robust if someone is able to pass along a publicly-sharable (preferably small) sas7bdat file with TIME values that I can use for testing.

tdanford added a commit to tdanford/metamodel_extras that referenced this issue Oct 12, 2015
This commit adds some support for parsing fields with the DATE, TIME, or
DATETIME formats from .sas7bdat files.

This is a big change, involving a couple of different elements:
  * new dependencies, including to Joda Time (for date/time formatting)
    and a log4j-slfj4 dependency for logging.
  * Adding two new types (DATE, TIME) to SasColumnType,
  * carrying through those changes to the parsing in the SasReader
    class, including use of date time conversion methods,
  * which are in the new DateTimeConverter class, that uses the
    Joda Time library.

A couple of notes: first, there is some bounds-checking for date/time
values in SasReader, which are meant to replicate some of the
observations on date/time value parsing from the Python sas7bdat
library.  In particular, we replicate the bounds on values from the
Python datetime library, so that the results _should_ be replicable
across libraries. See the comments in SasReader and DateTimeConverter.

Also, we note that some "excessively large" values for date9 are
actually 'datetime' (i.e. seconds, not days, from Jan 1 1960), an
observation taken from reading the Python code itself.  I've actually
aimed, throughout the code in places where the bounds or behavior were
underdocumented, to match as much as possible the behavior of the Python
sas7bdat module (https://pypi.python.org/pypi/sas7bdat).

Probably we should recognize an additional, explicit DATETIME format
type in the future.

NULL values (represented as NaN in NUMERIC format) are now returned as
straight Java nulls, from SasReader, for Date/Time values.

Furthermore, the actual values returned are DateTime and Period values
(from the Joda Time library), so downstream libraries will need to
recognize those values (by RTTI?) and provide their own date/time
formatting.

Right now, testing only covers the DATE formatting, using the example
DATE-containing data file provided in the discussion/gist to Issue datacleaner#5.
However, at the moment, I don't have an example of TIME values.  I've
tested these against a privately-available file, but I don't have a
public test that I can share so this feature should still be considered
UNTESTED.
@kaspersorensen
Copy link
Member

Hi @tdanford,

I merged your initial PR, but it sounded like you have more coming? From DataCleaner's side we would like to release MM extras maybe within a week or so. So I am just curious if you had something already lying ready to post or if we should consider that for a next release?

@Gagravarr
Copy link
Author

@tdanford I've just added another test file to my gist, which contains a numeric column, a date column, a time column and a datetime column. Excel version and screen capture also in there too, to help you see what SAS shows for them. Please do use any of the test files you need for unit testing!

@tdanford
Copy link
Contributor

Hi @kaspersorensen I've got another PR coming, let me check out Gagravarr's TIME test file first

@Gagravarr thanks!!

tdanford added a commit to tdanford/metamodel_extras that referenced this issue Oct 14, 2015
This commit adds some support for parsing fields with the DATE, TIME, or
DATETIME formats from .sas7bdat files.

This is a big change, involving a couple of different elements:
  * new dependencies, including to Joda Time (for date/time formatting)
    and a log4j-slfj4 dependency for logging.
  * Adding two new types (DATE, TIME) to SasColumnType,
  * carrying through those changes to the parsing in the SasReader
    class, including use of date time conversion methods,
  * which are in the new DateTimeConverter class, that uses the
    Joda Time library.

A couple of notes: first, there is some bounds-checking for date/time
values in SasReader, which are meant to replicate some of the
observations on date/time value parsing from the Python sas7bdat
library.  In particular, we replicate the bounds on values from the
Python datetime library, so that the results _should_ be replicable
across libraries. See the comments in SasReader and DateTimeConverter.

Also, we note that some "excessively large" values for date9 are
actually 'datetime' (i.e. seconds, not days, from Jan 1 1960), an
observation taken from reading the Python code itself.  I've actually
aimed, throughout the code in places where the bounds or behavior were
underdocumented, to match as much as possible the behavior of the Python
sas7bdat module (https://pypi.python.org/pypi/sas7bdat).

Probably we should recognize an additional, explicit DATETIME format
type in the future.

NULL values (represented as NaN in NUMERIC format) are now returned as
straight Java nulls, from SasReader, for Date/Time values.

Furthermore, the actual values returned are DateTime and Period values
(from the Joda Time library), so downstream libraries will need to
recognize those values (by RTTI?) and provide their own date/time
formatting.

Right now, testing only covers the DATE formatting, using the example
DATE-containing data file provided in the discussion/gist to Issue datacleaner#5.
However, at the moment, I don't have an example of TIME values.  I've
tested these against a privately-available file, but I don't have a
public test that I can share so this feature should still be considered
UNTESTED.
@ghost
Copy link

ghost commented Nov 22, 2015

@tdanford I've been using this library and translating it over to D, and came across this issue too. I've found a couple files that don't seem to be able to use the format and label subheader for the correct format. They're just a single date9 or datetim223 depending on the file. Got any ideas what might be up?

Files:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants