-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Date and Datetime column formatting information not extracted #5
Comments
A late thank you for the bug report. Unfortunately I have no clue how to fix this tho'! Do you? |
The Python SAS7BDAT library is able to extract the date formatting information, at first glance it looks like there's more/new subheader support needed to capture this |
I have a working fix for this which I'd be happy to contribute -- is there an example sas7bdat file with both date and time fields, that I could use to write a test? |
@tdanford You can find an example file with dates and times in the Gist I referenced in the first post, along with a screenshot of what it looks like in SAS |
I'm getting ready to file the PR, I think I have the TIME parsing working as well as the DATE parsing. I've tested both, however, I am unable to share (and therefore the PR won't include) the test file that I've used for the TIME testing. This will probably be more robust if someone is able to pass along a publicly-sharable (preferably small) sas7bdat file with TIME values that I can use for testing. |
This commit adds some support for parsing fields with the DATE, TIME, or DATETIME formats from .sas7bdat files. This is a big change, involving a couple of different elements: * new dependencies, including to Joda Time (for date/time formatting) and a log4j-slfj4 dependency for logging. * Adding two new types (DATE, TIME) to SasColumnType, * carrying through those changes to the parsing in the SasReader class, including use of date time conversion methods, * which are in the new DateTimeConverter class, that uses the Joda Time library. A couple of notes: first, there is some bounds-checking for date/time values in SasReader, which are meant to replicate some of the observations on date/time value parsing from the Python sas7bdat library. In particular, we replicate the bounds on values from the Python datetime library, so that the results _should_ be replicable across libraries. See the comments in SasReader and DateTimeConverter. Also, we note that some "excessively large" values for date9 are actually 'datetime' (i.e. seconds, not days, from Jan 1 1960), an observation taken from reading the Python code itself. I've actually aimed, throughout the code in places where the bounds or behavior were underdocumented, to match as much as possible the behavior of the Python sas7bdat module (https://pypi.python.org/pypi/sas7bdat). Probably we should recognize an additional, explicit DATETIME format type in the future. NULL values (represented as NaN in NUMERIC format) are now returned as straight Java nulls, from SasReader, for Date/Time values. Furthermore, the actual values returned are DateTime and Period values (from the Joda Time library), so downstream libraries will need to recognize those values (by RTTI?) and provide their own date/time formatting. Right now, testing only covers the DATE formatting, using the example DATE-containing data file provided in the discussion/gist to Issue datacleaner#5. However, at the moment, I don't have an example of TIME values. I've tested these against a privately-available file, but I don't have a public test that I can share so this feature should still be considered UNTESTED.
Hi @tdanford, I merged your initial PR, but it sounded like you have more coming? From DataCleaner's side we would like to release MM extras maybe within a week or so. So I am just curious if you had something already lying ready to post or if we should consider that for a next release? |
@tdanford I've just added another test file to my gist, which contains a numeric column, a date column, a time column and a datetime column. Excel version and screen capture also in there too, to help you see what SAS shows for them. Please do use any of the test files you need for unit testing! |
Hi @kaspersorensen I've got another PR coming, let me check out Gagravarr's TIME test file first @Gagravarr thanks!! |
This commit adds some support for parsing fields with the DATE, TIME, or DATETIME formats from .sas7bdat files. This is a big change, involving a couple of different elements: * new dependencies, including to Joda Time (for date/time formatting) and a log4j-slfj4 dependency for logging. * Adding two new types (DATE, TIME) to SasColumnType, * carrying through those changes to the parsing in the SasReader class, including use of date time conversion methods, * which are in the new DateTimeConverter class, that uses the Joda Time library. A couple of notes: first, there is some bounds-checking for date/time values in SasReader, which are meant to replicate some of the observations on date/time value parsing from the Python sas7bdat library. In particular, we replicate the bounds on values from the Python datetime library, so that the results _should_ be replicable across libraries. See the comments in SasReader and DateTimeConverter. Also, we note that some "excessively large" values for date9 are actually 'datetime' (i.e. seconds, not days, from Jan 1 1960), an observation taken from reading the Python code itself. I've actually aimed, throughout the code in places where the bounds or behavior were underdocumented, to match as much as possible the behavior of the Python sas7bdat module (https://pypi.python.org/pypi/sas7bdat). Probably we should recognize an additional, explicit DATETIME format type in the future. NULL values (represented as NaN in NUMERIC format) are now returned as straight Java nulls, from SasReader, for Date/Time values. Furthermore, the actual values returned are DateTime and Period values (from the Joda Time library), so downstream libraries will need to recognize those values (by RTTI?) and provide their own date/time formatting. Right now, testing only covers the DATE formatting, using the example DATE-containing data file provided in the discussion/gist to Issue datacleaner#5. However, at the moment, I don't have an example of TIME values. I've tested these against a privately-available file, but I don't have a public test that I can share so this feature should still be considered UNTESTED.
@tdanford I've been using this library and translating it over to D, and came across this issue too. I've found a couple files that don't seem to be able to use the format and label subheader for the correct format. They're just a single date9 or datetim223 depending on the file. Got any ideas what might be up? Files: |
In this gist you can find simple test SAS file (date_dd_mm_yyyy_copy.sas7bdat) with date formatting applied to the single column. In SAS viewer, it looks like this:
However, when I process it with a
SasReader
, mySasReaderCallback
is called with aSasColumnType
ofNUMERIC
and data values like14610.0
for the 1st of Jan 2000,14611.0
for the 2nd of Jan 2000 etc. The date formatting information isn't being extracted, and the raw values aren't being converted to datesIf I process the same SAS file with the Python SAS7BDAT library, it is able to detect the column as a date one, and return the values as dates
With a datetime formatted column, we see the same behaviour
It seems that the
org.eobjects.metamodel.sas
code isn't correctly detecting / fetching the date related formatting/type information for the column, and hence isn't correctly returning the data values as date objectsThe text was updated successfully, but these errors were encountered: