Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix segfaults during CDX read #24

Open
wants to merge 5 commits into
base: v1.21.3-at
Choose a base branch
from

Conversation

the-blank-x
Copy link

Fixes #23

@the-blank-x
Copy link
Author

the-blank-x commented Jan 27, 2024

This writes invalid WARC-Refers-To-Date headers--oops.

This is because the code would now read the timestamps from the CDX file (which appears to be formatted as YYYYMMDDhhmmss, but it is not defined in the legend, nor is it defined in the specifications from 2006 and 2015), and write it in the WARC-Refers-To-Date header without rewriting it to conform to the WARC specs (which specifies that it is "a UTC timestamp formatted according to W3CDTF", i.e. in the form of YYYY-MM-DDThh:mm:ssZ).

I can have the code rewrite the CDX timestamp into the UTC timestamp according to the W3CDTF, but I am slightly hesitant to (unless given the okay) since I don't know if all properly formed CDX files will have their timestamps be in the form of YYYYMMDDhhmmss.

@the-blank-x the-blank-x marked this pull request as draft January 27, 2024 01:56
@the-blank-x
Copy link
Author

The invalid WARC-Refers-To-Date headers issue should be fixed now

@the-blank-x the-blank-x marked this pull request as ready for review January 29, 2024 00:57
@Arkiver2
Copy link
Member

Arkiver2 commented Jan 29, 2024

Thank you! It looks like warc_date = cdx_to_warc_timestamp(date); is not xfreed after use.

I will need to check this PR closer before merging it.

@the-blank-x
Copy link
Author

Oopsie

(and you're welcome!)

If this is not done, wget segfaults if --warc-dedup is not passed (but
--warc-file and --warc-cdx is)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Segmentation fault (null pointer read and/or write) when reading CDX files
2 participants