-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
COKI - create script to fetch all outputs found so far #15
Comments
Need to setup a lambda to generate this on a weekly basis. Will need to place the resulting dataset into an S3 bucket so that it can be transferred to Google Cloud. Use the JSON lines format. Need to investigate which is more cost effective: |
We will are going to explore using AWS pre-signed URLs for S3. We have other teams here that use this technology to allow external users/systems to fetch objects from (and push objects to) our S3 buckets. @jdddog you mentioned that you are working within a python environment. Would you be able to call an API endpoint from within your code to retrieve these URLs? I have an API that I can add an endpoint to that we could use to allow you to fetch a predesigned URL on demand. The resigned URL could then be used to write the datacite, crossref and openalex files to. I could then also add an endpoint that would allow you to fetch the latest DMP metadata from us as well. |
Had some conversations here and it sounds like the presigned URL route will work well. I will need to:
Example Lambda to generate a predesigned URL on demand:
Python example of |
Hey @briri, we can use pre-signed URLs if you would prefer, its no problem to call an API from our Apache Airflow environment. Some thoughts:
I wonder if it would simpler to directly use the boto3 S3 Python package to read and write files from the bucket, it would require less setup and would be more flexible when uploading an unknown number of files for each dataset. It is also probably easier and more robust to use the s3 Python package to upload the files. The DMP export files could be stored on a read only path, configured via an IAM role. The dataset match files could be stored on a write-only path, also configured via an IAM role. It wouldn't offer as much fine grained access control. Let me know what you think. |
Hi @jdddog. From my conversations with my colleagues here, the presigned URL option can handle a file up to a 5GB in size before you need to break it apart. So we shouldn't have an issue there. I can add logic to process/combine them back together on my end. Direct read/write access to the S3 bucket is not a preferred method for us here, so we will need to stick with the presigned url route. |
Jamie has requested that I pull the list of all research outputs we've found so far for from the DataCite harvester.
I am going to create a script to do this as it seems like it will be useful long term.
The text was updated successfully, but these errors were encountered: