Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COKI - create script to fetch all outputs found so far #15

Closed
briri opened this issue Sep 13, 2024 · 5 comments
Closed

COKI - create script to fetch all outputs found so far #15

briri opened this issue Sep 13, 2024 · 5 comments
Assignees

Comments

@briri
Copy link
Collaborator

briri commented Sep 13, 2024

Jamie has requested that I pull the list of all research outputs we've found so far for from the DataCite harvester.

I am going to create a script to do this as it seems like it will be useful long term.

@briri
Copy link
Collaborator Author

briri commented Sep 17, 2024

Need to setup a lambda to generate this on a weekly basis.

Will need to place the resulting dataset into an S3 bucket so that it can be transferred to Google Cloud. Use the JSON lines format.

Need to investigate which is more cost effective:

@briri
Copy link
Collaborator Author

briri commented Sep 18, 2024

We will are going to explore using AWS pre-signed URLs for S3. We have other teams here that use this technology to allow external users/systems to fetch objects from (and push objects to) our S3 buckets.

@jdddog you mentioned that you are working within a python environment. Would you be able to call an API endpoint from within your code to retrieve these URLs?

I have an API that I can add an endpoint to that we could use to allow you to fetch a predesigned URL on demand. The resigned URL could then be used to write the datacite, crossref and openalex files to.

I could then also add an endpoint that would allow you to fetch the latest DMP metadata from us as well.

@briri
Copy link
Collaborator Author

briri commented Sep 19, 2024

Had some conversations here and it sounds like the presigned URL route will work well.

I will need to:

Example Lambda to generate a predesigned URL on demand:

def get_file(key)
    @s3_client = Aws::S3::Client.new(region: ENV.fetch('AWS_REGION', nil))
    @presigner = Aws::S3::Presigner.new(client: @s3_client)
    bucket_name = env.fetch('BUCKET_NAME', nil)
    begin
      @s3_client.head_object({bucket: bucket_name, key: key})
    rescue Aws::S3::Errors::NotFound
      halt 404, "Object \"#{key}\" not found in S3 bucket \"#{bucket_name}\"\n"
    end
    url, headers = @presigner.presigned_request(:get_object, bucket: bucket_name, key: key)
    if url
      response.headers['Location'] = url
      status 303
      "success: redirecting"
    end
  end

Python example of

@jdddog
Copy link
Collaborator

jdddog commented Sep 20, 2024

Hey @briri, we can use pre-signed URLs if you would prefer, its no problem to call an API from our Apache Airflow environment.

Some thoughts:

  • The maximum file size for each file exported from BigQuery is 1GB (before compression), after that, files are sharded into parts, e.g., crossref_000000000000.jsonl.gz, crossref_000000000001.jsonl.gz, etc. So as we get more DMPs, we will get multiple files per dataset.
    • Can we upload as many files as needed for each dataset using the pre-signed URLs? It seems a URL must be generated for each file.
  • In BigQuery, we shard the tables with date (year, month and day) to avoid conflicts between weekly runs. The date could be the date that the new DMP file is generated.
    • Could this date be included in the DMP export? E.g. in the file name or data.

I wonder if it would simpler to directly use the boto3 S3 Python package to read and write files from the bucket, it would require less setup and would be more flexible when uploading an unknown number of files for each dataset. It is also probably easier and more robust to use the s3 Python package to upload the files. The DMP export files could be stored on a read only path, configured via an IAM role. The dataset match files could be stored on a write-only path, also configured via an IAM role. It wouldn't offer as much fine grained access control.

Let me know what you think.

@briri
Copy link
Collaborator Author

briri commented Sep 23, 2024

Hi @jdddog. From my conversations with my colleagues here, the presigned URL option can handle a file up to a 5GB in size before you need to break it apart. So we shouldn't have an issue there. I can add logic to process/combine them back together on my end.

Direct read/write access to the S3 bucket is not a preferred method for us here, so we will need to stick with the presigned url route.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants