COKI - create script to fetch all outputs found so far #15

briri · 2024-09-13T15:59:51Z

Jamie has requested that I pull the list of all research outputs we've found so far for from the DataCite harvester.

I am going to create a script to do this as it seems like it will be useful long term.

briri · 2024-09-17T23:07:30Z

Need to setup a lambda to generate this on a weekly basis.

Will need to place the resulting dataset into an S3 bucket so that it can be transferred to Google Cloud. Use the JSON lines format.

Need to investigate which is more cost effective:

briri · 2024-09-18T22:53:32Z

We will are going to explore using AWS pre-signed URLs for S3. We have other teams here that use this technology to allow external users/systems to fetch objects from (and push objects to) our S3 buckets.

@jdddog you mentioned that you are working within a python environment. Would you be able to call an API endpoint from within your code to retrieve these URLs?

I have an API that I can add an endpoint to that we could use to allow you to fetch a predesigned URL on demand. The resigned URL could then be used to write the datacite, crossref and openalex files to.

I could then also add an endpoint that would allow you to fetch the latest DMP metadata from us as well.

briri · 2024-09-19T20:26:12Z

Had some conversations here and it sounds like the presigned URL route will work well.

I will need to:

General infrastructure
- Setup an S3 bucket for our data exchange AWS repo issue #175
- Setup a Cognito client for COKI to use. Create a new scope for transferring data AWS repo issue #176
- Verify that the new Cognito client is able to read/write to the S3 bucket via presigned URLs
Generating DMP metadata for COKI
- Create a Lambda that builds the DMP metadata file (scheduled to run each week) and writes it to the S3 bucket. I already have a script that does this, so should not be much work COKI - Create Lambda to generate DMP metadata export #18
- Create a Lambda function behind our API Gateway that can be used to generate a presigned URL that will allow COKI to fetch the latest DMP file Create Lambda to provide a presigned URL to fetch DMP export file #19
Submitting match files (e.g. DataCite, Crossref, OpenAlex, etc.) to CDL
- Create a Lambda function behind our API Gateway that generates a presigned URL that can then be used to push files to the CDL S3 bucket Create Lambda to generate a presigned URL for submitting match data files to S3 #20
- Create a Lambda that is triggered by the arrival of the new match files and updates the DMPTool. I can modify the existing DataCite harvester code to work with these new files. Write a Lambda function to process match data from COKI #21

Example Lambda to generate a predesigned URL on demand:

def get_file(key)
    @s3_client = Aws::S3::Client.new(region: ENV.fetch('AWS_REGION', nil))
    @presigner = Aws::S3::Presigner.new(client: @s3_client)
    bucket_name = env.fetch('BUCKET_NAME', nil)
    begin
      @s3_client.head_object({bucket: bucket_name, key: key})
    rescue Aws::S3::Errors::NotFound
      halt 404, "Object \"#{key}\" not found in S3 bucket \"#{bucket_name}\"\n"
    end
    url, headers = @presigner.presigned_request(:get_object, bucket: bucket_name, key: key)
    if url
      response.headers['Location'] = url
      status 303
      "success: redirecting"
    end
  end

Python example of

jdddog · 2024-09-20T05:18:47Z

Hey @briri, we can use pre-signed URLs if you would prefer, its no problem to call an API from our Apache Airflow environment.

Some thoughts:

The maximum file size for each file exported from BigQuery is 1GB (before compression), after that, files are sharded into parts, e.g., crossref_000000000000.jsonl.gz, crossref_000000000001.jsonl.gz, etc. So as we get more DMPs, we will get multiple files per dataset.
- Can we upload as many files as needed for each dataset using the pre-signed URLs? It seems a URL must be generated for each file.
In BigQuery, we shard the tables with date (year, month and day) to avoid conflicts between weekly runs. The date could be the date that the new DMP file is generated.
- Could this date be included in the DMP export? E.g. in the file name or data.

I wonder if it would simpler to directly use the boto3 S3 Python package to read and write files from the bucket, it would require less setup and would be more flexible when uploading an unknown number of files for each dataset. It is also probably easier and more robust to use the s3 Python package to upload the files. The DMP export files could be stored on a read only path, configured via an IAM role. The dataset match files could be stored on a write-only path, also configured via an IAM role. It wouldn't offer as much fine grained access control.

Let me know what you think.

briri · 2024-09-23T15:24:01Z

Hi @jdddog. From my conversations with my colleagues here, the presigned URL option can handle a file up to a 5GB in size before you need to break it apart. So we shouldn't have an issue there. I can add logic to process/combine them back together on my end.

Direct read/write access to the S3 bucket is not a preferred method for us here, so we will need to stick with the presigned url route.

briri self-assigned this Sep 13, 2024

briri mentioned this issue Sep 13, 2024

added script to fetch all harvested related works #16

Open

briri mentioned this issue Sep 19, 2024

COKI - Create Lambda to generate DMP metadata export #18

Open

bofstein closed this as completed Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

COKI - create script to fetch all outputs found so far #15

COKI - create script to fetch all outputs found so far #15

briri commented Sep 13, 2024

briri commented Sep 17, 2024 •

edited

Loading

briri commented Sep 18, 2024

briri commented Sep 19, 2024 •

edited

Loading

jdddog commented Sep 20, 2024 •

edited

Loading

briri commented Sep 23, 2024

COKI - create script to fetch all outputs found so far #15

COKI - create script to fetch all outputs found so far #15

Comments

briri commented Sep 13, 2024

briri commented Sep 17, 2024 • edited Loading

briri commented Sep 18, 2024

briri commented Sep 19, 2024 • edited Loading

jdddog commented Sep 20, 2024 • edited Loading

briri commented Sep 23, 2024

briri commented Sep 17, 2024 •

edited

Loading

briri commented Sep 19, 2024 •

edited

Loading

jdddog commented Sep 20, 2024 •

edited

Loading