Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upload size limit #43

Open
pdurbin opened this issue Mar 17, 2022 · 7 comments
Open

upload size limit #43

pdurbin opened this issue Mar 17, 2022 · 7 comments

Comments

@pdurbin
Copy link
Member

pdurbin commented Mar 17, 2022

For now this is a placeholder issue. Today @landreev @scolapasta and I were talking about how in practical terms, files are limited to 2 GB when uploading to Dataverse via SWORD because getMaxUploadSize returns an int.

Below is the code from https://github.com/IQSS/dataverse/blob/v5.9/src/main/java/edu/harvard/iq/dataverse/api/datadeposit/SwordConfigurationImpl.java#L123-L146

To summarize:

  • When the system limit is not set, the SWORD upload is unlimited. However, it's very unlikely that anyone runs Dataverse in production with no size limit.
  • When the system limit is over 2 GB, we set the SWORD upload limit to 2 GB. This is probably a common case.
  • When the system limit is set and less than 2 GB, we set the SWORD upload limit to that value.

@Override
public int getMaxUploadSize() {
    
    int unlimited = -1;
    /* It doesn't look like we can determine which store will be used here, so we'll go with the default
     * (It looks like the collection or study involved is available where this method is called, but the SwordConfiguration.getMaxUploadSize()
     * doesn't allow a parameter)
     */ 
    Long maxUploadInBytes = systemConfig.getMaxFileUploadSizeForStore("default");

    if (maxUploadInBytes == null){
        // (a) No setting, return unlimited           
        return unlimited;      
    
    }else if (maxUploadInBytes > Integer.MAX_VALUE){
        // (b) setting returns the limit of int, return max int value  (BUG)
        return Integer.MAX_VALUE;
        
    }else{            
        // (c) Return the setting as an int
        return maxUploadInBytes.intValue();

    }
}

@pdurbin pdurbin changed the title upload site limit upload size limit Mar 17, 2022
@landreev
Copy link

landreev commented Mar 18, 2022

The part about "if you don't set the limit explicitly, it is unlimited" - do you know offhand if anyone has actually tested it? As in, try to run it without the limit set, and upload, say, a 5GB file?
What I'm getting at, I would imagine it is possible that this limit being an int instead of a long may not be the only limiting bottleneck inside their code; the library is quite old, and was clearly not written for depositing large volumes of data. So I would not be surprised if there were another 2GB or 4GB limit elsewhere.
But if it were the only such bottleneck, then this would be quite trivial.

@pdurbin
Copy link
Member Author

pdurbin commented Mar 18, 2022

@landreev when SWORD was new in DVN 3.x I remember trying a large file locally and it working but I can't remember exactly how many gigabytes. I want to say 4 or 5 GB. Obviously, this was a long time ago. I haven't done any recent testing.

@poikilotherm
Copy link
Member

poikilotherm commented Mar 22, 2022

Aside of the limit not being associated with a Dataverse collection, there seems to be a misunderstanding.

The reported maximum upload size is in kilobyte, not byte:

The SWORD server MAY specify the sword:maxUploadSize (in kB) of content that can be uploaded in one request [SWORD003] as a child of the app:service element. If provided this MUST contain an integer.

(See 6.1. Retrieving a Service Document)

It's wrong in the code!

long fLength = file.length(); // in bytes

This means, max integer is capable of 2TB.

@poikilotherm
Copy link
Member

Reporting the gist of our Slack tech discussion yesterday:

  • Maybe we can make SWORD report "-1" as unlimited upload size
  • File size will be enforced by the underlying file storage procedures
  • SWORD accepts simple ZIP as upload only (no other profiles or upload types supported)
  • If the limit is exceeded, the temporary files by FileUtil will be deleted and an empty list is returned. When we remove the current ZIP file size limit check from SWORD (by using -1), the user would retrieve an error message "No files to add to dataset. Perhaps the zip file was empty." on size limit violation.
sequenceDiagram
  actor up as ZIP Upload
  participant servlet as dv.SWORDv2MediaResourceServlet
  participant mapi as sw2s.MediaResourceApi
  participant sae  as sw2s.SwordApiEndpoint
  participant mrmi as dv.MediaResourceManagerImpl
  participant fu as dv.FileUtil

  up ->> servlet: doPost()
  servlet ->> mapi: post()
  mapi ->> sae: addDepositPropertiesFromBinary()
  sae ->> sae: storeAndCheckBinary()
  note right of sae: Store ZIP upload as SWORD temp file

  mapi ->> mrmi: addResource()
  mrmi ->> mrmi:  replaceOrAddFiles()
  mrmi ->> mrmi: Identify target dataset
  mrmi ->> fu: createDataFiles()
  fu ->> fu: Retrieve target store size limit
  fu ->> fu: Copy SWORD temp file to another temp file
  fu ->> fu: Unzip
  fu ->> fu: For each file check limit and store
Loading

@poikilotherm
Copy link
Member

@landreev @pdurbin is this still a thing for you?

May I propose a rather simple change?

public interface SwordConfiguration {
...
-     int getMaxUploadSize();
+     long getMaxUploadSize();
...
}

The comparison happens in bytes. To be able to use more than max int (2,147,483,647), we would be limited to max long (9,223,372,036,854,775,807) instead and have only minimal code changes and a simple workaround for now.

@pdurbin
Copy link
Member Author

pdurbin commented May 16, 2022

@poikilotherm yes! I think this is exactly what we want. Obviously, we'll need to make the change on the Dataverse side as well. When we're ready, we should re-open this issue (or create a new one, no strong preference), and get it put into a sprint:

@poikilotherm
Copy link
Member

poikilotherm commented May 16, 2022

Thx @pdurbin

Do you want this on the main branch right away or for the jakarta branch only? (Which will become the new main once we need it for DV)

@pdurbin said on 2022-05-16 via Matrix: let's do this in a future branch, neither main nor jakarta.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants