S3 Slow Transfers Speeds #839

zatkinson08 · 2024-09-27T16:45:21Z

Problem description

What are you trying to achieve?
Transfer AMI binary between S3 us-east-2 & us-gov-east-1
What is the expected result?
Transfer completes with performance on-par or close to AWS CLI.
What are you seeing instead?
Transfer is taking ~2-3min per gigabyte which is much slower than CLI.

Steps/code to reproduce the problem

To be clear, smart_open IS working. However, I will not be able to use it for my project because the speed is too slow. My largest file presently is ~27GB. At 2min per gigabytes that is ~54min to transfer a single file. Am I utilize this project correctly? If there are suggestions to increase performance, I would very much appreciate more info.

def copy_between_s3(src_bucket, src_ami_id, dest_bucket, s3_client_src, s3_client_dest):
  logger.info("Copying between S3 buckets..")
  read_path=f"{src_bucket}/{src_ami_id}.bin"
  write_path=f"{dest_bucket}/{src_ami_id}.bin"

  # optional transport_param;
  min_file_chunks_in_bytes = (1 * 1024**3)  # 1 * 1024**3 bytes -> 1-GB
  buffer_size = (1 * 1024**3) # optional transport_param; Slow than default

  try:
    with open(f"s3://{read_path}", mode='rb', transport_params={'client': s3_client_src, 'buffer_size': buffer_size, 'min_part_size':min_file_chunks_in_bytes}) as fr:
      with open(f"s3://{write_path}", mode='wb', transport_params={'client': s3_client_dest, 'buffer_size': buffer_size, 'min_part_size':min_file_chunks_in_bytes}) as fw:
        for line in fr:
          fw.write(line)
  except Exception as e:
     logger.error(f"\t copy_between_s3: {e}")

AWS CLI - python output as well

## Python - Only reporting 4GB of 27GB here
2024-09-27 09:23:17,255 - smart_open.s3 - INFO - smart_open.s3.MultipartWriter('BUCKET_NAME', 'ami-002145465c5357f2f.bin'): uploading part_num: 1, 1073741824 bytes (total 1.000GB)
2024-09-27 09:24:25,093 - smart_open.s3 - INFO - smart_open.s3.MultipartWriter('BUCKET_NAME', 'ami-002145465c5357f2f.bin'): uploading part_num: 2, 1073741824 bytes (total 2.000GB)
2024-09-27 09:25:33,845 - smart_open.s3 - INFO - smart_open.s3.MultipartWriter('BUCKET_NAME', 'ami-002145465c5357f2f.bin'): uploading part_num: 3, 1073741824 bytes (total 3.000GB)
2024-09-27 09:26:37,991 - smart_open.s3 - INFO - smart_open.s3.MultipartWriter('BUCKET_NAME', 'ami-002145465c5357f2f.bin'): uploading part_num: 4, 1073741824 bytes (total 4.000GB)


## AWS CLI - Time reflects full 27GB download & upload
time aws s3 cp s3://SRC_BUCKET_NAME/ami-002145465c5357f2f.bin . # Download
# ~100MiB/s avg - Output -  121.27s user 106.51s system 84% cpu 4:28.12 total

time aws s3 cp ./ami-002145465c5357f2f.bin s3://DEST_BUCKET_NAME/ami-002145465c5357f2f.bin  # Upload
# ~117MiB/s avg - Output - 186.19s user 215.25s system 164% cpu 4:04.40 total

Versions

>>> print("Python", sys.version)
Python 3.9.2 (default, Sep 23 2024, 11:08:05) 
[GCC 11.4.0]
>>> print("smart_open", smart_open.__version__)
smart_open 7.0.4

Checklist

Before you create the issue, please make sure you have:

Described the problem clearly
Provided a minimal reproducible example, including any required data
Provided the version numbers of the relevant software

ddelange · 2024-09-27T19:49:33Z

did you try reading and writing buffer_size byte chunks instead of reading and writing line-by-line? for multipart upload you can go up to smart_open.s3.MAX_PART_SIZE (5GiB).

while (chunk := fr.read(buffer_size)):
    fw.write(chunk)

the line iterator checks every character for carriage returns: big chance your code is CPU bound and not IO bound.

ddelange · 2024-09-28T06:39:35Z

if you have enough RAM/swap, you can save yourself some API charges by doing only a single GET (doing a single fr.read() without size argument) and then a single PUT (doing a single fw.write() with multipart_upload=False transport_param).

multiple chunk reads (GETs) and multiple part writes (PUTs, plus init, plus commit) are all billed by AWS

ddelange · 2024-10-09T13:57:14Z

hi @zatkinson08 👋

we're you able to try out my suggestions?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 Slow Transfers Speeds #839

S3 Slow Transfers Speeds #839

zatkinson08 commented Sep 27, 2024

ddelange commented Sep 27, 2024 •

edited

Loading

ddelange commented Sep 28, 2024 •

edited

Loading

ddelange commented Oct 9, 2024

S3 Slow Transfers Speeds #839

S3 Slow Transfers Speeds #839

Comments

zatkinson08 commented Sep 27, 2024

Problem description

Steps/code to reproduce the problem

Versions

Checklist

ddelange commented Sep 27, 2024 • edited Loading

ddelange commented Sep 28, 2024 • edited Loading

ddelange commented Oct 9, 2024

ddelange commented Sep 27, 2024 •

edited

Loading

ddelange commented Sep 28, 2024 •

edited

Loading