Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wacz file system fix #151

Open
djhmateer opened this issue Oct 25, 2024 · 0 comments
Open

wacz file system fix #151

djhmateer opened this issue Oct 25, 2024 · 0 comments

Comments

@djhmateer
Copy link
Contributor

I run the auto-archiver from source, using docker for the wacz_enricher.

I've found that if I have 2 consecutive items to archive, then the second one with throw an exception when any filesystem call is made from Python after running the first wacz_enricher

eg when a Telethon archiver is called (as it reads a .session file).

# on the second item any filesystem call will throw an exception eg this throw with can't find file !
os.getcwd()

My solution is to have a directory volume for docker to write to outside of the directory where the python script is being called from.

https://github.com/djhmateer/auto-archiver/blob/836fbd7733d46ea14fa9615fbda691ad6234f1f6/src/auto_archiver/enrichers/wacz_enricher.py#L105

# old way
# eg /home/dave/auto-archiver/tmpa22nvh69
tmp_dir = ArchivingContext.get_tmp_dir()

# new tmp directory
linux_tmp_dir ='/home/dave/aatmp' 

so it runs

# old way
docker run --rm -v /home/dave/auto-archiver/tmpa22nvh69:/crawls/ webrecorder/browsertrix-crawler crawl --url https://t.me/baznews9/10690 --scopeType page --generateWACZ --text --screenshot fullPage --collection e4422338 --id e4422338 --saveState never --behaviors autoscroll,autoplay,autofetch,siteSpecific --behaviorTimeout 200 --timeout 200 --postLoadDelay 20 --profile /crawls/profile.tar.gz

# new way
docker run --rm -v /home/dave/aatmp:/crawls/ webrecorder/browsertrix-crawler crawl --url https://t.me/baznews9/10690 --scopeType page --generateWACZ --text --screenshot fullPage --collection e4422338 --id e4422338 --saveState never --behaviors autoscroll,autoplay,autofetch,siteSpecific --behaviorTimeout 200 --timeout 200 --postLoadDelay 20 --profile /crawls/profile.tar.gz

This is allowing me to run the wacz_enricher on all links archived.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant