Help us build a billion-scale image-caption dataset by filtering Common Crawl with OpenAI CLIP
- Change the value for YOUR_NICKNAME_FOR_THE_LEADERBOARD and make sure you are connected to a gpu runtime to maximize efficiency.
- Then just run all (
Ctrl+F9
) to install dependencies and start Crawling!
-
- If you want to run a cpu only worker (don't use a gpu runtime)
-
- If you want to run a gpu only worker (please use a gpu runtime)
- Get the docker image using
docker pull arkseal/cah-worker:hybrid-cpu
- Run docker image using
docker run --name cahworker --shm-size=4G -d arkseal/cah-worker:hybrid-cpu
- add
-e NAME={nickname}
to specify display name- Ex:
docker run --name cahworker -e NAME=ARKseal --shm-size=4G -d arkseal/cah-worker:hybrid-cpu
- Ex:
- add
You can use this one liner: docker pull arkseal/cah-worker:hybrid-cpu && docker run --name cahworker --shm-size=4G -d arkseal/cah-worker:hybrid-cpu
- add
-e NAME={nickname}
to specify display name- Ex:
docker pull arkseal/cah-worker:hybrid-cpu && docker run --name cahworker -e NAME=ARKseal --shm-size=4G -d arkseal/cah-worker:hybrid-cpu
- Ex:
- Gpu enabled hybrid worker:
docker pull arkseal/cah-worker:hybrid-gpu && docker run --name cahworker --gpus all --shm-size=4G -d arkseal/cah-worker:hybrid-gpu
- add
-e NAME={nickname}
to specify display name- Ex:
docker pull arkseal/cah-worker:hybrid-gpu && docker run --name cahworker --gpus all -e NAME=ARKseal --shm-size=4G -d arkseal/cah-worker:hybrid-gpu
- Ex:
- This requries NVIDIA Container Toolkit on host device
- add
- Cpu only worker:
docker pull arkseal/cah-worker:cpu && docker run --name cahworker --shm-size=4G -d arkseal/cah-worker:cpu
- add
-e NAME={nickname}
to specify display name- Ex:
docker pull arkseal/cah-worker:cpu && docker run --name cahworker -e NAME=ARKseal --shm-size=4G -d arkseal/cah-worker:cpu
- Ex:
- add
wget https://raw.githubusercontent.com/ARKseal/crawlingathome-worker/master/setup/setup_hybrid.sh
bash setup.sh
, to install dependencies.export PYTHONHASHCODE=0 && python3 crawlingathome.py
, to start Crawling!- use
--name {nickname}
to specify your display name
- use
- CPU Only Worker:
wget https://raw.githubusercontent.com/ARKseal/crawlingathome-worker/master/setup/setup_cpu.sh
bash setup.sh
, to install dependencies.export PYTHONHASHCODE=0 && python3 crawlingathome.py --cpu
, to start Crawling!- use
--name {nickname}
to specify your display name
- use
- GPU Only Worker:
wget https://raw.githubusercontent.com/ARKseal/crawlingathome-worker/master/setup/setup_gpu.sh
bash setup.sh
, to install dependencies.export PYTHONHASHCODE=0 && python3 crawlingathome.py --gpu
, to start Crawling!- use
--name {nickname}
to specify your display name
- use
- use
cloud-config.yaml
script to init the droplet - ssh with this command
ssh -oIdentitiesOnly=yes -i~/.ssh/id_cah crawl@{your-droplet-ip}}
- check the script by running
tail -f crawl.log
- Save image embedding
- Convert images to tfrecords
- Upload to google drive
- Prevent corrupt image to be processed
- Shard of chunk (it needs to read all WAT file which will be bad for low ram server)
- Crawling@Home integration
- Verify output