All test are done remotely using 10-100 samples with one decentralized worker on the same host machine. We want to keep the benchmarks close to a real world scenario reason for network IO instead of spinning up a local website.
----------------------
mac Apple M1 Max
10-core CPU
64 GB of RAM memory
1 TB of SSD disk space
-----------------------
Test url: `https://rsseau.fr/`
185 pages
libraries |
|
---|---|
Rust[spider]: crawl 10 samples |
73ms |
Go[colly]: crawl 10 samples |
32s |
Node.js[crawler]: crawl 10 samples |
15s |
C[wget]: crawl 10 samples |
70s |
----------------------
linux ubuntu-latest
2-core CPU
7 GB of RAM memory
14 GB of SSD disk space
-----------------------
Test url: `https://rsseau.fr/`
185 pages
libraries |
|
---|---|
Rust[spider]: crawl 10 samples |
50ms |
Go[colly]: crawl 10 samples |
30s |
Node.js[crawler]: crawl 10 samples |
3.4s |
C[wget]: crawl 10 samples |
60s |
The concurrent benchmarks are averaged across 10 individual runs for 10 concurrent crawls with 10 sample counts.
In order for us to get better metrics we need to test the concurrency and simultaneous runs with a larger website. Favorably a website that can spin up inside the local container to avoid latency issues. The multi-threaded crawling capabilities shines brighter the larger the website. Currently even with a small website this package still runs faster than the top OSS crawlers to date. Spider is capable of crawling over 100k pages between 1-10 minutes depending on the website and OS. When spider is used decentralized it can handle IO within fractions of the time depending on the specs and amount of workers. The IO handling in linux performs drastically better than macOS and windows.
Note: Nodejs concurrency heavily impacts each additional run. As soon as you add multiple crawlers with nodejs the performance reduces over 2x plus per, while other lanaguages that can handle concurrency scale effectively.
You need a dedicated machine to get non flakey results. Github Actions results may differ across runs due to the shared env and the crawler built to scale across workloads.
Adjusting the target url for the massive bench crawling can be done using the env variable SPIDER_BENCH_URL_LARGE
.