spider

A microservice to crawl a set of sites by following links to pages of the relevant domains. Only the relevant host urls of the provided host(s) are considered. New URLs are entered into a GraphQL database.

Used Frameworks / Libraries

(not comprehensive, but the most important ones)

akka
Caliban Client to talk to GraphQL endpoint
Sentry (error reporting)

Configuration

Configuration is done using environment variables. The following configuration parameters are available.

Environment config values:

API_URL - GraphQL API URL (required)
AUTH_SECRET - GraphQL authentication secret (required)
SCRAPE_PARALLELISM - number of pages that crawler visits in parallel (default: 100)
SCRAPE_INTERVAL - time interval between page hits (default: 500ms)
SCRAPE_TIMEOUT - timeout of each page load attempt (default: 20.000ms)
SHUTDOWN_TIMEOUT - time after which spider exits, if no new URLs have been found (default: 15.000ms)
MAX_RETRIES - max number of retries after attempts to load a page failed (default: 0)

Name		Name	Last commit message	Last commit date
Latest commit History 178 Commits
gradle		gradle
src		src
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spider

Used Frameworks / Libraries

Configuration

About

Contributors 2

Languages

License

coverified/spider

Folders and files

Latest commit

History

Repository files navigation

spider

Used Frameworks / Libraries

Configuration

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages