Running the scraper

Build the scraper and API docker images, from the root of the project:

docker build -f docker/api/Dockerfile -t api
docker build -f docker/scraper/Dockerfile -t commoncrawl .

Run the API:

docker run --network=host api

Run the scraper:

docker run -e AWS_ACCESS_KEY_ID=<YOUR_ACCESS_KEY_HERE> -e AWS_SECRET_ACCESS_KEY=<YOUR_SECRET_HERE> \
-v /mnt/x/Test Data/Insight/BigDataTest:/app/data commoncrawl

Note that you'll need to replace <YOUR_ACCESS_KEY_HERE> and <YOUR_SECRET_HERE> with your AWS access key ID and secret access key, respectively. Additionally, you'll need to replace /mnt/x/Test Data/Insight/BigDataTest with the path to the directory where you want to save the scraped data.

Specify the requested MB of each file type

You can specify the MB limit of each file type by editing the parser/parser.go file. In this file, you'll see a mimeLimits map that looks like this:

var mimeLimits = map[string]int64{
	"image/jpeg":      20000,
	"image/jpg":       0, // It seems nothing is being detected as jpg, use jpeg instead
	"image/png":       20000,
	"application/pdf": 20000,
	"video/mp4":       20000,
	"application/vnd.openxmlformats-officedocument.wordprocessingml.document":   10000,
	"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet":         5000,
	"application/vnd.openxmlformats-officedocument.presentationml.presentation": 5000,
}

To change the limit of a file type, simply update the corresponding value in this map. The value should be specified in MB.

Number of threads

You can change the number of threads that the scraper should use by editing the Dockerfile for the scraper container. In this file, you'll see a CMD instruction that looks like this:

CMD ["go", "run", "main.go", "1000"]

The last argument ("1000") specifies the number of threads that the container should use. You can change this value to any number that you like. Note that this will impact the number of open sockets and concurrently open files, so choose a value that works well for your system.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
docker		docker
nsfw-classification		nsfw-classification
scraper		scraper
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Running the scraper

Specify the requested MB of each file type

Number of threads

About

Releases 1

Packages

Languages

mico-boje/commoncrawl_parser

Folders and files

Latest commit

History

Repository files navigation

Running the scraper

Specify the requested MB of each file type

Number of threads

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages