crawlerr is simple, yet powerful web crawler for Node.js, based on Promises. This tool allows you to crawl specific urls only based on wildcards. It uses Bloom filter for caching. A browser-like feeling.
- Simple: our crawler is simple to use;
- Elegant: provides a verbose, Express-like API;
- MIT Licensed: free for personal and commercial use;
- Server-side DOM: we use JSDOM to make you feel like in your browser;
- Configurable pool size, retries, rate limit and more;
$ npm install crawlerr
crawlerr(base [, options])
You can find several examples in the examples/
directory. There are the some of the most important ones:
const spider = crawlerr("http://google.com/");
spider.get("/")
.then(({ req, res, uri }) => console.log(res.document.title))
.catch(error => console.log(error));
const spider = crawlerr("http://blog.npmjs.org/");
spider.when("/post/[digit:id]/[all:slug]", ({ req, res, uri }) => {
const post = req.param("id");
const slug = req.param("slug").split("?")[0];
console.log(`Found post with id: ${post} (${slug})`);
});
const spider = crawlerr("http://example.com/");
spider.get("/").then(({ req, res, uri }) => {
const document = res.document;
const elementA = document.getElementById("someElement");
const elementB = document.querySelector(".anotherForm");
console.log(element.innerHTML);
});
const url = "http://example.com/";
const spider = crawlerr(url);
spider.request.setCookie(spider.request.cookie("foobar=…"), url);
spider.request.setCookie(spider.request.cookie("session=…"), url);
spider.get("/profile").then(({ req, res, uri }) => {
//… spider.request.getCookieString(url);
//… spider.request.setCookies(url);
});
Creates a new Crawlerr
instance for a specific website with custom options
. All routes will be resolved to base
.
Option | Default | Description |
---|---|---|
concurrent |
10 |
How many request can be run simultaneously |
interval |
250 |
How often should new request be send (in ms) |
… | null |
See request defaults for more informations |
Requests url
. Returns a Promise
which resolves with { req, res, uri }
, where:
req
is the Request object;res
is the Response object;uri
is the absoluteurl
(resolved frombase
).
Example:
spider
.get("/")
.then(({ res, req, uri }) => …);
Searches the entire website for urls which match the specified pattern
. pattern
can include named wildcards which can be then retrieved in the response via res.param
.
Example:
spider
.when("/users/[digit:userId]/repos/[digit:repoId]", ({ res, req, uri }) => …);
Executes a callback
for a given event
. For more informations about which events are emitted, refer to queue-promise.
Example:
spider.on("error", …);
spider.on("resolve", …);
Starts/stops the crawler.
Example:
spider.start();
spider.stop();
A configured request
object which is used by retry-request
when crawling webpages. Extends from request.jar()
. Can be configured when initializing a new crawler instance through options
. See crawler options and request
documentation for more informations.
Example:
const url = "https://example.com";
const spider = crawlerr(url);
const request = spider.request;
request.post(`${url}/login`, (err, res, body) => {
request.setCookie(request.cookie("session=…"), url);
// Next requests will include this cookie
spider.get("/profile").then(…);
spider.get("/settings").then(…);
});
Extends the default Node.js
incoming message.
Returns the value of a HTTP header
. The Referrer
header field is special-cased, both Referrer
and Referer
are interchangeable.
Example:
req.get("Content-Type"); // => "text/plain"
req.get("content-type"); // => "text/plain"
Check if the incoming request contains the "Content-Type" header field, and it contains the give mime type
. Based on type-is.
Example:
// Returns true with "Content-Type: text/html; charset=utf-8"
req.is("html");
req.is("text/html");
req.is("text/*");
Return the value of param name
when present or defaultValue
:
- checks route placeholders, ex:
user/[all:username]
; - checks body params, ex:
id=12, {"id":12}
; - checks query string params, ex:
?id=12
;
Example:
// .when("/users/[all:username]/[digit:someID]")
req.param("username"); // /users/foobar/123456 => foobar
req.param("someID"); // /users/foobar/123456 => 123456
Returns the JSDOM object.
Returns the DOM window for response content. Based on JSDOM.
Returns the DOM document for response content. Based on JSDOM.
Example:
res.document.getElementById(…);
res.document.getElementsByTagName(…);
// …
npm test