-
-
Notifications
You must be signed in to change notification settings - Fork 192
Example of always running crawl app
Martin Angers edited this page Jun 12, 2013
·
1 revision
The following example shows a gocrawl application that reads its seeds from the database, presumably saves the harvested URLs to the database too, with a nextCrawl
date field or a crawled
flag to false, for example, and runs gocrawl indefinitely, limiting the number of seeds per host to send in each cycle.
package main
import (
"log"
"time"
"github.com/PuerkitoBio/gocrawl"
)
const (
// You probably want to limit the number of hits you will continually make on
// the websites, so this allows some throttling. Adjust as required.
SeedsLimitPerSource = 100
ForeverLoopDelay = 10 * time.Minute
)
type customExtender struct {
*gocrawl.DefaultExtender
// Possibly some additional fields, as required
}
// Omitted: overridden Extender methods, as required
var (
ext = &customExtender{
new(gocrawl.DefaultExtender),
}
crawler = gocrawl.NewCrawler(ext)
)
func main() {
// Omitted: Open connection to the database, defer the close
// Adjust the options as required, for example:
crawler.Options.LogFlags = gocrawl.LogError
loopForever()
}
func loopForever() {
for {
delay := time.After(ForeverLoopDelay)
seeds := getNextSeeds()
err := crawler.Run(seeds)
if err != nil {
log.Print("error crawling URLs: ", err)
}
<-delay
}
}
func getNextSeeds() gocrawl.S {
ret := make(gocrawl.S)
/*
Omitted: implementation of looping over a range of hosts, querying
SeedsLimitPerSource number of URLs for each one. The use of gocrawl.S
allows specifying each URL as an entry in a map[string]interface{} where
the key is the URL, and the value is some state data associated with the URL
(e.g. the ID of the URL in the database, or the whole struct representing the URL,
whatever).
*/
return ret
}