Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider
With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.
Install 1 go get -u github.com /gocolly/colly/...
Getting started 1 import "github.com/gocolly/colly"
Collector Colly’s main entity.
Manages the network communication and responsible for the execution of the attached callbacks while a collector job is running.
To work with colly, you have to initialize a Collector
:
1 c := colly.NewCollector()
Callbacks Attach different type of callback functions to a Collector
to control a collecting job or retrieve information.
Add callbacks to a Collector 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 c.OnRequest(func (r *colly.Request) { fmt.Println("Visiting" , r.URL) }) c.OnError(func (_ *colly.Response, err error ) { log.Println("Something went wrong:" , err) }) c.OnResponse(func (r *colly.Response) { fmt.Println("Visited" , r.Request.URL) }) c.OnHTML("a[href]" , func (e *colly.HTMLElement) { e.Request.Visit(e.Attr("href" )) }) c.OnHTML("tr td:nth-of-type(1)" , func (e *colly.HTMLElement) { fmt.Println("First column of a table row:" , e.Text) }) c.OnXML("//h1" , func (e *colly.XMLElement) { fmt.Println(e.Text) }) c.OnScraped(func (r *colly.Response) { fmt.Println("Finished" , r.Request.URL) })
Combat 豆瓣 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 package mainimport ( "fmt" "github.com/PuerkitoBio/goquery" "github.com/gocolly/colly" "github.com/gocolly/colly/extensions" "regexp" "strings" "time" )func main () { t := time.Now() number := 1 c := colly.NewCollector(func (c *colly.Collector) { extensions.RandomUserAgent(c) c.Async = true }, colly.URLFilters( regexp.MustCompile("^(https://movie\\.douban\\.com/top250)\\?start=[0-9].*&filter=" ), ), ) c.OnHTML("a[href]" , func (e *colly.HTMLElement) { link := e.Attr("href" ) c.Visit(e.Request.AbsoluteURL(link)) }) c.OnHTML("div.info" , func (e *colly.HTMLElement) { e.DOM.Each(func (i int , selection *goquery.Selection) { movies := selection.Find("span.title" ).First().Text() director := strings.Join(strings.Fields(selection.Find("div.bd p" ).First().Text()), " " ) quote := selection.Find("p.quote span.inq" ).Text() fmt.Printf("%d --> %s:%s %s\n" , number, movies, director, quote) number += 1 }) }) c.OnError(func (response *colly.Response, err error ) { fmt.Println(err) }) c.Visit("https://movie.douban.com/top250?start=0&filter=" ) c.Wait() fmt.Printf("花费时间:%s" , time.Since(t)) }
Step1 创建收集器 1 2 3 4 5 6 7 8 9 10 c := colly.NewCollector(func (c *colly.Collector) { extensions.RandomUserAgent(c) c.Async = true }, colly.URLFilters( regexp.MustCompile("^(https://movie\\.douban\\.com/top250)\\?start=[0-9].*&filter=" ), ), )
Step2 HTML回调 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 c.OnHTML("a[href]" , func (e *colly.HTMLElement) { link := e.Attr("href" ) c.Visit(e.Request.AbsoluteURL(link)) }) c.OnHTML("div.info" , func (e *colly.HTMLElement) { e.DOM.Each(func (i int , selection *goquery.Selection) { movies := selection.Find("span.title" ).First().Text() director := strings.Join(strings.Fields(selection.Find("div.bd p" ).First().Text()), " " ) quote := selection.Find("p.quote span.inq" ).Text() fmt.Printf("%d --> %s:%s %s\n" , number, movies, director, quote) number += 1 }) })
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 <div class ="info" > <div class ="hd" > <a href ="https://movie.douban.com/subject/1291546/" class ="" > <span class ="title" > 霸王别姬</span > <span class ="other" > / 再见,我的妾 / Farewell My Concubine</span > </a > <span class ="playable" > [可播放]</span > </div > <div class ="bd" > <p class ="" > 导演: 陈凯歌 Kaige Chen &nbsqp; 主演: 张国荣 Leslie Cheung / 张丰毅 Fengyi Zha...<br > 1993 / 中国大陆 中国香港 / 剧情 爱情 同性 </p > <div class ="star" > <span class ="rating5-t" > </span > <span class ="rating_num" property ="v:average" > 9.6</span > <span property ="v:best" content ="10.0" > </span > <span > 2008403人评价</span > </div > <p class ="quote" > <span class ="inq" > 风华绝代。</span > </p > </div > </div >
Step3 错误处理,回调页面 1 2 3 4 5 6 c.OnError(func (response *colly.Response, err error ) { fmt.Println(err) }) c.Visit("https://movie.douban.com/top250?start=0&filter=" ) c.Wait() fmt.Printf("花费时间:%s" , time.Since(t))