predator / 掠食者

⚠️ 项目移动到https://github.com/go-predator/predator

此仓库不再维护

predator / 掠食者

基于 fasthttp 开发的高性能爬虫框架

使用

下面是一个示例，基本包含了当前已完成的所有功能，使用方法可以参考注释。

1 创建一个 Crawler

import "github.com/thep0y/predator"


func main() {
	crawler := predator.NewCrawler(
		predator.WithUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0"),
		predator.WithCookies(map[string]string{"JSESSIONID": cookie}),
		predator.WithProxy(ip), // 或者使用代理池 predator.WithProxyPool([]string)
	)
}

创建Crawler时有一些可选项用来功能增强。所有可选项参考predator/options.go。

2 发送 Get 请求

crawler.Get("http://www.baidu.com")

对请求和响应的处理参考的是 colly，我觉得 colly 的处理方式非常舒服。

// BeforeRequest 可以在发送请求前，对请求进行一些修补
crawler.BeforeRequest(func(r *predator.Request) {
	headers := map[string]string{
		"Accept":           "*/*",
		"Accept-Language":  "zh-CN",
		"Accept-Encoding":  "gzip, deflate",
		"X-Requested-With": "XMLHttpRequest",
		"Origin":           "http://example.com",
	}

	r.SetHeaders(headers)
  
	// 请求和响应之间的上下文传递，上下文见下面的上下文示例
	r.Ctx.Put("id", 10)
	r.Ctx.Put("name", "tom")
})

crawler.AfterResponse(func(r *predator.Response) {
	// 从请求发送的上下文中取值
	id := r.Ctx.GetAny("id").(int)
	name := r.Ctx.Get("name")
	
	// 对于 json 响应，建议使用 gjson 进行处理
	body := gjson.ParseBytes(r.Body)
	amount := body.Get("amount").Int()
	types := body.Get("types").Array()
})

// 请求语句要在 BeforeRequest 和 AfterResponse 后面调用
crawler.Get("http://www.baidu.com")

3 发送 Post 请求

与 Get 请求有一点不同，通常每个 Post 的请求的参数是不同的，而这些参数都在请求体中，在BeforeRequest中重新解析请求体获取关键参数虽然可以，但绝非最佳选择。所以在构造 Post 请求时，可以直接传入上下文，用以解决与响应的信息传递。

3.1 普通 POST 表单(application/x-www-form-urlencoded)

// BeforeRequest 可以在发送请求前，对请求进行一些修补
crawler.BeforeRequest(func(r *predator.Request) {
	headers := map[string]string{
		"Accept":           "*/*",
		"Accept-Language":  "zh-CN",
		"Accept-Encoding":  "gzip, deflate",
		"X-Requested-With": "XMLHttpRequest",
		"Origin":           "http://example.com",
	}

	r.SetHeaders(headers)
})

crawler.AfterResponse(func(r *predator.Response) {
	// 从请求发送的上下文中取值
	id := r.Ctx.GetAny("id").(int)
	name := r.Ctx.Get("name")
	
	// 对于 json 响应，建议使用 gjson 进行处理
	body := gjson.ParseBytes(r.Body)
	amount := body.Get("amount").Int()
	types := body.Get("types").Array()
})


body := map[string]string{"foo": "bar"}

// 在 Post 请求中，应该将关键参数用这种方式放进上下文
ctx, _ := context.AcquireCtx()
ctx.Put("id", 10)
ctx.Put("name", "tom")

crawler.Post("http://www.baidu.com", body, ctx)

如果不需要传入上下文，可以直接用nil代替：

crawler.Post("http://www.baidu.com", body, nil)

3.2 复杂 POST 请求(multipart/form-data)

multipart/form-data方法需要使用专门的PostMultipart方法，示例可能较长，这里不便书写。

使用方法请参考示例：https://github.com/thep0y/predator/blob/main/example/multipart/main.go

3.3 JSON 请求

JSON 请求也有专门的方法PostJSON来完成，在使用PostJSON时会自动在请求头中添加Content-Type: application/json，无需重复添加。当然，你再重新添加一次也可以，最终将会使用你添加的Content-Type。

示例：

func main() {
	c := NewCrawler()

	c.AfterResponse(func(r *Response) {
		t.Log(r)
	})

	type User struct {
		Name string `json:"name"`
		Age  int    `json:"age"`
	}

	body := map[string]interface{}{
		"time": 156546535,
		"cid":  "10_18772100220-1625540144276-302919",
		"args": []int{1, 2, 3, 4, 5},
		"dict": map[string]string{
			"mod": "1592215036_002", "extend1": "关注", "t": "1628346994", "eleTop": "778",
		},
		"user": User{"Tom", 13},
	}

	c.PostJSON("https://httpbin.org/post", body, nil)
}

3.4 其他 POST 请求

虽然以上三种方式已解决大部分的网站的请求，但仍然存在一小部分网站比较特殊，此时需要使用PostRaw方法：

func (c *Crawler) PostRaw(URL string, body []byte, ctx pctx.Context) error

其中的请求体需要你自行构造，原始请求体可以是任何形式，构造完成后再序列化为[]byte作为请求体。

4 允许重定向

考虑到爬虫的效率问题，默认情况下是不允许重定向的。

但在正常的爬虫业务中难免遇到重定向问题，你可以根据每个请求的不同情况设置不同的最大重定向次数。

crawler.BeforeRequest(func(r *predator.Request) {
    // 用 GET 请求时可以根据 r.URL 判断，POST 请求时可以根据请求体判断，下面仅是示例
	if r.URL == 情况一 {
		// 允许重定向 1 次
		r.AllowRedirect(1)
	} else if r.URL == 情况二 {
		// 允许重定向 3 次
		r.AllowRedirect(3)
	}
})

不允许设置全局重定向，只能针对每个请求进行修补。

当然，如果全局重定向呼声高的话，再考虑是否加入。

5 上下文

上下文是一个接口，我实现了两种上下文：

ReadOp：基于sync.Map实现，适用于读取上下文较多的场景
WriteOp：用map实现，适用于读写频率相差不大或写多于读的场景，这是默认采用的上下文

爬虫中如果遇到了读远多于写时就应该换ReadOp了，如下代码所示：

ctx, err := AcquireCtx(context.ReadOp)

6 处理 HTML

爬虫的结果大体可分为两种，一是 HTML 响应，另一种是 JSON 格式的响应。

与 JSON 相比，HTML 需要更多的代码处理。

本框架对 HTML 处理进行了一些函数封装，能方便地通过 css selector 进行元素的查找，可以提取元素中的属性和文本等。

crawl := NewCrawler()

crawl.ParseHTML("body", func(he *html.HTMLElement) {
	// 元素内部 HTML
	h, err := he.InnerHTML()
	// 元素整体 HTML
	h, err := he.OuterHTML()
	// 元素内的文本（包括子元素的文本）
	he.Text()
	// 元素的属性
	he.Attr("class")
	// 第一个匹配的子元素
	he.FirstChild("p")
	// 最后一个匹配的子元素
	he.LastChild("p")
	// 第 2 个匹配的子元素
	he.Child("p", 2)
	// 第一个匹配的子元素的属性
	he.ChildAttr("p", "class")
	// 所有匹配到的子元素的属性切片
	he.ChildrenAttr("p", "class")
}

7 异步 / 多协程请求

c := NewCrawler(
	// 使用此 option 时自动使用指定数量的协程池发出请求，不使用此 option 则默认使用同步方式请求
	// 设置的数量不宜过少，也不宜过多，请自行测试设置不同数量时的效率
	WithConcurrency(30),
)

c.AfterResponse(func(r *predator.Response) {
	// handle response
})

for i := 0; i < 10; i++ {
	c.Post(ts.URL+"/post", map[string]string{
		"id": fmt.Sprint(i + 1),
	}, nil)
}

c.Wait()

8 使用缓存

默认情况下，缓存是不启用的，所有的请求都直接放行。

已经实现的缓存：

MySQL
PostgreSQL
Redis
SQLite3

缓存接口中有一个方法Compressed(yes bool)用来压缩响应的，毕竟有时，响应长度非常长，直接保存到数据库中会影响插入和查询时的性能。

这四个接口的使用方法示例：

// MySQL
c := NewCrawler(
	WithCache(&cache.MySQLCache{
		Host:     "127.0.0.1",
		Port:     "3306",
		Database: "predator",
		Username: "root",
		Password: "123456",
	}, false), // false 为关闭压缩，true 为开启压缩，下同
)

// PostgreSQL
c := NewCrawler(
	WithCache(&cache.PostgreSQLCache{
		Host:     "127.0.0.1",
		Port:     "54322",
		Database: "predator",
		Username: "postgres",
		Password: "123456",
	}, false),
)

// Redis
c := NewCrawler(
	WithCache(&cache.RedisCache{
		Addr: "localhost:6379",
	}, true),
)

// SQLite3
c := NewCrawler(
	WithCache(&cache.SQLiteCache{
		URI: uri,  // uri 为数据库存放的位置，尽量加上后缀名 .sqlite
	}, true),
)
// 也可以使用默认值。WithCache 的第一个为 nil 时，
// 默认使用 SQLite 作为缓存，且会将缓存保存在当前
// 目录下的 predator-cache.sqlite 中
c := NewCrawler(WithCache(nil, true))

9 代理

支持 HTTP 代理和 Socks5 代理。

使用代理时需要加上协议，如：

WithProxyPool([]string{"http://ip:port", "socks5://ip:port"})

10 日志

日志使用的是流行日志库zerolog。

默认情况下，日志是不开启的，需要手动开启。

WithLogger选项需要填入一个参数*predator.LogOp，当填入nil时，默认会以INFO等级从终端美化输出。

	crawler := predator.NewCrawler(
		predator.WithLogger(nil),
	)

predator.LogOp对外公开四个方法：

SetLevel：设置日志等级。等级可选：DEBUG、INFO、WARNING、ERROR、FATAL
```
logOp := new(predator.LogOp)
// 设置为 INFO
logOp.SetLevel(log.INFO)
```
ToConsole：美化输出到终端。
ToFile：JSON 格式输出到文件。
ToConsoleAndFile：既美化输出到终端，同时以 JSON 格式输出到文件。

日志的完整示例：

import "github.com/thep0y/predator/log"

func main() {
	logOp := new(predator.LogOp)
	logOp.SetLevel(log.INFO)
	logOp.ToConsoleAndFile("test.log")

	crawler := predator.NewCrawler(
		predator.WithLogger(logOp),
	)
}

11 关于 JSON

本来想着封装一个 JSON 包用来快速处理 JSON 响应，但是想了一两天也没想出个好办法来，因为我能想到的，gjson都已经解决了。

对于 JSON 响应，能用gjson处理就不要老想着反序列化了。对于爬虫而言，反序列化是不明智的选择。

当然，如果你确实有反序列化的需求，也不要用标准库，使用封装的 JSON 包中的序列化和反序列化方法比标准库性能高。

import "github.com/thep0y/predator/json"

json.Marshal()
json.Unmarshal()
json.UnmarshalFromString()

对付 JSON 响应，当前足够用了。

Name	Name	Last commit message	Last commit date
Latest commit thep0y archive this repository Apr 26, 2022 87ca6b8 · Apr 26, 2022 History 66 Commits
cache	cache	Complete all basic functions	Jul 31, 2021
context	context	Add log function	Aug 1, 2021
dao	dao	Fix the issue that some requests are lost when using concurrency	Jul 31, 2021
example/multipart	example/multipart	Rewrite the PostMultipart method, optimize some log statements	Aug 2, 2021
html	html	add `Parent()` and `Parents()` to `HTMLElement`	Oct 10, 2021
json	json	Add log function	Aug 1, 2021
log	log	Add log function	Aug 1, 2021
proxy	proxy	rewrite proxy errors	Nov 5, 2021
tools	tools	Add log function	Aug 1, 2021
.gitignore	.gitignore	Initial commit	Jul 23, 2021
LICENSE	LICENSE	Initial commit	Jul 23, 2021
README.md	README.md	archive this repository	Apr 26, 2022
async_test.go	async_test.go	move `randomBoundary()` from https://github.com/thep0y/predator/blob/…	Oct 9, 2021
craw.go	craw.go	rewrite proxy errors	Nov 5, 2021
craw_test.go	craw_test.go	rewrite proxy errors	Nov 5, 2021
go.mod	go.mod	Add log function	Aug 1, 2021
go.sum	go.sum	Add log function	Aug 1, 2021
options.go	options.go	add `SkipVerification()` option	Oct 11, 2021
pool.go	pool.go	用 for range 替换 for { select {} }	Sep 7, 2021
proxy.go	proxy.go	rewrite proxy errors	Nov 5, 2021
request.go	request.go	add `AbsoluteURL` to `Request` struct	Oct 11, 2021
response.go	response.go	调整代码风格，删除或注释暂未使用的函数或变量	Sep 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

predator / 掠食者

使用

1 创建一个 Crawler

2 发送 Get 请求

3 发送 Post 请求

3.1 普通 POST 表单(application/x-www-form-urlencoded)

3.2 复杂 POST 请求(multipart/form-data)

3.3 JSON 请求

3.4 其他 POST 请求

4 允许重定向

5 上下文

6 处理 HTML

7 异步 / 多协程请求

8 使用缓存

9 代理

10 日志

11 关于 JSON

目标

About

Releases 15

Packages

Languages

License

thep0y/predator

Folders and files

Latest commit

History

Repository files navigation

predator / 掠食者

使用

1 创建一个 Crawler

2 发送 Get 请求

3 发送 Post 请求

3.1 普通 POST 表单(application/x-www-form-urlencoded)

3.2 复杂 POST 请求(multipart/form-data)

3.3 JSON 请求

3.4 其他 POST 请求

4 允许重定向

5 上下文

6 处理 HTML

7 异步 / 多协程请求

8 使用缓存

9 代理

10 日志

11 关于 JSON

目标

About

Resources

License

Stars

Watchers

Forks

Releases 15

Packages 0

Languages

Packages