-
需要准备cookie 需要配置cookie,修改redis参数
-
进入目录
cd tutorial/tutorial
-
爬取板块帖子列表 624 怀旧板块id 510428 探索板块id
scrapy crawl quotes -a fid=510428
- 爬取板块帖子评论
scrapy crawl spider2 -a list_redis_key=quotes:20240212
list_redis_key是帖子列表的redis的key值。默认的key是 quotes:当日
-
分析数据
cd tutorial python app.py 结果在data目录
conda create --name ngaspider python=3.10.9
conda activate ngaspider
pip install -r requirements.txt
pip config list -v
[global]
index-url = https://pypi.tuna.tsinghua.edu.cn/simple
<!-- 爬虫 -->
pip install Scrapy
<!-- 文档 -->
https://docs.scrapy.org/en/latest/intro/tutorial.html
scrapy startproject tutorial
<!-- 运行爬虫 -->
cd tutorial
scrapy crawl quotes
# 调试
scrapy shell
url='https://bbs.nga.cn/thread.php?fid=510428&order_by=postdatedesc&page=1'
url="https://bbs.nga.cn/read.php?tid=39244916&page=1"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'}
cookies = {"key_1": "value_1", "key_2": "value_2", "key_3": "value_3"}
req = scrapy.Request(url, cookies=cookies, headers=headers)
fetch(req)
view(response)
response.css(".topicrow").get()
#标题
response.css("title::text").get()
#帖子标题
response.css(".topic::text").getall()
#帖子链接
response.css(".topic::attr(href)").getall()
#帖子作者
response.css(".author::text").getall()
#帖子作者id
response.css(".author::attr(title)").getall()
帖子时间
response.css("span.postdate::text").getall()
response.css(".replyer::text").getall()
response.css(".replydate").get()
response.css(".replies::text").getall()
response.xpath('//a[@title="下一页"]').xpath('./@href').get()
词频统计Ok 词云图片ok
回帖时间分析。例如集中在啥时间。 评论次数排行榜。哪个人话最多。 感情色彩分析。 统计下整体的数据集规模。
期望的是分析出xx是 刷子,水军。
conda info --envs
/Users/ft521/anaconda3/envs/ngaspider
/Users/ft521/anaconda3/envs/ngaspider/lib/python3.10/site-packages/scrapyd/default_scrapyd.conf
~/.scrapyd.conf
cp /Users/ft521/anaconda3/envs/ngaspider/lib/python3.10/site-packages/scrapyd/default_scrapyd.conf ~/.scrapyd.conf
修改配置,增加账号和密码
cd tutorial scrapyd-deploy -p tutorial git:main*
Packing version 1707725550
Deploying to project "tutorial" in http://127.0.0.1:6800/addversion.json
Server response (200):
{"node_name": "ft521deMacBook-Pro.local", "status": "ok", "project": "tutorial", "version": "1707725550", "spiders": 2}
curl -u admin:123456 http://127.0.0.1:6800/schedule.json -d project=tutorial -d spider=quotes
打包发布文件 scrapyd-deploy --build-egg tutorial.egg
可视化 scrapydweb
不推荐docker-compose部署。理论上应该是先启动多个scrapyd 再启动scrapydweb compose部署带来访问的问题
docker run 参考 https://github.com/libaibuaidufu/scrapyd_web_log
docker build -t scrapyd_logparser:v1 . docker run -d -p 6800:6800 --name scrapyd_1 scrapyd_logparser:v1 docker build -t scrapydweb:v1 . docker run -d -p 5000:5000 --name scrapydweb scrapydweb:v1 docker restart scrapydweb
#镜像下载 已经上传到docker hub
docker tag scrapydweb:v1 doudouchidou/scrapydweb:v1 docker push doudouchidou/scrapydweb:v1 docker tag scrapyd_logparser:v1 doudouchidou/scrapyd_logparser:v1 docker push doudouchidou/scrapyd_logparser:v1
docker pull doudouchidou/scrapydweb:v1 docker pull doudouchidou/scrapyd_logparser:v1