crawler

毕业设计的爬虫模块, 琉璃神社爬虫，新浪微博爬虫, 等后续是面向于游戏方面的资讯

如何团队项目保持同步(重要)

第一次时需要,与团队仓库建立联系

git remote add upstream https://github.com/ghost-of-fantasy/crawler.git

工作前后要运行这几条命令,和团队项目保持同步

git fetch upstream
git merge upstream/master

安装并进行单机测试

安装依赖包

pip install --upgrade pip

pip install -r requirements.txt

尝试运行程序

scrapy crawl shenshe

爬虫设计

新闻对象

Key	Value
website	网站的名称
url	文章链接
title	文章内容
content	文章内容
category	文章类型
publish_time	发布时间

新浪微博爬虫设计

像是新浪微博这样的，是账号越多越好

先爬取个人信息
将这个人所关注的人也加到待爬序列中

微博用户(放在redis的List里面)

Key	Value
user_id	用户ID
nickname	用户昵称；
weibo_num	微博数；
following	关注数；
followers	粉丝数；

关系网络(放在redis的List里面)

Key	Value
user_id	用户ID
follow_id	他关注的用户ID

微博内容(放在redis的List里面)

Key	Value
user_id	用户ID
weibo_content	存储用户的所有微博
weibo_place	存储微博的发布位置
publish_time	存储微博的发布时间
up_num	存储微博获得的点赞数
retweet_num	存储微博获得的转发数
comment_num	存储微博获得的评论数
publish_tool	存储微博的发布工具

待爬取网站

打包命令

$ cd ..
$ tar -czvf crawler.tar.gz  --exclude=crawler/venv --exclude=crawler/media --exclude=crawler/.git crawler

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
crawler		crawler
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

crawler

如何团队项目保持同步(重要)

安装并进行单机测试

安装依赖包

尝试运行程序

爬虫设计

新闻对象

新浪微博爬虫设计

待爬取网站

打包命令

参考文章

About

Releases

Packages

Contributors 3

Languages

License

niracler/crawler

Folders and files

Latest commit

History

Repository files navigation

crawler

如何团队项目保持同步(重要)

安装并进行单机测试

安装依赖包

尝试运行程序

爬虫设计

新闻对象

新浪微博爬虫设计

待爬取网站

打包命令

参考文章

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages