Java爬虫02-webmagic

Webmagic

是什么

​ WebMagic是一款简单灵活的爬虫框架,它的的架构设计参照了Scrapy,目标是尽量的模块化,并体现爬虫的功能特点。

架构设计

WebMagic的四个组件

image

1.Downloader

​ Downloader负责从互联网上下载页面,以便后续处理。WebMagic默认使用了Apache HttpClient作为下载工具。

2.PageProcessor

​ PageProcessor负责解析页面,抽取有用信息,以及发现新的链接。WebMagic使用Jsoup作为HTML解析工具,并基于其开发了解析XPath的工具Xsoup

在这四个组件中,PageProcessor对于每个站点每个页面都不一样,是需要使用者定制的部分。

3.Scheduler

​ Scheduler负责管理待抓取的URL,以及一些去重的工作。WebMagic默认提供了JDK的内存队列来管理URL,并用集合来进行去重。也支持使用Redis进行分布式管理。

除非项目有一些特殊的分布式需求,否则无需自己定制Scheduler。

4.Pipeline

​ Pipeline负责抽取结果的处理,包括计算、持久化到文件、数据库等。WebMagic默认提供了“输出到控制台”和“保存到文件”两种结果处理方案。

Pipeline定义了结果保存的方式,如果你要保存到指定数据库,则需要编写对应的Pipeline。对于一类需求一般只需编写一个Pipeline

用于数据流转的对象

1. Request

Request是对URL地址的一层封装,一个Request对应一个URL地址。

它是PageProcessor与Downloader交互的载体,也是PageProcessor控制Downloader唯一方式。

除了URL本身外,它还包含一个Key-Value结构的字段extra。你可以在extra中保存一些特殊的属性,然后在其他地方读取,以完成不同的功能。例如附加上一个页面的一些信息等。

2. Page

Page代表了从Downloader下载到的一个页面——可能是HTML,也可能是JSON或者其他文本格式的内容。

Page是WebMagic抽取过程的核心对象,它提供一些方法可供抽取、结果保存等。在第四章的例子中,我们会详细介绍它的使用。

3. ResultItems

ResultItems相当于一个Map,它保存PageProcessor处理的结果,供Pipeline使用。它的API与Map很类似,值得注意的是它有一个字段skip,若设置为true,则不应被Pipeline处理

控制爬虫运转的引擎–Spider

​ Spider是WebMagic内部流程的核心。Downloader、PageProcessor、Scheduler、Pipeline都是Spider的一个属性,这些属性是可以自由设置的,通过设置这个属性可以实现不同的功能。Spider也是WebMagic操作的入口,它封装了爬虫的创建、启动、停止、多线程等功能

缺点

​ 不适合爬取动态或者动态加载页面,而现如今大部门大厂网站都做了动态渲染或者反扒。如果需要爬取,可以结合selenium +浏览器驱动访问页面。

Selenium

是什么

​ Selenium 通过使用 WebDriver 支持市场上所有主流浏览器的自动化。 WebDriver 是一个 API 和协议,它定义了一个语言中立的接口,用于控制 web 浏览器的行为。 每个浏览器都有一个特定的 WebDriver 实现,称为驱动程序。 驱动程序是负责委派给浏览器的组件,并处理与 Selenium 和浏览器之间的通信。总的来说,selenium是一种浏览器控制框架。

怎么用

Webmagic+Selenium+Chormedriver

Boss直聘网爬取

DDL

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
create table `job_info`
(
`id` bigint(20) not null auto_increment comment '主键id',
job_id VARCHAR(100) NOT NULL DEFAULT '' COMMENT 'jobid',
work_location VARCHAR(50) NOT NULL DEFAULT '' COMMENT 'workLocation',
job_benifit VARCHAR(200) NOT NULL DEFAULT '' COMMENT 'jobBenifit',
work_experience VARCHAR(200) NOT NULL DEFAULT '' COMMENT 'workExperience',
degree_experience VARCHAR(200) NOT NULL DEFAULT '' COMMENT 'degreeExperience',
job_desc VARCHAR(2000) NOT NULL DEFAULT '' COMMENT 'jobDesc',
`company_name` varchar(100) default null comment '公司名称',
`company_addr` varchar(200) default null comment '公司联系方式',
`company_info` text comment '公司信息',
`job_name` varchar(100) default null comment '职位名称',
`job_addr` varchar(200) default null comment '工作地点',
`job_info` text comment '职位信息',
`salary_min` float(10, 2) default null comment '薪资范围,最小',
`salary_max` float(10, 2) default null comment '薪资范围,最大',
salary_month int default 12 not null comment '薪资月数',
`url` varchar(1500) default null comment '招聘信息详情页',
`time` varchar(10) default null comment '职位最近发布时间',
`company_creat_time` varchar(50) default null comment '公司成立时间',
`company_fund` varchar(50) default null comment '公司注册资本',
`boss_active_time` varchar(50) default null comment '职位最近发布时间',
primary key (`id`)
) engine = InnoDB comment ='招聘信息';

application.properties

1
2
3
4
spring.datasource.driver-class-name=com.mysql.jdbc.Driver
spring.datasource.url=jdbc:mysql://localhost:3306/selenium?characterEncoding=UTF-8&useUnicode=true&useSSL=false&tinyInt1isBit=false&allowPublicKeyRetrieval=true&serverTimezone=Asia/Shanghai
spring.datasource.username=root
spring.datasource.password=123456

pom.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>2.6.6</version>
<relativePath/>
</parent>

<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>

<dependencies>
<dependency>
<groupId>cn.hutool</groupId>
<artifactId>hutool-all</artifactId>
<version>5.8.27</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>

<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-core</artifactId>
<version>0.8.0</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>us.codecraft</groupId>
<artifactId>webmagic-extension</artifactId>
<version>0.8.0</version>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
</dependency>
<dependency>
<groupId>com.baomidou</groupId>
<artifactId>mybatis-plus-boot-starter</artifactId>
<version>3.1.1</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.47</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>4.22.0</version>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.18.22</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>3.13.0</version>
</dependency>
</dependencies><!--selenium-->

webmagic3件套

MyBossDownloader:注入谷歌浏览器驱动

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
@Component
public class MyBossDownloader implements Downloader {

private RemoteWebDriver driver;

public MyBossDownloader() {
// 创建chorme的配置信息
System.setProperty("webdriver.chrome.driver","C:\\Users\\81566\\Downloads\\chromedriver-win64\\chromedriver-win64\\chromedriver.exe");
// 创建chorme对象
ChromeOptions chromeOptions = new ChromeOptions();
//关闭谷歌驱动上的 Chrome 正在受软件测试控制的 提示
chromeOptions.setExperimentalOption("excludeSwitches", Arrays.asList("enable-automation"));
// 设置为无头模式
//chromeOptions.addArguments("--headless");
// 设置浏览器打开窗口大小
//chromeOptions.addArguments("--window-size=100,100");
// 1. 基于配置信息,创建RemoteWebDriver对象
driver = new ChromeDriver(chromeOptions);
}
@SneakyThrows
@Override
public Page download(Request request, Task task) {
String url = request.getUrl();
System.out.println("url:"+url);
if(!url.contains("detail")){
//下一页
if(StrUtil.isNotBlank(request.getExtra("url"))){
url = request.getExtra("url");
if(!url.equals(driver.getCurrentUrl())){
driver.get(url);
Thread.sleep( RandomUtil.randomInt(8000,12000));
doScrollToEnd();
Thread.sleep(2000);
}

List<WebElement> nextPages = driver.findElementsByCssSelector("#wrap > div.page-job-wrapper > div.page-job-inner > div > div.job-list-wrapper > div.search-job-result > div > div > div > a");
WebElement nextPage = nextPages.get(nextPages.size() - 1);
if(!"disabled".equals(nextPage.getAttribute("class"))){
//等待元素可点击
((JavascriptExecutor) driver).executeScript("arguments[0].click();", nextPage);
Thread.sleep( RandomUtil.randomInt(8000,12000));

doScrollToEnd();
return createPage(driver.getCurrentUrl(), driver.getPageSource(), "page");
}
System.out.println("翻页失败");
}else{
//分页 https://www.zhipin.com/web/geek/job?query=Java&city=101280100&page=1
driver.get(url);
Thread.sleep( RandomUtil.randomInt(8000,12000));
doScrollToEnd();
return createPage(driver.getCurrentUrl(), driver.getPageSource(), "page");
}
}else {
driver.get(url);
Thread.sleep(RandomUtil.randomInt(2000,5000));
Page page = createPage(driver.getCurrentUrl(), driver.getPageSource(), "pageDetail");
String jobId = request.getExtra("jobId");
page.putField("jobId", jobId);
return page;
}

return null;
}

private void doScrollToEnd() throws InterruptedException {
long height = (long) driver.executeScript("return document.body.scrollHeight");
for (int i = 0; i < height - 1000; i = i + RandomUtil.randomInt(200,500)) {
driver.executeScript(StrUtil.format("window.scrollTo(0,{})",i));
Thread.sleep(RandomUtil.randomInt(200,1000));
}

}

private Page createPage(String url, String html, String pageName) throws InterruptedException {
Page page = new Page();
PlainText plainText = new PlainText(url);
page.setUrl(plainText);
page.setRawText(html);

int retry = 0;
while (html.contains("请稍候") && url.contains("detail") && retry < 1){
System.out.println("加载无效,重新加载");
driver.get(url);
Thread.sleep(5000);
html = driver.getPageSource();
page.setRawText(html);
retry++;
}

Request req = new Request(url);
req.putExtra("pageName",pageName);
page.setRequest(req);
return page;
}

@Override
public void setThread(int i) {

}
}

MyBossPageInterceptor:处理列表页、详情页、下一页

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
@Component
public class MyBossPageInterceptor implements PageProcessor {

private String domain = "https://www.zhipin.com";

private Random random = new Random();

@SneakyThrows
@Override
public void process(Page page) {
//1、列表页 2、详情页
Html html = page.getHtml();
Document document = html.getDocument();
String pageName = (String) page.getRequest().getExtras().get("pageName");
if("page".equals(pageName)){
//下一页
Request request = new Request();
request.setPriority(RandomUtil.randomInt(20,100));
request.setUrl("nextPage" + UUID.randomUUID().toString().replace("-", ""));
request.putExtra("url",page.getUrl().get());
page.addTargetRequest(request);

Elements selects = document.select("#wrap > div.page-job-wrapper > div.page-job-inner > div > div.job-list-wrapper > div.search-job-result > ul > li");
if(CollUtil.isNotEmpty(selects)){
//列表页
List<JobInfo> itemList = new ArrayList<>();
for (Element select : selects) {
String jobName = select.selectFirst("div.job-card-body.clearfix > a > div.job-title.clearfix > span.job-name").text();
System.out.println("jobName:" + jobName);
String companyName = select.selectFirst("div.job-card-body.clearfix > div > div.company-info > h3 > a").text();
System.out.println("companyName:" + companyName);
String workLocation = select.selectFirst("div.job-card-body.clearfix > a > div.job-title.clearfix > span.job-area-wrapper > span").text();
System.out.println("workLocation:" + workLocation);
String companyInfo = select.select("div.job-card-body.clearfix > div > div.company-info > ul > li").stream().map(Element::text).collect(Collectors.joining(","));
System.out.println("companyInfo:"+companyInfo);
String jobBenifit = select.select("div.job-card-footer.clearfix > div").stream().map(Element::text).collect(Collectors.joining(","));
System.out.println("jobBenifit:"+jobBenifit);
String salary = select.select("div.job-card-body.clearfix span.salary").text();

String[] split = salary.split("-");
Float min = Float.valueOf(split[0]);
Float max;
int month = 12;
String maxStr = split[1];
String[] split1 = maxStr.split("\\·");
if(split1.length == 1){
max = Float.valueOf(maxStr.substring(0,maxStr.length()-1));
}else{
max = Float.valueOf(split1[0].substring(0,split1[0].length()-1));
month = Integer.parseInt(split1[1].substring(0,split1[1].length()-1));
}
System.out.println("salary:"+min+"-"+max+"*"+month);
//把详情链接放到Request
String detailUrl = select.select("div.job-card-body.clearfix > a").attr("href");
String jobId = detailUrl.substring(detailUrl.lastIndexOf("/") + 1,detailUrl.lastIndexOf(".html?"));
System.out.println("jobId:"+jobId);

JobInfo jobInfo = new JobInfo();
jobInfo.setJobId(jobId);
jobInfo.setWorkLocation(workLocation);
jobInfo.setJobBenifit(jobBenifit);
jobInfo.setCompanyName(companyName);
jobInfo.setCompanyInfo(companyInfo);
jobInfo.setJobName(jobName);
jobInfo.setSalaryMonth(month);
jobInfo.setSalaryMin(min);
jobInfo.setSalaryMax(max);
jobInfo.setUrl(detailUrl);
itemList.add(jobInfo);
System.out.println();
detailUrl = domain + detailUrl;
Request detailRequest = new Request();
detailRequest.setPriority(RandomUtil.randomInt(1,10));
detailRequest.putExtra("jobId", jobId);
detailRequest.setUrl(detailUrl);
page.addTargetRequest(detailRequest);
}
page.putField("itemList",itemList);
}
}else if("pageDetail".equals(pageName)){
String jobId = page.getResultItems().get("jobId");
//详情页-获取岗位描述、要求 #main > div.job-banner > div > div > div.info-primary > p > span.text-desc.text-experiece
String workExperience = document.select("#main > div.job-banner > div > div > div.info-primary > p > span.text-desc.text-experiece").text();
System.out.println("workExperience:"+ workExperience);
String degreeExperience = document.select("#main > div.job-banner > div > div > div.info-primary > p > span.text-desc.text-degree").text();
System.out.println("degreeExperience:" + degreeExperience);
String jobDesc = document.select("#main > div.job-box > div > div.job-detail > div:nth-child(1) > div.job-sec-text").text();
System.out.println("jobDesc:" + jobDesc);
String bossActiveTime = document.select("#main > div.job-box > div > div.job-detail > div:nth-child(1) > div.job-boss-info > h2 > span").text();
System.out.println("bossActiveTime:" + bossActiveTime);
String lastUpdateTime = document.select("#main > div.job-box > div > div.job-detail > p").text();
lastUpdateTime = lastUpdateTime.substring(lastUpdateTime.indexOf(":") + 1);
System.out.println("lastUpdateTime:" + lastUpdateTime);
String workAddr = document.select("#main > div.job-box > div > div.job-detail > div.job-detail-section.job-detail-company > div.detail-section-item.company-address > div > div.location-address").text();
System.out.println("workAddr:" + workAddr);
String companyCreatTime = document.select("#main > div.job-box > div > div.job-detail > div.job-detail-section.job-detail-company > div.detail-section-item.business-info-box > div > ul > li.res-time").text();
System.out.println("companyCreatTime:" + companyCreatTime);
String companyFund = document.select("#main > div.job-box > div > div.job-detail > div.job-detail-section.job-detail-company > div.detail-section-item.business-info-box > div > ul > li.company-fund").text();
System.out.println("companyFund:" + companyFund);
System.out.println();

JobInfo jobInfo = new JobInfo();
jobInfo.setJobId(jobId);
jobInfo.setWorkExperience(workExperience);
jobInfo.setDegreeExperience(degreeExperience);
jobInfo.setJobDesc(jobDesc);
jobInfo.setJobAddr(workAddr);
jobInfo.setTime(lastUpdateTime);
jobInfo.setCompanyCreatTime(companyCreatTime);
jobInfo.setCompanyFund(companyFund);
jobInfo.setBossActiveTime(bossActiveTime);
page.putField("item",jobInfo);

}
}

@Override
public Site getSite() {
return Site.me().setSleepTime(1000).setTimeOut(2000);
}
}

MyBossPipeline

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
@Component
public class MyBossPipeline implements Pipeline {

@Resource
private JobInfoService jobInfoService;

@Override
public void process(ResultItems resultItems, Task task) {
JobInfo item = resultItems.get("item");
if(Objects.nonNull(item)){
JobInfo jobInfo = jobInfoService.lambdaQuery().eq(JobInfo::getJobId, item.getJobId()).one();
if(Objects.nonNull(jobInfo)){
item.setId(jobInfo.getId());
jobInfoService.updateById(item);
}else{
jobInfoService.save(item);
}
return;
}
List<JobInfo> itemList = resultItems.get("itemList");
if(CollUtil.isNotEmpty(itemList)){
for (JobInfo jobInfo : itemList) {
JobInfo selectJob = jobInfoService.lambdaQuery().eq(JobInfo::getJobId, jobInfo.getJobId()).one();
if(Objects.nonNull(selectJob)){
jobInfo.setId(selectJob.getId());
jobInfoService.updateById(jobInfo);
}else{
jobInfoService.save(jobInfo);
}
}
}
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
@Component
public class BossSpiderStart {
@Resource
private MyBossPageInterceptor myBossPageInterceptor;
@Resource
private MyBossPipeline myBossPipeline;
@Resource
private MyBossDownloader myBossDownloader;

public void start(){
Spider.create(myBossPageInterceptor)
.setPipelines(Arrays.asList(myBossPipeline))
.setDownloader(myBossDownloader)
.setScheduler(new PriorityScheduler())
.addUrl("https://www.zhipin.com/web/geek/job?query=Java&city=101280100&jobType=1901")
.thread(1)
.start();
}
}

难题

1、被识别出ip存在异常访问行为需要人工验证滑动滑块。

image-20240928221420318

解决方案一

守在电脑前,人力手动验证。

解决方案二

在Downloader动态设置代理ip,按照特性规则每50个请求换一个ip

方案三

​ 滑动验证码识别技术。这种技术通常利用机器学习或深度学习算法来识别滑动验证码的图案和特征,并模拟用户进行滑动操作。然而,由于滑动验证码的复杂性和变化性,识别技术的准确性和稳定性仍然是一个挑战,目前常用的验证码平台是超级鹰:https://www.chaojiying.com/api-45.html,一般用来识别验证码,还不能解决滑块问题。

​ 如果要解决滑块问题,需要定位获取滑块区域,通过算法来解析凹槽的位置,计算得出滑块需要移动的距离,但是该方案难度很高,如果采用第三方解析成本极高,故尽量降低被反扒的可能