爬虫简介，bs4 解析，xpath

python > spider spider

发布时间 : 2024-11-17 10:27

阅读 :

爬虫的特殊性
beautifulsoup
xpath
1. 简介
2. xpath 语法

爬虫的特殊性

爬虫是一个更新极快的东西，可能上一个案例过两天就失效了，学习的是教招拆招的东西。思维不是固定的，达到目的即可，不要套公式。

如果程序写的不够完善，访问频率过高，可能对服务器造成毁灭性打击，所以不要顶着一个网站干。必须放慢节奏和访问频率，刑法里有一条是破坏计算机系统罪。

大厂的反爬手段很残忍，没有经验不得挑战。gov 网站不要爬，那是在找死。

beautifulsoup

基本项

安装

Beautiful Soup 是 python 的一个库，主要的功能是从网页中抓取数据。

安装 pip install beautifulsoup4
安装 lxml 解析器 pip install lxml

Beautiful Soup 支持 python 标准库中的 html 解析器，还支持一些第三方的解析器，如果不安装则会使用 python 默认的解析器，lxml 解析器更强大，速度更快。

快速使用

解析代码，并且得到一个 BeautifulSoup 对象，将其解析

下面的一段HTML代码将作为例子被多次用到.这是爱丽丝梦游仙境的一段内容(以后内容中简称为爱丽丝的文档):

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'lxml')

# <class 'bs4.BeautifulSoup'>
print(type(soup))

# 格式化
print(soup.prettify())

# 获取所有内容
print(soup.get_text())

# 获取当前标签，获取不到则返回 None。<title>The Dormouse's story</title>
print(soup.title)
# 获取当前标签的名称。title
print(soup.title.name)
# 获取当前标签中的内容。The Dormouse's story
print(soup.title.string)
# 获取当前标签中的内容。The Dormouse's story
print(soup.title.text)

# 获取第一个标签中的属性
print(soup.p['class'])
# 获取所有标签。[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
print(soup.findAll('a'))
# 获取所有 id 为 link3 的标签。[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
print(soup.findAll(id='link3'))

使用：将一段文档传入到 BeautifulSoup 的构造方法，就能得到一个对象：

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

# 方式一
soup = BeautifulSoup(html_doc, 'lxml')
# 方式二
url_soup = BeautifulSoup(open('url'), 'lxml')

对象种类

BS 将 HTML 文档转换为复杂的树形结构，每个节点都是 python 对象，可以归类为四种：

Tag

就是 HTML 中的一个个标签，比如 a、b、p 标签等
```
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
# <class 'bs4.element.Tag'>
type(tag)
```
通过点取属性的方式只能获得当前名字的第一个 tag

如果想要得到所有的标签，或是通过名字得到比一个 tag 更多的内容的时候,就需要用到 Searching the tree 中描述的方法，比如: soup.find_all('a')

但是注意这些是查找的标签，而不是标签中的值，如果要使用标签中的名字和属性，可以直接使用 x[xx] 获取，也可以使用 attrs 获取所有属性。
```
soup = BeautifulSoup(html_doc, 'lxml')
# 获取所有属性
print(soup.a.attrs)
# 获取 href 属性
print(soup.a.attrs['href'])
# 获取 href 属性
print(soup.a['href'])
```

NavigableString 字符串

使用 xx.string 即可获取标签中的字符串

soup = BeautifulSoup(html_doc, 'lxml')
# Elsie
print(soup.a.string)
# <class 'bs4.element.NavigableString'>
print(type(soup.a.string))

BeautifulSoup

BeautifulSoup 对象表示的是一个文档的全部内容。大部分时候可以把它当作 Tag 对象，是一个特殊的 Tag，我们可以分别获取它的类型，名称，以及属性。
```
# <class 'str'>
print(type(soup.name))
# [document]
print(soup.name)
# {} 空字典
print(soup.attrs)
```

Comment

如果字符串内容为注释则为 Comment

html_doc='<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>'

soup = BeautifulSoup(html_doc, 'html.parser')
# Elsie
print(soup.a.string)
#  <class 'bs4.element.Comment'>
print(type(soup.a.string))

子节点

一个 Tag 可能包含多个字符串或其它的 Tag，这些都是这个 Tag 的子节点：

.contents、.children 可以拿到 tag 的直接子节点

soup = BeautifulSoup(html_doc, 'lxml')
# 返回子节点的列表形式，可以循环使用
print(soup.head.contents)
# 返回一个子节点的 list 迭代器
print(soup.head.children)

for tag in soup.head.children:
    print(tag)

.descendants 可以拿到 tag 的所有子孙节点，包含节点和节点中的文本内容

soup = BeautifulSoup(html_doc, 'lxml')

# <generator object Tag.descendants at 0x0000027C893E27A0>
print(soup.head.descendants)
'''
<title>The Dormouse's story</title>
The Dormouse's story
'''
for tag in soup.head.descendants:
    print(tag)

如果想要获取节点内容，也有方式：

.string

如果一个标签里面是文本内容或者有一个唯一的子节点，则会返回最终的文本内容

如果一个标签里面是多个子节点，那么就无法找到最终应该展示哪个子节点的内容，则会返回 None

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'lxml')

# The Dormouse's story
print(soup.head.string)
# None
print(soup.body.string)

.text

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'lxml')

# None
print(soup.body.string)
'''
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
'''
print(soup.body.text)

.strings

strings 和 text 都可以返回所有文本内容，区别是 text 返回内容为字符串类型 strings 为生成器 generator
.stripped_strings

输出的字符串中可能包含了很多空格或空行，使用 .stripped_strings 可以去除多余空白内容

父节点

.parent

通过 .parent 属性来获取某个元素的父节点

title_tag = soup.title

# <head><title>The Dormouse's story</title></head>
title_tag.parent

文档的顶层节点比如 <html> 的父节点是 BeautifulSoup 对象:

html_tag = soup.html
# <class 'bs4.BeautifulSoup'>
type(html_tag.parent)

.parents

通过元素的 .parents 属性可以递归得到元素的所有父辈节点

 ```python
 link = soup.a
 for parent in link.parents:
     if parent is None:
         print(parent)
     else:
         print(parent.name)
 ```

文档搜索树

find_all

find_all(name, attrs, recursive, string, **kwargs) 搜索当前 tag 的所有 tag 子节点,并判断是否符合过滤器的条件

name 参数：可以传递字符串、正则、列表、布尔

import re

# [<title>The Dormouse's story</title>]
soup.find_all("title")

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find_all("a")

# [<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example. com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find_all(["a", "b"])

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
soup.find_all(id="link2")

# 模糊查询 包含sisters的就可以
soup.find(string=re.compile("sisters"))

keyword 参数：如果一个指定名字的参数不是搜索内置的参数名，搜索时会把该参数当作指定名字 tag 的属性来搜索。

可以指定一些 html 存在的一些属性，比如 id、class、href 等

# [<p class="title"><b>The Dormouse's story</b></p>]
soup.find_all("p", "title")
# [<p class="title"><b>The Dormouse's story</b></p>]
soup.find_all('p', class_='title')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
soup.find_all(class_='sister', id='link1')

attrs 参数：匹配 html 中内置的属性

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
soup.find_all(attrs={'class': 'sister', 'id': 'link1'})

text 参数：可以搜索文档中的字符串内容。与 name 参数的可选值一样

import re

# ['Elsie']
print(soup.find_all(text="Elsie"))

# ['Elsie', 'Lacie', 'Tillie']
print(soup.find_all(text=["Tillie", "Elsie", "Lacie"]))

# 只要包含Dormouse就可以
# ["The Dormouse's story", "The Dormouse's story"]
print(soup.find_all(text=re.compile("Dormouse")))

limit 参数：返回全部的搜索结构，如果文档树很大那么搜索会很慢。如果我们不需要全部结果，可以使用 limit 参数限制返回结果的数量。

效果与SQL中的 limit 关键字类似，当搜索到的结果数量达到 limit 的限制时，就停止搜索返回结果。
```
print(soup.find_all("a",limit=2))
print(soup.find_all("a")[0:2])

'''
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
'''
```

find

find()：唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表，而 find() 方法直接返回结果。

find_all() 方法没有找到目标是返回空列表, find() 方法找不到目标时,返回 None。

find_parents() 和 find_parent()

a_string = soup.find(text="Lacie")
print(a_string)  # Lacie

print(a_string.find_parent())
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
print(a_string.find_parents())
print(a_string.find_parent("p"))
'''
<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p>

'''

beautifulsoup 的 css 选择器

我们在写 CSS 时，标签名不加任何修饰，类名前加点，id名前加 #，在这里我们也可以利用类似的方法来筛选元素，用到的方法是 **soup.select()，**返回类型是 list

通过标签名查找

#[<title>The Dormouse's story</title>]
print(soup.select("title"))
#[<b>The Dormouse's story</b>]
print(soup.select("b"))

通过类名查找

'''
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

'''
print(soup.select(".sister"))

id 名查找

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
print(soup.select("#link1"))

组合查找

组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，例如查找 p 标签中，id 等于 link2 的内容，二者需要用空格分开
```
#[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
print(soup.select("p #link2"))
```
直接子标签查找
```
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
print(soup.select("p > #link2"))
```
查找既有class也有id选择器的标签
```
a_string = soup.select(".story#test")
```
查找有多个class选择器的标签
```
a_string = soup.select(".story.test")
```
查找有多个class选择器和一个id选择器的标签
```
a_string = soup.select(".story.test#book")
```
属性查找

查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。
```
#[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
print(soup.select("a[href='http://example.com/tillie']"))
```
select 方法返回的结果都是列表形式，可以遍历形式输出，然后用 get_text() 方法来获取它的内容：
```
'''
Elsie
Lacie
Tillie
'''
for title in soup.select('a'):
    print (title.get_text())
```

xpath

简介

安装 lxml 解析器 pip install lxml
解析流程

实例化一个 etree 的对象，把即将被解析的页面源码加载到该对象

调用该对象的 xpath 方法结合着不同形式的 xpath 表达进行标签定位和数据提取
使用

导入 lxml.etree: from lxml import etree

解析本地 html 文件: html_tree = etree.parse('XX.html')
```
from lxml import etree

parser = etree.HTMLParser(encoding='UTF-8')
tree = etree.parse('./doc/豆瓣.html', parser=parser)
print(tree)
```
解析网络的html字符串:（推荐） html_tree = etree.HTML(html字符串)
```
from lxml import etree

# 解析本地文件或者网络动态字符串
file = open('./doc/豆瓣.html', 'r', encoding='utf-8')
tree = etree.HTML(file.read())
print(tree)
```
使用 xpath 路径查询信息，返回一个列表: html_tree.xpath()

注意：如果 lxml 解析本地 HTML 文件报错可以安装如下添加参数
```
parser = etree.HTMLParser(encoding="utf-8")
selector = etree.parse('./lol_1.html', parser=parser)
result=etree.tostring(selector)
```
注意，浏览器 copy 的 xpath 会补充一些内容，可能会有大坑，和实际写的不大一样

而且 xpath 在 url 上是动态的内容（requests 获取的内容），可能和静态内容不一样，所以也是一个坑

所以要么看页面的源代码，要么直接用 requests 中的内容，不要从浏览器中的控制台直接拿

xpath 语法

路径表达式

表达式	描述
/	从 html 节点开始选择直接子节点可以匹配的节点
//	从 html 节点开始选择所有子节点中可以匹配的节点
./	当前节点再次进行xpath
@	选取属性

from lxml import etree

# 解析本地文件或者网络动态字符串
file = open('./doc/豆瓣.html', 'r', encoding='utf-8')
tree = etree.HTML(file.read())

# 从根节点开始，获取三个 a 标签的文本
# ['登录', '注册', '下载豆瓣客户端']
list = tree.xpath('/html/body/div/div/div/a/text()')
# 从根节点开始，获取第一个 div 中的两个 a 标签的文本
# ['登录', '注册']
list = tree.xpath('/html/body/div/div/div[1]/a/text()')

# 使用 // 匹配 a 标签，// 代表无论在 html 文档哪个位置全都拿出来
# [<Element a at 0x1acb7b644c0>, ……]
list = tree.xpath('//a')

# 根据第一个节点再接着向下匹配
# ['登录']
list[0].xpath('./text()')

# 根据属性值选取 ul 节点，之后选择 ul 下面第一个 li 中的 text
result = tree.xpath('//ul[@class="cover-col-4 clearfix"]/li[1]//text()')
# 根据属性值选取 ul 节点，之后选择 ul 下面最后的 li 中的 text
result = tree.xpath('//ul[@class="cover-col-4 clearfix"]/li[last()]//text()')
# 根据属性值选取 ul 节点，之后选择 ul 下面倒数第二个的 li 中的 text
result = tree.xpath('//ul[@class="cover-col-4 clearfix"]/li[last()-1]//text()')
# 根据属性值选取 ul 节点，之后选择 ul 下面从第四个开始的 li 中的 text
result = tree.xpath('//ul[@class="cover-col-4 clearfix"]/li[position() > 3]//text()')

# 获取 a 标签的 src 属性的第一个属性值
# 注意此此时已经转为了 list，所以不能再使用 xxx/@src[1] 的形式获取第一个值了
# https://img3.doubanio.com/mpic/s29535271.jpg
tree.xpath('//a[@class="cover"]/img/@src')[0]

通配符

通配符	描述
*	匹配任何元素节点，一般用于浏览器copy xpath会出现
@*	匹配任何属性节点
node()	匹配任何类型的节点
\|	使用 \| 可以匹配节点 1 或者节点 2
and	使用 and 选取为 A 并且节点为 B 的节点

from lxml import etree

# 解析本地文件或者网络动态字符串
file = open('./doc/豆瓣.html', 'r', encoding='utf-8')
tree = etree.HTML(file.read())

# 匹配任何标签，只要带有 href 属性都拿到节点对象
# [<Element link at 0x149d11d44c0>, ……]
tree.xpath('//*[@href]')

# 匹配任何节点，只要带有属性即可
# [<Element div at 0x23e0ff44700>, ……]
tree.xpath('//div[@*]')

# 匹配任何类型的节点
tree.xpath('//node()')

# [<Element div at 0x2b2d1004880>, <Element div at 0x2b2d1004800>]
tree.xpath('//div[@id="test1"] | //div[@id="test2"]')

# [<Element div at 0x20c3c204800>]
tree.xpath('//div[@id="test1" and @class="top-nav-info"]')

属性查询

通配符	描述
contains(@xxx, “xxx”)	获取属性中 xxx 带有 xxx 的标签
starts-with(@xxx, “xxx”)	获取属性中 xxx 以 xxx 开头的标签
text()	获取节点的文本内容
@xxx	获取节点的属性内容

from lxml import etree

# 解析本地文件或者网络动态字符串
file = open('./doc/豆瓣.html', 'r', encoding='utf-8')
tree = etree.HTML(file.read())

# [<Element div at 0x206ea614440>]
tree.xpath('//div[contains(@class, "top-nav-info")]')

# [<Element div at 0x2b517904700>, <Element div at 0x2b5179047c0>]
tree.xpath('//*[starts-with(@class, "top")]')

tree.xpath('//*[starts-with(@class, "top")]//text()')

tree.xpath('//*[starts-with(@class, "top")]//@class')

转载请注明来源，欢迎对文章中的引用来源进行考证，欢迎指出任何有错误或不够清晰的表达。