from bs4 import BeautifulSoup soup=BeautifulSoup('<p>Hello</p>','lxml') print(soup.p.string) #Hello
示例
html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.prettify()) print(soup.title.string)
运行结果:
<html> <head> <title> The Dormouse's story </title> </head> <body> <pclass="title"name="dromouse"> <b> The Dormouse's story </b> </p> <pclass="story"> Once upon a time there were three little sisters; and their names were <aclass="sister"href="http://example.com/elsie"id="link1"> <!-- Elsie --> </a> , <aclass="sister"href="http://example.com/lacie"id="link2"> Lacie </a> and <aclass="sister"href="http://example.com/tillie"id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <pclass="story"> ... </p> </body> </html>
The Dormouse's story
首先声明变量 html,它是一个 HTML 字符串。但是需要注意的是,它并不是一个完整的 HTML 字符串,因为 body 和 html 节点都没有闭合。接着,我们将它当作第一个参数传给 BeautifulSoup 对象,该对象的第二个参数为解析器的类型(这里使用 lxml),此时就完成了 BeaufulSoup 对象的初始化。然后,将这个对象赋值给 soup 变量。
prettify() 方法可以把要解析的字符串以标准的缩进格式输出。
然后调用 soup.title.string,这实际上是输出 HTML 中 title 节点的文本内容。所以,soup.title 可以选出 HTML 中的 title 节点,再调用 string 属性就可以得到里面的文本了。
提取信息
获取名称
可以利用 name 属性获取节点的名称。这里还是以上面的文本为例,选取 title 节点,然后调用 name 属性就可以得到节点名称:
print(soup.title.name)
运行结果:
title
获取属性
每个节点可能有多个属性,比如 id 和 class 等,选择这个节点元素后,可以调用 attrs 获取所有属性:
print(soup.p.attrs) print(soup.p.attrs['name'])
运行结果:
{'class': ['title'], 'name': 'dromouse'} dromouse
可以看到,attrs 的返回结果是字典形式,它把选择的节点的所有属性和属性值组合成一个字典。接下来,如果要获取 name 属性,就相当于从字典中获取某个键值,只需要用中括号加属性名就可以了。比如,要获取 name 属性,就可以通过 attrs[‘name’] 来得到。
html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.contents)
运行结果:
['\n Once upon a time there were three little sisters; and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie</span> </a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n and\n ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n and they lived at the bottom of a well.\n ']
或者调用 children 属性得到相应的结果
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.children) for i, child inenumerate(soup.p.children): print(i, child)
结果:
<list_iterator object at 0x1064f7dd8> 0 Once upon a time there were three little sisters; and their names were
3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 4 and
5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 6 and they lived at the bottom of a well.
要得到所有的子孙节点的话,可以调用 descendants 属性
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.p.descendants) for i, child inenumerate(soup.p.descendants): print(i, child)
父节点和祖先节点
如果要获取某个节点元素的父节点,可以调用 parent 属性
html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> </p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.a.parent)
html = """ <html> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> Hello <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print('Next Sibling', soup.a.next_sibling) print('Prev Sibling', soup.a.previous_sibling) print('Next Siblings', list(enumerate(soup.a.next_siblings))) print('Prev Siblings', list(enumerate(soup.a.previous_siblings)))
import re html=''' <div class="panel"> <div class="panel-body"> <a>Hello, this is a link</a> <a>Hello, this is a link, too</a> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(text=re.compile('link')))