python|通过一个简单爬虫实例简单了解文本解析与读写
- 编程
- 2023-02-11
python通过request模块可以很简单地通过链接地址获取网络文本。
python的re模板有强大的正则表达式功能来处理文本。
python的文件读写功能也很简单和强大。
1 python通过request模块通过链接地址获取网络文本
1.1 安装request模块
在CMD进入py.exe目录
开始菜单→运行(windows+r)→cmd→通过cd命令进入到python安装目录下的Scripts文件,如:
cd C:\Users\userName\AppData\Local\Programs\Python\Python36-32\Scripts
输入pip install requests
或者 打开Python文件的安装目录,进入Scripts文件中,按住Shift键+鼠标右击,在右键中选择“在此处打开命令窗口”。
或者直接在cmd窗口中输入以下命令:
pip install requests -i http://pypi.douban.com/simple --trusted-host=pypi.douban.com
1.2 通过链接地址获取网络文本
import requests href = "https://www.3zmm.net/files/article/html/98709/98709808/" html_response = requests.get(href) #html_response.encoding = utf-8 html = html_response.text print(html)运行结果:
2 建立目录文件
对需要提取网页的链接制作目录文件index.html(可手工也可通过代码提取)。
(为演示需要,截取一部分):
<a href ="https://www.3zmm.net/files/article/html/98709/98709808/13110286.html">第1章 出门即是江湖</a> <a href ="https://www.3zmm.net/files/article/html/98709/98709808/13110285.html">第2章 麻将出千</a> <a href ="https://www.3zmm.net/files/article/html/98709/98709808/13110284.html">第3章 移山卸岭</a> <a href ="https://www.3zmm.net/files/article/html/98709/98709808/13110283.html">第4章 初次试探</a> <a href ="https://www.3zmm.net/files/article/html/98709/98709808/13110282.html">第5章 炸金花</a>当然也可以直接获取网络文本,将通过正则表达式查找建立list。这里为演示需要,建立index.html目录文件。(目录文件可以随时修改,相当于网络截取的目录,演示时更灵活)
3 读取index.html,并建立链接和标题list
import re with open("index.html",rU,encoding=utf-8) as strf: str = strf.read() res = r<a href ="(.*?)">(.*?)</a> # 使用()分组(分为两组) indexList = re.findall(res,str) for link in indexList: print(href: ,link[0]) print(title: ,link[1],\n)运行效果:
4 读取index.html中的链接的网络文本
通过链接读取网络文本。
import re import requests with open("index.html",rU,encoding=utf-8) as strf: str = strf.read() res = r<a href ="(.*?)">(.*?)</a> # 使用()分组(分为两组) indexList = re.findall(res,str) for link in indexList: chapter_response = requests.get(link[0]) #chapter_response.encoding = utf-8 chapter_html = chapter_response.text print(link[1],\n\n) print(chapter_html,\n\n)运行效果:
5 文本提取
在网页源文件中提取主体文本。
import re import requests with open("index.html",rU,encoding=utf-8) as strf: str = strf.read() res = r<a href ="(.*?)">(.*?)</a> # 使用()分组(分为两组) indexList = re.findall(res,str) for link in indexList: chapter_response = requests.get(link[0]) #chapter_response.encoding = utf-8 chapter_html = chapter_response.text chapter_content = re.findall(r<div id="content" class="showtxt">(.*?)</div>,chapter_html)[0] print(link[1],\n\n) print(chapter_content,\n\n)运行效果:
6 文本清洗
将不需要的文本替换为空白。
import re import requests with open("index.html",rU,encoding=utf-8) as strf: str = strf.read() res = r<a href ="(.*?)">(.*?)</a> # 使用()分组(分为两组) indexList = re.findall(res,str) for link in indexList: chapter_response = requests.get(link[0]) #chapter_response.encoding = utf-8 chapter_html = chapter_response.text chapter_content = re.findall(r<div id="content" class="showtxt">(.*?)</div>,chapter_html)[0] str = <script>chaptererror();</script><br /> 请记住本书首发域名:www.3zmm.net。三掌门手机版阅读网址:m.3zmm.net chapter_content = chapter_content.replace(str,) chapter_content = chapter_content.replace(link[0],) print(link[1],\n\n) print(chapter_content,\n\n)运行效果:
7 文本处理(文本查找、替换)
import re import requests with open("index.html",rU,encoding=utf-8) as strf: str = strf.read() res = r<a href ="(.*?)">(.*?)</a> # 使用()分组(分为两组) indexList = re.findall(res,str) for link in indexList: chapter_response = requests.get(link[0]) #chapter_response.encoding = utf-8 chapter_html = chapter_response.text chapter_content = re.findall(r<div id="content" class="showtxt">(.*?)</div>,chapter_html)[0] chapter_content = chapter_content.replace(<script>app2();</script><br />,<p>) chapter_content = chapter_content.replace(<br /><br />,</p>\r\n<p>) str = <script>chaptererror();</script><br /> 请记住本书首发域名:www.3zmm.net。三掌门手机版阅读网址:m.3zmm.net chapter_content = chapter_content.replace(str,) chapter_content = chapter_content.replace(link[0],) print(link[1],\n\n) print(chapter_content,\n\n)运行效果:
8 将文本分别写入文件
import re import requests # 1 读取目录文件并提取包含链接和标题的list with open("index.html",rU,encoding=utf-8) as strf: str = strf.read() res = r<a href ="(.*?)">(.*?)</a> # 使用()分组(分为两组) indexList = re.findall(res,str) for link in indexList: # 2 按链接读取网页文本 chapter_response = requests.get(link[0]) #chapter_response.encoding = utf-8 chapter_html = chapter_response.text # 3 提取(截取)文本 chapter_content = re.findall(r<div id="content" class="showtxt">(.*?)</div>,chapter_html)[0] # 4 文本清洗(删除不需要文本) str = <script>chaptererror();</script><br /> 请记住本书首发域名:www.3zmm.net。三掌门手机版阅读网址:m.3zmm.net chapter_content = chapter_content.replace(str,) chapter_content = chapter_content.replace(link[0],) # 5 文本处理(查找、替换) chapter_content = chapter_content.replace(<script>app2();</script><br />,<p>) chapter_content = chapter_content.replace(<br /><br />,</p>\r\n<p>) print(link[1],\n\n) # 6 数据持久化(写入到文件) fb = open(%s.html%link[1], w, encoding=utf-8);#%s用%link[1]替换 fb.write(chapter_content) fb.close9 将文本分别写入文件并适当的添加CSS、JS
import re import requests # 1 读取目录文件并提取包含链接和标题的list with open("index.html",rU,encoding=utf-8) as strf: str = strf.read() res = r<a href ="(.*?)">(.*?)</a> # 使用()分组(分为两组) indexList = re.findall(res,str) for link in indexList: # 2 按链接读取网页文本 chapter_response = requests.get(link[0]) #chapter_response.encoding = utf-8 chapter_html = chapter_response.text # 3 提取(截取)文本 chapter_content = re.findall(r<div id="content" class="showtxt">(.*?)</div>,chapter_html)[0] # 4 文本清洗(删除不需要文本) str = <script>chaptererror();</script><br /> 请记住本书首发域名:www.3zmm.net。三掌门手机版阅读网址:m.3zmm.net chapter_content = chapter_content.replace(str,) chapter_content = chapter_content.replace(link[0],) # 5 文本处理(查找、替换) chapter_content = chapter_content.replace(<script>app2();</script><br />,<p>) chapter_content = chapter_content.replace(<br /><br />,</p>\r\n<p>) print(link[1],\n\n) # 6 数据持久化(写入到文件,并适当添加CSS、JS) sn = re.findall(r第(.*?)章,link[1])[0] fb = open(%s.html%sn, w, encoding=utf-8);#%s用%link[1]替换 fheader = open(header.html,r,encoding="UTF-8") fb.write(fheader.read()) fheader.close() fb.write(\n<h4>) fb.write(sn) cha = link[1].replace(sn,); cha = cha.replace(第章 ,) fb.write() fb.write(cha) fb.write(</h4>\n) fb.write(chapter_content) ffooter = open(footer.html,r,encoding="UTF-8") fb.write(ffooter.read()) ffooter.close() fb.close()也可以直接将文件头部、尾部写入文件:
import re import requests # 1 读取目录文件并提取包含链接和标题的list with open("index.html",rU,encoding=utf-8) as strf: str = strf.read() res = r<a href ="(.*?)">(.*?)</a> # 使用()分组(分为两组) indexList = re.findall(res,str) for link in indexList: # 2 按链接读取网页文本 chapter_response = requests.get(link[0]) #chapter_response.encoding = utf-8 chapter_html = chapter_response.text # 3 提取(截取)文本 chapter_content = re.findall(r<div id="content" class="showtxt">(.*?)</div>,chapter_html)[0] # 4 文本清洗(删除不需要文本) str = <script>chaptererror();</script><br /> 请记住本书首发域名:www.3zmm.net。三掌门手机版阅读网址:m.3zmm.net chapter_content = chapter_content.replace(str,) chapter_content = chapter_content.replace(link[0],) # 5 文本处理(查找、替换) chapter_content = chapter_content.replace(<script>app2();</script><br />,<p>) chapter_content = chapter_content.replace(<br /><br />,</p>\r\n<p>) print(link[1],\n\n) # 6 数据持久化(写入到文件,并适当添加CSS、JS) sn = re.findall(r第(.*?)章,link[1])[0] fb = open(%s.html%sn, w, encoding=utf-8);#%s用%link[1]替换 # 6.1 写文件头数据 #fheader = open(header.html,r,encoding="UTF-8") #fb.write(fheader.read()) #fheader.close() headertxt = <!DOCTYPE html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title></title> <link ID="CSS" href="../cssjs/css.css" rel="stylesheet" type="text/css" /> <script charset="utf-8" language="JavaScript" type="text/javascript" src="../cssjs/js.js"></script> <script>docWrite1();</script> </head> <body> <div id="container"> fb.write(headertxt) # 6.2 写文件主体 fb.write(\n<h4>) fb.write(sn) cha = link[1].replace(sn,); cha = cha.replace(第章 ,) fb.write() fb.write(cha) fb.write(</h4>\n) fb.write(chapter_content) # 6.2 写文件尾部 #ffooter = open(footer.html,r,encoding="UTF-8") #fb.write(ffooter.read()) #ffooter.close() footertxt = <div> <script type=text/javascript> docWrite2(); bootfunc(); window.onload = myfun; </script> </div> </body> </html> fb.write(footertxt) fb.close()-End-