原理先行
作為一個資深的小說愛好者,國內很多小說網站如出一轍,什么 🖊*閣啊等等,大都是 get 請求返回 html 內容,而且會有標志性的dl>dd>等標簽。
所以大概的原理,就是先 get 請求這個網站,然后對獲取的內容進行清洗,寫進文本里面,變成一個 txt,導入手機,方便看小說。
實踐篇
之前踩過一個坑,一開始我看了幾頁小說,大概小說的內容網站是https://www.xxx.com/小說編號/章節編號.html,一開始看前幾章,我發現章節編號是連續的, 于是我一開始想的就是記住起始章節編號,然后在循環的時候章節編號自增就行,后面發現草率了,可能看個 100 章之后,章節列表會出現斷層現象,這個具體為啥 還真不知道,按理說小說編號固定,可以算是一個數據表,那里面的章節編號不就是一個自增 id 就完了嘛?有懂王可以科普一下!
所以這里要先獲取小說的目錄列表,并把目錄列表洗成一個數組方便我們后期查找!getList.py文件:
定義一個請求書簽的方法
# 請求書簽地址
def req():
url = "https://www.24kwx.com/book/4/4020/"
strHtml = requests.get(url)
return strHtml.text
將獲取到的內容提取出(id:唯一值/或第 X 章小說)(name:小說的章節名稱)(key:小說的章節 id)
# 定義一個章節對象
class Xs(object):
def __init__(self,id,key,name):
self._id = id
self._key = key
self._name = name
@property
def id(self):
self._id
@property
def key(self):
self._key
@property
def name(self):
self._name
def getString(self):
return 'id:%s,name:%s,key:%s' %(self._id,self._name,self._key)
# 轉換成書列表
def tranceList():
key = 0
name = ""
xsList = []
idrule = r'/4020/(.+?).html'
keyrule = r'第(.+?)章'
html = req()
html = re.split("/dt>",html)[2]
html = re.split("/dl>",html)[0]
htmlList = re.split("/dd>",html)
for i in htmlList:
i = i.strip()
if(i):
# 獲取id
id = re.findall(idrule,i)[0]
lsKeyList = re.findall(keyrule,i)
# 如果有章節
if len(lsKeyList) > 0 :
key = int(lsKeyList[0])
lsname = re.findall(r'章(.+?)/a>',i)
else :
key = key + 1
# 獲取名字
# lsname = re.findall(r'.html">(.+?)/a>',i)[0]
# name = re.sub(',',' ', lsname, flags=re.IGNORECASE)
name = re.findall(r'.html">(.+?)/a>',i)[0]
xsobj = Xs(id,key,name)
xsList.append(xsobj.getString())
writeList(xsList)
注意一下我:如果你從別的語言轉 py,第一次寫object對象可能會比較懵,沒錯因為他的object是一個class,這里我創建的對象就是{id,key,name}但是你寫入 txt 的時候還是要getString,所以后面想想我直接寫個{id:xxx,name:xxx,key:xxx}的字符串不就完了,還弄啥class,后面還是想想給兄弟盟留點看點,就留著了
最后寫入 txt 文件
# 寫入到文本
def writeList(list):
f = open("xsList.txt",'w',encoding='utf-8')
# 這里不能寫list,要先轉字符串 TypeError: write() argument must be str, not list
f.write('\n'.join(list))
print('寫入成功')
# 大概寫完的txt是這樣的
id:3798160,name:第1章 孫子,我是你爺爺,key:1
id:3798161,name:第2章 孫子,等等我!,key:2
id:3798162,name:第3章 天上掉下個親爺爺,key:3
id:3798163,name:第4章 超級大客戶,key:4
id:3798164,name:第5章 一張退婚證明,key:5
ok ! Last one
這里已經寫好了小說的目錄,那我們就要讀取小說的內容,同理
先寫個請求
# 請求內容地址
def req(id):
url = "https://www.24kwx.com/book/4/4020/"+id+".html"
strHtml = requests.get(url)
return strHtml.text
讀取我們剛剛保存的目錄
def getList():
f = open("xsList.txt",'r', encoding='utf-8')
# 這里按行讀取,讀取完后line是個數組
line = f.readlines()
f.close()
return line
定義好一個清洗數據的規則
contextRule = r'div class="content">(.+?)script>downByJs();/script>'
titleRule = r'h1>(.+?)/h1>'
def getcontext(objstr):
xsobj = re.split(",",objstr)
id = re.split("id:",xsobj[0])[1]
name = re.split("name:",xsobj[1])[1]
html = req(id)
lstitle = re.findall(titleRule,html)
title = lstitle[0] if len(lstitle) > 0 else name
context = re.split('div id="content" class="showtxt">',html)[1]
context = re.split('/div>',context)[0]
context = re.sub('nbsp;|\r|\n','',context)
textList = re.split('br />',context)
textList.insert(0,title)
for item in textList :
writeTxt(item)
print('%s--寫入成功'%(title))
再寫入文件
def writeTxt(txt):
if txt :
f = open("nr.txt",'a',encoding="utf-8")
f.write(txt+'\n')
最后當然是串聯起來啦
def getTxt():
# 默認參數配置
startNum = 1261 # 起始章節
endNum = 1300 # 結束章節
# 開始主程序
f = open("nr.txt",'w',encoding='utf-8')
f.write("")
if endNum startNum:
print('結束條數必須大于開始條數')
return
allList = getList()
needList = allList[startNum-1:endNum]
for item in needList:
getcontext(item)
time.sleep(0.2)
print("全部爬取完成")
完整代碼
getList.py
import requests
import re
# 請求書簽地址
def req():
url = "https://www.24kwx.com/book/4/4020/"
strHtml = requests.get(url)
return strHtml.text
# 定義一個章節對象
class Xs(object):
def __init__(self,id,key,name):
self._id = id
self._key = key
self._name = name
@property
def id(self):
self._id
@property
def key(self):
self._key
@property
def name(self):
self._name
def getString(self):
return 'id:%s,name:%s,key:%s' %(self._id,self._name,self._key)
# 轉換成書列表
def tranceList():
key = 0
name = ""
xsList = []
idrule = r'/4020/(.+?).html'
keyrule = r'第(.+?)章'
html = req()
html = re.split("/dt>",html)[2]
html = re.split("/dl>",html)[0]
htmlList = re.split("/dd>",html)
for i in htmlList:
i = i.strip()
if(i):
# 獲取id
id = re.findall(idrule,i)[0]
lsKeyList = re.findall(keyrule,i)
# 如果有章節
if len(lsKeyList) > 0 :
key = int(lsKeyList[0])
lsname = re.findall(r'章(.+?)/a>',i)
else :
key = key + 1
# 獲取名字
# lsname = re.findall(r'.html">(.+?)/a>',i)[0]
# name = re.sub(',',' ', lsname, flags=re.IGNORECASE)
name = re.findall(r'.html">(.+?)/a>',i)[0]
xsobj = Xs(id,key,name)
xsList.append(xsobj.getString())
writeList(xsList)
# 寫入到文本
def writeList(list):
f = open("xsList.txt",'w',encoding='utf-8')
# 這里不能寫list,要先轉字符串 TypeError: write() argument must be str, not list
f.write('\n'.join(list))
print('寫入成功')
def main():
tranceList()
if __name__ == '__main__':
main()
writeTxt.py
import requests
import re
import time
# 請求內容地址
def req(id):
url = "https://www.24kwx.com/book/4/4020/"+id+".html"
strHtml = requests.get(url)
return strHtml.text
def getList():
f = open("xsList.txt",'r', encoding='utf-8')
# 這里按行讀取
line = f.readlines()
f.close()
return line
contextRule = r'div class="content">(.+?)script>downByJs();/script>'
titleRule = r'h1>(.+?)/h1>'
def getcontext(objstr):
xsobj = re.split(",",objstr)
id = re.split("id:",xsobj[0])[1]
name = re.split("name:",xsobj[1])[1]
html = req(id)
lstitle = re.findall(titleRule,html)
title = lstitle[0] if len(lstitle) > 0 else name
context = re.split('div id="content" class="showtxt">',html)[1]
context = re.split('/div>',context)[0]
context = re.sub('nbsp;|\r|\n','',context)
textList = re.split('br />',context)
textList.insert(0,title)
for item in textList :
writeTxt(item)
print('%s--寫入成功'%(title))
def writeTxt(txt):
if txt :
f = open("nr.txt",'a',encoding="utf-8")
f.write(txt+'\n')
def getTxt():
# 默認參數配置
startNum = 1261 # 起始章節
endNum = 1300 # 結束章節
# 開始主程序
f = open("nr.txt",'w',encoding='utf-8')
f.write("")
if endNum startNum:
print('結束條數必須大于開始條數')
return
allList = getList()
needList = allList[startNum-1:endNum]
for item in needList:
getcontext(item)
time.sleep(0.2)
print("全部爬取完成")
def main():
getTxt()
if __name__ == "__main__":
main()
以上就是python 爬取國內小說網站的詳細內容,更多關于python 爬取小說網站的資料請關注腳本之家其它相關文章!
您可能感興趣的文章:- Python爬蟲入門教程02之筆趣閣小說爬取
- python 爬取小說并下載的示例
- python爬取”頂點小說網“《純陽劍尊》的示例代碼
- Python爬取365好書中小說代碼實例
- Python實現的爬取小說爬蟲功能示例
- Python scrapy爬取起點中文網小說榜單
- python爬蟲之爬取筆趣閣小說升級版