文章詳情頁

Python如何爬取51cto數(shù)據(jù)并存入MySQL

瀏覽：24日期：2022-07-13 09:39:14

實(shí)驗(yàn)環(huán)境

1.安裝Python 3.7

2.安裝requests, bs4，pymysql 模塊

實(shí)驗(yàn)步驟1.安裝環(huán)境及模塊

可參考https://www.jb51.net/article/194104.htm

2.編寫代碼

# 51cto 博客頁面數(shù)據(jù)插入mysql數(shù)據(jù)庫# 導(dǎo)入模塊import reimport bs4import pymysqlimport requests# 連接數(shù)據(jù)庫賬號密碼db = pymysql.connect(host=’172.171.13.229’, user=’root’, passwd=’abc123’, db=’test’, port=3306, charset=’utf8’)# 獲取游標(biāo)cursor = db.cursor()def open_url(url): # 連接模擬網(wǎng)頁訪問 headers = { ’user-agent’: ’Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ’ ’Chrome/57.0.2987.98 Safari/537.36’} res = requests.get(url, headers=headers) return res# 爬取網(wǎng)頁內(nèi)容def find_text(res): soup = bs4.BeautifulSoup(res.text, ’html.parser’) # 博客名 titles = [] targets = soup.find_all('a', class_='tit') for each in targets: each = each.text.strip() if '置頂' in each: each = each.split(’ ’)[0] titles.append(each) # 閱讀量 reads = [] read1 = soup.find_all('p', class_='read fl on') read2 = soup.find_all('p', class_='read fl') for each in read1: reads.append(each.text) for each in read2: reads.append(each.text) # 評論數(shù) comment = [] targets = soup.find_all('p', class_=’comment fl’) for each in targets: comment.append(each.text) # 收藏 collects = [] targets = soup.find_all('p', class_=’collect fl’) for each in targets: collects.append(each.text) # 發(fā)布時(shí)間 dates=[] targets = soup.find_all('a', class_=’time fl’) for each in targets: each = each.text.split(’：’)[1] dates.append(each) # 插入sql 語句 sql = '''insert into blog (blog_title,read_number,comment_number, collect, dates) values( ’%s’, ’%s’, ’%s’, ’%s’, ’%s’);''' # 替換頁面 xa0 for titles, reads, comment, collects, dates in zip(titles, reads, comment, collects, dates): reads = re.sub(’s’, ’’, reads) comment = re.sub(’s’, ’’, comment) collects = re.sub(’s’, ’’, collects) cursor.execute(sql % (titles, reads, comment, collects，dates)) db.commit() pass# 統(tǒng)計(jì)總頁數(shù)def find_depth(res): soup = bs4.BeautifulSoup(res.text, ’html.parser’) depth = soup.find(’li’, class_=’next’).previous_sibling.previous_sibling.text return int(depth)# 主函數(shù)def main(): host = 'https://blog.51cto.com/13760351' res = open_url(host) # 打開首頁鏈接 depth = find_depth(res) # 獲取總頁數(shù) # 爬取其他頁面信息 for i in range(1, depth + 1): url = host + ’/p’ + str(i) # 完整鏈接 res = open_url(url) # 打開其他鏈接 find_text(res) # 爬取數(shù)據(jù) # 關(guān)閉游標(biāo) cursor.close() # 關(guān)閉數(shù)據(jù)庫連接 db.close()if __name__ == ’__main__’: main()

3..MySQL創(chuàng)建對應(yīng)的表

CREATE TABLE `blog` ( `row_id` int(11) NOT NULL AUTO_INCREMENT COMMENT ’主鍵’, `blog_title` varchar(52) DEFAULT NULL COMMENT ’博客標(biāo)題’, `read_number` varchar(26) DEFAULT NULL COMMENT ’閱讀數(shù)量’, `comment_number` varchar(16) DEFAULT NULL COMMENT ’評論數(shù)量’, `collect` varchar(16) DEFAULT NULL COMMENT ’收藏?cái)?shù)量’, `dates` varchar(16) DEFAULT NULL COMMENT ’發(fā)布日期’, PRIMARY KEY (`row_id`)) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;

Python如何爬取51cto數(shù)據(jù)并存入MySQL

4.運(yùn)行代碼，查看效果:

Python如何爬取51cto數(shù)據(jù)并存入MySQL

改進(jìn)版：

改進(jìn)內(nèi)容：

1.數(shù)據(jù)庫里面的某些字段只保留數(shù)字即可

2.默認(rèn)爬取的內(nèi)容都是字符串，存放數(shù)據(jù)庫的某些字段，最好改為整型，方便后面數(shù)據(jù)庫操作

1.代碼如下：

import reimport bs4import pymysqlimport requests# 連接數(shù)據(jù)庫db = pymysql.connect(host=’172.171.13.229’, user=’root’, passwd=’abc123’, db=’test’, port=3306, charset=’utf8’)# 獲取游標(biāo)cursor = db.cursor()def open_url(url): # 連接模擬網(wǎng)頁訪問 headers = { ’user-agent’: ’Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ’ ’Chrome/57.0.2987.98 Safari/537.36’} res = requests.get(url, headers=headers) return res# 爬取網(wǎng)頁內(nèi)容def find_text(res): soup = bs4.BeautifulSoup(res.text, ’html.parser’) # 博客標(biāo)題 titles = [] targets = soup.find_all('a', class_='tit') for each in targets: each = each.text.strip() if '置頂' in each: each = each.split(’ ’)[0] titles.append(each) # 閱讀量 reads = [] read1 = soup.find_all('p', class_='read fl on') read2 = soup.find_all('p', class_='read fl') for each in read1: reads.append(each.text) for each in read2: reads.append(each.text) # 評論數(shù) comment = [] targets = soup.find_all('p', class_=’comment fl’) for each in targets: comment.append(each.text) # 收藏 collects = [] targets = soup.find_all('p', class_=’collect fl’) for each in targets: collects.append(each.text) # 發(fā)布時(shí)間 dates=[] targets = soup.find_all('a', class_=’time fl’) for each in targets: each = each.text.split(’：’)[1] dates.append(each) # 插入sql 語句 sql = '''insert into blogs (blog_title,read_number,comment_number, collect, dates) values( ’%s’, ’%s’, ’%s’, ’%s’, ’%s’);''' # 替換頁面 xa0 for titles, reads, comment, collects, dates in zip(titles, reads, comment, collects, dates): reads = re.sub(’s’, ’’, reads) reads=int(re.sub(’D’, '', reads)) #匹配數(shù)字，轉(zhuǎn)換為整型 comment = re.sub(’s’, ’’, comment) comment = int(re.sub(’D’, '', comment)) #匹配數(shù)字，轉(zhuǎn)換為整型 collects = re.sub(’s’, ’’, collects) collects = int(re.sub(’D’, '', collects)) #匹配數(shù)字，轉(zhuǎn)換為整型 dates = re.sub(’s’, ’’, dates) cursor.execute(sql % (titles, reads, comment, collects,dates)) db.commit() pass# 統(tǒng)計(jì)總頁數(shù)def find_depth(res): soup = bs4.BeautifulSoup(res.text, ’html.parser’) depth = soup.find(’li’, class_=’next’).previous_sibling.previous_sibling.text return int(depth)# 主函數(shù)def main(): host = 'https://blog.51cto.com/13760351' res = open_url(host) # 打開首頁鏈接 depth = find_depth(res) # 獲取總頁數(shù) # 爬取其他頁面信息 for i in range(1, depth + 1): url = host + ’/p’ + str(i) # 完整鏈接 res = open_url(url) # 打開其他鏈接 find_text(res) # 爬取數(shù)據(jù) # 關(guān)閉游標(biāo) cursor.close() # 關(guān)閉數(shù)據(jù)庫連接 db.close()#主程序入口if __name__ == ’__main__’: main()

2.創(chuàng)建對應(yīng)表

CREATE TABLE `blogs` ( `row_id` int(11) NOT NULL AUTO_INCREMENT COMMENT ’主鍵’, `blog_title` varchar(52) DEFAULT NULL COMMENT ’博客標(biāo)題’, `read_number` int(26) DEFAULT NULL COMMENT ’閱讀數(shù)量’, `comment_number` int(16) DEFAULT NULL COMMENT ’評論數(shù)量’, `collect` int(16) DEFAULT NULL COMMENT ’收藏?cái)?shù)量’, `dates` varchar(16) DEFAULT NULL COMMENT ’發(fā)布日期’, PRIMARY KEY (`row_id`)) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;

3.運(yùn)行代碼，驗(yàn)證

Python如何爬取51cto數(shù)據(jù)并存入MySQL

升級版

為了能讓小白就可以使用這個(gè)程序，可以把這個(gè)項(xiàng)目打包成exe格式的文件，讓其他人，使用電腦就可以運(yùn)行代碼，這樣非常方便！

1.改進(jìn)代碼：

#末尾修改為：if __name__ == ’__main__’: main() print('ntt所有數(shù)據(jù)已成功存放數(shù)據(jù)庫！!! n') time.sleep(5)

2.安裝打包模塊pyinstaller(cmd安裝）

pip install pyinstaller -i https://pypi.tuna.tsinghua.edu.cn/simple/

3.Python代碼打包

1.切換到需要打包代碼的路徑下面

2.在cmd窗口運(yùn)行 pyinstaller -F test03.py （test03為項(xiàng)目名稱）

Python如何爬取51cto數(shù)據(jù)并存入MySQL

4.查看exe包

在打包后會(huì)出現(xiàn)dist目錄，打好包就在這個(gè)目錄里面

Python如何爬取51cto數(shù)據(jù)并存入MySQL

5.運(yùn)行exe包，查看效果

Python如何爬取51cto數(shù)據(jù)并存入MySQL

檢查數(shù)據(jù)庫

Python如何爬取51cto數(shù)據(jù)并存入MySQL

總結(jié)：

1.這一篇博客，是在上一篇的基礎(chǔ)上改進(jìn)的，步驟是先爬取首頁的信息，再爬取其他頁面信息，最后在改進(jìn)細(xì)節(jié)，打包exe文件

2.我們爬取網(wǎng)頁數(shù)據(jù)大多數(shù)還是存放到數(shù)據(jù)庫的，所以這種方法很實(shí)用。

3.其實(shí)在此博客的基礎(chǔ)上還是可以改進(jìn)的，重要的是掌握方法即可。

以上就是本文的全部內(nèi)容，希望對大家的學(xué)習(xí)有所幫助，也希望大家多多支持好吧啦網(wǎng)。

Python 編程

上一條：python 多線程死鎖問題的解決方案下一條：基于Python爬取51cto博客頁面信息過程解析

相關(guān)文章：

1. IntelliJ IDEA導(dǎo)入項(xiàng)目的方法2. Android View 事件防抖的兩種方案3. ASP常用源代碼的總結(jié)（下）4. Android EditText隨輸入法一起移動(dòng)并懸浮在輸入法之上的示例代碼5. 基于Java實(shí)現(xiàn)記事本功能6. Java 實(shí)現(xiàn)定時(shí)任務(wù)的三種方法7. Python ini文件常用操作方法解析8. Java通俗易懂系列設(shè)計(jì)模式之建造者模式9. Java 獲取properties的幾種方式10. 淺析vue偵測數(shù)據(jù)的變化之基本實(shí)現(xiàn)

排行榜

					
					IntelliJ IDEA導(dǎo)入項(xiàng)目的方法
解決ant design vue中樹形控件defaultExpandAll設(shè)置無效的問題
Java通俗易懂系列設(shè)計(jì)模式之建造者模式
Spring Cloud Alibaba整合Sentinel的實(shí)現(xiàn)步驟
Android View 事件防抖的兩種方案
ASP常用源代碼的總結(jié)（下）
基于Java實(shí)現(xiàn)記事本功能
Java 實(shí)現(xiàn)定時(shí)任務(wù)的三種方法
Java 獲取properties的幾種方式
docker鏡像完全卸載的操作步驟
vue 組件簡介