好湿?好紧?好多水好爽自慰,久久久噜久噜久久综合,成人做爰A片免费看黄冈,机机对机机30分钟无遮挡

主頁 > 知識庫 > Python爬蟲之爬取我愛我家二手房數據

Python爬蟲之爬取我愛我家二手房數據

熱門標簽:電銷機器人的風險 河北防封卡電銷卡 400電話辦理哪種 開封自動外呼系統怎么收費 應電話機器人打電話違法嗎 開封語音外呼系統代理商 天津電話機器人公司 地圖標注線上如何操作 手機網頁嵌入地圖標注位置

一、問題說明

首先,運行下述代碼,復現問題:

# -*-coding:utf-8-*-
import re
import requests
from bs4 import BeautifulSoup


cookie = 'PHPSESSID=aivms4ufg15sbrj0qgboo3c6gj; HMF_CI=4d8ff20092e9832daed8fe5eb0475663812603504e007aca93e6630c00b84dc207; _ga=GA1.2.556271139.1620784679; gr_user_id=4c878c8f-406b-46a0-86ee-a9baf2267477; _dx_uzZo5y=68b673b0aaec1f296c34e36c9e9d378bdb2050ab4638a066872a36f781c888efa97af3b5; smidV2=20210512095758ff7656962db3adf41fa8fdc8ddc02ecb00bac57209becfaa0; yfx_c_g_u_id_10000001=_ck21051209583410015104784406594; __TD_deviceId=41HK9PMCSF7GOT8G; zufang_cookiekey=["%7B%22url%22%3A%22%2Fzufang%2F_%25E9%2595%25BF%25E6%2598%25A5%25E6%25A1%25A5%3Fzn%3D%25E9%2595%25BF%25E6%2598%25A5%25E6%25A1%25A5%22%2C%22x%22%3A%220%22%2C%22y%22%3A%220%22%2C%22name%22%3A%22%E9%95%BF%E6%98%A5%E6%A1%A5%22%2C%22total%22%3A%220%22%7D","%7B%22url%22%3A%22%2Fzufang%2F_%25E8%258B%258F%25E5%25B7%259E%25E8%25A1%2597%3Fzn%3D%25E8%258B%258F%25E5%25B7%259E%25E8%25A1%2597%22%2C%22x%22%3A%220%22%2C%22y%22%3A%220%22%2C%22name%22%3A%22%E8%8B%8F%E5%B7%9E%E8%A1%97%22%2C%22total%22%3A%220%22%7D","%7B%22url%22%3A%22%2Fzufang%2F_%25E8%258B%258F%25E5%25B7%259E%25E6%25A1%25A5%3Fzn%3D%25E8%258B%258F%25E5%25B7%259E%25E6%25A1%25A5%22%2C%22x%22%3A%220%22%2C%22y%22%3A%220%22%2C%22name%22%3A%22%E8%8B%8F%E5%B7%9E%E6%A1%A5%22%2C%22total%22%3A%220%22%7D"]; ershoufang_cookiekey=["%7B%22url%22%3A%22%2Fzufang%2F_%25E9%2595%25BF%25E6%2598%25A5%25E6%25A1%25A5%3Fzn%3D%25E9%2595%25BF%25E6%2598%25A5%25E6%25A1%25A5%22%2C%22x%22%3A%220%22%2C%22y%22%3A%220%22%2C%22name%22%3A%22%E9%95%BF%E6%98%A5%E6%A1%A5%22%2C%22total%22%3A%220%22%7D","%7B%22url%22%3A%22%2Fershoufang%2F_%25E8%258B%258F%25E5%25B7%259E%25E6%25A1%25A5%3Fzn%3D%25E8%258B%258F%25E5%25B7%259E%25E6%25A1%25A5%22%2C%22x%22%3A%220%22%2C%22y%22%3A%220%22%2C%22name%22%3A%22%E8%8B%8F%E5%B7%9E%E6%A1%A5%22%2C%22total%22%3A%220%22%7D"]; zufang_BROWSES=501465046,501446051,90241951,90178388,90056278,90187979,501390110,90164392,90168076,501472221,501434480,501480593,501438374,501456072,90194547,90223523,501476326,90245144; historyCity=["\u5317\u4eac"]; _gid=GA1.2.23153704.1621410645; Hm_lvt_94ed3d23572054a86ed341d64b267ec6=1620784715,1621410646; _Jo0OQK=4958FA78A5CC420C425C480565EB46670E81832D8173C5B3CFE61303A51DE43E320422D6C7A15892C5B8B66971ED1B97A7334F0B591B193EBECAAB0E446D805316B26107A0B847CA53375B268E06EC955BB75B268E06EC955BB9D992FB153179892GJ1Z1OA==; ershoufang_BROWSES=501129552; domain=bj; 8fcfcf2bd7c58141_gr_session_id=61676ce2-ea23-4f77-8165-12edcc9ed902; 8fcfcf2bd7c58141_gr_session_id_61676ce2-ea23-4f77-8165-12edcc9ed902=true; yfx_f_l_v_t_10000001=f_t_1620784714003__r_t_1621471673953__v_t_1621474304616__r_c_2; Hm_lpvt_94ed3d23572054a86ed341d64b267ec6=1621475617'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36',
    'Cookie': cookie.encode("utf-8").decode("latin1")
}


def run():
    base_url = 'https://bj.5i5j.com/ershoufang/xichengqu/n%d/'
    for page in range(1, 11):
        url = base_url % page
        print(url)
        html = requests.get(url, headers=headers).text
        soup = BeautifulSoup(html, 'lxml')
        try:
            for li in soup.find('div', class_='list-con-box').find('ul', class_='pList').find_all('li'):
                title = li.find('h3', class_='listTit').get_text()  # 名稱
                # print(title)
        except Exception as e:
            print(e)
            print(html)
            break


if __name__ == '__main__':
    run()

運行后會發現,在抓取https://bj.5i5j.com/ershoufang/xichengqu/n1/(也可能是其他頁碼)時,會報錯:'NoneType' object has no attribute 'find',觀察輸出的html信息,可以發現html內容為:HTML>HEAD>script>window.location. rel="external nofollow" ;/script>/HEAD>BODY>,但此鏈接在瀏覽器訪問是可以看到數據的,但鏈接會被重定向,重定向后的url即為上面這個htmlhref內容。因此,可以合理的推斷,針對部分頁碼鏈接,我愛我家不會直接返回數據,但會返回帶有正確鏈接的信息,通過正則表達式獲取該鏈接即可正確抓取數據。

二、解決方法

在下面的完整代碼中,采取的解決方法是:

1.首先判斷當前html是否含有數據

2.若無數據,則通過正則表達式獲取正確鏈接

3.重新獲取html數據

if 'HTML>HEAD>script>window.location.href=' in html:
	url = re.search(r'.*?href="(.+)" rel="external nofollow"  rel="external nofollow" .*?', html).group(1)
	html = requests.get(url, headers=headers).text

三、完整代碼

# -*-coding:utf-8-*-
import os
import re
import requests
import csv
import time
from bs4 import BeautifulSoup

folder_path = os.path.split(os.path.abspath(__file__))[0] + os.sep  # 獲取當前文件所在目錄
cookie = 'PHPSESSID=aivms4ufg15sbrj0qgboo3c6gj; HMF_CI=4d8ff20092e9832daed8fe5eb0475663812603504e007aca93e6630c00b84dc207; _ga=GA1.2.556271139.1620784679; gr_user_id=4c878c8f-406b-46a0-86ee-a9baf2267477; _dx_uzZo5y=68b673b0aaec1f296c34e36c9e9d378bdb2050ab4638a066872a36f781c888efa97af3b5; smidV2=20210512095758ff7656962db3adf41fa8fdc8ddc02ecb00bac57209becfaa0; yfx_c_g_u_id_10000001=_ck21051209583410015104784406594; __TD_deviceId=41HK9PMCSF7GOT8G; zufang_cookiekey=["%7B%22url%22%3A%22%2Fzufang%2F_%25E9%2595%25BF%25E6%2598%25A5%25E6%25A1%25A5%3Fzn%3D%25E9%2595%25BF%25E6%2598%25A5%25E6%25A1%25A5%22%2C%22x%22%3A%220%22%2C%22y%22%3A%220%22%2C%22name%22%3A%22%E9%95%BF%E6%98%A5%E6%A1%A5%22%2C%22total%22%3A%220%22%7D","%7B%22url%22%3A%22%2Fzufang%2F_%25E8%258B%258F%25E5%25B7%259E%25E8%25A1%2597%3Fzn%3D%25E8%258B%258F%25E5%25B7%259E%25E8%25A1%2597%22%2C%22x%22%3A%220%22%2C%22y%22%3A%220%22%2C%22name%22%3A%22%E8%8B%8F%E5%B7%9E%E8%A1%97%22%2C%22total%22%3A%220%22%7D","%7B%22url%22%3A%22%2Fzufang%2F_%25E8%258B%258F%25E5%25B7%259E%25E6%25A1%25A5%3Fzn%3D%25E8%258B%258F%25E5%25B7%259E%25E6%25A1%25A5%22%2C%22x%22%3A%220%22%2C%22y%22%3A%220%22%2C%22name%22%3A%22%E8%8B%8F%E5%B7%9E%E6%A1%A5%22%2C%22total%22%3A%220%22%7D"]; ershoufang_cookiekey=["%7B%22url%22%3A%22%2Fzufang%2F_%25E9%2595%25BF%25E6%2598%25A5%25E6%25A1%25A5%3Fzn%3D%25E9%2595%25BF%25E6%2598%25A5%25E6%25A1%25A5%22%2C%22x%22%3A%220%22%2C%22y%22%3A%220%22%2C%22name%22%3A%22%E9%95%BF%E6%98%A5%E6%A1%A5%22%2C%22total%22%3A%220%22%7D","%7B%22url%22%3A%22%2Fershoufang%2F_%25E8%258B%258F%25E5%25B7%259E%25E6%25A1%25A5%3Fzn%3D%25E8%258B%258F%25E5%25B7%259E%25E6%25A1%25A5%22%2C%22x%22%3A%220%22%2C%22y%22%3A%220%22%2C%22name%22%3A%22%E8%8B%8F%E5%B7%9E%E6%A1%A5%22%2C%22total%22%3A%220%22%7D"]; zufang_BROWSES=501465046,501446051,90241951,90178388,90056278,90187979,501390110,90164392,90168076,501472221,501434480,501480593,501438374,501456072,90194547,90223523,501476326,90245144; historyCity=["\u5317\u4eac"]; _gid=GA1.2.23153704.1621410645; Hm_lvt_94ed3d23572054a86ed341d64b267ec6=1620784715,1621410646; _Jo0OQK=4958FA78A5CC420C425C480565EB46670E81832D8173C5B3CFE61303A51DE43E320422D6C7A15892C5B8B66971ED1B97A7334F0B591B193EBECAAB0E446D805316B26107A0B847CA53375B268E06EC955BB75B268E06EC955BB9D992FB153179892GJ1Z1OA==; ershoufang_BROWSES=501129552; domain=bj; 8fcfcf2bd7c58141_gr_session_id=61676ce2-ea23-4f77-8165-12edcc9ed902; 8fcfcf2bd7c58141_gr_session_id_61676ce2-ea23-4f77-8165-12edcc9ed902=true; yfx_f_l_v_t_10000001=f_t_1620784714003__r_t_1621471673953__v_t_1621474304616__r_c_2; Hm_lpvt_94ed3d23572054a86ed341d64b267ec6=1621475617'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36',
    'Cookie': cookie.encode("utf-8").decode("latin1")
}


def get_page(url):
    """獲取網頁原始數據"""
    global headers
    html = requests.get(url, headers=headers).text
    return html


def extract_info(html):
    """解析網頁數據,抽取出房源相關信息"""
    host = 'https://bj.5i5j.com'
    soup = BeautifulSoup(html, 'lxml')
    data = []
    for li in soup.find('div', class_='list-con-box').find('ul', class_='pList').find_all('li'):
        try:
            title = li.find('h3', class_='listTit').get_text()  # 名稱
            url = host + li.find('h3', class_='listTit').a['href']  # 鏈接
            info_li = li.find('div', class_='listX')  # 每個房源核心信息都在這里
            p1 = info_li.find_all('p')[0].get_text()  # 獲取第一段
            info1 = [i.strip() for i in p1.split('  ·  ')]
            # 戶型、面積、朝向、樓層、裝修、建成時間
            house_type, area, direction, floor, decoration, build_year = info1
            p2 = info_li.find_all('p')[1].get_text()  # 獲取第二段
            info2 = [i.replace(' ', '') for i in p2.split('·')]
            # 小區、位于幾環、交通信息
            if len(info2) == 2:
                residence, ring = info2
                transport = ''  # 部分房源無交通信息
            elif len(info2) == 3:
                residence, ring, transport = info2
            else:
                residence, ring, transport = ['', '', '']
            p3 = info_li.find_all('p')[2].get_text()  # 獲取第三段
            info3 = [i.replace(' ', '') for i in p3.split('·')]
            # 關注人數、帶看次數、發布時間
            try:
                watch, arrive, release_year = info3
            except Exception as e:
                print(info2, '獲取帶看、發布日期信息出錯')
                watch, arrive, release_year = ['', '', '']
            total_price = li.find('p', class_='redC').get_text().strip()  # 房源總價
            univalence = li.find('div', class_='jia').find_all('p')[1].get_text().replace('單價', '')  # 房源單價
            else_info = li.find('div', class_='listTag').get_text()
            data.append([title, url, house_type, area, direction, floor, decoration, residence, ring,
                         transport, total_price, univalence, build_year, release_year, watch, arrive, else_info])
        except Exception as e:
            print('extract_info: ', e)
    return data


def crawl():
    esf_url = 'https://bj.5i5j.com/ershoufang/'  # 主頁網址
    fields = ['城區', '名稱', '鏈接', '戶型', '面積', '朝向', '樓層', '裝修', '小區', '環', '交通情況', '總價', '單價',
              '建成時間', '發布時間', '關注', '帶看', '其他信息']
    f = open(folder_path + 'data' + os.sep + '北京二手房-我愛我家.csv', 'w', newline='', encoding='gb18030')
    writer = csv.writer(f, delimiter=',')  # 以逗號分割
    writer.writerow(fields)
    page = 1
    regex = re.compile(r'.*?href="(.+)" rel="external nofollow"  rel="external nofollow" .*?')
    while True:
        url = esf_url + 'n%s/' % page  # 構造頁面鏈接
        if page == 1:
            url = esf_url
        html = get_page(url)
        # 部分頁面鏈接無法獲取數據,需進行判斷,并從返回html內容中獲取正確鏈接,重新獲取html
        if 'HTML>HEAD>script>window.location.href=' in html:
            url = regex.search(html).group(1)
            html = requests.get(url, headers=headers).text
        print(url)
        data = extract_info(html)
        if data:
            writer.writerows(data)
        page += 1
    f.close()


if __name__ == '__main__':
    crawl()  # 啟動爬蟲

四、數據展示

截至2021年5月23日,共獲取數據62943條,基本上將我愛我家官網上北京地區的二手房數據全部抓取下來了。

 

到此這篇關于Python爬蟲之爬取我愛我家二手房數據的文章就介紹到這了,更多相關Python爬取二手房數據內容請搜索腳本之家以前的文章或繼續瀏覽下面的相關文章希望大家以后多多支持腳本之家!

您可能感興趣的文章:
  • Python手拉手教你爬取貝殼房源數據的實戰教程
  • Python scrapy爬取蘇州二手房交易數據
  • python爬取鏈家二手房的數據
  • Python爬蟲之爬取二手房信息
  • 基于python爬取鏈家二手房信息代碼示例
  • python爬蟲 爬取58同城上所有城市的租房信息詳解
  • Python爬蟲入門案例之爬取二手房源數據

標簽:蘭州 成都 宿遷 江蘇 六盤水 常州 駐馬店 山東

巨人網絡通訊聲明:本文標題《Python爬蟲之爬取我愛我家二手房數據》,本文關鍵詞  Python,爬蟲,之爬,取,我愛我家,;如發現本文內容存在版權問題,煩請提供相關信息告之我們,我們將及時溝通與處理。本站內容系統采集于網絡,涉及言論、版權與本站無關。
  • 相關文章
  • 下面列出與本文章《Python爬蟲之爬取我愛我家二手房數據》相關的同類信息!
  • 本頁收集關于Python爬蟲之爬取我愛我家二手房數據的相關信息資訊供網民參考!
  • 推薦文章
    主站蜘蛛池模板: 一本久道中文字幕无码道一区二| 很很操很很日| 深夜福利免费精品国偷自产在线| 久久午夜影院| 女友PORNY丨首页?入口| 久久久久久91| 日本人妻偷伦中文无码电影| 男人都懂的www网站免费观看| 青柠视频高清观看在线| 色多多软件下载| 被仇人调教成禁脔h虐女| 爽到无码高潮喷水aV无码网站 | 九九精品视频在线| 王爷在花轿里就开始圆房的小说| 坚果加速器app| 亚洲啊啊啊啊啊| 性动态视频| 欧美老妇xxx| 玉蒲剧3| 成人无码一级A片在线观看| 一级中文字幕乱码免费| 国产欧美在线观看精品一区二区| 丝袜老师张开腿任我玩弄下药免费| 波多野结衣全部作品| 亚洲国产精品久久久久蜜桃噜噜| 粉嫩高中生的第一次在线观看| 处女第一次朱莉| 女人脱了裤衩让男人桶| 公车上雪柔被猛烈进出动态图| 艳妇乳肉豪妇荡乳ⅩXXO电影| 欧美日韩一区二区不卡三区| 强制高c怎么实现| 国产福利麻豆精品一区| 国产91??在线观看| 日本玖玖视频| 才几天没做你就叫成这样了| 成人性做爰AAA片免费| 免费污网站在线观看| 国产欧美一区二区三区精华液好吗| 久久久国产高清| 婆岳同床双飞燕|