好湿?好紧?好多水好爽自慰,久久久噜久噜久久综合,成人做爰A片免费看黄冈,机机对机机30分钟无遮挡

主頁 > 知識庫 > Python爬蟲實戰之使用Scrapy爬取豆瓣圖片

Python爬蟲實戰之使用Scrapy爬取豆瓣圖片

熱門標簽:儋州電話機器人 朝陽手機外呼系統 地圖標注面積 所得系統電梯怎樣主板設置外呼 北瀚ai電銷機器人官網手機版 北京電銷外呼系統加盟 小蘇云呼電話機器人 市場上的電銷機器人 佛山400電話辦理

使用Scrapy爬取豆瓣某影星的所有個人圖片

以莫妮卡·貝魯奇為例

1.首先我們在命令行進入到我們要創建的目錄,輸入 scrapy startproject banciyuan 創建scrapy項目

創建的項目結構如下

2.為了方便使用pycharm執行scrapy項目,新建main.py

from scrapy import cmdline

cmdline.execute("scrapy crawl banciyuan".split())

再edit configuration

然后進行如下設置,設置后之后就能通過運行main.py運行scrapy項目了

3.分析該HTML頁面,創建對應spider

from scrapy import Spider
import scrapy

from banciyuan.items import BanciyuanItem


class BanciyuanSpider(Spider):
    name = 'banciyuan'
    allowed_domains = ['movie.douban.com']
    start_urls = ["https://movie.douban.com/celebrity/1025156/photos/"]
    url = "https://movie.douban.com/celebrity/1025156/photos/"

    def parse(self, response):
        num = response.xpath('//div[@class="paginator"]/a[last()]/text()').extract_first('')
        print(num)
        for i in range(int(num)):
            suffix = '?type=Cstart=' + str(i * 30) + 'sortby=likesize=asubtype=a'
            yield scrapy.Request(url=self.url + suffix, callback=self.get_page)

    def get_page(self, response):
        href_list = response.xpath('//div[@class="article"]//div[@class="cover"]/a/@href').extract()
        # print(href_list)
        for href in href_list:
            yield scrapy.Request(url=href, callback=self.get_info)

    def get_info(self, response):
        src = response.xpath(
            '//div[@class="article"]//div[@class="photo-show"]//div[@class="photo-wp"]/a[1]/img/@src').extract_first('')
        title = response.xpath('//div[@id="content"]/h1/text()').extract_first('')
        # print(response.body)
        item = BanciyuanItem()
        item['title'] = title
        item['src'] = [src]
        yield item

4.items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class BanciyuanItem(scrapy.Item):
    # define the fields for your item here like:
    src = scrapy.Field()
    title = scrapy.Field()

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from scrapy.pipelines.images import ImagesPipeline
import scrapy

class BanciyuanPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        yield scrapy.Request(url=item['src'][0], meta={'item': item})

    def file_path(self, request, response=None, info=None, *, item=None):
        item = request.meta['item']
        image_name = item['src'][0].split('/')[-1]
        # image_name.replace('.webp', '.jpg')
        path = '%s/%s' % (item['title'].split(' ')[0], image_name)

        return path

settings.py

# Scrapy settings for banciyuan project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'banciyuan'

SPIDER_MODULES = ['banciyuan.spiders']
NEWSPIDER_MODULE = 'banciyuan.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36'}


# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'banciyuan.middlewares.BanciyuanSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'banciyuan.middlewares.BanciyuanDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'banciyuan.pipelines.BanciyuanPipeline': 1,
}
IMAGES_STORE = './images'

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

5.爬取結果

reference

源碼

到此這篇關于Python爬蟲實戰之使用Scrapy爬取豆瓣圖片的文章就介紹到這了,更多相關Scrapy爬取豆瓣圖片內容請搜索腳本之家以前的文章或繼續瀏覽下面的相關文章希望大家以后多多支持腳本之家!

您可能感興趣的文章:
  • Python爬蟲之教你利用Scrapy爬取圖片
  • Python爬取網站圖片并保存的實現示例
  • python制作微博圖片爬取工具
  • python繞過圖片滑動驗證碼實現爬取PTA所有題目功能 附源碼
  • 利用python批量爬取百度任意類別的圖片的實現方法
  • Python使用xpath實現圖片爬取
  • Python Scrapy圖片爬取原理及代碼實例
  • Python3直接爬取圖片URL并保存示例
  • python爬取某網站原圖作為壁紙
  • 用Python做一個嗶站小姐姐詞云跳舞視頻

標簽:金融催收 云南 酒泉 寧夏 江蘇 商丘 龍巖 定西

巨人網絡通訊聲明:本文標題《Python爬蟲實戰之使用Scrapy爬取豆瓣圖片》,本文關鍵詞  Python,爬蟲,實戰,之,使用,;如發現本文內容存在版權問題,煩請提供相關信息告之我們,我們將及時溝通與處理。本站內容系統采集于網絡,涉及言論、版權與本站無關。
  • 相關文章
  • 下面列出與本文章《Python爬蟲實戰之使用Scrapy爬取豆瓣圖片》相關的同類信息!
  • 本頁收集關于Python爬蟲實戰之使用Scrapy爬取豆瓣圖片的相關信息資訊供網民參考!
  • 推薦文章
    主站蜘蛛池模板: 香港三级日本三级三级人妇| 日韩A片一级无码免费??蜜桃| 美女扒开内??看个够动漫| 国产高清在线视频伊甸园| 国产美女裸体永久免费软件| 日本aaaa视频| 特黄aa级毛片免费视频播放| 女主播直播给粉丝看脱内衣| 麻豆国产AV国片精品有毛| 黄色毛片在线看| 动漫xxxx做受3d| 日韩三级黄色片| ?交+视频+在线观看女4小说| 蜜桃AV一区二区视频正在播放 | 精品欧美Av无码久久久| 国产乡下妇女三?片| 一级美国片| 大胆人gogo888体艺术高清| 日本色色电影| www.17c嫩嫩草色视频蜜桃| 秋霞三级完整版在线观看| 国产福利在线永久视频| 欧美精品日韩一区二区三区 | 18在线观看国内精品视频| 国产精品视频一区二区三区| 色婷婷综合久久久久中文一区二区 | 丝袜白浆国产17c| 国产一级A片毛毛天码美女视频| jizz大全日本| 最近的2019中文字幕免费下载| 妺妺窝人体色WWW聚| 少年高潮h跪趴扩张调教喷水文| 精品国产香蕉在线播出| 国产精品网址你懂的| 蜜桃成熟时33d高清三级| 亚洲国产中日韩欧美AV| 花心研磨| 豪妇荡乳1一5潘金莲| 白嫩漂亮的美女ktv啪啪界| 国产精品久久久久岛| 特级毛片WWW|