这个就是效果,可以在这里先给大家看一下,我相信大家对狂飙这个部电视剧是非常不模式,我们今天就使用python爬虫技术实现这部电视剧的主人公照片爬取,废话不多说,接下来教大家怎么去实现

打开百度

输入我们要的图片关键字,这里我就以狂飙关键字作为例子

大家就能看到这种效果,但是其实在我们专业的程序员他并不是长这个样子,他是长什么样子的呢,我们需要进行鼠标右键,点击检查,进行我们的数据进行抓包**

我找了一会,发现里面有一个标签是图片链接

到这里就已经发现完毕了,接下来我们开始进行代码编写

import requests
import urllib.parse as up
import json
import time
import os

major_url = 'https://image.baidu.com/search/index?'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36'}
def pic_spider(kw, page = 10, file_path = os.getcwd()):
    path = os.path.join(file_path, kw)
    if not os.path.exists(path):
        os.mkdir(path)
    if kw != '':
        for num in range(page):
            data = {
                "tn": "resultjson_com",
                "logid": "11587207680030063767",
                "ipn": "rj",
                "ct": "201326592",
                "is": "",
                "fp": "result",
                "queryWord": kw,
                "cl": "2",
                "lm": "-1",
                "ie": "utf-8",
                "oe": "utf-8",
                "adpicid": "",
                "st": "-1",
                "z": "",
                "ic": "0",
                "hd": "",
                "latest": "",
                "copyright": "",
                "word": kw,
                "s": "",
                "se": "",
                "tab": "",
                "width": "",
                "height": "",
                "face": "0",
                "istype": "2",
                "qc": "",
                "nc": "1",
                "fr": "",
                "expermode": "",
                "force": "",
                "pn": num*30,
                "rn": "30",
                "gsm": oct(num*30),
                "1602481599433": ""
            }
            url = major_url + up.urlencode(data)
            i = 0
            pic_list = []
            while i < 5:
                try:
                    pic_list = requests.get(url=url, headers=headers).json().get('data')
                    break
                except:
                    print('网络不好,正在重试...')
                    i += 1
                    time.sleep(1.3)

把url地址确认好,在把构造请求参数data确定好,这个请求参数不是唯一的,他有很多种,我在这里做了一个kw构造,详细内容大家可以点击

for pic in pic_list:
                url = pic.get('thumbURL', '') # 有的没有图片链接,就设置成空狂飙
                if url == '':
                    continue
                name = pic.get('fromPageTitleEnc')
                for char in ['?', '\\', '/', '*', '"', '|', ':', '<', '>']:
                    name = name.replace(char, '') # 将所有不能出现在文件名中的字符去除掉
                type = pic.get('type', 'jpg') # 找到图片的类型,若没有找到,默认为 jpg
                pic_path = (os.path.join(path, '%s.%s') % (name, type))
                print(name, '已完成下载')
                if not os.path.exists(pic_path):
                    with open(pic_path, 'wb') as f:
                        f.write(requests.get(url = url, headers = headers).content)

到这里基本我们就已经结束了,接下来我把所有的代码都放到下面,喜欢的可以点赞哦!

更多内容请关注科象教育微信公众号,科象教育,成就首席数字官!