人脸识别项目之爬虫

由于最近想做一个人脸检测识别的系统,于是,准备花一些时间,顺便整合下知识体系。

今天主要记录下爬虫,主要为人脸检测做一个图片储备,这也是之前,闲余之际,写的一个爬虫,基于Scrapy的,这个项目,很早写了,中间数据库丢失了,重新记录下,这次划到机器学习类里。也是抓取了整个站的图。仅做学习之用!

item.py #要处理的字段值 # -*- coding: utf-8 -*- # Define here the models for your scraped items # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class FengniaoItem(scrapy.Item): imgurl = scrapy.Field()
feng.py # 要抓取处理的业务 # -*- coding: utf-8 -*- import scrapy from fengniao.items import FengniaoItem class FengSpider(scrapy.Spider):     name = 'feng'     allowed_domains = ['pp.fengniao.com']     start_urls = ['https://pp.fengniao.com/']     start = 1     # 如果遇到网页状态为404,500     handle_httpstatus_list = [404, 500]     url = "https://pp.fengniao.com/album_13_"     end = ".html"     start_urls = [url + str(start) + end]     def parse_item(self, response):     if response.status in self.handle_httpstatus_list:     if self.start <= 3009: self.start += 1 yield scrapy.Request(self.url+str(self.start)+self.end,callback = self.parse) else: i = FengniaoItem() for x in range(0,10): # imgs = response.xpath('//*[@id="contentBox"]/div['+str(x)+']/div[1]/img/@src').extract(); # if imgs != []: # print(imgs[0][:-42]) i['imgurl'] = response.xpath('//*[@id="contentBox"]/div['+str(x)+']/div[1]/img/@src').extract(); # print(i) yield i def parse(self, response): for i in range(1,40): links = response.xpath("//*[@id='contentBox']/ul/li["+str(i)+"]/a/@href").extract() for link in links: for j in range(0,5): link0 = link.replace('.html','_'+str(j)+'.html') yield scrapy.Request('https://pp.fengniao.com'+link0,callback = self.parse_item) if self.start <= 3009: self.start += 1 yield scrapy.Request(self.url + str(self.start) + self.end,callback = self.parse)


piplines.py  # 存储业务 # -*- coding: utf-8 -*- # Define your item pipelines here # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import requests from fengniao import settings import os import codecs import json class FengniaoPipeline(object):     def __init__(self):         self.filename = codecs.open('feng.json',"w",encoding='utf-8')     def process_item(self, item, spider):         if 'imgurl' in item:             images = []             dir_path = '%s/%s' % (settings.IMAGES_STORE, spider.name)             if not os.path.exists(dir_path):                 os.makedirs(dir_path)             for image_url0 in item['imgurl']:                 image_url = image_url0[:-42]                 us = image_url.split('/')[3:]                 image_file_name = '_'.join(us)                 file_path = '%s/%s' % (dir_path, image_file_name)                 images.append(file_path)                 if os.path.exists(file_path):                     continue                 with open(file_path, 'wb') as handle:                     response = requests.get(image_url, stream=True)                     for block in response.iter_content(1024):                         if not block:                             break                         handle.write(block)                 # print("start")                 # print(images)                 # print("end")                 item['imgurl'] = images                 content = json.dumps(dict(item),ensure_ascii=False) + "\n"                 self.filename.write(content)         return item     def spider_closed(self, spider):         self.file.close()
setting.py #配置文件 # -*- coding: utf-8 -*- BOT_NAME = 'fengniao' SPIDER_MODULES = ['fengniao.spiders'] NEWSPIDER_MODULE = 'fengniao.spiders' IMAGES_STORE = './images' ITEM_PIPELINES = {    'fengniao.pipelines.FengniaoPipeline': 300, } DEFAULT_REQUEST_HEADERS = {     'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3236.0 Safari/537.36', }
main.py #作为执行文件 from scrapy import cmdline cmdline.execute('scrapy crawl feng'.split())

项目结构

image.png

执行

image.png

处理结果

image.png

备注,该项目使用anconda ,scrapy环境,兼容linux,widows,python2,python3本人亲测。image.png

时间因素,暂不继续抓取了,可以抓取该系列的所有。已测!提前中断了,仅留以下测试使用。

image.png


本文链接:https://itarvin.com/detail-18.aspx

登录或者注册以便发表评论

登录

注册