selenium - Defer parts of scrape in scrapy -
i have parse method given below, uses selenium first load page, visits pages cannot accessed scraping directly spider, collects individual urls parse method extracts items pages. problem is, parse method blocks other parsing untill pages visited. chokes system. tried adding sleep, stops engine alltogether, , not parse
method.
any pointers how optimize this, or atleast make sleep work doesnt stop engine?
def parse(self, response): '''parse first page , extract page links''' item_link_xpath = "/html/body/form/div[@class='wrapper']//a[@title='view & apply']" pagination_xpath = "//div[@class='pagination']/input" page_xpath = pagination_xpath + "[@value=%d]" display = display(visible=0, size=(800, 600)) display.start() browser = webdriver.firefox() browser.get(response.url) log.msg('loaded search results', level=log.debug) page_no = 1 while true: log.msg('scraping page: %d'%page_no, level=log.debug) link in [item_link.get_attribute('href') item_link in browser.find_elements_by_xpath(item_link_xpath)]: yield request(link, callback=self.parse_item_page) page_no += 1 log.msg('using xpath: %s'%(page_xpath%page_no), level=log.debug) page_element = browser.find_element_by_xpath(page_xpath%page_no) if not page_element or page_no > settings['pagination_pages']: break page_element.click() if settings['pagination_sleep_interval']: seconds = int(settings['pagination_sleep_interval']) log.msg('sleeping %d'%seconds, level=log.debug) time.sleep(seconds) log.msg('scraped listing pages, closing browser.', level=log.debug) browser.close() display.stop()
this may help:
# delayspider.py scrapy.spider import basespider twisted.internet import reactor, defer scrapy.http import request delay = 5 # seconds class myspider(basespider): name = 'wikipedia' max_concurrent_requests = 1 start_urls = ['http://www.wikipedia.org'] def parse(self, response): nextreq = request('http://en.wikipedia.org') dfd = defer.deferred() reactor.calllater(delay, dfd.callback, nextreq) return dfd
output:
$ scrapy runspider delayspider.py 2012-05-24 11:01:54-0300 [scrapy] info: scrapy 0.15.1 started (bot: scrapybot) 2012-05-24 11:01:54-0300 [scrapy] debug: enabled extensions: logstats, telnetconsole, closespider, webservice, corestats, spiderstate 2012-05-24 11:01:54-0300 [scrapy] debug: enabled downloader middlewares: httpauthmiddleware, downloadtimeoutmiddleware, useragentmiddleware, retrymiddleware, defaultheadersmiddleware, redirectmiddleware, cookiesmiddleware, httpcompressionmiddleware, chunkedtransfermiddleware, downloaderstats 2012-05-24 11:01:54-0300 [scrapy] debug: enabled spider middlewares: httperrormiddleware, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware 2012-05-24 11:01:54-0300 [scrapy] debug: enabled item pipelines: 2012-05-24 11:01:54-0300 [wikipedia] info: spider opened 2012-05-24 11:01:54-0300 [wikipedia] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2012-05-24 11:01:54-0300 [scrapy] debug: telnet console listening on 0.0.0.0:6023 2012-05-24 11:01:54-0300 [scrapy] debug: web service listening on 0.0.0.0:6080 2012-05-24 11:01:56-0300 [wikipedia] debug: crawled (200) <get http://www.wikipedia.org> (referer: none) 2012-05-24 11:02:04-0300 [wikipedia] debug: redirecting (301) <get http://en.wikipedia.org/wiki/main_page> <get http://en.wikipedia.org> 2012-05-24 11:02:06-0300 [wikipedia] debug: crawled (200) <get http://en.wikipedia.org/wiki/main_page> (referer: http://www.wikipedia.org) 2012-05-24 11:02:11-0300 [wikipedia] info: closing spider (finished) 2012-05-24 11:02:11-0300 [wikipedia] info: dumping spider stats: {'downloader/request_bytes': 745, 'downloader/request_count': 3, 'downloader/request_method_count/get': 3, 'downloader/response_bytes': 29304, 'downloader/response_count': 3, 'downloader/response_status_count/200': 2, 'downloader/response_status_count/301': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2012, 5, 24, 14, 2, 11, 447498), 'request_depth_max': 2, 'scheduler/memory_enqueued': 3, 'start_time': datetime.datetime(2012, 5, 24, 14, 1, 54, 408882)} 2012-05-24 11:02:11-0300 [wikipedia] info: spider closed (finished) 2012-05-24 11:02:11-0300 [scrapy] info: dumping global stats: {}
it uses twisted's calllater sleep.
Comments
Post a Comment