How to login in Scrapy?

T

tispoint2016-12-18 01:00:44

Scrapy

tispoint, 2016-12-18 01:00:44

Good afternoon.
I'm trying to run a script that, as planned, should pass authorization and parse materials from the site.
I have never done such scripts (and with authorization), and I poorly understand the mechanism. Used the example from https://doc.scrapy.org/en/latest/topics/logging.html
But something went wrong:

#! coding: utf-8
__author__ = 'iam'
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.item import Item, Field
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import TakeFirst
from scrapy.http import Request, FormRequest
from scrapy import Selector

#
class ScrapyTestItem(Item):
    title = Field()
    url = Field()

class Test03Loader(XPathItemLoader):
    default_output_processor = TakeFirst()


class ScrapyTestSpider(CrawlSpider):
    name = "cr01"
    allowed_domains = ["ecom.elko.ru"]
    start_urls = ["https://ecom.elko.ru/Account/Login",
                  "https://ecom.elko.ru/Catalog/Category/SCO"
                  ]

    rules = (
        Rule(LinkExtractor(
            allow=('https://ecom.elko.ru/Catalog/Product/')),
             callback='parse_item', follow=False),

    )

    def parse(self, response):
        return [FormRequest.from_response(response,
                    formdata={'username': 'мой_логин', 'password': 'мой_пароль'},
                    callback=self.after_login)]

    def after_login(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.log("Login failed", level=log.ERROR)
            return

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        l = Test03Loader(ScrapyTestItem(), hxs)
        l.add_xpath('title', "//h1/text()")
        l.add_value('url', response.url)
        return l.load_item()

Gives the following results:

2016-12-17 23:46:21 [scrapy] INFO: Spider opened
2016-12-17 23:46:21 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2016-12-17 23:46:21 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-12-17 23:46:21 [scrapy] DEBUG: Redirecting (302) to <GET https://ecom.elko.
ru/Account/Login?ReturnUrl=%2fCatalog%2fCategory%2fSCO> from <GET https://ecom.e
lko.ru/Catalog/Category/SCO>
2016-12-17 23:46:21 [scrapy] DEBUG: Crawled (200) <GET https://ecom.elko.ru/Acco
unt/Login> (referer: None)
2016-12-17 23:46:21 [scrapy] DEBUG: Crawled (200) <GET https://ecom.elko.ru/Acco
unt/Login?ReturnUrl=%2fCatalog%2fCategory%2fSCO> (referer: None)
2016-12-17 23:46:21 [scrapy] DEBUG: Crawled (200) <POST https://ecom.elko.ru/Acc
ount/Login> (referer: https://ecom.elko.ru/Account/Login)
2016-12-17 23:46:22 [scrapy] DEBUG: Crawled (200) <POST https://ecom.elko.ru/Acc
ount/Login?ReturnUrl=%2fCatalog%2fCategory%2fSCO> (referer: https://ecom.elko.ru
/Account/Login?ReturnUrl=%2fCatalog%2fCategory%2fSCO)
2016-12-17 23:46:22 [scrapy] INFO: Closing spider (finished)
2016-12-17 23:46:22 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2365,
 'downloader/request_count': 5,
 'downloader/request_method_count/GET': 3,
 'downloader/request_method_count/POST': 2,
 'downloader/response_bytes': 19527,
 'downloader/response_count': 5,
 'downloader/response_status_count/200': 4,
 'downloader/response_status_count/302': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 12, 17, 20, 46, 22, 105000),
 'log_count/DEBUG': 6,
 'log_count/INFO': 7,
 'request_depth_max': 1,
 'response_received_count': 4,
 'scheduler/dequeued': 5,
 'scheduler/dequeued/memory': 5,
 'scheduler/enqueued': 5,
 'scheduler/enqueued/memory': 5,
 'start_time': datetime.datetime(2016, 12, 17, 20, 46, 21, 433000)}
2016-12-17 23:46:22 [scrapy] INFO: Spider closed (finished)

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

D

DannyFork, 2016-12-18
@tispoint

Specified in start_urls =[] are requested asynchronously. You send a request to the login page and the content at the same time.
Your main problem is the wrong Post-request. Instead https://ecom.elko.ru/Account/Loginit should be

https://ecom.elko.ru/Account/Login?ReturnUrl=%2fCatalog%2fCategory%2fSCO

Question about authorization, here is the working code for passing it.

import scrapy
from scrapy.contrib.spiders import CrawlSpider

from scrapy.item import Item, Field
from scrapy.contrib.loader import XPathItemLoader, ItemLoader
from scrapy.http import Request, FormRequest

class ScrapyTestItem(scrapy.Item):
    title = Field()
    url = Field()

class ScrapyTestSpider(CrawlSpider):
    name = "catalog"

    def start_requests(self):
        return [
            FormRequest(
                "https://ecom.elko.ru/Account/Login?ReturnUrl=%2fCatalog%2fCategory%2fSCO",
                formdata={"Username": "ваш_логин", "Password": "ваш_пароль"}
            )]

    def parse(self, response):
          print(response.url) 
   # Парсим страницу или отправляем запрос на другие.

Redirect to catalog page ecom.elko.ru/Catalog/Category/SCO

2016-12-18 12:32:55 [scrapy] DEBUG: Redirecting (302) to <GET https://ecom.elko.ru/Catalog/Category/SCO> from <POST https://ecom.elko.ru/Account/Login?ReturnUrl=%2fCatalog%2fCategory%2fSCO>

T

tispoint, 2016-12-18
@tispoint

Tell me, while looking at the log
2016-12-18 13:12:02 [scrapy] INFO: Spider opened
2016-12-18 13:12:02 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2016-12-18 13:12:02 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-12-18 13:12:03 [scrapy] DEBUG: Redirecting (302) to from
2016-12-18 13:12:05 [scrapy] DEBUG: Crawled (200) (referer: None)
https://ecom.elko.ru/Catalog/Category/SCO
2016-12-18 13 :12:05 [scrapy] INFO: Closing spider (finished)
These are sequential operations of one instance of the robot (I don't know what to call it exactly)?
That is, I correctly think that on the page https://ecom.elko.ru/Catalog/Category/SCOspider does not find links and terminates?
Where is my error?

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.item import Item, Field
from scrapy.contrib.loader import XPathItemLoader, ItemLoader
from scrapy.http import Request, FormRequest
from scrapy.contrib.loader.processor import TakeFirst

class ScrapyTestItem(scrapy.Item):
    title = Field()
    url = Field()

class Test03Loader(XPathItemLoader):
    default_output_processor = TakeFirst()

class ScrapyTestSpider(CrawlSpider):
    name = "catalog"

    rules = (
        Rule(LinkExtractor(
            allow=('https://ecom.elko.ru/Catalog/Product/')),
             callback='parse_item', follow=False),
        Rule(LinkExtractor(
            allow=('https://ecom.elko.ru/Catalog/Category/')),
             follow=True),
    )

    def start_requests(self):
        return [
            FormRequest(
                "https://ecom.elko.ru/Account/Login?ReturnUrl=%2fCatalog%2fCategory%2fSCO",
                formdata={"Username": "tiscom6", "Password": "6307860"}
            )]

    def parse(self, response):
          print(response.url)

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        l = Test03Loader(ScrapyTestItem(), hxs)
        l.add_xpath('title', "//h1/text()")
        l.add_value('url', response.url)
        return l.load_item()