Why is the data not coming to the parser?

D

David52021-03-23 08:28:56

Python

David5, 2021-03-23 08:28:56

I wrote a small parser, data comes from the second link, and the first array is empty, the status is 200, but there is no data, maybe this is due to the Russian domain, although I decoded it into English through the service.

checklist = {
    'лип' : ['https://xn--80aacoonefzg3am8b1fsb.xn--p1ai/news', '//*[@id="news__area-blocks"]/a[1]/div/div[3]/text()', '//*[@id="news__area-blocks"]/a[1]/@href', 'https://xn--80aacoonefzg3am8b1fsb.xn--p1ai/news'],
    'следственный комитет' : ['https://lipetsk.sledcom.ru', '//*[@id="news_tab-1"]/div[1]/div[1]/div[2]/div[3]/a/text()', '//*[@id="news_tab-1"]/div[1]/div[1]/div[2]/div[3]/a/@href', 'https://lipetsk.sledcom.ru']
}



def get_titles(checklist):
     

  for site in checklist.items():
 
    user_agent = ('Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:50.0) '
              'Gecko/20100101 Firefox/50.0')

    html_text = requests.get(site[1][0], headers={'User-Agent':user_agent}).text

    tree = lxml.html.document_fromstring(html_text)

    text_titles = tree.xpath(site[1][1])
    text_link = tree.xpath(site[1][2])
        

 

    yield text_titles, text_link


 



for i in get_titles(checklist):
    print(i)

Why is the data not coming to the parser?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

N

nullnull, 2021-03-26
@David5

There is an option that the first one simply loads data through JS or whatever is fashionable there now, and you simply cannot receive data in such parsers. Or he sends you a page for robot, although there is a User agent.
Put a breakpoint on the line "tree = lxml.html.document_fromstring(html_text)" and run through the debug.
See what the html_text page looks like, is everything okay with it?
Then look in the browser "page source code" to see if there is what you are looking for.
If all this is there, then the error is in your code or in the xPath request. And if it's not in html_text, then you can't get it in this way :) You'll have to change the approach or tools for parsing.