How to get data?

L

lcd12322015-11-08 15:01:56

Python

lcd1232, 2015-11-08 15:01:56

How to get data?

import lxml.html as html
import requests
import time
url = "http://www.world-art.ru/animation/manga.php?id="
folder = ""
counter = 501
info = {}
page = html.parse(url+str(counter)).getroot()
info["name"] = page.xpath("html/body/table/tbody/tr[1]/td/center/table[7]/tbody/tr/td/table/tbody/tr/td[5]/table[2]/tbody/tr/td[3]/font[1]/b")[0].text
info["year"] = page.xpath("html/body/table/tbody/tr[1]/td/center/table[7]/tbody/tr/td/table/tbody/tr/td[5]/table[2]/tbody/tr/td[3]/b/font[1]")[0].text
info["name1"] = page.xpath("html/body/table/tbody/tr[1]/td/center/table[7]/tbody/tr/td/table/tbody/tr/td[5]/table[2]/tbody/tr/td[3]")[0].text
print(info["name1"])

Actually, none of the elements is missing. The path is correct, because used FirePath. I don't know how else to get it.
update

import lxml.html as html
import requests
import time
from lxml import etree
from lxml.html import HTMLParser
# url = "http://animanga.ru/default.aspx?a=book&id="
url = "http://www.world-art.ru/animation/manga.php?id="
folder = ""
counter = 520
info = {}
r = requests.get(url+str(counter))
if r.ok:
    page = etree.fromstring(r.text, parser=HTMLParser())

    name = page.xpath("//font/b")
    for element in name:
        if (element.text and element.text.find("манга")!=-1):
            string = element.text
            string = string[:string.find("(")-1]
            print(string)

    name_eng = page.xpath("//tr/td/text()")
    i = 1
    for element in name_eng:
        if (i==40):
            print(element)
        i += 1

    year = page.xpath("//font")
    for element in year:
        string = element.text
        if (string and string.isnumeric()):
            print(string)

I know that the code is terrible and that it does not receive name_eng, but at least something.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

D

Dimonchik, 2015-11-08
@dimonchik2013

download the file to a local one and carefully read the article and video, it will change your approach
prostoitblog.ru/xpath-i-css/kak-sostavlyat-xpath-i ... /td/center/ table[7] /tbody/tr/td/table/tbody/tr/ td[5] is a very bad sign - one change per page/ in the table, and all the numbering goes through the forest. Learn to compose expressions like regexps: build a path that matches on your site

B

baterson, 2015-11-08
@baterson

Use Beautiful Soup, it's very convenient to parse with it
Here's a guide, even if you don't know English, everything is clear
https://www.youtube.com/watch?v=3xQTJi2tqgk

A

angru, 2015-11-08
@angru

Never had to parse such sites, I do not envy you.
As you have already been advised, open the source of the page, there is a slightly different structure, for example, there is no tbody.
Also, the root element is html, so it doesn't need to be specified in the xpath.
I got the same shitty code:

import lxml.html as html
import requests
from lxml import etree
from lxml.html import HTMLParser


info = {}
r = requests.get("http://www.world-art.ru/animation/manga.php?id=501")

if r.ok:
    tree = etree.fromstring(r.text, parser=HTMLParser())

    info["name"] = tree.xpath("body/table/tr[1]/td/center/table[7]/tr/td/table/tr/td[5]/table[2]/tr/td[3]/font[1]/b")[0].text
    info["year"] = tree.xpath("body/table/tr[1]/td/center/table[7]/tr/td/table/tr/td[5]/table[2]/tr/td[3]/font[2]")[0].text
    info["name1"] = str(etree.tostring(tree.xpath("body/table/tr[1]/td/center/table[7]/tr/td/table/tr/td[5]/table[2]/tr/td[3]")[0])).split('<br/>')[1]

    print(info)