Python web scraping (characteristics)?

A

artds2015-12-19 10:33:49

Python

artds, 2015-12-19 10:33:49

how can i make title(value headers) in the headers and bower(characteristics value) in the cells themselves (in excel)? the characteristics can be different and so as not to be repeated
, for example, there is a voltage characteristic - 380 V, you need "voltage" to be the heading, and "380 V" in the values \u200b\u200b(the cell itself)
the problem is that these characteristics may be out of order, and may be in different rows and different characteristics.
I'm thinking of making them a list (array) and what algorithm is needed if it's a repetition, then it adds to this line, and if it's new, it creates a new line.
Here is the code from where I'm parsing

<div class="tab-pane                 techdata product-collateral__techdata active" id="techdata">    <ul>
                <li><span>Маркировка по взрывозащите</span><span>нет</span></li>
                <li><span>Тип контактов</span><span>НЗ</span></li>
                <li><span>Расстояние между магнитом и герконом, мм:</span><span></span></li>
                <li><span>- при размыкании контактов, более</span><span>45</span></li>
                <li><span>- при замыкании контактов, менее</span><span>12.7</span></li>
                <li><span>Максимально допустимые токи и напряжения:</span><span></span></li>
                <li><span>- максимальное коммутируемое напряжение, В</span><span>72</span></li>
                <li><span>- максимальный коммутируемый ток, А</span><span>0.3</span></li>
                <li><span>Степень защиты</span><span>-</span></li>
                <li><span>Диапазон рабочих температур, °С</span><span>-50…+50</span></li>
                <li><span>Габаритные размеры, мм:</span><span></span></li>
                <li><span>- корпус геркона</span><span>58х11х11</span></li>
                <li><span>- корпус магнита</span><span>58х11х11</span></li>
                <li><span>Масса, не более, кг</span><span>-</span></li>
        </ul>
        </div>

the parser code

from urllib.request import urlopen
from urllib.parse import urljoin

from lxml.html import fromstring


URL = 'http://www.tinko.ru/c-3.html?limit=100&no_cache=true&p=l'
ITEM_PATH = ' .info-block .product-name'
DESCR_PATH = '.breadcrumb .active'

HARET_PATH = '#techdata li'
def parse_courses():
f = urlopen(URL)
list_html = f.read().decode('utf-8')
list_doc = fromstring(list_html)

for elem in list_doc.cssselect(ITEM_PATH):
    a = elem.cssselect('a')[0]
    href = a.get('href')
    name = a.text

    course = { 'name': name, 'href': href}

    details_html = urlopen(href).read().decode('utf-8')
    try:
        details_doc = fromstring(details_html)
    except XMLSyntaxError:
        contine


    descr_elem = details_doc.cssselect(DESCR_PATH)[0]
    descr = descr_elem.text_content()

    for haret_elems in details_doc.cssselect(HARET_PATH):
        title = haret_elems.cssselect('span')[0]
        title = title.text_content()


    for haret_elems in details_doc.cssselect(HARET_PATH):
        bower = haret_elems.cssselect('span')[1]
        bower = bower.text_content()
        print(bower)


def main():
    parse_courses()


if __name__ == '__main__':
    main()

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

N

nirvimel, 2015-12-19
@artds

from urllib.request import urlopen # for Python 3
# from urllib2 import urlopen  # for Python 2

from lxml.etree import XMLSyntaxError
from lxml.html import fromstring
from pandas import DataFrame, ExcelWriter

URL = 'http://www.tinko.ru/c-3.html?limit=100&no_cache=true&p=l'
ITEM_PATH = ' .info-block .product-name'
DESCR_PATH = '.breadcrumb .active'

HARET_PATH = '#techdata li'


def parse_courses():
    f = urlopen(URL)
    list_html = f.read().decode('utf-8')
    list_doc = fromstring(list_html)

    df = DataFrame(columns=('name', 'description', 'href'))

    for elem in list_doc.cssselect(ITEM_PATH):
        a = elem.cssselect('a')[0]
        href = a.get('href')
        name = a.text

        details_html = urlopen(href).read().decode('utf-8')

        try:
            details_doc = fromstring(details_html)
        except XMLSyntaxError:
            continue

        description = details_doc.cssselect(DESCR_PATH)[0].text_content()

        haret_elems_list = [('name', name), ('description', description), ('href', href)]

        for haret_elems in details_doc.cssselect(HARET_PATH):
            spans = haret_elems.cssselect('span')
            title = spans[0].text_content()
            bower = spans[1].text_content()
            haret_elems_list.append((title, bower))

        df = df.append(dict(haret_elems_list), ignore_index=True)

    writer = ExcelWriter('tinko_ru_price_list.xlsx', engine='xlsxwriter')
    df.to_excel(writer, sheet_name='tinko.ru price list', header=True, index=False)
    writer.save()


def main():
    parse_courses()


if __name__ == '__main__':
    main()

Here is a ready-made tinko.ru parser with export to Excel (I only checked offline on a page from disk).
How are we going to split the fees? ;)
UPD : Corrected.