A
A
artds2015-12-19 10:33:49
Python
artds, 2015-12-19 10:33:49

Python web scraping (characteristics)?

how can i make title(value headers) in the headers and bower(characteristics value) in the cells themselves (in excel)? the characteristics can be different and so as not to be repeated
, for example, there is a voltage characteristic - 380 V, you need "voltage" to be the heading, and "380 V" in the values ​​\u200b\u200b(the cell itself)
the problem is that these characteristics may be out of order, and may be in different rows and different characteristics.
I'm thinking of making them a list (array) and what algorithm is needed if it's a repetition, then it adds to this line, and if it's new, it creates a new line.
Here is the code from where I'm parsing

<div class="tab-pane                 techdata product-collateral__techdata active" id="techdata">    <ul>
                <li><span>Маркировка по взрывозащите</span><span>нет</span></li>
                <li><span>Тип контактов</span><span>НЗ</span></li>
                <li><span>Расстояние между магнитом и герконом, мм:</span><span></span></li>
                <li><span>- при размыкании контактов, более</span><span>45</span></li>
                <li><span>- при замыкании контактов, менее</span><span>12.7</span></li>
                <li><span>Максимально допустимые токи и напряжения:</span><span></span></li>
                <li><span>- максимальное коммутируемое напряжение, В</span><span>72</span></li>
                <li><span>- максимальный коммутируемый ток, А</span><span>0.3</span></li>
                <li><span>Степень защиты</span><span>-</span></li>
                <li><span>Диапазон рабочих температур, °С</span><span>-50…+50</span></li>
                <li><span>Габаритные размеры, мм:</span><span></span></li>
                <li><span>- корпус геркона</span><span>58х11х11</span></li>
                <li><span>- корпус магнита</span><span>58х11х11</span></li>
                <li><span>Масса, не более, кг</span><span>-</span></li>
        </ul>
        </div>

the parser code
from urllib.request import urlopen
from urllib.parse import urljoin

from lxml.html import fromstring


URL = 'http://www.tinko.ru/c-3.html?limit=100&no_cache=true&p=l'
ITEM_PATH = ' .info-block .product-name'
DESCR_PATH = '.breadcrumb .active'

HARET_PATH = '#techdata li'
def parse_courses():
f = urlopen(URL)
list_html = f.read().decode('utf-8')
list_doc = fromstring(list_html)

for elem in list_doc.cssselect(ITEM_PATH):
    a = elem.cssselect('a')[0]
    href = a.get('href')
    name = a.text

    course = { 'name': name, 'href': href}

    details_html = urlopen(href).read().decode('utf-8')
    try:
        details_doc = fromstring(details_html)
    except XMLSyntaxError:
        contine


    descr_elem = details_doc.cssselect(DESCR_PATH)[0]
    descr = descr_elem.text_content()

    for haret_elems in details_doc.cssselect(HARET_PATH):
        title = haret_elems.cssselect('span')[0]
        title = title.text_content()


    for haret_elems in details_doc.cssselect(HARET_PATH):
        bower = haret_elems.cssselect('span')[1]
        bower = bower.text_content()
        print(bower)


def main():
    parse_courses()


if __name__ == '__main__':
    main()

Answer the question

In order to leave comments, you need to log in

1 answer(s)
N
nirvimel, 2015-12-19
@artds

from urllib.request import urlopen # for Python 3
# from urllib2 import urlopen  # for Python 2

from lxml.etree import XMLSyntaxError
from lxml.html import fromstring
from pandas import DataFrame, ExcelWriter

URL = 'http://www.tinko.ru/c-3.html?limit=100&no_cache=true&p=l'
ITEM_PATH = ' .info-block .product-name'
DESCR_PATH = '.breadcrumb .active'

HARET_PATH = '#techdata li'


def parse_courses():
    f = urlopen(URL)
    list_html = f.read().decode('utf-8')
    list_doc = fromstring(list_html)

    df = DataFrame(columns=('name', 'description', 'href'))

    for elem in list_doc.cssselect(ITEM_PATH):
        a = elem.cssselect('a')[0]
        href = a.get('href')
        name = a.text

        details_html = urlopen(href).read().decode('utf-8')

        try:
            details_doc = fromstring(details_html)
        except XMLSyntaxError:
            continue

        description = details_doc.cssselect(DESCR_PATH)[0].text_content()

        haret_elems_list = [('name', name), ('description', description), ('href', href)]

        for haret_elems in details_doc.cssselect(HARET_PATH):
            spans = haret_elems.cssselect('span')
            title = spans[0].text_content()
            bower = spans[1].text_content()
            haret_elems_list.append((title, bower))

        df = df.append(dict(haret_elems_list), ignore_index=True)

    writer = ExcelWriter('tinko_ru_price_list.xlsx', engine='xlsxwriter')
    df.to_excel(writer, sheet_name='tinko.ru price list', header=True, index=False)
    writer.save()


def main():
    parse_courses()


if __name__ == '__main__':
    main()

Here is a ready-made tinko.ru parser with export to Excel (I only checked offline on a page from disk).
How are we going to split the fees? ;)
UPD : Corrected.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question