Answer the question
In order to leave comments, you need to log in
Python web scraping (characteristics)?
how can i make title(value headers) in the headers and bower(characteristics value) in the cells themselves (in excel)? the characteristics can be different and so as not to be repeated
, for example, there is a voltage characteristic - 380 V, you need "voltage" to be the heading, and "380 V" in the values \u200b\u200b(the cell itself)
the problem is that these characteristics may be out of order, and may be in different rows and different characteristics.
I'm thinking of making them a list (array) and what algorithm is needed if it's a repetition, then it adds to this line, and if it's new, it creates a new line.
Here is the code from where I'm parsing
<div class="tab-pane techdata product-collateral__techdata active" id="techdata"> <ul>
<li><span>Маркировка по взрывозащите</span><span>нет</span></li>
<li><span>Тип контактов</span><span>НЗ</span></li>
<li><span>Расстояние между магнитом и герконом, мм:</span><span></span></li>
<li><span>- при размыкании контактов, более</span><span>45</span></li>
<li><span>- при замыкании контактов, менее</span><span>12.7</span></li>
<li><span>Максимально допустимые токи и напряжения:</span><span></span></li>
<li><span>- максимальное коммутируемое напряжение, В</span><span>72</span></li>
<li><span>- максимальный коммутируемый ток, А</span><span>0.3</span></li>
<li><span>Степень защиты</span><span>-</span></li>
<li><span>Диапазон рабочих температур, °С</span><span>-50…+50</span></li>
<li><span>Габаритные размеры, мм:</span><span></span></li>
<li><span>- корпус геркона</span><span>58х11х11</span></li>
<li><span>- корпус магнита</span><span>58х11х11</span></li>
<li><span>Масса, не более, кг</span><span>-</span></li>
</ul>
</div>
from urllib.request import urlopen
from urllib.parse import urljoin
from lxml.html import fromstring
URL = 'http://www.tinko.ru/c-3.html?limit=100&no_cache=true&p=l'
ITEM_PATH = ' .info-block .product-name'
DESCR_PATH = '.breadcrumb .active'
HARET_PATH = '#techdata li'
def parse_courses():
f = urlopen(URL)
list_html = f.read().decode('utf-8')
list_doc = fromstring(list_html)
for elem in list_doc.cssselect(ITEM_PATH):
a = elem.cssselect('a')[0]
href = a.get('href')
name = a.text
course = { 'name': name, 'href': href}
details_html = urlopen(href).read().decode('utf-8')
try:
details_doc = fromstring(details_html)
except XMLSyntaxError:
contine
descr_elem = details_doc.cssselect(DESCR_PATH)[0]
descr = descr_elem.text_content()
for haret_elems in details_doc.cssselect(HARET_PATH):
title = haret_elems.cssselect('span')[0]
title = title.text_content()
for haret_elems in details_doc.cssselect(HARET_PATH):
bower = haret_elems.cssselect('span')[1]
bower = bower.text_content()
print(bower)
def main():
parse_courses()
if __name__ == '__main__':
main()
Answer the question
In order to leave comments, you need to log in
from urllib.request import urlopen # for Python 3
# from urllib2 import urlopen # for Python 2
from lxml.etree import XMLSyntaxError
from lxml.html import fromstring
from pandas import DataFrame, ExcelWriter
URL = 'http://www.tinko.ru/c-3.html?limit=100&no_cache=true&p=l'
ITEM_PATH = ' .info-block .product-name'
DESCR_PATH = '.breadcrumb .active'
HARET_PATH = '#techdata li'
def parse_courses():
f = urlopen(URL)
list_html = f.read().decode('utf-8')
list_doc = fromstring(list_html)
df = DataFrame(columns=('name', 'description', 'href'))
for elem in list_doc.cssselect(ITEM_PATH):
a = elem.cssselect('a')[0]
href = a.get('href')
name = a.text
details_html = urlopen(href).read().decode('utf-8')
try:
details_doc = fromstring(details_html)
except XMLSyntaxError:
continue
description = details_doc.cssselect(DESCR_PATH)[0].text_content()
haret_elems_list = [('name', name), ('description', description), ('href', href)]
for haret_elems in details_doc.cssselect(HARET_PATH):
spans = haret_elems.cssselect('span')
title = spans[0].text_content()
bower = spans[1].text_content()
haret_elems_list.append((title, bower))
df = df.append(dict(haret_elems_list), ignore_index=True)
writer = ExcelWriter('tinko_ru_price_list.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='tinko.ru price list', header=True, index=False)
writer.save()
def main():
parse_courses()
if __name__ == '__main__':
main()
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question