Beautiful Soup, how to catch tags more efficiently?

Z

Zalim Lampezhev2020-03-23 08:31:45

Python

Zalim Lampezhev, 2020-03-23 08:31:45

There is a site rusprofil.ru, I do parsing of full name.
url = 'https://www.rusprofile.ru/codes/561010/'

fio = soup.select('.company-item > .company-item-info > dl > dd')

The fact is that the number of tags in each block is not always equal. Somewhere one more, some less.
As a result, the first page is parsed well, but among the results there are a couple of errors due to the number of tags.
Then, the program opens the next page and now there are much more errors. Instead of the necessary tags, completely different ones come out and, as a result, porridge.

I do the following for the first page:

for i in range(0, len, 5):
    print(fio[i].text)
    names.append(fio[i].text)
    for j in range(5, len, 6):
        print(fio[j].text)
        names.append(fio[j].text)

For the first page, it is 98% suitable, but when the page changes, then 2% of hits)))

Is there a more efficient method for catching the right tags?

Site layout example

<div class="company-item">
    <div class="company-item__title">
        <a href="/id/10612303">                                ООО "Восток"            </a>        
    </div>
    <div class="company-item-info">
        <dl>
            <dt>Генеральный директор</dt>
            <dd>Титаев Александр Витальевич</dd>
        </dl>
    </div>
    <address class="company-item__text">
        603005, Нижегородская область, город Нижний Новгород, улица Пискунова, дом 14/5, помещение 8                                            
    </address>
    <div class="company-item-info">
        <dl>
            <dt>ИНН</dt>
            <dd>5260430967</dd>
        </dl>
        <dl>
            <dt>ОГРН</dt>
            <dd>1165275042349</dd>
        </dl>
        <dl>
            <dt>Дата регистрации</dt>
            <dd>1 сентября 2016 г.</dd>
        </dl>
        <dl>
            <dt>Уставный капитал</dt>
            <dd>10 000 руб.</dd>
        </dl>
    </div>
    <div class="company-item-info">
        <dl>
            <dt>Основной вид деятельности</dt>
            <dd>56.10.1 Деятельность ресторанов и кафе с полным ресторанным обслуживанием, кафетериев, ресторанов быстрого питания и самообслуживания</dd>
        </dl>
    </div>
</div>

I summarize: you need to pull out the full name, the number of tags is not always equal, sometimes some tags are added or removed. Therefore, if you access by index, as in my example, then there will be a failure ...

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

Z

Zalim Lampezhev, 2020-03-23
@sabolch

Solved the problem with a different solution.

fio = soup.select('.company-item-info')
len = len(soup.select('.company-item-info'))

for i in range(0, len, 3):
    print(fio[i].dd.text)