A
A
AlexRAV2016-11-12 19:06:25
Python
AlexRAV, 2016-11-12 19:06:25

How to parse a block in Python?

Why is Beautiful Soup parsing this block incorrectly?
Here is a simplified code, the bug is also reproduced on it:

soupIndex = BeautifulSoup('''<div class="vk-comment">
                    <div class="vk-avatar">
                        <img src="img.png">
                    </div>
                    <div class="vk-comment-name">
                        Имя автора
                    </div>
                    <div class="vk-comment-text">
                        <p>
                            Текст коммента
                        </p>
                    </div>
                    <div class="vk-comment-date">
                        17 минут назад
                    </div>
                </div>''')
template = soupIndex.select_one('.vk-comment')
print(template)

In this variation, two extra divs appear during the output... If the length of the comment is increased several times, then the vk-comment-date block starts to be copied. As I understand it, the longer this block is in the character representation, the more characters are duplicated from the end.
UPD: html5lib is the default parser, OS is Windows 7. I tried html parser, there is some kind of nonsense going on there, for example, a closing tag is added to the img tag.

Answer the question

In order to leave comments, you need to log in

3 answer(s)
A
AlexRAV, 2016-11-13
@AlexRAV

The problem was in the wrong display in the console, wrote the data to a file, it is normally displayed there.

V
Vadim kyklaed, 2016-11-12
@kyklaed

do you need data from all div s?

for i in soupIndex.find_all('div'):
    print(i['class'])

D
Dimonchik, 2016-11-12
@dimonchik2013

preprocess
, you probably get it from some kind of API, check what kind of data comes
Ps I recommend lxml

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question