I
I
Ivan Bogatyrev2019-08-15 09:54:15
Python
Ivan Bogatyrev, 2019-08-15 09:54:15

How to organize normal xml output for further processing?

When you try to load xml for parsing, the bytecode is loaded, as far as I understand, and nothing can be done with it. At best, only Latin characters are displayed.
Code below:

import bs4 as bs
import urllib.request

source = urllib.request.urlopen('https://www.fl.ru/rss/all.xml?category=2').read()
soup = bs.BeautifulSoup(source,'lxml')


table = soup.find('channel')
table_rows = table.find_all('item')

print(table_rows[1])

I get:
<item>
<title></title>
<link/>https://www.fl.ru/projects/4080170/fullstack---nodejs-razrabotchik.html
                  <description></description>
<guid>https://www.fl.ru/projects/4080170/fullstack---nodejs-razrabotchik.html</guid>
<category></category><category></category>
<pubdate>Sat, 06 Jul 2019 21:40:45 GMT</pubdate>
</item>

And should get:
-<item>
-<title>
-<![CDATA[Fullstack - Node.js разработчик (Бюджет: 130000  руб.)]]>
</title>
<link>https://www.fl.ru/projects/4080170/fullstack---nodejs-razrabotchik.html</link>
-<description>
-<![CDATA[В тематике криптовалют нужно закончить несколько самостоятельных модулей на отдельных поддоменах, которые собирают информацию через API с главного проекта, важно заложить...]]>
</description>
<guid>https://www.fl.ru/projects/4080170/fullstack---nodejs-razrabotchik.html</guid>
-<category>
-<![CDATA[Разработка сайтов / Веб-программирование]]>
</category>
-<category>
-<![CDATA[Программирование / Системное программирование]]>
</category>
<pubDate>Sat, 06 Jul 2019 21:40:45 GMT</pubDate>
</item>

When forced to convert to utf-8, everything turns into a bunch of some characters ...
The roof is already moving. Help guys, I don't understand...

Answer the question

In order to leave comments, you need to log in

3 answer(s)
D
Dmitry Shitskov, 2019-08-15
@Zarom

soup = bs.BeautifulSoup(source.read().decode('cp1251'), 'lxml')

I
Ivan Bogatyrev, 2019-08-15
@crybabycry

Does not help...

Traceback (most recent call last):
  File "C:\feedscrape-master\25.py", line 6, in <module>
    soup = bs.BeautifulSoup(source.read().decode('cp1251'), 'lxml')
AttributeError: 'bytes' object has no attribute 'read'

V
Vladimir, 2019-08-15
@vintello

why do you need BS if python works fine with xml without any layers? a bunch
of examples

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question