D
D
Demigodd2020-04-29 12:25:42
ruby
Demigodd, 2020-04-29 12:25:42

How to properly get all the content in the body using Nokogiri and convert it to text?

<body>
  <p>Content</p>
   ...Content...
<body>


This is how I get all the content as text
new_content = nokogiri_content.at('body').children.text


But whitespace characters remain.
Is it correct to do so, if so, how to remove whitespace characters?

Answer the question

In order to leave comments, you need to log in

1 answer(s)
V
Valentin V., 2020-04-29
@Demigodd

To reduce spaces, alas, is not included in the Nokogiri functions, you can remove the starting spaces with a regular expression. But in general, this is not a very normal way, since in the text you will not only have spaces, but also content that is usually not processed as text. Processing html with Nokogiri involves more targeted actions, such as extracting the necessary tags and text from them:
new_content.gsub(/^ +/, "")

require 'open-uri'
require 'nokogiri'

url = 'https://ru.wikipedia.org/wiki/Ruby'
doc = Nokogiri::HTML(open(url))

text = ''
doc.css('p,h1').each do |e|
  text << e.content
end

puts text

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question