How to properly get all the content in the body using Nokogiri and convert it to text?

D

Demigodd2020-04-29 12:25:42

ruby

Demigodd, 2020-04-29 12:25:42

<body>
  <p>Content</p>
   ...Content...
<body>

This is how I get all the content as text

new_content = nokogiri_content.at('body').children.text

But whitespace characters remain.
Is it correct to do so, if so, how to remove whitespace characters?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

V

Valentin V., 2020-04-29
@Demigodd

To reduce spaces, alas, is not included in the Nokogiri functions, you can remove the starting spaces with a regular expression. But in general, this is not a very normal way, since in the text you will not only have spaces, but also content that is usually not processed as text. Processing html with Nokogiri involves more targeted actions, such as extracting the necessary tags and text from them:
new_content.gsub(/^ +/, "")

require 'open-uri'
require 'nokogiri'

url = 'https://ru.wikipedia.org/wiki/Ruby'
doc = Nokogiri::HTML(open(url))

text = ''
doc.css('p,h1').each do |e|
  text << e.content
end

puts text