S
S
Shaks2014-05-24 06:18:46
ruby
Shaks, 2014-05-24 06:18:46

How to properly get rid of the Invalid byte sequence in utf-8 error?

Description of the problem:
For example, I take 2
links 1. xttp://www.ariatender.com/beta/
2. xttp://www.ariatender.com/beta/tender-by-category.php
Using the Mechanize jam, I pull each of them in order. I process the content of each of the received pages with regular expressions, cut off html tags, all garbage, and leave clean text.
The problem is that on link #1 everything is ok. And on link No. 2, I get the error Invalid byte sequence in utf-8 as soon as I try to touch the page content with gub or split.
Measures that I took and what came of it:

force_encoding('utf-8') # ошибка остается
encode('UTF-8', 'UTF-8', invalid: :replace, undef: :replace, replace: '') # ошибка остается
encode('UTF-8', invalid: :replace, undef: :replace, replace: '') # творится что-то мне непонятное. Почему-то вся арабская абракадабра чудесным образом исчезает, хотя я ожидал что исчезнут только проблемные байты и дадут мне спокойно работать.

If you don't mind helping to sort out the issue, the scribble below is for you.
require 'mechanize'
require 'logger'

pages = ['http://www.ariatender.com/beta/', 'http://www.ariatender.com/beta/tender-by-category.php']
agent = Mechanize.new
agent.user_agent_alias ='Mac Safari'
agent.log = Logger.new(File.join(File.dirname(__FILE__), 'log.txt'))


pages.each do |page|
  begin
    agent.log.debug("Page: #{page}")
    content = agent.get(page)
    enc = content.body.force_encoding('utf-8')
    agent.log.debug("Content: #{enc}")
    striped_tags = enc.gsub(/<[^>]+?>/m, ' ')
    agent.log.debug("Modify content: #{striped_tags}")
  rescue
    agent.log.error($!.message)
    agent.log.error($!.backtrace.join("\n"))
  end
end

Answer the question

In order to leave comments, you need to log in

2 answer(s)
M
Marat Amerov, 2014-05-24
@amerov

content = agent.page.root.serialize(encoding: "utf-8")

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question