Answer the question
In order to leave comments, you need to log in
How to properly get rid of the Invalid byte sequence in utf-8 error?
Description of the problem:
For example, I take 2
links 1. xttp://www.ariatender.com/beta/
2. xttp://www.ariatender.com/beta/tender-by-category.php
Using the Mechanize jam, I pull each of them in order. I process the content of each of the received pages with regular expressions, cut off html tags, all garbage, and leave clean text.
The problem is that on link #1 everything is ok. And on link No. 2, I get the error Invalid byte sequence in utf-8 as soon as I try to touch the page content with gub or split.
Measures that I took and what came of it:
force_encoding('utf-8') # ошибка остается
encode('UTF-8', 'UTF-8', invalid: :replace, undef: :replace, replace: '') # ошибка остается
encode('UTF-8', invalid: :replace, undef: :replace, replace: '') # творится что-то мне непонятное. Почему-то вся арабская абракадабра чудесным образом исчезает, хотя я ожидал что исчезнут только проблемные байты и дадут мне спокойно работать.
require 'mechanize'
require 'logger'
pages = ['http://www.ariatender.com/beta/', 'http://www.ariatender.com/beta/tender-by-category.php']
agent = Mechanize.new
agent.user_agent_alias ='Mac Safari'
agent.log = Logger.new(File.join(File.dirname(__FILE__), 'log.txt'))
pages.each do |page|
begin
agent.log.debug("Page: #{page}")
content = agent.get(page)
enc = content.body.force_encoding('utf-8')
agent.log.debug("Content: #{enc}")
striped_tags = enc.gsub(/<[^>]+?>/m, ' ')
agent.log.debug("Modify content: #{striped_tags}")
rescue
agent.log.error($!.message)
agent.log.error($!.backtrace.join("\n"))
end
end
Answer the question
In order to leave comments, you need to log in
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question