D
D
DarkCoder2011-06-09 20:59:16
ruby
DarkCoder, 2011-06-09 20:59:16

Encoding when reading files in Ruby

The task of counting the number of words in a file, please tell me how to properly set up the encoding in File.read () so that Russian is read normally.
the code:

def words_from_string(string)
  string.downcase.scan(/[\w']+/)
end

def count_frequency(word_list)
  count = Hash.new(0)
  word_list.each {|word| count[word] += 1 }
  count
end

raw_text = File.read("text.txt") #, encoding: Encoding::UTF_8) #, encoding: "cp1251")
p raw_text

word_list = words_from_string(raw_text)
p word_list

counts = count_frequency(word_list)
p counts

sorted = counts.sort_by { |word, count| -count }
p sorted

top_five = sorted.last(5)
p top_five
top_five.each { |word, count| puts "#{word} #{count}" }

Answer the question

In order to leave comments, you need to log in

2 answer(s)
S
sl_bug, 2011-06-09
@DarkCoder

What version of ruby?
open("data.txt", "w:UTF-16LE") e.g.

D
DarkCoder, 2011-06-10
@DarkCoder

Great, I'm one step closer! )
there is some difference between p and puts:

f = File.open("text_ascii.txt", "r:windows-1251")
raw_text = f.gets
puts raw_text.encoding
puts raw_text

Windows-1251
Здравствуйте, уважаемые читатели. Я продолжаю свою серию постов про распределенную систему контроля версий Mercurial.

Only regexp does not accept Russian words (

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question