module CMess::GuessEncoding::Automatic::EncodingGuessers
Definition of guessing heuristics. Order matters!
Public Instance Methods
encoding_01_ASCII()
click to toggle source
ASCII, if all bytes are within the lower 128 bytes. Unfortunately, we have to read the whole file to make that decision.
# File lib/cmess/guess_encoding/automatic.rb, line 251 def encoding_01_ASCII ASCII if eof? && byte_count_sum(0x00..0x7f) == byte_total end
encoding_02_UTF_32_and_UTF_16BE_and_UTF_16LE_and_UTF_16()
click to toggle source
UTF-16 / UTF-32, if lots of NULL bytes present.
# File lib/cmess/guess_encoding/automatic.rb, line 258 def encoding_02_UTF_32_and_UTF_16BE_and_UTF_16LE_and_UTF_16 if relative_byte_count(byte_count[0]) > 0.25 case first_byte when 0x00 then UTF_32 when 0xfe then UTF_16BE when 0xff then UTF_16LE else UTF_16 end end end
encoding_03_UTF_8()
click to toggle source
UTF-8, if number of escape-bytes and following bytes is matching.
# File lib/cmess/guess_encoding/automatic.rb, line 271 def encoding_03_UTF_8 esc_bytes = byte_count_sum(0xc0..0xdf) + # => 110xxxxx 10xxxxxx byte_count_sum(0xe0..0xef) * 2 + # => 1110xxxx 10xxxxxx 10xxxxxx byte_count_sum(0xf0..0xf7) * 3 # => 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx UTF_8 if esc_bytes > 0 && esc_bytes == byte_count_sum(0x80..0xbf) end
encoding_04_TEST_ENCODINGS()
click to toggle source
TEST_ENCODINGS, if frequency of TEST_CHARS exceeds TEST_THRESHOLD_DIRECT (direct match) or TEST_THRESHOLD_APPROX (approximate match).
# File lib/cmess/guess_encoding/automatic.rb, line 284 def encoding_04_TEST_ENCODINGS ratios = {} TEST_ENCODINGS.find(lambda { ratio, encoding = ratios.sort.last encoding if ratio >= TEST_THRESHOLD_APPROX }) { |encoding| ratio = relative_byte_count(byte_count_sum(TEST_CHARS[encoding])) ratio >= TEST_THRESHOLD_DIRECT || (ratios[ratio] ||= encoding; false) } end