module CMess::GuessEncoding::Automatic::EncodingGuessers

Definition of guessing heuristics. Order matters!

Public Instance Methods

encoding_01_ASCII() click to toggle source

ASCII, if all bytes are within the lower 128 bytes. Unfortunately, we have to read the whole file to make that decision.

# File lib/cmess/guess_encoding/automatic.rb, line 251
def encoding_01_ASCII
  ASCII if eof? && byte_count_sum(0x00..0x7f) == byte_total
end
encoding_02_UTF_32_and_UTF_16BE_and_UTF_16LE_and_UTF_16() click to toggle source

UTF-16 / UTF-32, if lots of NULL bytes present.

# File lib/cmess/guess_encoding/automatic.rb, line 258
def encoding_02_UTF_32_and_UTF_16BE_and_UTF_16LE_and_UTF_16
  if relative_byte_count(byte_count[0]) > 0.25
    case first_byte
      when 0x00 then UTF_32
      when 0xfe then UTF_16BE
      when 0xff then UTF_16LE
      else           UTF_16
    end
  end
end
encoding_03_UTF_8() click to toggle source

UTF-8, if number of escape-bytes and following bytes is matching.

# File lib/cmess/guess_encoding/automatic.rb, line 271
def encoding_03_UTF_8
  esc_bytes = byte_count_sum(0xc0..0xdf)     +
              # => 110xxxxx 10xxxxxx
              byte_count_sum(0xe0..0xef) * 2 +
              # => 1110xxxx 10xxxxxx 10xxxxxx
              byte_count_sum(0xf0..0xf7) * 3
              # => 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

  UTF_8 if esc_bytes > 0 && esc_bytes == byte_count_sum(0x80..0xbf)
end
encoding_04_TEST_ENCODINGS() click to toggle source

TEST_ENCODINGS, if frequency of TEST_CHARS exceeds TEST_THRESHOLD_DIRECT (direct match) or TEST_THRESHOLD_APPROX (approximate match).

# File lib/cmess/guess_encoding/automatic.rb, line 284
def encoding_04_TEST_ENCODINGS
  ratios = {}

  TEST_ENCODINGS.find(lambda {
    ratio, encoding = ratios.sort.last
    encoding if ratio >= TEST_THRESHOLD_APPROX
  }) { |encoding|
    ratio = relative_byte_count(byte_count_sum(TEST_CHARS[encoding]))
    ratio >= TEST_THRESHOLD_DIRECT || (ratios[ratio] ||= encoding; false)
  }
end