ruby - String#encode not fixing "invalid byte sequence in UTF-8" error -
i know there multiple similar questions error, , i've tried many of them without luck. problem i'm having involves byte \xa1
, throwing
argumenterror: invalid byte sequence in utf-8
i've tried following no success:
"\xa1".encode('utf-8', :undef => :replace, :invalid => :replace, :replace => "").sub('', '') "\xa1".encode('utf-8', :undef => :replace, :invalid => :replace, :replace => "").force_encoding('utf-8').sub('', '') "\xa1".encode('utf-8', :undef => :replace, :invalid => :replace, :replace => "").encode('utf-8').sub('', '')
each line throws error me. doing wrong?
update:
the above lines fail in irb. however, modified application encode lines of cvs file using same string#encode method , arguments, , same error when reading line file (note: works if perform operations on same string w/o using io).
bad_line = "col1\tcol2\tbad\xa1" bad_line.sub('', '') # not fail puts bad_line # => col1 col2 bad? tmp = tempfile.new 'foo' # write line file emulate real problem tmp.puts bad_line tmp.close tmp2 = tempfile.new 'bar' begin io.foreach tmp.path |line| line.encode!('utf-8', :undef => :replace, :invalid => :replace, :replace => "") line.sub('', '') # fail: invalid byte sequence in utf-8 tmp2.puts line end tmp2.close # fail if above error didn't halt execution csv.foreach(tmp2.path) |row| puts row.inspect # fail: invalid byte sequence in utf-8 end ensure tmp.unlink tmp2.close tmp2.unlink end
it seem ruby thinks string encoding utf8, when do
line.encode!('utf-8', :undef => :replace, :invalid => :replace, :replace => "")
it doesn't because destination encoding same current encoding (at least that's interpretation of code in transcode.c
)
the real question here whether starting data valid in encoding isn't utf-8 or whether data supposed utf-8 has few warts in want discard.
in first case, correct thing tell ruby encoding is. can when open file
file.open('somefile', 'r:iso-8859-1')
will open file, interpreting contents iso-8859-1
you can ruby transcode you
file.open('somefile', 'r:iso-8859-1:utf-8')
will open file iso-8859-1, when read data bytes converted utf-8 you.
you can call force_encoding
tell ruby string's encoding (this doesn't modify bytes @ all, tells ruby how interpret them).
in second case, want dump whatever nasty stuff has got utf-8, can't call encode!
have because that's no-op. in ruby 2.1 , higher, can use string#scrub, in previous versions can this
line.encode!('utf-16', :undef => :replace, :invalid => :replace, :replace => "") line.encode!('utf-8')
we first convert utf-16. since different encoding, ruby replace our invalid sequences. can convert utf-8. won't lose data because utf-8 , utf-16 2 different ways of encoding same underlying character set.
Comments
Post a Comment