ruby - String#encode not fixing "invalid byte sequence in UTF-8" error -

i know there multiple similar questions error, , i've tried many of them without luck. problem i'm having involves byte \xa1 , throwing

argumenterror: invalid byte sequence in utf-8

i've tried following no success:

"\xa1".encode('utf-8', :undef => :replace, :invalid => :replace, :replace => "").sub('', '') "\xa1".encode('utf-8', :undef => :replace, :invalid => :replace, :replace => "").force_encoding('utf-8').sub('', '') "\xa1".encode('utf-8', :undef => :replace, :invalid => :replace, :replace => "").encode('utf-8').sub('', '')

each line throws error me. doing wrong?

update:

the above lines fail in irb. however, modified application encode lines of cvs file using same string#encode method , arguments, , same error when reading line file (note: works if perform operations on same string w/o using io).

bad_line = "col1\tcol2\tbad\xa1" bad_line.sub('', '') # not fail puts bad_line # => col1 col2 bad? tmp = tempfile.new 'foo' # write line file emulate real problem tmp.puts bad_line tmp.close tmp2 = tempfile.new 'bar' begin io.foreach tmp.path |line| line.encode!('utf-8', :undef => :replace, :invalid => :replace, :replace => "") line.sub('', '') # fail: invalid byte sequence in utf-8 tmp2.puts line end tmp2.close # fail if above error didn't halt execution csv.foreach(tmp2.path) |row| puts row.inspect # fail: invalid byte sequence in utf-8 end ensure tmp.unlink tmp2.close tmp2.unlink end

it seem ruby thinks string encoding utf8, when do

line.encode!('utf-8', :undef => :replace, :invalid => :replace, :replace => "")

it doesn't because destination encoding same current encoding (at least that's interpretation of code in transcode.c)

the real question here whether starting data valid in encoding isn't utf-8 or whether data supposed utf-8 has few warts in want discard.

in first case, correct thing tell ruby encoding is. can when open file

file.open('somefile', 'r:iso-8859-1')

will open file, interpreting contents iso-8859-1

you can ruby transcode you

file.open('somefile', 'r:iso-8859-1:utf-8')

will open file iso-8859-1, when read data bytes converted utf-8 you.

you can call force_encoding tell ruby string's encoding (this doesn't modify bytes @ all, tells ruby how interpret them).

in second case, want dump whatever nasty stuff has got utf-8, can't call encode! have because that's no-op. in ruby 2.1 , higher, can use string#scrub, in previous versions can this

line.encode!('utf-16', :undef => :replace, :invalid => :replace, :replace => "") line.encode!('utf-8')

we first convert utf-16. since different encoding, ruby replace our invalid sequences. can convert utf-8. won't lose data because utf-8 , utf-16 2 different ways of encoding same underlying character set.

Search This Blog

Brayton

ruby - String#encode not fixing "invalid byte sequence in UTF-8" error -

Comments

Post a Comment

Popular posts from this blog

javascript - backbone.js Collection.add() doesn't `construct` (`initialize`) an object -

php - Get uncommon values from two or more arrays -

Adding duplicate array rows in Php -