python - Scrapy retrieves text encoding incorrectly, hebrew as \u0d5 etc -


first time working stuff. checked out other sof questions internalization / text encoding.

i'm doing scrapy tutorial, when got stuck @ part: extracting data, when extract data, text instead of hebrew displayed series of \uxxxx.

it's possible check out scraping this page example;

scrapy shell http://israblog.nana10.co.il/blogread.asp?blog=167524&blogcode=13348970 hxs.select('//h2[@class="title"]/text()').extract()[0] 

this retrieve

u'\u05de\u05d9 \u05d0\u05e0\u05e1 \u05e4\u05d5\u05d8\u05e0\u05e6\u05d9\u05d0\u05dc\u05d9?'

(unrelated:) if try print in console, get: traceback (most recent call last): file "<stdin>", line 1, in <module> file "c:\python27\lib\encodings\cp437.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_map) unicodeencodeerror: 'charmap' codec can't encode characters in position 0-1: cha racter maps <undefined>

tried setting encoding through settings, tried converting manually, feel tried everything.

(i've gone 5 pomodoros trying fix this!)

what can hebrew text should there: "מי אנס פוטנציאלי?"

(disclaimer: went first blog , post noticed on http://israblog.co.il, i'm in no way related blog or blog owner, used example)

what can hebrew text should there: "מי אנס פוטנציאלי?"

test.py:

# coding: utf-8 = u'\u05de\u05d9 \u05d0\u05e0\u05e1 \u05e4\u05d5\u05d8\u05e0\u05e6\u05d9\u05d0\u05dc\u05d9?' b = 'מי אנס פוטנציאלי?' print print b 

result:

vic@wic:~/projects/snippets$ python test.py מי אנס פוטנציאלי? מי אנס פוטנציאלי? vic@wic:~/projects/snippets$ 

as see same. it's different representation of same unicode string. don't worry it's not scraped correctly.

if want save file:

python 2.7.3 (default, apr 20 2012, 22:39:59) [gcc 4.6.3] on linux2 >>> = u'\u05de\u05d9 \u05d0\u05e0\u05e1 \u05e4\u05d5\u05d8\u05e0\u05e6\u05d9\u05d0\u05dc\u05d9' >>> u'\u05de\u05d9 \u05d0\u05e0\u05e1 \u05e4\u05d5\u05d8\u05e0\u05e6\u05d9\u05d0\u05dc\u05d9' >>> f = open('test.txt', 'w') >>> f.write(a) traceback (most recent call last): file "<stdin>", line 1, in <module> unicodeencodeerror: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128) >>> f.write(a.encode('utf-8')) >>> f.close() 

Comments

Popular posts from this blog

javascript - backbone.js Collection.add() doesn't `construct` (`initialize`) an object -

php - Get uncommon values from two or more arrays -

Adding duplicate array rows in Php -