python - Scrapy retrieves text encoding incorrectly, hebrew as \u0d5 etc -
first time working stuff. checked out other sof questions internalization / text encoding.
i'm doing scrapy tutorial, when got stuck @ part: extracting data, when extract data, text instead of hebrew displayed series of \uxxxx.
it's possible check out scraping this page example;
scrapy shell http://israblog.nana10.co.il/blogread.asp?blog=167524&blogcode=13348970 hxs.select('//h2[@class="title"]/text()').extract()[0]
this retrieve
u'\u05de\u05d9 \u05d0\u05e0\u05e1 \u05e4\u05d5\u05d8\u05e0\u05e6\u05d9\u05d0\u05dc\u05d9?'
(unrelated:) if try print in console, get: traceback (most recent call last): file "<stdin>", line 1, in <module> file "c:\python27\lib\encodings\cp437.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_map) unicodeencodeerror: 'charmap' codec can't encode characters in position 0-1: cha racter maps <undefined>
tried setting encoding through settings, tried converting manually, feel tried everything.
(i've gone 5 pomodoros trying fix this!)
what can hebrew text should there: "×× ×× ×¡ פ××× ×¦××××?
"
(disclaimer: went first blog , post noticed on http://israblog.co.il, i'm in no way related blog or blog owner, used example)
what can hebrew text should there: "×× ×× ×¡ פ××× ×¦××××?"
test.py
:
# coding: utf-8 = u'\u05de\u05d9 \u05d0\u05e0\u05e1 \u05e4\u05d5\u05d8\u05e0\u05e6\u05d9\u05d0\u05dc\u05d9?' b = '×× ×× ×¡ פ××× ×¦××××?' print print b
result:
vic@wic:~/projects/snippets$ python test.py ×× ×× ×¡ פ××× ×¦××××? ×× ×× ×¡ פ××× ×¦××××? vic@wic:~/projects/snippets$
as see same. it's different representation of same unicode string. don't worry it's not scraped correctly.
if want save file:
python 2.7.3 (default, apr 20 2012, 22:39:59) [gcc 4.6.3] on linux2 >>> = u'\u05de\u05d9 \u05d0\u05e0\u05e1 \u05e4\u05d5\u05d8\u05e0\u05e6\u05d9\u05d0\u05dc\u05d9' >>> u'\u05de\u05d9 \u05d0\u05e0\u05e1 \u05e4\u05d5\u05d8\u05e0\u05e6\u05d9\u05d0\u05dc\u05d9' >>> f = open('test.txt', 'w') >>> f.write(a) traceback (most recent call last): file "<stdin>", line 1, in <module> unicodeencodeerror: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128) >>> f.write(a.encode('utf-8')) >>> f.close()
Comments
Post a Comment