Why does it throw an error when using find()?

B

bugagashnik2018-04-06 12:29:58

Python

bugagashnik, 2018-04-06 12:29:58

Source:

# -*- coding: utf-8 -*-

import pymongo as pymongo
from dictionary import dictionary
class DiplomaPipeline(object):
    collection_name = 'DiplomaItem'
    arr = ['да', 'нет']

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE')
        )

    def open_spider(self, spider):
        ## initializing spider
        ## opening db connection
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        ## clean up when spider is closed
        self.client.close()

    def process_item(self, item, spider):
        ## how to handle each post
        print('~~~~~~~~~~!!!!', )
        for word in self.arr:
            print(word)
            print(item['Comment'])
           print(item['Comment'].find(word))
        # self.db[self.collection_name].insert(dict(item))
        # logging.debug("Post added to MongoDB")
        return item

Error log:

да
Вообще это не смешно, а практично, заменить двух мальеньких нигеров на два вместительных бака.
2018-04-06 15:27:06 [scrapy.core.scraper] ERROR: Error processing {'Comment': u'\u0412\u043e\u043e\u0431\u0449\u0435 \u044d\u0442\u043e \u043d\u0435 \u0441\u043c\u0435\u0448\u043d\u043e, \u0430 \u043f\u0440\u0430\u043a\u0442\u0438\u0447\u043d\u043e, \u0437\u0430\u043c\u0435\u043d\u0438\u0442\u044c \u0434\u0432\u0443\u0445 \u043c\u0430\u043b\u044c\u0435\u043d\u044c\u043a\u0438\u0445 \u043d\u0438\u0433\u0435\u0440\u043e\u0432 \u043d\u0430 \u0434\u0432\u0430 \u0432\u043c\u0435\u0441\u0442\u0438\u0442\u0435\u043b\u044c\u043d\u044b\u0445 \u0431\u0430\u043a\u0430.',
 'MainPageUrl': u'https://pikabu.ru/story/bezyiskhodnost_5826272'}
Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Users/mymac/Work/crawler/diploma/diploma/pipelines.py", line 36, in process_item
    print(item['Comment'].find(word))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

On the subject:
An error in the process_item method.
I'm in python zero, unfortunately. My crawler collects comments, I need to filter each comment. So far, I have written a test comparison on an array, using the find () function. Gives an error message. I suspect that something is wrong with the encoding, but I can’t figure out how to fix it, because if there are Latin words in the array, then everything is ok.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

B

bugagashnik, 2018-04-06
@bugagashnik

As is usually the case, such errors are due to a poor understanding of the language. The bottom line is that I was trying to compare data of two different types: string and unicode string. Once I understand this, it remains to figure out how to convert from a unicode string to a string. Did the following: item['Comment'].encode('utf-8')

R

Ruslan., 2018-04-06
@LaRN

A similar error was sorted out here:
https://stackoverflow.com/questions/21129020/how-t...
As a solution, they suggested setting the default encoding:
import sys
reload(sys)
sys.setdefaultencoding('utf8')