A
A
alekssamos2020-08-07 08:56:57
Python
alekssamos, 2020-08-07 08:56:57

Fix fulltext get UnicodeDecodeError? Ignore doesn't help?

How to fix encoding? ignore doesn't help. It is also impossible to know the file encoding in advance.

Library
https://github.com/btimby/fulltext
My code:

#!/usr/bin/env python3
# -*- coding: UTF-8 -*-

import fulltext  as ft
import sys
debug = True

def gt(f):
  encodings = ("utf8", "cp1251")
  t = ""
  for encoding in encodings:
    e = None
    try:
      t = ft.get(f, encoding = encoding, encoding_errors = 'ignore')
      break
    except UnicodeDecodeError:
      e = sys.exc_info()
  if t == "" and e[0] == UnicodeDecodeError:
    raise e[1]
  return t

print(gt("book.epub"))

Mistake:
'utf-8' codec can't decode byte 0xfb in position 10: invalid start byte
Traceback (most recent call last):
  File "ft.py", line 31, in <module>
    print(gt(sys.argv[1]))
  File "ft.py", line 20, in gt
    raise e[1]
  File "ft.py", line 15, in gt
    t = ft.get(f, encoding = encoding, encoding_errors = 'ignore')
  File "/usr/local/lib/python3.7/site-packages/fulltext/__init__.py", line 154,                                                                                         in get
    backend, path_or_file, mime=mime, name=name, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/fulltext/__init__.py", line 90, i                                                                                        n _get_path
    return backend._get_file(f, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/fulltext/backends/__text.py", lin                                                                                        e 20, in _get_file
    text = f.read(BUFFER_MAX)
  File "/usr/local/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfb in position 10: invalid                                                                                         start byte
[email protected]:/var/www/html/stc# clear
[email protected]:/var/www/html/stc# python3 ft.py  Davydenko_Uchilka.508202.fb2.epub
'utf-8' codec can't decode byte 0xfb in position 10: invalid start byte
Traceback (most recent call last):
  File "ft.py", line 31, in <module>
    print(gt(sys.argv[1]))
  File "ft.py", line 20, in gt
    raise e[1]
  File "ft.py", line 15, in gt
    t = ft.get(f, encoding = encoding, encoding_errors = 'ignore')
  File "/usr/local/lib/python3.7/site-packages/fulltext/__init__.py", line 154,                                                                                         in get
    backend, path_or_file, mime=mime, name=name, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/fulltext/__init__.py", line 90, i                                                                                        n _get_path
    return backend._get_file(f, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/fulltext/backends/__text.py", lin                                                                                        e 20, in _get_file
    text = f.read(BUFFER_MAX)
  File "/usr/local/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfb in position 10: invalid                                                                                         start byte

Answer the question

In order to leave comments, you need to log in

1 answer(s)
A
alekssamos, 2020-08-07
@alekssamos

pip has an old version. New on Github.
To fix do:

pip install git+https://github.com/btimby/fulltext.git

and the new version will be installed.
If done simply pip install fulltextthen the old version will be installed.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question