I want to parse a large number of books < 17th century in search of information about the disappeared people, how to solve the problem?

K

krakaka2020-09-28 22:43:44

Parsing

krakaka, 2020-09-28 22:43:44

in the case of pdf, I can parse with regular expressions, but "books" will of course be more often scans, and more often some kind of unformatted manuscripts, but in different languages, and moreover, outdated versions of languages. computer vision is likely to be needed, what tool would be chosen for such a task?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

D

datka, 2020-09-29
@krakaka

Take a look here https://github.com/tesseract-ocr/tesseract

D

Developer, 2020-09-28
@samodum

This task is not solved automatically, but only manually.