Is it possible to get Russian characters from pdf?

A

Alexander2020-01-19 19:02:04

Python

Alexander, 2020-01-19 19:02:04

Is it possible to get Russian characters from pdf using this code:

import pdfminer.high_level

with open('1.txt', 'w', encoding='utf8') as out_file:
    with open('1.pdf', 'rb') as file:
        pdfminer.high_level.extract_text_to_fp(file, out_file)

I get the following text in the file:

Text

flpunoxenue J,,M .K r{3BerqeHrrro o [poBeAeHr,[r 3anpocaKornpoBoK n e:rerrponnofi 1|oplre cpe4ucy6texTon MaJroro u cpeAHeroupeArrpuErrMaTeJrbcTBaYrnepxqarc019r.OE 2,rZ7 "Texuuqecrcoe 3ananue J\b332-089-19Ha MexaHuqecxyrc o6pa6orrcy1. Hauueuoranrre roBapa (pa6orrr, yc.nyrn): Mexamrvecrcar o6pa6orra:lllnunrra M18x 69x45.23 IOCT 22033-76Illnumr<a Ml8x 69x50.23 IOCT 22033-76IIIrlrQr 8x22 IOCT 24296-93faftr<a uaKlr.urar lz-lzAlOCT 13957 -7 4fafira naxrAsar 16-12A fOCT 13957 -7 4llareu TH064-00.0032. Ko.rnqecrro (o6rdu):lllnrurrra Ml8x6g"45.23|OCT 22033-76 - 40 un.Illnunrra M18x69x50.23IOCT 22033-76 - 120 utr.lllrrr(fr 8x22 IOCT 24296-93 - 50 Itrr.fa"ftra narur4rar l2-l2A IOCT 13957 -7 4 - 20 mr.faftxa narumrag 16-124 |OCT 13957 -74 - 30 urr.Ilareq TH064-00.003 - 10 trt.3. (Dynxquona,'rbuble xapaKreprlcrrrnr!: Kpen€x.4. Texuflqecxue xaparTepucrxrli:HautrrenosaHrleMarepuar 3aroroBor rrepeAaBaeMbrx3axas.ruxoM IIcrroJrHrrreJrnlllnunrra Ml8x 6s.x45.23 |OCT 22033-76t4xt7H2 focr 5949-75Illnr.rmra M 1 8 x 69x50 .23 IOCT 22033-7 614Xr7H2 |OCT 5949-75llhubr 8 x22 | O CT 2429 6 -93Cra;rr 45 |OCT 1050-88faftra naxulnas 12-l2A |OCT 13957 -7 414X17I12 fOCT 5949-75faftra naxrAnas 16-12A fOCT 13957'7 4t4xtTH2rocr s949-75IIareu TH064-00.00340x focT 4543-71Teprrloo6pa6oma 3aroroBoK Brmorntercc cunaltu AO <<Typ6onacoc>. .5. Ka.recrneunue xaparTeprrcrrrxlr:- CranAaprnue Aera!'rrr B coorBercrswr c |OCT 22033-76,IOCT 24296-93,IOCT 13957 -7 4;- fla,reu TH064-00.003: flpe4emt repoxoBarocrlt aeranefi or ./FZTF' ao VRdT3, npeaenKBaJrr4reroB pa3Mepon - h13-H14, f9.6. Tpe6onanun K 6eonacHocr[: B coorBErcrBun c o6rqulruT pe6osaHl,Ifiul,I 6eonacnocru.7. Tpe6ooauur K pa3MepaM ToBapa:- Pasueprr roroBbD( crrurAaprHlrx AerarlEi cornacuo |OCT 22033-7 6,IOCT 24296-91 |OCT 13957-74;- Parrrleptr roroBbrx Aeraneii flaneq TH064-00.003 ,qonxnrr coorBercrBoBarr rpe6onaur.rrrr.rKoHcrp).KropcKofi .qoryvrenraquu, ra6apnrnue pa3Mephr cornacHo ecrusy (flpuloxenne).8. Tpe$orauur K orrpy3Ke roBapa: Orrpyara roBapa AoJIxHa noJlnocrlro o6ecnequnarr3arqr.rry or BHelrrwx tfusuuecrux tfarropon ao:4eficrnue na 4eranu.9. Tpe6orauur K yrraKoBKe roBapa: .{eraru 4orxnu 6rrrr ynaronaubl B LIHAI,IBtIAyuurbHyroyrraxoBKy, KOTOpaJT npeAoxparuIeT OT 3aCOpeHI,It, MexaHI{qecKI{x noBpexAeHlrfi, a TaKxenpeAorBparqaer ro:4eftcrnue Ira Aeranfi (faxropon orpyxarorqefi cpeAm10. Tpe6orauuff K p$yJr6Tararr pa6orr.r: foronrre Aeran[ Aon)I(Hbr coorBercrBoBarbrpe6oran1rxra crarr.qaproB (|OCT, OCT), Koncrp)'KropcKoft Aor)'rteHraqrEu u 6ylYr ucnoJrb3oBaflrl Bco6crreunorrr rrpor{3Bo.qcrBe 3aragq.rra. forosue AeTaJILI rloA;Iexar KoHTpoJIIo Lr npI{eMKe OTK sacooTBercrBr,Ie rpe6onauuru crall,qaproB I{ KoucTpyKTopcKofi AoKWeHraIIuu Gn). C rolvrnlerrou KIMoxrro o3uaxoMltrbct Ha reppllTopuu AO <Typ6ouacoc>,11. Cpor [ocraBrcrr, r$roroBJrer rfl [poryKrluu (oxarauux ycJryr, BbrrroJrlreuur pa6or)l BTeqenrre 30 (ryIa4qaru) Kanen.qapu6D( AEe[ c AaGI nepeaaru 3aroroBoK.Ilpu"roxeuue: 3crrc flaneq TH064-00.003,{uperrop uo rauecrny - uannrrfi KoHTponepHaqam,ur,rr orA. 323HaqaruHur orA. 332Hasamur.rr orA. 316O. H. KypuuuofiH. M. KouqpaquonoaA. A. 9afirunaA. IO. @part o4y'-/ l*v,,'

I wanted to convert using selenium and some online service, but they send exactly the same text, and I tried more than one.
Then I decided to use some kind of online reader and parse the text from html, but in the reader the text is loaded normally in the Russian alphabet, and in the page code itself the same as mine, as I understand the matter in the format, using python I found out that the format text in the file 'cp1252', I try to apply decode('cp1252') exactly the same text comes out.
Maybe you have some suggestions?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

U

U235U235, 2020-01-19
@AlexMine

It looks like it's only in Latin.
Apparently the wrong language was selected during OCR.
The way out is to do OCR again, specifying the correct recognition language.