How to parse data from a PDF table?

A

apiwi2021-11-12 18:57:55

Python

apiwi, 2021-11-12 18:57:55

There is a class schedule for groups, which is sent every day in a pdf file in the form of a table.
You need to use python to extract the classes and classrooms of a specific group. I can't figure out how to implement it.
The file looks like this:
The result can be in the form of text, or in the form of a cropped photo with the activities of a certain group.
Tried different libraries like: tabula, PyPDF2, camelot. All I got was this:

Also this option:

I understand that maybe you tell me to go to the freelance exchange, but no, I need to be pushed to the idea of completing the task. Thank you.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

A

apiwi, 2021-11-17
@apiwi

Solved the issue with: pdfminer, pdf2image, PIL
Found the coordinates of the desired text using pdfminer, converted it to an image using pdf2image and using PIL cropped the desired area (added values to the coordinates)

A

Adamos, 2021-11-12
@Adamos

which is sent every day in a pdf file in the form of a table

By whom? Reptilians who do not make contact and eat all carrier pigeons?
IMHO, you courageously overcome artificially created problems.
And, perhaps, you will achieve some results ... but the very first change "they have" these results of yours will shatter to smithereens in half, and you will have to start all over again.
Ask the source for data in a different format and don't make your head up.

A

Alexey Cheremisin, 2021-11-12
@leahch

Alas, it will not work out normally. (I regularly answer this question here)
For - pdf knows absolutely nothing about tables, it is a preprint language, there is nothing at all except text, fonts, blocks and graphic primitives! Accordingly, the data in it is absolutely not structured.