#
#
#2019-04-03 14:45:15
C++ / C#
#, 2019-04-03 14:45:15

Parsing a complex RTF document, tabular data extraction and pagination, how?

Initially, the task seemed simple, but neither on the forehead, nor a couple of parsers found on the jit, nor anything happened. which was somewhat disconcerting.
By the way, there is a critical limitation - all components must be legal and free,
I will be grateful for the tips!
upd based on answers/comments at the moment is an auto-generated report, multi-page, several documents of the same type, with tabular forms. it is necessary to cut into pages and selectively remove the information - let's say the date of the document and part of the tabular data. and the fields do not have any tags
.. and the element tree built by https://github.com/sgolivernet/nrtftree has 620331 lines ))

Answer the question

In order to leave comments, you need to log in

2 answer(s)
#
#, 2019-04-04
@mindtester

1 - https://github.com/SourceCodeBackup/RtfDomParser is the best candidate for data extraction. and certainly for express research

I just had to learn how to cook it
Fortunately, the structure of the document is quite clear, so everything can be solved. but either it does not know how to save modified documents, or I still do not understand how to use the local Writer
2 - https://github.com/sgolivernet/nrtftree can save the current state, which means it can be used for slicing. if you learn to apply the knowledge about the structure obtained using RtfDomParser. it is possible and you can parse .. but the execution time of the task is not infinite. so slicing will apparently have to be done by printing to PDF, it will obviously be faster (according to the terms of the assignment, paginated PDFs are needed on the exhaust)

S
sergey, 2019-04-03
kuzmin @sergueik

attach your "complex RTF document" please. see if it can handle it https://poi.apache.org/components/document/

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question