How to parse doc file in Python?

V

Vladislav2014-06-28 15:17:06

Python

Vladislav, 2014-06-28 15:17:06

A file is a type of several consecutive records of the form:
paragraph1: Name
of paragraph2: <picture>
paragraph3: Description
You need to put all this rubbish in the database, but the problem is that fonts and hyphens are in vain (that is, here Description and Picture in a row, and there is already an empty line between them, and over there the last line of the description borders on a new name, etc.), and I feel sad with regexes (I haven’t really drunk them yet).
What modules to use (references to mana welcome) and what regexes to use?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

A

Alexey Zenkov, 2014-07-11
@Lexxus31337

Regex smoking is mandatory + any complex parsing has a non-zero error
variant

S

snowpiercer, 2014-08-12
@snowpiercer

Parse doc file with regular expressions? Doubtful (in such cases it is customary to link to stackoverflow.com/a/1732454/2402125).
There are special libraries for parsing doc files (docx, actually), for example https://github.com/mikemaccana/python-docx/