I have a .docx Microsoft Word file formatted roughly as follows:
TAG Lorem ipsum dolor sit amet, consectetur adipiscing
elit, sed do eiusmod tempor
TAG_2 Lorem ipsum dolor sit amet, consectetur adipiscing
elit, sed do eiusmod tempor incididunt ut labore
et dolore magna aliqua. Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi
TAG Text text text text text text text text text text
Where indentation is achieved by wrapping long lines automatically (if copy-pasted in a simple txt editor, the above text would result in 3 lines instead of 7).
My task is to automatically count the number of lines assigned to a tag, s.t. the above file would result in something like:
TAG 2
TAG_2 4
TAG 1
Right now I do it manually, by specifiying a font file, font size, and average line lenght, and dividing the lenght of a line (measured with PIL.ImageFont.getsize()) but this approach is really error-prone and does not cover all possible situations (like fonts changing mid-file).
Unfortunately I have no control over the file, so I cannot properly format it before counting lines (as reason would demand).
Is there a way to do this in Python? I've found the python-docx package but is seems kinda limited in its capabilities.
Also note that the .docx format is not necessary mandatory, I could also convert the file to .odt if necessary.
Attaching a screenshot of my setup (in LibreOffice) to make it more clear.