0 votes
1 view
in Data Science by (17.6k points)

I have a .docx Microsoft Word file formatted roughly as follows:

TAG    Lorem ipsum dolor sit amet, consectetur adipiscing 

       elit, sed do eiusmod tempor

TAG_2  Lorem ipsum dolor sit amet, consectetur adipiscing 

       elit, sed do eiusmod tempor incididunt ut labore 

       et dolore magna aliqua. Ut enim ad minim veniam, 

       quis nostrud exercitation ullamco laboris nisi 

TAG    Text text text text text text text text text text

Where indentation is achieved by wrapping long lines automatically (if copy-pasted in a simple txt editor, the above text would result in 3 lines instead of 7).

My task is to automatically count the number of lines assigned to a tag, s.t. the above file would result in something like:

TAG    2

TAG_2  4

TAG    1

Right now I do it manually, by specifiying a font file, font size, and average line lenght, and dividing the lenght of a line (measured with PIL.ImageFont.getsize()) but this approach is really error-prone and does not cover all possible situations (like fonts changing mid-file).

Unfortunately I have no control over the file, so I cannot properly format it before counting lines (as reason would demand).

Is there a way to do this in Python? I've found the  python-docx package  but is seems kinda limited in its capabilities.

Also note that the .docx format is not necessary mandatory, I could also convert the file to .odt if necessary.

Attaching a screenshot of my setup (in LibreOffice) to make it more clear.

enter image description here

1 Answer

0 votes
by (38.5k points)

Use this to count the number of lines & words in all paragraphs in a Document with VBA:

Sub ParaStatsCount()

Dim Para As Paragraph

For Each Para In ActiveDocument.Paragraphs

  With Para.Range

    MsgBox .Text & vbCr & "Line Count = " & .ComputeStatistics(wdStatisticLines) & vbCr _

      & "Word Count = " & .ComputeStatistics(wdStatisticWords)

  End With

Next

End Sub

If you wish to know about Python visit this Python Course.

Related questions

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...