Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Python by (16.4k points)

I'm searching for a PDF library that will permit me to separate the content from a PDF document. I've taken a gander at PyPDF, and this can remove the content from a PDF report pleasantly. The issue with this is that if there are tables in the report, the content in the tables is removed in-accordance with the rest of the document text. This can be dangerous on the grounds that it produces segments of text that aren't valuable and look jumbled (for example, bunches of numbers squashed together).

I'd prefer to remove the text from a PDF document, barring any tables and exceptional formatting. Is there a library out there that can do this?

1 Answer

0 votes
by (26.4k points)

You can likewise investigate PDFMiner (or for more older versions of Python see PDFMiner and PDFMiner).

A specific component of interest in PDFMiner is that you can handle how it pulls together text parts while separating them. You do this by determining the space between lines, words, characters, and so forth. Along these lines, perhaps by tweaking this you can accomplish what you need (that depends of the inconstancy of your documents). PDFMiner can likewise give you the area of the text in the page, it can separate information by Object ID and other stuff. So dive in PDFMiner and be imaginative!

Are you pretty much interested to learn python in detail? Come and join the python training course to gain more knowledge.

Related questions

0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
0 votes
1 answer
asked Jan 2, 2021 in Python by ashely (50.2k points)

Browse Categories

...