How to analyse PDF documents with Amazon Textract in a Synchronous way?

Question

asked Feb 7, 2021 in Python by laddulakshana (16.4k points)

I need to extract tables from a lot of PDFs I have. To do this I am utilizing AWS Textract Python pipeline.

If it's not too much trouble, how might I do this without SNS and SQS? I need it to be synchronous: give my pipeline a PDF document, call AWS Textract and get the outcomes.

Here is the thing that I use then, if it's not too much trouble, Can anyone tell me what should I change:

import boto3
import time
def startJob(s3BucketName, objectName):
response = None
client = boto3.client('textract')
response = client.start_document_text_detection(
DocumentLocation={
'S3Object': {
'Bucket': s3BucketName,
'Name': objectName
}
})
return response["JobId"]
def isJobComplete(jobId):
# For production use cases, use SNS based notification
# Details at: https://docs.aws.amazon.com/textract/latest/dg/api-async.html
time.sleep(5)
client = boto3.client('textract')
response = client.get_document_text_detection(JobId=jobId)
status = response["JobStatus"]
print("Job status: {}".format(status))
while(status == "IN_PROGRESS"):
time.sleep(5)
response = client.get_document_text_detection(JobId=jobId)
status = response["JobStatus"]
print("Job status: {}".format(status))
return status
def getJobResults(jobId):
pages = []
client = boto3.client('textract')
response = client.get_document_text_detection(JobId=jobId)
pages.append(response)
print("Resultset page recieved: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
nextToken = response['NextToken']
while(nextToken):
response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)
pages.append(response)
print("Resultset page recieved: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
nextToken = response['NextToken']
return pages
# Document
s3BucketName = "ki-textract-demo-docs"
documentName = "Amazon-Textract-Pdf.pdf"
jobId = startJob(s3BucketName, documentName)
print("Started job with id: {}".format(jobId))
if(isJobComplete(jobId)):
response = getJobResults(jobId)
#print(response)
# Print detected text
for resultPage in response:
for item in resultPage["Blocks"]:
if item["BlockType"] == "LINE":
print ('\033[94m' + item["Text"] + '\033[0m')

1 Answer

hari_sh · Answer 1 · 2021-02-07T10:12:03+0000

You can't straightforwardly handle PDF documents simultaneously with Textract presently. From the Textract documentation:

Amazon Textract synchronous operations (DetectDocumentText and AnalyzeDocument) support the PNG and JPEG image formats. Asynchronous operations (StartDocumentTextDetection, StartDocumentAnalysis) also support the PDF file format.

A work-around is to convert the PDF report into pictures in your code and afterward utilize the synchronous API activities with these pictures to handle the documents.

Interested to learn the concepts of Python in detail? Come and join the python course to gain more knowledge in Python

Watch this video tutorial for more details.

How to analyse PDF documents with Amazon Textract in a Synchronous way?

1 Answer

Related questions

Browse Categories