Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Python by (16.4k points)

I need to extract tables from a lot of PDFs I have. To do this I am utilizing AWS Textract Python pipeline. 

If it's not too much trouble, how might I do this without SNS and SQS? I need it to be synchronous: give my pipeline a PDF document, call AWS Textract and get the outcomes. 

Here is the thing that I use then, if it's not too much trouble, Can anyone tell me what should I change:

import boto3

import time

def startJob(s3BucketName, objectName):

    response = None

    client = boto3.client('textract')

    response = client.start_document_text_detection(

    DocumentLocation={

        'S3Object': {

            'Bucket': s3BucketName,

            'Name': objectName

        }

    })

    return response["JobId"]

def isJobComplete(jobId):

    # For production use cases, use SNS based notification 

    # Details at: https://docs.aws.amazon.com/textract/latest/dg/api-async.html

    time.sleep(5)

    client = boto3.client('textract')

    response = client.get_document_text_detection(JobId=jobId)

    status = response["JobStatus"]

    print("Job status: {}".format(status))

    while(status == "IN_PROGRESS"):

        time.sleep(5)

        response = client.get_document_text_detection(JobId=jobId)

        status = response["JobStatus"]

        print("Job status: {}".format(status))

    return status

def getJobResults(jobId):

    pages = []

    client = boto3.client('textract')

    response = client.get_document_text_detection(JobId=jobId)

    pages.append(response)

    print("Resultset page recieved: {}".format(len(pages)))

    nextToken = None

    if('NextToken' in response):

        nextToken = response['NextToken']

    while(nextToken):

        response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)

        pages.append(response)

        print("Resultset page recieved: {}".format(len(pages)))

        nextToken = None

        if('NextToken' in response):

            nextToken = response['NextToken']

    return pages

# Document

s3BucketName = "ki-textract-demo-docs"

documentName = "Amazon-Textract-Pdf.pdf"

jobId = startJob(s3BucketName, documentName)

print("Started job with id: {}".format(jobId))

if(isJobComplete(jobId)):

    response = getJobResults(jobId)

#print(response)

# Print detected text

for resultPage in response:

    for item in resultPage["Blocks"]:

        if item["BlockType"] == "LINE":

            print ('\033[94m' +  item["Text"] + '\033[0m')

1 Answer

0 votes
by (26.4k points)

You can't straightforwardly handle PDF documents simultaneously with Textract presently. From the Textract documentation

Amazon Textract synchronous operations (DetectDocumentText and AnalyzeDocument) support the PNG and JPEG image formats. Asynchronous operations (StartDocumentTextDetection, StartDocumentAnalysis) also support the PDF file format.

A work-around is to convert the PDF report into pictures in your code and afterward utilize the synchronous API activities with these pictures to handle the documents.

Interested to learn the concepts of Python in detail? Come and join the python course to gain more knowledge in Python

Watch this video tutorial for more details.

Browse Categories

...