Intellipaat Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in AWS by (5.6k points)

I have an S3 bucket, every 3 hours I get a file in the bucket with a timestamp attached to it. I'm using Glue job for moving the files from S3 to Redshift with some transformations. The Glue job uses the table created in Data Catalog via crawler as he input.

First, run: 

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test", table_name = "employee_623215", transformation_ctx = "datasource0")

After three hours if I get one more file on the S3 bucket should I crawl it again?

Is there any way to have a single table in Data Catalog and update the table with the latest S3 file which can be used by Glue Job for processing or I need to run crawler every time to get the latest data?

1 Answer

0 votes
by (12.4k points)

As per my understanding, an alternative approach can be, instead of reading from the catalog read directly from S3 and process data in the Glue job.

Use the below command to run the crawler again:

from_options(connection_type, connection_options={}, format=None, format_options={}, transformation_ctx="")

For more details, you can read the documentation here

Do you want to master AWS, then do checkout the AWS Course by Intellipaat.

Related questions

0 votes
1 answer
asked Feb 4, 2021 in Python by vinita (107k points)
Want to get 50% Hike on your Salary?

Learn how we helped 50,000+ professionals like you !

0 votes
1 answer
asked Nov 17, 2020 in AWS by Amyra (12.9k points)
0 votes
2 answers
0 votes
1 answer

31k questions

32.8k answers

501 comments

693 users

Browse Categories

...