0 votes
1 view
in Azure by (17.6k points)

I am considering Google DataFlow as an option for running a pipeline that involves steps like:

  1. Downloading images from the web;
  2. Processing images.

I like that DataFlow manages the lifetime of VMs required to complete the job, so I don't need to start or stop them myself, but all examples I came across use it for data mining kind of tasks. I wonder if it is a viable option for other batch tasks like image processing and crawling.

1 Answer

0 votes
by (26.7k points)

You can able to generate crawler URLs and after that add them to PubSub , and code a Beam pipeline for the following:

1. Read it from PubSub

2. Download the content of website in ParDo.

3. Parse the Urls of image from the website to another ParDo.

4. Download that image and process it with ParDo.

5. Store the result in GCS, BigQuery, depends on the information you want from image.

 You can perform the same with a batch job. Just you need to change the source of Urls.

I hope this will work.

Want to become an Azure Expert? join azure certification now!!

Welcome to Intellipaat Community. Get your technical queries answered by top developers !