Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Big Data Hadoop & Spark by (11.4k points)

While fetching data from SQL Server via a JDBC connection in Spark, I found that I can set some parallelization parameters like partitionColumn, lowerBound, upperBound, and numPartitions. I have gone through spark documentation but wasn't able to understand it.

Can anyone explain me the meanings of these parameters?

1 Answer

0 votes
by (32.3k points)
  • partitionColumn is a column which should be used to determine partitions.

  • lowerBound and upperBound determine range of values to be fetched. The complete dataset will be using rows corresponding to the following query:

       SELECT * FROM table WHERE partitionColumn BETWEEN lowerBound AND upperBound

  • numPartitions determines number of partitions to be created. Range between lowerBound and  upperBound is divided into numPartitions each with stride equal to:

            upperBound / numPartitions - lowerBound / numPartitions

For example if:

  • lowerBound: 0

  • upperBound: 1000

  • numPartitions: 10

Stride is equal to 100 and partitions are will be corresponding to the following queries:

SELECT * FROM table WHERE partitionColumn < 100

 

SELECT * FROM table WHERE partitionColumn BETWEEN 100 AND 200  

...

SELECT * FROM table WHERE partitionColumn BETWEEN 900 AND 1000

Welcome to Intellipaat Community. Get your technical queries answered by top developers!

30.5k questions

32.6k answers

500 comments

108k users

Browse Categories

...