0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

While fetching data from SQL Server via a JDBC connection in Spark, I found that I can set some parallelization parameters like partitionColumn, lowerBound, upperBound, and numPartitions. I have gone through spark documentation but wasn't able to understand it.

Can anyone explain me the meanings of these parameters?

1 Answer

0 votes
by (31.4k points)
  • partitionColumn is a column which should be used to determine partitions.

  • lowerBound and upperBound determine range of values to be fetched. The complete dataset will be using rows corresponding to the following query:

       SELECT * FROM table WHERE partitionColumn BETWEEN lowerBound AND upperBound

  • numPartitions determines number of partitions to be created. Range between lowerBound and  upperBound is divided into numPartitions each with stride equal to:

            upperBound / numPartitions - lowerBound / numPartitions

For example if:

  • lowerBound: 0

  • upperBound: 1000

  • numPartitions: 10

Stride is equal to 100 and partitions are will be corresponding to the following queries:

SELECT * FROM table WHERE partitionColumn < 100

 

SELECT * FROM table WHERE partitionColumn BETWEEN 100 AND 200  

...

SELECT * FROM table WHERE partitionColumn BETWEEN 900 AND 1000

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...