Does count() produces the underlying table it needs to count?

Question

asked Feb 7, 2020 in SQL by blackindya (18.4k points)
edited Feb 10, 2020 by blackindya

I am trying to join the 3 tables, for example, table A, table B, table c Which has rows around 74m*3m*75m rows I have used this query

SELECT A.*,
C."needed_field"
FROM "tableA" A
INNER JOIN (SELECT "field_join_AB", "field_join_BC" FROM "tableB") B
ON A."field_join_AB" = B."field_join_AB"
INNER JOIN (SELECT "field_join_BC", "needed_field" FROM "tableC") C
ON B."field_join_BC" = C."field_join_BC"

When I am trying to code in Dataiku (data science studio + Vertica), temporary data seem to be created to produce the output, which is consuming 1T of server space.

In the worst case, the table produced will be multiplication of all the 3 tables that is 74m*3m*75m = 1.6*10^19 I am not able to explain my superior because he doesn’t know much about SQL, and I am new to the organization so I am not sure will it produce so many rows or not

I have tried doing the bellow code to know beforehand how many rows will be produced. If used the count()

SELECT COUNT
FROM "tableA" A
INNER JOIN (SELECT "field_join_AB", "field_join_BC" FROM "tableB") B
ON A."field_join_AB" = B."field_join_AB"
INNER JOIN (SELECT "field_join_BC", "needed_field" FROM "tableC") C
ON B."field_join_BC" = C."field_join_BC"

Can the Underlying engine produce the entire dataset and count it?

or is it possible that I will get the result using the count()? (because it's not building the dataset but working it out some other way)

I have tried to count the dataset elements using the count() but it is taking me a long time. Please guide me a solution

1 Answer

supriya · Answer 1 · 2020-02-07T12:17:05+0000

Vertical is mainly a column-based database, to resolve the problem you only need the columns to query to get the output.

Vertical is also used to query the encoded data in many cases, need not avoid materialization until it is needed

Vertica is good and fast for counts, you don’t need to use hoops.

As I have already mentioned that Vertica will only include columns that is enough, to reconstruct the entire row optimizer won’t help you.

What is happening here is that you might have hash joined with rebroadcasting. If underlying projections of yours don’t line up and sorts you make are different, if you are joining many large tables together, then just the join can be expensive because it has to load into a hash and do a lot of network rebroadcasting of the data to get the joins which happen on the initiator node.

As per my knowledge running DBD using these queries as input, if query patterns are common, in case if you’re not using DBD and also not using custom projections, then probably your projections will not perform well. And caused the situation which I mentioned above

To learn more about SQL go through this SQL tutorial

Does count() produces the underlying table it needs to count?

Does count() produces the underlying table it needs to count?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Related questions

Browse Categories

Popular Courses

Top Tutorials

Top Articles

Top Interview Questions