# Does count() produces the underlying table it needs to count?

1 view
in SQL
edited

I am trying to join the 3 tables, for example, table A, table B, table c Which has rows around 74m*3m*75m rows I have used this query

SELECT A.*,

C."needed_field"

FROM "tableA" A

INNER JOIN (SELECT "field_join_AB", "field_join_BC" FROM "tableB") B

ON A."field_join_AB" = B."field_join_AB"

INNER JOIN (SELECT "field_join_BC", "needed_field" FROM "tableC") C

ON B."field_join_BC" = C."field_join_BC"

When I am trying to code in Dataiku (data science studio + Vertica), temporary data seem to be created to produce the output, which is consuming 1T of server space.

In the worst case, the table produced will be multiplication of all the 3 tables that is 74m*3m*75m = 1.6*10^19 I am not able to explain my superior because he doesn’t know much about SQL, and I am new to the organization so I am not sure will it produce so many rows or not

I have tried doing the bellow code to know beforehand how many rows will be produced. If used the count()

SELECT COUNT

FROM "tableA" A

INNER JOIN (SELECT "field_join_AB", "field_join_BC" FROM "tableB") B

ON A."field_join_AB" = B."field_join_AB"

INNER JOIN (SELECT "field_join_BC", "needed_field" FROM "tableC") C

ON B."field_join_BC" = C."field_join_BC"

Can the Underlying engine produce the entire dataset and count it?

or is it possible that I will get the result using the count()? (because it's not building the dataset but working it out some other way)

I have tried to count the dataset elements using the count() but it is taking me a long time. Please guide me a solution

by (10.4k points)

Vertical is mainly a column-based database, to resolve the problem you only need the columns to query to get the output.

Vertical is also used to query the encoded data in many cases, need not avoid materialization until it is needed

Vertica is good and fast for counts, you don’t need to use hoops.

As I have already mentioned that Vertica will only include columns that is enough, to reconstruct the entire row optimizer won’t help you.

What is happening here is that you might have hash joined with rebroadcasting. If underlying projections of yours don’t line up and sorts you make are different, if you are joining many large tables together, then just the join can be expensive because it has to load into a hash and do a lot of network rebroadcasting of the data to get the joins which happen on the initiator node.

As per my knowledge running DBD using these queries as input, if query patterns are common, in case if you’re not using DBD and also not using custom projections, then probably your projections will not perform well. And caused the situation which I mentioned above