0 votes
1 view
in Big Data Hadoop & Spark by (11.5k points)

I am currently using Pandas and Spark for data analysis. I found Dask provides parallelized NumPy array and Pandas DataFrame.

Pandas is easy and intuitive for doing data analysis in Python. But I find difficulty in handling multiple bigger dataframes in Pandas due to limited system memory.

I have researched about Dask and I got to know some facts regarding it. Overall I can understand Dask is simpler to use than spark. Dask is as flexible as Pandas with more power to compute with more cpu's parallely.

So, I want to know that roughly how much amount of data(in terabyte) can be processed with Dask?

1 Answer

0 votes
by (31.4k points)

Generally, Dask is smaller and lighter weight as compared to Spark. This means that it has fewer features and, instead, is used in conjunction with other libraries, particularly those in the numeric Python ecosystem. It couples with libraries like Pandas or Scikit-Learn to achieve high-level functionality.

Reasons you might choose Spark

  • You prefer Scala or the SQL language

  • You have mostly JVM infrastructure and legacy systems

  • You are mostly doing business analytics with some lightweight machine learning

Reasons you might choose Dask

  • You prefer Python, or have large legacy code bases that you do not want to entirely rewrite.

  • You have got a complex use case or your use case does not cleanly fit the Spark computing model

  • You want a lighter-weight transition from local computing to cluster computing

  • You intend to interoperate with other technologies and have no issue in installing multiple packages

Related questions

Welcome to Intellipaat Community. Get your technical queries answered by top developers !