Look, the problem in your approach is that first you are trying to get integer from a Row Type, the output of your collect is like this:
>>> mvv_list = mvv_count_df.select('mvv').collect()
Instead if you take something like this:
>>> firstvalue = mvv_list.mvv
You will get the mvv value.
Now, in order to get all the information of the array do:
>>> mvv_array = [int(row.mvv) for row in mvv_list.collect()]
But if you try the same for the other column:
>>> mvv_count = [int(row.count) for row in mvv_list.collect()]
You get an error:
Out: TypeError: int() argument must be a string or a number, not 'builtin_function_or_method'
This happens because count is a built-in method and the column has the same name as count. A workaround to do this without getting an error for the other column is change the column name of count to _count:
>>> mvv_list = mvv_list.selectExpr("mvv as mvv", "count as _count")
>>> mvv_count = [int(row._count) for row in mvv_list.collect()]
But this workaround is not needed, as you can access the column using the dictionary syntax:
>>> mvv_array = [int(row['mvv']) for row in mvv_list.collect()]
>>> mvv_count = [int(row['count']) for row in mvv_list.collect()]
This will work finely without any error.
If you want to know more about Spark, then do check out this awesome video tutorial:
If you wish to learn What is Apache Spark visit this Apache Spark Training by Intellipaat.
If you are interested to learn Python from Industry experts, you can sign up for this Python Certification Course by Intellipaat.