Look, the problem in your approach is that first you are trying to get integer from a Row Type, the output of your collect is like this:
>>> mvv_list = mvv_count_df.select('mvv').collect()
Instead if you take something like this:
>>> firstvalue = mvv_list.mvv
You will get the mvv value.
Now, in order to get all the information of the array do:
>>> mvv_array = [int(row.mvv) for row in mvv_list.collect()]
But if you try the same for the other column:
>>> mvv_count = [int(row.count) for row in mvv_list.collect()]
You get an error:
Out: TypeError: int() argument must be a string or a number, not 'builtin_function_or_method'
This happens because count is a built-in method and the column has the same name as count. A workaround to do this without getting an error for the other column is change the column name of count to _count:
>>> mvv_list = mvv_list.selectExpr("mvv as mvv", "count as _count")
>>> mvv_count = [int(row._count) for row in mvv_list.collect()]
But this workaround is not needed, as you can access the column using the dictionary syntax:
>>> mvv_array = [int(row['mvv']) for row in mvv_list.collect()]
>>> mvv_count = [int(row['count']) for row in mvv_list.collect()]
This will work finely without any error.
If you want to know more about Spark, then do check out this awesome video tutorial:
If you wish to learn What is Apache Spark visit this Apache Spark Training by Intellipaat.