0 votes
1 view
in Big Data Hadoop & Spark by (120 points)

I have JSON file named Class.json and want to calculate all data with some condition.

Class.json

{

  "class": [

    {

      "class_id": "1",

      "data": {

        "lesson3": {

          "id": 3,

          "schedule": [

            {

              "schedule_id": "1",

              "schedule_date": "2017-07-11",

              "lesson_price": "USD 25",

              "status": "ONGOING"

            },

            {

              "schedule_id": "2",

              "schedule_date": "2016-09-24",

              "lesson_price": "USD 15",

              "status": "OPEN REGISTRATION"

            }

          ]

        },

        "lesson4": {

          "id": 4,

          "schedule": [

            {

              "schedule_id": "1",

              "schedule_date": "2016-12-17",

              "lesson_price": "USD 19",

              "status": "ONGOING"

            },

            {

              "schedule_id": "2",

              "schedule_date": "2015-11-12",

              "lesson_price": "USD 29",

              "status": "ONGOING"

            },

            {

              "schedule_id": "3",

              "schedule_date": "2015-11-10",

              "lesson_price": "USD 14",

              "status": "ON SCHEDULE"

            }

          ]

        }

      }

    },

    {

      "class_id": "2",

      "data": {

        "lesson1": {

          "id": 1,

          "schedule": [

            {

              "schedule_id": "1",

              "schedule_date": "2017-05-21",

              "lesson_price": "USD 50",

              "status": "CANCELLED"

            }

          ]

        },

        "lesson2": {

          "id": 2,

          "schedule": [

            {

              "schedule_id": "1",

              "schedule_date": "2017-06-04",

              "lesson_price": "USD10",

              "status": "FINISHED"

            },

            {

              "schedule_id": "5",

              "schedule_date": "2018-03-01",

              "lesson_price": "USD12",

              "status": "CLOSED"

            }

          ]

        }

      }

    }

  ]

}

I've tried 

df = spark.read.json("class.json", multiLine=True)

df.show()

and its shows:

+--------------------+

|              class    |

+--------------------+

|[[1, [,, [3, [[US... |

+--------------------+

then for accessing the array I've tried this one

try = df.select("class").map(lambda s: s['data'])

But got error AttributeError: 'DataFrame' object has no attribute 'map'

or doing df['class'][0]['data'] got Column<b'class[0][data]'>

Goal: 

  • count schedule that has the status "ONGOING" that has schedule date before 2017-01
  • average lesson price before 2017-01

How to do it with pyspark?

Please log in or register to answer this question.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !


Categories

...