Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Python by (150 points)
closed by

I'm going around in circles with this one. My JSON data looks like this (one complete line)

{  "EnqueuedTimeUtc": "2021-03-18T18:26:25.7930000Z",

 "Properties": {  },

  "SystemProperties": {

    "connectionDeviceId": "bbbb-aaaa-23-f4-78cb",

    "connectionAuthMethod": "{\"scope\":\"device\",\"type\":\"x509Certificate\",\"issuer\":\"external\",\"acceptingIpFilterRule\":null}",

    "connectionDeviceGenerationId": "123456",

    "contentType": "application/json",

    "contentEncoding": "utf-8",

    "enqueuedTime": "2021-03-18T18:26:25.7930000Z"

  },

  "Body": {

    "device_id": "abc-1234-1234-abc-abc",

    "event_name": "device_tab_list",

    "tabs": [

      {

        "active": false,

        "id": "chrome-7209",

        "incognito": false,

        "selected": false,

        "status": "complete",

        "title": "Classwork for 2020-2021",

        "url": "https://classroom.google.com/w/abcabc/t/all",

        "width": 1138,

        "height": 489,

        "browser": "chromeos"

      },

      { 'etc several more tabs in this array'}

    ]

  }

}

What I want is a data frame with columns

| EnqueuedTimeUtc | Body.device_id | Body.tabs.url |

Obviously with the first two columns repeated for each url in the 'tabs' list.

When I read this in with

sparkDF = spark.read.json('path')

I get

root 

|-- Body: string (nullable = true) 

|-- EnqueuedTimeUtc: string (nullable = true) 

|-- SystemProperties: struct (nullable = true) 

| |-- connectionAuthMethod: string (nullable = true) 

| |-- connectionDeviceGenerationId: string (nullable = true) 

| |-- connectionDeviceId: string (nullable = true) 

| |-- contentEncoding: string (nullable = true) 

| |-- contentType: string (nullable = true) 

| |-- enqueuedTime: string (nullable = true)

So Body is just a string. How do I extract the individual tab.urls from this string to make a complete dataset? Thanks in advance. (answer need to necessarily be in python)

closed

5 Answers

0 votes
by (25.7k points)
selected by
 
Best answer
To extract the individual tab URLs from the Body string and create a complete dataset with the desired columns, you can follow these steps:

Convert the Spark DataFrame to a Pandas DataFrame to facilitate data manipulation:

pandasDF = sparkDF.toPandas()

Use the json_normalize function from the pandas library to extract the nested JSON structure into separate columns:

import pandas as pd

data = pd.json_normalize(pandasDF['Body'], record_path=['tabs'], meta=['device_id'])

Select the desired columns from the extracted data:

result = data[['device_id', 'url']]

The resulting result DataFrame will have two columns: device_id and url. Each row represents an individual tab URL, with the corresponding device_id repeated for each URL.

Please note that this solution requires the pandas library, so make sure it is installed in your environment.
0 votes
by (15.4k points)
To extract the individual tab URLs from the Body string and create a complete dataset with the desired columns, you can follow these steps:

Convert the Spark DataFrame to a Pandas DataFrame for easier data manipulation:

pandasDF = sparkDF.toPandas()

Use the json_normalize function from the pandas library to extract the nested JSON structure into separate columns:

import pandas as pd

data = pd.json_normalize(pandasDF['Body'], record_path=['tabs'], meta=['device_id'])

Select the desired columns from the extracted data:

result = data[['device_id', 'url']]

The resulting result DataFrame will contain the columns device_id and url. Each row represents an individual tab URL, with the corresponding device_id repeated for each URL.

Please ensure that you have the pandas library installed in your environment for this solution to work effectively.
0 votes
by (19k points)
To extract the individual tab URLs from the Body string and create a complete dataset with the desired columns, follow these steps:

Convert the Spark DataFrame to a Pandas DataFrame:

pandasDF = sparkDF.toPandas()

Use pd.json_normalize to extract the nested JSON structure into separate columns:

import pandas as pd

data = pd.json_normalize(pandasDF['Body'], record_path=['tabs'], meta=['device_id'])

Select the desired columns from the extracted data:

result = data[['device_id', 'url']]

The resulting result DataFrame will include the columns device_id and url. Each row corresponds to an individual tab URL, with the respective device_id repeated for each URL.

Make sure to have the pandas library installed for this solution to work smoothly.
0 votes
by (15.4k points)
To create a dataset with the desired columns by extracting individual tab URLs from the Body string, follow these concise steps:

Convert the Spark DataFrame to a Pandas DataFrame:

pandasDF = sparkDF.toPandas()

Use pd.json_normalize from the pandas library to extract the nested JSON structure into separate columns:

import pandas as pd

data = pd.json_normalize(pandasDF['Body'], record_path=['tabs'], meta=['device_id'])

Select the desired columns from the extracted data:

result = data[['device_id', 'url']]

The resulting result DataFrame will contain the columns device_id and url. Each row represents an individual tab URL, with the corresponding device_id repeated for each URL.

Ensure that the pandas library is installed in your environment to successfully implement this solution.
0 votes
by (19k points)
To extract individual tab URLs from the Body string and create a dataset with the desired columns, follow these steps:

Convert the Spark DataFrame to a Pandas DataFrame:

pandasDF = sparkDF.toPandas()

Use pd.json_normalize from the pandas library to extract the nested JSON structure into separate columns:

import pandas as pd

data = pd.json_normalize(pandasDF['Body'], record_path=['tabs'], meta=['device_id'])

Select the required columns from the extracted data:

result = data[['device_id', 'url']]

The resulting result DataFrame will include the device_id and url columns. Each row will represent an individual tab URL, with the corresponding device_id repeated for each URL.

Ensure that you have the pandas library installed in your environment to implement this solution successfully.

Related questions

Browse Categories

...