How to extract data embedded in PySpark string

Question

asked Apr 14, 2021 in Python by Sawagurumi (150 points)
closed Jun 17, 2023 by Anamika Chakravarty

I'm going around in circles with this one. My JSON data looks like this (one complete line)

{ "EnqueuedTimeUtc": "2021-03-18T18:26:25.7930000Z",
"Properties": { },
"SystemProperties": {
    "connectionDeviceId": "bbbb-aaaa-23-f4-78cb",
    "connectionAuthMethod": "{\"scope\":\"device\",\"type\":\"x509Certificate\",\"issuer\":\"external\",\"acceptingIpFilterRule\":null}",
    "connectionDeviceGenerationId": "123456",
    "contentType": "application/json",
    "contentEncoding": "utf-8",
    "enqueuedTime": "2021-03-18T18:26:25.7930000Z"
},
"Body": {
    "device_id": "abc-1234-1234-abc-abc",
    "event_name": "device_tab_list",
    "tabs": [
      {
        "active": false,
        "id": "chrome-7209",
        "incognito": false,
        "selected": false,
        "status": "complete",
        "title": "Classwork for 2020-2021",
        "url": "https://classroom.google.com/w/abcabc/t/all",
        "width": 1138,
        "height": 489,
        "browser": "chromeos"
      },
      { 'etc several more tabs in this array'}
    ]
}
}

What I want is a data frame with columns

| EnqueuedTimeUtc | Body.device_id | Body.tabs.url |

Obviously with the first two columns repeated for each url in the 'tabs' list.

When I read this in with

sparkDF = spark.read.json('path')

I get

root
|-- Body: string (nullable = true)
|-- EnqueuedTimeUtc: string (nullable = true)
|-- SystemProperties: struct (nullable = true)
| |-- connectionAuthMethod: string (nullable = true)
| |-- connectionDeviceGenerationId: string (nullable = true)
| |-- connectionDeviceId: string (nullable = true)
| |-- contentEncoding: string (nullable = true)
| |-- contentType: string (nullable = true)
| |-- enqueuedTime: string (nullable = true)

So Body is just a string. How do I extract the individual tab.urls from this string to make a complete dataset? Thanks in advance. (answer need to necessarily be in python)

closed

5 Answers

answered Jun 17, 2023 by Balram111 (25.7k points)
selected Jun 17, 2023 by Anamika Chakravarty

Best answer

To extract the individual tab URLs from the Body string and create a complete dataset with the desired columns, you can follow these steps:

Convert the Spark DataFrame to a Pandas DataFrame to facilitate data manipulation:

pandasDF = sparkDF.toPandas()

Use the json_normalize function from the pandas library to extract the nested JSON structure into separate columns:

import pandas as pd

data = pd.json_normalize(pandasDF['Body'], record_path=['tabs'], meta=['device_id'])

Select the desired columns from the extracted data:

result = data[['device_id', 'url']]

The resulting result DataFrame will have two columns: device_id and url. Each row represents an individual tab URL, with the corresponding device_id repeated for each URL.

Please note that this solution requires the pandas library, so make sure it is installed in your environment.

Similu · Answer 1 · 2023-06-17T10:36:29+0000

To extract the individual tab URLs from the Body string and create a complete dataset with the desired columns, you can follow these steps:

Convert the Spark DataFrame to a Pandas DataFrame for easier data manipulation:

pandasDF = sparkDF.toPandas()

Use the json_normalize function from the pandas library to extract the nested JSON structure into separate columns:

import pandas as pd

data = pd.json_normalize(pandasDF['Body'], record_path=['tabs'], meta=['device_id'])

Select the desired columns from the extracted data:

result = data[['device_id', 'url']]

The resulting result DataFrame will contain the columns device_id and url. Each row represents an individual tab URL, with the corresponding device_id repeated for each URL.

Please ensure that you have the pandas library installed in your environment for this solution to work effectively.

Anamika Chakravarty · Answer 2 · 2023-06-17T10:37:31+0000

To extract the individual tab URLs from the Body string and create a complete dataset with the desired columns, follow these steps:

Convert the Spark DataFrame to a Pandas DataFrame:

pandasDF = sparkDF.toPandas()

Use pd.json_normalize to extract the nested JSON structure into separate columns:

import pandas as pd

data = pd.json_normalize(pandasDF['Body'], record_path=['tabs'], meta=['device_id'])

Select the desired columns from the extracted data:

result = data[['device_id', 'url']]

The resulting result DataFrame will include the columns device_id and url. Each row corresponds to an individual tab URL, with the respective device_id repeated for each URL.

Make sure to have the pandas library installed for this solution to work smoothly.

Similu · Answer 3 · 2023-06-17T10:39:57+0000

To create a dataset with the desired columns by extracting individual tab URLs from the Body string, follow these concise steps:

Convert the Spark DataFrame to a Pandas DataFrame:

pandasDF = sparkDF.toPandas()

Use pd.json_normalize from the pandas library to extract the nested JSON structure into separate columns:

import pandas as pd

data = pd.json_normalize(pandasDF['Body'], record_path=['tabs'], meta=['device_id'])

Select the desired columns from the extracted data:

result = data[['device_id', 'url']]

The resulting result DataFrame will contain the columns device_id and url. Each row represents an individual tab URL, with the corresponding device_id repeated for each URL.

Ensure that the pandas library is installed in your environment to successfully implement this solution.

Anamika Chakravarty · Answer 4 · 2023-06-17T10:40:51+0000

To extract individual tab URLs from the Body string and create a dataset with the desired columns, follow these steps:

Convert the Spark DataFrame to a Pandas DataFrame:

pandasDF = sparkDF.toPandas()

Use pd.json_normalize from the pandas library to extract the nested JSON structure into separate columns:

import pandas as pd

data = pd.json_normalize(pandasDF['Body'], record_path=['tabs'], meta=['device_id'])

Select the required columns from the extracted data:

result = data[['device_id', 'url']]

The resulting result DataFrame will include the device_id and url columns. Each row will represent an individual tab URL, with the corresponding device_id repeated for each URL.

Ensure that you have the pandas library installed in your environment to implement this solution successfully.

How to extract data embedded in PySpark string

How to extract data embedded in PySpark string

Please log in or register to add a comment.

5 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions

Browse Categories

Popular Courses

Top Tutorials

Top Articles

Top Interview Questions