Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (18.4k points)

I am new to Python and been trying to learn both Python and BeautifulSoup.

How would I incorporate the href links as other "key":"value" into a JSON object:

from bs4 import BeautifulSoup

import json

html = """<table>

  <tbody>

      <tr>

        <td><a href="/page/some-page">Some Page Title</a></td>

        <td class="created-at">2020-08-01</td>

        <td><a href="/id/400">Text Description 1</a></td>

      </tr>

      <tr>

          <td><a href="/page/some-page-2">Some Page Title 2</a></td>

          <td class="created-at">2020-08-02</td>

          <td><a href="/id/400">Text Description 2</a></td>

      </tr>

      <tr>

          <td><a href="/page/some-page-3">Some Page Title 3</a></td>

          <td class="created-at">2020-08-03</td>

          <td><a href="/id/400">Text Description 3</a></td>

      </tr>

  </tbody>

</table>"""

data = []

soup = BeautifulSoup(html, 'html.parser')

rows = soup.select('table > tbody > tr')

for table in rows:

    keys = ["Name","Date","Description"]

    values = [td.get_text(strip=True) for td in table.find_all('td')]

    d = dict(zip(keys, values))

    data.append(d)

print(json.dumps(data, indent=4))

I basically need to add 2 more keys and get href values:

keys = ["Name","Date","Description","Url1","Url2"]

1 Answer

0 votes
by (36.8k points)

This is how to do:

for table in rows:

    keys = ["Name","Date","Description",'Url1','Url2']

    values = [td.get_text(strip=True) for td in table.find_all('td')] + [a.attrs['href'] for a in table.find_all('a')]

    d = dict(zip(keys, values))

    data.append(d)

print(json.dumps(data, indent=4))

Output

[

    {

        "Name": "Some Page Title",

        "Date": "2020-08-01",

        "Description": "Text Description 1",

        "Url1": "/page/some-page",

        "Url2": "/id/400"

    },

    {

        "Name": "Some Page Title 2",

        "Date": "2020-08-02",

        "Description": "Text Description 2",

        "Url1": "/page/some-page-2",

        "Url2": "/id/400"

    },

    {

        "Name": "Some Page Title 3",

        "Date": "2020-08-03",

        "Description": "Text Description 3",

        "Url1": "/page/some-page-3",

        "Url2": "/id/400"

    }

]

If you want to know more about the Data Science then do check out the following Data Science which will help you in understanding Data Science from scratch

 

Browse Categories

...