Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in Data Science by (18.4k points)

I am starting to work with python again after 8 years. I am trying to do the program with BeautifulSoup and an array argument. I pass the array argument medios to the URL functions count_words, but it doesn't work. Is there a way to fix it or to search the word on multiple websites using BeautifulSoup?

import requests

from bs4 import BeautifulSoup

def count_words(url, the_word):

    r = requests.get(url, allow_redirects=False)

    soup = BeautifulSoup(r.content, 'lxml')

    words = soup.find(text=lambda text: text and the_word in text)

 #   print(words)

    return len(words)

def main():

    url = 'https://www.nytimes.com/'

    medios = {

        'Los Angeles Times': ['http://www.latimes.com/'],

        'New York Times' : ['http://www.nytimes.com/'

    ] }

    word = 'Trump'

    #count = count_words(url, word)

    cuenta = count_words(medios, word)

   # print('\n El Sitio: {}\n Contiene {} occurrencias de la palabra: {}'.format(url, count, word))

    print('\n La palabra: {} aparece {} occurrencias en el New York Times'.format(word, cuenta))

if __name__ == '__main__':

    main()

1 Answer

0 votes
by (36.8k points)

There are 3 problems here

  1. The medios is the dict. Hence, you will have to loop through the keys and values to send it to the method as the method only accepts the URL string.
  2. BeautifulSoup finds method needs the tag name for it to search else it will return None. If you want to count a number of occurrences of the word, then use count on the string.
  3. You have to send User-Agent in a requests code else you will get 403 or 301.

import requests

from bs4 import BeautifulSoup

headers = {'user-agent':"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"}

def count_words(url, the_word):

    r = requests.get(url, headers=headers)

    return r.text.lower().count(the_word)

def main():

    url = 'https://www.nytimes.com/'

    medios = {

        'Los Angeles Times': ['http://www.latimes.com/'],

        'New York Times' : ['http://www.nytimes.com/']

    }

    word = 'trump'

    

    for web_name, urls in medios.items():

        for url in urls:

            cuenta = count_words(url, word)

            print('La palabra: {} aparece {} occurrencias en el {}'.format(word, cuenta, web_name))

if __name__ == '__main__':

    main()

Output:

La palabra: trump aparece 47 occurrencias en el Los Angeles Times

La palabra: trump aparece 194 occurrencias en el New York Times

If you want to know more about the Data Science then do check out the following Data Science which will help you in understanding Data Science from scratch 

Browse Categories

...