0 votes
1 view
in Data Science by (17.6k points)

So I'm trying to make a data table of some information on a website. This is what I've done so far.


url <- 'https://uws-community.symplicity.com/index.php?s=student_group'

page <- html_session(url)

name_nodes <- html_nodes(page,".grpl-name a")

name_text <- html_text(name_nodes)

df <- data.frame(matrix(unlist(name_text)), stringsAsFactors = FALSE)


df <- df %>% mutate(id = row_number())

desc_nodes <- html_nodes(page, ".grpl-purpose")

desc_text <- html_text(desc_nodes)

df <- left_join(df, data.frame(matrix(unlist(desc_text)), 

                               stringsAsFactors = FALSE) %>% 

                  mutate(id = row_number()))

email_nodes <- html_nodes(page, ".grpl-contact a")

email_text <- html_text(email_nodes)

df <- left_join(df, data.frame(matrix(unlist(email_text)), 

                               stringsAsFactors = FALSE) %>% 

                  mutate(id = row_number()))

This has been working until I got to the emails part. A few of the entries do not have emails. In the data frame, instead of the appropriate rows showing the NA value for the email, the last three rows show an NA value.

How do I make it so the appropriate rows show have the NA value instead of just the last 3 rows?

1 Answer

0 votes
by (39.2k points)

 For solving this problem,we need to find the 20 parent nodes which are known to exist for each student group. 

Use the html_node function on each parent node with the list of parent nodes. Here, the html_node function will return one result or NA depending if the desired tag exists.

When there is a variable number of sub nodes, then you can use this technique.



url <- 'https://uws-community.symplicity.com/index.php?s=student_group'

page <- html_session(url)


#find group names

name_text <- html_nodes(page,".grpl-name a") %>% html_text()

df <- data.frame(name_text, stringsAsFactors = FALSE)

df <- df %>% mutate(id = row_number())


#find text description

desc_text <- html_nodes(page, ".grpl-purpose") %>% html_text()

df$desc_text <- trimws(desc_text)


#find emails

#  find the parent nodes with html_nodes

#  then find the contact information from each parent using html_node

email_nodes<-html_nodes(page, "div.grpl-grp") %>% html_node( ".grpl-contact a") %>% html_text()


Welcome to Intellipaat Community. Get your technical queries answered by top developers !