Web Scraper | Python
For the Golang version please check here.
And here is a python version:
"""
create a rss class to parse the rss feed url anf get the title and link
"""
import feedparser
import os
class rss:
def __init__(req, url):
req.url = url
def fetchDetails(req):
try:
url = req.url
last_etag = ''
last_modified = ''
title = ''
link = ''
row = ''
result = []
feed = feedparser.parse(url)
last_etag = feed.etag
last_modified = feed.modified
for entry in feed['entries']:
title = entry['title']
for link in entry['links']:
link = link['href']
row = [title, link]
result.append(row)
return last_etag, last_modified, result
except Exception as e:
print(e)
Now call the python module class that you just created.
import feedparser
import os
import my_module.rss_feedparser
url = '' # the rss feed url that you want to parse
rss = my_module.rss_feedparser.rss(url)
result = rss.fetchDetails()[2]
Now keep the title as the title_init
and then do a while loop:
i = 0
while latest_title == title_init :
i += 1
print(i, "no new feeds yet, sleep 2 seconds")
time.sleep(2)
result = getstats(url)[2]
latest_title = result[0][0]
if latest_title != title_init :
print("new news:")
result = getstats(url)[2]
latest_title = result[0][0]
title_init = result[0][0]
for r in result:
print(r[0])
else:
continue
Then you have a tiny web scraper:
5 no new feeds yet, sleep 60 seconds
6 no new feeds yet, sleep 60 seconds
7 no new feeds yet, sleep 60 seconds
8 no new feeds yet, sleep 60 seconds
9 no new feeds yet, sleep 60 seconds
10 no new feeds yet, sleep 60 seconds
11 no new feeds yet, sleep 60 seconds
new news:
Roche lifts 2019 sales view again as Chinese demand soars
Trump national security adviser jets to Turkey in bid to halt assault
Irish PM says Brexit issues remain, EU sources report 'standstill'
Turkey's Erdogan says to re-evaluate upcoming U.S. visit
UK watchdog opens formal probe of Amazon Deliveroo deal
Hong Kong leader forced to abandon address, offers no olive branch
Warming climate puts Austria's hip Gruener Veltliner wine at risk
U.S. security chief 'heaped pain' on grieving parents of UK teen - lawyer
Sterling, UK stocks fall on concerns Brexit talks have stalled
Brexit talks ongoing as PM Johnson seeks to reach a deal - spokesman
'Number of significant issues to resolve' in Brexit talks - EU's Avramopoulos
UK Brexit minister: Not aware of 'second letter' plan to thwart EU exit delay
Euro zone inflation drops more than foreseen, trade surplus widens
Trading in second Woodford fund suspended - administrator
UK stocks retreat as market waits for Brexit deal update
M&S CFO Humphrey Singer to step down at the end of the year
Sales disappoint at UK homebuilder Barratt
Clashes erupt in Barcelona as Catalan separatists protest sentences for leaders
UK minister: PM will abide by pledge to write Brexit delay letter
Brexit minister says would not countenance short 'technical' delay
12 no new feeds yet, sleep 60 seconds
13 no new feeds yet, sleep 60 seconds
Good! Now we have the RSS feed and the link, Let’s grab the content off the internet with the HTML attribute. We will need the BeautifulSoup
.
'''
Let's get those python modules
'''
import requests
from bs4 import BeautifulSoup
Now let’s create a class to capture those texts by parsing with the HTML attribute like {'class': 'StandardArticleBody_container'}
.
class html_class:
def __init__(req, url):
req.url = url
def parsedHTML(req):
try:
url = req.url
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
see_also_section = soup.find('div', attrs={'class': 'StandardArticleBody_container'})
result = see_also_section.find_all(text=True)
return result
except Exception as e:
print(e)
Let’s hit the button and give it go 😎😎😎!
$ python3 nltk_token_tag.py
title_init: U.S. CDC reports 'breakthrough' in vaping lung injury probe as cases top 2,000
U.S. CDC reports 'breakthrough' in vaping lung injury probe as cases top 2,000 http://feeds.reuters.com/~r/reuters/topNews/~3/4TXTCbJEx_g/u-s-cdc-reports-breakthrough-in-vaping-lung-injury-probe-as-cases-top-2000-idUSKBN1XI26R
['CHICAGO (Reuters) - Tests of lung samples taken from 29 patients with vaping-related injuries suggest all contained Vitamin E acetate, a discovery U.S. officials described on Friday as a “breakthrough” in the investigation of the nationwide outbreak that has topped 2,000 cases.']
['The discovery of Vitamin E acetate in lung samples offers the first direct evidence of a link with the substance and vaping-related lung injuries.', 'The substance has also been identified in tests by U.S. and state officials of product samples collected from patients with the vaping injury.']
['In a telephone briefing on Friday, Dr. Anne Schuchat, principal deputy director of the U.S. Centers for Disease Control and Prevention (CDC), called Vitamin E acetate “a very strong culprit of concern” and referred to the discovery as “a breakthrough” in the investigation.']
['She cautioned that more work is needed to definitively declare it a cause, and said studies may identify other potential causes of the serious injuries as well.']
['Vitamin E acetate is believed to be used as a cutting agent in illicit vaping products containing THC - the component of marijuana that gets people high.']
['The substance was identified early in product testing done in the New York Health Department’s Wadsworth laboratory, but not every THC vaping pen the lab tested contained Vitamin E, a lab official told Reuters.']
['Schuchat said researchers must now establish a causal link between exposure and injury, adding that “many substances are still under investigation.”']
['On Thursday, the CDC reported there have been 2,051 confirmed and probable U.S. lung injury cases and 39 deaths associated with use of e-cigarettes, or vaping products.', 'Nearly 85 percent of lung injury patients in the nationwide outbreak have reported using products containing THC.']
['FILE PHOTO: A man uses a vape device in this illustration picture, September 19, 2019.', 'REUTERS/Adnan Abidi/Illustration']
['In the CDC analysis, THC was detected in 23 of 28 patient samples of lung cells, including from three patients who said they did not use THC products.', 'Nicotine was detected in 16 of 26 patient samples.']
['In a separate report, Illinois officials found that compared to vapers who did not get sick, those who had a lung injury were significantly more likely to use THC-containing vaping products exclusively or frequently, and were nine times more likely to have purchased products from illicit sources, such as from on-line or off the street.']
['Together, the findings reinforce public health officials’ recommendation that people avoid using e-cigarettes that contain THC or any products that come from illicit sources.']
['Reporting by Julie Steenhuysen; Editing by Bill Berkrot']
['Our Standards:']
['The Thomson Reuters Trust Principles.']
Read other posts