Subsequent:BeautifulSoupgetText从

之间 不拿起后续段落

首先,当涉及到 Pytn 时,我是一个完整的新手。但是,我已经编写了一段代码来查看一个 RSS 提要,打开链接并从文章中提取文本。这是我到目前为止:

from BeautifulSoup import BeautifulSoup
import feedpr
import urllib
# Dictionaries
links = {}
les = {}
# Variables
n = 0
rss_url = "feed://www.gfsc.gg/_layouts/GFSC/GFSCRSSFeed.aspx?Division=ALL&Article=All&Title=News&Type=doc&List=%7b66fa9b18-776a-4e91-9f80-    30195001386c%7d%23%7b679e913e-6301-4bc4-9fd9-a788b926f565%7d%23%7b0e65f37f-1129-4c78-8f59-3db5f96409fd%7d%23%7bdd7c290d-5f17-43b7-b6fd-50089368e090%7d%23%7b4790a972-c55f-46a5-8020-396780eb8506%7d%23%7b6b67c085-7c25-458d-8a98-373e0ac71c52%7d%23%7be3b71b9c-30ce-47c0-8bfb-f3224e98b756%7d%23%7b25853d98-37d7-4ba2-83f9-78685f2070df%7d%23%7b14c41f90-c462-44cf-a773-878521aa007c%7d%23%7b7ceaf3bf-d501-4f60-a3e4-2af84d0e1528%7d%23%7baf17e955-96b7-49e9-ad8a-7ee0ac097f37%7d%23%7b3faca1d0-be40-445c-a577-c742c2d367a8%7d%23%7b6296a8d6-7cab-4609-b7f7-b6b7c3a264d6%7d%23%7b43e2b52d-e4f1-4628-84ad-0042d644deaf%7d"
# P the RSS feed
feed = feedpr.p(rss_url)
# view the entire feed, one entry at a time
for post in feed.entries:
    # Create variables from posts
    link = post.link
    le = post.le
    # Add the link to the dictionary
    n += 1
    links[n] = link
for k,v in links.items():
    # Open RSS feed
    page = urllib.urlopen(v).read()
    page = str(page)
    soup = BeautifulSoup(page)
    # Find all of the text between paragraph tags and strip out the html
    page = soup.find('p').getText()
    # Strip ampersand codes and WATCH:
    page = re.sub('&\w+;','',page)
    page = re.sub('WATCH:','',page)
    # Print Page
    print(page)
    print(" ")
    # To stop after 3rd article, just whilst testing ** to be removed **
    if (k >= 3):
        break

这会产生以下输出:

>>> (executing lines 1 to 45 of "RSS_BeautifulSoup.py")
​Total deposits held with Guernsey banks at the end of June 2012 increased 2.1% in sterling terms by £2.1 billion from the end of March 2012 level of £101 billion, up to £103.1 billion. This is 9.4% lower than the same time a year ago.  Total ets and liabilities increased by £2.9 billion to £131.2 billion representing a 2.3% increase over the quarter tugh this was 5.7% lower than the level a year ago.  The higher figures reflected the effects both of volume and exchange rate factors.
The net et value of total funds under management and administration has increased over the quarter ended 30 June 2012 by £711 million (0.3%) to reach £270.8 billion.For the year since 30 June 2011, total net et values decreased by £3.6 billion (1.3%).
The Commission has updated the warranties on the Form REG, Form QIF and Form FTL to take into account the Commission’s Guidance Notes on Personal Questionnaires and Personal Declarations.  In particular, the following warranty (varies slightly dependent on the application) has been inserted in the aforementioned forms,
>>> 

问题是,这是每篇文章的第一段,但是我需要展示整篇文章。

114

你越来越近了!

# Find all of the text between paragraph tags and strip out the html
page = soup.find('p').getText()

使用find(如您已经注意到的)在找到一个结果后停止。如果您想要所有段落,则需要find_all。如果页面的格式一致(只看一个),则还可以使用类似

soup.find('div',{'id':'ctl00_PlaceHolderMain_RichHtmlField1__ControlWrapper_RichHtmlField'})

把注意力集中在文章的正文上。

20

这适用于文本全部包装在<p>标签中的特定文章。由于网络是一个丑陋的地方,因此并非总是如此。

通常,网站会有分散的文本,包裹在不同类型的标签中(例如,可能在<span><div><li>中)。

Tofind all text nodes in the DOM,you can usesoup.find_all(text=True).

这将返回一些不需要的文本,如<script><style>标签的内容。

blocklist = [
  'style',
  'script',
  # other elements,
]
text_elements = [t for t in soup.find_all(text=True) if t.parent.name not in blocklist]

如果您正在使用一组已知的标记,则可以使用相反的方法进行标记:

allowlist = [
  'p'
]
text_elements = [t for t in soup.find_all(text=True) if t.parent.name in allowlist]
5
get_text
htmldata = getdata("https://www.geeksforgeeks.org/w-to-automate-an-excel-sheet-in-pytn/?ref=feed") 
soup = BeautifulSoup(htmldata, 'html.pr') 
data = '' 
for data in soup.find_all("p"): 
    print(data.get_text()) 

本站系公益性非盈利分享网址,本文来自用户投稿,不代表边看边学立场,如若转载,请注明出处

(245)
Loop指令:程序集8086 LOOP指令不停止(al loop)
上一篇
Og pp:og:image:urlvsog:image之间的区别
下一篇

相关推荐

发表评论

登录 后才能评论

评论列表(31条)