Beautifulsoup With An Invalid Html Document
I am trying to parse the document http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/5923en8.htm. I want to extract everything before Commission:. (I need Bea
Solution 1:
If you want everything before the tag with "Commision:" value. You could just do it without beatifulsoup... and just treat it lika a string variable and search for the right keyword and drop the rest of the string.
But when I run your code I get following:
[u'The Governments of the Member States and the European Commission were represe
nted as follows:', u'Commission', u'The Council held an orientation debate on ke
y economic policy issues with a view to giving guidance to the Commission on the
questions Ministers wish to be addressed in the broad economic policy guideline
s 1998/99 for which the Commission will present its recommandation later in the
Spring. It was noted that the forthcoming guidelines are of particular importanc
e given the start of stage 3 of EMU.', u'The debate was based on an assessment o
f the economic situation and outlook in the Community carried out by the Commiss
ion and the Economic Policy and Monetary Committees.', u"The Council held an ori
entation debate on the Commission's Communication setting out a possible Communi
ty framework allowing Member States to experiment with reduced VAT rates for lab
our-intensive services in order to boost employment in small businesses without
distorting international competition. ", u'This Communication was tabled by the
Commission as a follow-up to the Employment European Council of last November in
Luxembourg, which concluded that, in order to make the taxation system more emp
loyment-friendly, "Member States will examine, without obligation, the advisabil
ity of reducing the rate of VAT on labour-intensive services not exposed to cros
s-border competition".', u"In conclusion, the Council invited Coreper to examine
the technical questions arising from today's debate and to report back to it wi
th a view to deciding on a possible request to the Commission to submit a propos
al in this area. ", u"This technical examination should be carried out, taking i
nto account the criteria indicated in the Commission's Communication for a reduc
ed VAT rate, on the following questions :", u'An initial trial period running un
til the year 2002 should identify the best method for allocating FISIM. At the e
nd of this period, the Commission will assess the results of the trial period an
d decide, by means of a comitology procedure, on the final methodology to be app
lied. However, a unanimous decision by the Council would be needed in order to u
se the new methodology in budgetary calculations on other Community policies and
notably concerning "own resources".']
Solution 2:
Iterate over p
elements and stop when you find a text starting with Commission
:
import urllib
from bs4 import BeautifulSoup
url="http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/5923en8.htm"
soup=BeautifulSoup(urllib.urlopen(url))
for item in soup.find_all('p'):
if item.text.startswith('Commission'):
breakelse:
print item.text
It prints everything before Commission
:
The Governments of the Member States and the European Commission were represented as follows:
Belgium:
...
Ms Helen LIDDELL Economic Secretary to the Treasury
* * *
Post a Comment for "Beautifulsoup With An Invalid Html Document"