Skip to content Skip to sidebar Skip to footer

Beautifulsoup Lost Nodes

I am using Python and Beautifulsoup to parse HTML-Data and get p-tags out of RSS-Feeds. However, some urls cause problems because the parsed soup-object does not include all nodes

Solution 1:

The input HTML is not quite conformant, so you'll have to use a different parser here. The html5lib parser handles this page correctly:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm')
>>> soup = BeautifulSoup(r.text, 'lxml')
>>> soup.find('div', id='story-body') is not None
False
>>> soup = BeautifulSoup(r.text, 'html5')
>>> soup.find('div', id='story-body') is not None
True

Post a Comment for "Beautifulsoup Lost Nodes"