Skip to content Skip to sidebar Skip to footer

What’s The Most Forgiving Html Parser In Python?

I have some random HTML and I used BeautifulSoup to parse it, but in most of the cases (>70%) it chokes. I tried using Beautiful soup 3.0.8 and 3.2.0 (there were some problems w

Solution 1:

They all are. I have yet to come across any html page found in the wild that lxml.html couldn't parse. If lxml barfs on the pages you're trying to parse you can always preprocess them using some regexps to keep lxml happy.

lxml itself is fairly strict, but lxml.html is a different parser and can deal with very broken html. For extremely brokeh html, lxml also ships with lxml.html.soupparser which interfaces with the BeautifulSoup library.

Some approaches to parsing broken html using lxml.html are described here: http://lxml.de/elementsoup.html

Solution 2:

With pages that don't work with anything else (those that contain nested <form> elements come to mind) I've had success with MinimalSoup and ICantBelieveItsBeautifulSoup. Each can handle certain types of error that the other one can't so often you'll need to try both.

Solution 3:

I ended up using BeautifulSoup 4.0 with html5lib for parsing and is much more forgiving, with some modifications to my code it's now working considerabily well, thanks all for suggestions.

Post a Comment for "What’s The Most Forgiving Html Parser In Python?"