Htmlparser.htmlparser().unescape() Doesn't Work
I would like to convert HTML entities back to its human readable format, e.g. '£' to '£', '°' to '°' etc. I've read several posts regarding this question Conve
Solution 1:
Apparently HTMLParser.unescape
was a bit more primitive before Python 2.6.
Python 2.5:
>>>import HTMLParser>>>HTMLParser.HTMLParser().unescape('©')
'©'
Python 2.6/2.7:
>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape('©')
u'\xa9'
See the 2.5 implementation vs the 2.6 implementation / 2.7 implementation
Solution 2:
This site lists some solutions, here's one of them:
from xml.sax.saxutils import escape, unescape
html_escape_table = {
'"': """,
"'": "'",
"©": "©"# etc...
}
html_unescape_table = {v:k for k, v in html_escape_table.items()}
defhtml_unescape(text):
return unescape(text, html_unescape_table)
Not the prettiest thing though, since you would have to list each escaped symbol manually.
EDIT:
How about this?
import htmllib
defunescape(s):
p = htmllib.HTMLParser(None)
p.save_bgn()
p.feed(s)
return p.save_end()
Post a Comment for "Htmlparser.htmlparser().unescape() Doesn't Work"