Parsing Invalid Anchor Tag With Beautifulsoup Or Regex
I wanted parse to parse a raw document containing html anchor tag but unfortunately it contains invalid tag such as: some text here I know
Solution 1:
I guess you could pre-filter your input text through a regular expression to correct this particular problem. Something like:
>>>r = re.compile('''<a[^>]+href="([^>]+)">''')>>>m = r.match(text)>>>m.group(1)
'A 4"drive bay'
>>>r.sub('<a href="%s">' % m.group(1).replace('"', ' '), text)
'<a href="A 4 drive bay">some text here</a>'
This isn't a complete solution; just an idea of how to move forward.
Solution 2:
Selfhtm 8.1.2 (documention of HTML used very frequently in Germany) recommends:
- First position latin character (a-z, A-Z)
- Later latin character, number (0-9), -, _ or .
I use the following regex to ensure the first requirement:
name="[^a-zA-Z]
(n. b. first leading space seems not so important, works on most regex-implementations, e. g. textpad editor from helios)
To ease work I have also a regex for the other requirement: It catches also one character anchor (they are valid), but it will help to identify possible problems:
name=".?[^a-zA-Z0-9_\.-][^"]*"
Most of other problems I find with a syntax checker.
Post a Comment for "Parsing Invalid Anchor Tag With Beautifulsoup Or Regex"