Skip to content Skip to sidebar Skip to footer

Parsing Invalid Anchor Tag With Beautifulsoup Or Regex

I wanted parse to parse a raw document containing html anchor tag but unfortunately it contains invalid tag such as: some text here I know

Solution 1:

I guess you could pre-filter your input text through a regular expression to correct this particular problem. Something like:

>>>r = re.compile('''<a[^>]+href="([^>]+)">''')>>>m = r.match(text)>>>m.group(1)
'A 4"drive bay'
>>>r.sub('<a href="%s">' % m.group(1).replace('"', ' '), text)
'<a href="A 4 drive bay">some text here</a>'

This isn't a complete solution; just an idea of how to move forward.

Solution 2:

Selfhtm 8.1.2 (documention of HTML used very frequently in Germany) recommends:

  1. First position latin character (a-z, A-Z)
  2. Later latin character, number (0-9), -, _ or .

I use the following regex to ensure the first requirement:

name="[^a-zA-Z]

(n. b. first leading space seems not so important, works on most regex-implementations, e. g. textpad editor from helios)

To ease work I have also a regex for the other requirement: It catches also one character anchor (they are valid), but it will help to identify possible problems:

name=".?[^a-zA-Z0-9_\.-][^"]*"

Most of other problems I find with a syntax checker.

Post a Comment for "Parsing Invalid Anchor Tag With Beautifulsoup Or Regex"