HTMLParser und whitespaces...

jens · Dienstag 2. Dezember 2008, 18:53

Ich hampel gerade mit HTMLParser herrum... Im Grunde klappt alles, ich hab nur ein problem mit whitespaces... Also Leeräume zwischen den Tags und dem eigentlichen Text...

Wie schaffe ich es, das ich in handle_data() nur wirklich den Text teil bekomme, der relevant ist?

Ich hab das versucht, aber es macht so keinen Sinn:

Code: Alles auswählen

...
    def feed(self, data):
        ...

        lines = data.split("\n")
        lines = [l.strip() for l in lines]
        lines = [l for l in lines if l]

        clean_data = u" "
        for line in lines:
            if line and clean_data[-1] == u">" and line[0] == u"<":
                clean_data += line
            elif line and clean_data.endswith("<br />"):
                clean_data += line
            else:
                print "[%r]" % line
                clean_data += " " + line

        clean_data = clean_data.strip()

        HTMLParser.feed(self, clean_data)

        return self.root

sma · Mittwoch 3. Dezember 2008, 14:21

Streng genommen ist jedes Whitespace relevant, somit darf ein (XHTML-)Parser sie nicht einfach unter den Tisch fallen lassen. ElementTree oder BeautifulSoup unterdrücken sie AFAIK aber dennnoch. Welchen Parser benutzt du? Und warum stören dich die Textknoten? Kannst du sie nicht einfach überspringen?

Stefan

jens · Mittwoch 3. Dezember 2008, 14:26

Der Ansatz von oben ist Mist... Nicht wirklich viel besser, aber es funktioniert halbwegs, hab ich hier:

Code: Alles auswählen

BLOCK_TAGS = (
    "address", "blockquote", "center", "del", "dir", "div", "dl", "fieldset",
    "form",
    "h1", "h2", "h3", "h4", "h5", "h6",
    "hr", "ins", "isindex", "menu", "noframes", "noscript",
    "ul", "ol", "li", "table",
    "p", "pre",
    "br"
)

strip_html_regex = re.compile(
    r"""
        \s*
        <
            (?P<end>/{0,1})       # end tag e.g.: </end>
            (?P<tag>[^ >]+)       # tag name
            .*?
            (?P<startend>/{0,1})  # closed tag e.g.: <closed />
        >
        \s*
    """,
    re.VERBOSE | re.MULTILINE | re.UNICODE
)

def strip_html(html_code):
    """
    Delete whitespace from html code. Doesn't recordnize preformatted blocks!
    
    >>> strip_html(u' <p>  one  \\n two  </p>')
    u'<p>one two</p>'
    
    >>> strip_html(u'<p><strong><i>bold italics</i></strong></p>')
    u'<p><strong><i>bold italics</i></strong></p>'
    
    >>> strip_html(u'<li>  Force  <br /> \\n linebreak </li>')
    u'<li>Force<br />linebreak</li>'
    
    >>> strip_html(u'one  <i>two \\n <strong>   \\n  three  \\n  </strong></i>')
    u'one <i>two <strong>three</strong> </i>'
    
    >>> strip_html(u'<p>a <unknown tag /> foobar  </p>')
    u'<p>a <unknown tag /> foobar</p>'
    """
    def strip_tag(match):
        block        = match.group(0)
        end_tag      = match.group("end") in ("/", u"/")
        startend_tag = match.group("startend") in ("/", u"/")
        tag          = match.group("tag")
        
#        print "_"*40
#        print match.groupdict()
#        print "block.......: %r" % block
#        print "end_tag.....:", end_tag
#        print "startend_tag:", startend_tag
#        print "tag.........: %r" % tag
        
        if tag in BLOCK_TAGS:
            return block.strip()
        
        space_start = block.startswith(" ")
        space_end = block.endswith(" ")
        
        result = block.strip()
        
        if end_tag:
            # It's a normal end tag e.g.: </strong>
            if space_start or space_end:
                result += " "      
        elif startend_tag:
            # It's a closed start tag e.g.: <br />
            
            if space_start: # there was space before the tag
                result = " " + result
                          
            if space_end: # there was space after the tag
                result += " "
        else:
            # a start tag e.g.: <strong>
            if space_start or space_end:
                result = " " + result
                
        return result

    data = html_code.strip()
    clean_data = " ".join([line.strip() for line in data.split("\n")])
    clean_data = strip_html_regex.sub(strip_tag, clean_data)
    return clean_data

Ich kann nicht einfach alle newlines/spaces ignorieren. Dann gibt es Probleme bei dem hier:

Code: Alles auswählen

<p>111
    <strong>222</strong>
    <i>333</i>
</p>

Würde ich alle ignorieren, kommt am Ende das raus: 111222333
Korrekt wäre aber: 111 222 333

Einfach alle newlines in ein Leerzeichen wandeln geht auch nicht so gut. Ansonsten hätte ich einen Code mit sehr vielen unnötigen Leerzeichen, in etwa so:

Code: Alles auswählen

111        [b]222[/b]       [i]333[/i]

Also muß man unterscheiden, ist es ein Block- oder Inline-tag. Bei den Blocktags kann man vor und dahinter alles wegschneiden. Bei den Inline muß aber zumindest ein Leerzeichen übrig bleiben.

sma · Mittwoch 3. Dezember 2008, 16:07

Dachte, ich könnte kurz einen Link auf die Regel aus dem HTML-Standard bieten, wie und wann Leerzeichen ignoriert werden, doch das scheint nicht so einfach. Vor und nach Tags (außer PRE und denen, die per CSS-white-space-Attribut umdefiniert wurden) kann man sie zu einem verkürzen, zwischen Tags sogar weglassen. Mehrere innerhalb eines Texts werden zu genau einem. Die Regeln stammen aber offenbar ursprünglich von SGML und sind noch komplizierter.

Stefan

Leonidas · Mittwoch 3. Dezember 2008, 16:27

sma hat geschrieben:Die Regeln stammen aber offenbar ursprünglich von SGML und sind noch komplizierter.

Oder jede Implementierung versucht das selbst irgendwie hinzubekommen. Das war eben das Problem bei HTML5, dass viele Sachen eben nicht so implementiert worden sind wie in SGML sondern eben ad-hoc.

jens · Donnerstag 4. Dezember 2008, 10:52

Hab nun mal xml.dom.minidom getestet. Dort gibt es ein Node.normalize():

http://docs.python.org/dev/library/xml. ... .normalize

Ich dachte das würde evtl. die whitespaces "löschen". Ist aber offensichtlich nicht dafür da:

Code: Alles auswählen

from xml.dom.minidom import parseString


document = """
<h2>
   simple    demo
</h2>

   <p>  You
  can
  convert    from:   </p>

"""

dom = parseString(u"<root>%s</root>" % document)

def display(node, level=0):
    for node in node.childNodes:
        indent = "    "*level
        
        print indent, "node: %r" % repr(node)
        
        try:
            node.normalize()
        except Exception, err:
            pass
        
        if node.nodeType == node.TEXT_NODE:
            print indent, "Text node: %r" % node.data
        
        print " -"*40
        if node.childNodes:
            display(node, level+1)


display(dom)

Code: Alles auswählen

 node: '<DOM Element: root at 0x2639ee0>'
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
     node: '<DOM Text node "\n">'
     Text node: u'\n'
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
     node: '<DOM Element: h2 at 0x2639fa8>'
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
         node: '<DOM Text node "\n   simple...">'
         Text node: u'\n   simple    demo\n'
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
     node: '<DOM Text node "\n\n   ">'
     Text node: u'\n\n   '
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
     node: '<DOM Element: p at 0x263f0a8>'
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
         node: '<DOM Text node "  You\n  ca...">'
         Text node: u'  You\n  can\n  convert    from:   '
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
     node: '<DOM Text node "\n\n">'
     Text node: u'\n\n'
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Wie man sieht sind alle whitespaces noch da.
Allerdings macht es wohl für mein Projekt mehr Sinn minidom statt HTMLParser zu nutzten, weil ich ja eh einen Baum aufbaue.

Auf der anderen Seite hab ich das gebastelt:

Code: Alles auswählen

from htmllib import HTMLParser
from formatter import AbstractFormatter


document = """
<h2>
   simple    demo
</h2>

   <p>  You 
  can
  convert    from:   </p>

"""


class DebugWriter(object):
    def _debug(self, *args):
        print repr(args)
    
    def __getattr__(self, name):
        print name,
        return self._debug


writer = DebugWriter()
formatter = AbstractFormatter(writer)
parser = HTMLParser(formatter)
parser.feed(document)

Ausgabe:

Code: Alles auswählen

send_paragraph (1,)
new_font (('h2', 0, 1, 0),)
send_flowing_data ('simple demo',)
send_line_break ()
send_paragraph (1,)
new_font (None,)
send_flowing_data ('You can convert from:',)

Das sieht super aus. Man bekommt die Text Daten neu formatiert. Ich weiß nur jetzt nicht, wie ich daraus einen Baum Aufbauen kann