Mit BeautifulSoup Links aus HTML extrahieren

gerold · Dienstag 13. Dezember 2005, 23:45

Hi!

Mit BeautifulSoup lassen sich einzelne Teile der HTML-Struktur ziemlich einfach parsen.

Hier ein Beispiel, das aufzeigt wie einfach es sein kann, A-Tags (Links oder auch Anker-Tags) aus einem HTML-Text heraus zu parsen:

Code: Alles auswählen

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from BeautifulSoup import BeautifulSoup

html = """<html>
<head>
  <title>Hallo Welt</title>
</head>
<body>
  <p>
    <a href="http://www.bcom.at">Bcom</a>
    <a href="http://gerold.bcom.at">Gerold</a>
  </p>
  <p>
    <A ID="sw3" href="http://sw3.at">SW3</A>
  </p>
</body>
</html>
"""

soup = BeautifulSoup(html)

# Alle Links raus suchen:
for anker in soup("a"):
    print "TEXT:", anker.string
    print "HREF:", dict(anker.attrs).get("href")
    print "ID:  ", dict(anker.attrs).get("id")

Auch wenn die Tag- oder Attribut-Namen im HTML-Text groß geschrieben sind, werden die Tags gefunden.
lg
Gerold

gerold · Donnerstag 15. Dezember 2005, 02:05

Hi @ all!

Und hier das oben gezeigte Beispiel um ein paar Ausrutscher erweitert. Groß und klein geschriebene Tag- und Attributnamen. Einfache und Doppelte Anführungszeichen. Ein Bild als Link und ein Tag, der sich über mehrere Zeilen erstreckt. Alles gültiges HTML.

Code: Alles auswählen

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from BeautifulSoup import BeautifulSoup

html = """<html>
<head>
  <title>Allgemein uebliches HTML mit grossen und kleinen...</title>
</head>
<body>
  <p>
    <a href="http://www.bcom.at"
    ><strong>Bcom</strong> Werbeagentur</a>
    <a href="http://gerold.bcom.at"><img src="bild.gif" /></A>
  </p>
  <p>
    <A href="http://sw3.at">SW3 das Kassensystem</A>
    <A HREF="http://www.python-forum.de/" 
       id='forum' 
       Name="forum"
    >Das Python-Forum</a>
  </p>
</body>
</html>
"""
soup = BeautifulSoup(html)

for anker in soup("a"):
    attrs = dict(anker.attrs)
    print "tag     :", anker
    print "string  :", anker.string
    print "content :", "".join([str(item) for item in anker.contents])
    print "contents:", anker.contents
    print "href    :", attrs.get("href")
    print "id      :", attrs.get("id")
    print "name    :", attrs.get("name")
    print

Und hier das Ergebnis:

Code: Alles auswählen

tag     : <a href="http://www.bcom.at"><strong>Bcom</strong> Werbeagentur</a>
string  : Null
content : <strong>Bcom</strong> Werbeagentur
contents: [<strong>Bcom</strong>, ' Werbeagentur']
href    : http://www.bcom.at
id      : None
name    : None

tag     : <a href="http://gerold.bcom.at"><img src="bild.gif" /></a>
string  : Null
content : <img src="bild.gif" />
contents: [<img src="bild.gif" />]
href    : http://gerold.bcom.at
id      : None
name    : None

tag     : <a href="http://sw3.at">SW3 das Kassensystem</a>
string  : SW3 das Kassensystem
content : SW3 das Kassensystem
contents: ['SW3 das Kassensystem']
href    : http://sw3.at
id      : None
name    : None

tag     : <a href="http://www.python-forum.de/" id="forum" name="forum">Das Python-Forum</a>
string  : Das Python-Forum
content : Das Python-Forum
contents: ['Das Python-Forum']
href    : http://www.python-forum.de/
id      : forum
name    : forum

Natürlich lässt sich auch alles nur mit *Regular Expressions* parsen. Aber warum soll man sich die Arbeit nicht teilen. Mit *BeautifulSoup* lassen sind Tags und deren Attribute ziemlich einfach aus einem HTML-String heraus holen. Dabei werden unschön verschachtelte Tags genau so wie fehlerhafter HTML-Code, je nach eingesetzter Parserklasse, mehr oder weniger entschärft. Mit *RE* lässt sich so etwas nur mit viel mehr Aufwand bewerkstelligen.

mfg
Gerold