crawler selber schreiben bräuchte Tipps

McAce · Mittwoch 21. April 2010, 15:23

Hi,

ich bin recht neu in Python, dewegen meine Frage gibt es irgendwo ein Tutorial in dem beschrieben wird wie man einen Webcrawler schreibt?

Ich habe mich schon umgesehen aber nur Beispielcode gefunden
ohne weitere, oder viel zu kurz gehaltene Erklärungen.

Es wäre echt nett wenn einer da einen Link Tipp oder was auch immer hat.

Vielen Dank

Hyperion · Mittwoch 21. April 2010, 15:36

Iirc wollte das hier schon mal jemand. Benutz doch mal die SuFu.

Generell dürfte wohl die urllib / urllib2 (Standard Lib) etwas für Dich sein. Dazu ein HTML Parser, wie etwa der von lxml (3rd Party Lib).

Letztlich brauchste ja nur eine "Start"-Seite und hangelst Dich dann anhand der <a hrefs""> durch.

Vermutlich bietet sich eine Breitensuche an, damit man nicht den Rekursionstod bei einer Tiefensuche stirbt.

Interessant wird dann noch die Datenstruktur. Da könnte ein Key-Value Store etwas sein.

ms4py · Mittwoch 21. April 2010, 15:44

McAce hat geschrieben:ich bin recht neu in Python, dewegen meine Frage gibt es irgendwo ein Tutorial in dem beschrieben wird wie man einen Webcrawler schreibt?

Wenn dir einmal klar ist, was ein Crawler beherrschen muss, findest du eigentlich genug Infos zu den einzelnen Themen.

1) URL öffnen und Inhalt auslesen (z.B. urllib2)
2) Inhalt parsen und alle Links filtern (z.B. lxml)
3) Inhalt verarbeiten (das ist natürlich total auf den Anwendungsfall bezogen, eine Suchmaschine wird den Inhalt indexieren)
4) "robots.txt" beachten

`mechanize` ( http://wwwsearch.sourceforge.net/mechanize/ ) beherrscht vieles von diesen Anforderungen, falls das aber eine Hausaufgabe oder ähnliches sein soll, ist damit eventuell zu viel Abstraktion im Spiel.

Edit: Hier wär noch was, die Übersetzung sieht auf den ersten Blick ganz brauchbar aus.
http://translate.google.com/translate?h ... parte-i%2F

Edit2: Nur der Code ist so falsch dargestellt, den musst dir dann im (Portugiesischen

) Orginal anschauen
http://herberthamaral.com/2010/02/crian ... n-parte-i/

McAce · Mittwoch 21. April 2010, 17:16

Vielen Dank für die Hinweise

werde mich da gleich mal durchkämpfen und weiter suchen.

Könnt ihr mir den ein Buch empfehlen was in diese Richtung also
Web oder Netzwerkprogrammierung mit Python geht?

Hyperion · Mittwoch 21. April 2010, 17:25

McAce hat geschrieben: Könnt ihr mir den ein Buch empfehlen was in diese Richtung also
Web oder Netzwerkprogrammierung mit Python geht?

Die Grundlagen lernen musst Du eh - da hilft Dir ein Themen-Buch auch nichts.

Ich würde mir die Doku angucken und ggf. auch die der verwendeten Libs. Da stehen oft gute Beispiele drin. Auch das suchen hier im Forum hilft oftmals.

Herberth Amaral · Sonntag 2. Mai 2010, 04:14

ms4py hat geschrieben: Edit: Hier wär noch was, die Übersetzung sieht auf den ersten Blick ganz brauchbar aus.
http://translate.google.com/translate?h ... parte-i%2F

Edit2: Nur der Code ist so falsch dargestellt, den musst dir dann im (Portugiesischen ) Orginal anschauen
http://herberthamaral.com/2010/02/crian ... n-parte-i/

Hi, I like the German language but I don't speak German

.I am the author of that post in Portuguese and I found the reference of your post browsing the web and I think it deserves a comment.

There are a few points I like to share in Python-based Web crawlers:

- You can use urllib *and* urllib2 to download webpages. urrlib2 is not necessarily the evolution of urllib: there are somethings like url enconding that is not present in urllib2.

- Dowloading pages is easy, but parsing can be hard. In my examples, I use BeautifulSoup, a Python toolkit for parsing malformed web pages and extract *any* information of it in a very, very simple way. I recomend the version 3.0.9 because 3.1.0 has some drawbacks (see more in http://www.crummy.com/software/BeautifulSoup/)
- That was only the first part of my tutorial series on making web crawlers in Python. There are 2 others posts (probably tomorrow I'll post the fourth one). If you think it would be useful to you, don't hesitate to contact-me on email: herberthamaral [at] gmail [dot] com. I'll be glad to help

lunar · Sonntag 2. Mai 2010, 09:22

@Herberth Amaral: You shouldn't use BeautifulSoup, at least not for parsing. It is not maintained anymore and can't easily be ported to Python 3 without loosing quality.

Use html5lib or lxml.html, both handle malformed HTML. I recommend the latter. Being implemented atop of libxml2, it is the fastest available parser, very mature and not less powerful or less convenient than BeautifulSoup.

ms4py · Sonntag 2. Mai 2010, 11:29

lunar hat geschrieben:@Herberth Amaral: You shouldn't use BeautifulSoup, at least not for parsing. It is not maintained anymore and can't easily be ported to Python 3 without loosing quality.

Use html5lib or lxml.html, both handle malformed HTML. I recommend the latter. Being implemented atop of libxml2, it is the fastest available parser, very mature and not less powerful or less convenient than BeautifulSoup.

I agree, see http://ms4py.org/2010/04/27/python-sear ... er-part-1/

Herberth Amaral · Sonntag 2. Mai 2010, 12:27

ms4py hat geschrieben:
lunar hat geschrieben:@Herberth Amaral: You shouldn't use BeautifulSoup, at least not for parsing. It is not maintained anymore and can't easily be ported to Python 3 without loosing quality.

Use html5lib or lxml.html, both handle malformed HTML. I recommend the latter. Being implemented atop of libxml2, it is the fastest available parser, very mature and not less powerful or less convenient than BeautifulSoup.
I agree, see http://ms4py.org/2010/04/27/python-sear ... er-part-1/

Yes, I know BeautifulSoup is not maintained anymore. Too bad

I was aware of the existence of lxml when I stated my studies, but I found BeautifulSoup easier to learn and manipulate elements (also, BeautifulSoup was being maintened that time). Also, BeautifulSoup documentation is more "clear" than lxml or html5lib. I don't recommend it in any "serious" project, but it can be very useful in a 20-minutes-of-coding-crawler

At first sight, lxml.html seems more interesting to me than html5lib and I'll bet on it