HTML Tags entfernen

SeriousRuffy · Mittwoch 1. Juli 2015, 16:20

Hallo zusammen,

versuche von einer Seite die Hotelname rauszufiltern, allerdings habe ich das Problem, das ich die HTML Tags nicht entfernen kann:

Code: Alles auswählen

import requests
from bs4 import BeautifulSoup
import time

user_agent = {'User-agent': 'Chrome/43.0.2357.124'}

schreibdatei= open("testo.csv", "w")

r = requests.get("http://www.beispiel.de")

soup = BeautifulSoup(r.content)

#Hotelnamen raus filtern:

g_data2 = soup.find_all("div", {"id": "main_content"})
for item in g_data2:
    item = soup.find_all("span", {"class": "item_name"})
    print(item)

Mir wird folgendes zurück gegeben:

Code: Alles auswählen

[<span class="item_name">Excelsior Ernst</span>, <span class="item_name">Gallery Loft Cologne</span>, <span class="item_name">Ibis Köln am Dom</span>, <span class="item_name">Mondial am Dom Cologne - MGallery Collection</span>, <span class="item_name">Eden Früh am Dom</span>, <span class="item_name">Callas Am Dom</span>, <span class="item_name">Central am Dom</span>, <span class="item_name">Lindner Dom Residence</span>, <span class="item_name">Bürgerhof</span>, <span class="item_name">Königshof</span>, <span class="item_name">An der Philharmonie</span>, <span class="item_name">Station</span>, <span class="item_name">Wyndham Köln</span>, <span class="item_name">CityClass Hotel Residence am Dom</span>, <span class="item_name">Hilton Köln</span>, <span class="item_name">Drei Koenige</span>, <span class="item_name">Senats</span>, <span class="item_name">Breslauer Hof</span>, <span class="item_name">Maria Suite</span>, <span class="item_name">Residenz am Dom</span>, <span class="item_name">Engelbertz</span>, <span class="item_name">Lint</span>, <span class="item_name">Berg</span>, <span class="item_name">Müller</span>, <span class="item_name">Tryp by Wyndham Koeln City Centre</span>]

Process finished with exit code 0

Nun habe ich versucht, die HTML Tags zu entfernen, was mir eigentlich gelungen ist:

Code: Alles auswählen

g_data2 = soup.find_all("div", {"id": "main_content"})
for item in g_data2:
    item = soup.find_all("span", {"class": "item_name"})
    print(item[0].text.strip())

Resultat:

Code: Alles auswählen

Radisson Blu Köln

Da es für ein Hotelnamen funktioniert hat, habe ich es jetzt versucht alle Hotelnamen ohne HTML Tags zu bekommen.

Code: Alles auswählen

g_data2 = soup.find_all("div", {"id": "main_content"})
for item in g_data2:
    item = soup.find_all("span", {"class": "item_name"})
    print(item.text.strip())

Aber in diesem Fall bekomme ich eine Fehlermeldung:
Resultat:

Code: Alles auswählen

AttributeError: 'ResultSet' object has no attribute 'text'

Kann mir einer von euch Feedback/Tipps geben, was ich am besten machen sollte? Wäre für jede Hilfe dankbar:)

cofi · Mittwoch 1. Juli 2015, 16:29

Code: Alles auswählen

item = soup.find_all("span", {"class": "item_name"})

Es ist nicht _item_ es ist _itemS_.
Was funktioniert hat: Du hast das erste Element genommen, aber das funktioniert natuerlich nicht mit einer ganzen Liste (bzw ResultSet).

Code: Alles auswählen

spans = soup.find_all("span", {"class": "item_name"})
names = [span.text.strip() for span in spans]

SeriousRuffy · Mittwoch 1. Juli 2015, 16:49

Danke für dein Feedback. Habe mittlerweile auch eine andere Lösung gefunden, was deiner ähnelt:

Code: Alles auswählen

items = soup.find_all("span", {"class": "item_name"})
for item in items:
    print(item.text)