itterieren über eine Reihe verschiedener URLs und soup.find text

say_hello · Sonntag 24. April 2022, 14:11

Hallo und guten Tag Community,

,,,, ich will aus einer Seite Texte beziehen: Hier stoße ich auf zwei Probleme.
a. zunächst hab ich die Frage, wie ich über eine ganze Reihe von URLs itteriere und
b. gelingt es mir nicht auf Anhieb, den Text einer Klasse zurückzubekommen...

ich beginne mit einigen ersten Schritten: ... die auch gut gelingen.

a. für einen ersten Test fange ich an, die Überschriften zu holen:_

Code: Alles auswählen

# Python program to print all heading tags
import requests
from bs4 import BeautifulSoup
 
# scraping a the content
url_link = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view'
request = requests.get(url_link)
 
Soup = BeautifulSoup(request.text, 'lxml')
 
# creating a list of all common heading tags
heading_tags = ["h1", "h2", "h3", "h4"]
for tags in Soup.find_all(heading_tags):
    print(tags.name + ' -> ' + tags.text.strip())

Das gibt mir einen ersten Überblick

h1 -> Smart Specialisation Platform
h1 -> Digital Innovation Hubs

Digital Innovation Hubs
h2 -> Bavarian Robotic Network (BaRoN) Bayerisches Robotik-Netzwerk, BaRoN
h4 -> Contact Data
h4 -> Description
h4 -> Link to national or regional initiatives for digitising industry
h4 -> Market and Services
h4 -> Organization
h4 -> Evolutionary Stage
h4 -> Geographical Scope
h4 -> Funding
h4 -> Partners
h4 -> Technologies

soweit - so gut: was ich noch vorhabe, das ist nun, den Content der h4 Tags beziehen:
Also ich bin relativ neu in BS4 und pandas: ich will itterieren

ich will über eine Reihe von Seiten itterieren mit BS4 : https://s3platform.jrc.ec.europa.eu/dig ... -hubs-tool

hier haben wir die folgenden URLs . circa 700 verschiedene. Da stellt sich die Frage wie ich diese alle abfrage:

https://s3platform-legacy.jrc.ec.europa ... /1096/view
https://s3platform-legacy.jrc.ec.europa ... 17865/view
https://s3platform-legacy.jrc.ec.europa ... /1416/view

auf jeder dieser Seiten hab ich diverse Elemente die ich brauche:

so zum Beispiel:

Code: Alles auswählen

dif hubCardTitle 
dif hubCardContent
<div class="hubCardContent" id="yui_patched_v3_11_0_1_1650793691535_463">
				<p class="infoLabel">Description</p>
<p>&nbsp;</p><p></p>
			</div>

ich könnte nun hier mit get text zu arbeiten:

nun - soweit ich informiert bin, brauch ich nicht soup.find_all wenn ich nlediglich ein Element suche, soup.find würde auch so funktionieren man kann das auch so ansetzen: tag.string/tag.contents/tag.text to um den Text zu erhalten.(vgl ²

Code: Alles auswählen

# Python program to print all heading tags
import requests
from bs4 import BeautifulSoup
 
# Fetch the page and create a Beautiful Soup object
page = requests.get("https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view")
soup = BeautifulSoup(page.text, "lxml")

div = soup.find('div', {"class" : hubCardContent})
text = div.string

Das klappt aber so nicht. Ich komme nicht ran an den Text der Klassen:

a. dif hubCardTitle
b. dif hubCardContent

hier muss ich nochmals weiter überlegen.

zu ² vgl. Web scraping - Get text from a class with BeautifulSoup and Python?
https://stackoverflow.com/questions/454 ... and-python

say_hello · Sonntag 24. April 2022, 21:36

hallo u. guten Abend,

bin weitergekommen... - also wir haben:

-.... wie ja oben bereits geschreiben...

Code: Alles auswählen

# Python program to print all heading tags
import requests
from bs4 import BeautifulSoup
 
# scraping a the content
url_link = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view'
request = requests.get(url_link)
 
Soup = BeautifulSoup(request.text, 'lxml')
 
# creating a list of all common heading tags
heading_tags = ["h1", "h2", "h3", "h4"]
for tags in Soup.find_all(heading_tags):
    print(tags.name + ' -> ' + tags.text.strip())

h1 -> Smart Specialisation Platform
h1 -> Digital Innovation Hubs

Digital Innovation Hubs
h2 -> Bavarian Robotic Network (BaRoN) Bayerisches Robotik-Netzwerk, BaRoN
h4 -> Contact Data
h4 -> Description
h4 -> Link to national or regional initiatives for digitising industry
h4 -> Market and Services
h4 -> Organization
h4 -> Evolutionary Stage
h4 -> Geographical Scope
h4 -> Funding
h4 -> Partners
h4 -> Technologies

soweit - so gut: was ich noch vorhabe, das ist nun, den Content der h4 Tags beziehen:

also ich bin relativ neu in BS4 und pandas: ich will itterieren
ich will über eine Reihe von Seiten itterieren mit BS4 : https://s3platform.jrc.ec.europa.eu/dig ... -hubs-tool

hier haben wir die folgenden URLs - hier brauch ich noch eine Technik, die das intelligent durchläuft.:

https://s3platform-legacy.jrc.ec.europa ... /1096/view
https://s3platform-legacy.jrc.ec.europa ... 17865/view
https://s3platform-legacy.jrc.ec.europa ... /1416/view

und - eine Seite zu parsen - das gelingt auch schon:

https://s3platform-legacy.jrc.ec.europa ... /1096/view

Code: Alles auswählen

import requests
from bs4 import BeautifulSoup

url = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view'
r = requests.get(url)
html_as_string = r.text
soup = BeautifulSoup(html_as_string, 'html.parser')
for link in soup.find_all('p'):
    print (link.text)

resultat - nun halt noch unsortiert: ich will das mit Pandas in Tabellen bringen.

Digital Innovation Hubs
h2 -> Bavarian Robotic Network (BaRoN) Bayerisches Robotik-Netzwerk, BaRoN
h4 -> Contact Data
h4 -> Description
h4 -> Link to national or regional initiatives for digitising industry
h4 -> Market and Services
h4 -> Organization
h4 -> Evolutionary Stage
h4 -> Geographical Scope
h4 -> Funding
h4 -> Partners
h4 -> Technologies

hier also die Resultate aus dem Parser

Code: Alles auswählen

import requests
from bs4 import BeautifulSoup

url = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1096/view'
r = requests.get(url)
html_as_string = r.text
soup = BeautifulSoup(html_as_string, 'html.parser')
for link in soup.find_all('p'):
    print (link.text)

hier die Resultate:

Click on the following link if you want to propose a change of this HUB

You need an EU Login account for request proposals for editions or creations of new hubs. If you already have an ECAS account, you don't have to create a new EU Login account.
In EU Login, your credentials and personal data remain unchanged. You can still access the same services and applications as before. You just need to use your e-mail address for logging in.
If you don't have an EU Login account please use the following link. you can create one by clicking on the Create an account hyperlink.
If you already have a user account for EU Login please login via https://webgate.ec.europa.eu/cas/login

Sign in
New user? Create an account
Coordinator (University)
Robotic Competence Center of Technical University of Munich, TUM CC
Coordinator website
http://www6.in.tum.de/en/home/
Year Established
2017
Location
Schleißheimer Str. 90a, 85748, Garching bei München (Germany)
Website
http://www.robot.bayern
Social Media

Contact information
Adam Schmidt
adam.schmidt@tum.de
+49 (0)89 289-18064

Year Established
2017
Location
Schleißheimer Str. 90a, 85748, Garching bei München (Germany)
Website
http://www.robot.bayern
Social Media

Contact information
Description
BaRoN is an initiative bringing together several actors in Bavaria: the TUM Robotics Competence Center founded within the HORSE project, Bavarian Research Alliance (BayFOR), ITZB (Projektträger Bayern) and Bayerische Patentallianz, the latter three being members of the Bavarian Research and Innovation Agency) in order to facilitate the process of robotizing Bavarian manufacturing sector. In its current form it is an informal alliance of established institutions with a vast experience in the field of bringing and facilitating innovation in Bavaria. The mission of the network is to make Bavaria the forerunner of the digitalized and robotized European industry. The mission is realized by offering services ranging from providing the technological expertise, access to the robotic equipment, IPR advice and management, and funding facilitation to various entities of the Bavarian manufacturing ecosystem – start-ups, SMEs, research institutes, universities and other institutions interested in embracing the Industry 4.0 revolution.
BaRoN verbindet mehrere Bayerische Akteure mit einem gemeinsamen Ziel – die Robotisierung des Bayerischen produzierenden Gewerbes voranzutreiben. Die Mitglieder des Netzwerks sind das Robotik-Kompetenzzentrum der TUM, gegründet im Rahmen des HORSE-Projektes, die Bayerische Forschungsallianz (BayFOR), das ITZB (Projektträger Bayern) und die Bayerische Patentallianz. Die letzteren drei sind Mitglieder der Bayerischen Forschungs- und Innovationsagentur, einer vom Bayerischen Staat geförderten Organisation. In seiner gegenwärtigen Form ist BaRoN eine informelle Allianz etablierter Institutionen, die über breite Erfahrung darin verfügen, Innovation in und für Bayern zu fördern. Das gemeinsame Anliegen des Netzwerks ist es, Bayern zum Vorreiter beim Thema digitale und robotisierte Industrie zu machen. Diese Mission realisiert das Netzwerk durch das Angebot verschiedener Dienstleistungen wie z.B. technische Beratung, Zugang zu Robotertechnologie, Beratung zum Schutz geistigen Eigentums und Management desselben sowie Zugang zu Fördermöglichkeiten. Die Zielgruppen sind dabei alle Akteure, die sich mit innovativer Produktion in Bayern beschäftigen – Start-ups, KMUs, Forschungsinstitute, Universitäten und alle Institutionen, die sich mit dem Thema Industrie 4.0 beschäftigen.
BaRoN brings added value the digitization of Bavarian industry by supporting activities ranging from the technology development and transfer, obtaining funding for innovative actions, protecting IPR and networking the actors of the regional and European manufacturing and robotics communities.
Via its members and through cooperation with ZD.B BaRoN is linked to the following regional and national iniatives:

Bavaria Digital – Bavarian State Strategy

Masterplan Bayern Digital 2.5B€ invested in digitization in 2017 and 2018
Masterplan Bayern Digital II – foreseen 3.5B€ to be invested between 2018 and 2022

OP Bayern ERDF 2014-2020

Enhancing the competitiveness of SMEs through the creation and the extension of advanced capacities for product and service developments and through internationalisation initiatives
Budget 1.4B€

ich werde mal weitergucken - wie ich das mit Pandas besser hinbekommen kann - m.a.W. also die Ergebnisse in eine Tabelle schreiben kann

ps - die resultate - die sollten dann mit Pandas in Tabellen geschreiben werden - mit den Überschriften.

Digital Innovation Hubs
h2 -> Bavarian Robotic Network (BaRoN) Bayerisches Robotik-Netzwerk, BaRoN
h4 -> Contact Data
h4 -> Description
h4 -> Link to national or regional initiatives for digitising industry
h4 -> Market and Services
h4 -> Organization
h4 -> Evolutionary Stage
h4 -> Geographical Scope
h4 -> Funding
h4 -> Partners
h4 -> Technologies