Das deutsche Python-Forum

Hallo Nutzer!
Folgendes Programm arbeitet insofern schon passabel, als es die Tabelle korrekt in eine Liste einliest. Jedoch möchte ich daraus noch einen Dataframe machen, um später auch schön visualisieren zu können. Dazu habe ich die letzten drei Stunden gegoogelt. Keine Chance. Ich denke, das Problem ist, dass Python aus einer einzigen linearen Liste ja nicht wissen kann, was Spalten und Reihen sein sollen. Erbitte Tipps.

Code: Alles auswählen

# -*- coding: utf-8 -*-
"""
Created on Tue Jul 23 17:13:13 2019

@author: Admin
"""

from selenium import webdriver
import pandas as pd

driver = webdriver.Chrome(executable_path=r'C:\Users\Karl Kraft\Documents\System_Dateien\chromedriver.exe')
driver.get('https://www.finanzen.net/index/dax/marktkapitalisierung')
tbl = driver.find_element_by_xpath("//table[@class='table']").get_attribute('outerHTML')
myList = pd.read_html(tbl)
print(myList)
df = pd.DataFrame(myList)
print(df)

Code: Alles auswählen

# -*- coding: utf-8 -*-
"""
Created on Tue Jul 23 17:13:13 2019

@author: Admin
"""

from selenium import webdriver
import pandas as pd

driver = webdriver.Chrome(executable_path=r'C:\Users\Karl Kraft\Documents\System_Dateien\chromedriver.exe')
driver.get('https://www.finanzen.net/index/dax/marktkapitalisierung')
tbl = driver.find_element_by_xpath("//table[@class='table']").get_attribute('outerHTML')

myList = pd.read_html(tbl)

df = myList[0].dropna(axis=0, thresh=8)

print(df)

print(type(df))

schon besser

Aber was bedeutet axis und thresh ?

Ich verstehe das Problem nicht, denn Du bekommst Du einen DataFrame mit allen Spalten.
Viel einfacher geht das aber über:

Code: Alles auswählen

url = "https://www.finanzen.net/index/dax/marktkapitalisierung"
df = pd.read_html(url, attrs={'class': "table"})[0]

Hallo Nutzer!

Mittlerweile sieht mein Code so aus (viele von euren Tipps eingeflossen):

Code: Alles auswählen

# -*- coding: utf-8 -*-
"""
Created on Tue Jul 23 17:13:13 2019

@author: Admin
"""

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

def parse_html_table(url, selector, table_position):
    """
    parses HTML-table using the URL, the selector and the position of the table in the HTML-document
    """
    df = pd.read_html(url, attrs={selector: "table"})[table_position]
    
    return df
    
def show_results(data_frame_from_table):
    """
    shows the results by simple print, converting to Microsoft Excel, creating a bar-chart
    and save chart as png-image
    """
    print(data_frame_from_table)
    data_frame_from_table.to_excel("example.xlsx")
    data_frame_from_table.plot.bar(width=1.5)
    plt.savefig("scrapedTableBarChart.png")
    
def rectify_data(data_frame_from_table):
    """
    rectifies the data by putting the comma at the correct position
    """
    for i in range(10):
        if data_frame_from_table.iloc[:,i].dtype == np.int64:
            data_frame_from_table.iloc[:,i] = data_frame_from_table.iloc[:,i] / 100
    
    return data_frame_from_table

def main():
    data_frame_from_table = parse_html_table(url="https://www.finanzen.net/index/dax/marktkapitalisierung",
                                             selector="class",
                                             table_position=0)
    data_frame_from_table = rectify_data(data_frame_from_table)
    show_results(data_frame_from_table)

if __name__ == "__main__":
    main()

Problem nunmehr: Ich möchte eine Tabelle einlesen, die auf mehrere Webseiten verteilt ist: https://www.finanzen.net/index/s&p_500/ ... ierung?p=1

Glücklicherweise wird hier mit einer PHP-GET-Variablen gearbeitet; könnte man als Scheife durchlaufen. Aber wie kann man einen Dataframe aus den 10 Tabellen erhalten?

@Strawk: Wie kommst Du darauf, dass hier irgendwo PHP benutzt wird?

schonmal etwas von DataFrame.append gelesen?

Ich meinte die zu scrapende Website: https://www.finanzen.net/index/s&p_500/ ... ierung?p=5

@Strawk: Wieso sollte die PHP verwenden? Falls Du das ``p=5`` meinst – der Abfrageteil einer URL ist keine PHP-Erfindung.

Code: Alles auswählen

#!/usr/bin/env python
# coding: utf-8 -*-
from __future__ import print_function, division
#-------------------------------------------------------------------------------
# Name:        scrapeHTMLTable04.py
# Purpose:     scrape html-table and show results by:
#              simple print out, Excel-file, bar-chart and image
#
# Author:      Martin Königs
#
# Created:     07/29/2019
# Licence:     n/a
#-------------------------------------------------------------------------------
"""
    scrape html-table and show results
    
    contains functions:
    parse_html_table - parses HTML-table using the URL, the selector and the position of the table in the HTML-document
    show_results - shows the results by simple print, converting to Microsoft Excel, creating a bar-chart
                   and save chart as png-image
    rectify_data - rectifies the data: putting the comma at the correct position by dividing by 100
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

def parse_html_table(url, selector, table_position):
    """
    parses HTML-table using the URL, the selector and the position of the table in the HTML-document
    """
    df = pd.read_html(url, attrs={selector: "table"})[table_position]
    
    return df
    
def parse_html_tables_different_sites(url, selector, table_position):
    dfds = pd.DataFrame()
    for subsite in range(1, 11):
        dfds.append = pd.read_html(url + "?p=" + str(subsite), attrs={selector: "table"})[table_position]
    
    return dfds

def show_results(data_frame_from_table):
    """
    shows the results by simple print, converting to Microsoft Excel, creating a bar-chart
    and save chart as png-image
    """
    print(data_frame_from_table)
    # data_frame_from_table.to_excel("scrapedTable.xlsx")
    # data_frame_from_table.plot.bar(width=1.5)
    # plt.savefig("scrapedTableBarChart.png")
    
def rectify_data(data_frame_from_table):
    """
    rectifies the data: putting the comma at the correct position by dividing by 100
    """
    for i in range(10):
        if data_frame_from_table.iloc[:,i].dtype == np.int64:
            data_frame_from_table.iloc[:,i] = data_frame_from_table.iloc[:,i] / 100
    
    return data_frame_from_table

def main():
    data_frame_from_table = parse_html_table(url="https://www.finanzen.net/index/dax/marktkapitalisierung",
                                             selector="class",
                                             table_position=0)
    data_frame_from_diff_sites = parse_html_tables_different_sites(url="https://www.finanzen.net/index/s&p_500/marktkapitalisierung",
                                      selector="class",
                                      table_position=0)
    data_frame_from_table = rectify_data(data_frame_from_table)
    show_results(data_frame_from_table)
    show_results(data_frame_from_diff_sites)

if __name__ == "__main__":
    main()

Die Funktion parse_html_tables_different_sites liefert leider noch einen leeren DataFrame.

@Strawk: lies nochmal nach, wie man Funktionen benutzt.

Warum kann man den selector angeben, den eigentlichen Inhalt, was der Selector aber suchen soll nicht? Ohne den eine Parameter, macht der andere keinen Sinn.

Habe beides versucht zu fixen, ohne Erfolg.

Code: Alles auswählen

#!/usr/bin/env python
# coding: utf-8 -*-
from __future__ import print_function, division
#-------------------------------------------------------------------------------
# Name:        scrapeHTMLTable04.py
# Purpose:     scrape html-table and show results by:
#              simple print out, Excel-file, bar-chart and image
#
# Author:      Martin Königs
#
# Created:     07/29/2019
# Licence:     n/a
#-------------------------------------------------------------------------------
"""
    scrape html-table and show results
    
    contains functions:
    parse_html_table - parses HTML-table using the URL, the selector and the position of the table in the HTML-document
    show_results - shows the results by simple print, converting to Microsoft Excel, creating a bar-chart
                   and save chart as png-image
    rectify_data - rectifies the data: putting the comma at the correct position by dividing by 100
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

def parse_html_table(url, selector, content, table_position):
    """
    parses HTML-table using the URL, the selector and the position of the table in the HTML-document
    """
    df = pd.read_html(url, attrs={selector: content})[table_position]
    
    return df
    
def parse_html_tables_different_sites(url, selector, content, table_position):
    dfds = pd.DataFrame()
    for subsite in range(1, 11):
        dfds.append = pd.read_html(url + "?p=" + str(subsite), attrs={selector: content})[table_position]
    
    return dfds

def show_results(data_frame):
    """
    shows the results by simple print, converting to Microsoft Excel, creating a bar-chart
    and save chart as png-image
    """
    print(data_frame)
    # data_frame_from_table.to_excel("scrapedTable.xlsx")
    # data_frame_from_table.plot.bar(width=1.5)
    # plt.savefig("scrapedTableBarChart.png")

def rectify_data(data_frame_from_table):
    """
    rectifies the data: putting the comma at the correct position by dividing by 100
    """
    for i in range(10):
        if data_frame_from_table.iloc[:,i].dtype == np.int64:
            data_frame_from_table.iloc[:,i] = data_frame_from_table.iloc[:,i] / 100
    
    return data_frame_from_table

def main():
    data_frame_from_table = parse_html_table(url="https://www.finanzen.net/index/dax/marktkapitalisierung",
                                             selector="class",
                                             content = "table",
                                             table_position=0)
    data_frame_from_diff_sites = parse_html_tables_different_sites(url="https://www.finanzen.net/index/s&p_500/marktkapitalisierung",
                                      selector="class",
                                      content = "table",
                                      table_position=0)
    data_frame_from_table = rectify_data(data_frame_from_table)
    show_results(data_frame_from_table)
    show_results(data_frame_from_diff_sites)

if __name__ == "__main__":
    main()

Das

Code: Alles auswählen

dfds.append = pd.read_html(url + "?p=" + str(subsite), attrs={selector: content})[table_position]

ist kein Methodenaufruf.

Ist das einer?

Code: Alles auswählen

df_part = pd.read_html(url + "?p=" + str(subsite), attrs={selector: content})[table_position]
dfds.append(df_part)

Probier's doch aus.

Hallo Sirius3, das habe ich freilich getan; jedoch erhalte ich das gleiche, unbefriedigende, Ergebnis: Der erste DF ist korrekt, der zweite:

Empty DataFrame
Columns: []
Index: []

Kommt denn von der Webseite etwas sinnvolles?

Ja, mit diesem Code:

Code: Alles auswählen

def parse_html_tables_different_sites(url, selector, content, table_position):
    """
    parses HTML-table spreaded on different sites
    """
    dfds = pd.DataFrame()
    dfds = pd.read_html(url + "?p=" + str(1), attrs={selector: content})[table_position]
    """
    for subsite in range(1, 11):
        df_part = pd.read_html(url + "?p=" + str(subsite), attrs={selector: content})[table_position]
        # dfds.append = pd.read_html(url + "?p=" + str(subsite), attrs={selector: content})[table_position]
        dfds.append(df_part)
    """
    return dfds

kommt dieses Ergebnis:

...
0 ...
1 ...
2 ...
3 ...
4 ...
5 ...
6 ...
7 ...
8 ...
9 ...
10 ...
11 ...
12 ...
13 ...
14 ...
15 ...
16 ...
17 ...
18 ...
19 ...
20 ...
21 ...
22 ...
23 ...
24 ...
25 ...
26 ...
27 ...
28 ...
29 ...
30 ...
31 ...
32 ...
33 ...
34 ...
35 ...
36 ...
37 ...
38 ...
39 ...
40 ...
41 ...
42 ...
43 ...
44 ...
45 ...
46 ...
47 ...
48 ...
49 ...

[50 rows x 10 columns]

Und etwas in der Art erwarte ich ja auch bzw. die entsprechend generierte Excel-Tabelle ist korrekt! Bliebe also die Frage: Wie die Dataframes aneinanderhängen?

Jetzt hast Du ja das append auskommentiert.
Wie sieht die Tabelle auf Seite 2 aus, und wie der Dataframe dann?

Mit einfach auf "2" gesetzter Variable

Code: Alles auswählen

def parse_html_tables_different_sites(url, selector, content, table_position):
    """
    parses HTML-table spreaded on different sites
    """
    dfds = pd.DataFrame()
    dfds = pd.read_html(url + "?p=" + str(2), attrs={selector: content})[table_position]
    """
    for subsite in range(1, 11):
        df_part = pd.read_html(url + "?p=" + str(subsite), attrs={selector: content})[table_position]
        # dfds.append = pd.read_html(url + "?p=" + str(subsite), attrs={selector: content})[table_position]
        dfds.append(df_part)
    """
    
    dfds.to_excel("scrapedTable.xlsx")
    return dfds

auch hier (Website 2) ein korrektes Ergebnis.

Ach Mist, ich bin wieder auf Pandas reingefallen. `append` liefert einen neuen Dataframe zurück. Besser ist dann doch `concat`.
Ich würde das dann so lösen:

Code: Alles auswählen

def parse_html_tables_different_sites(url, selector, content, table_position, pages=10):
    """
    parses HTML-table spreaded on different sites
    """
    return pd.concat([
        pd.read_html(f"{url}?p={page}", attrs={selector: content})[table_position]
        for page in range(1, pages+1)
    ], ignore_index=True)

Das deutsche Python-Forum

Webscraping - DataFrame aus Liste erstellen

Webscraping - DataFrame aus Liste erstellen

Re: Webscraping - DataFrame aus Liste erstellen

Re: Webscraping - DataFrame aus Liste erstellen

Re: Webscraping - DataFrame aus Liste erstellen

Re: Webscraping - DataFrame aus Liste erstellen

Re: Webscraping - DataFrame aus Liste erstellen

Re: Webscraping - DataFrame aus Liste erstellen

Re: Webscraping - DataFrame aus Liste erstellen

Re: Webscraping - DataFrame aus Liste erstellen

Re: Webscraping - DataFrame aus Liste erstellen

Re: Webscraping - DataFrame aus Liste erstellen

Re: Webscraping - DataFrame aus Liste erstellen

Re: Webscraping - DataFrame aus Liste erstellen

Re: Webscraping - DataFrame aus Liste erstellen

Re: Webscraping - DataFrame aus Liste erstellen

Re: Webscraping - DataFrame aus Liste erstellen

Re: Webscraping - DataFrame aus Liste erstellen

Re: Webscraping - DataFrame aus Liste erstellen

Re: Webscraping - DataFrame aus Liste erstellen

Re: Webscraping - DataFrame aus Liste erstellen