Webscraping - DataFrame aus Liste erstellen

Strawk · Samstag 27. Juli 2019, 16:23

Hallo Nutzer!
Folgendes Programm arbeitet insofern schon passabel, als es die Tabelle korrekt in eine Liste einliest. Jedoch möchte ich daraus noch einen Dataframe machen, um später auch schön visualisieren zu können. Dazu habe ich die letzten drei Stunden gegoogelt. Keine Chance. Ich denke, das Problem ist, dass Python aus einer einzigen linearen Liste ja nicht wissen kann, was Spalten und Reihen sein sollen. Erbitte Tipps.

Code: Alles auswählen

# -*- coding: utf-8 -*-
"""
Created on Tue Jul 23 17:13:13 2019

@author: Admin
"""

from selenium import webdriver
import pandas as pd

driver = webdriver.Chrome(executable_path=r'C:\Users\Karl Kraft\Documents\System_Dateien\chromedriver.exe')
driver.get('https://www.finanzen.net/index/dax/marktkapitalisierung')
tbl = driver.find_element_by_xpath("//table[@class='table']").get_attribute('outerHTML')
myList = pd.read_html(tbl)
print(myList)
df = pd.DataFrame(myList)
print(df)

Strawk · Samstag 27. Juli 2019, 16:46

Code: Alles auswählen

# -*- coding: utf-8 -*-
"""
Created on Tue Jul 23 17:13:13 2019

@author: Admin
"""

from selenium import webdriver
import pandas as pd

driver = webdriver.Chrome(executable_path=r'C:\Users\Karl Kraft\Documents\System_Dateien\chromedriver.exe')
driver.get('https://www.finanzen.net/index/dax/marktkapitalisierung')
tbl = driver.find_element_by_xpath("//table[@class='table']").get_attribute('outerHTML')

myList = pd.read_html(tbl)

df = myList[0].dropna(axis=0, thresh=8)

print(df)

print(type(df))

schon besser

Aber was bedeutet axis und thresh ?

Sirius3 · Samstag 27. Juli 2019, 16:59

Ich verstehe das Problem nicht, denn Du bekommst Du einen DataFrame mit allen Spalten.
Viel einfacher geht das aber über:

Code: Alles auswählen

url = "https://www.finanzen.net/index/dax/marktkapitalisierung"
df = pd.read_html(url, attrs={'class': "table"})[0]

Strawk · Sonntag 28. Juli 2019, 13:39

Hallo Nutzer!

Mittlerweile sieht mein Code so aus (viele von euren Tipps eingeflossen):

Code: Alles auswählen

# -*- coding: utf-8 -*-
"""
Created on Tue Jul 23 17:13:13 2019

@author: Admin
"""

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

def parse_html_table(url, selector, table_position):
    """
    parses HTML-table using the URL, the selector and the position of the table in the HTML-document
    """
    df = pd.read_html(url, attrs={selector: "table"})[table_position]
    
    return df
    
def show_results(data_frame_from_table):
    """
    shows the results by simple print, converting to Microsoft Excel, creating a bar-chart
    and save chart as png-image
    """
    print(data_frame_from_table)
    data_frame_from_table.to_excel("example.xlsx")
    data_frame_from_table.plot.bar(width=1.5)
    plt.savefig("scrapedTableBarChart.png")
    
def rectify_data(data_frame_from_table):
    """
    rectifies the data by putting the comma at the correct position
    """
    for i in range(10):
        if data_frame_from_table.iloc[:,i].dtype == np.int64:
            data_frame_from_table.iloc[:,i] = data_frame_from_table.iloc[:,i] / 100
    
    return data_frame_from_table

def main():
    data_frame_from_table = parse_html_table(url="https://www.finanzen.net/index/dax/marktkapitalisierung",
                                             selector="class",
                                             table_position=0)
    data_frame_from_table = rectify_data(data_frame_from_table)
    show_results(data_frame_from_table)

if __name__ == "__main__":
    main()

Problem nunmehr: Ich möchte eine Tabelle einlesen, die auf mehrere Webseiten verteilt ist: https://www.finanzen.net/index/s&p_500/ ... ierung?p=1

Glücklicherweise wird hier mit einer PHP-GET-Variablen gearbeitet; könnte man als Scheife durchlaufen. Aber wie kann man einen Dataframe aus den 10 Tabellen erhalten?

Sirius3 · Sonntag 28. Juli 2019, 14:14

@Strawk: Wie kommst Du darauf, dass hier irgendwo PHP benutzt wird?

schonmal etwas von DataFrame.append gelesen?

Strawk · Sonntag 28. Juli 2019, 15:52

Ich meinte die zu scrapende Website: https://www.finanzen.net/index/s&p_500/ ... ierung?p=5

__blackjack__ · Sonntag 28. Juli 2019, 16:24

@Strawk: Wieso sollte die PHP verwenden? Falls Du das ``p=5`` meinst – der Abfrageteil einer URL ist keine PHP-Erfindung.

Strawk · Montag 29. Juli 2019, 10:32

Code: Alles auswählen

#!/usr/bin/env python
# coding: utf-8 -*-
from __future__ import print_function, division
#-------------------------------------------------------------------------------
# Name:        scrapeHTMLTable04.py
# Purpose:     scrape html-table and show results by:
#              simple print out, Excel-file, bar-chart and image
#
# Author:      Martin Königs
#
# Created:     07/29/2019
# Licence:     n/a
#-------------------------------------------------------------------------------
"""
    scrape html-table and show results
    
    contains functions:
    parse_html_table - parses HTML-table using the URL, the selector and the position of the table in the HTML-document
    show_results - shows the results by simple print, converting to Microsoft Excel, creating a bar-chart
                   and save chart as png-image
    rectify_data - rectifies the data: putting the comma at the correct position by dividing by 100
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

def parse_html_table(url, selector, table_position):
    """
    parses HTML-table using the URL, the selector and the position of the table in the HTML-document
    """
    df = pd.read_html(url, attrs={selector: "table"})[table_position]
    
    return df
    
def parse_html_tables_different_sites(url, selector, table_position):
    dfds = pd.DataFrame()
    for subsite in range(1, 11):
        dfds.append = pd.read_html(url + "?p=" + str(subsite), attrs={selector: "table"})[table_position]
    
    return dfds

def show_results(data_frame_from_table):
    """
    shows the results by simple print, converting to Microsoft Excel, creating a bar-chart
    and save chart as png-image
    """
    print(data_frame_from_table)
    # data_frame_from_table.to_excel("scrapedTable.xlsx")
    # data_frame_from_table.plot.bar(width=1.5)
    # plt.savefig("scrapedTableBarChart.png")
    
def rectify_data(data_frame_from_table):
    """
    rectifies the data: putting the comma at the correct position by dividing by 100
    """
    for i in range(10):
        if data_frame_from_table.iloc[:,i].dtype == np.int64:
            data_frame_from_table.iloc[:,i] = data_frame_from_table.iloc[:,i] / 100
    
    return data_frame_from_table

def main():
    data_frame_from_table = parse_html_table(url="https://www.finanzen.net/index/dax/marktkapitalisierung",
                                             selector="class",
                                             table_position=0)
    data_frame_from_diff_sites = parse_html_tables_different_sites(url="https://www.finanzen.net/index/s&p_500/marktkapitalisierung",
                                      selector="class",
                                      table_position=0)
    data_frame_from_table = rectify_data(data_frame_from_table)
    show_results(data_frame_from_table)
    show_results(data_frame_from_diff_sites)

if __name__ == "__main__":
    main()

Die Funktion parse_html_tables_different_sites liefert leider noch einen leeren DataFrame.

Sirius3 · Montag 29. Juli 2019, 10:44

@Strawk: lies nochmal nach, wie man Funktionen benutzt.

Warum kann man den selector angeben, den eigentlichen Inhalt, was der Selector aber suchen soll nicht? Ohne den eine Parameter, macht der andere keinen Sinn.

Strawk · Montag 29. Juli 2019, 10:55

Habe beides versucht zu fixen, ohne Erfolg.

Strawk · Montag 29. Juli 2019, 10:56

Code: Alles auswählen

#!/usr/bin/env python
# coding: utf-8 -*-
from __future__ import print_function, division
#-------------------------------------------------------------------------------
# Name:        scrapeHTMLTable04.py
# Purpose:     scrape html-table and show results by:
#              simple print out, Excel-file, bar-chart and image
#
# Author:      Martin Königs
#
# Created:     07/29/2019
# Licence:     n/a
#-------------------------------------------------------------------------------
"""
    scrape html-table and show results
    
    contains functions:
    parse_html_table - parses HTML-table using the URL, the selector and the position of the table in the HTML-document
    show_results - shows the results by simple print, converting to Microsoft Excel, creating a bar-chart
                   and save chart as png-image
    rectify_data - rectifies the data: putting the comma at the correct position by dividing by 100
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

def parse_html_table(url, selector, content, table_position):
    """
    parses HTML-table using the URL, the selector and the position of the table in the HTML-document
    """
    df = pd.read_html(url, attrs={selector: content})[table_position]
    
    return df
    
def parse_html_tables_different_sites(url, selector, content, table_position):
    dfds = pd.DataFrame()
    for subsite in range(1, 11):
        dfds.append = pd.read_html(url + "?p=" + str(subsite), attrs={selector: content})[table_position]
    
    return dfds

def show_results(data_frame):
    """
    shows the results by simple print, converting to Microsoft Excel, creating a bar-chart
    and save chart as png-image
    """
    print(data_frame)
    # data_frame_from_table.to_excel("scrapedTable.xlsx")
    # data_frame_from_table.plot.bar(width=1.5)
    # plt.savefig("scrapedTableBarChart.png")

def rectify_data(data_frame_from_table):
    """
    rectifies the data: putting the comma at the correct position by dividing by 100
    """
    for i in range(10):
        if data_frame_from_table.iloc[:,i].dtype == np.int64:
            data_frame_from_table.iloc[:,i] = data_frame_from_table.iloc[:,i] / 100
    
    return data_frame_from_table

def main():
    data_frame_from_table = parse_html_table(url="https://www.finanzen.net/index/dax/marktkapitalisierung",
                                             selector="class",
                                             content = "table",
                                             table_position=0)
    data_frame_from_diff_sites = parse_html_tables_different_sites(url="https://www.finanzen.net/index/s&p_500/marktkapitalisierung",
                                      selector="class",
                                      content = "table",
                                      table_position=0)
    data_frame_from_table = rectify_data(data_frame_from_table)
    show_results(data_frame_from_table)
    show_results(data_frame_from_diff_sites)

if __name__ == "__main__":
    main()

Sirius3 · Montag 29. Juli 2019, 12:04

Das

Code: Alles auswählen

dfds.append = pd.read_html(url + "?p=" + str(subsite), attrs={selector: content})[table_position]

ist kein Methodenaufruf.

Strawk · Montag 29. Juli 2019, 12:24

Ist das einer?

Code: Alles auswählen

df_part = pd.read_html(url + "?p=" + str(subsite), attrs={selector: content})[table_position]
dfds.append(df_part)

Sirius3 · Montag 29. Juli 2019, 12:43

Probier's doch aus.

Strawk · Montag 29. Juli 2019, 12:49

Hallo Sirius3, das habe ich freilich getan; jedoch erhalte ich das gleiche, unbefriedigende, Ergebnis: Der erste DF ist korrekt, der zweite:

Empty DataFrame
Columns: []
Index: []

Sirius3 · Montag 29. Juli 2019, 13:14

Kommt denn von der Webseite etwas sinnvolles?

Strawk · Montag 29. Juli 2019, 14:56

Ja, mit diesem Code:

Code: Alles auswählen

def parse_html_tables_different_sites(url, selector, content, table_position):
    """
    parses HTML-table spreaded on different sites
    """
    dfds = pd.DataFrame()
    dfds = pd.read_html(url + "?p=" + str(1), attrs={selector: content})[table_position]
    """
    for subsite in range(1, 11):
        df_part = pd.read_html(url + "?p=" + str(subsite), attrs={selector: content})[table_position]
        # dfds.append = pd.read_html(url + "?p=" + str(subsite), attrs={selector: content})[table_position]
        dfds.append(df_part)
    """
    return dfds

kommt dieses Ergebnis:

...
0 ...
1 ...
2 ...
3 ...
4 ...
5 ...
6 ...
7 ...
8 ...
9 ...
10 ...
11 ...
12 ...
13 ...
14 ...
15 ...
16 ...
17 ...
18 ...
19 ...
20 ...
21 ...
22 ...
23 ...
24 ...
25 ...
26 ...
27 ...
28 ...
29 ...
30 ...
31 ...
32 ...
33 ...
34 ...
35 ...
36 ...
37 ...
38 ...
39 ...
40 ...
41 ...
42 ...
43 ...
44 ...
45 ...
46 ...
47 ...
48 ...
49 ...

[50 rows x 10 columns]

Und etwas in der Art erwarte ich ja auch bzw. die entsprechend generierte Excel-Tabelle ist korrekt! Bliebe also die Frage: Wie die Dataframes aneinanderhängen?

Sirius3 · Montag 29. Juli 2019, 15:40

Jetzt hast Du ja das append auskommentiert.
Wie sieht die Tabelle auf Seite 2 aus, und wie der Dataframe dann?

Strawk · Montag 29. Juli 2019, 18:22

Mit einfach auf "2" gesetzter Variable

Code: Alles auswählen

def parse_html_tables_different_sites(url, selector, content, table_position):
    """
    parses HTML-table spreaded on different sites
    """
    dfds = pd.DataFrame()
    dfds = pd.read_html(url + "?p=" + str(2), attrs={selector: content})[table_position]
    """
    for subsite in range(1, 11):
        df_part = pd.read_html(url + "?p=" + str(subsite), attrs={selector: content})[table_position]
        # dfds.append = pd.read_html(url + "?p=" + str(subsite), attrs={selector: content})[table_position]
        dfds.append(df_part)
    """
    
    dfds.to_excel("scrapedTable.xlsx")
    return dfds

auch hier (Website 2) ein korrektes Ergebnis.

Sirius3 · Montag 29. Juli 2019, 18:43

Ach Mist, ich bin wieder auf Pandas reingefallen. `append` liefert einen neuen Dataframe zurück. Besser ist dann doch `concat`.
Ich würde das dann so lösen:

Code: Alles auswählen

def parse_html_tables_different_sites(url, selector, content, table_position, pages=10):
    """
    parses HTML-table spreaded on different sites
    """
    return pd.concat([
        pd.read_html(f"{url}?p={page}", attrs={selector: content})[table_position]
        for page in range(1, pages+1)
    ], ignore_index=True)