Wenn du dir nicht sicher bist, in welchem der anderen Foren du die Frage stellen sollst, dann bist du hier im Forum für allgemeine Fragen sicher richtig.
Hallo Nutzer!
Folgendes Programm arbeitet insofern schon passabel, als es die Tabelle korrekt in eine Liste einliest. Jedoch möchte ich daraus noch einen Dataframe machen, um später auch schön visualisieren zu können. Dazu habe ich die letzten drei Stunden gegoogelt. Keine Chance. Ich denke, das Problem ist, dass Python aus einer einzigen linearen Liste ja nicht wissen kann, was Spalten und Reihen sein sollen. Erbitte Tipps.
# -*- coding: utf-8 -*-
"""
Created on Tue Jul 23 17:13:13 2019
@author: Admin
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def parse_html_table(url, selector, table_position):
"""
parses HTML-table using the URL, the selector and the position of the table in the HTML-document
"""
df = pd.read_html(url, attrs={selector: "table"})[table_position]
return df
def show_results(data_frame_from_table):
"""
shows the results by simple print, converting to Microsoft Excel, creating a bar-chart
and save chart as png-image
"""
print(data_frame_from_table)
data_frame_from_table.to_excel("example.xlsx")
data_frame_from_table.plot.bar(width=1.5)
plt.savefig("scrapedTableBarChart.png")
def rectify_data(data_frame_from_table):
"""
rectifies the data by putting the comma at the correct position
"""
for i in range(10):
if data_frame_from_table.iloc[:,i].dtype == np.int64:
data_frame_from_table.iloc[:,i] = data_frame_from_table.iloc[:,i] / 100
return data_frame_from_table
def main():
data_frame_from_table = parse_html_table(url="https://www.finanzen.net/index/dax/marktkapitalisierung",
selector="class",
table_position=0)
data_frame_from_table = rectify_data(data_frame_from_table)
show_results(data_frame_from_table)
if __name__ == "__main__":
main()
Glücklicherweise wird hier mit einer PHP-GET-Variablen gearbeitet; könnte man als Scheife durchlaufen. Aber wie kann man einen Dataframe aus den 10 Tabellen erhalten?
#!/usr/bin/env python
# coding: utf-8 -*-
from __future__ import print_function, division
#-------------------------------------------------------------------------------
# Name: scrapeHTMLTable04.py
# Purpose: scrape html-table and show results by:
# simple print out, Excel-file, bar-chart and image
#
# Author: Martin Königs
#
# Created: 07/29/2019
# Licence: n/a
#-------------------------------------------------------------------------------
"""
scrape html-table and show results
contains functions:
parse_html_table - parses HTML-table using the URL, the selector and the position of the table in the HTML-document
show_results - shows the results by simple print, converting to Microsoft Excel, creating a bar-chart
and save chart as png-image
rectify_data - rectifies the data: putting the comma at the correct position by dividing by 100
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def parse_html_table(url, selector, table_position):
"""
parses HTML-table using the URL, the selector and the position of the table in the HTML-document
"""
df = pd.read_html(url, attrs={selector: "table"})[table_position]
return df
def parse_html_tables_different_sites(url, selector, table_position):
dfds = pd.DataFrame()
for subsite in range(1, 11):
dfds.append = pd.read_html(url + "?p=" + str(subsite), attrs={selector: "table"})[table_position]
return dfds
def show_results(data_frame_from_table):
"""
shows the results by simple print, converting to Microsoft Excel, creating a bar-chart
and save chart as png-image
"""
print(data_frame_from_table)
# data_frame_from_table.to_excel("scrapedTable.xlsx")
# data_frame_from_table.plot.bar(width=1.5)
# plt.savefig("scrapedTableBarChart.png")
def rectify_data(data_frame_from_table):
"""
rectifies the data: putting the comma at the correct position by dividing by 100
"""
for i in range(10):
if data_frame_from_table.iloc[:,i].dtype == np.int64:
data_frame_from_table.iloc[:,i] = data_frame_from_table.iloc[:,i] / 100
return data_frame_from_table
def main():
data_frame_from_table = parse_html_table(url="https://www.finanzen.net/index/dax/marktkapitalisierung",
selector="class",
table_position=0)
data_frame_from_diff_sites = parse_html_tables_different_sites(url="https://www.finanzen.net/index/s&p_500/marktkapitalisierung",
selector="class",
table_position=0)
data_frame_from_table = rectify_data(data_frame_from_table)
show_results(data_frame_from_table)
show_results(data_frame_from_diff_sites)
if __name__ == "__main__":
main()
Die Funktion parse_html_tables_different_sites liefert leider noch einen leeren DataFrame.
@Strawk: lies nochmal nach, wie man Funktionen benutzt.
Warum kann man den selector angeben, den eigentlichen Inhalt, was der Selector aber suchen soll nicht? Ohne den eine Parameter, macht der andere keinen Sinn.
#!/usr/bin/env python
# coding: utf-8 -*-
from __future__ import print_function, division
#-------------------------------------------------------------------------------
# Name: scrapeHTMLTable04.py
# Purpose: scrape html-table and show results by:
# simple print out, Excel-file, bar-chart and image
#
# Author: Martin Königs
#
# Created: 07/29/2019
# Licence: n/a
#-------------------------------------------------------------------------------
"""
scrape html-table and show results
contains functions:
parse_html_table - parses HTML-table using the URL, the selector and the position of the table in the HTML-document
show_results - shows the results by simple print, converting to Microsoft Excel, creating a bar-chart
and save chart as png-image
rectify_data - rectifies the data: putting the comma at the correct position by dividing by 100
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def parse_html_table(url, selector, content, table_position):
"""
parses HTML-table using the URL, the selector and the position of the table in the HTML-document
"""
df = pd.read_html(url, attrs={selector: content})[table_position]
return df
def parse_html_tables_different_sites(url, selector, content, table_position):
dfds = pd.DataFrame()
for subsite in range(1, 11):
dfds.append = pd.read_html(url + "?p=" + str(subsite), attrs={selector: content})[table_position]
return dfds
def show_results(data_frame):
"""
shows the results by simple print, converting to Microsoft Excel, creating a bar-chart
and save chart as png-image
"""
print(data_frame)
# data_frame_from_table.to_excel("scrapedTable.xlsx")
# data_frame_from_table.plot.bar(width=1.5)
# plt.savefig("scrapedTableBarChart.png")
def rectify_data(data_frame_from_table):
"""
rectifies the data: putting the comma at the correct position by dividing by 100
"""
for i in range(10):
if data_frame_from_table.iloc[:,i].dtype == np.int64:
data_frame_from_table.iloc[:,i] = data_frame_from_table.iloc[:,i] / 100
return data_frame_from_table
def main():
data_frame_from_table = parse_html_table(url="https://www.finanzen.net/index/dax/marktkapitalisierung",
selector="class",
content = "table",
table_position=0)
data_frame_from_diff_sites = parse_html_tables_different_sites(url="https://www.finanzen.net/index/s&p_500/marktkapitalisierung",
selector="class",
content = "table",
table_position=0)
data_frame_from_table = rectify_data(data_frame_from_table)
show_results(data_frame_from_table)
show_results(data_frame_from_diff_sites)
if __name__ == "__main__":
main()
Und etwas in der Art erwarte ich ja auch bzw. die entsprechend generierte Excel-Tabelle ist korrekt! Bliebe also die Frage: Wie die Dataframes aneinanderhängen?
Ach Mist, ich bin wieder auf Pandas reingefallen. `append` liefert einen neuen Dataframe zurück. Besser ist dann doch `concat`.
Ich würde das dann so lösen: