Webscraping - DataFrame aus Liste erstellen

Strawk · Montag 29. Juli 2019, 19:01

funzt

Strawk · Dienstag 30. Juli 2019, 14:46

Hallo!
Der Code sieht mittlerweile so aus:

Code: Alles auswählen

#!/usr/bin/env python
# coding: utf-8 -*-
from __future__ import print_function, division
#-------------------------------------------------------------------------------
# Name:        scrapeHTMLTables_diffSites.py
# Purpose:     scrape html-table and show results 
#
# Author:      Xxxxxx Xxxxxx
#
# Created:     07/29/2019
# Licence:     n/a
#-------------------------------------------------------------------------------
"""
scrape html-table and show results by:
simple print out, Excel-file, bar-chart and image
    
contains functions:
parse_html_table - parses HTML-table using the URL, the selector and the position of the table in the HTML-document
parse_html_tables_different_sites - parses HTML-table spread on different pages
show_results - shows the results by simple print, converting to Microsoft Excel, creating a bar-chart
                   and save chart as png-image
rectify_data - rectifies the data: putting the comma at the correct position by dividing by 100
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

def parse_html_table(url, selector, content, table_position):
    """
    parses HTML-table using the URL, the selector and the position of the table in the HTML-document
    
    In:
    URL of page to be scraped
    html-selector of element to be scraped
    html-content of element to be scraped
    table-postion is number of position of table in html-document
    
    Out:
    dataframe containing former html-table
    """
    df = pd.read_html(url, attrs={selector: content})[table_position]
    
    return df
    
def parse_html_tables_different_sites(url, selector, content, table_position, pages=10):
    """
    parses HTML-table spreaded on different sites
    In:
    URL of pages to be scraped
    html-selector of element to be scraped
    html-content of element to be scraped
    table-postion is number of position of table in html-document
    number of pages
    
    Out:
    dataframe containing former html-tables
    """
    return pd.concat([
        pd.read_html(f"{url}?p={page}", attrs={selector: content})[table_position]
        for page in range(1, pages+1)
    ], ignore_index=True)

def show_results(data_frame_one_page,
                 data_frame_diff_pages,
                 output_path,
                 output_name_xlsx_one_page,
                 output_name_xlsx_diff_pages,
                 output_name_img_one_page,
                 output_name_img_diff_pages):
    """
    shows the results of the different-sites-tableby simple print, converting to Microsoft Excel, creating a bar-chart
    and save chart as png-image
    
    In:
    dataframe containing former html-table
    dataframe containing former html-tables
    path for output-files
    name and format of Microsoft-Excel-file for one table
    name and format of Microsoft-Excel-file for several tables
    name and format of image-file for one table
    name and format of image-file for several tables
    
    Out:
    no return value
    """
    print(data_frame_one_page)
    print(data_frame_diff_pages)
    data_frame_one_page.to_excel(output_path + output_name_xlsx_one_page)
    data_frame_diff_pages.to_excel(output_path + output_name_xlsx_diff_pages)
    data_frame_one_page.plot.bar(width=1.5)
    plt.savefig(output_path + output_name_img_one_page)
    data_frame_diff_pages.plot.bar(width=1.5)
    plt.savefig(output_path + output_name_img_diff_pages)

def rectify_data(data_frame):
    """
    rectifies the data: putting the comma at the correct position by dividing by 100
    
    In:
    dataframe containing table or tables
    
    Out:
    dataframe containing table or tables with correct comma-positions
    """
    for i in range(10):
        if data_frame.iloc[:,i].dtype == np.int64:
            data_frame.iloc[:,i] = data_frame.iloc[:,i] / 100
    
    return data_frame

def main():
    data_frame_one_page = parse_html_table(url="https://www.finanzen.net/index/dax/marktkapitalisierung",
                                             selector="class",
                                             content = "table",
                                             table_position=0)
    data_frame_diff_pages = parse_html_tables_different_sites(url="https://www.finanzen.net/index/s&p_500/marktkapitalisierung",
                                      selector="class",
                                      content = "table",
                                      table_position=0)
    data_frame_one_page = rectify_data(data_frame_one_page)
    data_frame_diff_pages = rectify_data(data_frame_diff_pages)
    show_results(data_frame_one_page,
                 data_frame_diff_pages,
                 output_path="output/",
                 output_name_xlsx_one_page="scrapedTable.xlsx",
                 output_name_xlsx_diff_pages="scrapedTableDiffPages.xlsx",
                 output_name_img_one_page="scrapedTable.png",
                 output_name_img_diff_pages="scrapedTableDiffPages.png")
    

if __name__ == "__main__":
    main()

Er soll nun in eine Klassenstruktur gefasst werden. Da hänge ich. Folgender Code dazu bisher geschrieben:

Code: Alles auswählen

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# main class
class scrapeHTMLTablesFromWwwFinanzenNet():
    def __init__(self, url, selector, content, table_position, pages, output_path, file_name_excel, file_name_img):
        self.datafame = pd.concat([
        pd.read_html(f"{url}?p={page}", attrs={selector: content})[table_position]
        for page in range(1, pages+1)
        ], ignore_index=True)
    
    def show_result_print_out(self):
        print(self.datafame)
        
    def show_result_excel_file(self):
        self.datafame.to_excel(self.)

Wie kann ich jetzt auf den Inhalt der Variable "output_path" zugreifen? Auch mit "self."?
Grüße
Strawk

Strawk · Dienstag 30. Juli 2019, 16:18

Hat sich fürs erste erledigt:

Code: Alles auswählen

#!/usr/bin/env python
# coding: utf-8 -*-
from __future__ import print_function, division
#-------------------------------------------------------------------------------
# Name:        scrapeHTMLTables_diffSites.py
# Purpose:     scrape html-table and show results 
#
# Author:      Xxxxxx Xxxxxx
#
# Created:     07/29/2019
# Licence:     n/a
#-------------------------------------------------------------------------------
"""
scrape html-table and show results by:
simple print out, Excel-file, bar-chart and image
    
contains functions:
parse_html_table - parses HTML-table using the URL, the selector and the position of the table in the HTML-document
parse_html_tables_different_sites - parses HTML-table spread on different pages
show_results - shows the results by simple print, converting to Microsoft Excel, creating a bar-chart
                   and save chart as png-image
rectify_data - rectifies the data: putting the comma at the correct position by dividing by 100
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# main class
class scrapeHTMLTablesFromWwwFinanzenNet():
    def __init__(self, url, selector, content, table_position, pages, output_path, file_name_excel, file_name_img):
        self.datafame = pd.concat([
        pd.read_html(f"{url}?p={page}", attrs={selector: content})[table_position]
        for page in range(1, pages+1)
        ], ignore_index=True, sort=False)
        self.url = url
        self.selector = selector
        self.content = content
        self.table_position = table_position
        self.pages = pages
        self.output_path = output_path
        self.file_name_excel = file_name_excel
        self.file_name_img = file_name_img
    
    def show_result_print_out(self):
        print(self.datafame)
        
    def show_result_excel_file(self):
        self.datafame.to_excel(self.output_path + self.file_name_excel)
        
    def show_result_plot_barchart(self):
        self.datafame.plot.bar(width=1.5)
        
    def show_result_img_file(self):
        plt.savefig(self.output_path + self.file_name_img) 
    
def main():
    table = scrapeHTMLTablesFromWwwFinanzenNet(url="https://www.finanzen.net/index/s&p_500/marktkapitalisierung",
                                               selector="class",
                                               content="table",
                                               table_position=0,
                                               pages=10,
                                               output_path="output/",
                                               file_name_excel="scrapedTable.xlsx",
                                               file_name_img="scrapedTable.png")
    
    table.show_result_print_out()
    table.show_result_excel_file()
    table.show_result_plot_barchart()
    table.show_result_img_file()

if __name__ == "__main__":
    main()

__blackjack__ · Dienstag 30. Juli 2019, 17:10

@Strawk: Pfadteile setzt man nicht mit ``+`` zusammen, sondern mit `os.path.join()`. Wenn man dann noch den '/' bei 'output/' weg lässt, ist das auch plattformunabhängig.

Sinnvoll erscheint mir die Klasse aber nicht. Der Name ist ja schon mal falsch. Sowohl was die Schreibweise angeht – Klassen fangen mit einem Grossbuchstaben an – als auch vom Inhalt. Das beschreibt eine Tätigkeit, also eine Funktion oder Methode, und kein ”Ding”.

Macht das Sinn die ganzen Argumente an das Objekt zu binden?

Die Methoden fangen alle mit `show_result_*()` an, aber nur eine ”zeigt” tatsächlich etwas. Die andern Speichern Dateien in verschiedenen Formaten.

`show_result_img_file()` scheint unvollständig zu sein. Man müsste ja erst einmal etwas plotten um es zu speichern. Und innerhalb einer Methode würde man nicht das globale `plt.savefig()` verwenden, sondern die `savefig()`-Methode auf einem `Figure`-Objekt.

Strawk · Mittwoch 31. Juli 2019, 07:08

Guten Morgen, blackjack!
os.path.join() funktioniert soweit, jedoch führt das Weglassen des Slashs zu Namen wie "outputscrapedTable.xlsx" und "outputscrapedTable.png". Belasse ich den Slash dagegen, gelangen die Dateien unter den erwünschten Namen in den Ordner "output". Hast du dich vertan oder es falsch erklärt? Oder habe ich es nicht verstanden?
Grüße
Strawk

P.S.: Quatsch! Klappt, wie du gesagt hast. Sorry für den falschen Post.

Strawk · Mittwoch 31. Juli 2019, 07:16

__blackjack__ hat geschrieben: Dienstag 30. Juli 2019, 17:10 Macht das Sinn die ganzen Argumente an das Objekt zu binden?

Die Frage kann ich nicht beantworten. Ich kann nur aufgrund ihrer Formulierung vermuten, dass meine Vorgehensweise fragwürdig ist. Wohin sollten die Argumente denn sonst?
Grüße. Strawk

Strawk · Mittwoch 31. Juli 2019, 07:37

Alles das funktioniert nicht (1h trial and error)

Code: Alles auswählen

# plt.savefig(os.path.join(self.output_path, self.file_name_img))
# self.table_barchart.savefig(fname=os.path.join(self.output_path, self.file_name_img))
# fig_path = os.path.join(self.output_path, self.file_name_img)
# plt.savefig(self.table_barchart, fig_path)
# plt.savefig(os.path.join(self.output_path, self.file_name_img))

__blackjack__ · Mittwoch 31. Juli 2019, 08:21

@Strawk: Die Argumente sollen nirgends hin – die Klasse an sich ist ja schon fragwürdig. Warum steckt das überhaupt in einer Klasse, warum nicht einfach ein paar Funktionen die etwas mit einem `DataFrame`-Objekt machen?

Strawk · Mittwoch 31. Juli 2019, 08:39

Weil mein Lehrer es so will. Bitte Antwort zu Post von 8.37 Uhr. Danke.

Sirius3 · Mittwoch 31. Juli 2019, 09:22

Schick gerne Deinen Lehrer hier vorbei, dann können wir ihm erklären, was an der Klasse schlecht ist.
"Trial and Error" ist ein schlechter Programmierer. Was versuchst Du denn und was funktioniert genau nicht?