String-Operation

Strawk · Dienstag 13. August 2019, 12:27

Hallo!

Bitte, wie kann ich hieraus:
<a hre... ... ... ...ap.org/?mlon=24.0000&mlat=56.0000&zoom=6' target='_blank'><img src='img/flags/lt.gif' class='flag' width='18px' title='//LT' alt='LT' border='0'></a>

die Werte für mlon und mlat ziehen?

Grüße
Strawk

ThomasL · Dienstag 13. August 2019, 12:33

Ich denke mal z.B. mit Regular Expressions. Diese Seite ist da sehr hilfreich. https://regexr.com

__blackjack__ · Dienstag 13. August 2019, 14:25

Naja, also erst mal würde man sich mit einem HTML-Parser den Inhalt vom `href`-Attribut geben lassen und dann gibt's in `urllib` & Co die passenden Funktionen um den Query-String von der URL zu parsen und da Schlüssel/Wert-Paare draus zu machen.

__blackjack__ · Dienstag 13. August 2019, 14:52

Code: Alles auswählen

#!/usr/bin/env python3
from urllib.parse import urlparse, parse_qsl

import bs4


HTML = (
    "<a href='... ... ... ...ap.org/?mlon=24.0000&mlat=56.0000&zoom=6'"
    " target='_blank'><img src='img/flags/lt.gif' class='flag' width='18px'"
    " title='//LT' alt='LT' border='0'></a>"
)


def main():
    document = bs4.BeautifulSoup(HTML, 'html.parser')
    query = dict(parse_qsl(urlparse(document.a['href']).query))
    latitude = float(query['mlat'])
    longitude = float(query['mlon'])
    print(latitude, longitude)


if __name__ == '__main__':
    main()

Strawk · Donnerstag 15. August 2019, 09:15

Hallo!
Ich bin dem vorherigen Tipp gefolgt und habe mit einem regulären Ausdruck gearbeitet. Hat auch zunächst funktioniert. Jetzt wollte ich noch groups im RegEx verwenden; da funktioniert es dann nicht mehr. String:

Code: Alles auswählen

<a href='http://www.openstreetmap.org/?mlon=2.3387&mlat=48.8582&zoom=6' target='_blank'><img src='img/flags/fr.gif' class='flag' width='18px' title='//FR' alt='FR' border='0'></a>

Regulärer Ausdruck:

Code: Alles auswählen

result = re.search(r'mlon=([-]?[\d]{1,3}.[\d]{4,8})&mlat=([-]?[\d]{1,3}.[\d]{4,8})', my_string)

            try:
                result_lon = str(result.group(1))
                result_lat = str(result.group(2))
            except:
                pass
            
            res.append([result_lon, result_lat])

Ergebnis:

UnboundLocalError: local variable 'result_lon' referenced before assignment

sparrow · Donnerstag 15. August 2019, 09:47

Warum verwendest du nicht __blackjacks__ wunderschöne und funktionierende Lösung?
Und nackte Excepts (die zudem auch noch unbehandelt sind) sind ein No-Go, weil du die Fehler nicht siehst. Wenn du den Fehler nicht vernünftig im Programm nicht behandeln kannst, kannst du dir auch den try-Block sparen.

ThomasL · Donnerstag 15. August 2019, 12:23

Ich fühle mich ja geehrt, dass du meinem Gedanken gefolgt bist aber ehrlich, ich finde __blackjacks__ Beispielcode sehr elegant.

noisefloor · Donnerstag 15. August 2019, 13:41

Hallo,

Ich bin dem vorherigen Tipp gefolgt und habe mit einem regulären Ausdruck gearbeitet.

HTML per RegEx durchsuchen geht, ist aber ziemlich komplex und damit fehleranfällig. Darum gibt es ja auch HTML-Parser, die das viel besser und einfacher können.

Gruß, noisefloor

Strawk · Donnerstag 15. August 2019, 14:47

Es geht aktuell um etwas anderes an der Sache: Der Code

Code: Alles auswählen

#!/usr/bin/env python
# coding: utf-8 -*-
from __future__ import print_function, division
#-------------------------------------------------------------------------------
# Name:        scrapeTorSiteLocations.py
# Purpose:     scrape html-table from https://torstatus.blutmagie.de/
#              and retrieve locations (longitude, latitude)
#
# Author:      Xxxxxx Xxxxxx
#
# Created:     08/13/2019
# Licence:     n/a
#-------------------------------------------------------------------------------
"""
scrape html-table from https://torstatus.blutmagie.de/ and retrieve locations (longitude, latitude)
    
contains classes:
    1 main class that contains functions:
        constructor (__init__) - parses HTML-table, retrieves locations (longitude, latitude)
          
        public functions:
        to_excel (self explaining)
"""
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re
import geo_mk
import numpy as np
import time

# main class
class TorSiteLocations():
    def __init__(self, url):
        """
        constructor (__init__) - parses HTML-table
        retrieves locations (longitude, latitude)
        
        In:
        URL of page to be scraped
        """
        
        # site = requests.get(url)
        # soup = BeautifulSoup(site.content, 'html.parser')
        with open(url) as f:
            site = f.read()
        soup = BeautifulSoup(site, 'html.parser')
        table = soup.find('table', attrs={'class':'displayTable'})
        table_rows = table.find_all('tr', {'class':'r'})

        res = []
                
        for counter, tr in enumerate(table_rows):
            
            anchor = tr.findNext('td',{'class':'TRr'}).findAll('a')
            my_string = str(anchor)
            
            result = re.search(r'mlon=([-]?[\d]{1,3}.[\d]{4,8})&amp;mlat=([-]?[\d]{1,3}.[\d]{4,8})', my_string)

            try:
                result_lon = result.group(1)
                result_lat = result.group(2)
            except:
                pass
            
            res.append([result_lon, result_lat])
        
            # if counter == 300:
                # break
        print("\n\nLoop ran %10.0f times" % counter)
        self.dataframe = pd.DataFrame(res, columns=["longitude", "latitude"])
        
    # public methods
    def to_excel(self, outfile):
        """
        (self explaining)
        
        In:
        outfile - path and name of Microsoft Excel-file
        """
        self.dataframe.to_excel(outfile)
        
def main():
    tbegin = time.clock()
    
    table = TorSiteLocations(url="data/Tor.htm")
    table.to_excel(outfile="results/scrapedLocations.xlsx")
    
    lons = np.array(table.dataframe['longitude'].values, dtype=float)
    lats = np.array(table.dataframe['latitude'].values, dtype=float)
    geo_mk.geo_visualization.plot_points_on_map(lats, lons)
    
    computation_time = time.clock()-tbegin
    print("\n\nExecuting the script took %6.3f seconds" % (computation_time))
        
if __name__ == "__main__":
    main()

braucht auf meinem Rechner 107 Sekunden. Auf dem langsameren Rechner eines Bekannten läuft er in 5 Sekunden.

Das Profiling sieht so aus:

1 0.000 0.000 0.000 0.000 internals.py:3265(__init__)
1 0.000 0.000 0.000 0.000 internals.py:3266(<listcomp>)
3 0.000 0.000 0.000 0.000 internals.py:3307(shape)
9 0.000 0.000 0.000 0.000 internals.py:3309(<genexpr>)
9 0.000 0.000 0.000 0.000 internals.py:3311(ndim)
1 0.000 0.000 0.000 0.000 internals.py:3363(_rebuild_blknos_and_blklocs)
7 0.000 0.000 0.000 0.000 internals.py:3384(_get_items)
2 0.000 0.000 0.000 0.000 internals.py:3473(__len__)
1 0.000 0.000 0.000 0.000 internals.py:348(shape)
1 0.000 0.000 0.000 0.000 internals.py:3488(_verify_integrity)
2 0.000 0.000 0.000 0.000 internals.py:3490(<genexpr>)
1 0.000 0.000 0.000 0.000 internals.py:352(dtype)
1 0.000 0.000 0.000 0.000 internals.py:356(ftype)
4 0.000 0.000 0.000 0.000 internals.py:372(iget)
1 0.000 0.000 0.000 0.000 internals.py:3776(is_consolidated)
1 0.000 0.000 0.000 0.000 internals.py:3784(_consolidate_check)
1 0.000 0.000 0.000 0.000 internals.py:3785(<listcomp>)
1 0.000 0.000 0.000 0.000 internals.py:4101(_consolidate_inplace)
2 0.000 0.000 0.000 0.000 internals.py:4108(get)
4 0.000 0.000 0.000 0.000 internals.py:4137(iget)
4 0.000 0.000 0.000 0.000 internals.py:4639(__init__)
8 0.000 0.000 0.000 0.000 internals.py:4684(_block)
2 0.000 0.000 0.000 0.000 internals.py:4742(external_values)
6 0.000 0.000 0.000 0.000 internals.py:4745(internal_values)
1 0.000 0.000 0.001 0.001 internals.py:4869(create_block_manager_from_arrays)
1 0.000 0.000 0.000 0.000 internals.py:4880(form_blocks)
1 0.000 0.000 0.000 0.000 internals.py:4972(_simple_blockify)
1 0.000 0.000 0.000 0.000 internals.py:5017(_stack_arrays)
2 0.000 0.000 0.000 0.000 internals.py:5020(_asarray_compat)
1 0.000 0.000 0.000 0.000 internals.py:5026(_shape_compat)
6 0.000 0.000 0.000 0.000 iostream.py:195(schedule)
4 0.000 0.000 0.000 0.000 iostream.py:300(_is_master_process)
4 0.000 0.000 0.000 0.000 iostream.py:313(_schedule_flush)
4 0.000 0.000 0.000 0.000 iostream.py:366(write)
6 0.000 0.000 0.000 0.000 iostream.py:93(_event_pipe)
19443 0.017 0.000 0.029 0.000 lexer.py:237(__new__)
6481 0.014 0.000 0.798 0.000 lexer.py:303(__init__)
12962 0.010 0.000 0.010 0.000 lexer.py:315(__bool__)
12962 0.021 0.000 0.837 0.000 lexer.py:349(__next__)
6481 0.007 0.000 0.017 0.000 lexer.py:364(close)
6481 0.009 0.000 0.024 0.000 lexer.py:391(get_lexer)
6481 0.004 0.000 0.041 0.000 lexer.py:548(_normalize_newlines)
6481 0.008 0.000 0.806 0.000 lexer.py:552(tokenize)
12962 0.021 0.000 0.793 0.000 lexer.py:558(wrap)
12962 0.041 0.000 0.723 0.000 lexer.py:599(tokeniter)
2 0.000 0.000 0.000 0.000 loaders.py:241(uptodate)
1 0.000 0.000 0.004 0.004 loaders.py:250(list_templates)
12/1 0.000 0.000 0.004 0.004 loaders.py:258(_walk)
6472 0.014 0.000 0.208 0.000 map.py:254(__init__)
2 0.000 0.000 0.000 0.000 map.py:38(__init__)
1 0.000 0.000 0.012 0.012 map.py:454(__init__)
1 0.000 0.000 0.000 0.000 map.py:465(<dictcomp>)
1 0.000 0.000 0.000 0.000 marker_cluster.py:63(__init__)
1 0.000 0.000 4.368 4.368 marker_cluster.py:91(render)
19420 0.008 0.000 0.015 0.000 missing.py:112(_isna_new)
19420 0.006 0.000 0.021 0.000 missing.py:32(isna)
19443 0.051 0.000 0.072 0.000 nodes.py:127(__init__)
246278 0.074 0.000 0.092 0.000 nodes.py:148(iter_fields)
200911 0.513 0.000 0.651 0.000 nodes.py:164(iter_child_nodes)
6481 0.005 0.000 0.079 0.000 nodes.py:177(find)
58329/19443 0.056 0.000 0.191 0.000 nodes.py:184(find_all)
6481 0.022 0.000 0.086 0.000 nodes.py:219(set_environment)
6481 0.005 0.000 0.006 0.000 nodes.py:519(as_const)
12982 0.010 0.000 0.011 0.000 nodes.py:81(__init__)
6481 0.001 0.000 0.001 0.000 nodes.py:97(get_eval_context)
248 0.001 0.000 0.001 0.000 ntpath.py:122(splitdrive)
2 0.000 0.000 0.000 0.000 ntpath.py:223(splitext)
21 0.000 0.000 0.001 0.000 ntpath.py:472(normpath)
11 0.000 0.000 0.000 0.000 ntpath.py:539(abspath)
56 0.000 0.000 0.001 0.000 ntpath.py:75(join)
4 0.000 0.000 0.000 0.000 numeric.py:110(is_all_dates)
4 0.000 0.000 0.000 0.000 numeric.py:424(asarray)
2 0.000 0.000 0.000 0.000 numeric.py:495(asanyarray)
1 0.000 0.000 0.000 0.000 numeric.py:621(require)
2 0.000 0.000 0.000 0.000 numeric.py:692(<genexpr>)
6481 0.002 0.000 0.002 0.000 optimizer.py:32(__init__)
1 0.000 0.000 0.000 0.000 packager.py:106(_set_tmpdir)
1 0.000 0.000 0.000 0.000 packager.py:110(_set_in_memory)
1 0.000 0.000 0.000 0.000 packager.py:114(_add_workbook)
1 0.000 0.000 0.505 0.505 packager.py:129(_create_package)
10 0.000 0.000 0.162 0.016 packager.py:156(_filename)
1 0.000 0.000 0.019 0.019 packager.py:169(_write_workbook_file)
1 0.000 0.000 0.276 0.276 packager.py:176(_write_worksheet_files)
1 0.000 0.000 0.000 0.000 packager.py:192(_write_chartsheet_files)
1 0.000 0.000 0.000 0.000 packager.py:204(_write_chart_files)
1 0.000 0.000 0.000 0.000 packager.py:222(_write_drawing_files)
1 0.000 0.000 0.000 0.000 packager.py:234(_write_vml_files)
1 0.000 0.000 0.000 0.000 packager.py:264(_write_comment_files)
1 0.000 0.000 0.067 0.067 packager.py:277(_write_shared_strings_file)
1 0.000 0.000 0.013 0.013 packager.py:288(_write_app_file)
1 0.000 0.000 0.017 0.017 packager.py:324(_write_core_file)
1 0.000 0.000 0.000 0.000 packager.py:333(_write_custom_file)
1 0.000 0.000 0.030 0.030 packager.py:345(_write_content_types_file)
1 0.000 0.000 0.021 0.021 packager.py:390(_write_styles_file)
1 0.000 0.000 0.026 0.026 packager.py:415(_write_theme_file)
1 0.000 0.000 0.000 0.000 packager.py:422(_write_table_files)
1 0.000 0.000 0.016 0.016 packager.py:440(_write_root_rels_file)
1 0.000 0.000 0.018 0.018 packager.py:460(_write_workbook_rels_file)
1 0.000 0.000 0.000 0.000 packager.py:496(_write_worksheet_rels_files)
1 0.000 0.000 0.000 0.000 packager.py:526(_write_chartsheet_rels_files)
1 0.000 0.000 0.000 0.000 packager.py:552(_write_drawing_rels_files)
1 0.000 0.000 0.000 0.000 packager.py:587(_add_image_files)
1 0.000 0.000 0.000 0.000 packager.py:632(_add_vba_project)
1 0.000 0.000 0.000 0.000 packager.py:79(__init__)
1 0.000 0.000 114.037 114.037 parser.py:104(feed)
199911 0.075 0.000 0.075 0.000 parser.py:127(clear_cdata_mode)
1 1.619 1.619 114.037 114.037 parser.py:134(goahead)
1 0.000 0.000 0.000 0.000 parser.py:256(parse_html_declaration)
240516 1.909 0.000 10.220 0.000 parser.py:301(parse_starttag)
240516 0.290 0.000 0.949 0.000 parser.py:352(check_for_whole_start_tag)
6481 0.015 0.000 0.923 0.000 parser.py:37(__init__)
199911 0.702 0.000 100.007 0.001 parser.py:386(parse_endtag)
6481 0.033 0.000 0.189 0.000 parser.py:851(subparse)
6481 0.017 0.000 0.042 0.000 parser.py:859(flush_data)
1 0.000 0.000 0.000 0.000 parser.py:87(__init__)
6481 0.018 0.000 0.313 0.000 parser.py:899(parse)
1 0.000 0.000 0.000 0.000 parser.py:96(reset)
80 0.000 0.000 0.000 0.000 random.py:223(_randbelow)
80 0.000 0.000 0.000 0.000 random.py:253(choice)
1 0.000 0.000 0.000 0.000 range.py:131(_simple_new)
1 0.000 0.000 0.000 0.000 range.py:158(_validate_dtype)
1 0.000 0.000 0.000 0.000 range.py:257(tolist)
9 0.000 0.000 0.000 0.000 range.py:481(__len__)
1 0.000 0.000 0.000 0.000 range.py:68(__new__)
2 0.000 0.000 0.000 0.000 range.py:84(_ensure_int)
23 0.000 0.000 0.000 0.000 raster_layers.py:115(<lambda>)
1 0.000 0.000 0.004 0.004 raster_layers.py:85(__init__)
38838 0.018 0.000 0.052 0.000 re.py:169(match)
16768 0.008 0.000 0.030 0.000 re.py:179(search)
13728 0.005 0.000 0.025 0.000 re.py:184(sub)
15 0.000 0.000 0.000 0.000 re.py:231(compile)
69349 0.032 0.000 0.032 0.000 re.py:286(_compile)
3432 0.002 0.000 0.002 0.000 re.py:324(_subx)
7 0.000 0.000 0.000 0.000 relationships.py:100(_write_relationship)
2 0.000 0.000 0.000 0.000 relationships.py:30(__init__)
2 0.000 0.000 0.002 0.001 relationships.py:47(_assemble_xml_file)
6 0.000 0.000 0.000 0.000 relationships.py:58(_add_document_relationship)
1 0.000 0.000 0.000 0.000 relationships.py:64(_add_package_relationship)
2 0.000 0.000 0.000 0.000 relationships.py:89(_write_relationships)
6527 0.002 0.000 0.002 0.000 runtime.py:125(resolve_or_missing)
6501 0.017 0.000 0.028 0.000 runtime.py:157(__init__)
6501 0.001 0.000 0.001 0.000 runtime.py:168(<genexpr>)
6527 0.003 0.000 0.005 0.000 runtime.py:208(resolve_or_missing)
19473/12962 0.045 0.000 0.445 0.000 runtime.py:234(call)
6478 0.020 0.000 0.439 0.000 runtime.py:501(__call__)
6478 0.007 0.000 0.412 0.000 runtime.py:577(_invoke)
6501 0.007 0.000 0.035 0.000 runtime.py:59(new_context)
1 0.000 0.000 0.000 0.000 runtime.py:609(__init__)
1 0.000 0.000 0.000 0.000 runtime.py:669(__nonzero__)
1 0.041 0.041 118.660 118.660 scrapeTorSiteLocations.py:34(__init__)
1 0.000 0.000 1.111 1.111 scrapeTorSiteLocations.py:74(to_excel)
1 0.002 0.002 124.890 124.890 scrapeTorSiteLocations.py:83(main)
4 0.000 0.000 0.000 0.000 series.py:165(__init__)
4 0.000 0.000 0.000 0.000 series.py:364(_set_axis)
4 0.000 0.000 0.000 0.000 series.py:390(_set_subtyp)
4 0.000 0.000 0.000 0.000 series.py:400(name)
2 0.000 0.000 0.000 0.000 series.py:4016(_sanitize_array)
2 0.000 0.000 0.000 0.000 series.py:4033(_try_cast)
4 0.000 0.000 0.000 0.000 series.py:404(name)
2 0.000 0.000 0.000 0.000 series.py:431(values)
6 0.000 0.000 0.000 0.000 series.py:464(_values)
1 0.000 0.000 0.000 0.000 sharedstrings.py:132(__init__)
12946 0.010 0.000 0.010 0.000 sharedstrings.py:138(_get_shared_string_index)
1 0.000 0.000 0.000 0.000 sharedstrings.py:157(_sort_string_data)
1 0.000 0.000 0.000 0.000 sharedstrings.py:163(_get_strings)
1 0.000 0.000 0.000 0.000 sharedstrings.py:28(__init__)
1 0.000 0.000 0.066 0.066 sharedstrings.py:44(_assemble_xml_file)
1 0.000 0.000 0.000 0.000 sharedstrings.py:68(_write_sst)
1 0.001 0.001 0.061 0.061 sharedstrings.py:80(_write_sst_strings)
3432 0.010 0.000 0.059 0.000 sharedstrings.py:86(_write_si)
10 0.000 0.000 0.037 0.004 shutil.py:76(copyfileobj)
3 0.000 0.000 0.000 0.000 six.py:184(find_module)
6 0.000 0.000 0.000 0.000 socket.py:333(send)
1 0.000 0.000 0.000 0.000 styles.py:120(_write_style_sheet)
1 0.000 0.000 0.000 0.000 styles.py:127(_write_num_fmts)
1 0.000 0.000 0.000 0.000 styles.py:198(_write_fonts)
2 0.000 0.000 0.000 0.000 styles.py:210(_write_font)
1 0.000 0.000 0.000 0.000 styles.py:25(__init__)
2 0.000 0.000 0.000 0.000 styles.py:305(_write_color)
1 0.000 0.000 0.000 0.000 styles.py:311(_write_fills)
2 0.000 0.000 0.000 0.000 styles.py:328(_write_default_fill)
1 0.000 0.000 0.000 0.000 styles.py:393(_write_borders)
2 0.000 0.000 0.000 0.000 styles.py:406(_write_border)
10 0.000 0.000 0.000 0.000 styles.py:460(_write_sub_border)
1 0.000 0.000 0.000 0.000 styles.py:497(_write_cell_style_xfs)
1 0.000 0.000 0.002 0.002 styles.py:50(_assemble_xml_file)
1 0.000 0.000 0.000 0.000 styles.py:514(_write_cell_xfs)
1 0.000 0.000 0.000 0.000 styles.py:533(_write_style_xf)
2 0.000 0.000 0.000 0.000 styles.py:561(_write_xf)
1 0.000 0.000 0.000 0.000 styles.py:625(_write_cell_styles)
1 0.000 0.000 0.000 0.000 styles.py:643(_write_cell_style)
1 0.000 0.000 0.000 0.000 styles.py:653(_write_dxfs)
1 0.000 0.000 0.000 0.000 styles.py:683(_write_table_styles)
1 0.000 0.000 0.000 0.000 styles.py:697(_write_colors)
1 0.000 0.000 0.000 0.000 styles.py:95(_set_style_properties)
10 0.000 0.000 0.000 0.000 tempfile.py:118(_sanitize_params)
10 0.000 0.000 0.000 0.000 tempfile.py:146(rng)
10 0.000 0.000 0.000 0.000 tempfile.py:157(__next__)
10 0.000 0.000 0.000 0.000 tempfile.py:160(<listcomp>)
10 0.000 0.000 0.000 0.000 tempfile.py:235(_get_candidate_names)
10 0.000 0.000 0.161 0.016 tempfile.py:249(_mkstemp_inner)
10 0.000 0.000 0.000 0.000 tempfile.py:289(gettempdir)
10 0.000 0.000 0.161 0.016 tempfile.py:305(mkstemp)
10 0.000 0.000 0.000 0.000 tempfile.py:97(_infer_return_type)
1 0.000 0.000 0.000 0.000 tests.py:62(test_none)
1 0.000 0.000 0.000 0.000 theme.py:29(__init__)
1 0.000 0.000 0.001 0.001 theme.py:44(_assemble_xml_file)
1 0.000 0.000 0.001 0.001 theme.py:50(_set_xml_writer)
1 0.000 0.000 0.000 0.000 theme.py:65(_write_theme_file)
6 0.000 0.000 0.000 0.000 threading.py:1062(_wait_for_tstate_lock)
6 0.000 0.000 0.000 0.000 threading.py:1104(is_alive)
6 0.000 0.000 0.000 0.000 threading.py:506(is_set)
1 0.000 0.000 0.000 0.000 threading.py:74(RLock)
2 0.000 0.000 0.000 0.000 tiles.txt:5(root)
12947 0.008 0.000 0.067 0.000 utilities.py:30(_is_sized_iterable)
25909 0.068 0.000 0.406 0.000 utilities.py:349(_camelify)
1 0.000 0.000 0.000 0.000 utilities.py:35(_validate_location)
25909 0.221 0.000 0.308 0.000 utilities.py:350(<listcomp>)
4 0.000 0.000 0.000 0.000 utilities.py:370(_parse_size)
6472 0.004 0.000 0.122 0.000 utilities.py:51(_validate_coordinates)
19419/6473 0.014 0.000 0.022 0.000 utilities.py:60(_iter_tolist)
19419 0.007 0.000 0.074 0.000 utilities.py:68(_flatten)
6473 0.006 0.000 0.095 0.000 utilities.py:77(_isnan)
19419 0.009 0.000 0.085 0.000 utilities.py:79(<genexpr>)
2 0.000 0.000 0.000 0.000 utility.py:15(xl_rowcol_to_cell)
19418 0.014 0.000 0.014 0.000 utility.py:37(xl_rowcol_to_cell_fast)
2 0.000 0.000 0.000 0.000 utility.py:58(xl_col_to_name)
19418 0.011 0.000 0.015 0.000 utility.py:604(supported_datetime)
12964 0.010 0.000 0.041 0.000 utils.py:348(get)
12964 0.025 0.000 0.031 0.000 utils.py:392(__getitem__)
12975 0.048 0.000 0.072 0.000 uuid.py:106(__init__)
12975 0.015 0.000 0.015 0.000 uuid.py:280(hex)
12975 0.019 0.000 0.117 0.000 uuid.py:621(uuid4)
58329 0.033 0.000 0.074 0.000 visitor.py:26(get_visitor)
58329/6481 0.070 0.000 1.434 0.000 visitor.py:34(visit)
38886/19443 0.040 0.000 0.542 0.000 visitor.py:41(generic_visit)
1 0.000 0.000 0.000 0.000 webbrowser.py:110(__init__)
1 0.000 0.000 0.000 0.000 webbrowser.py:27(get)
1 0.000 0.000 0.009 0.009 webbrowser.py:511(open)
1 0.000 0.000 0.009 0.009 webbrowser.py:57(open)
1 0.000 0.000 0.000 0.000 workbook.py:1008(_sort_defined_names)
1 0.000 0.000 0.000 0.000 workbook.py:1035(_prepare_drawings)
1 0.000 0.000 0.000 0.000 workbook.py:1298(_extract_named_ranges)
1 0.000 0.000 0.000 0.000 workbook.py:1337(_prepare_vml)
1 0.000 0.000 0.000 0.000 workbook.py:1398(_prepare_tables)
1 0.000 0.000 0.000 0.000 workbook.py:1412(_add_chart_data)
1 0.000 0.000 0.000 0.000 workbook.py:148(__del__)
1 0.000 0.000 0.000 0.000 workbook.py:1522(_prepare_sst_string_data)
1 0.000 0.000 0.000 0.000 workbook.py:1532(_write_workbook)
1 0.000 0.000 0.000 0.000 workbook.py:1546(_write_file_version)
1 0.000 0.000 0.000 0.000 workbook.py:1567(_write_workbook_pr)
1 0.000 0.000 0.000 0.000 workbook.py:1581(_write_book_views)
1 0.000 0.000 0.000 0.000 workbook.py:1587(_write_workbook_view)
1 0.000 0.000 0.000 0.000 workbook.py:1611(_write_sheets)
1 0.000 0.000 0.000 0.000 workbook.py:1622(_write_sheet)
1 0.000 0.000 0.000 0.000 workbook.py:1636(_write_calc_pr)
1 0.000 0.000 0.000 0.000 workbook.py:165(add_worksheet)
1 0.000 0.000 0.000 0.000 workbook.py:1651(_write_defined_names)
1 0.000 0.000 0.000 0.000 workbook.py:1688(__init__)
3 0.000 0.000 0.000 0.000 workbook.py:197(add_format)
1 0.000 0.000 0.562 0.562 workbook.py:298(close)
22 0.000 0.000 0.000 0.000 workbook.py:481(worksheets)
1 0.000 0.000 0.001 0.001 workbook.py:558(_assemble_xml_file)
1 0.000 0.000 0.000 0.000 workbook.py:56(__init__)
1 0.000 0.000 0.562 0.562 workbook.py:594(_store_workbook)
1 0.000 0.000 0.000 0.000 workbook.py:663(_add_sheet)
1 0.000 0.000 0.000 0.000 workbook.py:714(_check_sheetname)
1 0.000 0.000 0.000 0.000 workbook.py:757(_prepare_format_properties)
1 0.000 0.000 0.000 0.000 workbook.py:775(_prepare_formats)
1 0.000 0.000 0.000 0.000 workbook.py:819(_prepare_fonts)
1 0.000 0.000 0.000 0.000 workbook.py:848(_prepare_num_formats)
1 0.000 0.000 0.000 0.000 workbook.py:877(_prepare_borders)
1 0.000 0.000 0.000 0.000 workbook.py:908(_prepare_fills)
1 0.000 0.000 0.000 0.000 workbook.py:967(_prepare_defined_names)
1 0.000 0.000 0.000 0.000 worksheet.py:158(__init__)
19418 0.061 0.000 0.290 0.000 worksheet.py:353(write)
1 0.000 0.000 0.000 0.000 worksheet.py:3551(_initialize)
1 0.000 0.000 0.260 0.260 worksheet.py:3592(_assemble_xml_file)
19418 0.022 0.000 0.022 0.000 worksheet.py:3683(_check_dimensions)
2 0.000 0.000 0.000 0.000 worksheet.py:3762(_sort_pagebreaks)
12946 0.027 0.000 0.063 0.000 worksheet.py:443(write_string)
6472 0.001 0.000 0.001 0.000 worksheet.py:4782(_isnan)
6472 0.001 0.000 0.001 0.000 worksheet.py:4786(_isinf)
6472 0.014 0.000 0.028 0.000 worksheet.py:486(write_number)
1 0.000 0.000 0.000 0.000 worksheet.py:4879(_write_worksheet)
1 0.000 0.000 0.000 0.000 worksheet.py:4901(_write_dimension)
1 0.000 0.000 0.000 0.000 worksheet.py:4937(_write_sheet_views)
1 0.000 0.000 0.000 0.000 worksheet.py:4946(_write_sheet_view)
1 0.000 0.000 0.000 0.000 worksheet.py:4991(_write_sheet_format_pr)
1 0.000 0.000 0.000 0.000 worksheet.py:5015(_write_cols)
1 0.000 0.000 0.240 0.240 worksheet.py:5084(_write_sheet_data)
1 0.000 0.000 0.000 0.000 worksheet.py:5119(_write_page_margins)
1 0.000 0.000 0.000 0.000 worksheet.py:5131(_write_page_setup)
1 0.000 0.000 0.000 0.000 worksheet.py:5198(_write_print_options)
1 0.000 0.000 0.000 0.000 worksheet.py:5223(_write_header_footer)
1 0.033 0.033 0.240 0.240 worksheet.py:5251(_write_rows)
38836/19418 0.044 0.000 0.316 0.000 worksheet.py:53(cell_wrapper)
1 0.009 0.009 0.009 0.009 worksheet.py:5329(_calculate_spans)
6473 0.010 0.000 0.032 0.000 worksheet.py:5378(_write_row)
19418 0.035 0.000 0.154 0.000 worksheet.py:5433(_write_cell)
1 0.000 0.000 0.000 0.000 worksheet.py:5560(_write_sheet_pr)
1 0.000 0.000 0.000 0.000 worksheet.py:5625(_write_row_breaks)
1 0.000 0.000 0.000 0.000 worksheet.py:5646(_write_col_breaks)
1 0.000 0.000 0.000 0.000 worksheet.py:5676(_write_merge_cells)
1 0.000 0.000 0.000 0.000 worksheet.py:5708(_write_hyperlinks)
1 0.000 0.000 0.000 0.000 worksheet.py:5813(_write_auto_filter)
1 0.000 0.000 0.000 0.000 worksheet.py:5939(_write_sheet_protection)
1 0.000 0.000 0.000 0.000 worksheet.py:5987(_write_drawings)
1 0.000 0.000 0.000 0.000 worksheet.py:6003(_write_legacy_drawing)
1 0.000 0.000 0.000 0.000 worksheet.py:6016(_write_legacy_drawing_hf)
1 0.000 0.000 0.000 0.000 worksheet.py:6029(_write_data_validations)
1 0.000 0.000 0.000 0.000 worksheet.py:6157(_write_conditional_formats)
1 0.000 0.000 0.000 0.000 worksheet.py:6582(_write_table_parts)
1 0.000 0.000 0.000 0.000 worksheet.py:6612(_write_ext_list)
12946 0.023 0.000 0.055 0.000 xmlwriter.py:102(_xml_string_element)
3432 0.005 0.000 0.013 0.000 xmlwriter.py:112(_xml_si_element)
6472 0.021 0.000 0.048 0.000 xmlwriter.py:129(_xml_number_element)
26031 0.011 0.000 0.027 0.000 xmlwriter.py:180(_escape_attributes)
3447 0.001 0.000 0.002 0.000 xmlwriter.py:196(_escape_data)
13 0.000 0.000 0.000 0.000 xmlwriter.py:24(__init__)
9 0.000 0.000 0.004 0.000 xmlwriter.py:34(_set_xml_writer)
9 0.000 0.000 0.035 0.004 xmlwriter.py:43(_xml_close)
9 0.000 0.000 0.000 0.000 xmlwriter.py:48(_xml_declaration)
36 0.000 0.000 0.000 0.000 xmlwriter.py:53(_xml_start_tag)
6473 0.012 0.000 0.021 0.000 xmlwriter.py:61(_xml_start_tag_unencoded)
6509 0.004 0.000 0.013 0.000 xmlwriter.py:70(_xml_end_tag)
54 0.000 0.000 0.001 0.000 xmlwriter.py:74(_xml_empty_tag)
15 0.000 0.000 0.000 0.000 xmlwriter.py:91(_xml_data_element)
1 0.000 0.000 0.001 0.001 zipfile.py:1060(__init__)
10 0.000 0.000 0.001 0.000 zipfile.py:1317(open)
10 0.000 0.000 0.001 0.000 zipfile.py:1430(_open_to_write)
10 0.000 0.000 0.000 0.000 zipfile.py:1560(_writecheck)
10 0.000 0.000 0.041 0.004 zipfile.py:1583(write)
1 0.000 0.000 0.000 0.000 zipfile.py:1661(__del__)
2 0.000 0.000 0.005 0.003 zipfile.py:1665(close)
1 0.000 0.000 0.000 0.000 zipfile.py:1687(_write_end_record)
1 0.000 0.000 0.005 0.005 zipfile.py:1788(_fpclose)
10 0.000 0.000 0.000 0.000 zipfile.py:320(__init__)
20 0.000 0.000 0.000 0.000 zipfile.py:384(FileHeader)
30 0.000 0.000 0.000 0.000 zipfile.py:430(_encodeFilenameFlags)
10 0.000 0.000 0.001 0.000 zipfile.py:472(from_file)
20 0.000 0.000 0.000 0.000 zipfile.py:506(is_dir)
11 0.000 0.000 0.000 0.000 zipfile.py:643(_check_compression)
10 0.000 0.000 0.000 0.000 zipfile.py:662(_get_compressor)
10 0.000 0.000 0.000 0.000 zipfile.py:967(__init__)
172 0.000 0.000 0.000 0.000 zipfile.py:976(_fileobj)
122 0.001 0.000 0.034 0.000 zipfile.py:983(write)
10 0.000 0.000 0.002 0.000 zipfile.py:995(close)
149597 0.106 0.000 0.106 0.000 {built-in method __new__ of type object at 0x0000000059EBC3F0}
1 0.026 0.026 0.026 0.026 {built-in method _codecs.charmap_decode}
10 0.000 0.000 0.000 0.000 {built-in method _codecs.lookup}
35947 0.013 0.000 0.013 0.000 {built-in method _codecs.utf_8_encode}
6482 0.003 0.000 0.003 0.000 {built-in method _functools.reduce}
15 0.000 0.000 0.000 0.000 {built-in method _imp.acquire_lock}
1 0.000 0.000 0.000 0.000 {built-in method _imp.is_frozen}
15 0.000 0.000 0.000 0.000 {built-in method _imp.release_lock}
10 0.000 0.000 0.000 0.000 {built-in method _json.encode_basestring_ascii}
19418 0.052 0.000 0.052 0.000 {built-in method _libjson.dumps}
1 0.000 0.000 0.000 0.000 {built-in method _locale._getdefaultlocale}
10 0.000 0.000 0.000 0.000 {built-in method _stat.S_ISDIR}
31 0.000 0.000 0.000 0.000 {built-in method _struct.pack}
8 0.000 0.000 0.000 0.000 {built-in method _thread.allocate_lock}
8 0.000 0.000 0.000 0.000 {built-in method _thread.get_ident}
3/1 0.000 0.000 0.000 0.000 {built-in method builtins.__import__}
6476 0.005 0.000 0.089 0.000 {built-in method builtins.any}
149055 0.014 0.000 0.014 0.000 {built-in method builtins.callable}
24 0.000 0.000 0.000 0.000 {built-in method builtins.chr}
6481 0.727 0.000 0.727 0.000 {built-in method builtins.compile}
6482/1 0.011 0.000 124.891 124.891 {built-in method builtins.exec}
311302 0.096 0.000 0.096 0.000 {built-in method builtins.getattr}
999551/999549 0.315 0.000 0.315 0.000 {built-in method builtins.hasattr}
7 0.000 0.000 0.000 0.000 {built-in method builtins.hash}
44 0.000 0.000 0.000 0.000 {built-in method builtins.id}
4738194 0.770 0.000 1.685 0.000 {built-in method builtins.isinstance}
50 0.000 0.000 0.000 0.000 {built-in method builtins.issubclass}
64809 0.012 0.000 0.012 0.000 {built-in method builtins.iter}
1032444/1032430 0.097 0.000 0.098 0.000 {built-in method builtins.len}
71362 0.024 0.000 0.024 0.000 {built-in method builtins.max}
2 0.000 0.000 0.000 0.000 {built-in method builtins.min}
376422/363460 0.089 0.000 1.006 0.000 {built-in method builtins.next}
2 0.000 0.000 0.000 0.000 {built-in method builtins.ord}
2 0.000 0.000 0.000 0.000 {built-in method builtins.print}
6481 0.014 0.000 0.014 0.000 {built-in method builtins.repr}
58329 0.011 0.000 0.011 0.000 {built-in method builtins.setattr}
38864 0.049 0.000 0.049 0.000 {built-in method builtins.sorted}
1 0.000 0.000 0.000 0.000 {built-in method builtins.sum}
12975 0.015 0.000 0.015 0.000 {built-in method from_bytes}
23 0.006 0.000 0.006 0.000 {built-in method io.open}
12946 0.002 0.000 0.002 0.000 {built-in method math.isnan}
11 0.000 0.000 0.000 0.000 {built-in method nt._getfullpathname}
34 0.002 0.000 0.002 0.000 {built-in method nt._isdir}
10 0.000 0.000 0.000 0.000 {built-in method nt.close}
338 0.000 0.000 0.000 0.000 {built-in method nt.fspath}
14 0.000 0.000 0.000 0.000 {built-in method nt.getpid}
12 0.000 0.000 0.000 0.000 {built-in method nt.listdir}
10 0.160 0.016 0.160 0.016 {built-in method nt.open}
10 0.005 0.000 0.005 0.000 {built-in method nt.remove}
1 0.009 0.009 0.009 0.009 {built-in method nt.startfile}
13 0.001 0.000 0.001 0.000 {built-in method nt.stat}
12975 0.026 0.000 0.026 0.000 {built-in method nt.urandom}
10 0.004 0.000 0.004 0.000 {built-in method nt.utime}
2 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.arange}
15 0.003 0.000 0.003 0.000 {built-in method numpy.core.multiarray.array}
4 0.000 0.000 0.000 0.000 {built-in method numpy.core.multiarray.empty}
4 0.000 0.000 0.000 0.000 {built-in method pandas._libs.algos.ensure_object}
4 0.000 0.000 0.000 0.000 {built-in method pandas._libs.lib.infer_datetimelike_array}
12946 0.001 0.000 0.001 0.000 {built-in method pandas._libs.lib.is_bool}
32364 0.004 0.000 0.004 0.000 {built-in method pandas._libs.lib.is_float}
19427 0.002 0.000 0.002 0.000 {built-in method pandas._libs.lib.is_integer}
38846 0.005 0.000 0.005 0.000 {built-in method pandas._libs.lib.is_scalar}
19420 0.006 0.000 0.006 0.000 {built-in method pandas._libs.missing.checknull}
19443 0.005 0.000 0.005 0.000 {built-in method sys.intern}
2 0.000 0.000 0.000 0.000 {built-in method time.clock}
10 0.000 0.000 0.000 0.000 {built-in method time.localtime}
10 0.000 0.000 0.000 0.000 {built-in method time.mktime}
3 0.000 0.000 0.000 0.000 {built-in method utcnow}
10 0.000 0.000 0.000 0.000 {built-in method zlib.compressobj}
122 0.002 0.000 0.002 0.000 {built-in method zlib.crc32}
1 0.000 0.000 0.000 0.000 {function FrozenList.__getitem__ at 0x000000000520BE18}
10 0.000 0.000 0.000 0.000 {function _ZipWriteFile.close at 0x00000000054F2950}
12970 0.005 0.000 0.005 0.000 {method 'acquire' of '_thread.lock' objects}
1 0.000 0.000 0.000 0.000 {method 'any' of 'numpy.ndarray' objects}
8 0.000 0.000 0.000 0.000 {method 'append' of 'collections.deque' objects}
1274771 0.146 0.000 0.146 0.000 {method 'append' of 'list' objects}
80 0.000 0.000 0.000 0.000 {method 'bit_length' of 'int' objects}
1 0.005 0.005 0.005 0.005 {method 'close' of '_io.BufferedRandom' objects}
11 0.257 0.023 0.257 0.023 {method 'close' of '_io.BufferedWriter' objects}
122 0.031 0.000 0.031 0.000 {method 'compress' of 'zlib.Compress' objects}
4 0.000 0.000 0.000 0.000 {method 'copy' of 'dict' objects}
12975 0.005 0.000 0.005 0.000 {method 'count' of 'list' objects}
591316 0.337 0.000 0.337 0.000 {method 'count' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
31 0.001 0.000 0.001 0.000 {method 'encode' of 'str' objects}
1062356 0.139 0.000 0.139 0.000 {method 'end' of '_sre.SRE_Match' objects}
240517 0.077 0.000 0.077 0.000 {method 'endswith' of 'str' objects}
19443 0.008 0.000 0.062 0.000 {method 'extend' of 'collections.deque' objects}
12954 0.002 0.000 0.002 0.000 {method 'extend' of 'list' objects}
2 0.000 0.000 0.000 0.000 {method 'fill' of 'numpy.ndarray' objects}
11 0.000 0.000 0.000 0.000 {method 'find' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'flush' of '_io.BufferedRandom' objects}
10 0.001 0.000 0.001 0.000 {method 'flush' of 'zlib.Compress' objects}
3 0.000 0.000 0.000 0.000 {method 'format' of 'str' objects}
464007 0.125 0.000 0.125 0.000 {method 'get' of 'dict' objects}
4 0.000 0.000 0.000 0.000 {method 'get_loc' of 'pandas._libs.index.IndexEngine' objects}
126 0.000 0.000 0.000 0.000 {method 'getrandbits' of '_random.Random' objects}
6481 0.010 0.000 0.010 0.000 {method 'getvalue' of '_io.StringIO' objects}
860503 0.318 0.000 0.318 0.000 {method 'group' of '_sre.SRE_Match' objects}
4 0.000 0.000 0.000 0.000 {method 'index' of 'list' objects}
174859 0.016 0.000 0.016 0.000 {method 'islower' of 'str' objects}
317328 0.026 0.000 0.026 0.000 {method 'isupper' of 'str' objects}
6480 0.001 0.000 0.001 0.000 {method 'items' of 'collections.OrderedDict' objects}
77747 0.011 0.000 0.011 0.000 {method 'items' of 'dict' objects}
246807/240309 0.097 0.000 0.215 0.000 {method 'join' of 'str' objects}
176837 0.118 0.000 0.118 0.000 {method 'keys' of 'dict' objects}
1340926 0.160 0.000 0.160 0.000 {method 'lower' of 'str' objects}
25943 0.010 0.000 0.010 0.000 {method 'lstrip' of 'str' objects}
1801581 2.475 0.000 2.475 0.000 {method 'match' of '_sre.SRE_Pattern' objects}
1 0.000 0.000 0.000 0.000 {method 'partition' of 'str' objects}
38887 0.007 0.000 0.007 0.000 {method 'pop' of 'dict' objects}
240517 0.081 0.000 0.081 0.000 {method 'pop' of 'list' objects}
19443 0.002 0.000 0.002 0.000 {method 'popleft' of 'collections.deque' objects}
132 0.002 0.000 0.002 0.000 {method 'read' of '_io.BufferedReader' objects}
1 0.044 0.044 0.071 0.071 {method 'read' of '_io.TextIOWrapper' objects}
3 0.000 0.000 0.000 0.000 {method 'reduce' of 'numpy.ufunc' objects}
12964 0.002 0.000 0.002 0.000 {method 'release' of '_thread.lock' objects}
2 0.000 0.000 0.000 0.000 {method 'remove' of 'collections.deque' objects}
26187 0.012 0.000 0.012 0.000 {method 'replace' of 'str' objects}
1 0.000 0.000 0.000 0.000 {method 'reverse' of 'list' objects}
6 0.000 0.000 0.000 0.000 {method 'rfind' of 'str' objects}
6894 0.006 0.000 0.006 0.000 {method 'rindex' of 'str' objects}
7 0.000 0.000 0.000 0.000 {method 'rpartition' of 'str' objects}
10 0.000 0.000 0.000 0.000 {method 'rstrip' of 'str' objects}
738541 0.615 0.000 0.615 0.000 {method 'search' of '_sre.SRE_Pattern' objects}
32 0.000 0.000 0.000 0.000 {method 'seek' of '_io.BufferedRandom' objects}
2 0.000 0.000 0.000 0.000 {method 'sort' of 'list' objects}
123236 0.138 0.000 0.138 0.000 {method 'split' of '_sre.SRE_Pattern' objects}
70 0.000 0.000 0.000 0.000 {method 'split' of 'str' objects}
6481 0.009 0.000 0.009 0.000 {method 'splitlines' of 'str' objects}
492382 0.076 0.000 0.076 0.000 {method 'start' of '_sre.SRE_Match' objects}
861186 0.231 0.000 0.231 0.000 {method 'startswith' of 'str' objects}
2 0.000 0.000 0.000 0.000 {method 'strftime' of 'datetime.date' objects}
240522 0.035 0.000 0.035 0.000 {method 'strip' of 'str' objects}
98033 0.168 0.000 0.201 0.000 {method 'sub' of '_sre.SRE_Pattern' objects}
22 0.000 0.000 0.000 0.000 {method 'tell' of '_io.BufferedRandom' objects}
3 0.000 0.000 0.000 0.000 {method 'tolist' of 'numpy.ndarray' objects}
4 0.000 0.000 0.000 0.000 {method 'update' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'upper' of 'str' objects}
19443 0.003 0.000 0.003 0.000 {method 'values' of 'dict' objects}
5 0.000 0.000 0.000 0.000 {method 'view' of 'numpy.ndarray' objects}
194 0.000 0.000 0.000 0.000 {method 'write' of '_io.BufferedRandom' objects}
35948 0.012 0.000 0.012 0.000 {method 'write' of '_io.BufferedWriter' objects}
207392 0.026 0.000 0.026 0.000 {method 'write' of '_io.StringIO' objects}
1 0.000 0.000 0.000 0.000 {pandas._libs.lib.clean_index_list}
1 0.000 0.000 0.000 0.000 {pandas._libs.lib.infer_dtype}
2 0.000 0.000 0.000 0.000 {pandas._libs.lib.maybe_convert_objects}
1 0.001 0.001 0.001 0.001 {pandas._libs.lib.to_object_array}

Der böse Flaschenhals scheint die Instanziierug zu sein, aber wieso?
Strawk

__blackjack__ · Donnerstag 15. August 2019, 15:49

@Strawk: HTML mit regulären Ausdrücken parsen zu wollen ist immer noch falsch. Und das nackte ``except`` ebenfalls. Wenn da beim ersten Schleifendurchlauf ein Problem ein Problem auftaucht, bekommt man das ja noch mit, weil die dort definierten Namen dann nicht existieren und danach deshalb eine Folgeausnahme ausgelöst wird. Bei weiteren Schleifendurchläufen entstehen dann ohne das man das direkt merkt falsche Daten, weil die Werte aus dem vorhergehenden Durchlauf dann einfach wiederholt werden. Das ist eine ziemlich kaputte ”Fehlerbehandlung” wenn die einfach weitere Fehler erzeugt.

`counter` sollte man vor der Schleufe mit 0 initialisieren, denn sonst bekommt man nach der Schleife ein Problem wenn keine <tr>-Elemente gefunden wurden. Warum gibst Du eine ganze Zahl mit einer Nachkommastelle aus?

Namen schreibt man in Python klein_mit_unterstrichen. Das geht auch bei `bs4` – man darf halt nicht mehr die alten „unpythonischen“ Namen verwenden. For allem *werden* die ja sogar teilweise verwendet. Warum benutzt Du in der gleichen Funktion sowohl `find_all()` als auch `findAll()`? Was ist denn dafür die Entscheidungsgrundlage gewesen?

`my_*` ist ein Präfix den man nur verwenden sollte wenn der Sinn macht, es also auch ein `our_*` oder `their_*` oder so gibt. Also im Grunde nie.

Die Klasse ist semantisch mal wieder nicht wirklich eine.

`__init__()` ist kein Konstruktor. Der Konstruktor in Python ist `__new__()`.

Es macht auch wenig Sinn den Namen der Funktion oder Methode im Docstring noch mal zu wiederholen denn den fragt man ja von der Funktion/Methode ab – und da bekommt man auch den Namen her, denn der ist ebenfalls als Attribut auf dem Objekt gespeichert.

Den grossen Profiling-Dump wird sich wohl keiner angucken. Zudem steht da ja nicht einmal was die Zahlen überhaupt bedeuten.

noisefloor · Donnerstag 15. August 2019, 19:38

Hallo,

und wenn du den Code überarbeitest am besten direkt auf Python 3 portieren, weil Python 2 ja bekanntlich Ende diesen Jahres keinen Support mehr hat.

Gruß, noisefloor

__blackjack__ · Donnerstag 15. August 2019, 20:21

Und nicht nur Python selbst, auch einige grosse Projekte stellen den Support für Python 2 zu dem Zeitpunkt ein: https://python3statement.org/

Ein Paar sogar schon vorher. Pandas hat das bereits getan, und von IPython und Matplotlib gibt es die aktuellen Versionen auch bereits nur noch für Python 3.

/me · Freitag 16. August 2019, 07:40

__blackjack__ hat geschrieben: ↑Donnerstag 15. August 2019, 20:21 Und nicht nur Python selbst, auch einige grosse Projekte stellen den Support für Python 2 zu dem Zeitpunkt ein: https://python3statement.org/

Oh, das kannte ich noch nicht. Da sind aber einige richtige Schwergewichte dabei.

__blackjack__ · Freitag 16. August 2019, 20:25

@Strawk: Ich habe mal den Teil mit `geo_mk` auskommentiert weil ich das nicht installiert habe, die HTML-Datei im Binärmodus geöffnet, damit das auch unter Python 3 läuft, und es auch mal laufen lassen – 135 Sekunden ohne Profiler.

Dann mal mit Profiler und geschaut wo die Zeit bleibt: 96,28% der Zeit nimmt das Erstellen des `BeautifulSoup`-Objekts in Anspruch, wobei diese Zeit im Grunde komplett im `html.parser`-Modul aus der Standardbibliothek verbracht wird.

Es ist vielleicht keine gute Idee eine fast 9 Megabyte grosse HTML-Datei mit etwas mehr als 230.000 HTML-Tags mit reinem Python-Code parsen zu wollen.

Ersetze ich den Parser aus der Standardbibliothek mit `lxml`, dann braucht es nur noch 6,13 Sekunden. Das ist dann auch das was Dein Bekannter gemacht hat, vermute ich mal ganz stark.

Wenn ich das profile ist der Anteil des Erstellens des `BeautifulSoup`-Objekts nur noch 57,37% der Gesamtlaufzeit.

Sirius3 · Freitag 16. August 2019, 21:34

@Strawk: neben der überflüssigen Klasse und dem falschen verwenden von Regulären Ausdrücken, willst Du sicher auch keine Excel-Datei, in der alle Zahlen als Strings gespeichert sein?

Code: Alles auswählen

import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qsl
import requests

def read_locations(url):
    """
    parses HTML-table
    retrieves locations (longitude, latitude)
    
    Arguments:
    - url: page to be scraped
    """
    # content = requests.get(url).content
    with open(url) as f:
        content = f.read()
    soup = BeautifulSoup(content, 'html.parser')
    table = soup.find('table', attrs={'class': 'displayTable'})
    table_rows = table.find_all('tr', {'class': 'r'})

    res = []
    for tr in table_rows:
        anchor = tr.find_next('td',{'class': 'TRr'}).find_all('a')
        query = dict(parse_qsl(urlparse(anchor['href']).query))
        latitude = float(query['mlat'])
        longitude = float(query['mlon'])
        res.append([latitude, longitude])
    return pd.DataFrame(res, columns=["longitude", "latitude"])

def main():
    locations = read_locations(url="data/Tor.htm")
    locations.to_excel("results/scrapedLocations.xlsx")
    lons = locations['longitude'].values
    lats = locations['latitude'].values
    geo_mk.geo_visualization.plot_points_on_map(lats, lons)

if __name__ == "__main__":
    main()

Strawk · Samstag 17. August 2019, 06:23

Hallo _blackjack_,
wie genau hast du den Parser aus der Standardbibliothek mit `lxml` ersetzt?
LG S.

Sirius3 · Samstag 17. August 2019, 06:58

›BeautifulSoup‹ hat ein zweites Argument:

Code: Alles auswählen

soup = BeautifulSoup(content, "lxml")

Strawk · Samstag 17. August 2019, 07:02

Hallo _blackjack_ und Sirius3,
Skript braucht statt 110 Sekunden nur noch durchschnittlich 19. Sehr schön und danke.
Strawk

__blackjack__ · Montag 19. August 2019, 09:27

@Strawk: Die Ausnahmebehandlung beim abfragen der Gruppen vom Match-Objekt ist übrigens definitiv fehlerhaft bei den gegebenen Daten, weil nicht jeder Eintrag in der Tabelle auch tatsächlich eine Ortsangabe enthält. Das muss man also a) berücksichtigen, und b) überlegen was man in so einem Fall macht. Also einfach den Exit-Node ignorieren oder NaN für einen fehlenden ”Messwert” notieren.

Strawk · Montag 19. August 2019, 16:48

Lese gerade https://www.geeksforgeeks.org/working-w ... in-pandas/
Grüße
Strawk