Regex Problem

Wenn du dir nicht sicher bist, in welchem der anderen Foren du die Frage stellen sollst, dann bist du hier im Forum für allgemeine Fragen sicher richtig.
Antworten
Benutzeravatar
Lachesis580
User
Beiträge: 3
Registriert: Donnerstag 20. Juni 2019, 23:24

Hi

Ich hab ein Problem mit Regex und finde einfach keine Lösung.
Und zwar hol ich mir die HTML Daten von zb https://www.imdb.com/title/tt0773262/

Und möchte den Titel auslesen das Problem ist aber das alles nach dem Capture dennoch mitgeliefert wird.
In Autoit funktionierte es hier in Python nicht. Folgendes hab ich probiert:

Code: Alles auswählen

Title = re.findall('<title>(.*) \(', str(html))
print(Title[0])

Raus kam folgendes:

Code: Alles auswählen

Dexter (TV Series 2006\xe2\x80\x932013) - IMDb</title>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>\n<script>\n    if (typeof uet == \'function\') {\n      uet("be", "LoadTitle", {wb: 1});\n    }\n</script>\n<script>\n    if (typeof uex == \'function\') {\n      uex("ld", "LoadTitle", {wb: 1});\n    }\n</script>\n\n        <link rel="canonical" href="https://www.imdb.com/title/tt0773262/" />\n        <meta property="og:url" content="http://www.imdb.com/title/tt0773262/" />\n        <link rel="alternate" media="only screen and (max-width: 640px)" href="https://m.imdb.com/title/tt0773262/">\n\n<script>\n    if (typeof uet == \'function\') {\n      uet("bb", "LoadIcons", {wb: 1});\n    }\n</script>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_pre_icon"] = new Date().getTime(); })(IMDbTimer);</script>\n        <link href="https://m.media-amazon.com/images/G/01/imdb/images/safari-favicon-517611381._CB483525257_.svg" mask rel="icon" sizes="any">\n        <link rel="icon" type="image/ico" href="https://m.media-amazon.com/images/G/01/imdb/images/favicon-2165806970._CB470047330_.ico" />\n        <meta name="theme-color" content="#000000" />\n        <link rel="shortcut icon" type="image/x-icon" href="https://m.media-amazon.com/images/G/01/imdb/images/desktop-favicon-2165806970._CB484110913_.ico" />\n        <link href="https://m.media-amazon.com/images/G/01/imdb/images/mobile/apple-touch-icon-web-4151659188._CB483525313_.png" rel="apple-touch-icon"> \n        <link href="https://m.media-amazon.com/images/G/01/imdb/images/mobile/apple-touch-icon-web-76x76-53536248._CB484146059_.png" rel="apple-touch-icon" sizes="76x76"> \n        <link href="https://m.media-amazon.com/images/G/01/imdb/images/mobile/apple-touch-icon-web-120x120-2442878471._CB483525250_.png" rel="apple-touch-icon" sizes="120x120"> \n        <link href="https://m.media-amazon.com/images/G/01/imdb/images/mobile/apple-touch-icon-web-152x152-1475823641._CB470042035_.png" rel="apple-touch-icon" sizes="152x152">            \n        <link rel="search" type="application/opensearchdescription+xml" href="https://m.media-amazon.com/images/G/01/imdb/images/imdbsearch-3349468880._CB470047351_.xml" title="IMDb" />\n  <script>(function(t){ (t.events = t.events || {})["csm_head_post_icon"] = new Date().getTime(); })(IMDbTimer);</script>\n<script>\n    if (typeof uet == \'function\') {\n      uet("be", "LoadIcons", {wb: 1});\n    }\n</script>\n<script>\n    if
(typeof uex == \'function\') {\n      uex("ld", "LoadIcons", {wb: 1});\n    }\n</script>\n\n        <meta property="pageId" content="tt0773262" />\n        <meta property="pageType" content="title" />\n        <meta property="subpageType" content="main"
/>\n\n\n        <link rel=\'image_src\' href="https://m.media-amazon.com/images/M/MV5BMTM5MjkwMTI0MV5BMl5BanBnXkFtZTcwODQwMTc0OQ@@._V1_UY1200_CR126,0,630,1200_AL_.jpg">\n        <meta property=\'og:image\' content="https://m.media-amazon.com/images/M/MV5BMTM5MjkwMTI0MV5BMl5BanBnXkFtZTcwODQwMTc0OQ@@._V1_UY1200_CR126,0,630,1200_AL_.jpg" />\n\n        <meta property=\'og:type\' content="video.tv_show" />\n    <meta property=\'fb:app_id\' content=\'115109575169727\' />\n\n      <meta property=\'og:title\' content="Dexter (TV Series 2006\xe2\x80\x932013) - IMDb" />\n    <meta property=\'og:site_name\' content=\'IMDb\' />\n    <meta name="title" content="Dexter (TV Series 2006\xe2\x80\x932013) - IMDb" />\n        <meta name="description" content="Created by
James Manos Jr..  With Michael C. Hall, Jennifer Carpenter, David Zayas, James Remar. By day, mild-mannered Dexter is a blood-spatter analyst for the Miami police. But at night, he is a serial killer who only targets other murderers." />\n        <meta property="og:description" content="Created by James Manos Jr..  With Michael C. Hall, Jennifer Carpenter, David Zayas, James Remar. By day, mild-mannered Dexter is a blood-spatter analyst for the Miami police. But at night, he is a serial killer who only
targets other murderers." />\n        <meta name="keywords" content="Reviews, Showtimes, DVDs, Photos, Message Boards, User Ratings, Synopsis, Trailers, Credits" />\n        <meta name="request_id" content="WP8B0B9VEER5XMC3EFE7" />\n<script type="application/ld+json">{\n  "@context": "http://schema.org",\n  "@type": "TVSeries",\n  "url": "/title/tt0773262/",\n  "name": "Dexter",\n  "image": "https://m.media-amazon.com/images/M/MV5BMTM5MjkwMTI0MV5BMl5BanBnXkFtZTcwODQwMTc0OQ@@._V1_.jpg",\n  "genre": [\n
  "Crime",\n    "Drama",\n    "Mystery",\n    "Thriller"\n  ],\n  "contentRating": "TV-MA",\n  "actor": [\n    {\n      "@type": "Person",\n      "url": "/name/nm0355910/",\n      "name": "Michael C. Hall"\n    },\n    {\n      "@type": "Person",\n
"url": "/name/nm1358539/",\n      "name": "Jennifer Carpenter"\n    },\n    {\n      "@type": "Person",\n      "url": "/name/nm0953882/",\n      "name": "David Zayas"\n    },\n    {\n      "@type": "Person",\n      "url": "/name/nm0001664/",\n      "name": "James Remar"\n    }\n  ],\n  "creator": [\n    {\n      "@type": "Person",\n      "url": "/name/nm0543612/",\n      "name": "James Manos Jr."\n    },\n    {\n      "@type": "Organization",\n      "url": "/company/co0052980/"\n    },\n    {\n      "@type": "Organization",\n      "url": "/company/co0177677/"\n    },\n    {\n      "@type": "Organization",\n      "url": "/company/co0176685/"\n    },\n    {\n      "@type": "Organization",\n      "url": "/company/co0020550/"\n    },\n    {\n      "@type": "Organization",\n      "url": "/company/co0360733/"\n    }\n  ],\n  "description": "Dexter is a TV series starring Michael C. Hall, Jennifer Carpenter, and David Zayas. By day, mild-mannered Dexter is a blood-spatter analyst for the Miami police. But at night, he is a serial killer who only targets other murderers.",\n  "datePublished": "2006-10-01",\n  "keywords": "double life,police department,dark secret,homicide,serial murder",\n  "aggregateRating": {\n    "@type": "AggregateRating",\n    "ratingCount": 600003,\n    "bestRating": "10.0",\n    "worstRating": "1.0",\n    "ratingValue": "8.7"\n  },\n  "review": {\n    "@type": "Review",\n    "itemReviewed": {\n      "@type": "CreativeWork",\n      "url": "/title/tt0773262/"\n    },\n    "author": {\n      "@type": "Person",\n      "name": "emvan"\n    },\n    "dateCreated": "2006-10-23",\n    "inLanguage": "English",\n    "name": "The Best Show on TV",\n    "reviewBody": "After four episodes, I\\u0027m ready to proclaim this the best show currently
on TV, one that may someday rank with the likes of _The Sopranos_ and the first season of _Twin Peaks_ as a contender for the second best TV show ever (after the incomparable _Buffy the Vampire Slayer_; one of the show\\u0027s producers and writers is former Buffy writer Drew Z. Greenberg, and the cast includes Buffy / Angel mainstay Julie Benz).\\n\\nDexter is a sociopath, someone with no human feelings and hence no natural, inner moral compass, and he has an unquenchable blood lust that drives him to
kill. But he had the great grace to have been the adopted child of a police officer, who (as we see in terrific flashbacks) successfully instilled in him a complete moral code, which he adheres to on a strictly intellectual level. This is an utterly brilliant concept (which I assume derives from the novels it\\u0027s based on), one that allows the writers to explore the nature of moral behavior and of what it means to be human (Dexter is, in a sense, an alien).\\n\\nAnother thing the show is doing brilliantly is moving at different speeds in parallel. There is a primary apparent season-long story arc (concerning a cat-and-mouse game between Dexter and a serial killer), and a a secondary arc involving Dexter\\u0027s sister\\u0027s police career. The first handful of episodes include a very powerful completed arc concerning one of Dexter\\u0027s police colleagues and a local crime lord, while two of the four episodes so far have also included a self-contained story spliced among (and playing off) the ongoing ones. I\\u0027ve seen the future of TV season structuring, and this is it.\\n\\nWhile the writing isn\\u0027t quite up to the brilliance of the best of _House_, it\\u0027s been excellent. The cast and production are terrific. The only reason you wouldn\\u0027t want to watch this utterly brilliant show is the frequent use of extremely graphic images: there have probably been more severed body parts shown in these first four episodes than in the first four episodes of every other TV show on the air
combined. If you can stomach that, tune it for a mesmerizing look at what makes us human -- or inhuman.",\n    "reviewRating": {\n      "@type": "Rating",\n      "worstRating": "1",\n      "bestRating": "10",\n      "ratingValue": "9"\n    }\n  },\n  "trailer": {\n    "@type": "VideoObject",\n    "name": "Truth",\n    "embedUrl": "/video/imdb/vi2270077465",\n    "thumbnail": {\n      "@type": "ImageObject",\n      "contentUrl": "https://m.media-amazon.com/images/M/MV5BNzYwMTk5MjE5OV5BMl5BanBnXkFtZTcwNjA2MD
Habs auch so versucht aber das selbe:

Code: Alles auswählen

Title = re.search('<title>(.*) \(', str(html))
print(Title.group(1))
Wie schon erwähnt in Autoit funktioniert es da kommt nur "Dexter"
Testen tu ich die Pattern mit Regexbuddy. Dort hab ich auch auf Python 3.7 umgestellt, und auch dort wird das Ergebnis so angezeigt wie ich es erwarte.

Bild

Über die Google suche kam ich leider auch nicht zu seiten wo welche eventuell das selbe Problem hatten.
Wo liegt den mein Fehler? Ich bin wirklich Ratlos zumahl ich wie gesagt auch rein gar nix über Google finden kann, nicht mal was ähnliches wo ich evtl sehn könnte "Ja doch vll hab ich da auch verkackt"
__deets__
User
Beiträge: 14543
Registriert: Mittwoch 14. Oktober 2015, 14:29

Der zugrunde liegende Fehler ist die Nutzung von regulären Ausdrücken. HTML ist nicht regulär. Sondern kontextfrei. Darum parst man es mit dafür geeigneten Parsern wie lxml oder beautifulsoup.
Sirius3
User
Beiträge: 17754
Registriert: Sonntag 21. Oktober 2012, 17:20

HTML bearbeitet man nicht mit Regulären Ausdrücken, dafür benutzt man einen HTML-Parser.
Das Problem ist, dass bei Deinem Beispiel, das Problem gar nicht auftritt.
Was ist Dein wirklicher Text, was ist das Ergebnis und was das erwartete Ergebnis?
Benutzeravatar
Lachesis580
User
Beiträge: 3
Registriert: Donnerstag 20. Juni 2019, 23:24

Ok muss ich mir morgen mal näher anschauen mit dem HTML-Parser.

Das ist der wirkliche text ich bastele mir nen eigenen Serien Manager da ich als Serienjunkie fast gar kein überblick mehr habe.
Hier mal der komplette test ablauf:

Code: Alles auswählen

def FNC_ADDSERIES(_ui, _IMDBID):
    with urlopen('https://www.imdb.com/title/tt0773262/') as h:
        html = h.read()
        Title = re.findall('<title>(.*) \(', str(html))
        print(Title[0])
Und für den Regexbuddy hab ich mir die HTML so besorgt:

Code: Alles auswählen

urllib.request.urlretrieve('https://www.imdb.com/title/tt0773262/', 'web.html')
Benutzeravatar
__blackjack__
User
Beiträge: 13114
Registriert: Samstag 2. Juni 2018, 10:21
Wohnort: 127.0.0.1
Kontaktdaten:

Den Titel würde ich ja nicht aus dem HTML-<title> holen, weil da noch zusätzliche Informationen drin stehen. Es gibt den *in* der Seite ja noch mal einzeln.

Code: Alles auswählen

In [13]: response = requests.get('https://www.imdb.com/title/tt0773262/')

In [14]: soup = bs4.BeautifulSoup(response.text, 'lxml')

In [15]: soup.title
Out[15]: <title>Dexter (TV Series 2006–2013) - IMDb</title>

In [16]: soup.title.text
Out[16]: 'Dexter (TV Series 2006–2013) - IMDb'

In [17]: soup.find('div', 'title_block').h1.text.strip()
Out[17]: 'Dexter'
„All religions are the same: religion is basically guilt, with different holidays.” — Cathy Ladman
Benutzeravatar
snafu
User
Beiträge: 6741
Registriert: Donnerstag 21. Februar 2008, 17:31
Wohnort: Gelsenkirchen

Du könntest auch was fertiges nehmen. Dies ist recht umfangreich: https://pypi.org/project/IMDbPY/
Benutzeravatar
Lachesis580
User
Beiträge: 3
Registriert: Donnerstag 20. Juni 2019, 23:24

So ich hab mir Beautifulsoup angeschaut und hat auch auf Anhieb funktioniert.
Ich bedanke mich bei allen, Und @snafu

Danke ich werde es mir mal abspeichern für den Fall der fälle.
Momentan versuch ich was eigenes zu machen um so auch mehr mit Python vertraut zu werden, hab erst seit 3 Wochen mit Python zu tun davor die ganzen Jahre Autoit (Was ich auch jetzt noch benutze wenn es schnell & simple gehn muss)
Antworten