Informationen zu Serien bei IMDB auslesen

madfrog · Montag 19. Januar 2009, 02:11

Ich hab ne Frage zu verschiedenen String-Operationen. Und zwar extrahiere ich die folgenden Informationen aus einer Internetseite und speichere sie in "Seasons.txt":

View by:

Season 1

Season 1, Episode 1:
Pilot
Season 1, Episode 2:
Hell-A Woman
Season 1, Episode 3:
The Whore of Babylon
Season 1, Episode 4:
Fear and Loathing at the Fundraiser
Season 1, Episode 5:
LOL
Season 1, Episode 6:
Absinthe Makes the Heart Grow Fonder
Season 1, Episode 7:
Girls, Interrupted
Season 1, Episode 8:
California Son
Season 1, Episode 9:
Filthy Lucre
Season 1, Episode 10:
The Devil's Threesome
Season 1, Episode 11:
Turn the Page
Season 1, Episode 12:
The Last Waltz

Season 2

Season 2, Episode 1:
Slip of the Tongue
Season 2, Episode 2:
The Great Ashby
Season 2, Episode 3:
No Way to Treat a Lady
Season 2, Episode 4:
The Raw
the Cooked
Season 2, Episode 5:
Vaginatown
Season 2, Episode 6:
Coke Dick
First Kick
. . .

Nun möchte ich die Informationen in eine Form bringen, die es mir ermöglicht sie direkt zum Umbenennen von Dateien zu verwenden. Also in diese Form etwa:

Californication - S01E01 - Pilot.avi
Californication - S01E02 - Hell-A Woman.avi
Californication - S01E03 - The Whore of Babylon.avi
Californication - S01E04 - Fear and Loathing at the Fundraiser.avi
Californication - S01E05 - LOL.avi
Californication - S01E06 - Absinthe Makes the Heart Grow Fonder.avi
Californication - S01E07 - Girls, Interrupted.avi
Californication - S01E08 - California Son.avi
Californication - S01E09 - Filthy Lucre.avi
Californication - S01E10 - The Devil's Threesome.avi
Californication - S01E11 - Turn the Page.avi
Californication - S01E12 - The Last Waltz.avi
Californication - S02E01 - Slip of the Tongue.avi
Californication - S02E02 - The Great Ashby.avi
Californication - S02E03 - No Way to Treat a Lady.avi
Californication - S02E04 - The Raw.avi
Californication - S02E05 - Vaginatown.avi
Californication - S02E06 - Coke Dick.avi
Californication - S02E07 - In a Lonely Place.avi
Californication - S02E08 - Going Down and Out in Beverly Hills.avi
Californication - S02E09 - La Ronde.avi
Californication - S02E10 - In Utero.avi
Californication - S02E11 - Blues from Laurel Canyon.avi
Californication - S02E12 - La Petite Mort.avi

Dazu habe ich benutze ich folgenden Code:

Code: Alles auswählen

import re

f = open(r"D:\Filme\Californication\seasons.txt", "r")
content = f.read()
f.close()

#-----------------------------------------------------------------------------
# Find the relevant informations and build a list of them.
content = re.findall(r"Season [0-9]+\, Episode [0-9]+\: \n.*", content)
content = [element.replace("\n", "- ") for element in content]
content = [element.replace(":", "") for element in content]

#-----------------------------------------------------------------------------
# Build the filenames.
content = [element.replace("Season ", "S") for element in content]
content = [element.replace(", Episode ", "E") for element in content]
for index, element in enumerate(content):
    if element[2] == "E":
        content[index] = element.replace("S", "S0", 1)
for index, element in enumerate(content):
    if element[5] == " ":
        content[index] = element.replace("E", "E0", 1)

#-----------------------------------------------------------------------------
# Write the content back.
f = open(r"D:\Filme\Californication\seasons2.txt", "w")
for element in content:
    f.write("Californication - " + element.strip() + ".avi\n")
f.close()

Und nun wollte ich fragen, ob man das irgendwie besser machen kann. Der Code gefällt mir irgendwie nicht wirklich und ist auch sehr unflexibel. Weiterhin muß ich zum Beispiel um zu checken, ob die Angaben im Format S01 und E01 statt S1 und E1 zwei Schleifen verwenden. Geht das besser? Und wie kann ich Zeichen herausfiltern, die in einem Dateinamen nicht verwendet werden dürfen?

Über jegliche Verbesserungsvorschläge bin ich sehr dankbar.

Hyperion · Montag 19. Januar 2009, 09:20

Du speicherst ja Daten in einer .txt Datei, die Du gar nicht brauchst! Evtl. könnte man direkt beim Scrapen eine geeignetere Datenstruktur aufbauen oder sogar on the fly diese Liste erzeugen? Dazu müßte man natürlich den Code dazu sehen

sma · Montag 19. Januar 2009, 10:07

Dies gibt dir eine Liste von Tupeln. Das dritte Element musst du noch etwas nachbearbeiten, da es ein \n enthalten kann. In deinem Beispiel waren jedoch einige Titel zweizeilig, was dann den regulären Ausdruck etwas komplizierter hat werden lassen.

Code: Alles auswählen

print re.compile(r"Season (\d+), Episode (\d+):\n(.*?)(?=\nSeason)", re.DOTALL).findall(s)

Stefan

PS: Du brauchst das sicherlich nur, um die von RTL2 persönlich mitgeschnittenen Folgen zu betiteln ;)

Hyperion · Montag 19. Januar 2009, 10:13

sma hat geschrieben: PS: Du brauchst das sicherlich nur, um die von RTL2 persönlich mitgeschnittenen Folgen zu betiteln

*fg* Aber sicher doch

Wobei das Umbenennen per Hand hier doch deutlich schneller gehen sollte, als das Coden drum herum

Aber ok, evtl. ist's ja nur eine Übung oder der Code ist dann fast so auch auf andere Serien anwendbar ...

rayo · Montag 19. Januar 2009, 10:25

Hi

Ich wuerde es so machen:

Code: Alles auswählen

content = content.split('\n')
for x,line in enumerate(content):
    match = re.match(r"Season ([0-9]+)\, Episode ([0-9]+)\:", line)
    if match:
        season, episode = int(match.group(1)), int(match.group(2))
        
        # Scan title (because of multiline titles)
        title = []
        x += 1
        while x < len(content) and content[x] and not content[x].startswith('Season'):
            title.append(content[x].strip())
            x += 1
        title = ' '.join(title)
        
        print 'Californication - S%02dE%02d - %s.avi' % (season, episode, title)

Ich finde in deinem (madfrog) Code hat es zuviele Listcomprehensions ueber die Liste content. Ich wuerde daraus eine for-Schleife machen in dem du alle replaces vornimmst und nicht fuer jeden replace eine eigene Schleife.

Gruss

madfrog · Montag 19. Januar 2009, 10:43

sma hat geschrieben:PS: Du brauchst das sicherlich nur, um die von RTL2 persönlich mitgeschnittenen Folgen zu betiteln

Ja. Wobei das hier noch ein simplereres Beispiel ist. Bei anderen Serien ist z.B. alleine die Anzahl der Seasons schon deutlich höher. Darüberhinaus ist es auch Teil natülich eines größeren Projekts, das auch später noch ein UI usw. bekommmen soll. Also für mich auch noch einen Lerneffekt hat.

Danke schonmal für die Antworten. Hier der Code zum Extracten der Informationen aus der IMDB:

Code: Alles auswählen

from urllib import urlopen
from HTMLParser import HTMLParser

class Scraper(HTMLParser):
    in_h3 = False
    chunks = []

    def handle_starttag(self, tag, attrs):
        if tag == "h3":
            self.in_h3 = True

    def handle_data(self, data):
        if self.in_h3:
            self.chunks.append(data)

    def handle_endtag(self, tag):
        if tag == "h3":
            self.in_h3 = False

    def results(self):
        file = open(r"D:\Filme\Californication\seasons.txt", "w")
        file.write("\n".join(self.chunks))
        file.close()
        
url = urlopen(r"http://www.imdb.com/title/tt0904208/episodes").read()
parser = Scraper()
parser.feed(url)
parser.results()
parser.close()

Edit:

sma hat geschrieben:
Code: Alles auswählen
print re.compile(r"Season (\d+), Episode (\d+): \n(.*?)(?=\nSeason)", re.DOTALL).findall(s)

Danke ... die Zeile macht alles!

madfrog · Montag 19. Januar 2009, 19:37

Der Vollständigkeit halber hier nochmal die Implementation wie man die Dateinamen korrekt erstellt:

Code: Alles auswählen

#-----------------------------------------------------------------------------
# Import auxiliary modules.
import re

#-----------------------------------------------------------------------------
# Define some constants as a dictionary.
# Is this the best way?!
const = {"INPUTFILE" : r"D:\Filme\Californication\seasons.txt",
         "OUTPUTFILE": r"D:\Filme\Californication\seasons2.txt",
         "SERIES"    : "Californication"}

#-----------------------------------------------------------------------------
# Read the file and buffer its content.
with open(const["INPUTFILE"], "r") as f:
    content = f.read()

#-----------------------------------------------------------------------------
# Build a list with the relevant informations. List contains tuples (S, E, T):
# S = season number (as a string),
# E = episode number (as a string),
# T = title of episode (contains newline sequence for multiline titles).
pattern = \
    """                             # in a verbosed pattern whitespaces must
                                    # be marked by \s or "\ "
    Season\s(\d+),\s                # subpattern(0) holds S
    Episode\s(\d+):\s\n             # subpattern(1) holds E
    (.*?)                           # subpattern(2) holds T
    (?=\nSeason|\nRelated)          # ... even if T is multilined
    """
pattern = re.compile(pattern, re.DOTALL|re.VERBOSE)
content = pattern.findall(content)

#-----------------------------------------------------------------------------
# Fix the tuples ... cast the numbers, remove newline sequences
# TODO: Filter out all illegal characters ("\/:*?"<>|") for a filename.
for index, element in enumerate(content):
    content[index] = (int(element[0]),
                      int(element[1]),
                      element[2].replace("\n", ""))

#-----------------------------------------------------------------------------
# Build the filenames and write them to the file.
with open(const["OUTPUTFILE"], "w") as f:
    for element in content:
        f.write(const["SERIES"] + " - S%02dE%02d - %s.avi\n" % (element[0], \
                element[1], element[2].replace("  ", " ")))

derdon · Montag 19. Januar 2009, 19:42

Und der Vollständigkeit halber die Frage: Kennst du das with-statement? Siehe auch: [wiki]Tutorial/with[/wiki] und PEP 343

madfrog · Montag 19. Januar 2009, 19:49

Bisher noch nicht

Aber danke für den Hinweis ich werde es gleich verbessern.
Frage: damit kann ich mir die file.close()-Anweisungen sparen?

derdon · Montag 19. Januar 2009, 19:56

Ja. Guck dir mal am besten [wiki=Tutorial/with#Dateiffnen]das Beispiel beim Öffnen von Dateien[/wiki] an.

BlackJack · Dienstag 20. Januar 2009, 20:36

@madfrog: Du weisst, dass es ein fertiges Modul gibt (IMDBpy) um Informationen aus der IMDB abzufragen!?

madfrog · Dienstag 20. Januar 2009, 22:25

Danke BlackJack für die Info. Hab es auch gleich neu implementiert. Hier das Ergebnis:

Code: Alles auswählen

#-----------------------------------------------------------------------------
# Import auxiliary modules.
from imdb import IMDb

#-----------------------------------------------------------------------------
# Define some constants as a dictionary.
const = {"OUTPUTFILE": r"D:\Filme\Californication\seasons.txt",
         "SERIES"    : "Californication"}

#-----------------------------------------------------------------------------
# Search for the series and retrieve the actual informations on the episodes.
IMDBHelper = IMDb()
series = IMDBHelper.search_movie(const["SERIES"])[0]
IMDBHelper.update(series, "episodes")

#-----------------------------------------------------------------------------
# Build the filenames and write them to the file.
# TODO: Filter out all illegal characters ("\/:*?"<>|") for a filename.
with open(const["OUTPUTFILE"], "w") as f:
    for season in series["episodes"].iteritems():
        for episode in season[1].iteritems():
            f.write("%s - S%02dE%02d - %s.avi\n" % (const["SERIES"],
                    season[0], episode[0], episode[1]))

BlackJack · Dienstag 20. Januar 2009, 22:35

Der Backslash in der vorletzten Zeile ist nicht nötig. Solange es noch geöffnete Klammern gibt, geht die "logische" Zeile weiter ohne dass man das extra sagen muss.

Und musst Du beim letzten Wert in dem Tupel wirklich noch einmal die ganze Hierarchie der Datenstruktur durchlaufen? Bzw. Wenn Du immer den Schlüssel verwendest um dann auf den jeweiligen Wert zu kommen, währe es nicht möglich gleich mit `iteritems()` beides zu bekommen?

madfrog · Dienstag 20. Januar 2009, 23:46

Ja du hast recht es ist einfacher und vermutlich leichter verständlich. Ich kannte iteritems() bisher noch nicht. Komische Struktur auch die imdbpy einem da zurückgibt. Ein in einem Dictionary verschachteltes Dictionary, das wiederum der Value zu dem Key "episode" ist.
Ich habs verbessert.