Das deutsche Python-Forum

Hier ein Beispiel:

import codecs

encoding="utf8"

unicode_decoder = codecs.getdecoder(encoding)

help(unicode_decoder)

print unicode_decoder("test1")
print unicode_decoder("BlaBlaBla")

print "test2".decode("utf8")

Ausgabe:

Help on function decode in module encodings.utf_8:

decode(input, errors='strict')

(u'test1', 5)
(u'BlaBlaBla', 9)
test2

Warum liefert mit der unicode_decoder ein tuple mit dem String und der länge zurück? Das macht der normale decode() nicht...
Ich möchte doch nur den decodierten String... Klar kann ich ein [0] anhängen, aber das ist doch doof...

Also das Verhalten ist normal:
http://docs.python.org/lib/codec-objects.html

Decodes the object input and returns a tuple (output object, length consumed). In a Unicode context, decoding converts a plain string encoded using a particular character set encoding to a Unicode object.

Ich frag mich nur, ob das auch die schnellste Möglichkeit ist... Da steht auch was von wegen Effizienz:

The method may not store state in the Codec instance. Use StreamCodec for codecs which have to keep state in order to make encoding/decoding efficient.

Aber was ist StreamCodec???

Also das ganz brauche ich um alle Felder die aus MySQL gelesen werden in unicode zu wandeln. Es wird also relativ oft aufgerufen und sollte möglichst Effizient sein...

jens hat geschrieben:Titel: codecs.getdecoder() liefert tulpe zurück?!?!?

Naja, solange du keine Rose oder gar Nelke zurueck bekommst...

SCNR

Naja...

Also hier mal ein Test:

Code: Alles auswählen

import time, codecs


loops = 1000000
test_text = "Das ist ein doofer Test!"



print "getencodet method...",

encoding="utf8"
unicode_decoder = codecs.getdecoder(encoding)

start_time = time.time()
for i in xrange(loops):
    tmp = unicode_decoder(test_text)[0]
duration = time.time() - start_time
print duration




encoding="utf_8"
print encoding,"...",
start_time = time.time()
for i in xrange(loops):
    tmp = test_text.decode(encoding)
duration = time.time() - start_time
print duration



encoding="utf8"
print encoding,"...",
start_time = time.time()
for i in xrange(loops):
    tmp = test_text.decode(encoding)
duration = time.time() - start_time
print duration



encoding="U8"
print encoding,"...",
start_time = time.time()
for i in xrange(loops):
    tmp = test_text.decode(encoding)
duration = time.time() - start_time
print duration

Ausgabe:

getencodet method... 1.81200003624
utf_8 ... 2.93799996376
utf8 ... 2.89100003242
U8 ... 2.85899996758

Die "normalen" Varianten sind wohl deutlich langsamer, weil immer im codec-alias-dict nachgesehen wird... Da ist der index-Zugriff beim tuple wohl egal...

Rebecca hat geschrieben:
jens hat geschrieben:Titel: codecs.getdecoder() liefert tulpe zurück?!?!?
Naja, solange du keine Rose oder gar Nelke zurueck bekommst...

SCNR

Ich hatte mich auch schon gewundert. Man ahnt nichts böses, und auf einmal hat man eine Tulpe auf dem Schirm... Aber immer noch besser, als wenn einem eine Python anspringt.