urllib urlencode not working as expected

da.dom · Sonntag 31. März 2013, 13:19

Hi there,

i try to build a little script to read a webpage, read movie titles and query imdb.com.

from urllib2 import urlopen
from lxml.html import parse
from lxml.etree import tostring
from lxml.html import HTMLParser
import socket
import re
import urllib
import os

# Read movie titles
parser = HTMLParser()
....
m.name=tableRow.find("td[@class='col-3']/span/strong/a").text
....

#Read Rating
title=urllib.urlencode({'q':m.name.encode('utf-8')})

lxml will retrieve a unicode string (?), urlencode didn't accept that and so i transform it to utf-8.
But the "title" will Result in something like : q=Die+Land%C3%83%C2%A4rztin

A german "ä" is encoded as %C3%83%C2%A4r
but should be: %C3%A4

so what's going wrong?

Thanks a lot
Christian N.

Hyperion · Sonntag 31. März 2013, 17:06

First of all I would encourage you to use Requests - that has a far better API than the built in urlencode-Libs

Are you sure that you can pass an utf-8 Byte-String as (part of) an URL?

snafu · Sonntag 31. März 2013, 23:11

@da.dom: Aus welchem Grund schreibst du plötzlich in Englisch? Du hast in deinen bisherigen Beiträgen doch auch ganz gut Deutsch gesprochen.

BlackJack · Montag 1. April 2013, 17:11

@da.dom: Kann ich nicht nachvollziehen — bei mir arbeitet `urlencode()` wie erwartet:

Code: Alles auswählen

In [2]: s
Out[2]: u'Land\xe4rztin'

In [3]: print s
Landärztin

In [4]: s.encode('utf-8')
Out[4]: 'Land\xc3\xa4rztin'

In [5]: import urllib

In [6]: urllib.urlencode({'q': s.encode('utf-8')})
Out[6]: 'q=Land%C3%A4rztin'

Vielleicht sehen Deine Daten nicht so aus wie Du das erwartest oder vermutest‽