Das deutsche Python-Forum

I'am trying get content the page http://thepiratebay.se/top/201 . But they encode the content.
I try using below code:

def get_html(url):
    
    hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
           'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
           'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
           'Accept-Encoding': 'none',
           'Accept-Language': 'en-US,en;q=0.8',
           'Connection': 'keep-alive'
            }
    #req = urllib2.Request(url)
    req = urllib2.Request(url, headers=hdr)
    
    try:
        page = urllib2.urlopen(req)
    except urllib2.HTTPError, e:
        print e.fp.read()
        #return ''
    content = page.read()
    return content

Can anyone help me get html this page http://thepiratebay.se/top/201

I'm not quite sure what goes wrong, but if I use requests I get the desired result.

Code: Alles auswählen

import requests

r = requests.get('http://thepiratebay.se/top/201')
print(r.status_code)
print(r.encoding)
print(r.text)

@fourseason46: What do you mean by „they encode the content”? I get the HTML one would expect when looking at the page in a browser. HTML and text content is there, and ”readable”. Not encoded or obfuscated, not even just JavaScript that loads the actual data. Just grab an HTML parser and extract whatever information you need from this page.

The two usual suspects are `lxml.html` or BeautifulSoup4. The former can download the URL itself, a feature which should be used because then the HTML parser also sees the header of the web server's response which might contain information about the character encoding.

When fetching the page content with own code I'd recommend the external `requests` package instead of the `urllib` and `urllib2` modules from the standard library.

Edit: Depending on which information you want to extract exactly, trying to switch the view from ”double” to ”single” might be a good idea. Then there is more data in separate table columns instead of grouped together in one cell.

If I use (previously it was ok)

Code: Alles auswählen

def get_html(url):
    
    hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
           'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
           'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
           'Accept-Encoding': 'none',
           'Accept-Language': 'en-US,en;q=0.8',
           'Connection': 'keep-alive'
            }
    #req = urllib2.Request(url)
    req = urllib2.Request(url, headers=hdr)
    
    try:
        page = urllib2.urlopen(req)
    except urllib2.HTTPError, e:
        print e.fp.read()
        #return ''
    content = page.read()
    return content

the result I get:

Code: Alles auswählen

‹ì[ksÛ6ýœü ”;iìáJ|S”"©ãWÒÌ:¶6vãd;HBjŠPIÊ¶Úíß‹)Ê±-9ùîÔ™ŒM÷\‡—Iÿ»ÃÓƒóO£#4-g)ý´üöi-Ã¸pãðü}üñüÝ1²Ú&:ÏqVÐ’²§†qt¢!mZ–óža\__·¯6Ë'Æù{ã†û²xguÙ*=ÛI™hÃç}x3K³bp‡«ÛíÊÞâÆ'ÃçÏú%-S2<Ÿ4¢9. ÚÇKÔBüÁ§øfù²@3V”('M)ÉJÑ²dyÎ/éÒ8›‘#Ý"¿-èÕ@;`Y vóåœh(–w$7¥Á©¼Bñç)‹rÜ 4£ö’áhÆ&)iq˜ÖÉé˜Æ˜ºá+:zÿ‰°ÙÛì_¤£ãÅ|4ú4úôf¯ûæjï§ýþúÇ©sñöÚv?þSš]ÂpÒVœÇS •@o áùq4›/Jµíªä¥Z&¼)b7µ+5çgÊLê¬9±Õ¦,KXÍí”ä¤Ýnkè §0ÓV`êÉ-oŠÂ"šÑ²¦ o£¼J2áAæìS‘Á6€M¹H(«Iî‰» ‹/^Õ.éª–Å™]6Å§—ÏöÒtg÷UÅ%ž’ø’ß §}C€Êè5ð¯hBVøÄ]_µK|uóH|áô>|È2‹ÕðW)g±a¥‚ ®ƒ†çû¨LÀýŠËq×$¡ÚU*.oIC8½¯¨Õrå7MtÙ*Áåõ#±¹Çû $Çy}*îšàª]¢«›GÂ§5¾ ï7þæ05dÂòe…±º˜)¼m×^µ=8") á*®/úL$'Õ^ùã¾!ŸŠ±ž Ï‘ Tï<ÑxÛ§„JÀšN>·³5¹ÛPÄØe±ÉÚ— OÈA0è|£µ« _ïl0ëvµ*Ôkã5ªß3~µóïtj‹ñ‹Üéat›@%Y‡ßo²vTX‘–Î]å|›xÙ¦×|m=lêkÃq–LIšl2í€é!j¡íÂ(ëm‰Àä9‡Œ¾t†×ÅõNßŸèš%©:b¦ñ† áðþéäíÇMfî¶àði}{z†vè'MYFv7õùÝË’œÑMÞëØ¢Ó³G†W½/îôëò¸Ž6l]w«º<¤£³›MVÑû§ßÜdéñ)§›¬¶Þ".ß"œ!—o”fÈýÒÕ/ß§wúô¶•7ïQòæ‰™¢q¹È7;†éR¹ÁÃvÞ#äÇã“¶µ`z_Y•-ÜéÔç¡=jmñ6ôylØŒÆ ·«ÏÃzÀKM !®£é² QºÑévq‚³ºÈnTªÓ<©Li’U.‡'¤>h˜²ù`qÚŽ–uà#;õ ~°æ¥qÔî×j!Ã@pÝ“5Ôj‰"ÆÔö‹9ÎV§*q'c}šŸ×ÕUOQÇñêiq¹l‰Z–š÷ík ÐÓ^/*X¼°**<®ªP—jj3L³ÖŠõ£ ü´Lòuz»Qóè>Š]¿Ä°ždÀÄAñ=)©ˆY¿äS"š„ÑDUgúe^ëBf9ò¯}.ªÍâŽ,ˆjC^àTóßh]Õ.µ!ú7=´:pËnÒžOç?ð‹¯sÀRÂOÜÈ@‡l,«"Ü¯1ÁnëŠ ÞÜÃô††l 37>¾m¿s±˜ËúãL.#”ÔÃ"yy>U’“§6ãª‹áñuUU$XN`Ù•SZ ú¨¡NHXÔ…£;É5)½IJwõv´tNKç´ôšÖ»Ë[5]Rû¾Ì‹dþÂÙ{a¿†¿eŽãK þ lõ1‹ ˜æ}æsØQ4Ž6Ò¢6mZB‹ßõ»÷Øµâ8n'ä#NZg,£T!ZŽÓYÍ~]ç+¹ªˆÃá,›¨ºø}UWñmJšÈš«(³¾[•ë«ëjQ/ ’ñôr”ä>·WtÞp÷áíhU¤â×âÉ@“Ÿß,k~óJCJ”_š/+Ä;=[Ö thÏ³Iý¶å_cû‚¥‚ƒËœV°éµLGjŸéõÌàŸèŒþN%½¡ûÿD?ƒ]´äªzËÍ]c6nWëX¨YTæ¬nI’šzùZi?ÐjYÕZ‰éß\B»”[ÆÙœÂ: ßá,üøñch‡p/÷nHîçx‘…#œ³dÙPÔôŠ&µn%¡dCÒ¾—!‚’ =” •Âëó+=l-¡æØ ‡ø‘çÚ®5v]Çr¼qÞ7AlÆØócÏuˆp •¤t ¥)ÝÖ[ú^¦sRº ¥KRkŠ*h é“„~¡„ÞÛ=B3³Uÿó&).`)£¢½úÂò ‘ñ D{KAöjAv\%ÈÝ®Ûîòù»o©Ê°ReÓzRå5UîÚ¦oü¼³ p)x±â—ð„^^ÒpŸd¿‡;ûtžÓ²÷S–%$|"T„ø ~âRa_n¥Ê·@AqÄAA  H€¿ÖÃÖªì1m/&¶ï&&wˆkY‘›ŒMßÓ#ž?vÆ¾ÃUù…·¿Fë…w¨b:'Æu¨éœš.©é‚š.¨é@MçÔžTùo–Ø~j%x©t´Ó3+µ³mûµÍû['x’Ñ¦ŒZžÓ Œ³Œ]‡SZ‘ÜnÊl/(l±$¬ ª»[&·‚OM{›òR ‚jžÜ~¥‡e´ãuqlc3 ,Ó[öMÇ‡¤–øq". Ž"â‰äHé‚OnAI6ç¶’˜^ãYî“Œ~‹äÖß&¹õÿ’[·eÛJ”Ýž]‰²Õv¼¿@Á¡ã<Úz]÷I’kIîzË7=ãÍëðýÑ‡£“7Gáh†Ý‚ÎnØ1UE6<‡ÿ^@$Ëeøž\‘lBBñ¹r%~ó)ç¨áuLUGE§c¤œ#åÉo¡_ÚskåuâÅãGN0îØÝNâÃÊYã îÄ–‡;6¶L®¼o^ëŠW×š—ÓŽ©ê®úéXW„tEHç„žôö[èëm#¸ëVŸ)n+Ù6Íàæ[k®]l»g;•æm;x¼èÖ£úLvã/·•ÞnïÞ¥ËNðTê½]T°\ãçw,+ÀWq:æ'Ü_ÂO$Ë–!ÿX9ÉqîŒHÉsåcXrÏñ%òkÆÊð˜e“í²áÛ H€ )$AAqÈ†¿ÖÃÖšœId:I-Ç´ƒNDš$nšÄv]ÓÛ][Öiñª‚ ¦×Ä¸ZKjº¤¦j:§¦sjOÙðß¼¨Ð©Š ®_‹é7-*8ÿþVXO2ºöºÝ kœ°ð/ÃÓEÙ(Ò -qŽR¼”ÿ³p7üù‚D‡Ç¿l%›'SN}å5œ¢Êéð±=¶¯µº]l»„€Ã(J\ÇGNl;NìÄ]$^gÜ1ŸËâ Ó„$Ö Š†¾¢!¾xyû‚ Hç“~‹4ÕÚ&KµþÊ^Ëtë²€ëU²Ú6ý¿@YÀt:µšªõÿÿìkOG†ÿÊ´M¢)xî*!…$¢ðÉQ£ E£Ù¹+ŽaÓB~}gf×Ëš$î e%¾¬ÍžóÊbÞ9sæx êÊœ"‚Fã¼ˆ]†0K~tš›Eƒ7‡öÌ»¤År…ÚÄÕÜo<ùÜ¬m\ÐÆ«¸€+v&P®âîÝà¦Þx*!,Š2øÊ+ÂytÞ;O<÷Šs…“ŒPfu©Á®tÀV\é€„t™Û¶V3

Maybe the block me
but if I use requests function, the result is ok. @/me: thank very much
I think they block me again.

@fourseason46: Ah, if you look at the `requests` response headers you'll find:

Code: Alles auswählen

In [23]: r.headers['content-encoding']
Out[23]: 'gzip'

So the server sends the response compressed and `requests` kindly transparently uncompresses that for you.

Das deutsche Python-Forum

Problem with Get html from thepiratebay.se

Problem with Get html from thepiratebay.se

Re: Problem with Get html from thepiratebay.se

Re: Problem with Get html from thepiratebay.se

Re: Problem with Get html from thepiratebay.se

Re: Problem with Get html from thepiratebay.se