Problem with Get html from thepiratebay.se

Sockets, TCP/IP, (XML-)RPC und ähnliche Themen gehören in dieses Forum
Antworten
fourseason46
User
Beiträge: 4
Registriert: Mittwoch 7. Mai 2014, 08:27

I'am trying get content the page http://thepiratebay.se/top/201 . But they encode the content.
I try using below code:

Code: Alles auswählen

def get_html(url):
    
    hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
           'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
           'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
           'Accept-Encoding': 'none',
           'Accept-Language': 'en-US,en;q=0.8',
           'Connection': 'keep-alive'
            }
    #req = urllib2.Request(url)
    req = urllib2.Request(url, headers=hdr)
    
    try:
        page = urllib2.urlopen(req)
    except urllib2.HTTPError, e:
        print e.fp.read()
        #return ''
    content = page.read()
    return content
Can anyone help me get html this page http://thepiratebay.se/top/201
Benutzeravatar
/me
User
Beiträge: 3555
Registriert: Donnerstag 25. Juni 2009, 14:40
Wohnort: Bonn

I'm not quite sure what goes wrong, but if I use requests I get the desired result.

Code: Alles auswählen

import requests

r = requests.get('http://thepiratebay.se/top/201')
print(r.status_code)
print(r.encoding)
print(r.text)
BlackJack

@fourseason46: What do you mean by „they encode the content”? I get the HTML one would expect when looking at the page in a browser. HTML and text content is there, and ”readable”. Not encoded or obfuscated, not even just JavaScript that loads the actual data. Just grab an HTML parser and extract whatever information you need from this page.

The two usual suspects are `lxml.html` or BeautifulSoup4. The former can download the URL itself, a feature which should be used because then the HTML parser also sees the header of the web server's response which might contain information about the character encoding.

When fetching the page content with own code I'd recommend the external `requests` package instead of the `urllib` and `urllib2` modules from the standard library.

Edit: Depending on which information you want to extract exactly, trying to switch the view from ”double” to ”single” might be a good idea. Then there is more data in separate table columns instead of grouped together in one cell.
fourseason46
User
Beiträge: 4
Registriert: Mittwoch 7. Mai 2014, 08:27

If I use (previously it was ok)

Code: Alles auswählen

def get_html(url):
    
    hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
           'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
           'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
           'Accept-Encoding': 'none',
           'Accept-Language': 'en-US,en;q=0.8',
           'Connection': 'keep-alive'
            }
    #req = urllib2.Request(url)
    req = urllib2.Request(url, headers=hdr)
    
    try:
        page = urllib2.urlopen(req)
    except urllib2.HTTPError, e:
        print e.fp.read()
        #return ''
    content = page.read()
    return content
the result I get:

Code: Alles auswählen

‹ì[ksÛ6ýœü ”;iìáJ|S”"©ãWÒÌ:¶6vãd;HBjŠPIʶÚíß‹)ʱ-9ùîÔ™ŒM÷\‡—Iÿ»ÃÓƒóO£#4-g)ý´üöi-øpãðü}üñüÝ1²Ú&:ÏqVÐ’²§†qt¢!mZ–óža\__·¯6Ë'Æù{ã†û²xguÙ*=ÛI™hÃç}x3K³bp‡«ÛíÊÞâÆ'ÃçÏú%-S2<Ÿ4¢9. ÚÇKÔBüÁ§øfù²@3V”('M)ÉJѲdyÎ/éÒ8›‘#Ý"¿-èÕ@;`Y v­óåœh(–w­$7¥Á©¼Bñç)‹rÜ 4£ö’áhÆ&)iq˜ÖÉé˜Æ˜ºá+:zÿ‰°Ùۏì_¤£ãÅ|4ú4úôf¯ûæjï§ýþúÇ©sñöÚv?þSš]ÂpҁVœÇS •@o áùq4›/Jµíªä¥Z&¼)b7µ+5çgÊLê¬9±Õ¦,KXÍí”ä¤Ýnkè §0ÓV`êÉ-oŠÂ"šÑ²¦ o£¼J2áAæìS‘Á6€M¹H(«I ‹/^Õ.骖Ő™]6ŧ—ÏöÒtg÷UÅ%ž’ø’ß §}C€Êè5ð¯hBVøÄ]_µK|uóH|áô>|È2‹ÕðW)g±a¥‚ ®ƒ†çû¨LÀýŠËq×$¡ÚU*.oIC8½¯¨Õrå7MtÙ*Áåõ#±¹Çû $Çy}*îšàª]¢«›G§5¾ ï7þæ05dÂòe…±º˜)¼m×^µ=8") á*®/úL$'Õ^ùã¾!ŸŠ±ž Ï‘ Tï<Ñxہ§„JÀšN>·³5¹ÛPÄØe±Éځ— OÈA0è|£µ« _ïl0ëvµ*Ôkã5ªß3~µóïtj‹ñ‹Üéat›@%Y‡ßo²vTX‘–Î]å|›xÙ¦×|m=lêkÃq–LIšl2í€é!j¡íÂ(ëm‰Àä9‡Œ¾t†×ÅõNߟèš%©:b¦ñ† áðþéäíÇMfî¶àði}{z†vè'MYFv7õùÝË’œÑMÞëØ¢Ó³G†W½/îôëò¸Ž6l]w«º<¤£³›MVяû§ßÜdéñ)§›¬¶Þ".ß"œ!—o”­fÈýÒÕ/ߧwúô¶•7ïQò扙¢q¹È7;†éR¹ÁÃvÞ#äÇ㓶µ`z_Y•-ÜéÔç¡=jmñ6ôylØŒÆ ·«ÏÃzÀKM !®£é² QºÑévq‚³ºÈnTªÓ<©Li’U.‡'¤>h˜²ù`qÚŽ–uà#;õ ~°æ¥qÔî×j!Ã@pÝ“5Ôj‰"ÆÔö‹9ÎV§*q'c}šŸ×ÕUOQÇñêiq¹l‰Z–š÷ík ÐÓ^/*X¼°**<®ªP—jj3L³ÖŠõ£ ü´Lòuz»Qóè>Š]¿Ä°ždÀÄAñ=)©ˆY¿äS"š„яDUgúe^ëBf9ò¯}.ªÍâŽ,ˆjC^àTóßh]Õ.µ!ú7=´:pËnÒžOç?ð‹¯sÀRÂOÜÈ@‡l,«"ܯ1ÁnëŠ ÞÜÃô††l 37>¾m¿s±˜ËúãL.#”ÔÃ"yy>U’“§6ã­ª‹áñuUU$XN`Ù•SZ ú¨¡NHXÔ…£;É5)½IJwõv´tNKç´ôšÖ»Ë[5]Rû¾Ì‹dþÂÙ{a¿†¿eŽãK þ lõ1‹ ˜æ}æsØQ4Ž6Ò¢6mZB‹ßõ»÷صâ8n'ä#NZg,£T!ZŽÓYÍ~]ç+¹ªˆÃá,›¨ºø}UWñmJšÈš«(³¾[•ë«ëjQ/ ’ñôr”ä>·WtÞp÷áíhU¤â×âÉ@“Ÿß,k~óJCJ”_š/+Ä;=[Ö thϳIý¶å_cû‚¥‚ƒËœV°éµLGjŸéõÌàŸèŒþN%½¡ûÿD?ƒ]´äªzËÍ]c6nWëX¨YTæ¬nI’šzùZi?ÐjYÕZ‰éß\B»”[ÆÙœÂ: ßá,üøñch‡­p/÷nHîçx‘…#œ³dÙPÔôŠ&µn%¡dCÒ¾—!‚’ =” •Âëó+=l-¡æØ ‡ø‘çÚ®5v]Çr¼qÞ7AlÆØócÏu­ˆp •¤t ¥)ÝÖ[ú^¦sRº ¥KRkŠ*h­ é“„~¡„ÞÛ=B3³Uÿó&).`)£¢½úÂò ‘ñ D{KAöjAv\%ÈÝ®Ûîòù»o©Ê°ReÓzRå5UîÚ¦oü¼³ p)x±â—ð„^^ÒpŸd¿‡;ûtžÓ²÷S–%$|"T„ø ~âRa_n¥Ê·@AqÄAA  H€¿ÖÃÖªì1m/&¶ï&&wˆkY‘›ŒMßÓ#ž?vƾÃUù…·¿Fë…w¨b:'Æu¨éœš.©é‚š.¨é@MçÔžTùo–Ø~j%x©t´Ó3+µ³mûµÍû['x’ѦŒZžÓ Œ³Œ]‡SZ‘ÜnÊl/(l±$¬ ª»[&·‚OM{›òR ‚jžÜ~¥‡­e´ãuqlc3 ,Ó[öMLJ¤–øq". Ž"â‰äHé‚OnAI6綒˜^ãYî“Œ~‹äÖß&¹õÿ’[·eÛJ”Ýž]‰²Õv¼¿@Á¡ã<Úz]÷I’kIîzË7=ãÍëðýч£“7Gáh†Ý‚ÎnØ1UE6<‡ÿ^@$Ëeøž\‘lBBñ¹r%~ó)ç¨áuLUGE§c¤œ#åÉo¡_ÚskåuâŁãGN0îØÝNâÃʏYã îÄ–‡;6¶L®¼o^ëŠWך—ÓŽ©ê®úéXW„tEHç„žôö[è­ëm#¸ëVŸ)n+Ù6Íàæ[k®]l»g;•æm;x¼èÖ£úLvã­/·•ÞnïÞ¥ËNðTê½]T°\ãçw,+ÀWq:æ'Ü_ÂO$Ë–!ÿX9ÉqîŒHÉsåcXrÏñ%òkÆÊð˜e“í²áÛ H€ )$AAqȆ¿ÖÃÖšœId:I-Ç´ƒNDš$nšÄv]ÓÛ][Öiñª‚ ¦×ĸZKjº¤¦j:§¦sjOÙðß¼¨Ð©Š ®_‹é7-*8ÿþVXO2ºöºÝ kœ°ð/ÃÓEÙ(Ò -qŽR¼”ÿ³p7üù‚D‡Ç¿l%›'SN}å5œ¢Êéð±=¶¯µº]l»„€Ã(J\ÇGNl;NìÄ]$^gÜ1ŸËâ Ӂ„$Ö Š†¾¢!¾xyû‚ Hç“~‹4ÕÚ&KµþÊ^Ëtë²€ëU²Ú6ý¿@YÀt:µšªõÿÿìkOG†ÿÊ´M¢)xî*!…$¢ðÉQ£ E£Ù¹+ŽaÓB~}gf×Ëš$î e%¾¬ÍžóÊbÞ9sæx êÊœ"‚F㼈]†0K~tš›Eƒ7‡öÌ»¤År…ÚÄÕÜo<ùܬm\ÐÆ«¸€+v&P®âîÝà¦Þx­*!,Š2øÊ+ÂytÞ;O<÷Šs…“ŒPfu©Á®tÀV\逄t™Û¶V3
Maybe the block me
but if I use requests function, the result is ok. @/me: thank very much
I think they block me again.
BlackJack

@fourseason46: Ah, if you look at the `requests` response headers you'll find:

Code: Alles auswählen

In [23]: r.headers['content-encoding']
Out[23]: 'gzip'
So the server sends the response compressed and `requests` kindly transparently uncompresses that for you.
Antworten