Seite 1 von 1
Problem with Get html from thepiratebay.se
Verfasst: Mittwoch 7. Mai 2014, 08:35
von fourseason46
I'am trying get content the page
http://thepiratebay.se/top/201 . But they encode the content.
I try using below code:
Code: Alles auswählen
def get_html(url):
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}
#req = urllib2.Request(url)
req = urllib2.Request(url, headers=hdr)
try:
page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
print e.fp.read()
#return ''
content = page.read()
return content
Can anyone help me get html this page
http://thepiratebay.se/top/201
Re: Problem with Get html from thepiratebay.se
Verfasst: Mittwoch 7. Mai 2014, 08:59
von /me
I'm not quite sure what goes wrong, but if I use
requests I get the desired result.
Code: Alles auswählen
import requests
r = requests.get('http://thepiratebay.se/top/201')
print(r.status_code)
print(r.encoding)
print(r.text)
Re: Problem with Get html from thepiratebay.se
Verfasst: Mittwoch 7. Mai 2014, 09:09
von BlackJack
@fourseason46: What do you mean by „they encode the content”? I get the HTML one would expect when looking at the page in a browser. HTML and text content is there, and ”readable”. Not encoded or obfuscated, not even just JavaScript that loads the actual data. Just grab an HTML parser and extract whatever information you need from this page.
The two usual suspects are `lxml.html` or BeautifulSoup4. The former can download the URL itself, a feature which should be used because then the HTML parser also sees the header of the web server's response which might contain information about the character encoding.
When fetching the page content with own code I'd recommend the external `requests` package instead of the `urllib` and `urllib2` modules from the standard library.
Edit: Depending on which information you want to extract exactly, trying to switch the view from ”double” to ”single” might be a good idea. Then there is more data in separate table columns instead of grouped together in one cell.
Re: Problem with Get html from thepiratebay.se
Verfasst: Mittwoch 7. Mai 2014, 10:05
von fourseason46
If I use (previously it was ok)
Code: Alles auswählen
def get_html(url):
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}
#req = urllib2.Request(url)
req = urllib2.Request(url, headers=hdr)
try:
page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
print e.fp.read()
#return ''
content = page.read()
return content
the result I get:
Code: Alles auswählen
‹ì[ksÛ6ýœü ”;iìáJ|S”"©ãWÒÌ:¶6vãd;HBjŠPIʶÚíß‹)ʱ-9ùîÔ™ŒM÷\‡—Iÿ»ÃÓƒóO£#4-g)ý´üöi-øpãðü}üñüÝ1²Ú&:ÏqVÐ’²§†qt¢!mZ–óža\__·¯6Ë'Æù{ã†û²xguÙ*=ÛI™hÃç}x3K³bp‡«ÛíÊÞâÆ'ÃçÏú%-S2<Ÿ4¢9. ÚÇKÔBüÁ§øfù²@3V”('M)ÉJѲdyÎ/éÒ8›‘#Ý"¿-èÕ@;`Y vóåœh(–w$7¥Á©¼Bñç)‹rÜ 4£ö’áhÆ&)iq˜ÖÉé˜Æ˜ºá+:zÿ‰°ÙÛì_¤£ãÅ|4ú4úôf¯ûæjï§ýþúÇ©sñöÚv?þSš]ÂpÒVœÇS •@o áùq4›/Jµíªä¥Z&¼)b7µ+5çgÊLê¬9±Õ¦,KXÍí”ä¤Ýnkè §0ÓV`êÉ-oŠÂ"šÑ²¦ o£¼J2áAæìS‘Á6€M¹H(«I ‹/^Õ.骖ř]6ŧ—ÏöÒtg÷UÅ%ž’ø’ß §}C€Êè5ð¯hBVøÄ]_µK|uóH|áô>|È2‹ÕðW)g±a¥‚ ®ƒ†çû¨LÀýŠËq×$¡ÚU*.oIC8½¯¨Õrå7MtÙ*Áåõ#±¹Çû $Çy}*îšàª]¢«›G§5¾ ï7þæ05dÂòe…±º˜)¼m×^µ=8") á*®/úL$'Õ^ùã¾!ŸŠ±ž Ï‘ Tï<ÑxÛ§„JÀšN>·³5¹ÛPÄØe±ÉÚ— OÈA0è|£µ« _ïl0ëvµ*Ôkã5ªß3~µóïtj‹ñ‹Üéat›@%Y‡ßo²vTX‘–Î]å|›xÙ¦×|m=lêkÃq–LIšl2í€é!j¡íÂ(ëm‰Àä9‡Œ¾t†×ÅõNߟèš%©:b¦ñ† áðþéäíÇMfî¶àði}{z†vè'MYFv7õùÝË’œÑMÞëØ¢Ó³G†W½/îôëò¸Ž6l]w«º<¤£³›MVÑû§ßÜdéñ)§›¬¶Þ".ß"œ!—o”fÈýÒÕ/ß§wúô¶•7ïQò扙¢q¹È7;†éR¹ÁÃvÞ#äÇã“¶µ`z_Y•-ÜéÔç¡=jmñ6ôylØŒÆ ·«ÏÃzÀKM !®£é² QºÑévq‚³ºÈnTªÓ<©Li’U.‡'¤>h˜²ù`qÚŽ–uà#;õ ~°æ¥qÔî×j!Ã@pÝ“5Ôj‰"ÆÔö‹9ÎV§*q'c}šŸ×ÕUOQÇñêiq¹l‰Z–š÷ík ÐÓ^/*X¼°**<®ªP—jj3L³ÖŠõ£ ü´Lòuz»Qóè>Š]¿Ä°ždÀÄAñ=)©ˆY¿äS"š„ÑDUgúe^ëBf9ò¯}.ªÍâŽ,ˆjC^àTóßh]Õ.µ!ú7=´:pËnÒžOç?ð‹¯sÀRÂOÜÈ@‡l,«"ܯ1ÁnëŠ ÞÜÃô††l 37>¾m¿s±˜ËúãL.#”ÔÃ"yy>U’“§6㪋áñuUU$XN`Ù•SZ ú¨¡NHXÔ…£;É5)½IJwõv´tNKç´ôšÖ»Ë[5]Rû¾Ì‹dþÂÙ{a¿†¿eŽãK þ lõ1‹ ˜æ}æsØQ4Ž6Ò¢6mZB‹ßõ»÷صâ8n'ä#NZg,£T!ZŽÓYÍ~]ç+¹ªˆÃá,›¨ºø}UWñmJšÈš«(³¾[•ë«ëjQ/ ’ñôr”ä>·WtÞp÷áíhU¤â×âÉ@“Ÿß,k~óJCJ”_š/+Ä;=[Ö thϳIý¶å_cû‚¥‚ƒËœV°éµLGjŸéõÌàŸèŒþN%½¡ûÿD?ƒ]´äªzËÍ]c6nWëX¨YTæ¬nI’šzùZi?ÐjYÕZ‰éß\B»”[ÆÙœÂ: ßá,üøñch‡p/÷nHîçx‘…#œ³dÙPÔôŠ&µn%¡dCÒ¾—!‚’ =” •Âëó+=l-¡æØ ‡ø‘çÚ®5v]Çr¼qÞ7AlÆØócÏuˆp •¤t ¥)ÝÖ[ú^¦sRº ¥KRkŠ*h é“„~¡„ÞÛ=B3³Uÿó&).`)£¢½úÂò ‘ñ D{KAöjAv\%ÈÝ®Ûîòù»o©Ê°ReÓzRå5UîÚ¦oü¼³ p)x±â—ð„^^ÒpŸd¿‡;ûtžÓ²÷S–%$|"T„ø ~âRa_n¥Ê·@AqÄAA H€¿ÖÃÖªì1m/&¶ï&&wˆkY‘›ŒMßÓ#ž?vƾÃUù…·¿Fë…w¨b:'Æu¨éœš.©é‚š.¨é@MçÔžTùo–Ø~j%x©t´Ó3+µ³mûµÍû['x’ѦŒZžÓ Œ³Œ]‡SZ‘ÜnÊl/(l±$¬ ª»[&·‚OM{›òR ‚jžÜ~¥‡e´ãuqlc3 ,Ó[öMLJ¤–øq". Ž"â‰äHé‚OnAI6ç¶’˜^ãY~‹äÖß&¹õÿ’[·eÛJ”Ýž]‰²Õv¼¿@Á¡ã<Úz]÷I’kIîzË7=ãÍëðýч£“7Gáh†Ý‚ÎnØ1UE6<‡ÿ^@$Ëeøž\‘lBBñ¹r%~ó)ç¨áuLUGE§c¤œ#åÉo¡_ÚskåuâÅãGN0îØÝNâÃÊYã îÄ–‡;6¶L®¼o^ëŠWך—ÓŽ©ê®úéXW„tEH焞ôö[èëm#¸ëVŸ)n+Ù6Íàæ[k®]l»g;•æm;x¼èÖ£úLvã/·•ÞnïÞ¥ËNðTê½]T°\ãçw,+ÀWq:æ'Ü_ÂO$Ë–!ÿX9ÉqîŒHÉsåcXrÏñ%òkÆÊð˜e“í²áÛ H€ )$AAqȆ¿ÖÃÖšœId:I-Ç´ƒNDš$nšÄv]ÓÛ][Öiñª‚ ¦×ĸZKjº¤¦j:§¦sjOÙðß¼¨Ð©Š ®_‹é7-*8ÿþVXO2ºöºÝ kœ°ð/ÃÓEÙ(Ò -qŽR¼”ÿ³p7üù‚D‡Ç¿l%›'SN}å5œ¢Êéð±=¶¯µº]l»„€Ã(J\ÇGNl;NìÄ]$^gÜ1ŸËâ Ó„$Ö Š†¾¢!¾xyû‚ Hç“~‹4ÕÚ&KµþÊ^Ëtë²€ëU²Ú6ý¿@YÀt:µšªõÿÿìkOG†ÿÊ´M¢)xî*!…$¢ðÉQ£ E£Ù¹+ŽaÓB~}gf×Ëš$î e%¾¬ÍžóÊbÞ9sæx êÊœ"‚F㼈]†0K~tš›Eƒ7‡öÌ»¤År…ÚÄÕÜo<ùܬm\ÐÆ«¸€+v&P®âîÝà¦Þx*!,Š2øÊ+ÂytÞ;O<÷Šs…“ŒPfu©Á®tÀV\逄t™Û¶V3
Maybe the block me
but if I use requests function, the result is ok. @/me: thank very much
I think they block me again.
Re: Problem with Get html from thepiratebay.se
Verfasst: Mittwoch 7. Mai 2014, 10:22
von BlackJack
@fourseason46: Ah, if you look at the `requests` response headers you'll find:
Code: Alles auswählen
In [23]: r.headers['content-encoding']
Out[23]: 'gzip'
So the server sends the response compressed and `requests` kindly transparently uncompresses that for you.