python:beautifulsoup多线程分析抓取网页
python beautifulsoup多线程分析抓取网页
Posted: 24 Jun 2011 04:51 AM PDT
最近在用python做一些网页分析方面的事情,很久没更新博客了,今天补上。下面的代码用到了
1 python 多线程
2 网页分析库:beautifulsoup ,这个库比之前分享的python SGMLParser 网页分析库要强大很多,大家有兴趣可以去了解下。
#@description:蜘蛛抓取内容。
import Queue
import threading
import urllib,urllib2
import time
from BeautifulSoup import BeautifulSoup
hosts = ["http://www.baidu.com","http://www.163.com"]#要抓取的网页
queue = Queue.Queue()
out_queue = Queue.Queue()
class ThreadUrl(threading.Thread):
"""Threaded Url Grab"""
def __init__(self, queue, out_queue):
threading.Thread.__init__(self)
self.queue = queue
self.out_queue = out_queue
def run(self):
while True:
#grabs host from queue
host = self.queue.get()
proxy_support = urllib2.ProxyHandler({'http':'http://xxx.xxx.xxx.xxxx'})#代理IP
opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler)
urllib2.install_opener(opener)
#grabs urls of hosts and then grabs chunk of webpage
url = urllib.urlopen(host)
chunk = url.read()
#place chunk into out queue
self.out_queue.put(chunk)
#signals to queue job is done
self.queue.task_done()
class DatamineThread(threading.Thread):
"""Threaded Url Grab"""
def __init__(self, out_queue):
threading.Thread.__init__(self)
self.out_queue = out_queue
def run(self):
while True:
#grabs host from queue
chunk = self.out_queue.get()
#parse the chunk
soup = BeautifulSoup(chunk)
print soup.findAll(['title']))
#signals to queue job is done
self.out_queue.task_done()
start = time.time()
def main():
#spawn a pool of threads, and pass them queue instance
t = ThreadUrl(queue, out_queue)
t.setDaemon(True)
t.start()
#populate queue with data
for host in hosts:
queue.put(host)
dt = DatamineThread(out_queue)
dt.setDaemon(True)
dt.start()
#wait on the queue until everything has been processed
queue.join()
out_queue.join()
main()
print "Elapsed Time: %s" % (time.time() - start)
运行上面的程序需要安装beautifulsoup, 这个是beautifulsou 文档,大家可以看看。
今天分享python beautifulsoup多线程分析抓取网页就到这里了,有什么运行问题可以发到下面的评论里。大家相互讨论。
文章链接:http://www.cnpythoner.com/post/pythonduoxianchen.html
2012年10月27日 14:30
726691 585407Sweet internet web site , super style , actually clean and utilize genial . 82120
2012年05月12日 09:45
不错,学习了
2012年03月27日 08:56
Your article is an inspiration for me to understand this subject. I must confess your clarity widened my sentiments and I’ll proper away snatch your rss or atom feed to stay updated on any rising content articles you might compose. Bravo for a congratulations post!
2012年03月27日 05:33
hi, 不知是否能问一个关于BeautifulSoup的问题。 我希望抓取网页中可被浏览器显示的内容,不知道BeautifulSoup是否可以完成。现在后无头绪啊。
2012年03月26日 14:24
博主是否可以开发php。有个小东西,php开发,可以付费。
2012年03月23日 15:45
This web site is absolutely attention-grabbing i am probing for is there any other examples? but anyway thanks very much as a result of I found that i used to be probing for.
2012年03月17日 19:10
Your style is really unique in comparison to other people I have read stuff from. Thanks for posting when you’ve got the opportunity, Guess I will just book mark this blog.
2012年03月15日 17:31
We ended up being more than happy in which Erina managed to deal with his or her preliminary research using the ideas they found while using the web pages. It is currently and yet again complicated to simply often be freely giving options that many many more could have been earning money via. We actually recognize we’ve got you being pleased for you to for that. The most crucial designs you get, the simple internet site direction-finding, the actual associations a person assist to generate * it’s got everything incredible, and also it’s generating our own boy in addition to the loved ones consider this subject theme is pleasurable, that is very crucial. Appreciate the entire good deal!
2012年03月15日 10:02
some genuinely interesting information, well written and loosely user genial .
2012年03月13日 20:13
Omw to work 4 to close -___-
2012年03月13日 05:08
Hello i might really love a subscription and read your website posts
2012年03月12日 01:19
Great 1 blog site manager success weblog publish excellent sharings with this webpage always have entertaining
2012年03月07日 12:19
Can you return a coach purse bought from an outlet store to a regular coach store?
2012年01月15日 22:30
wut up do you have any other articles like this one? im doing a page for my school and I need a few links to insert on our campus website. Would it be ok if we point to this blog from the school blog for reference purposes?
2012年03月04日 17:44
Keep on wrtinig and chugging away!
2012年03月06日 10:26
This shows real expertise. Thanks for the asnwer.