竹磬网-邵珠庆の日记 生命只有一次,你可以用它来做些更多伟大的事情–Make the world a little better and easier

147月/1116

python:beautifulsoup多线程分析抓取网页

python beautifulsoup多线程分析抓取网页

Posted: 24 Jun 2011 04:51 AM PDT

  最近在用python做一些网页分析方面的事情,很久没更新博客了,今天补上。下面的代码用到了

1 python 多线程

2 网页分析库:beautifulsoup ,这个库比之前分享的python SGMLParser 网页分析库要强大很多,大家有兴趣可以去了解下。

 

#encoding=utf-8
#@description:蜘蛛抓取内容。

import Queue
import threading
import urllib,urllib2
import time
from BeautifulSoup import BeautifulSoup

hosts = ["http://www.baidu.com","http://www.163.com"]#要抓取的网页

queue = Queue.Queue()
out_queue = Queue.Queue()

class ThreadUrl(threading.Thread):
    """Threaded Url Grab"""
    def __init__(self, queue, out_queue):
        threading.Thread.__init__(self)
        self.queue = queue
        self.out_queue = out_queue

    def run(self):
        while True:
            #grabs host from queue
            host = self.queue.get()
            proxy_support = urllib2.ProxyHandler({'http':'http://xxx.xxx.xxx.xxxx'})#代理IP
            opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler)
            urllib2.install_opener(opener)

            #grabs urls of hosts and then grabs chunk of webpage
            url = urllib.urlopen(host)
            chunk = url.read()

            #place chunk into out queue
            self.out_queue.put(chunk)

            #signals to queue job is done
            self.queue.task_done()

class DatamineThread(threading.Thread):
    """Threaded Url Grab"""
    def __init__(self, out_queue):
        threading.Thread.__init__(self)
        self.out_queue = out_queue

    def run(self):
        while True:
            #grabs host from queue
            chunk = self.out_queue.get()

            #parse the chunk
            soup = BeautifulSoup(chunk)
            print soup.findAll(['title']))

            #signals to queue job is done
            self.out_queue.task_done()

start = time.time()
def main():

    #spawn a pool of threads, and pass them queue instance

    t = ThreadUrl(queue, out_queue)
    t.setDaemon(True)
    t.start()

    #populate queue with data
    for host in hosts:
        queue.put(host)

    dt = DatamineThread(out_queue)
    dt.setDaemon(True)
    dt.start()

    #wait on the queue until everything has been processed
    queue.join()
    out_queue.join()

main()
print "Elapsed Time: %s" % (time.time() - start)

 
 
 

 

运行上面的程序需要安装beautifulsoup, 这个是beautifulsou 文档,大家可以看看。

今天分享python beautifulsoup多线程分析抓取网页就到这里了,有什么运行问题可以发到下面的评论里。大家相互讨论。

文章链接:http://www.cnpythoner.com/post/pythonduoxianchen.html

喜欢这个文章吗?

考虑订阅我们的RSS Feed吧!

发布在 邵珠庆

评论 (16) 引用 (0)
  1. 726691 585407Sweet internet web site , super style , actually clean and utilize genial . 82120

  2. Your article is an inspiration for me to understand this subject. I must confess your clarity widened my sentiments and I’ll proper away snatch your rss or atom feed to stay updated on any rising content articles you might compose. Bravo for a congratulations post!

  3. hi, 不知是否能问一个关于BeautifulSoup的问题。 我希望抓取网页中可被浏览器显示的内容,不知道BeautifulSoup是否可以完成。现在后无头绪啊

  4. 博主是否可以开发php。有个小东西,php开发,可以付费。

  5. This web site is absolutely attention-grabbing i am probing for is there any other examples? but anyway thanks very much as a result of I found that i used to be probing for.

  6. Your style is really unique in comparison to other people I have read stuff from. Thanks for posting when you’ve got the opportunity, Guess I will just book mark this blog.

  7. We ended up being more than happy in which Erina managed to deal with his or her preliminary research using the ideas they found while using the web pages. It is currently and yet again complicated to simply often be freely giving options that many many more could have been earning money via. We actually recognize we’ve got you being pleased for you to for that. The most crucial designs you get, the simple internet site direction-finding, the actual associations a person assist to generate * it’s got everything incredible, and also it’s generating our own boy in addition to the loved ones consider this subject theme is pleasurable, that is very crucial. Appreciate the entire good deal!

  8. some genuinely interesting information, well written and loosely user genial .

  9. Omw to work 4 to close -___-

  10. Hello i might really love a subscription and read your website posts

  11. Great 1 blog site manager success weblog publish excellent sharings with this webpage always have entertaining

  12. Can you return a coach purse bought from an outlet store to a regular coach store?

  13. wut up do you have any other articles like this one? im doing a page for my school and I need a few links to insert on our campus website. Would it be ok if we point to this blog from the school blog for reference purposes?


取消回复

还没有引用.