1. Python多線程爬蟲
在批量去爬取數據的時候,往往效率會很低,這個時候我們可以用到多線程的技術。
python是支持多線程的, 主要是通過thread和threading這兩個模塊來實現的。
單線程爬蟲效率相對來說會低很多,例如:
import
requests
from
bs4
import
BeautifulSoup
import
time
start_time
=
time
.
time
(
)
def
main
(
)
:
headers
=
{
'User-Agent'
:
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
for
i
in
range
(
1
,
6
)
:
url
=
"https://so.csdn.net/so/search/s.do?p="
+
str
(
i
)
+
"&q=python"
s
=
requests
.
session
(
)
html
=
s
.
get
(
url
,
headers
=
headers
)
html
.
encoding
=
"utf-8"
r
=
html
.
text
#print(r)
#print(r)
soup
=
BeautifulSoup
(
str
(
r
)
,
"html.parser"
)
limit
=
soup
.
find_all
(
'div'
,
class_
=
'limit_width'
)
soup1
=
BeautifulSoup
(
str
(
limit
)
,
"html.parser"
)
div
=
soup1
.
find_all
(
'div'
,
class_
=
'limit_width'
)
soup2
=
BeautifulSoup
(
str
(
div
)
,
"html.parser"
)
a
=
soup2
.
find_all
(
'a'
)
for
i
in
a
:
text
=
i
.
get_text
(
)
href
=
i
[
"href"
]
if
"CSDN"
not
in
text
:
print
(
text
)
print
(
href
)
main
(
)
end
=
time
.
time
(
)
print
(
end
-
start_time
)
#運行結果:
#......
#Time-Cost:2.061112642288208
然后我們嘗試用多線程的方法,執行同樣的爬取內容,如下所示:
# coding=utf-8
import
threading
,
queue
,
time
,
urllib
from
bs4
import
BeautifulSoup
from
urllib
import
request
headers
=
{
'User-Agent'
:
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
import
requests
baseUrl
=
"https://so.csdn.net/so/search/s.do?p="
urlQueue
=
queue
.
Queue
(
)
for
i
in
range
(
1
,
6
)
:
url
=
baseUrl
+
str
(
i
)
+
"&q=python"
urlQueue
.
put
(
url
)
#print(url)
def
fetchUrl
(
urlQueue
)
:
while
True
:
try
:
#不阻塞的讀取隊列數據
url
=
urlQueue
.
get_nowait
(
)
i
=
urlQueue
.
qsize
(
)
#print(url,threading.current_thread().name)
except
Exception
as
e
:
break
#print ('Current Thread Name %s, Url: %s ' % (threading.currentThread().name, url))
try
:
s
=
requests
.
session
(
)
html
=
s
.
get
(
url
,
headers
=
headers
)
html
.
encoding
=
"utf-8"
r
=
html
.
text
# print(r)
# print(r)
soup
=
BeautifulSoup
(
str
(
r
)
,
"html.parser"
)
limit
=
soup
.
find_all
(
'div'
,
class_
=
'limit_width'
)
soup1
=
BeautifulSoup
(
str
(
limit
)
,
"html.parser"
)
div
=
soup1
.
find_all
(
'div'
,
class_
=
'limit_width'
)
soup2
=
BeautifulSoup
(
str
(
div
)
,
"html.parser"
)
a
=
soup2
.
find_all
(
'a'
)
for
i
in
a
:
text
=
i
.
get_text
(
)
href
=
i
[
"href"
]
if
"CSDN"
not
in
text
:
print
(
text
)
print
(
href
)
print
(
"已爬取完畢!"
)
except
:
pass
#抓取內容的數據處理可以放到這里
#為了突出效果, 設置延時
#time.sleep(1)
#print(html)
if
__name__
==
'__main__'
:
startTime
=
time
.
time
(
)
print
(
"這是主線程:"
,
threading
.
current_thread
(
)
.
name
)
threads
=
[
]
# 可以調節線程數, 進而控制抓取速度
threadNum
=
5
for
i
in
range
(
0
,
threadNum
)
:
#創建一個線程
t
=
threading
.
Thread
(
target
=
fetchUrl
,
args
=
(
urlQueue
,
)
)
threads
.
append
(
t
)
print
(
threads
)
for
t
in
threads
:
t
.
start
(
)
for
t
in
threads
:
#多線程多join的情況下,依次執行各線程的join方法, 這樣可以確保主線程最后退出, 且各個線程間沒有阻塞
t
.
join
(
)
endTime
=
time
.
time
(
)
print
(
"主線程結束:"
,
threading
.
current_thread
(
)
.
name
)
print
(
'Done, Time cost: %s '
%
(
endTime
-
startTime
)
)
#運行結果:
#這是主線程: MainThread
#Python游戲開發入門
#https://edu.csdn.net/course/detail/5690
#Python, Python, Python
#https://blog.csdn.net/ww_great/article/details/3057071
#......
#已爬取完畢!
#主線程結束: MainThread
#Time cost: 0.7241780757904053
設置threadNum = 2的話,也就是將隊列設置為2,那么速度會大大降低。
我們運行一下,發現 Time cost: 1.3654978275299072
更多文章、技術交流、商務合作、聯系博主
微信掃碼或搜索:z360901061

微信掃一掃加我為好友
QQ號聯系: 360901061
您的支持是博主寫作最大的動力,如果您喜歡我的文章,感覺我的文章對您有幫助,請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點擊下面給點支持吧,站長非常感激您!手機微信長按不能支付解決辦法:請將微信支付二維碼保存到相冊,切換到微信,然后點擊微信右上角掃一掃功能,選擇支付二維碼完成支付。
【本文對您有幫助就好】元
