亚洲免费在线-亚洲免费在线播放-亚洲免费在线观看-亚洲免费在线观看视频-亚洲免费在线看-亚洲免费在线视频

python 爬取boss直聘招聘信息實現

系統 1883 0

1、一些公共方法的準備

獲取數據庫鏈接:

            
              import pymysql
'''
遇到不懂的問題?Python學習交流群:821460695滿足你的需求,資料都已經上傳群文件,可以自行下載!
'''
# 獲得數據庫鏈接對象
def getConnect(database):
    DATABASE = {
        'host': 'localhost',
        'database': database,
        'user': 'root',
        'password': '123456'
    }
    return pymysql.connect(**DATABASE)

            
          

獲取頁面soup對象:

            
              import requests
from bs4 import BeautifulSoup
#轉換成soup對象
def to_soup(str):
    return BeautifulSoup(str,'lxml')
#通過url和header獲取頁面soup對象
def get_soup(url,header):
    response=requests.get(url,headers=header)
    return to_soup(response.text)

            
          

2、爬取BOSS直聘python相關崗位的實現

定義工作信息對象:

            
              class WorkInfo:
    def __init__(self, title, salary, site, experience, education, job_url, company,release_date,get_date):
        self.title = title
        self.salary = salary
        self.site = site
        self.experience = experience
        self.education = education
        self.job_url = job_url
        self.company = company
        self.release_date = release_date
        self.get_date = get_date

            
          

獲取工作信息到定義的對象的集合:

            
              # 獲取工作信息集合
def getWorkInfos(url, header):
        # 獲得頁面soup對象
        htmlSoup = rep.get_soup(url, header)
        workInfos = []
        # 獲取頁面內容塊狀列表
        job_infos = htmlSoup.find_all('div', class_='job-primary')
        if len(job_infos)==0:
            print('已到空白頁!!!')
            return workInfos
        # 遍歷每塊,獲取每塊詳細類容
        print('開始爬取頁面數據!')
        for job_info_soup in job_infos:
                # 標題
                title = job_info_soup.find('div', class_='job-title').get_text()
                # 薪資
                salary = job_info_soup.find('span', class_='red').get_text()
                infos = str(job_info_soup.find('p'))
                infosList = tool.toContent(infos)
                # 工作地址
                site = infosList[0]
                # 工作經驗
                experience = infosList[1]
                # 學歷要求
                education = infosList[2]
                # 詳細信息鏈接
                job_url = job_info_soup.find('a').get('href')
                # 公司名
                company = job_info_soup.find('div', class_='company-text').find('a').get_text()
                # 發布時間
                release_date = job_info_soup.find('div', class_='info-publis').find('p').get_text()[3:]
                # 拼接獲取符合數據庫要求的日期字符串
                if '昨' in release_date:
                    release_date=time.strftime("%Y-%m-%d",time.localtime(time.time()-86400))
                elif ':' in release_date:
                    release_date=time.strftime("%Y-%m-%d")
                else:
                     release_date = str(time.localtime().tm_year) + '-' + re.sub(r'[月,日]', '-', release_date)[:-1]
                # 獲取數據的時間
                get_date = time.strftime("%Y-%m-%d  %H:%M:%S")
                workInfo = WorkInfo(title, salary, site, experience, education, job_url, company, release_date, get_date)
                workInfos.append(workInfo)
        print('爬取頁面數據完畢!')
        return workInfos

            
          

把獲取到的工作信息集合存入數據庫:

            
              # 存入數據庫
def toDatabase(workInfos):
    print('開始存入數據庫')
    db = database.getConnect('reptile')
    cursor = db.cursor()
    for workInfo in workInfos:
        sql = "INSERT INTO `work_info` (`title`, `salary`, `site`, `experience`, `education`, `job_url`, `company`, `release_date`, `get_date`)" \
          " VALUES ('%s','%s','%s','%s','%s','%s','%s','%s','%s')" \
          % (workInfo.title, workInfo.salary, workInfo.site, workInfo.experience, workInfo.education, workInfo.job_url, workInfo.company, workInfo.release_date,workInfo.get_date)
        cursor.execute(sql)
    cursor.close()
    db.commit()
    db.close()
    print('存入數據庫完畢!')

            
          

爬取工作實現:

            
              url = 'https://www.zhipin.com/c101270100/?'
header = {
    'user-agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
    'referer': '',
    'cookie':'lastCity=101270100; _uab_collina=155876824002955006866925; t=DPiicXvgrhx7xtms; wt=DPiicXvgrhx7xtms; sid=sem_pz_bdpc_dasou_title; __c=1559547631; __g=sem_pz_bdpc_dasou_title; __l=l=%2Fwww.zhipin.com%2F%3Fsid%3Dsem_pz_bdpc_dasou_title&r=https%3A%2F%2Fsp0.baidu.com%2F9q9JcDHa2gU2pMbgoY3K%2Fadrc.php%3Ft%3D06KL00c00fDIFkY0IWPB0KZEgsZb1OwT00000Kd7ZNC00000JqHYFm.THdBULP1doZA80K85yF9pywdpAqVuNqsusK15yF9m1DdmWfdnj0sm1PhrAf0IHYYnD7aPH9aPRckwjRLrjbsnYfYfWwaPYwDnHuDfHcdwfK95gTqFhdWpyfqn1czPjmsPjnYrausThqbpyfqnHm0uHdCIZwsT1CEQLILIz4lpA-spy38mvqVQ1q1pyfqTvNVgLKlgvFbTAPxuA71ULNxIA-YUAR0mLFW5HRvnH0s%26tpl%3Dtpl_11534_19713_15764%26l%3D1511867677%26attach%3Dlocation%253D%2526linkName%253D%2525E6%2525A0%252587%2525E5%252587%252586%2525E5%2525A4%2525B4%2525E9%252583%2525A8-%2525E6%2525A0%252587%2525E9%2525A2%252598-%2525E4%2525B8%2525BB%2525E6%2525A0%252587%2525E9%2525A2%252598%2526linkText%253DBoss%2525E7%25259B%2525B4%2525E8%252581%252598%2525E2%252580%252594%2525E2%252580%252594%2525E6%252589%2525BE%2525E5%2525B7%2525A5%2525E4%2525BD%25259C%2525EF%2525BC%25258C%2525E6%252588%252591%2525E8%2525A6%252581%2525E8%2525B7%25259F%2525E8%252580%252581%2525E6%25259D%2525BF%2525E8%2525B0%252588%2525EF%2525BC%252581%2526xp%253Did(%252522m3224604348_canvas%252522)%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FDIV%25255B1%25255D%25252FH2%25255B1%25255D%25252FA%25255B1%25255D%2526linkType%253D%2526checksum%253D8%26wd%3Dboss%25E7%259B%25B4%25E8%2581%2598%26issp%3D1%26f%3D3%26ie%3Dutf-8%26rqlang%3Dcn%26tn%3Dbaiduhome_pg%26oq%3D%2525E5%25258D%25259A%2525E5%2525AE%2525A2%2525E5%25259B%2525AD%26inputT%3D9649%26prefixsug%3Dboss%26rsp%3D0&g=%2Fwww.zhipin.com%2F%3Fsid%3Dsem_pz_bdpc_dasou_title; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1558768262,1558768331,1559458549,1559547631; JSESSIONID=A0FC9E1FD0F10E42EAB681A51AC459C7;'
             ' __a=86180698.1558768240.1559458549.1559547631.63.3.6.6; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1559551561'
              'referer: https://www.zhipin.com/c101270100/?query=python&page=2&ka=page-2'
}
query='python'
page=1
while True:
    print("開始第:{} 頁".format(page))
    purl=url+'query='+query+'&page='+str(page)+'&ka=page-'+str(page)
    workInfos = getWorkInfos(purl, header)
    if len(workInfos)==0:
        print('結束爬取!')
        break
    toDatabase(workInfos)
    page=page+1

            
          

3、涉及的小知識

自制去取html標簽,把標簽內夾雜的內容存入list中:

            
              # 通過正則表達式去掉HTML標簽,獲取標簽內的文字內容列表
def toContent(str):
    infos=re.split('<[^>]*>', str)
    # 去除空元素
    return list(filter(None,infos))

            
          

時間的相關操作

用‘-’替換‘月’‘日’:

re.sub(r’[月,日]’, ‘-’, release_date)
獲取前一天’:

release_date=time.strftime("%Y-%m-%d",time.localtime(time.time()-86400))


更多文章、技術交流、商務合作、聯系博主

微信掃碼或搜索:z360901061

微信掃一掃加我為好友

QQ號聯系: 360901061

您的支持是博主寫作最大的動力,如果您喜歡我的文章,感覺我的文章對您有幫助,請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點擊下面給點支持吧,站長非常感激您!手機微信長按不能支付解決辦法:請將微信支付二維碼保存到相冊,切換到微信,然后點擊微信右上角掃一掃功能,選擇支付二維碼完成支付。

【本文對您有幫助就好】

您的支持是博主寫作最大的動力,如果您喜歡我的文章,感覺我的文章對您有幫助,請用微信掃描上面二維碼支持博主2元、5元、10元、自定義金額等您想捐的金額吧,站長會非常 感謝您的哦!!!

發表我的評論
最新評論 總共0條評論
主站蜘蛛池模板: 国产午夜偷精品偷伦 | 亚洲一区二区久久 | 精品一区二区在线欧美日韩 | 亚洲成人日韩 | 九九视频免费在线观看 | 欧美一级免费看 | 欧洲一级黄色 | 国产欧美日韩精品一区二 | 中文亚洲日韩欧美 | 亚洲综合站 | 久久久美女| 亚洲精品国产第一区二区多人 | 久久精品资源站 | 日韩美在线 | 日本三级中文字幕 | 不卡欧美 | 精品国产综合成人亚洲区 | 中文字幕久久久久久精 | 精品综合久久久久久88小说 | 一区二区视频免费看 | 日本手机在线视频 | 曰本一级毛片 | 国产成人精品免费视频大全五级 | 97视频观看 | 亚洲另类网 | 91久久精品国产91性色tv | 爱爱视频在线免费观看 | 欧美性猛交99久久久久99 | 久草看片 | 久久精品免看国产 | 狠狠色成人综合 | 中文字幕日韩一区二区三区不 | 亚洲美女视频在线观看 | 国产成人免费a在线视频色戒 | 久久99精品国产99久久6男男 | 久久综合免费视频 | 国产精品高清在线观看93 | 手机看片国产永久1204 | 成人 亚洲 | 久久的精品99精品66 | 欧美一区二区三区香蕉视 |