본문 바로가기
Python

bs4 응용

by Coarti 2024. 1. 31.

https://berlinstartupjobs.com/engineering

 

Jobs in IT & Software Development | Berlin Startup Jobs

Tech jobs in Berlin for frontend and backend developers specializing in Python, PHP, Ruby on Rails, or Java, iOS & Android Developers, JavaScript Developers, MySQL and HTML/CSS specialists.

berlinstartupjobs.com

프로그래밍 스킬키워드를 크롤링해 수집한 단어들로 다시 크롤링하는 형태다

스킬 목록을 지정할 때와 그렇지 않은 때 선택지가 있다.

지정한 스킬만 크롤링 하거나 해당 페이지에서 제공하는 모든 스킬을 크롤링하여 추출한다

 

bs4에서 제공하는 select_one 메소드가 사용하기 더 편하게 느껴진다

 

 

import requests
from bs4 import BeautifulSoup

class Scraper:
    def __init__(self):
        self.skills = []

    def set_skills(self, skills = None):
        # specified position
        if skills:
            self.skills = skills
            return
        
        # all position 
        response = requests.get('https://berlinstartupjobs.com/engineering',
                                headers = {
                                    'User-Agent':
                                    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36'
                                }
                            )
        soup = BeautifulSoup(response.content, 'html.parser')
        links_box = soup.find_all('ul', class_ = 'links')[3]
        skills_links = links_box.find_all('a')

        
        for link in skills_links:
            self.skills.append(link.text.split(' ')[0].strip())


    def get_resource(self, skill):
        response = requests.get(f'https://berlinstartupjobs.com/skill-areas/{skill}/',
                        headers = {
                            'User-Agent':
                            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36'
                        }
                    )

        if response.status_code >= 400:
            return False

        return BeautifulSoup(response.content, 'html.parser')

    def refine(self, data):
        link = data.select_one('h4.bjs-jlid__h > a')
        title = link.text
        company = data.select_one('a.bjs-jlid__b').text
        description = data.select_one('div.bjs-jlid__description').text

        info = {
            'link': link['href'],
            'title': title,
            'company': company,
            'description': description
        }

        return info
        
    def get_jobs(self):
        result = {}
        for skill in self.skills:
            soup = self.get_resource(skill)

            # Not Found Page
            if not soup:
                continue
            jobs = soup.find_all('div', class_= 'bjs-jlid__wrapper')

            jobs_data = []
            for job in jobs:
                jobs_data.append(self.refine(job))
            result[skill] = jobs_data

        return result


skills = [
    "python",
    "typescript",
    "javascript",
    "rust"
]

s = Scraper()
s.set_skills(skills = skills)
result = s.get_jobs()
for key, value in result.items():
    print(key, ':', len(value))
    for job in value:
        print(job)
    print()

bs4(BeautifulSoup)

 

bs4(BeautifulSoup)

크롤링에 필요한 모듈을 받아서 진행하자 pip install requests pip install bs4 https://remoteok.com/ Remote Jobs in Programming, Design, Sales and more #OpenSalaries Looking for a remote job? Remote OK® is the #1 Remote Job Board and has 597

cloakinghost.tistory.com

 

728x90

'Python' 카테고리의 다른 글

CentOS 7 에 Python 3.10 이상 설치  (0) 2024.08.13
인공지능 로드맵  (0) 2024.06.28
playwright(동적 스크래핑)  (0) 2024.01.28
bs4(BeautifulSoup)  (0) 2024.01.27