How to parse the title and description of a vacancy at the same time from habr in python?

K

Katerina92_lomova2021-10-13 08:25:29

Python

Katerina92_lomova, 2021-10-13 08:25:29

The task is to parse vacancies from Habr in accordance with their types.
There is a list of words in the code, if the vacancy contains them, then it gets into the desired list.
If not, then such vacancies are added to another list.

Tell me how can I implement the following:
If the vacancy does not contain any of the necessary words, then parse into the list not only its description, for example, but also the name of the vacancy, that is, is there another tag from the page?

<source lang="python">
num_of_page = 40
other_vacancies = []  # остальные вакансии будут валиться сюда
collected_data = [
  {'pattern': ['angular'], 'result': [] },
  {'pattern': ['react'], 'result': []},
  {'pattern': ['vue','js'], 'result': []}
    
]

for i in range(num_of_page):
    URL ="https://career.habr.com/vacancies?divisions[]=frontend&page=" + str(i+1)+ "&type=all"
    page = requests.get(URL)
    soup = bs(page.text, "html.parser")
    vacancies_names = soup.find_all('a', class_='vacancy-card__title-link')

    for name in vacancies_names:
        for data in collected_data:
            pattern_found = False
            if any([x in name.get_text().lower() for x in data['pattern']]):
                data['result'].append(name.get_text())
                pattern_found = True
                break
        if not pattern_found:
            other_vacancies.append(name.get_text())
</source>

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

V

Vladislav Orlov, 2021-10-13
@haveacess

It's easy, well. Why are you just looking at the DOM when there is apiha. You just need to turn on the Network tab in your browser and walk around the pages.
The only one passed in the request header is X CSRF Token. But this is also pulled out in the simplest way through a regular expression or a normal DOM scan
1
2
3