Slow parsing in Python with BS4?

SKEPTIC2019-10-31 11:57:17

Python

SKEPTIC, 2019-10-31 11:57:17

You need to parse 2000 pages per second. The site allows and does not ban. I am using Python + BS4.
But I encountered the fact that bs4 executes the command soup = bs(result.content, 'html.parser')for a very long time. Around 250ms.
Is there any way to shorten this time?
Or do I need to use a different parser?
What performance parsers are there for Python?

Answer the question

In order to leave comments, you need to log in

4 answer(s)

danSamara, 2019-11-07
@pro100chel

Scraping tasks are both IO-bound and CPU-bound (as detailed in the previous answer). On the one hand, this creates a lot of problems for beginners, on the other hand, knowing these features helps with architecture - you simply have no choice :)
Let's now analyze your question in detail.
Everything that uses multiple I / O should be asynchronous, except in cases where you absolutely do not care about the execution time, and this is not your case :) I think the reasons for this are obvious - any external network delays are beyond your control (and not very internal ones either) - a request to a site page can take both a few milliseconds and a minute (this is rare, but a few seconds is quite enough), that is, the difference reaches 4-5 orders of magnitude! In the case of synchronous single-threaded code, the lion's share of your application's work is waiting, which, you see, is insulting. Roman Kitaev gave you a code example - it is not necessary to focus specifically on this implementation - there are a lot of examples of modern Python asynchronous code downloading pages on the net.
It's even simpler here - you need to load all the cores evenly, along the way, if we are talking about Python, to neutralize the influence of the GIL. This is usually solved by creating several threads - according to the number of cores and a queue (or several) from which the threads take data for processing.
Low memory consumption is a sign of good code architecture.
Ordinary web pages weigh a little - hundreds of kilobytes - and rarely cause problems, but there are also tasks of processing very voluminous XML / HTML documents - tens and hundreds of gigabytes. But even a document weighing a hundred megabytes can bring a lot of joy. In this case, I resort to the help of stream parsers - they work with a small buffer, calling handlers for the desired content. Again - not your case, which is good)
Oh, how often one can hear and read this!
Unfortunately, this legend is easily confirmed by beginners who have a poor understanding of the python ecosystem and a poor understanding of its internal structure, although this knowledge is easily acquired after a couple of months of using the language. A distinctive feature of such critics is the lack of intelligible argumentation and advice like: "write in a normal language! but normal is [%random_lang%]". In this thread, as you can see, DarthWazer excelled .
Yes, there are enough tasks that a "pure" python can handle very slowly, but for such cases there will always be a "wrapper battery" for a library written in C and you will not notice any special brakes.
Python is a glue language that allows you to quickly, easily and elegantly connect several low-level tools and get great results.
As an experiment, let's compare Python and Rust ( DarthWazer , is it a fast enough language or should I just use sharp?)
First, save the main page to the file "wiki_front.html" - downloading pages is pointless to compare.

We write the code:

main.py

from lxml.html import parse
import time


if __name__ == '__main__':
    with open('wiki_front.html', 'r') as contents:
        begin = time.monotonic()
        doc = parse(contents)
        links = doc.xpath("//a")
        time_total = time.monotonic() - begin
        print(f'Links counts: {len(links)}, time: {time_total:.9} sec')

We launch:

pipenv run ./main.py
Links counts: 333, time: 0.00371371489 sec

.....
[dependencies]
scraper = "*"

We write the code:

src/main.rs

use std::fs;
use std::time::Instant;

use scraper::{Html, Selector};


fn main() {

    let contents = fs::read_to_string("wiki_front.html")
        .expect("Something went wrong reading the file");

    let begin = Instant::now();
    let document = Html::parse_document(&contents);
    let selector = Selector::parse("a").unwrap();
    let links_count = document.select(&selector).count();

    println!("Links counts: {}, time: {} sec",
             links_count, begin.elapsed().as_secs_f32());

}

We launch:

cargo run --release
Links counts: 333, time: 0.002836701 sec

And finally, the answer to your question :)
About a year ago, I asked a question on your topic - Which Python framework to choose for a web scraping system?
You can use one of the mentioned frameworks. Or write your own, if there are enough competencies.

Roman Kitaev, 2019-10-31
@deliro

1. BS - shit
2. If you really really want to continue ~~eating~~ the BS cactus - he can use the lxml parser written in C instead of html.parser in python
3. Parsing pages can be easily parallelized through the ProcessPoolExecutor by the number of cores
4. Here is an example of how you can without blocking, download whatever you want via HTTP, throw the result into the queue that the ProcessPoolExecutor processes. True, the script does not have the ability to stop the parser, but I think it will not be difficult to add it. Fast, fashionable, efficient:

The code that downloads the entire Wikipedia (or rather, tries)

import asyncio
from concurrent.futures import ProcessPoolExecutor

import aiohttp
from loguru import logger as loguru
from lxml.html import fromstring


pool = ProcessPoolExecutor()
parser_sem = asyncio.Semaphore(pool._max_workers)
loguru.info(f"CPU workers: {pool._max_workers}")
host = "https://ru.wikipedia.org"
start_from = f"{host}/wiki/Заглавная_страница"
q_d = asyncio.Queue()
q_p = asyncio.Queue()
sem = asyncio.Semaphore(100)
downloaded_urls = set()


class O:
    downloaded = 0
    parsed = 0
    downloading = 0
    down_pending = 0
    waiting_for_download_q = 0


o = O()


async def log_printer(queue_d, queue_p):
    while True:
        loguru.debug(
            f"[PRINTER] to Download: {queue_d.qsize()}, to Parse: {queue_p.qsize()}"
            f" downloaded: {o.downloaded}, parsed: {o.parsed}"
            f" pending: {o.down_pending}, downloading: {o.downloading}"
            f" waiting Q: {o.waiting_for_download_q}"
            f" tasks: {len(asyncio.Task.all_tasks())}"
        )
        await asyncio.sleep(0.33)


def lxml_parse(html):
    try:
        tree = fromstring(html)
        urls = tree.xpath("//a/@href")
        try:
            title = tree.find(".//title").text
        except AttributeError:
            title = "<UNKNOWN>"

        new_urls = []
        for url in urls:
            if url.startswith("/") and not url.startswith("//"):
                new_urls.append(f"{host}{url}")
            elif url.startswith("http"):
                new_urls.append(url)

        return new_urls, title
    except Exception as e:
        loguru.error(f"Parse error: {e}")
        return [], "<ERROR>"


async def parse(html):
    loop = asyncio.get_event_loop()
    urls, title = await loop.run_in_executor(pool, lxml_parse, html)
    o.parsed += 1
    return urls, title


async def start_parse_task(content, queue_d):
    async with parser_sem:
        urls, title = await parse(content)
        # loguru.debug(f"[PARSER]: Parse done {title}")
        o.waiting_for_download_q += 1
        for url in urls:
            if url not in downloaded_urls:
                await queue_d.put(url)
        o.waiting_for_download_q -= 1
        # loguru.debug(f"[PARSER]: Add {len(urls)} to download queue")


async def parser(queue_d, queue_p):
    while True:
        content = await queue_p.get()
        asyncio.create_task(start_parse_task(content, queue_d))


async def downloader(queue_d, queue_p, session):
    while True:
        url = await queue_d.get()
        if url in downloaded_urls:
            continue

        o.down_pending += 1
        async with sem:
            o.down_pending -= 1
            o.downloading += 1
            try:
                async with session.get(url) as resp:
                    downloaded_urls.add(url)
                    # loguru.debug(f"[DOWNLOADER]: got response for {url}")
                    try:
                        text = await resp.text()
                        await queue_p.put(text)
                    except UnicodeDecodeError:
                        pass
                    o.downloaded += 1
            except Exception as e:
                loguru.error(f"Download error: {e}")
            finally:
                o.downloading -= 1


async def main():
    await q_d.put(start_from)
    async with aiohttp.ClientSession() as session:
        ds = []
        for i in range(100):
            ds.append(asyncio.create_task(downloader(q_d, q_p, session)))
        p = asyncio.create_task(parser(q_d, q_p))
        printer = asyncio.create_task(log_printer(q_d, q_p))
        await asyncio.gather(*ds, p, printer)


if __name__ == "__main__":
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

Don't forget to install dependencies: loguru, lxml, aiohttp

Kirill Gorelov, 2019-10-31
@Kirill-Gorelov

The site may allow.
Can your iron pull that much?
And the channel allows so much to transmit data per second?
How do you parse, synchronously, asynchronously, or using threads?
And the response from the site comes quickly??