How to find all URLs and URIs inside HTML using Python?

S

Sergey Eremin2021-12-09 19:07:04

Python

Sergey Eremin, 2021-12-09 19:07:04

Actually, the URL also needs to be changed and wrapped (I'm trying to do something like a web proxy on Django), but these are already trifles. First you need to find the URL and URI ... For simplicity, I will call everything URL.

I am aware of the existence of BeautifulSoup and its ability to parse (and replace) within HTML. But in reality, the task is too tough for him.

For starters, a URL can occur not only in <a href="URL">что-то</a>or <img src="URL" />, but also in <link href="URL" />, in <script src="URL"></script>, in <iframe src="URL"... and also in styles (for example, like background-image:url(URL)or @import url(URL)...), and also in inline SVGs (for example, like <a xlink:href="URL") ... and so on .

In addition, URLs can appear in structures like

<object data="URL" type="image/svg+xml" ...></object>

...and sometimes BeautifulSoup finds something that looks like a URL but isn't (like in the type construct, for example <img src='data:image/jpeg;base64).

Finding all these options using BeautifulSoup and further dismantling is not easy, but still possible. But BeautifulSoup won't help if some HTM tags are not closed. For example, a construction will come across <link href="URL">(without a finalizer />) and then BeautifulSoup will "swallow" all the HTML until the next one <link>(and it may no longer be on the page ... by the way, if the style of writing HTML code is "do not close tags", then BeautifulSoup is generally powerless) .

If you use regular expressions, then everything becomes very complicated, because The URL inside the HTML may not be formatted in quotation marks (in which case it is accepted up to the nearest space). Or have arbitrary whitespace characters (eg:
html

<a href
=
"URL"> ...

I think there are many more uses for URLs. I'm not talking about cases when the URL is found inside the built-in JavaScript (and sometimes they try to hide the URL from parking ... but if they hide it, then so be it ... but I would like to detect "open" URLs) .

And how to take it all apart? Write a universal regular expression for all cases - is not capable. BeautifulSoup , as I explained, does not always help. Are there any alternatives for URL discovery?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

V

Vindicar, 2021-12-09
@Vindicar

I would still dig in the direction of regular seasons. There are too many places where URLs can meet.
So I would say something like this ...
(["'])(https?://.+?)\1
Ie . "start with a quotation mark or an apostrophe, then something that starts with http:// or https://, then any characters, but as few as possible, and then the same character as at the beginning."
I wrote a small script, set it on the source of this page - in my opinion, it works well.