Answer the question
In order to leave comments, you need to log in
How to find all URLs and URIs inside HTML using Python?
Actually, the URL also needs to be changed and wrapped (I'm trying to do something like a web proxy on Django), but these are already trifles. First you need to find the URL and URI ... For simplicity, I will call everything URL.
I am aware of the existence of BeautifulSoup and its ability to parse (and replace) within HTML. But in reality, the task is too tough for him.
For starters, a URL can occur not only in <a href="URL">что-то</a>
or <img src="URL" />
, but also in <link href="URL" />
, in <script src="URL"></script>
, in <iframe src="URL"
... and also in styles (for example, like background-image:url(URL)
or @import url(URL)
...), and also in inline SVGs (for example, like <a xlink:href="URL"
) ... and so on .
In addition, URLs can appear in structures like
<object data="URL" type="image/svg+xml" ...></object>
...and sometimes BeautifulSoup finds something that looks like a URL but isn't (like in the type construct, for example <img src='data:image/jpeg;base64
). <link href="URL">
(without a finalizer />
) and then BeautifulSoup will "swallow" all the HTML until the next one <link>
(and it may no longer be on the page ... by the way, if the style of writing HTML code is "do not close tags", then BeautifulSoup is generally powerless) .<a href
=
"URL"> ...
Answer the question
In order to leave comments, you need to log in
I would still dig in the direction of regular seasons. There are too many places where URLs can meet.
So I would say something like this ...
(["'])(https?://.+?)\1
Ie . "start with a quotation mark or an apostrophe, then something that starts with http:// or https://, then any characters, but as few as possible, and then the same character as at the beginning."
I wrote a small script, set it on the source of this page - in my opinion, it works well.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question