A
A
Alexander2021-06-27 22:16:43
JavaScript
Alexander, 2021-06-27 22:16:43

Regular expression to find all links in html markup?

Been looking for a regular expression or any other way to find all links on a page for a long time.
A page is literally all the content . And I need to pack all these links into an array as strings. There were no problems with packing, but with a regular expression. Everyone on the internet seems to agree that "find all links on a page" means "find all tags in html markup", which made it quite difficult to find. Still, I found a couple of good ones, but one does not work with parameters, the other - with atypical characters, the third - does not find links of the form: if you make support , then the lines in the scripts of the form fall into the list.docunemt.querySelector('html').innerHTML

<a href="...">...</a>

//google.com//google.com//document.querySelector()

I tried to write a regular expression myself, I tried to create several and check one by one, but it didn’t work out.
My level of knowledge of regular expressions allowed me to compose something like this: , but this is very far from ideal.(http?s:\/\/|\.\/|\/\/).{0,})

found on the internet

/((http?s|ftp):\/\/|\.\/)[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]/gi
(((http?s:)|)\/\/\w+\.\w{2,3})(\.\w{2})?(\/\S*)?/gi

Даже нашёл вот этого монстра

/((?:(http|https|Http|Https|rtsp|Rtsp):\/\/(?:(?:[a-zA-Z0-9\$\-\_\.\+\!\*\'\(\)\,\;\?\&\=]|(?:\%[a-fA-F0-9]{2})){1,64}(?:\:(?:[a-zA-Z0-9\$\-\_\.\+\!\*\'\(\)\,\;\?\&\=]|(?:\%[a-fA-F0-9]{2})){1,25})?\@)?)?((?:(?:[a-zA-Z0-9][a-zA-Z0-9\-]{0,64}\.)+(?:(?:aero|arpa|asia|a[cdefgilmnoqrstuwxz])|(?:biz|b[abdefghijmnorstvwyz])|(?:cat|com|coop|c[acdfghiklmnoruvxyz])|d[ejkmoz]|(?:edu|e[cegrstu])|f[ijkmor]|(?:gov|g[abdefghilmnpqrstuwy])|h[kmnrtu]|(?:info|int|i[delmnoqrst])|(?:jobs|j[emop])|k[eghimnrwyz]|l[abcikrstuvy]|(?:mil|mobi|museum|m[acdghklmnopqrstuvwxyz])|(?:name|net|n[acefgilopruz])|(?:org|om)|(?:pro|p[aefghklmnrstwy])|qa|r[eouw]|s[abcdeghijklmnortuvyz]|(?:tel|travel|t[cdfghjklmnoprtvwz])|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]))|(?:(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[0-9])))(?:\:\d{1,5})?)(\/(?:(?:[a-zA-Z0-9\;\/\?\:\@\&\=\#\~\-\.\+\!\*\'\(\)\,\_])|(?:\%[a-fA-F0-9]{2}))*)?(?:\b|$)/gi




Maybe there are ready-made solutions or another, more understandable / simple / just working way ??
Thanks in advance, I've been searching for a couple of days now and now in the aggregate it didn't work out correctly to process even the test page
testPage.html

<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <meta 
    name="viewport"
    content="width=device-width, initial-scale=1.0"
  >
  <title>test 1</title>
</head>
<body>
  <h3>test 1</h3>
  <a href="https://google.com">google.com</a>
  <a href="//google.com"></a>
  <a href="//google.com/in/someelse/food.html">in/someelse/food</a>
  <a href="./testPage2.html">test 2</a>
  <a href="./weakPage.html?q=test">weak</a>
</body>
<script>
  //document.querySelectorAll( 'a' ).forEach( l => l.onclick = function () { return false; } );
  document.querySelector( 'h3' ).addEventListener( 'click', () => location.href = 'https://google.com' );
</script>
</html>

Answer the question

In order to leave comments, you need to log in

1 answer(s)
S
Simkav, 2021-06-27
@Simkav

Well, if you need to get all the links, then why not do that?

for (link of document.getElementsByTagName('a')){
console.log(link.href)}

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question