P
P
Printip2015-10-21 09:35:36
linux
Printip, 2015-10-21 09:35:36

How to pull website address (linux) from html?

There are approximately files with the following content (full source code sites):

' Art Ever<em>&bull;</em></span></a></div>
<div class="sect clearfix"><a data="foo" class="logo" href="http://tasteofcountry.com" target="_blank" rel="nofollow"><img src="https://s3.amazonaws.com/tsm-images/logos/footer/204-light.png?id=78" alt="Logo for the website http://tasteofcountry.com"></a><a data-id="295354" class="article" href="http://tasteofcountry.com/shocking-country-music-splits/" title="Love Hurts: Most Shocking Country Music Splits" target="_blank" rel="nofollow"><span>Love Hurts: Most Shocking Country Music Splits<em>&bull;</em></span></a><a data-id="295275" class="article" href="http://tasteofcountry.com/reba-mcentire-narvel-blackstock-relationship-timeline/" title="Their Last 30 Years: A Look Back at Reba and Narvel's Relationship" target="_blank" rel="nofollow"><span>Their Last 30 Years: A Look Back at Reba and Narvel's Relationship<em>&bull;</em></span></a></div>
</div><hr><div class="row clearfix"><div class="sect clearfix"><a class="article img" href="http://screencrush.com/official-batman-vs-superman-plot-synopsis/?footer" title="The Real Reason Batman & Superman Are Fighting" target="_blank" rel="nofollow"><img width="180" height="120" src="http://wac.450f.edgecastcdn.net/80450F/screencrush.com/files/2015/07/batman-vs-superman-300.jpg?w=180&h=120&zc=1&s=0&a=t&q=89" alt="The Real Reason Batman & Superman Are Fighting"><span>The Real Reason Batman & Superman Are Fighting</span></a></div>
<div class="sect clearfix"><a class="article img" href="http://popcrush.com/stars-who-were-born-rich/?footer" title="15 Stars Who Were Born Filthy Rich" target="_blank" rel="nofollow"><img width="180" height="120" src="http://wac.450f.edgecastcdn.net/80450F/popcrush.com/files/2015/04/born-rich-300.jpg?w=180&h=120&zc=1&s=0&a=t&q=89" alt="15 Stars Who Were Born Filthy Rich"><span>15 Stars Who Were Born Filthy Rich</span></a></div>
<div class="sect clearfix"><a class="article img" href="http://diffuser.fm/offensive-band-names/?footer" title="27 Most Offensive Band Names Ever" target="_blank" rel="nofollow"><img width="180" height="120" src="http://wac.450f.edgecastcdn.net/80450F/diffuser.fm/files/2015/03/offensive-band-names.jpg?w=180&h=120&zc=1&s=0&a=t&q=89" alt="27 Most Offensive Band Names Ever"><span>27 Most Offensive Band Names Ever</span></a></div>
<div class="sect clearfix"><a class="article img" href="http://comicsalliance.com/comic-book-movie-behind-the-scenes-pictures/?footer" title="Spectacular Behind-the-Scenes Pics From Comic Book Movies" target="_blank" rel="nofollow"><img width="180" height="120" src="http://wac.450f.edgecastcdn.net/80450F/comicsalliance.com/files/2015/05/behind-the-scenes-300.jpg?w=180&h=120&zc=1&s=0&a=t&q=89" alt="Spectacular Behind-the-Scenes Pics From Comic Book Movies"><span>Spectacular Behind-the-Scenes Pics From Comic Book Movies</span></a></div>
<div class="sect clearfix"><a class="article img" href="http://tasteofcountry.com/you-think-you-know-country-taylor-swift/?footer" title="Surprising Taylor Swift Facts You Probably Didn't Know" target="_blank" rel="nofollow"><img width="180" height="120" src="http://wac.450f.edgecastcdn.net/80450F/tasteofcountry.com/files/2014/08/taylor-swift-sexy.jpg?w=180&h=120&zc=1&s=0&a=t&q=89" alt="Surprising Taylor Swift Facts You Probably Didn't Know"><span>Surprising Taylor Swift Facts You Probably Didn't Know</span></a></div>

How to pull out from the code only the addresses that start with "http://, "https:// and end with the symbol " and write them into a file (each url on a new line?)
So that the output would be like this:
http://wac.450f.edgecastcdn.net/80450F/tasteofcountry.com/files/2014/08/taylor-swift-sexy.jpg?w=180&h=120&zc=1&s=0&a=t&q=89
http://tasteofcountry.com/you-think-you-know-country-taylor-swift/?footer
http://wac.450f.edgecastcdn.net/80450F/comicsalliance.com/files/2015/05/behind-the-scenes-300.jpg
http://comicsalliance.com/comic-book-movie-behind-the-scenes-pictures/?footer

Answer the question

In order to leave comments, you need to log in

4 answer(s)
A
abcd0x00, 2015-10-21
@abcd0x00

In two passes: first we prepare the links, then we select them.
For the text above written in file.html

[[email protected] tmp]$ cat "file.html" | sed 's/"http/\n&/g' | sed -n 's/^"\(http[^"]*\)".*/\1/p'
http://tasteofcountry.com
https://s3.amazonaws.com/tsm-images/logos/footer/204-light.png?id=78
http://tasteofcountry.com/shocking-country-music-splits/
http://tasteofcountry.com/reba-mcentire-narvel-blackstock-relationship-timeline/
http://screencrush.com/official-batman-vs-superman-plot-synopsis/?footer
http://wac.450f.edgecastcdn.net/80450F/screencrush.com/files/2015/07/batman-vs-superman-300.jpg?w=180&h=120&zc=1&s=0&a=t&q=89
http://popcrush.com/stars-who-were-born-rich/?footer
http://wac.450f.edgecastcdn.net/80450F/popcrush.com/files/2015/04/born-rich-300.jpg?w=180&h=120&zc=1&s=0&a=t&q=89
http://diffuser.fm/offensive-band-names/?footer
http://wac.450f.edgecastcdn.net/80450F/diffuser.fm/files/2015/03/offensive-band-names.jpg?w=180&h=120&zc=1&s=0&a=t&q=89
http://comicsalliance.com/comic-book-movie-behind-the-scenes-pictures/?footer
http://wac.450f.edgecastcdn.net/80450F/comicsalliance.com/files/2015/05/behind-the-scenes-300.jpg?w=180&h=120&zc=1&s=0&a=t&q=89
http://tasteofcountry.com/you-think-you-know-country-taylor-swift/?footer
http://wac.450f.edgecastcdn.net/80450F/tasteofcountry.com/files/2014/08/taylor-swift-sexy.jpg?w=180&h=120&zc=1&s=0&a=t&q=89
[[email protected] tmp]$

A
Aves, 2015-10-21
@Aves

grep -oP '(?<=")https?://.+?(?=")' file
or

perl -ne 'while(m/"(https?:\/\/.+?)"/g){print "$1\n"}' file

K
Kostya Lakin, 2015-10-21
@laki9

echo -e 'asdfasdf ya.ru "asfaf google.com "adfadsf\n reddit.com "\n https://reddit.com/blabla "' | grep -E -o ' http://[^ "]+"| https://[^ "]+"' | sed 's/"//g'

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question