How Can I Get Href Links From HTML Using Python?: 6 Answers

How can I get href links from HTML using Python?
import urllib2
website = "WEBSITE"
openwebsite = urllib2.urlopen(website)
html = getwebsite.read()
print html
So far so good.
But I want only href links from the plain text HTML. How can I solve this problem?
python html hyperlink beautifulsoup href
edited Jun 17 '17 at 11:05 asked Jun 19 '10 at 12:58

dreftymac user371012
14.2k 20 79 146 148 1 2 4
6 Answers
Try with Beautifulsoup:
from BeautifulSoup import BeautifulSoup

import urllib2
import re
html_page = urllib2.urlopen("http://www.yourwebsite.com")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
print link.get('href')
In case you just want links starting with http:// , you should use:
soup.findAll('a', attrs={'href': re.compile("^http://")})
edited Jun 19 '10 at 13:13 answered Jun 19 '10 at 13:04

systempuntoout
39.5k 35 140 221
BeautifulSoup can not automatically close meta tags, for example. The DOM model is invalid and
there is no guarantee that you'll find what you are looking for. – Antonio Dec 28 '13 at 16:16
another problem with bsoup is, the format of the link will change from its original. So, if you want to
change the original link to point to another resource, at the moment I still have no idea how yo do this
with bsoup. Any suggestion? – swdev Oct 28 '14 at 0:54
Not all links contain http . E.g., if you code your site to remove the protocol, the links will start with
// . This means just use whatever protocol the site is loaded with (either http: or https: ). –
reubano Jan 15 '17 at 17:19
You can use the HTMLParser module.
The code would probably look something like this:
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):

# Only parse the 'anchor' tag.
if tag == "a":
# Check the list of defined attributes.
for name, value in attrs:
# If href is defined, print it.
if name == "href":
print name, "=", value
parser = MyHTMLParser()
parser.feed(your_html_string)
Note: The HTMLParser module has been renamed to html.parser in Python 3.0. The 2to3 tool will
automatically adapt imports when converting your sources to 3.0.
answered Jun 19 '10 at 13:02
Stephen
30.5k 6 44 59
I come to realize that, if a link contains the special HTML character such as & , it get converted
into its textual representation, such as & in this case. How do you preserve the original string? –
swdev Oct 28 '14 at 3:20
1 I likte this solution best, since it doesn't need external dependencies – DomTomCat Apr 27 '16 at 6:09
Look at using the beautiful soup html parsing library.
http://www.crummy.com/software/BeautifulSoup/
You will do something like this:
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html)
for link in soup.findAll("a"):
print link.get("href")
edited Mar 4 '14 at 15:27 answered Jun 19 '10 at 13:07

Peter Lyons
100k 21 200 210
Join Stack Overflow

Thanks! But use to learn,
link sharea knowledge,
instead and4 build
. – Evgeny Mar '14 at your
12:30career. Email Sign Up OR SIGN IN WITH Google Facebook
My answer probably sucks compared to the real gurus out there, but using some simple math,
string slicing, find and urllib, this little script will create a list containing link elements. I test google
and my output seems right. Hope it helps!
import urllib
test = urllib.urlopen("http://www.google.com").read()
sane = 0
needlestack = []
while sane == 0:
curpos = test.find("href")
if curpos >= 0:
testlen = len(test)
test = test[curpos:testlen]
curpos = test.find('"')
testlen = len(test)
test = test[curpos+1:testlen]
curpos = test.find('"')
needle = test[0:curpos]
if needle.startswith("http" or "www"):
needlestack.append(needle)
else:
sane = 1
for item in needlestack:
print item
answered Feb 15 '13 at 5:05

0xhughes
913 2 16 33
Here's a lazy version of @stephen's answer
from urllib.request import urlopen

from itertools import chain
from html.parser import HTMLParser
class LinkParser(HTMLParser):
def reset(self):
HTMLParser.reset(self)
self.links = iter([])
def handle_starttag(self, tag, attrs):

if tag == 'a':
for name, value in attrs:
if name == 'href':
self.links = chain(self.links, [value])
def gen_links(f, parser):

encoding = f.headers.get_content_charset() or 'UTF-8'
for line in f:
parser.feed(line.decode(encoding))
yield from parser.links
Use it like so:
>>> parser = LinkParser()

>>> f = urlopen('http://stackoverflow.com/questions/3075550')
>>> links = gen_links(f, parser)
>>> next(links)
'//stackoverflow.com'
edited Jan 15 '17 at 17:58 answered Jan 15 '17 at 17:13

reubano
1,699 18 21
Using BS4 for this specific task seems overkill.
Try instead:
website = urllib2.urlopen('http://10.123.123.5/foo_images/Repo/')
html = website.read()
files = re.findall('href="(.*tgz|.*tar.gz)"', html)
print sorted(x for x in (files))
I found this nifty piece of code on http://www.pythonforbeginners.com/code/regular-expression-re-

findall and works for me quite well.
I tested it only on my scenario of extracting a list of files from a web folder that exposes the
files\folder in it, e.g.:
and I got a sorted list of the files\folders under the URL
answered Sep 20 '17 at 11:09

RaamEE
694 1 7 17

How Can I Get Href Links From HTML Using Python?: 6 Answers

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

How Can I Get Href Links From HTML Using Python?: 6 Answers

Uploaded by

Copyright:

Available Formats

How can I get href links from HTML using Python?

python html hyperlink beautifulsoup href

edited Jun 17 '17 at 11:05 asked Jun 19 '10 at 12:58

Try with Beautifulsoup:

from BeautifulSoup import BeautifulSoup

soup.findAll('a', attrs={'href': re.compile("^http://")})

edited Jun 19 '10 at 13:13 answered Jun 19 '10 at 13:04

You can use the HTMLParser module.

The code would probably look something like this:

from HTMLParser import HTMLParser

def handle_starttag(self, tag, attrs):

Look at using the beautiful soup html parsing library.

You will do something like this:

edited Mar 4 '14 at 15:27 answered Jun 19 '10 at 13:07

Join Stack Overflow

answered Feb 15 '13 at 5:05

Here's a lazy version of @stephen's answer

from urllib.request import urlopen

def handle_starttag(self, tag, attrs):

def gen_links(f, parser):

Use it like so:

>>> parser = LinkParser()

edited Jan 15 '17 at 17:58 answered Jan 15 '17 at 17:13

Using BS4 for this specific task seems overkill.

I found this nifty piece of code on http://www.pythonforbeginners.com/code/regular-expression-re-

and I got a sorted list of the files\folders under the URL

answered Sep 20 '17 at 11:09

You might also like