Difference between revisions of "Python and HTML Processing"

From PeformIQ Upgrade
Jump to navigation Jump to search
(New page: h1. Links * [Python Structured Markup Processing Tools|http://docs.python.org/lib/markup.html] * [http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52199] * [http://www.boddie.org.u...)
 
Line 19: Line 19:
h1. URLLIB Tutorial: urllib2 - The Missing Manual
h1. URLLIB Tutorial: urllib2 - The Missing Manual


{html}
<H2 class="subtitle" id="howto-fetch-internet-resources-with-python">HOWTO Fetch Internet Resources with Python</H2>


== HOWTO Fetch Internet Resources with Python ==


<DIV class="contents topic">
<div class="contents topic">
<P class="topic-title first"><A class id="urllib2-tutorial" name="urllib2-tutorial"></A>urllib2 Tutorial</P>
<UL class="simple">
<LI><A class href="http://esbinfo:8090/pages/editpage.action#introduction" id="id14" name="id14"></A>Introduction</LI>


<LI><A class href="http://esbinfo:8090/pages/editpage.action#fetching-urls" id="id15" name="id15"></A>Fetching URLs<UL>
urllib2 Tutorial
<LI><A class href="http://esbinfo:8090/pages/editpage.action#data" id="id16" name="id16"></A>Data</LI>
<LI><A class href="http://esbinfo:8090/pages/editpage.action#headers" id="id17" name="id17"></A>Headers</LI>
</UL>
</LI>
<LI><A class href="http://esbinfo:8090/pages/editpage.action#handling-exceptions" id="id18" name="id18"></A>Handling Exceptions<UL>
<LI><A class href="http://esbinfo:8090/pages/editpage.action#urlerror" id="id19" name="id19"></A>URLError</LI>
<LI><A class href="http://esbinfo:8090/pages/editpage.action#httperror" id="id20" name="id20"></A>HTTPError<UL>
<LI><A class href="http://esbinfo:8090/pages/editpage.action#error-codes" id="id21" name="id21"></A>Error Codes</LI>
</UL>


</LI>
* [http://esbinfo:8090/pages/editpage.action#introduction ]Introduction
<LI><A class href="http://esbinfo:8090/pages/editpage.action#wrapping-it-up" id="id22" name="id22"></A>Wrapping it Up<UL>
* [http://esbinfo:8090/pages/editpage.action#fetching-urls ]Fetching URLs
<LI><A class href="http://esbinfo:8090/pages/editpage.action#number-1" id="id23" name="id23"></A>Number 1</LI>
** [http://esbinfo:8090/pages/editpage.action#data ]Data
<LI><A class href="http://esbinfo:8090/pages/editpage.action#number-2" id="id24" name="id24"></A>Number 2</LI>
** [http://esbinfo:8090/pages/editpage.action#headers ]Headers
</UL>
* [http://esbinfo:8090/pages/editpage.action#handling-exceptions ]Handling Exceptions
</LI>
** [http://esbinfo:8090/pages/editpage.action#urlerror ]URLError
</UL>
** [http://esbinfo:8090/pages/editpage.action#httperror ]HTTPError
</LI>
*** [http://esbinfo:8090/pages/editpage.action#error-codes ]Error Codes
<LI><A class href="http://esbinfo:8090/pages/editpage.action#info-and-geturl" id="id25" name="id25"></A>info and geturl</LI>
** [http://esbinfo:8090/pages/editpage.action#wrapping-it-up ]Wrapping it Up
<LI><A class href="http://esbinfo:8090/pages/editpage.action#openers-and-handlers" id="id26" name="id26"></A>Openers and Handlers</LI>
*** [http://esbinfo:8090/pages/editpage.action#number-1 ]Number 1
<LI><A class href="http://esbinfo:8090/pages/editpage.action#id6" id="id27" name="id27"></A>Basic Authentication</LI>
*** [http://esbinfo:8090/pages/editpage.action#number-2 ]Number 2
* [http://esbinfo:8090/pages/editpage.action#info-and-geturl ]info and geturl
* [http://esbinfo:8090/pages/editpage.action#openers-and-handlers ]Openers and Handlers
* [http://esbinfo:8090/pages/editpage.action#id6 ]Basic Authentication
* [http://esbinfo:8090/pages/editpage.action#proxies ]Proxies
* [http://esbinfo:8090/pages/editpage.action#sockets-and-layers ]Sockets and Layers
* [http://esbinfo:8090/pages/editpage.action#footnotes ]Footnotes


<LI><A class href="http://esbinfo:8090/pages/editpage.action#proxies" id="id28" name="id28"></A>Proxies</LI>
</div>
<LI><A class href="http://esbinfo:8090/pages/editpage.action#sockets-and-layers" id="id29" name="id29"></A>Sockets and Layers</LI>
<LI><A class href="http://esbinfo:8090/pages/editpage.action#footnotes" id="id30" name="id30"></A>Footnotes</LI>
</UL>
</DIV>
<P> </P>
<DIV class="section">
<H1><A class href="http://esbinfo:8090/pages/editpage.action#id14" id="introduction" name="introduction"></A>Introduction</H1>
<DIV class="sidebar">
<P class="first sidebar-title">Related Articles</P>
<P>You may also find useful the following articles on fetching web
resources with Python :</P>


<UL class="last">
<div class="section">
<LI><P class="first"><A class="reference" href="http://www.voidspace.org.uk/python/articles/authentication.shtml">Basic Authentication</A></P>
<BLOCKQUOTE>
<P>A tutorial on <I>Basic Authentication</I>, with examples in Python.</P>
</BLOCKQUOTE>
</LI>
<LI><P class="first"><A class="reference" href="/python/articles/cookielib.shtml">cookielib and ClientCookie</A></P>
<BLOCKQUOTE>
<P>How to handle cookies when fetching web-pages with <I>urllib2</I>.</P>


</BLOCKQUOTE>
= [http://esbinfo:8090/pages/editpage.action#id14 ]Introduction =
</LI>
</UL>
</DIV>
<P><B>urllib2</B> is a <A class="reference" href="http://www.python.org">Python</A> module for fetching URLs
(Uniform Resource Locators). It offers a very simple interface, in the form of
the <I>urlopen</I> function. This is capable of fetching URLs using a variety
of different protocols. It also offers a slightly more complex
interface for handling common situations - like basic authentication,
cookies, proxies and so on. These are provided by objects called
handlers and openers.</P>
<P>urllib2 supports fetching URLs for many "URL schemes" (identified by the string
before the ":" in URL - for example "ftp" is the URL scheme of


"<A class="reference" href="ftp://python.org/">ftp://python.org/</A>") using their associated network protocols (e.g. FTP, HTTP).
<div class="sidebar">
This tutorial focuses on the most common case, HTTP.</P>
<P>For straightforward situations <I>urlopen</I> is very easy to use. But as
soon as you encounter errors or non-trivial cases when opening HTTP
URLs, you will need some understanding of the HyperText Transfer
Protocol. The most comprehensive and authoritative reference to HTTP
is <A class="reference" href="http://www.faqs.org/rfcs/rfc2616.html">RFC 2616</A>. This is a technical document and not intended to be
easy to read. This HOWTO aims to illustrate using <I>urllib2</I>, with
enough detail about HTTP to help you through. It is not intended to
replace the <A class="reference" href="http://docs.python.org/lib/module-urllib2.html">urllib2 docs</A> ,
but is supplementary to them.</P>
</DIV>
<DIV class="section">


<H1><A class href="http://esbinfo:8090/pages/editpage.action#id15" id="fetching-urls" name="fetching-urls"></A>Fetching URLs</H1>
Related Articles
<P>The simplest way to use urllib2 is as follows :</P>
<DIV class="pysrc"><SPAN class="pykeyword">import</SPAN> <SPAN class="pytext">urllib2</SPAN><BR></BR>
<SPAN class="pytext">response</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">urllib2</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">urlopen</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'http://python.org/'</SPAN><SPAN class="pyoperator">)</SPAN><BR></BR>


<SPAN class="pytext">html</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">response</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">read</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pytext"></SPAN></DIV><P>Many uses of urllib2 will be that simple (note that instead of an
You may also find useful the following articles on fetching web resources with Python :
'http:' URL we could have used an URL starting with 'ftp:', 'file:',
etc.).  However, it's the purpose of this tutorial to explain the more
complicated cases, concentrating on HTTP.</P>
<P>HTTP is based on requests and responses - the client makes requests
and servers send responses. urllib2 mirrors this with a <TT class="docutils literal"><SPAN class="pre">Request</SPAN></TT>
object which represents the HTTP request you are making. In its
simplest form you create a Request object that specifies the URL you
want to fetch. Calling <TT class="docutils literal"><SPAN class="pre">urlopen</SPAN></TT> with this Request object returns a
response object for the URL requested. This response is a file-like
object, which means you can for example call .read() on the response :</P>


<DIV class="pysrc"><SPAN class="pykeyword">import</SPAN> <SPAN class="pytext">urllib2</SPAN><BR></BR>
* [http://www.voidspace.org.uk/python/articles/authentication.shtml Basic Authentication]
<BR></BR>
<blockquote>
<SPAN class="pytext">req</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">urllib2</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">Request</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'http://www.voidspace.org.uk'</SPAN><SPAN class="pyoperator">)</SPAN><BR></BR>
A tutorial on <i>Basic Authentication</i>, with examples in Python.
<SPAN class="pytext">response</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">urllib2</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">urlopen</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pytext">req</SPAN><SPAN class="pyoperator">)</SPAN><BR></BR>
</blockquote>
* [/python/articles/cookielib.shtml cookielib and ClientCookie]
<blockquote>
How to handle cookies when fetching web-pages with <i>urllib2</i>.
</blockquote>


<SPAN class="pytext">the_page</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">response</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">read</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pytext"></SPAN></DIV><P>Note that urllib2 makes use of the same Request interface to handle
</div>
all URL schemes.  For example, you can make an FTP request like so :</P>
<DIV class="pysrc"><SPAN class="pytext">req</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">urllib2</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">Request</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'ftp://example.com/'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pytext"></SPAN></DIV><P>In the case of HTTP, there are two extra things that Request objects
allow you to do: First, you can pass data to be sent to the server.
Second, you can pass extra information ("metadata") <I>about</I> the data
or the about request itself, to the server - this information is sent
as HTTP "headers".  Let's look at each of these in turn.</P>


<DIV class="section">
<b>urllib2</b> is a [http://www.python.org Python] module for fetching URLs (Uniform Resource Locators). It offers a very simple interface, in the form of the <i>urlopen</i> function. This is capable of fetching URLs using a variety of different protocols. It also offers a slightly more complex interface for handling common situations - like basic authentication, cookies, proxies and so on. These are provided by objects called handlers and openers.
<H2><A class href="http://esbinfo:8090/pages/editpage.action#id16" id="data" name="data"></A>Data</H2>
<P>Sometimes you want to send data to a URL (often the URL will refer to
a CGI (Common Gateway Interface) script <A class href="http://esbinfo:8090/pages/editpage.action#id8" id="id1" name="id1"></A>[1] or other web
application). With HTTP, this is often done using what's known as a
<B>POST</B> request. This is often what your browser does when you submit
a HTML form that you filled in on the web. Not all POSTs have to come
from forms: you can use a POST to transmit arbitrary data to your own
application. In the common case of HTML forms, the data needs to be
encoded in a standard way, and then passed to the Request object as
the <TT class="docutils literal"><SPAN class="pre">data</SPAN></TT> argument. The encoding is done using a function from the
<TT class="docutils literal"><SPAN class="pre">urllib</SPAN></TT> library <I>not</I> from <TT class="docutils literal"><SPAN class="pre">urllib2</SPAN></TT>.</P>


<DIV class="pysrc"><SPAN class="pykeyword">import</SPAN> <SPAN class="pytext">urllib</SPAN><BR></BR>
urllib2 supports fetching URLs for many "URL schemes" (identified by the string before the ":" in URL - for example "ftp" is the URL scheme of "ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP). This tutorial focuses on the most common case, HTTP.
<SPAN class="pykeyword">import</SPAN> <SPAN class="pytext">urllib2</SPAN><BR></BR>
<BR></BR>
<SPAN class="pytext">url</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pystring">'http://www.someserver.com/cgi-bin/register.cgi'</SPAN><BR></BR>
<SPAN class="pytext">values</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pyoperator">{</SPAN><SPAN class="pystring">'name'</SPAN> <SPAN class="pyoperator">:</SPAN> <SPAN class="pystring">'Michael Foord'</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>


          <SPAN class="pystring">'location'</SPAN> <SPAN class="pyoperator">:</SPAN> <SPAN class="pystring">'Northampton'</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
For straightforward situations <i>urlopen</i> is very easy to use. But as soon as you encounter errors or non-trivial cases when opening HTTP URLs, you will need some understanding of the HyperText Transfer Protocol. The most comprehensive and authoritative reference to HTTP is [http://www.faqs.org/rfcs/rfc2616.html RFC 2616]. This is a technical document and not intended to be easy to read. This HOWTO aims to illustrate using <i>urllib2</i>, with enough detail about HTTP to help you through. It is not intended to replace the [http://docs.python.org/lib/module-urllib2.html urllib2 docs] , but is supplementary to them.
          <SPAN class="pystring">'language'</SPAN> <SPAN class="pyoperator">:</SPAN> <SPAN class="pystring">'Python'</SPAN> <SPAN class="pyoperator">}</SPAN><BR></BR>
<BR></BR>
<SPAN class="pytext">data</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">urllib</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">urlencode</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pytext">values</SPAN><SPAN class="pyoperator">)</SPAN><BR></BR>


<SPAN class="pytext">req</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">urllib2</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">Request</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pytext">url</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pytext">data</SPAN><SPAN class="pyoperator">)</SPAN><BR></BR>
</div><div class="section">
<SPAN class="pytext">response</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">urllib2</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">urlopen</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pytext">req</SPAN><SPAN class="pyoperator">)</SPAN><BR></BR>


<SPAN class="pytext">the_page</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">response</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">read</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pytext"></SPAN></DIV><P>Note that other encodings are sometimes required (e.g. for file upload
= [http://esbinfo:8090/pages/editpage.action#id15 ]Fetching URLs =
from HTML forms - see
<A class="reference" href="http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13">HTML Specification, Form Submission</A>
for more details).</P>
<P>If you do not pass the <TT class="docutils literal"><SPAN class="pre">data</SPAN></TT> argument, urllib2 uses a <B>GET</B>


request. One way in which GET and POST requests differ is that POST
The simplest way to use urllib2 is as follows :
requests often have "side-effects": they change the state of the
system in some way (for example by placing an order with the website
for a hundredweight of tinned spam to be delivered to your door).
Though the HTTP standard makes it clear that POSTs are intended to
<I>always</I> cause side-effects, and GET requests <I>never</I> to cause
side-effects, nothing prevents a GET request from having side-effects,
nor a POST requests from having no side-effects. Data can also be
passed in an HTTP GET request by encoding it in the URL itself.</P>
<P>This is done as follows.</P>
<PRE class="literal-block">>>> import urllib2<BR></BR>>>> import urllib<BR></BR><BR></BR>>>> data = {}<BR></BR>>>> data['name'] = 'Somebody Here'<BR></BR>>>> data['location'] = 'Northampton'<BR></BR>>>> data['language'] = 'Python'<BR></BR>>>> url_values = urllib.urlencode(data)<BR></BR>>>> print url_values<BR></BR>name=Somebody+Here&language=Python&location=Northampton<BR></BR>>>> url = 'http://www.example.com/example.cgi'<BR></BR>>>> full_url = url + '?' + url_values<BR></BR><BR></BR>>>> data = urllib2.open(full_url)<BR></BR></PRE>
<P>Notice that the full URL is created by adding a <TT class="docutils literal"><SPAN class="pre">?</SPAN></TT> to the URL, followed by
the encoded values.</P>
</DIV>
<DIV class="section">
<H2><A class href="http://esbinfo:8090/pages/editpage.action#id17" id="headers" name="headers"></A>Headers</H2>
<P>We'll discuss here one particular HTTP header, to illustrate how to
add headers to your HTTP request.</P>
<P>Some websites <A class href="http://esbinfo:8090/pages/editpage.action#id9" id="id2" name="id2"></A>[2] dislike being browsed by programs, or send
different versions to different browsers <A class href="http://esbinfo:8090/pages/editpage.action#id10" id="id3" name="id3"></A>[3] . By default urllib2
identifies itself as <TT class="docutils literal"><SPAN class="pre">Python-urllib/x.y</SPAN></TT> (where <TT class="docutils literal"><SPAN class="pre">x</SPAN></TT> and <TT class="docutils literal"><SPAN class="pre">y</SPAN></TT> are
the major and minor version numbers of the Python release,
e.g. <TT class="docutils literal"><SPAN class="pre">Python-urllib/2.5</SPAN></TT>), which may confuse the site, or just plain
not work. The way a browser identifies itself is through the


<TT class="docutils literal"><SPAN class="pre">User-Agent</SPAN></TT> header <A class href="http://esbinfo:8090/pages/editpage.action#id11" id="id4" name="id4"></A>[4]. When you create a Request object you can
<div class="pysrc"><span class="pykeyword">import</span> <span class="pytext">urllib2</span><br /><span class="pytext">response</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">urllib2</span><span class="pyoperator">.</span><span class="pytext">urlopen</span><span class="pyoperator">(</span><span class="pystring">'http://python.org/'</span><span class="pyoperator">)</span><br /><span class="pytext">html</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">response</span><span class="pyoperator">.</span><span class="pytext">read</span><span class="pyoperator">(</span><span class="pyoperator">)</span><span class="pytext"></span></div>
pass a dictionary of headers in. The following example makes the same
request as above, but identifies itself as a version of Internet
Explorer <A class href="http://esbinfo:8090/pages/editpage.action#id12" id="id5" name="id5"></A>[5].</P>
<DIV class="pysrc"><SPAN class="pykeyword">import</SPAN> <SPAN class="pytext">urllib</SPAN><BR></BR>
<SPAN class="pykeyword">import</SPAN> <SPAN class="pytext">urllib2</SPAN><BR></BR>
<BR></BR>


<SPAN class="pytext">url</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pystring">'http://www.someserver.com/cgi-bin/register.cgi'</SPAN><BR></BR>
Many uses of urllib2 will be that simple (note that instead of an 'http:' URL we could have used an URL starting with 'ftp:', 'file:', etc.). However, it's the purpose of this tutorial to explain the more complicated cases, concentrating on HTTP.
<SPAN class="pytext">user_agent</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pystring">'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'</SPAN><BR></BR>
<SPAN class="pytext">values</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pyoperator">{</SPAN><SPAN class="pystring">'name'</SPAN> <SPAN class="pyoperator">:</SPAN> <SPAN class="pystring">'Michael Foord'</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>


          <SPAN class="pystring">'location'</SPAN> <SPAN class="pyoperator">:</SPAN> <SPAN class="pystring">'Northampton'</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
HTTP is based on requests and responses - the client makes requests and servers send responses. urllib2 mirrors this with a <tt class="docutils literal"><span class="pre">Request</span></tt> object which represents the HTTP request you are making. In its simplest form you create a Request object that specifies the URL you want to fetch. Calling <tt class="docutils literal"><span class="pre">urlopen</span></tt> with this Request object returns a response object for the URL requested. This response is a file-like object, which means you can for example call .read() on the response :
          <SPAN class="pystring">'language'</SPAN> <SPAN class="pyoperator">:</SPAN> <SPAN class="pystring">'Python'</SPAN> <SPAN class="pyoperator">}</SPAN><BR></BR>
<SPAN class="pytext">headers</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pyoperator">{</SPAN> <SPAN class="pystring">'User-Agent'</SPAN> <SPAN class="pyoperator">:</SPAN> <SPAN class="pytext">user_agent</SPAN> <SPAN class="pyoperator">}</SPAN><BR></BR>


<BR></BR>
<div class="pysrc"><span class="pykeyword">import</span> <span class="pytext">urllib2</span><br /><br /><span class="pytext">req</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">urllib2</span><span class="pyoperator">.</span><span class="pytext">Request</span><span class="pyoperator">(</span><span class="pystring">'http://www.voidspace.org.uk'</span><span class="pyoperator">)</span><br /><span class="pytext">response</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">urllib2</span><span class="pyoperator">.</span><span class="pytext">urlopen</span><span class="pyoperator">(</span><span class="pytext">req</span><span class="pyoperator">)</span><br /><span class="pytext">the_page</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">response</span><span class="pyoperator">.</span><span class="pytext">read</span><span class="pyoperator">(</span><span class="pyoperator">)</span><span class="pytext"></span></div>
<SPAN class="pytext">data</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">urllib</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">urlencode</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pytext">values</SPAN><SPAN class="pyoperator">)</SPAN><BR></BR>
<SPAN class="pytext">req</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">urllib2</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">Request</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pytext">url</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pytext">data</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pytext">headers</SPAN><SPAN class="pyoperator">)</SPAN><BR></BR>


<SPAN class="pytext">response</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">urllib2</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">urlopen</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pytext">req</SPAN><SPAN class="pyoperator">)</SPAN><BR></BR>
Note that urllib2 makes use of the same Request interface to handle all URL schemes. For example, you can make an FTP request like so :
<SPAN class="pytext">the_page</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">response</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">read</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pytext"></SPAN></DIV><P>The response also has two useful methods. See the section on <A class="reference" href="http://esbinfo:8090/pages/editpage.action#info-and-geturl">info and
geturl</A> which comes after we have a look at what happens when things
go wrong.</P>


</DIV>
<div class="pysrc"><span class="pytext">req</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">urllib2</span><span class="pyoperator">.</span><span class="pytext">Request</span><span class="pyoperator">(</span><span class="pystring">'ftp://example.com/'</span><span class="pyoperator">)</span><span class="pytext"></span></div>
</DIV>
<DIV class="section">
<H1><A class href="http://esbinfo:8090/pages/editpage.action#id18" id="handling-exceptions" name="handling-exceptions"></A>Handling Exceptions</H1>
<P><I>urlopen</I> raises <TT class="docutils literal"><SPAN class="pre">URLError</SPAN></TT> when it cannot handle a response (though
as usual with Python APIs, builtin exceptions such as ValueError,
TypeError etc. may also be raised).</P>
<P><TT class="docutils literal"><SPAN class="pre">HTTPError</SPAN></TT> is the subclass of <TT class="docutils literal"><SPAN class="pre">URLError</SPAN></TT> raised in the specific
case of HTTP URLs.</P>


<DIV class="section">
In the case of HTTP, there are two extra things that Request objects allow you to do: First, you can pass data to be sent to the server. Second, you can pass extra information ("metadata") <i>about</i> the data or the about request itself, to the server - this information is sent as HTTP "headers". Let's look at each of these in turn.
<H2><A class href="http://esbinfo:8090/pages/editpage.action#id19" id="urlerror" name="urlerror"></A>URLError</H2>
<P>Often, URLError is raised because there is no network connection (no
route to the specified server), or the specified server doesn't exist.
In this case, the exception raised will have a 'reason' attribute,
which is a tuple containing an error code and a text error message.</P>
<P>e.g.</P>
<PRE class="literal-block">>>> req = urllib2.Request('http://www.pretend_server.org')<BR></BR>>>> try: urllib2.urlopen(req)<BR></BR>>>> except URLError, e:<BR></BR>>>>    print e.reason<BR></BR>>>><BR></BR><BR></BR>(4, 'getaddrinfo failed')<BR></BR></PRE>
</DIV>
<DIV class="section">
<H2><A class href="http://esbinfo:8090/pages/editpage.action#id20" id="httperror" name="httperror"></A>HTTPError</H2>
<P>Every HTTP response from the server contains a numeric "status
code". Sometimes the status code indicates that the server is unable
to fulfil the request. The default handlers will handle some of these
responses for you (for example, if the response is a "redirection"
that requests the client fetch the document from a different URL,
urllib2 will handle that for you). For those it can't handle, urlopen
will raise an <TT class="docutils literal"><SPAN class="pre">HTTPError</SPAN></TT>. Typical errors include '404' (page not
found), '403' (request forbidden), and '401' (authentication
required).</P>
<P>See section 10 of RFC 2616 for a reference on all the HTTP error
codes.</P>


<P>The <TT class="docutils literal"><SPAN class="pre">HTTPError</SPAN></TT> instance raised will have an integer 'code'
<div class="section">
attribute, which corresponds to the error sent by the server.</P>
<DIV class="section">
<H3><A class href="http://esbinfo:8090/pages/editpage.action#id21" id="error-codes" name="error-codes"></A>Error Codes</H3>
<P>Because the default handlers handle redirects (codes in the 300
range), and codes in the 100-299 range indicate success, you will
usually only see error codes in the 400-599 range.</P>
<P><TT class="docutils literal"><SPAN class="pre">BaseHTTPServer.BaseHTTPRequestHandler.responses</SPAN></TT> is a useful
dictionary of response codes in that shows all the response codes used
by RFC 2616. The dictionary is reproduced here for convenience :</P>
<DIV class="pysrc"><SPAN class="pycomment"># Table mapping response codes to messages; entries have the<BR></BR>
</SPAN><SPAN class="pycomment"># form {code: (shortmessage, longmessage)}.<BR></BR>


</SPAN><SPAN class="pytext">responses</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pyoperator">{</SPAN><BR></BR>
== [http://esbinfo:8090/pages/editpage.action#id16 ]Data ==
    <SPAN class="pynumber">100</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Continue'</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'Request received, please continue'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>


    <SPAN class="pynumber">101</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Switching Protocols'</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
Sometimes you want to send data to a URL (often the URL will refer to a CGI (Common Gateway Interface) script [http://esbinfo:8090/pages/editpage.action#id8 ][1] or other web application). With HTTP, this is often done using what's known as a <b>POST</b> request. This is often what your browser does when you submit a HTML form that you filled in on the web. Not all POSTs have to come from forms: you can use a POST to transmit arbitrary data to your own application. In the common case of HTML forms, the data needs to be encoded in a standard way, and then passed to the Request object as the <tt class="docutils literal"><span class="pre">data</span></tt> argument. The encoding is done using a function from the <tt class="docutils literal"><span class="pre">urllib</span></tt> library <i>not</i> from <tt class="docutils literal"><span class="pre">urllib2</span></tt>.
          <SPAN class="pystring">'Switching to new protocol; obey Upgrade header'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
<BR></BR>
    <SPAN class="pynumber">200</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'OK'</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'Request fulfilled, document follows'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>


    <SPAN class="pynumber">201</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Created'</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'Document created, URL follows'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
<div class="pysrc"><span class="pykeyword">import</span> <span class="pytext">urllib</span><br /><span class="pykeyword">import</span> <span class="pytext">urllib2</span><br /><br /><span class="pytext">url</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pystring">'http://www.someserver.com/cgi-bin/register.cgi'</span><br /><span class="pytext">values</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pyoperator">{</span><span class="pystring">'name'</span> <span class="pyoperator"><nowiki>:</nowiki></span> <span class="pystring">'Michael Foord'</span><span class="pyoperator">,</span><br /><span class="pystring">'location'</span> <span class="pyoperator"><nowiki>:</nowiki></span> <span class="pystring">'Northampton'</span><span class="pyoperator">,</span><br /><span class="pystring">'language'</span> <span class="pyoperator"><nowiki>:</nowiki></span> <span class="pystring">'Python'</span> <span class="pyoperator">}</span><br /><br /><span class="pytext">data</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">urllib</span><span class="pyoperator">.</span><span class="pytext">urlencode</span><span class="pyoperator">(</span><span class="pytext">values</span><span class="pyoperator">)</span><br /><span class="pytext">req</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">urllib2</span><span class="pyoperator">.</span><span class="pytext">Request</span><span class="pyoperator">(</span><span class="pytext">url</span><span class="pyoperator">,</span> <span class="pytext">data</span><span class="pyoperator">)</span><br /><span class="pytext">response</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">urllib2</span><span class="pyoperator">.</span><span class="pytext">urlopen</span><span class="pyoperator">(</span><span class="pytext">req</span><span class="pyoperator">)</span><br /><span class="pytext">the_page</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">response</span><span class="pyoperator">.</span><span class="pytext">read</span><span class="pyoperator">(</span><span class="pyoperator">)</span><span class="pytext"></span></div>
    <SPAN class="pynumber">202</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Accepted'</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>


          <SPAN class="pystring">'Request accepted, processing continues off-line'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
Note that other encodings are sometimes required (e.g. for file upload from HTML forms - see [http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13 HTML Specification, Form Submission] for more details).
    <SPAN class="pynumber">203</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Non-Authoritative Information'</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'Request fulfilled from cache'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
    <SPAN class="pynumber">204</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'No Content'</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'Request fulfilled, nothing follows'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>


    <SPAN class="pynumber">205</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Reset Content'</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'Clear input form for further input.'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
If you do not pass the <tt class="docutils literal"><span class="pre">data</span></tt> argument, urllib2 uses a <b>GET</b> request. One way in which GET and POST requests differ is that POST requests often have "side-effects": they change the state of the system in some way (for example by placing an order with the website for a hundredweight of tinned spam to be delivered to your door). Though the HTTP standard makes it clear that POSTs are intended to <i>always</i> cause side-effects, and GET requests <i>never</i> to cause side-effects, nothing prevents a GET request from having side-effects, nor a POST requests from having no side-effects. Data can also be passed in an HTTP GET request by encoding it in the URL itself.
    <SPAN class="pynumber">206</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Partial Content'</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'Partial content follows.'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>


<BR></BR>
This is done as follows.
    <SPAN class="pynumber">300</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Multiple Choices'</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
          <SPAN class="pystring">'Object has several resources -- see URI list'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
    <SPAN class="pynumber">301</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Moved Permanently'</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'Object moved permanently -- see URI list'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>


    <SPAN class="pynumber">302</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Found'</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'Object moved temporarily -- see URI list'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
&gt;&gt;&gt; import urllib2<br />&gt;&gt;&gt; import urllib<br /><br />&gt;&gt;&gt; data = {}<br />&gt;&gt;&gt; data['name'] = 'Somebody Here'<br />&gt;&gt;&gt; data['location'] = 'Northampton'<br />&gt;&gt;&gt; data['language'] = 'Python'<br />&gt;&gt;&gt; url_values = urllib.urlencode(data)<br />&gt;&gt;&gt; print url_values<br />name=Somebody+Here&amp;language=Python&amp;location=Northampton<br />&gt;&gt;&gt; url = 'http://www.example.com/example.cgi'<br />&gt;&gt;&gt; full_url = url + '?' + url_values<br /><br />&gt;&gt;&gt; data = urllib2.open(full_url)<br />
    <SPAN class="pynumber">303</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'See Other'</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'Object moved -- see Method and URL list'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>


    <SPAN class="pynumber">304</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Not Modified'</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
Notice that the full URL is created by adding a <tt class="docutils literal"><span class="pre">?</span></tt> to the URL, followed by the encoded values.
          <SPAN class="pystring">'Document has not changed since given time'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
    <SPAN class="pynumber">305</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Use Proxy'</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>


          <SPAN class="pystring">'You must use proxy specified in Location to access this '</SPAN><BR></BR>
</div><div class="section">
          <SPAN class="pystring">'resource.'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
    <SPAN class="pynumber">307</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Temporary Redirect'</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
          <SPAN class="pystring">'Object moved temporarily -- see URI list'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>


<BR></BR>
== [http://esbinfo:8090/pages/editpage.action#id17 ]Headers ==
    <SPAN class="pynumber">400</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Bad Request'</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
          <SPAN class="pystring">'Bad request syntax or unsupported method'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
    <SPAN class="pynumber">401</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Unauthorized'</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>


          <SPAN class="pystring">'No permission -- see authorization schemes'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
We'll discuss here one particular HTTP header, to illustrate how to add headers to your HTTP request.
    <SPAN class="pynumber">402</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Payment Required'</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
          <SPAN class="pystring">'No payment -- see charging schemes'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
    <SPAN class="pynumber">403</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Forbidden'</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>


          <SPAN class="pystring">'Request forbidden -- authorization will not help'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
Some websites [http://esbinfo:8090/pages/editpage.action#id9 ][2] dislike being browsed by programs, or send different versions to different browsers [http://esbinfo:8090/pages/editpage.action#id10 ][3] . By default urllib2 identifies itself as <tt class="docutils literal"><span class="pre">Python-urllib/x.y</span></tt> (where <tt class="docutils literal"><span class="pre">x</span></tt> and <tt class="docutils literal"><span class="pre">y</span></tt> are the major and minor version numbers of the Python release, e.g. <tt class="docutils literal"><span class="pre">Python-urllib/2.5</span></tt>), which may confuse the site, or just plain not work. The way a browser identifies itself is through the <tt class="docutils literal"><span class="pre">User-Agent</span></tt> header [http://esbinfo:8090/pages/editpage.action#id11 ][4]. When you create a Request object you can pass a dictionary of headers in. The following example makes the same request as above, but identifies itself as a version of Internet Explorer [http://esbinfo:8090/pages/editpage.action#id12 ][5].
    <SPAN class="pynumber">404</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Not Found'</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'Nothing matches the given URI'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
    <SPAN class="pynumber">405</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Method Not Allowed'</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>


          <SPAN class="pystring">'Specified method is invalid for this server.'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
<div class="pysrc"><span class="pykeyword">import</span> <span class="pytext">urllib</span><br /><span class="pykeyword">import</span> <span class="pytext">urllib2</span><br /><br /><span class="pytext">url</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pystring">'http://www.someserver.com/cgi-bin/register.cgi'</span><br /><span class="pytext">user_agent</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pystring">'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'</span><br /><span class="pytext">values</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pyoperator">{</span><span class="pystring">'name'</span> <span class="pyoperator"><nowiki>:</nowiki></span> <span class="pystring">'Michael Foord'</span><span class="pyoperator">,</span><br /><span class="pystring">'location'</span> <span class="pyoperator"><nowiki>:</nowiki></span> <span class="pystring">'Northampton'</span><span class="pyoperator">,</span><br /><span class="pystring">'language'</span> <span class="pyoperator"><nowiki>:</nowiki></span> <span class="pystring">'Python'</span> <span class="pyoperator">}</span><br /><span class="pytext">headers</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pyoperator">{</span> <span class="pystring">'User-Agent'</span> <span class="pyoperator"><nowiki>:</nowiki></span> <span class="pytext">user_agent</span> <span class="pyoperator">}</span><br /><br /><span class="pytext">data</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">urllib</span><span class="pyoperator">.</span><span class="pytext">urlencode</span><span class="pyoperator">(</span><span class="pytext">values</span><span class="pyoperator">)</span><br /><span class="pytext">req</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">urllib2</span><span class="pyoperator">.</span><span class="pytext">Request</span><span class="pyoperator">(</span><span class="pytext">url</span><span class="pyoperator">,</span> <span class="pytext">data</span><span class="pyoperator">,</span> <span class="pytext">headers</span><span class="pyoperator">)</span><br /><span class="pytext">response</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">urllib2</span><span class="pyoperator">.</span><span class="pytext">urlopen</span><span class="pyoperator">(</span><span class="pytext">req</span><span class="pyoperator">)</span><br /><span class="pytext">the_page</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">response</span><span class="pyoperator">.</span><span class="pytext">read</span><span class="pyoperator">(</span><span class="pyoperator">)</span><span class="pytext"></span></div>
    <SPAN class="pynumber">406</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Not Acceptable'</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'URI not available in preferred format.'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
    <SPAN class="pynumber">407</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Proxy Authentication Required'</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'You must authenticate with '</SPAN><BR></BR>


          <SPAN class="pystring">'this proxy before proceeding.'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
The response also has two useful methods. See the section on [http://esbinfo:8090/pages/editpage.action#info-and-geturl info and geturl] which comes after we have a look at what happens when things go wrong.
    <SPAN class="pynumber">408</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Request Timeout'</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'Request timed out; try again later.'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
    <SPAN class="pynumber">409</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Conflict'</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'Request conflict.'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>


    <SPAN class="pynumber">410</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Gone'</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
</div></div><div class="section">
          <SPAN class="pystring">'URI no longer exists and has been permanently removed.'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
    <SPAN class="pynumber">411</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Length Required'</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'Client must specify Content-Length.'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>


    <SPAN class="pynumber">412</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Precondition Failed'</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'Precondition in headers is false.'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
= [http://esbinfo:8090/pages/editpage.action#id18 ]Handling Exceptions =
    <SPAN class="pynumber">413</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Request Entity Too Large'</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'Entity is too large.'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>


    <SPAN class="pynumber">414</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Request-URI Too Long'</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'URI is too long.'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
<i>urlopen</i> raises <tt class="docutils literal"><span class="pre">URLError</span></tt> when it cannot handle a response (though as usual with Python APIs, builtin exceptions such as ValueError, TypeError etc. may also be raised).
    <SPAN class="pynumber">415</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Unsupported Media Type'</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'Entity body in unsupported format.'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>


    <SPAN class="pynumber">416</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Requested Range Not Satisfiable'</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
<tt class="docutils literal"><span class="pre">HTTPError</span></tt> is the subclass of <tt class="docutils literal"><span class="pre">URLError</span></tt> raised in the specific case of HTTP URLs.
          <SPAN class="pystring">'Cannot satisfy request range.'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
    <SPAN class="pynumber">417</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Expectation Failed'</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>


          <SPAN class="pystring">'Expect condition could not be satisfied.'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
<div class="section">
<BR></BR>
    <SPAN class="pynumber">500</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Internal Server Error'</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'Server got itself in trouble'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
    <SPAN class="pynumber">501</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Not Implemented'</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>


          <SPAN class="pystring">'Server does not support this operation'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
== [http://esbinfo:8090/pages/editpage.action#id19 ]URLError ==
    <SPAN class="pynumber">502</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Bad Gateway'</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'Invalid responses from another server/proxy.'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
    <SPAN class="pynumber">503</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Service Unavailable'</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>


          <SPAN class="pystring">'The server cannot process the request due to a high load'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
Often, URLError is raised because there is no network connection (no route to the specified server), or the specified server doesn't exist. In this case, the exception raised will have a 'reason' attribute, which is a tuple containing an error code and a text error message.
    <SPAN class="pynumber">504</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'Gateway Timeout'</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
          <SPAN class="pystring">'The gateway server did not receive a timely response'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>
    <SPAN class="pynumber">505</SPAN><SPAN class="pyoperator">:</SPAN> <SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'HTTP Version Not Supported'</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'Cannot fulfill request.'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">,</SPAN><BR></BR>


    <SPAN class="pyoperator">}</SPAN><SPAN class="pytext"></SPAN></DIV><P>When an error is raised the server responds by returning an HTTP error
e.g.
code <I>and</I> an error page. You can use the <TT class="docutils literal"><SPAN class="pre">HTTPError</SPAN></TT> instance as a
response on the page returned. This means that as well as the code
attribute, it also has read, geturl, and info, methods.</P>
<PRE class="literal-block">>>> req = urllib2.Request('http://www.python.org/fish.html')<BR></BR>>>> try:<BR></BR>>>>    urllib2.urlopen(req)<BR></BR>>>> except URLError, e:<BR></BR><BR></BR>>>>    print e.code<BR></BR>>>>    print e.read()<BR></BR>>>><BR></BR>404<BR></BR><BR></BR>    "http://www.w3.org/TR/html4/loose.dtd"><BR></BR>    type="text/css"?><BR></BR><BR></BR><TITLE>Error 404: File Not Found</TITLE><BR></BR>...... etc...<BR></BR></PRE>
</DIV>
</DIV>
<DIV class="section">
<H2><A class href="http://esbinfo:8090/pages/editpage.action#id22" id="wrapping-it-up" name="wrapping-it-up"></A>Wrapping it Up</H2>
<P>So if you want to be prepared for <TT class="docutils literal"><SPAN class="pre">HTTPError</SPAN></TT> <I>or</I> <TT class="docutils literal"><SPAN class="pre">URLError</SPAN></TT>


there are two basic approaches. I prefer the second approach.</P>
&gt;&gt;&gt; req = urllib2.Request('http://www.pretend_server.org')<br />&gt;&gt;&gt; try: urllib2.urlopen(req)<br />&gt;&gt;&gt; except URLError, e:<br />&gt;&gt;&gt;    print e.reason<br />&gt;&gt;&gt;<br /><br />(4, 'getaddrinfo failed')<br />
<DIV class="section">
<H3><A class href="http://esbinfo:8090/pages/editpage.action#id23" id="number-1" name="number-1"></A>Number 1</H3>
<DIV class="pysrc"><SPAN class="pykeyword">from</SPAN> <SPAN class="pytext">urllib2</SPAN> <SPAN class="pykeyword">import</SPAN> <SPAN class="pytext">Request</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pytext">urlopen</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pytext">URLError</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pytext">HTTPError</SPAN><BR></BR>


<SPAN class="pytext">req</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">Request</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pytext">someurl</SPAN><SPAN class="pyoperator">)</SPAN><BR></BR>
</div><div class="section">
<SPAN class="pykeyword">try</SPAN><SPAN class="pyoperator">:</SPAN><BR></BR>
    <SPAN class="pytext">response</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">urlopen</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pytext">req</SPAN><SPAN class="pyoperator">)</SPAN><BR></BR>


<SPAN class="pykeyword">except</SPAN> <SPAN class="pytext">HTTPError</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pytext">e</SPAN><SPAN class="pyoperator">:</SPAN><BR></BR>
== [http://esbinfo:8090/pages/editpage.action#id20 ]HTTPError ==
    <SPAN class="pykeyword">print</SPAN> <SPAN class="pystring">'The server couldn\'t fulfill the request.'</SPAN><BR></BR>
    <SPAN class="pykeyword">print</SPAN> <SPAN class="pystring">'Error code: '</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pytext">e</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">code</SPAN><BR></BR>


<SPAN class="pykeyword">except</SPAN> <SPAN class="pytext">URLError</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pytext">e</SPAN><SPAN class="pyoperator">:</SPAN><BR></BR>
Every HTTP response from the server contains a numeric "status code". Sometimes the status code indicates that the server is unable to fulfil the request. The default handlers will handle some of these responses for you (for example, if the response is a "redirection" that requests the client fetch the document from a different URL, urllib2 will handle that for you). For those it can't handle, urlopen will raise an <tt class="docutils literal"><span class="pre">HTTPError</span></tt>. Typical errors include '404' (page not found), '403' (request forbidden), and '401' (authentication required).
    <SPAN class="pykeyword">print</SPAN> <SPAN class="pystring">'We failed to reach a server.'</SPAN><BR></BR>
    <SPAN class="pykeyword">print</SPAN> <SPAN class="pystring">'Reason: '</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pytext">e</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">reason</SPAN><BR></BR>


<SPAN class="pykeyword">else</SPAN><SPAN class="pyoperator">:</SPAN><BR></BR>
See section 10 of RFC 2616 for a reference on all the HTTP error codes.
    <SPAN class="pycomment"># everything is fine</SPAN><SPAN class="pytext"></SPAN></DIV><DIV class="note">
<P class="first admonition-title">Note</P>
<P class="last">The <TT class="docutils literal"><SPAN class="pre">except</SPAN> <SPAN class="pre">HTTPError</SPAN></TT> <I>must</I> come first, otherwise <TT class="docutils literal"><SPAN class="pre">except</SPAN> <SPAN class="pre">URLError</SPAN></TT>


will <I>also</I> catch an <TT class="docutils literal"><SPAN class="pre">HTTPError</SPAN></TT>.</P>
The <tt class="docutils literal"><span class="pre">HTTPError</span></tt> instance raised will have an integer 'code' attribute, which corresponds to the error sent by the server.
</DIV>
</DIV>
<DIV class="section">
<H3><A class href="http://esbinfo:8090/pages/editpage.action#id24" id="number-2" name="number-2"></A>Number 2</H3>
<DIV class="pysrc"><SPAN class="pykeyword">from</SPAN> <SPAN class="pytext">urllib2</SPAN> <SPAN class="pykeyword">import</SPAN> <SPAN class="pytext">Request</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pytext">urlopen</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pytext">URLError</SPAN><BR></BR>


<SPAN class="pytext">req</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">Request</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pytext">someurl</SPAN><SPAN class="pyoperator">)</SPAN><BR></BR>
<div class="section">
<SPAN class="pykeyword">try</SPAN><SPAN class="pyoperator">:</SPAN><BR></BR>
    <SPAN class="pytext">response</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">urlopen</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pytext">req</SPAN><SPAN class="pyoperator">)</SPAN><BR></BR>


<SPAN class="pykeyword">except</SPAN> <SPAN class="pytext">URLError</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pytext">e</SPAN><SPAN class="pyoperator">:</SPAN><BR></BR>
=== [http://esbinfo:8090/pages/editpage.action#id21 ]Error Codes ===
    <SPAN class="pykeyword">if</SPAN> <SPAN class="pytext">hasattr</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pytext">e</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'reason'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">:</SPAN><BR></BR>


        <SPAN class="pykeyword">print</SPAN> <SPAN class="pystring">'We failed to reach a server.'</SPAN><BR></BR>
Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range.
        <SPAN class="pykeyword">print</SPAN> <SPAN class="pystring">'Reason: '</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pytext">e</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">reason</SPAN><BR></BR>
    <SPAN class="pykeyword">elif</SPAN> <SPAN class="pytext">hasattr</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pytext">e</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'code'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">:</SPAN><BR></BR>


        <SPAN class="pykeyword">print</SPAN> <SPAN class="pystring">'The server couldn\'t fulfill the request.'</SPAN><BR></BR>
<tt class="docutils literal"><span class="pre">BaseHTTPServer.BaseHTTPRequestHandler.responses</span></tt> is a useful dictionary of response codes in that shows all the response codes used by RFC 2616. The dictionary is reproduced here for convenience :
        <SPAN class="pykeyword">print</SPAN> <SPAN class="pystring">'Error code: '</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pytext">e</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">code</SPAN><BR></BR>
<SPAN class="pykeyword">else</SPAN><SPAN class="pyoperator">:</SPAN><BR></BR>
    <SPAN class="pycomment"># everything is fine</SPAN><SPAN class="pytext"></SPAN></DIV><DIV class="note">


<P class="first admonition-title">Note</P>
<div class="pysrc"><span class="pycomment"><nowiki># Table mapping response codes to messages; entries have the</nowiki><br /></span><span class="pycomment"><nowiki># form {code: (shortmessage, longmessage)}.</nowiki><br /></span><span class="pytext">responses</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pyoperator">{</span><br /><span class="pynumber">100</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Continue'</span><span class="pyoperator">,</span> <span class="pystring">'Request received, please continue'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">101</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Switching Protocols'</span><span class="pyoperator">,</span><br /><span class="pystring">'Switching to new protocol; obey Upgrade header'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><br /><span class="pynumber">200</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'OK'</span><span class="pyoperator">,</span> <span class="pystring">'Request fulfilled, document follows'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">201</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Created'</span><span class="pyoperator">,</span> <span class="pystring">'Document created, URL follows'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">202</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Accepted'</span><span class="pyoperator">,</span><br /><span class="pystring">'Request accepted, processing continues off-line'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">203</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Non-Authoritative Information'</span><span class="pyoperator">,</span> <span class="pystring">'Request fulfilled from cache'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">204</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'No Content'</span><span class="pyoperator">,</span> <span class="pystring">'Request fulfilled, nothing follows'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">205</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Reset Content'</span><span class="pyoperator">,</span> <span class="pystring">'Clear input form for further input.'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">206</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Partial Content'</span><span class="pyoperator">,</span> <span class="pystring">'Partial content follows.'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><br /><span class="pynumber">300</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Multiple Choices'</span><span class="pyoperator">,</span><br /><span class="pystring">'Object has several resources -- see URI list'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">301</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Moved Permanently'</span><span class="pyoperator">,</span> <span class="pystring">'Object moved permanently -- see URI list'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">302</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Found'</span><span class="pyoperator">,</span> <span class="pystring">'Object moved temporarily -- see URI list'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">303</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'See Other'</span><span class="pyoperator">,</span> <span class="pystring">'Object moved -- see Method and URL list'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">304</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Not Modified'</span><span class="pyoperator">,</span><br /><span class="pystring">'Document has not changed since given time'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">305</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Use Proxy'</span><span class="pyoperator">,</span><br /><span class="pystring">'You must use proxy specified in Location to access this '</span><br /><span class="pystring">'resource.'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">307</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Temporary Redirect'</span><span class="pyoperator">,</span><br /><span class="pystring">'Object moved temporarily -- see URI list'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><br /><span class="pynumber">400</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Bad Request'</span><span class="pyoperator">,</span><br /><span class="pystring">'Bad request syntax or unsupported method'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">401</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Unauthorized'</span><span class="pyoperator">,</span><br /><span class="pystring">'No permission -- see authorization schemes'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">402</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Payment Required'</span><span class="pyoperator">,</span><br /><span class="pystring">'No payment -- see charging schemes'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">403</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Forbidden'</span><span class="pyoperator">,</span><br /><span class="pystring">'Request forbidden -- authorization will not help'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">404</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Not Found'</span><span class="pyoperator">,</span> <span class="pystring">'Nothing matches the given URI'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">405</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Method Not Allowed'</span><span class="pyoperator">,</span><br /><span class="pystring">'Specified method is invalid for this server.'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">406</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Not Acceptable'</span><span class="pyoperator">,</span> <span class="pystring">'URI not available in preferred format.'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">407</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Proxy Authentication Required'</span><span class="pyoperator">,</span> <span class="pystring">'You must authenticate with '</span><br /><span class="pystring">'this proxy before proceeding.'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">408</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Request Timeout'</span><span class="pyoperator">,</span> <span class="pystring">'Request timed out; try again later.'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">409</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Conflict'</span><span class="pyoperator">,</span> <span class="pystring">'Request conflict.'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">410</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Gone'</span><span class="pyoperator">,</span><br /><span class="pystring">'URI no longer exists and has been permanently removed.'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">411</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Length Required'</span><span class="pyoperator">,</span> <span class="pystring">'Client must specify Content-Length.'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">412</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Precondition Failed'</span><span class="pyoperator">,</span> <span class="pystring">'Precondition in headers is false.'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">413</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Request Entity Too Large'</span><span class="pyoperator">,</span> <span class="pystring">'Entity is too large.'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">414</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Request-URI Too Long'</span><span class="pyoperator">,</span> <span class="pystring">'URI is too long.'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">415</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Unsupported Media Type'</span><span class="pyoperator">,</span> <span class="pystring">'Entity body in unsupported format.'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">416</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Requested Range Not Satisfiable'</span><span class="pyoperator">,</span><br /><span class="pystring">'Cannot satisfy request range.'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">417</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Expectation Failed'</span><span class="pyoperator">,</span><br /><span class="pystring">'Expect condition could not be satisfied.'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><br /><span class="pynumber">500</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Internal Server Error'</span><span class="pyoperator">,</span> <span class="pystring">'Server got itself in trouble'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">501</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Not Implemented'</span><span class="pyoperator">,</span><br /><span class="pystring">'Server does not support this operation'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">502</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Bad Gateway'</span><span class="pyoperator">,</span> <span class="pystring">'Invalid responses from another server/proxy.'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">503</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Service Unavailable'</span><span class="pyoperator">,</span><br /><span class="pystring">'The server cannot process the request due to a high load'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">504</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'Gateway Timeout'</span><span class="pyoperator">,</span><br /><span class="pystring">'The gateway server did not receive a timely response'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pynumber">505</span><span class="pyoperator"><nowiki>:</nowiki></span> <span class="pyoperator">(</span><span class="pystring">'HTTP Version Not Supported'</span><span class="pyoperator">,</span> <span class="pystring">'Cannot fulfill request.'</span><span class="pyoperator">)</span><span class="pyoperator">,</span><br /><span class="pyoperator">}</span><span class="pytext"></span></div>
<P><TT class="docutils literal"><SPAN class="pre">URLError</SPAN></TT> is a subclass of the built-in exception <TT class="docutils literal"><SPAN class="pre">IOError</SPAN></TT>.</P>
<P>This means that you can avoid importing <TT class="docutils literal"><SPAN class="pre">URLError</SPAN></TT> and use :</P>
<BLOCKQUOTE>
<DIV class="pysrc"><SPAN class="pykeyword">from</SPAN> <SPAN class="pytext">urllib2</SPAN> <SPAN class="pykeyword">import</SPAN> <SPAN class="pytext">Request</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pytext">urlopen</SPAN><BR></BR>


<SPAN class="pytext">req</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">Request</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pytext">someurl</SPAN><SPAN class="pyoperator">)</SPAN><BR></BR>
When an error is raised the server responds by returning an HTTP error code <i>and</i> an error page. You can use the <tt class="docutils literal"><span class="pre">HTTPError</span></tt> instance as a response on the page returned. This means that as well as the code attribute, it also has read, geturl, and info, methods.
<SPAN class="pykeyword">try</SPAN><SPAN class="pyoperator">:</SPAN><BR></BR>
    <SPAN class="pytext">response</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">urlopen</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pytext">req</SPAN><SPAN class="pyoperator">)</SPAN><BR></BR>


<SPAN class="pykeyword">except</SPAN> <SPAN class="pytext">IOError</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pytext">e</SPAN><SPAN class="pyoperator">:</SPAN><BR></BR>
&gt;&gt;&gt; req = urllib2.Request('http://www.python.org/fish.html')<br />&gt;&gt;&gt; try:<br />&gt;&gt;&gt;    urllib2.urlopen(req)<br />&gt;&gt;&gt; except URLError, e:<br /><br />&gt;&gt;&gt;    print e.code<br />&gt;&gt;&gt;    print e.read()<br />&gt;&gt;&gt;<br />404<br /><br />   "http://www.w3.org/TR/html4/loose.dtd"&gt;<br />   type="text/css"?&gt;<br /><br />
    <SPAN class="pykeyword">if</SPAN> <SPAN class="pytext">hasattr</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pytext">e</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'reason'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">:</SPAN><BR></BR>


        <SPAN class="pykeyword">print</SPAN> <SPAN class="pystring">'We failed to reach a server.'</SPAN><BR></BR>
</div></div></div>
        <SPAN class="pykeyword">print</SPAN> <SPAN class="pystring">'Reason: '</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pytext">e</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">reason</SPAN><BR></BR>
    <SPAN class="pykeyword">elif</SPAN> <SPAN class="pytext">hasattr</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pytext">e</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pystring">'code'</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pyoperator">:</SPAN><BR></BR>


        <SPAN class="pykeyword">print</SPAN> <SPAN class="pystring">'The server couldn\'t fulfill the request.'</SPAN><BR></BR>
<br />...... etc...<br />
        <SPAN class="pykeyword">print</SPAN> <SPAN class="pystring">'Error code: '</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pytext">e</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">code</SPAN><BR></BR>
<SPAN class="pykeyword">else</SPAN><SPAN class="pyoperator">:</SPAN><BR></BR>
    <SPAN class="pycomment"># everything is fine</SPAN><SPAN class="pytext"></SPAN></DIV></BLOCKQUOTE>


<P class="last">Under rare circumstances <TT class="docutils literal"><SPAN class="pre">urllib2</SPAN></TT> can raise <TT class="docutils literal"><SPAN class="pre">socket.error</SPAN></TT>.</P>
<div class="section">
</DIV>
</DIV>
</DIV>
</DIV>
<DIV class="section">
<H1><A class href="http://esbinfo:8090/pages/editpage.action#id25" id="info-and-geturl" name="info-and-geturl"></A>info and geturl</H1>
<P>The response returned by urlopen (or the <TT class="docutils literal"><SPAN class="pre">HTTPError</SPAN></TT> instance) has
two useful methods <TT class="docutils literal"><SPAN class="pre">info</SPAN></TT> and <TT class="docutils literal"><SPAN class="pre">geturl</SPAN></TT>.</P>


<P><B>geturl</B> - this returns the real URL of the page fetched. This is
== [http://esbinfo:8090/pages/editpage.action#id22 ]Wrapping it Up ==
useful because <TT class="docutils literal"><SPAN class="pre">urlopen</SPAN></TT> (or the opener object used) may have
followed a redirect. The URL of the page fetched may not be the same
as the URL requested.</P>
<P><B>info</B> - this returns a dictionary-like object that describes the
page fetched, particularly the headers sent by the server. It is
currently an <TT class="docutils literal"><SPAN class="pre">httplib.HTTPMessage</SPAN></TT> instance.</P>
<P>Typical headers include 'Content-length', 'Content-type', and so
on. See the
<A class="reference" href="http://www.cs.tut.fi/%7Ejkorpela/http.html">Quick Reference to HTTP Headers</A>


for a useful listing of HTTP headers with brief explanations of their meaning
So if you want to be prepared for <tt class="docutils literal"><span class="pre">HTTPError</span></tt> <i>or</i> <tt class="docutils literal"><span class="pre">URLError</span></tt> there are two basic approaches. I prefer the second approach.
and use.</P>
</DIV>
<DIV class="section">
<H1><A class href="http://esbinfo:8090/pages/editpage.action#id26" id="openers-and-handlers" name="openers-and-handlers"></A>Openers and Handlers</H1>
<P>When you fetch a URL you use an opener (an instance of the perhaps
confusingly-named <TT class="docutils literal"><SPAN class="pre">urllib2.OpenerDirector</SPAN></TT>). Normally we have been using
the default opener - via <TT class="docutils literal"><SPAN class="pre">urlopen</SPAN></TT> - but you can create custom
openers. Openers use handlers. All the "heavy lifting" is done by the
handlers. Each handler knows how to open URLs for a particular URL
scheme (http, ftp, etc.), or how to handle an aspect of URL opening,
for example HTTP redirections or HTTP cookies.</P>
<P>You will want to create openers if you want to fetch URLs with
specific handlers installed, for example to get an opener that handles
cookies, or to get an opener that does not handle redirections.</P>


<P>To create an opener, instantiate an OpenerDirector, and then call
<div class="section">
.add_handler(some_handler_instance) repeatedly.</P>
<P>Alternatively, you can use <TT class="docutils literal"><SPAN class="pre">build_opener</SPAN></TT>, which is a convenience
function for creating opener objects with a single function call.
<TT class="docutils literal"><SPAN class="pre">build_opener</SPAN></TT> adds several handlers by default, but provides a
quick way to add more and/or override the default handlers.</P>
<P>Other sorts of handlers you might want to can handle proxies,
authentication, and other common but slightly specialised
situations.</P>
<P><TT class="docutils literal"><SPAN class="pre">install_opener</SPAN></TT> can be used to make an <TT class="docutils literal"><SPAN class="pre">opener</SPAN></TT> object the
(global) default opener. This means that calls to <TT class="docutils literal"><SPAN class="pre">urlopen</SPAN></TT> will use
the opener you have installed.</P>


<P>Opener objects have an <TT class="docutils literal"><SPAN class="pre">open</SPAN></TT> method, which can be called directly
=== [http://esbinfo:8090/pages/editpage.action#id23 ]Number 1 ===
to fetch urls in the same way as the <TT class="docutils literal"><SPAN class="pre">urlopen</SPAN></TT> function: there's no
need to call <TT class="docutils literal"><SPAN class="pre">install_opener</SPAN></TT>, except as a convenience.</P>
</DIV>
<DIV class="section">
<H1><A class href="http://esbinfo:8090/pages/editpage.action#id27" id="id6" name="id6"></A>Basic Authentication</H1>
<P>To illustrate creating and installing a handler we will use the
<TT class="docutils literal"><SPAN class="pre">HTTPBasicAuthHandler</SPAN></TT>. For a more detailed discussion of this
subject - including an explanation of how Basic Authentication works -
see the <A class="reference" href="http://www.voidspace.org.uk/python/articles/authentication.shtml">Basic Authentication Tutorial</A>.</P>


<P>When authentication is required, the server sends a header (as well as
<div class="pysrc"><span class="pykeyword">from</span> <span class="pytext">urllib2</span> <span class="pykeyword">import</span> <span class="pytext">Request</span><span class="pyoperator">,</span> <span class="pytext">urlopen</span><span class="pyoperator">,</span> <span class="pytext">URLError</span><span class="pyoperator">,</span> <span class="pytext">HTTPError</span><br /><span class="pytext">req</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">Request</span><span class="pyoperator">(</span><span class="pytext">someurl</span><span class="pyoperator">)</span><br /><span class="pykeyword">try</span><span class="pyoperator"><nowiki>:</nowiki></span><br /><span class="pytext">response</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">urlopen</span><span class="pyoperator">(</span><span class="pytext">req</span><span class="pyoperator">)</span><br /><span class="pykeyword">except</span> <span class="pytext">HTTPError</span><span class="pyoperator">,</span> <span class="pytext">e</span><span class="pyoperator"><nowiki>:</nowiki></span><br /><span class="pykeyword">print</span> <span class="pystring">'The server couldn't fulfill the request.'</span><br /><span class="pykeyword">print</span> <span class="pystring">'Error code: '</span><span class="pyoperator">,</span> <span class="pytext">e</span><span class="pyoperator">.</span><span class="pytext">code</span><br /><span class="pykeyword">except</span> <span class="pytext">URLError</span><span class="pyoperator">,</span> <span class="pytext">e</span><span class="pyoperator"><nowiki>:</nowiki></span><br /><span class="pykeyword">print</span> <span class="pystring">'We failed to reach a server.'</span><br /><span class="pykeyword">print</span> <span class="pystring">'Reason: '</span><span class="pyoperator">,</span> <span class="pytext">e</span><span class="pyoperator">.</span><span class="pytext">reason</span><br /><span class="pykeyword">else</span><span class="pyoperator"><nowiki>:</nowiki></span><br /><span class="pycomment"><nowiki># everything is fine</nowiki></span><span class="pytext"></span></div><div class="note">
the 401 error code) requesting authentication.  This specifies the
authentication scheme and a 'realm'. The header looks like :
<TT class="docutils literal"><SPAN class="pre">Www-authenticate:</SPAN> <SPAN class="pre">SCHEME</SPAN> <SPAN class="pre">realm="REALM"</SPAN></TT>.</P>
<P>e.g.</P>
<PRE class="literal-block">Www-authenticate: Basic realm="cPanel Users"<BR></BR></PRE>
<P>The client should then retry the request with the appropriate name and
password for the realm included as a header in the request. This is
'basic authentication'. In order to simplify this process we can
create an instance of <TT class="docutils literal"><SPAN class="pre">HTTPBasicAuthHandler</SPAN></TT> and an opener to use
this handler.</P>


<P>The <TT class="docutils literal"><SPAN class="pre">HTTPBasicAuthHandler</SPAN></TT> uses an object called a password manager
Note
to handle the mapping of URLs and realms to passwords and
usernames. If you know what the realm is (from the authentication
header sent by the server), then you can use a
<TT class="docutils literal"><SPAN class="pre">HTTPPasswordMgr</SPAN></TT>. Frequently one doesn't care what the realm is. In
that case, it is convenient to use
<TT class="docutils literal"><SPAN class="pre">HTTPPasswordMgrWithDefaultRealm</SPAN></TT>. This allows you to specify a
default username and password for a URL. This will be supplied in the
absence of you providing an alternative combination for a specific
realm. We indicate this by providing <TT class="docutils literal"><SPAN class="pre">None</SPAN></TT> as the realm argument to
the <TT class="docutils literal"><SPAN class="pre">add_password</SPAN></TT> method.</P>
<P>The top-level URL is the first URL that requires authentication. URLs
"deeper" than the URL you pass to .add_password() will also match.</P>


<DIV class="pysrc"><SPAN class="pycomment"># create a password manager<BR></BR>
The <tt class="docutils literal"><span class="pre">except</span> <span class="pre">HTTPError</span></tt> <i>must</i> come first, otherwise <tt class="docutils literal"><span class="pre">except</span> <span class="pre">URLError</span></tt> will <i>also</i> catch an <tt class="docutils literal"><span class="pre">HTTPError</span></tt>.
</SPAN><SPAN class="pytext">password_mgr</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">urllib2</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">HTTPPasswordMgrWithDefaultRealm</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pyoperator">)</SPAN><BR></BR>
<BR></BR>
<SPAN class="pycomment"># Add the username and password.<BR></BR>
</SPAN><SPAN class="pycomment"># If we knew the realm, we could use it instead of ``None``.<BR></BR>


</SPAN><SPAN class="pytext">top_level_url</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pystring">"http://example.com/foo/"</SPAN><BR></BR>
</div></div><div class="section">
<SPAN class="pytext">password_mgr</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">add_password</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pytext">None</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pytext">top_level_url</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pytext">username</SPAN><SPAN class="pyoperator">,</SPAN> <SPAN class="pytext">password</SPAN><SPAN class="pyoperator">)</SPAN><BR></BR>


<BR></BR>
=== [http://esbinfo:8090/pages/editpage.action#id24 ]Number 2 ===
<SPAN class="pytext">handler</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">urllib2</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">HTTPBasicAuthHandler</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pytext">password_mgr</SPAN><SPAN class="pyoperator">)</SPAN><BR></BR>
<BR></BR>
<SPAN class="pycomment"># create "opener" (OpenerDirector instance)<BR></BR>
</SPAN><SPAN class="pytext">opener</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">urllib2</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">build_opener</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pytext">handler</SPAN><SPAN class="pyoperator">)</SPAN><BR></BR>


<BR></BR>
<div class="pysrc"><span class="pykeyword">from</span> <span class="pytext">urllib2</span> <span class="pykeyword">import</span> <span class="pytext">Request</span><span class="pyoperator">,</span> <span class="pytext">urlopen</span><span class="pyoperator">,</span> <span class="pytext">URLError</span><br /><span class="pytext">req</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">Request</span><span class="pyoperator">(</span><span class="pytext">someurl</span><span class="pyoperator">)</span><br /><span class="pykeyword">try</span><span class="pyoperator"><nowiki>:</nowiki></span><br /><span class="pytext">response</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">urlopen</span><span class="pyoperator">(</span><span class="pytext">req</span><span class="pyoperator">)</span><br /><span class="pykeyword">except</span> <span class="pytext">URLError</span><span class="pyoperator">,</span> <span class="pytext">e</span><span class="pyoperator"><nowiki>:</nowiki></span><br /><span class="pykeyword">if</span> <span class="pytext">hasattr</span><span class="pyoperator">(</span><span class="pytext">e</span><span class="pyoperator">,</span> <span class="pystring">'reason'</span><span class="pyoperator">)</span><span class="pyoperator"><nowiki>:</nowiki></span><br /><span class="pykeyword">print</span> <span class="pystring">'We failed to reach a server.'</span><br /><span class="pykeyword">print</span> <span class="pystring">'Reason: '</span><span class="pyoperator">,</span> <span class="pytext">e</span><span class="pyoperator">.</span><span class="pytext">reason</span><br /><span class="pykeyword">elif</span> <span class="pytext">hasattr</span><span class="pyoperator">(</span><span class="pytext">e</span><span class="pyoperator">,</span> <span class="pystring">'code'</span><span class="pyoperator">)</span><span class="pyoperator"><nowiki>:</nowiki></span><br /><span class="pykeyword">print</span> <span class="pystring">'The server couldn't fulfill the request.'</span><br /><span class="pykeyword">print</span> <span class="pystring">'Error code: '</span><span class="pyoperator">,</span> <span class="pytext">e</span><span class="pyoperator">.</span><span class="pytext">code</span><br /><span class="pykeyword">else</span><span class="pyoperator"><nowiki>:</nowiki></span><br /><span class="pycomment"><nowiki># everything is fine</nowiki></span><span class="pytext"></span></div><div class="note">
<SPAN class="pycomment"># use the opener to fetch a URL<BR></BR>
</SPAN><SPAN class="pytext">opener</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">open</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pytext">a_url</SPAN><SPAN class="pyoperator">)</SPAN><BR></BR>
<BR></BR>
<SPAN class="pycomment"># Install the opener.<BR></BR>
</SPAN><SPAN class="pycomment"># Now all calls to urllib2.urlopen use our opener.<BR></BR>
</SPAN><SPAN class="pytext">urllib2</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">install_opener</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pytext">opener</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pytext"></SPAN></DIV><DIV class="note">


<P class="first admonition-title">Note</P>
Note
<P class="last">In the above example we only supplied our <TT class="docutils literal"><SPAN class="pre">HHTPBasicAuthHandler</SPAN></TT>
to <TT class="docutils literal"><SPAN class="pre">build_opener</SPAN></TT>. By default openers have the handlers for
normal situations - <TT class="docutils literal"><SPAN class="pre">ProxyHandler</SPAN></TT>, <TT class="docutils literal"><SPAN class="pre">UnknownHandler</SPAN></TT>,
<TT class="docutils literal"><SPAN class="pre">HTTPHandler</SPAN></TT>, <TT class="docutils literal"><SPAN class="pre">HTTPDefaultErrorHandler</SPAN></TT>,
<TT class="docutils literal"><SPAN class="pre">HTTPRedirectHandler</SPAN></TT>, <TT class="docutils literal"><SPAN class="pre">FTPHandler</SPAN></TT>, <TT class="docutils literal"><SPAN class="pre">FileHandler</SPAN></TT>,


<TT class="docutils literal"><SPAN class="pre">HTTPErrorProcessor</SPAN></TT>.</P>
<tt class="docutils literal"><span class="pre">URLError</span></tt> is a subclass of the built-in exception <tt class="docutils literal"><span class="pre">IOError</span></tt>.
</DIV>
<P>top_level_url is in fact <I>either</I> a full URL (including the 'http:'
scheme component and the hostname and optionally the port number)
e.g. "<A class="reference" href="http://example.com/">http://example.com/</A>" <I>or</I> an "authority" (i.e. the hostname,
optionally including the port number) e.g. "example.com" or


"example.com:8080" (the latter example includes a port number).  The
This means that you can avoid importing <tt class="docutils literal"><span class="pre">URLError</span></tt> and use :
authority, if present, must NOT contain the "userinfo" component - for
example "joe@password:example.com" is not correct.</P>
</DIV>
<DIV class="section">
<H1><A class href="http://esbinfo:8090/pages/editpage.action#id28" id="proxies" name="proxies"></A>Proxies</H1>
<P><B>urllib2</B> will auto-detect your proxy settings and use those. This
is through the <TT class="docutils literal"><SPAN class="pre">ProxyHandler</SPAN></TT> which is part of the normal handler
chain. Normally that's a good thing, but there are occasions when it
may not be helpful <A class href="http://esbinfo:8090/pages/editpage.action#id13" id="id7" name="id7"></A>[6]. One way to do this is to setup our own


<TT class="docutils literal"><SPAN class="pre">ProxyHandler</SPAN></TT>, with no proxies defined. This is done using similar
<blockquote><div class="pysrc"><span class="pykeyword">from</span> <span class="pytext">urllib2</span> <span class="pykeyword">import</span> <span class="pytext">Request</span><span class="pyoperator">,</span> <span class="pytext">urlopen</span><br /><span class="pytext">req</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">Request</span><span class="pyoperator">(</span><span class="pytext">someurl</span><span class="pyoperator">)</span><br /><span class="pykeyword">try</span><span class="pyoperator"><nowiki>:</nowiki></span><br /><span class="pytext">response</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">urlopen</span><span class="pyoperator">(</span><span class="pytext">req</span><span class="pyoperator">)</span><br /><span class="pykeyword">except</span> <span class="pytext">IOError</span><span class="pyoperator">,</span> <span class="pytext">e</span><span class="pyoperator"><nowiki>:</nowiki></span><br /><span class="pykeyword">if</span> <span class="pytext">hasattr</span><span class="pyoperator">(</span><span class="pytext">e</span><span class="pyoperator">,</span> <span class="pystring">'reason'</span><span class="pyoperator">)</span><span class="pyoperator"><nowiki>:</nowiki></span><br /><span class="pykeyword">print</span> <span class="pystring">'We failed to reach a server.'</span><br /><span class="pykeyword">print</span> <span class="pystring">'Reason: '</span><span class="pyoperator">,</span> <span class="pytext">e</span><span class="pyoperator">.</span><span class="pytext">reason</span><br /><span class="pykeyword">elif</span> <span class="pytext">hasattr</span><span class="pyoperator">(</span><span class="pytext">e</span><span class="pyoperator">,</span> <span class="pystring">'code'</span><span class="pyoperator">)</span><span class="pyoperator"><nowiki>:</nowiki></span><br /><span class="pykeyword">print</span> <span class="pystring">'The server couldn't fulfill the request.'</span><br /><span class="pykeyword">print</span> <span class="pystring">'Error code: '</span><span class="pyoperator">,</span> <span class="pytext">e</span><span class="pyoperator">.</span><span class="pytext">code</span><br /><span class="pykeyword">else</span><span class="pyoperator"><nowiki>:</nowiki></span><br /><span class="pycomment"><nowiki># everything is fine</nowiki></span><span class="pytext"></span></div></blockquote>
steps to setting up a <A class="reference" href="http://www.voidspace.org.uk/python/articles/authentication.shtml">Basic Authentication</A> handler :</P>
<PRE class="literal-block">>>> proxy_support = urllib2.ProxyHandler({})<BR></BR>>>> opener = urllib2.build_opener(proxy_support)<BR></BR>>>> urllib2.install_opener(opener)<BR></BR></PRE>
<DIV class="note">
<P class="first admonition-title">Note</P>


<P class="last">Currently <TT class="docutils literal"><SPAN class="pre">urllib2</SPAN></TT> <I>does not</I> support fetching of <TT class="docutils literal"><SPAN class="pre">https</SPAN></TT>
Under rare circumstances <tt class="docutils literal"><span class="pre">urllib2</span></tt> can raise <tt class="docutils literal"><span class="pre">socket.error</span></tt>.
locations through a proxy. This can be a problem.</P>
</DIV>
</DIV>
<DIV class="section">
<H1><A class href="http://esbinfo:8090/pages/editpage.action#id29" id="sockets-and-layers" name="sockets-and-layers"></A>Sockets and Layers</H1>
<P>The Python support for fetching resources from the web is
layered. urllib2 uses the httplib library, which in turn uses the
socket library.</P>


<P>As of Python 2.3 you can specify how long a socket should wait for a
</div></div></div><div class="section">
response before timing out. This can be useful in applications which
have to fetch web pages. By default the socket module has <I>no timeout</I>
and can hang. Currently, the socket timeout is not exposed at the
httplib or urllib2 levels.  However, you can set the default timeout
globally for all sockets using :</P>
<DIV class="pysrc"><SPAN class="pykeyword">import</SPAN> <SPAN class="pytext">socket</SPAN><BR></BR>
<SPAN class="pykeyword">import</SPAN> <SPAN class="pytext">urllib2</SPAN><BR></BR>
<BR></BR>
<SPAN class="pycomment"># timeout in seconds<BR></BR>
</SPAN><SPAN class="pytext">timeout</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pynumber">10</SPAN><BR></BR>


<SPAN class="pytext">socket</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">setdefaulttimeout</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pytext">timeout</SPAN><SPAN class="pyoperator">)</SPAN><BR></BR>
= [http://esbinfo:8090/pages/editpage.action#id25 ]info and geturl =
<BR></BR>
<SPAN class="pycomment"># this call to urllib2.urlopen now uses the default timeout<BR></BR>
</SPAN><SPAN class="pycomment"># we have set in the socket module<BR></BR>
</SPAN><SPAN class="pytext">req</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">urllib2</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">Request</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pystring">'http://www.voidspace.org.uk'</SPAN><SPAN class="pyoperator">)</SPAN><BR></BR>


<SPAN class="pytext">response</SPAN> <SPAN class="pyoperator">=</SPAN> <SPAN class="pytext">urllib2</SPAN><SPAN class="pyoperator">.</SPAN><SPAN class="pytext">urlopen</SPAN><SPAN class="pyoperator">(</SPAN><SPAN class="pytext">req</SPAN><SPAN class="pyoperator">)</SPAN><SPAN class="pytext"></SPAN></DIV></DIV>
The response returned by urlopen (or the <tt class="docutils literal"><span class="pre">HTTPError</span></tt> instance) has two useful methods <tt class="docutils literal"><span class="pre">info</span></tt> and <tt class="docutils literal"><span class="pre">geturl</span></tt>.
<HR class="docutils"></HR>
<DIV class="section">
<H1><A class href="http://esbinfo:8090/pages/editpage.action#id30" id="footnotes" name="footnotes"></A>Footnotes</H1>
<P>This document was reviewed and revised by John Lee.</P>


<TABLE class="docutils footnote" frame="void" id="id8" rules="none">
<b>geturl</b> - this returns the real URL of the page fetched. This is useful because <tt class="docutils literal"><span class="pre">urlopen</span></tt> (or the opener object used) may have followed a redirect. The URL of the page fetched may not be the same as the URL requested.
<COLGROUP><COL class="label"></COL><COL></COL></COLGROUP>
<TBODY valign="top">
<TR><TD class="label"><A class href="http://esbinfo:8090/pages/editpage.action#id1" name="id8"></A>[1]</TD><TD>For an introduction to the CGI protocol see
<A class="reference" href="http://www.pyzine.com/Issue008/Section_Articles/article_CGIOne.html">Writing Web Applications in Python</A>.</TD></TR>
</TBODY>
</TABLE>
<TABLE class="docutils footnote" frame="void" id="id9" rules="none">
<COLGROUP><COL class="label"></COL><COL></COL></COLGROUP>
<TBODY valign="top">
<TR><TD class="label"><A class href="http://esbinfo:8090/pages/editpage.action#id2" name="id9"></A>[2]</TD><TD>Like Google for example. The <I>proper</I> way to use google from a program
is to use <A class="reference" href="http://pygoogle.sourceforge.net">PyGoogle</A> of course. See


<A class="reference" href="http://www.voidspace.org.uk/python/recipebook.shtml#google">Voidspace Google</A>
<b>info</b> - this returns a dictionary-like object that describes the page fetched, particularly the headers sent by the server. It is currently an <tt class="docutils literal"><span class="pre">httplib.HTTPMessage</span></tt> instance.
for some examples of using the Google API.</TD></TR>
</TBODY>
</TABLE>
<TABLE class="docutils footnote" frame="void" id="id10" rules="none">
<COLGROUP><COL class="label"></COL><COL></COL></COLGROUP>
<TBODY valign="top">
<TR><TD class="label"><A class href="http://esbinfo:8090/pages/editpage.action#id3" name="id10"></A>[3]</TD><TD>Browser sniffing is a very bad practise for website design - building
sites using web standards is much more sensible. Unfortunately a lot of
sites still send different versions to different browsers.</TD></TR>
</TBODY>
</TABLE>
<TABLE class="docutils footnote" frame="void" id="id11" rules="none">
<COLGROUP><COL class="label"></COL><COL></COL></COLGROUP>
<TBODY valign="top">


<TR><TD class="label"><A class href="http://esbinfo:8090/pages/editpage.action#id4" name="id11"></A>[4]</TD><TD>The user agent for MSIE 6 is
Typical headers include 'Content-length', 'Content-type', and so on. See the [http://www.cs.tut.fi/%7Ejkorpela/http.html Quick Reference to HTTP Headers] for a useful listing of HTTP headers with brief explanations of their meaning and use.
<I>'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'</I></TD></TR>
</TBODY>
</TABLE>
<TABLE class="docutils footnote" frame="void" id="id12" rules="none">
<COLGROUP><COL class="label"></COL><COL></COL></COLGROUP>
<TBODY valign="top">
<TR><TD class="label"><A class href="http://esbinfo:8090/pages/editpage.action#id5" name="id12"></A>[5]</TD><TD>For details of more HTTP request headers, see
<A class="reference" href="http://www.cs.tut.fi/%7Ejkorpela/http.html">Quick Reference to HTTP Headers</A>.</TD></TR>
</TBODY>
</TABLE>
<TABLE class="docutils footnote" frame="void" id="id13" rules="none">


<COLGROUP><COL class="label"></COL><COL></COL></COLGROUP>
</div><div class="section">
<TBODY valign="top">
<TR><TD class="label"><A class href="http://esbinfo:8090/pages/editpage.action#id7" name="id13"></A>[6]</TD><TD>In my case I have to use a proxy to access the internet at work. If you
attempt to fetch <I>localhost</I> URLs through this proxy it blocks them. IE
is set to use the proxy, which urllib2 picks up on. In order to test
scripts with a localhost server, I have to prevent urllib2 from using
the proxy.</TD></TR>
</TBODY>
</TABLE>
</DIV>
{html}


h1. Article Extracted from [http://www.boddie.org.uk/python/HTML.html]
= [http://esbinfo:8090/pages/editpage.action#id26 ]Openers and Handlers =


{html}
When you fetch a URL you use an opener (an instance of the perhaps confusingly-named <tt class="docutils literal"><span class="pre">urllib2.OpenerDirector</span></tt>). Normally we have been using the default opener - via <tt class="docutils literal"><span class="pre">urlopen</span></tt> - but you can create custom openers. Openers use handlers. All the "heavy lifting" is done by the handlers. Each handler knows how to open URLs for a particular URL scheme (http, ftp, etc.), or how to handle an aspect of URL opening, for example HTTP redirections or HTTP cookies.


<H2><A class name="Abstract"></A>Abstract</H2>
You will want to create openers if you want to fetch URLs with specific handlers installed, for example to get an opener that handles cookies, or to get an opener that does not handle redirections.


<P>Various Web surfing tasks that I regularly perform could be made much
To create an opener, instantiate an OpenerDirector, and then call .add_handler(some_handler_instance) repeatedly.
easier, and less tedious, if I could only use <A href="http://www.python.org">Python</A> to fetch the HTML pages and to
process them, yielding the information I really need. In this document I
attempt to describe HTML processing in Python using readily available tools
and libraries.</P>


<P><B>NOTE:</B> This document is not quite finished. I aim to
Alternatively, you can use <tt class="docutils literal"><span class="pre">build_opener</span></tt>, which is a convenience function for creating opener objects with a single function call. <tt class="docutils literal"><span class="pre">build_opener</span></tt> adds several handlers by default, but provides a quick way to add more and/or override the default handlers.
include sections on using mxTidy to deal with broken HTML as well as some
tips on cleaning up text retrieved from HTML resources.</P>


<H2><A class name="Prerequisites"></A>Prerequisites</H2>
Other sorts of handlers you might want to can handle proxies, authentication, and other common but slightly specialised situations.


<P>Depending on the methods you wish to follow in this tutorial, you need the
<tt class="docutils literal"><span class="pre">install_opener</span></tt> can be used to make an <tt class="docutils literal"><span class="pre">opener</span></tt> object the (global) default opener. This means that calls to <tt class="docutils literal"><span class="pre">urlopen</span></tt> will use the opener you have installed.
following things:</P>
<UL>
  <LI>For the "SGML parser" method, a recent release of Python is probably
    enough. You can find one at the Python <A href="http://www.python.org/download/">download</A> page.</LI>


  <LI>For
Opener objects have an <tt class="docutils literal"><span class="pre">open</span></tt> method, which can be called directly to fetch urls in the same way as the <tt class="docutils literal"><span class="pre">urlopen</span></tt> function: there's no need to call <tt class="docutils literal"><span class="pre">install_opener</span></tt>, except as a convenience.
the "XML parser" method, a recent release of Python is required, along
with a capable XML processing library. I recommend using <A href="http://esbinfo:8090/pages/libxml2dom.html">libxml2dom</A>, since it can handle badly-formed HTML documents as well as well-formed XML or XHTML documents. However, <A href="http://pyxml.sourceforge.net/">PyXML</A> also provides support for such documents.</LI><LI>For fetching Web pages over secure connections, it is important that
    SSL support is enabled either when building Python from source, or in any
    packaged distribution of Python that you might acquire. Information about
    this is given in the source distribution of Python, but you can download
    replacement socket libraries with SSL support for older versions of Python for Windows from <A href="http://alldunn.com/python/">Robin Dunn's site</A>.</LI>
</UL>


<H2><A class name="Activities"></A>Activities</H2>
</div><div class="section">


<P>Accessing sites, downloading content, and processing such content, either
= [http://esbinfo:8090/pages/editpage.action#id27 ]Basic Authentication =
to extract useful information for archiving or to use such content to
navigate further into the site, require combinations of the following
activities. Some activities can be chosen according to preference: whether
the SGML parser or the XML parser (or parsing framework) is used depends on
which style of programming seems nicer to a given developer (although one
parser may seem to work better in some situations). However, technical
restrictions usually dictate whether certain libraries are to be used instead
of others: when handling HTTP redirects, it appears that certain Python
modules are easier to use, or even more suited to handling such
situations.</P>


<H3>Fetching Web Pages</H3>
To illustrate creating and installing a handler we will use the <tt class="docutils literal"><span class="pre">HTTPBasicAuthHandler</span></tt>. For a more detailed discussion of this subject - including an explanation of how Basic Authentication works - see the [http://www.voidspace.org.uk/python/articles/authentication.shtml Basic Authentication Tutorial].


<P>Fetching standard Web pages over HTTP is very easy with Python:</P>
When authentication is required, the server sends a header (as well as the 401 error code) requesting authentication. This specifies the authentication scheme and a 'realm'. The header looks like : <tt class="docutils literal"><span class="pre">Www-authenticate:</span> <span class="pre">SCHEME</span> <span class="pre">realm="REALM"</span></tt>.
<PRE>import urllib<BR></BR># Get a file-like object for the Python Web site's home page.<BR></BR>f = urllib.urlopen("http://www.python.org")<BR></BR># Read from the object, storing the page's contents in 's'.<BR></BR>s = f.read()<BR></BR>f.close()</PRE>


<H4>Supplying Data</H4>
e.g.


<P>Sometimes, it is necessary to pass information to the Web server, such as
Www-authenticate: Basic realm="cPanel Users"<br />
information which would come from an HTML form. Of course, you need to know
which fields are available in a form, but assuming that you already know
this, you can supply such data in the <CODE>urlopen</CODE> function call:</P>
<PRE># Search the Vaults of Parnassus for "XMLForms".<BR></BR># First, encode the data.<BR></BR>data = urllib.urlencode({"find" : "XMLForms", "findtype" : "t"})<BR></BR># Now get that file-like object again, remembering to mention the data.<BR></BR>f = urllib.urlopen("http://www.vex.net/parnassus/apyllo.py", data)<BR></BR># Read the results back.<BR></BR>s = f.read()<BR></BR>s.close()</PRE>


<P>The above example passed data to the server as an HTTP <CODE>POST</CODE> request.
The client should then retry the request with the appropriate name and password for the realm included as a header in the request. This is 'basic authentication'. In order to simplify this process we can create an instance of <tt class="docutils literal"><span class="pre">HTTPBasicAuthHandler</span></tt> and an opener to use this handler.
Fortunately, the <A href="http://www.vex.net/parnassus/apyllo.py">Vaults of
Parnassus</A> is happy about such requests, but this is not always the case
with Web services. We can instead choose to use a different kind of request,
however:</P>


<PRE># We have the encoded data. Now get the file-like object...<BR></BR>f = urllib.urlopen("http://www.vex.net/parnassus/apyllo.py?" + data)<BR></BR># And the rest...</PRE>
The <tt class="docutils literal"><span class="pre">HTTPBasicAuthHandler</span></tt> uses an object called a password manager to handle the mapping of URLs and realms to passwords and usernames. If you know what the realm is (from the authentication header sent by the server), then you can use a <tt class="docutils literal"><span class="pre">HTTPPasswordMgr</span></tt>. Frequently one doesn't care what the realm is. In that case, it is convenient to use <tt class="docutils literal"><span class="pre">HTTPPasswordMgrWithDefaultRealm</span></tt>. This allows you to specify a default username and password for a URL. This will be supplied in the absence of you providing an alternative combination for a specific realm. We indicate this by providing <tt class="docutils literal"><span class="pre">None</span></tt> as the realm argument to the <tt class="docutils literal"><span class="pre">add_password</span></tt> method.


<P>The only difference is the use of a <CODE>?</CODE> (question mark) character and the
The top-level URL is the first URL that requires authentication. URLs "deeper" than the URL you pass to .add_password() will also match.
adding of <CODE>data</CODE> onto the end of the Vaults of Parnassus URL, but
this constitutes an HTTP <CODE>GET</CODE> request, where the query (our additional data)
is included in the URL itself.</P>


<H3>Fetching Secure Web Pages</H3>
<div class="pysrc"><span class="pycomment"><nowiki># create a password manager</nowiki><br /></span><span class="pytext">password_mgr</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">urllib2</span><span class="pyoperator">.</span><span class="pytext">HTTPPasswordMgrWithDefaultRealm</span><span class="pyoperator">(</span><span class="pyoperator">)</span><br /><br /><span class="pycomment"><nowiki># Add the username and password.</nowiki><br /></span><span class="pycomment"><nowiki># If we knew the realm, we could use it instead of ``None``.</nowiki><br /></span><span class="pytext">top_level_url</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pystring">"http://example.com/foo/"</span><br /><span class="pytext">password_mgr</span><span class="pyoperator">.</span><span class="pytext">add_password</span><span class="pyoperator">(</span><span class="pytext">None</span><span class="pyoperator">,</span> <span class="pytext">top_level_url</span><span class="pyoperator">,</span> <span class="pytext">username</span><span class="pyoperator">,</span> <span class="pytext">password</span><span class="pyoperator">)</span><br /><br /><span class="pytext">handler</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">urllib2</span><span class="pyoperator">.</span><span class="pytext">HTTPBasicAuthHandler</span><span class="pyoperator">(</span><span class="pytext">password_mgr</span><span class="pyoperator">)</span><br /><br /><span class="pycomment"><nowiki># create "opener" (OpenerDirector instance)</nowiki><br /></span><span class="pytext">opener</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">urllib2</span><span class="pyoperator">.</span><span class="pytext">build_opener</span><span class="pyoperator">(</span><span class="pytext">handler</span><span class="pyoperator">)</span><br /><br /><span class="pycomment"><nowiki># use the opener to fetch a URL</nowiki><br /></span><span class="pytext">opener</span><span class="pyoperator">.</span><span class="pytext">open</span><span class="pyoperator">(</span><span class="pytext">a_url</span><span class="pyoperator">)</span><br /><br /><span class="pycomment"><nowiki># Install the opener.</nowiki><br /></span><span class="pycomment"><nowiki># Now all calls to urllib2.urlopen use our opener.</nowiki><br /></span><span class="pytext">urllib2</span><span class="pyoperator">.</span><span class="pytext">install_opener</span><span class="pyoperator">(</span><span class="pytext">opener</span><span class="pyoperator">)</span><span class="pytext"></span></div><div class="note">


<P>Fetching secure Web pages using HTTPS is also very easy, provided that
Note
your Python installation supports SSL:</P>
<PRE>import urllib<BR></BR># Get a file-like object for a site.<BR></BR>f = urllib.urlopen("https://www.somesecuresite.com")<BR></BR># NOTE: At the interactive Python prompt, you may be prompted for a username<BR></BR># NOTE: and password here.<BR></BR># Read from the object, storing the page's contents in 's'.<BR></BR>s = f.read()<BR></BR>f.close()</PRE>


<P>Including data which forms the basis of a query, as illustrated above, is
In the above example we only supplied our <tt class="docutils literal"><span class="pre">HHTPBasicAuthHandler</span></tt> to <tt class="docutils literal"><span class="pre">build_opener</span></tt>. By default openers have the handlers for normal situations - <tt class="docutils literal"><span class="pre">ProxyHandler</span></tt>, <tt class="docutils literal"><span class="pre">UnknownHandler</span></tt>, <tt class="docutils literal"><span class="pre">HTTPHandler</span></tt>, <tt class="docutils literal"><span class="pre">HTTPDefaultErrorHandler</span></tt>, <tt class="docutils literal"><span class="pre">HTTPRedirectHandler</span></tt>, <tt class="docutils literal"><span class="pre">FTPHandler</span></tt>, <tt class="docutils literal"><span class="pre">FileHandler</span></tt>, <tt class="docutils literal"><span class="pre">HTTPErrorProcessor</span></tt>.
also possible with URLs starting with <CODE>https</CODE>.</P>


<H3>Handling Redirects</H3>
</div>


<P>Many Web services use HTTP redirects for various straightforward or even
top_level_url is in fact <i>either</i> a full URL (including the 'http:' scheme component and the hostname and optionally the port number) e.g. "http://example.com/" <i>or</i> an "authority" (i.e. the hostname, optionally including the port number) e.g. "example.com" or "example.com:8080" (the latter example includes a port number). The authority, if present, must NOT contain the "userinfo" component - for example "joe:example.com" is not correct.
bizarre purposes. For example, a fairly common technique employed on "high
traffic" Web sites is the HTTP redirection load balancing strategy where the
initial request to the publicised Web site (eg. <CODE>http://www.somesite.com</CODE>) is
redirected to another server (eg. <CODE>http://www1.somesite.com</CODE>) where a user's
session is handled.</P>


<P>Fortunately, <CODE>urlopen</CODE> handles redirects, at least in Python
</div><div class="section">
2.1, and therefore any such redirection should be handled transparently by
<CODE>urlopen</CODE> without your program needing to be aware that it is
happening. It is possible to write code to deal with redirection yourself,
and this can be done using the <CODE>httplib</CODE> module; however, the
interfaces provided by that module are more complicated than those provided
above, if somewhat more powerful.</P>


<H3>Using the SGML Parser</H3>
= [http://esbinfo:8090/pages/editpage.action#id28 ]Proxies =


<P>Given a character string from a Web service, such as the value held by
<b>urllib2</b> will auto-detect your proxy settings and use those. This is through the <tt class="docutils literal"><span class="pre">ProxyHandler</span></tt> which is part of the normal handler chain. Normally that's a good thing, but there are occasions when it may not be helpful [http://esbinfo:8090/pages/editpage.action#id13 ][6]. One way to do this is to setup our own <tt class="docutils literal"><span class="pre">ProxyHandler</span></tt>, with no proxies defined. This is done using similar steps to setting up a [http://www.voidspace.org.uk/python/articles/authentication.shtml Basic Authentication] handler :
<CODE>s</CODE> in the above examples, how can one understand the content
provided by the service in such a way that an "intelligent" response can be
made? One method is by using an SGML parser, since HTML is a relation of
SGML, and HTML is probably the content type most likely to be experienced
when interacting with a Web service.</P>


<P>In the standard Python library, the <CODE>sgmllib</CODE> module contains
&gt;&gt;&gt; proxy_support = urllib2.ProxyHandler({})<br />&gt;&gt;&gt; opener = urllib2.build_opener(proxy_support)<br />&gt;&gt;&gt; urllib2.install_opener(opener)<br />
an appropriate parser class called <CODE>SGMLParser</CODE>. Unfortunately, it
is of limited use to us unless we customise its activities somehow.
Fortunately, Python's object-oriented features, combined with the design of
the <CODE>SGMLParser</CODE> class, provide a means of customising it fairly
easily.</P>


<H4>Defining a Parser Class</H4>
<div class="note">


<P>First of all, let us define a new class inheriting from
Note
<CODE>SGMLParser</CODE> with a convenience method that I find very convenient
indeed:</P>
<PRE>import sgmllib<BR></BR><BR></BR>class MyParser(sgmllib.SGMLParser):<BR></BR>    "A simple parser class."<BR></BR><BR></BR>    def parse(self, s):<BR></BR>        "Parse the given string 's'."<BR></BR>        self.feed(s)<BR></BR>        self.close()<BR></BR><BR></BR>    # More to come...</PRE>


<P>What the <CODE>parse</CODE> method does is provide an easy way of passing
Currently <tt class="docutils literal"><span class="pre">urllib2</span></tt> <i>does not</i> support fetching of <tt class="docutils literal"><span class="pre">https</span></tt> locations through a proxy. This can be a problem.
some text (as a string) to the parser object. I find this nicer than having
to remember calling the <CODE>feed</CODE> method, and since I always tend to
have the entire document ready for parsing, I do not need to use
<CODE>feed</CODE> many times - passing many pieces of text which comprise an
entire document is an interesting feature of <CODE>SGMLParser</CODE> (and its
derivatives) which could be used in other situations.</P>


<H4>Deciding What to Remember</H4>
</div></div><div class="section">


<P>Of course, implementing our own customised parser is only of interest if
= [http://esbinfo:8090/pages/editpage.action#id29 ]Sockets and Layers =
we are looking to find things in a document. Therefore, we should aim to
declare these things before we start parsing. We can do this in the
<CODE>__init__</CODE> method of our class:</P>
<PRE>    # Continuing from above...<BR></BR><BR></BR>    def __init__(self, verbose=0):<BR></BR>        "Initialise an object, passing 'verbose' to the superclass."<BR></BR><BR></BR>        sgmllib.SGMLParser.__init__(self, verbose)<BR></BR>        self.hyperlinks = []<BR></BR><BR></BR>    # More to come...</PRE>


<P>Here, we initialise new objects by passing information to the
The Python support for fetching resources from the web is layered. urllib2 uses the httplib library, which in turn uses the socket library.
<CODE>__init__</CODE> method of the superclass (<CODE>SGMLParser</CODE>);
this makes sure that the underlying parser is set up properly. We also
initialise an attribute called <CODE>hyperlinks</CODE> which will be used to
record the hyperlinks found in the document that any given object will
parse.</P>


<P>Care should be taken when choosing attribute names, since use of names
As of Python 2.3 you can specify how long a socket should wait for a response before timing out. This can be useful in applications which have to fetch web pages. By default the socket module has <i>no timeout</i> and can hang. Currently, the socket timeout is not exposed at the httplib or urllib2 levels. However, you can set the default timeout globally for all sockets using :
defined in the superclass could potentially cause problems when our parser
object is used, because a badly chosen name would cause one of our attributes
to override an attribute in the superclass and result in our attributes being
manipulated for internal parsing purposes by the superclass. We might hope
that the <CODE>SGMLParser</CODE> class uses attribute names with leading
double underscores (<CODE>__</CODE>) since this isolates such attributes from access by
subclasses such as our own <CODE>MyParser</CODE> class.</P>


<H4>Remembering Document Details</H4>
<div class="pysrc"><span class="pykeyword">import</span> <span class="pytext">socket</span><br /><span class="pykeyword">import</span> <span class="pytext">urllib2</span><br /><br /><span class="pycomment"><nowiki># timeout in seconds</nowiki><br /></span><span class="pytext">timeout</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pynumber">10</span><br /><span class="pytext">socket</span><span class="pyoperator">.</span><span class="pytext">setdefaulttimeout</span><span class="pyoperator">(</span><span class="pytext">timeout</span><span class="pyoperator">)</span><br /><br /><span class="pycomment"><nowiki># this call to urllib2.urlopen now uses the default timeout</nowiki><br /></span><span class="pycomment"><nowiki># we have set in the socket module</nowiki><br /></span><span class="pytext">req</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">urllib2</span><span class="pyoperator">.</span><span class="pytext">Request</span><span class="pyoperator">(</span><span class="pystring">'http://www.voidspace.org.uk'</span><span class="pyoperator">)</span><br /><span class="pytext">response</span> <span class="pyoperator"><nowiki>=</nowiki></span> <span class="pytext">urllib2</span><span class="pyoperator">.</span><span class="pytext">urlopen</span><span class="pyoperator">(</span><span class="pytext">req</span><span class="pyoperator">)</span><span class="pytext"></span></div></div>
----
<div class="section">


<P>We now need to define a way of extracting data from the document, but
= [http://esbinfo:8090/pages/editpage.action#id30 ]Footnotes =
<CODE>SGMLParser</CODE> provides a mechanism which notifies us when an
interesting part of the document has been read. SGML and HTML are textual
formats which are structured by the presence of so-called tags, and in HTML,
hyperlinks may be represented in the following way:</P>
<PRE><A href="http://www.python.org">The Python Web site</A></PRE>


<H5>How SGMLParser Operates</H5>
This document was reviewed and revised by John Lee.


<P>An <CODE>SGMLParser</CODE> object which is parsing a document recognises
{| id="id8" class="docutils footnote" frame="void" rules="none" frame="void" rules="none"
starting and ending tags for things such as hyperlinks, and it issues a
| class="label" |
method call on itself based on the name of the tag found and whether the tag
[http://esbinfo:8090/pages/editpage.action#id1 ][1]
is a starting or ending tag. So, as the above text is recognised by an
|
<CODE>SGMLParser</CODE> object (or an object derived from
For an introduction to the CGI protocol see [http://www.pyzine.com/Issue008/Section_Articles/article_CGIOne.html Writing Web Applications in Python].
<CODE>SGMLParser</CODE>, like <CODE>MyParser</CODE>), the following method
|}
calls are made internally:</P>
<PRE>self.start_a(("href", "http://www.python.org"))<BR></BR>self.handle_data("The Python Web site")<BR></BR>self.end_a()</PRE>


<P>Note that the text between the tags is considered as data, and that the
{| id="id9" class="docutils footnote" frame="void" rules="none" frame="void" rules="none"
ending tag does not provide any information. The starting tag, however, does
| class="label" |
provide information in the form of a sequence of attribute names and values,
[http://esbinfo:8090/pages/editpage.action#id2 ][2]
where each name/value pair is placed in a 2-tuple:</P>
|
<PRE># The form of attributes supplied to start tag methods:<BR></BR># (name, value)<BR></BR># Examples:<BR></BR># ("href", "http://www.python.org")<BR></BR># ("target", "python")</PRE>
Like Google for example. The <i>proper</i> way to use google from a program is to use [http://pygoogle.sourceforge.net PyGoogle] of course. See [http://www.voidspace.org.uk/python/recipebook.shtml#google Voidspace Google] for some examples of using the Google API.
|}


<H5>Why SGMLParser Works</H5>
{| id="id10" class="docutils footnote" frame="void" rules="none" frame="void" rules="none"
| class="label" |
[http://esbinfo:8090/pages/editpage.action#id3 ][3]
| Browser sniffing is a very bad practise for website design - building sites using web standards is much more sensible. Unfortunately a lot of sites still send different versions to different browsers.
|}


<P>Why does <CODE>SGMLParser</CODE> issue a method call on itself,
{| id="id11" class="docutils footnote" frame="void" rules="none" frame="void" rules="none"
effectively telling itself that a tag has been encountered? The basic
| class="label" |
<CODE>SGMLParser</CODE> class surely does not know what to do with such
[http://esbinfo:8090/pages/editpage.action#id4 ][4]
information. Well, if another class inherits from <CODE>SGMLParser</CODE>,
| The user agent for MSIE 6 is <i>'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'</i>
then such calls are no longer confined to <CODE>SGMLParser</CODE> and instead
|}
act on methods in the subclass, such as <CODE>MyParser</CODE>, where such
methods exist. Thus, a customised parser class (eg. <CODE>MyParser</CODE>)
once instantiated (made into an object) acts like a stack of components, with
the lowest level of the stack doing the hard parsing work and passing items
of interest to the upper layers - it is a bit like a factory with components
being made on the ground floor and inspection of those components taking
place in the laboratories in the upper floors!</P>


<TABLE border="1">
{| id="id12" class="docutils footnote" frame="void" rules="none" frame="void" rules="none"
  <TBODY>
| class="label" |
    <TR>
[http://esbinfo:8090/pages/editpage.action#id5 ][5]
      <TH>Class</TH>
|
      <TH>Activity</TH>
For details of more HTTP request headers, see [http://www.cs.tut.fi/%7Ejkorpela/http.html Quick Reference to HTTP Headers].
    </TR>
|}
    <TR>
      <TD>...</TD>


      <TD>Listens to reports, records other interesting things</TD>
{| id="id13" class="docutils footnote" frame="void" rules="none" frame="void" rules="none"
    </TR>
| class="label" |
    <TR>
[http://esbinfo:8090/pages/editpage.action#id7 ][6]
      <TD><CODE>MyParser</CODE></TD>
| In my case I have to use a proxy to access the internet at work. If you attempt to fetch <i>localhost</i> URLs through this proxy it blocks them. IE is set to use the proxy, which urllib2 picks up on. In order to test scripts with a localhost server, I have to prevent urllib2 from using the proxy.
      <TD>Listens to reports, records interesting things</TD>
|}
    </TR>
    <TR>


      <TD><CODE>SGMLParser</CODE></TD>
</div>
      <TD>Parses documents, issuing reports at each step</TD>
    </TR>
  </TBODY>
</TABLE>


<H5>Introducing Our Customisations</H5>
= Article Extracted from [http://www.boddie.org.uk/python/HTML.html]> =


<P>Now, if we want to record the hyperlinks in the document, all we need to
== Abstract ==
do is to define a method called <CODE>start_a</CODE> which extracts the
hyperlink from the attributes which are provided in the starting <CODE>a</CODE> tag.
This can be defined as follows:</P>


<PRE>    # Continuing from above...<BR></BR><BR></BR>    def start_a(self, attributes):<BR></BR>        "Process a hyperlink and its 'attributes'."<BR></BR><BR></BR>        for name, value in attributes:<BR></BR>            if name == "href":<BR></BR>                self.hyperlinks.append(value)<BR></BR><BR></BR>    # More to come...</PRE>
Various Web surfing tasks that I regularly perform could be made much easier, and less tedious, if I could only use [http://www.python.org Python] to fetch the HTML pages and to process them, yielding the information I really need. In this document I attempt to describe HTML processing in Python using readily available tools and libraries.


<P>All we need to do is traverse the <CODE>attributes</CODE> list, find
<b>NOTE:</b> This document is not quite finished. I aim to include sections on using mxTidy to deal with broken HTML as well as some tips on cleaning up text retrieved from HTML resources.
appropriately named attributes, and record the value of those attributes.</P>


<H4>Retrieving the Details</H4>
== Prerequisites ==


<P>A nice way of providing access to the retrieved details is to define a
Depending on the methods you wish to follow in this tutorial, you need the following things:
method, although Python 2.2 provides additional features to make this more
convenient. We shall use the old approach:</P>
<PRE>    # Continuing from above...<BR></BR><BR></BR>    def get_hyperlinks(self):<BR></BR>        "Return the list of hyperlinks."<BR></BR><BR></BR>        return self.hyperlinks</PRE>


<H4>Trying it Out</H4>
* For the "SGML parser" method, a recent release of Python is probably enough. You can find one at the Python [http://www.python.org/download/ download] page.
* For the "XML parser" method, a recent release of Python is required, along with a capable XML processing library. I recommend using [http://esbinfo:8090/pages/libxml2dom.html libxml2dom], since it can handle badly-formed HTML documents as well as well-formed XML or XHTML documents. However, [http://pyxml.sourceforge.net/ PyXML] also provides support for such documents.
* For fetching Web pages over secure connections, it is important that SSL support is enabled either when building Python from source, or in any packaged distribution of Python that you might acquire. Information about this is given in the source distribution of Python, but you can download replacement socket libraries with SSL support for older versions of Python for Windows from [http://alldunn.com/python/ Robin Dunn's site].


<P>Now that we have defined our class, we can instantiate it, making a new
== Activities ==
<CODE>MyParser</CODE> object. After that, it is just a matter of giving it a
document to work with:</P>
<PRE>import urllib, sgmllib<BR></BR><BR></BR># Get something to work with.<BR></BR>f = urllib.urlopen("http://www.python.org")<BR></BR>s = f.read()<BR></BR><BR></BR># Try and process the page.<BR></BR># The class should have been defined first, remember.<BR></BR>myparser = MyParser()<BR></BR>myparser.parse(s)<BR></BR><BR></BR># Get the hyperlinks.<BR></BR>print myparser.get_hyperlinks()</PRE>


<P>The <CODE>print</CODE> statement should cause a list to be displayed,
Accessing sites, downloading content, and processing such content, either to extract useful information for archiving or to use such content to navigate further into the site, require combinations of the following activities. Some activities can be chosen according to preference: whether the SGML parser or the XML parser (or parsing framework) is used depends on which style of programming seems nicer to a given developer (although one parser may seem to work better in some situations). However, technical restrictions usually dictate whether certain libraries are to be used instead of others: when handling HTTP redirects, it appears that certain Python modules are easier to use, or even more suited to handling such situations.
containing various hyperlinks to locations on the Python home page and other
sites.</P>


<H5>The Example File</H5>
=== Fetching Web Pages ===


<P>The above example code can be <A href="http://esbinfo:8090/pages/downloads/HTML1.py">downloaded</A>
Fetching standard Web pages over HTTP is very easy with Python:
and executed to see the results.</P>


<H4>Finding More Specific Content</H4>
import urllib<br /><nowiki># Get a file-like object for the Python Web site's home page.</nowiki><br />f = urllib.urlopen("http://www.python.org")<br /><nowiki># Read from the object, storing the page's contents in 's'.</nowiki><br />s = f.read()<br />f.close()


<P>Of course, if it is sufficient for you to extract information from a
==== Supplying Data ====
document without worrying about where in the document it came from, then the
above level of complexity should suit you perfectly. However, one might want
to extract information which only appears in certain places or constructs - a
good example of this is the text between starting and ending tags of
hyperlinks which we saw above. If we just acquired every piece of text using
a <CODE>handle_data</CODE> method which recorded everything it saw, then we
would not know which piece of text described a hyperlink and which piece of
text appeared in any other place in a document.</P>
<PRE>    # An extension of the above class.<BR></BR>    # This is not very useful.<BR></BR><BR></BR>    def handle_data(self, data):<BR></BR>        "Handle the textual 'data'."<BR></BR><BR></BR>        self.descriptions.append(data)</PRE>


<P>Here, the <CODE>descriptions</CODE> attribute (which we would need to
Sometimes, it is necessary to pass information to the Web server, such as information which would come from an HTML form. Of course, you need to know which fields are available in a form, but assuming that you already know this, you can supply such data in the <code>urlopen</code> function call:
initialise in the <CODE>__init__</CODE> method) would be filled with lots of
meaningless textual data. So how can we be more specific? The best approach
is to remember not only the content that <CODE>SGMLParser</CODE> discovers,
but also to remember what kind of content we have seen already.</P>


<H5>Remembering Our Position</H5>
<nowiki># Search the Vaults of Parnassus for "XMLForms".</nowiki><br /><nowiki># First, encode the data.</nowiki><br />data = urllib.urlencode({"find" : "XMLForms", "findtype" : "t"})<br /><nowiki># Now get that file-like object again, remembering to mention the data.</nowiki><br />f = urllib.urlopen("http://www.vex.net/parnassus/apyllo.py", data)<br /><nowiki># Read the results back.</nowiki><br />s = f.read()<br />s.close()


<P>Let us add some new attributes to the <CODE>__init__</CODE> method.</P>
The above example passed data to the server as an HTTP <code>POST</code> request. Fortunately, the [http://www.vex.net/parnassus/apyllo.py Vaults of Parnassus] is happy about such requests, but this is not always the case with Web services. We can instead choose to use a different kind of request, however:


<PRE>       # At the end of the __init__ method...<BR></BR><BR></BR>        self.descriptions = []<BR></BR>       self.inside_a_element = 0</PRE>
<nowiki># We have the encoded data. Now get the file-like object...</nowiki><br />f = urllib.urlopen("http://www.vex.net/parnassus/apyllo.py?" + data)<br /><nowiki># And the rest...</nowiki>


<P>The <CODE>descriptions</CODE> attribute is defined as we anticipated, but
The only difference is the use of a <code>?</code> (question mark) character and the adding of <code>data</code> onto the end of the Vaults of Parnassus URL, but this constitutes an HTTP <code>GET</code> request, where the query (our additional data) is included in the URL itself.
the <CODE>inside_a_element</CODE> attribute is used for something different:
it will indicate whether or not <CODE>SGMLParser</CODE> is currently
investigating the contents of an <CODE>a</CODE> element - that is, whether


<CODE>SGMLParser</CODE> is between the starting <CODE>a</CODE> tag and the ending <CODE>a</CODE>
=== Fetching Secure Web Pages ===
tag.</P>


<P>Let us now add some "logic" to the <CODE>start_a</CODE> method, redefining
Fetching secure Web pages using HTTPS is also very easy, provided that your Python installation supports SSL:
it as follows:</P>
<PRE>    def start_a(self, attributes):<BR></BR>        "Process a hyperlink and its 'attributes'."<BR></BR><BR></BR>        for name, value in attributes:<BR></BR>            if name == "href":<BR></BR>                self.hyperlinks.append(value)<BR></BR>                self.inside_a_element = 1</PRE>


<P>Now, we should know when a starting <CODE>a</CODE> tag has been seen, but to avoid
import urllib<br /><nowiki># Get a file-like object for a site.</nowiki><br />f = urllib.urlopen("https://www.somesecuresite.com")<br /><nowiki># NOTE: At the interactive Python prompt, you may be prompted for a username</nowiki><br /><nowiki># NOTE: and password here.</nowiki><br /><nowiki># Read from the object, storing the page's contents in 's'.</nowiki><br />s = f.read()<br />f.close()
confusion, we should also change the value of the new attribute when the
parser sees an ending <CODE>a</CODE> tag. We do this by defining a new method for this
case:</P>
<PRE>   def end_a(self):<BR></BR>       "Record the end of a hyperlink."<BR></BR><BR></BR>       self.inside_a_element = 0</PRE>


<P>Fortunately, it is not permitted to "nest" hyperlinks, so it is not
Including data which forms the basis of a query, as illustrated above, is also possible with URLs starting with <code>https</code>.
relevant to wonder what might happen if an ending tag were to be seen after
more than one starting tag had been seen in succession.</P>


<H5>Recording Relevant Data</H5>
=== Handling Redirects ===


<P>Now, given that we can be sure of our position in a document and whether
Many Web services use HTTP redirects for various straightforward or even bizarre purposes. For example, a fairly common technique employed on "high traffic" Web sites is the HTTP redirection load balancing strategy where the initial request to the publicised Web site (eg. <code>http://www.somesite.com</code>) is redirected to another server (eg. <code>http://www1.somesite.com</code>) where a user's session is handled.
we should record the data that is being presented, we can define the "real"
<CODE>handle_data</CODE> method as follows:</P>
<PRE>    def handle_data(self, data):<BR></BR>        "Handle the textual 'data'."<BR></BR><BR></BR>        if self.inside_a_element:<BR></BR>            self.descriptions.append(data)</PRE>


<P>This method is not perfect, as we shall see, but it does at least avoid
Fortunately, <code>urlopen</code> handles redirects, at least in Python 2.1, and therefore any such redirection should be handled transparently by <code>urlopen</code> without your program needing to be aware that it is happening. It is possible to write code to deal with redirection yourself, and this can be done using the <code>httplib</code> module; however, the interfaces provided by that module are more complicated than those provided above, if somewhat more powerful.
recording every last piece of text in the document.</P>


<P>We can now define a method to retrieve the description data:</P>
=== Using the SGML Parser ===
<PRE>    def get_descriptions(self):<BR></BR>        "Return a list of descriptions."<BR></BR><BR></BR>        return self.descriptions</PRE>


<P>And we can add the following line to our test program in order to display
Given a character string from a Web service, such as the value held by <code>s</code> in the above examples, how can one understand the content provided by the service in such a way that an "intelligent" response can be made? One method is by using an SGML parser, since HTML is a relation of SGML, and HTML is probably the content type most likely to be experienced when interacting with a Web service.
the descriptions:</P>
<PRE>print myparser.get_descriptions()</PRE>


<H5>The Example File</H5>
In the standard Python library, the <code>sgmllib</code> module contains an appropriate parser class called <code>SGMLParser</code>. Unfortunately, it is of limited use to us unless we customise its activities somehow. Fortunately, Python's object-oriented features, combined with the design of the <code>SGMLParser</code> class, provide a means of customising it fairly easily.


<P>The example code with these modifications can be <A href="http://esbinfo:8090/pages/downloads/HTML2.py">downloaded</A> and executed to see the results.</P>
==== Defining a Parser Class ====


<H4>Problems with Text</H4>
First of all, let us define a new class inheriting from <code>SGMLParser</code> with a convenience method that I find very convenient indeed:


<P>Upon running the modified example, one thing is apparent: there are a few
import sgmllib<br /><br />class MyParser(sgmllib.SGMLParser):<br />    "A simple parser class."<br /><br />    def parse(self, s):<br />        "Parse the given string 's'."<br />       self.feed(s)<br />       self.close()<br /><br />   # More to come...
descriptions which do not make sense. Moreover, the number of descriptions
does not match the number of hyperlinks. The reason for this is the way that
text is found and presented to us by the parser - we may be presented with
more than one fragment of text for a particular region of text, so that more
than one fragment of text may be signalled between a starting <CODE>a</CODE> tag and an
ending <CODE>a</CODE> tag, even though it is logically one block of text.</P>


<P>We may modify our example by adding another attribute to indicate whether
What the <code>parse</code> method does is provide an easy way of passing some text (as a string) to the parser object. I find this nicer than having to remember calling the <code>feed</code> method, and since I always tend to have the entire document ready for parsing, I do not need to use <code>feed</code> many times - passing many pieces of text which comprise an entire document is an interesting feature of <code>SGMLParser</code> (and its derivatives) which could be used in other situations.
we are just beginning to process a region of text. If this new attribute is
set, then we add a description to the list; if not, then we add any text
found to the most recent description recorded.</P>


<P>The <CODE>__init__</CODE> method is modified still further:</P>
==== Deciding What to Remember ====
<PRE>        # At the end of the __init__ method...<BR></BR><BR></BR>        self.starting_description = 0</PRE>


<P>Since we can only be sure that a description is being started immediately
Of course, implementing our own customised parser is only of interest if we are looking to find things in a document. Therefore, we should aim to declare these things before we start parsing. We can do this in the <code>__init__</code> method of our class:
after a starting <CODE>a</CODE> tag has been seen, we redefine the <CODE>start_a</CODE>


method as follows:</P>
    # Continuing from above...<br /><br />    def __init__(self, verbose=0):<br />        "Initialise an object, passing 'verbose' to the superclass."<br /><br />        sgmllib.SGMLParser.__init__(self, verbose)<br />       self.hyperlinks = []<br /><br />   # More to come...
<PRE>    def start_a(self, attributes):<BR></BR>        "Process a hyperlink and its 'attributes'."<BR></BR><BR></BR>        for name, value in attributes:<BR></BR>            if name == "href":<BR></BR>                self.hyperlinks.append(value)<BR></BR>               self.inside_a_element = 1<BR></BR>               self.starting_description = 1</PRE>


<P>Now, the <CODE>handle_data</CODE> method needs redefining as follows:</P>
Here, we initialise new objects by passing information to the <code>__init__</code> method of the superclass (<code>SGMLParser</code>); this makes sure that the underlying parser is set up properly. We also initialise an attribute called <code>hyperlinks</code> which will be used to record the hyperlinks found in the document that any given object will parse.
<PRE>    def handle_data(self, data):<BR></BR>       "Handle the textual 'data'."<BR></BR><BR></BR>        if self.inside_a_element:<BR></BR>            if self.starting_description:<BR></BR>                self.descriptions.append(data)<BR></BR>                self.starting_description = 0<BR></BR>            else:<BR></BR>                self.descriptions[-1] += data</PRE>


<P>Clearly, the method becomes more complicated. We need to detect whether
Care should be taken when choosing attribute names, since use of names defined in the superclass could potentially cause problems when our parser object is used, because a badly chosen name would cause one of our attributes to override an attribute in the superclass and result in our attributes being manipulated for internal parsing purposes by the superclass. We might hope that the <code>SGMLParser</code> class uses attribute names with leading double underscores (<code>__</code>) since this isolates such attributes from access by subclasses such as our own <code>MyParser</code> class.
the description is being started and act in the manner discussed above.</P>


<H5>The Example File</H5>
==== Remembering Document Details ====


<P>The example code with these modifications can be <A href="http://esbinfo:8090/pages/downloads/HTML3.py">downloaded</A> and executed to see the results.</P>
We now need to define a way of extracting data from the document, but <code>SGMLParser</code> provides a mechanism which notifies us when an interesting part of the document has been read. SGML and HTML are textual formats which are structured by the presence of so-called tags, and in HTML, hyperlinks may be represented in the following way:


<H4>Conclusions</H4>
[http://www.python.org The Python Web site]


<P>Although the final example file produces some reasonable results - there
===== How SGMLParser Operates =====
are some still strange descriptions, however, and we have not taken images
used within hyperlinks into consideration - the modifications that were
required illustrate that as more attention is paid to the structure of the
document, the more effort is required to monitor the origins of information.
As a result, we need to maintain state information within the


<CODE>MyParser</CODE> object in a not-too-elegant way.</P>
An <code>SGMLParser</code> object which is parsing a document recognises starting and ending tags for things such as hyperlinks, and it issues a method call on itself based on the name of the tag found and whether the tag is a starting or ending tag. So, as the above text is recognised by an <code>SGMLParser</code> object (or an object derived from <code>SGMLParser</code>, like <code>MyParser</code>), the following method calls are made internally:


<P>For application purposes, the <CODE>SGMLParser</CODE> class, its
self.start_a(("href", "http://www.python.org"))<br />self.handle_data("The Python Web site")<br />self.end_a()
derivatives, and related approaches (such as SAX) are useful for casual
access to information, but for certain kinds of querying, they can become
more complicated to use than one would initially believe. However, these
approaches can be used for another purpose: that of building structures which
can be accessed in a more methodical fashion, as we shall see below.</P>


<H3>Using XML Parsers</H3>
Note that the text between the tags is considered as data, and that the ending tag does not provide any information. The starting tag, however, does provide information in the form of a sequence of attribute names and values, where each name/value pair is placed in a 2-tuple:


<P>Given a character string <CODE>s</CODE>, containing an HTML document which
<nowiki># The form of attributes supplied to start tag methods:</nowiki><br /><nowiki># (name, value)</nowiki><br /><nowiki># Examples:</nowiki><br /><nowiki># ("href", "http://www.python.org")</nowiki><br /><nowiki># ("target", "python")</nowiki>
may have been retrieved from a Web service (using an approach described in an
earlier section of this document), let us now consider an alternative method
of interpreting the contents of this document so that we do not have to
manage the complexity of remembering explicitly the structure of the document
that we have seen so far. One of the problems with <CODE>SGMLParser</CODE>


was that access to information in a document happened "serially" - that is,
===== Why SGMLParser Works =====
information was presented to us in the order in which it was found - but it
may have been more appropriate to access the document information according
to the structure of the document, so that we could request all parts of the
document corresponding to the hyperlink elements present in that document,
before examining each document portion for the text within each hyperlink
element.</P>


<P>In the XML world, a standard called the Document Object Model
Why does <code>SGMLParser</code> issue a method call on itself, effectively telling itself that a tag has been encountered? The basic <code>SGMLParser</code> class surely does not know what to do with such information. Well, if another class inherits from <code>SGMLParser</code>, then such calls are no longer confined to <code>SGMLParser</code> and instead act on methods in the subclass, such as <code>MyParser</code>, where such methods exist. Thus, a customised parser class (eg. <code>MyParser</code>) once instantiated (made into an object) acts like a stack of components, with the lowest level of the stack doing the hard parsing work and passing items of interest to the upper layers - it is a bit like a factory with components being made on the ground floor and inspection of those components taking place in the laboratories in the upper floors!
(DOM) has been devised to provide a means of access to document information
which permits us to navigate the structure of a document, requesting
different sections of that document, and giving us the ability to revisit
such sections at any time; the use of Python with XML and the DOM is
described in <A href="http://esbinfo:8090/pages/XML_intro.html">another document</A>.
If all Web pages were well-formed XML - that is, they all complied with
the expectations and standards set out by the XML specifications - then
any XML parser would be sufficient to process any HTML document found
on the Web. Unfortunately, many Web pages use less formal variants
of HTML which are rejected by XML parsers. Thus, we need to employ
particular tools and additional techniques to convert such pages to DOM
representations.</P><P>Below, we describe how Web pages may be processed using the <A href="http://pyxml.sourceforge.net/">PyXML</A> toolkit and with the <A href="http://esbinfo:8090/pages/libxml2dom.html">libxml2dom</A>
package to obtain a top-level document object. Since both approaches
yield an object which is broadly compatible with the DOM standard, the
subsequent description of how we then inspect such documents applies
regardless of whichever toolkit or package we have chosen.</P><H4>Using PyXML</H4><P>It is
possible to use Python's XML framework with the kind of HTML found on the Web by employing a special
"reader" class which builds a DOM representation from an HTML document, and
the consequences of this are described below.</P>


<H5>Creating the Reader</H5>
{| border="1"
! Class
! Activity
|-
| ...
| Listens to reports, records other interesting things
|-
| <code>MyParser</code>
| Listens to reports, records interesting things
|-
| <code>SGMLParser</code>
| Parses documents, issuing reports at each step
|}


<P>An appropriate class for reading HTML documents is found deep in the
===== Introducing Our Customisations =====
<CODE>xml</CODE> package, and we shall instantiate this class for subsequent
use:</P>
<PRE>from xml.dom.ext.reader import HtmlLib<BR></BR>reader = HtmlLib.Reader()</PRE>


<P>Of course, there are many different ways of accessing the
Now, if we want to record the hyperlinks in the document, all we need to do is to define a method called <code>start_a</code> which extracts the hyperlink from the attributes which are provided in the starting <code>a</code> tag. This can be defined as follows:
<CODE>Reader</CODE> class concerned, but I have chosen not to import


<CODE>Reader</CODE> into the common namespace. One good reason for deciding
    # Continuing from above...<br /><br />   def start_a(self, attributes):<br />       "Process a hyperlink and its 'attributes'."<br /><br />       for name, value in attributes:<br />           if name == "href":<br />               self.hyperlinks.append(value)<br /><br />   # More to come...
this is that I may wish to import other <CODE>Reader</CODE> classes from
other packages or modules, and we clearly need a way to distinguish between
them. Therefore, I import the <CODE>HtmlLib</CODE> name and access the
<CODE>Reader</CODE> class from within that module.</P>


<H5>Loading a Document</H5>
All we need to do is traverse the <code>attributes</code> list, find appropriately named attributes, and record the value of those attributes.


<P>Unlike <CODE>SGMLParser</CODE>, we do not need to customise any class
==== Retrieving the Details ====
before we load a document. Therefore, we can "postpone" any consideration of
the contents of the document until after the document has been loaded,
although it is very likely that you will have some idea of the nature of the
contents in advance and will have written classes or functions to work on the
DOM representation once it is available. After all, real programs extracting
particular information from a certain kind of document do need to know
something about the structure of the documents they process, whether that
knowledge is put in a subclass of a parser (as in <CODE>SGMLParser</CODE>) or
whether it is "encoded" in classes and functions which manipulate the DOM
representation.</P>


<P>Anyway, let us load the document and obtain a <CODE>Document</CODE>
A nice way of providing access to the retrieved details is to define a method, although Python 2.2 provides additional features to make this more convenient. We shall use the old approach:
object:</P>
<PRE>doc = reader.fromString(s)</PRE>


<P>Note that the "top level" of a DOM representation is always a
    # Continuing from above...<br /><br />   def get_hyperlinks(self):<br />       "Return the list of hyperlinks."<br /><br />       return self.hyperlinks
<CODE>Document</CODE> node object, and this is what <CODE>doc</CODE> refers
to immediately after the document is loaded.</P><H4>Using libxml2dom</H4><P>Obtaining documents using libxml2dom is slightly more straightforward:</P><PRE>import libxml2dom<BR></BR>doc = libxml2dom.parseString(s, html=1)</PRE>


<P>If the document text is well-formed XML, we could omit the <CODE>html</CODE>
==== Trying it Out ====
parameter or set it to have a false value. However, if we are not sure
whether the text is well-formed, no significant issues will arise
from setting the parameter in the above fashion.<BR></BR></P><H4>Deciding What to Extract</H4>


<P>Now, it is appropriate to decide which information is to be found and
Now that we have defined our class, we can instantiate it, making a new <code>MyParser</code> object. After that, it is just a matter of giving it a document to work with:
retrieved from the document, and this is where some tasks appear easier than
with <CODE>SGMLParser</CODE> (and related frameworks). Let us consider the
task of extracting all the hyperlinks from the document; we can certainly
find all the hyperlink elements as follows:</P>
<PRE>a_elements = doc.getElementsByTagName("a")</PRE>


<P>Since hyperlink elements comprise the starting <CODE>a</CODE> tag, the ending <CODE>a</CODE>
import urllib, sgmllib<br /><br /><nowiki># Get something to work with.</nowiki><br />f = urllib.urlopen("http://www.python.org")<br />s = f.read()<br /><br /><nowiki># Try and process the page.</nowiki><br /><nowiki># The class should have been defined first, remember.</nowiki><br />myparser = MyParser()<br />myparser.parse(s)<br /><br /><nowiki># Get the hyperlinks.</nowiki><br />print myparser.get_hyperlinks()
tag, and all data between them, the value of the <CODE>a_elements</CODE>
variable should be a list of objects representing regions in the document
which would appear like this:</P>
<PRE><A href="http://www.python.org">The Python Web site</A></PRE>


<H5>Querying Elements</H5>
The <code>print</code> statement should cause a list to be displayed, containing various hyperlinks to locations on the Python home page and other sites.


<P>To make the elements easier to deal with, each object in the list is not
===== The Example File =====
the textual representation of the element as given above. Instead, an object
is created for each element which provides a more convenient level of access
to the details. We can therefore obtain a reference to such an object and
find out more about the element it represents:</P>
<PRE># Get the first element in the list. We don't need to use a separate variable,<BR></BR># but it makes it clearer.<BR></BR>first = a_elements[0]<BR></BR># Now display the value of the "href" attribute.<BR></BR>print first.getAttribute("href")</PRE>


<P>What is happening here is that the <CODE>first</CODE> object (being the
The above example code can be [http://esbinfo:8090/pages/downloads/HTML1.py downloaded] and executed to see the results.
first <CODE>a</CODE> element in the list of those found) is being asked to return the
value of the attribute whose name is <CODE>href</CODE>, and if such an attribute exists,
a string is returned containing the contents of the attribute: in the case of
the above example, this would be...</P>


<PRE>http://www.python.org</PRE>
==== Finding More Specific Content ====


<P>If the <CODE>href</CODE> attribute had not existed, such as in the following example
Of course, if it is sufficient for you to extract information from a document without worrying about where in the document it came from, then the above level of complexity should suit you perfectly. However, one might want to extract information which only appears in certain places or constructs - a good example of this is the text between starting and ending tags of hyperlinks which we saw above. If we just acquired every piece of text using a <code>handle_data</code> method which recorded everything it saw, then we would not know which piece of text described a hyperlink and which piece of text appeared in any other place in a document.
element, then a value of <CODE>None</CODE> would have been returned.</P>
<PRE><A class name="Example"></A>This is not a hyperlink. It is a target.</PRE>


<H5>Namespaces</H5>
    # An extension of the above class.<br />    # This is not very useful.<br /><br />    def handle_data(self, data):<br />        "Handle the textual 'data'."<br /><br />       self.descriptions.append(data)


<P>Previously, this document recommended the usage of namespaces and the <CODE>getAttributeNS</CODE>
Here, the <code>descriptions</code> attribute (which we would need to initialise in the <code>__init__</code> method) would be filled with lots of meaningless textual data. So how can we be more specific? The best approach is to remember not only the content that <code>SGMLParser</code> discovers, but also to remember what kind of content we have seen already.
method, rather than the <CODE>getAttribute</CODE>
method. Whilst XML processing may involve extensive use of namespaces,
some HTML parsers do not appear to expose them quite as one would
expect: for example, not associating the XHTML namespace with XHTML
elements in a document. Thus, it can be advisable to ignore namespaces
unless their usage is unavoidable in order to distinguish between
elements in mixed-content documents (XHTML combined with SVG, for
example).</P><H4>Finding More Specific Content</H4>


<P>We are already being fairly specific, in a sense, in the way that we have
===== Remembering Our Position =====
chosen to access the <CODE>a</CODE> elements within the document, since we start from a
particular point in the document's structure and search for elements from
there. In the <CODE>SGMLParser</CODE> examples, we decided to look for
descriptions of hyperlinks in the text which is enclosed between the starting
and ending tags associated with hyperlinks, and we were largely successful
with that, although there were some issues that could have been handled
better. Here, we shall attempt to find <I>everything</I> that is
descriptive within hyperlink elements.</P>


<H5>Elements, Nodes and Child Nodes</H5>
Let us add some new attributes to the <code>__init__</code> method.


<P>Each hyperlink element is represented by an object whose attributes can be
        # At the end of the __init__ method...<br /><br />       self.descriptions = []<br />       self.inside_a_element = 0
queried, as we did above in order to get the <CODE>href</CODE> attribute's value.
However, elements can also be queried about their contents, and such contents
take the form of objects which represent "nodes" within the document. (The
nature of XML documents is described in another <A href="http://esbinfo:8090/pages/XML_intro.html">introductory document</A> which discusses the DOM.) In
this case, it is interesting for us to inspect the nodes which reside within
(or under) each hyperlink element, and since these nodes are known generally
as "child nodes", we access them through the <CODE>childNodes</CODE>
attribute on each so-called <CODE>Node</CODE> object.</P>


<PRE># Get the child nodes of the first "a" element.<BR></BR>nodes = first.childNodes</PRE>
The <code>descriptions</code> attribute is defined as we anticipated, but the <code>inside_a_element</code> attribute is used for something different: it will indicate whether or not <code>SGMLParser</code> is currently investigating the contents of an <code>a</code> element - that is, whether <code>SGMLParser</code> is between the starting <code>a</code> tag and the ending <code>a</code> tag.


<H5>Node Types</H5>
Let us now add some "logic" to the <code>start_a</code> method, redefining it as follows:


<P>Nodes are the basis of any particular piece of information found in an XML
    def start_a(self, attributes):<br />       "Process a hyperlink and its 'attributes'."<br /><br />       for name, value in attributes:<br />           if name == "href":<br />               self.hyperlinks.append(value)<br />               self.inside_a_element = 1
document, so any element found in a document is based on a node and can be
explicitly identified as an element by checking its "node type":</P>
<PRE>print first.nodeType<BR></BR># A number is returned which corresponds to one of the special values listed in<BR></BR># the xml.dom.Node class. Since elements inherit from that class, we can access<BR></BR># these values on 'first' itself!<BR></BR>print first.nodeType == first.ELEMENT_NODE<BR></BR># If first is an element (it should be) then display the value 1.</PRE>


<P>One might wonder how this is useful, since the list of hyperlink elements,
Now, we should know when a starting <code>a</code> tag has been seen, but to avoid confusion, we should also change the value of the new attribute when the parser sees an ending <code>a</code> tag. We do this by defining a new method for this case:
for example, is clearly a list of elements - that is, after all, what we
asked for. However, if we ask an element for a list of "child nodes", we
cannot immediately be sure which of these nodes are elements and which are,
for example, pieces of textual data. Let us therefore examine the "child
nodes" of <CODE>first</CODE> to see which of them are textual:</P>
<PRE>for node in first.childNodes:<BR></BR>   if node.nodeType == node.TEXT_NODE:<BR></BR>        print "Found a text node:", node.nodeValue</PRE>


<H5>Navigating the Document Structure</H5>
    def end_a(self):<br />       "Record the end of a hyperlink."<br /><br />       self.inside_a_element = 0


<P>If we wanted only to get the descriptive text within each hyperlink
Fortunately, it is not permitted to "nest" hyperlinks, so it is not relevant to wonder what might happen if an ending tag were to be seen after more than one starting tag had been seen in succession.
element, then we would need to visit all nodes within each element (the
"child nodes") and record the value of the textual elements. However, this
would not quite be enough - consider the following document region:</P>


<PRE><A href="http://www.python.org">A <I>really</I> important page.</A></PRE>
===== Recording Relevant Data =====


<P>Within the <CODE>a</CODE> element, there are text nodes and an <CODE>em</CODE> element - the
Now, given that we can be sure of our position in a document and whether we should record the data that is being presented, we can define the "real" <code>handle_data</code> method as follows:
text within that element is not directly available as a "child node" of the <CODE>a</CODE> element. If we did not consider textual child nodes of each child node,
then we would miss important information. Consequently, it becomes essential
to recursively descend inside the <CODE>a</CODE> element collecting child node values.
This is not as hard as it sounds, however:</P>


<PRE>def collect_text(node):<BR></BR>   "A function which collects text inside 'node', returning that text."<BR></BR><BR></BR>    s = ""<BR></BR>    for child_node in node.childNodes:<BR></BR>        if child_node.nodeType == child_node.TEXT_NODE:<BR></BR>            s += child_node.nodeValue<BR></BR>        else:<BR></BR>            s += collect_text(child_node)<BR></BR>    return s<BR></BR><BR></BR># Call 'collect_text' on 'first', displaying the text found.<BR></BR>print collect_text(first)</PRE>
    def handle_data(self, data):<br />       "Handle the textual 'data'."<br /><br />        if self.inside_a_element:<br />            self.descriptions.append(data)


<P>To contrast this with the <CODE>SGMLParser</CODE> approach, we see that
This method is not perfect, as we shall see, but it does at least avoid recording every last piece of text in the document.
much of the work done in that example to extract textual information is
distributed throughout the <CODE>MyParser</CODE> class, whereas the above
function, which looks quite complicated, gathers the necessary operations
into a single place, thus making it look complicated.</P>


<H5>Getting Document Regions as Text</H5>
We can now define a method to retrieve the description data:


<P>Interestingly, it is easier to retrieve whole sections of the original
    def get_descriptions(self):<br />       "Return a list of descriptions."<br /><br />       return self.descriptions
document as text for each of the child nodes, thus collecting the complete
contents of the <CODE>a</CODE> element as text. For this, we just need to make use of a
function provided in the <CODE>xml.dom.ext</CODE> package:</P>


<PRE>from xml.dom.ext import PrettyPrint<BR></BR># In order to avoid getting the "a" starting and ending tags, prettyprint the<BR></BR># child nodes.<BR></BR>s = ""<BR></BR>for child_node in a_elements[0]:<BR></BR>    s += PrettyPrint(child_node)<BR></BR># Display the region of the original document between the tags.<BR></BR>print s</PRE><SPAN style="font-family: sans-serif;"><SPAN style="font-weight: bold;"></SPAN></SPAN><P>Unfortunately, documents produced by libxml2dom do not work with <CODE>PrettyPrint</CODE>. However, we can use a method on each node object instead:</P><PRE># In order to avoid getting the "a" starting and ending tags, prettyprint the<BR></BR># child nodes.<BR></BR>s = ""<BR></BR>for child_node in a_elements[0]:<BR></BR>    s += child_node.toString(prettyprint=1)<BR></BR># Display the region of the original document between the tags.<BR></BR>print s</PRE><P>It is envisaged that libxml2dom will eventually work better with such functions and tools.</P>
And we can add the following line to our test program in order to display the descriptions:


{html}
print myparser.get_descriptions()
 
===== The Example File =====
 
The example code with these modifications can be [http://esbinfo:8090/pages/downloads/HTML2.py downloaded] and executed to see the results.
 
==== Problems with Text ====
 
Upon running the modified example, one thing is apparent: there are a few descriptions which do not make sense. Moreover, the number of descriptions does not match the number of hyperlinks. The reason for this is the way that text is found and presented to us by the parser - we may be presented with more than one fragment of text for a particular region of text, so that more than one fragment of text may be signalled between a starting <code>a</code> tag and an ending <code>a</code> tag, even though it is logically one block of text.
 
We may modify our example by adding another attribute to indicate whether we are just beginning to process a region of text. If this new attribute is set, then we add a description to the list; if not, then we add any text found to the most recent description recorded.
 
The <code>__init__</code> method is modified still further:
 
        # At the end of the __init__ method...<br /><br />        self.starting_description = 0
 
Since we can only be sure that a description is being started immediately after a starting <code>a</code> tag has been seen, we redefine the <code>start_a</code> method as follows:
 
    def start_a(self, attributes):<br />        "Process a hyperlink and its 'attributes'."<br /><br />        for name, value in attributes:<br />            if name == "href":<br />                self.hyperlinks.append(value)<br />                self.inside_a_element = 1<br />                self.starting_description = 1
 
Now, the <code>handle_data</code> method needs redefining as follows:
 
    def handle_data(self, data):<br />        "Handle the textual 'data'."<br /><br />        if self.inside_a_element:<br />            if self.starting_description:<br />                self.descriptions.append(data)<br />                self.starting_description = 0<br />            else:<br />                self.descriptions[-1] += data
 
Clearly, the method becomes more complicated. We need to detect whether the description is being started and act in the manner discussed above.
 
===== The Example File =====
 
The example code with these modifications can be [http://esbinfo:8090/pages/downloads/HTML3.py downloaded] and executed to see the results.
 
==== Conclusions ====
 
Although the final example file produces some reasonable results - there are some still strange descriptions, however, and we have not taken images used within hyperlinks into consideration - the modifications that were required illustrate that as more attention is paid to the structure of the document, the more effort is required to monitor the origins of information. As a result, we need to maintain state information within the <code>MyParser</code> object in a not-too-elegant way.
 
For application purposes, the <code>SGMLParser</code> class, its derivatives, and related approaches (such as SAX) are useful for casual access to information, but for certain kinds of querying, they can become more complicated to use than one would initially believe. However, these approaches can be used for another purpose: that of building structures which can be accessed in a more methodical fashion, as we shall see below.
 
=== Using XML Parsers ===
 
Given a character string <code>s</code>, containing an HTML document which may have been retrieved from a Web service (using an approach described in an earlier section of this document), let us now consider an alternative method of interpreting the contents of this document so that we do not have to manage the complexity of remembering explicitly the structure of the document that we have seen so far. One of the problems with <code>SGMLParser</code> was that access to information in a document happened "serially" - that is, information was presented to us in the order in which it was found - but it may have been more appropriate to access the document information according to the structure of the document, so that we could request all parts of the document corresponding to the hyperlink elements present in that document, before examining each document portion for the text within each hyperlink element.
 
In the XML world, a standard called the Document Object Model (DOM) has been devised to provide a means of access to document information which permits us to navigate the structure of a document, requesting different sections of that document, and giving us the ability to revisit such sections at any time; the use of Python with XML and the DOM is described in [http://esbinfo:8090/pages/XML_intro.html another document]. If all Web pages were well-formed XML - that is, they all complied with the expectations and standards set out by the XML specifications - then any XML parser would be sufficient to process any HTML document found on the Web. Unfortunately, many Web pages use less formal variants of HTML which are rejected by XML parsers. Thus, we need to employ particular tools and additional techniques to convert such pages to DOM representations.
 
Below, we describe how Web pages may be processed using the [http://pyxml.sourceforge.net/ PyXML] toolkit and with the [http://esbinfo:8090/pages/libxml2dom.html libxml2dom] package to obtain a top-level document object. Since both approaches yield an object which is broadly compatible with the DOM standard, the subsequent description of how we then inspect such documents applies regardless of whichever toolkit or package we have chosen.
 
==== Using PyXML ====
 
It is possible to use Python's XML framework with the kind of HTML found on the Web by employing a special "reader" class which builds a DOM representation from an HTML document, and the consequences of this are described below.
 
===== Creating the Reader =====
 
An appropriate class for reading HTML documents is found deep in the <code>xml</code> package, and we shall instantiate this class for subsequent use:
 
from xml.dom.ext.reader import HtmlLib<br />reader = HtmlLib.Reader()
 
Of course, there are many different ways of accessing the <code>Reader</code> class concerned, but I have chosen not to import <code>Reader</code> into the common namespace. One good reason for deciding this is that I may wish to import other <code>Reader</code> classes from other packages or modules, and we clearly need a way to distinguish between them. Therefore, I import the <code>HtmlLib</code> name and access the <code>Reader</code> class from within that module.
 
===== Loading a Document =====
 
Unlike <code>SGMLParser</code>, we do not need to customise any class before we load a document. Therefore, we can "postpone" any consideration of the contents of the document until after the document has been loaded, although it is very likely that you will have some idea of the nature of the contents in advance and will have written classes or functions to work on the DOM representation once it is available. After all, real programs extracting particular information from a certain kind of document do need to know something about the structure of the documents they process, whether that knowledge is put in a subclass of a parser (as in <code>SGMLParser</code>) or whether it is "encoded" in classes and functions which manipulate the DOM representation.
 
Anyway, let us load the document and obtain a <code>Document</code> object:
 
doc = reader.fromString(s)
 
Note that the "top level" of a DOM representation is always a <code>Document</code> node object, and this is what <code>doc</code> refers to immediately after the document is loaded.
 
==== Using libxml2dom ====
 
Obtaining documents using libxml2dom is slightly more straightforward:
 
import libxml2dom<br />doc = libxml2dom.parseString(s, html=1)
 
If the document text is well-formed XML, we could omit the <code>html</code> parameter or set it to have a false value. However, if we are not sure whether the text is well-formed, no significant issues will arise from setting the parameter in the above fashion.<br />
 
==== Deciding What to Extract ====
 
Now, it is appropriate to decide which information is to be found and retrieved from the document, and this is where some tasks appear easier than with <code>SGMLParser</code> (and related frameworks). Let us consider the task of extracting all the hyperlinks from the document; we can certainly find all the hyperlink elements as follows:
 
a_elements = doc.getElementsByTagName("a")
 
Since hyperlink elements comprise the starting <code>a</code> tag, the ending <code>a</code> tag, and all data between them, the value of the <code>a_elements</code> variable should be a list of objects representing regions in the document which would appear like this:
 
[http://www.python.org The Python Web site]
 
===== Querying Elements =====
 
To make the elements easier to deal with, each object in the list is not the textual representation of the element as given above. Instead, an object is created for each element which provides a more convenient level of access to the details. We can therefore obtain a reference to such an object and find out more about the element it represents:
 
<nowiki># Get the first element in the list. We don't need to use a separate variable,</nowiki><br /><nowiki># but it makes it clearer.</nowiki><br />first = a_elements[0]<br /><nowiki># Now display the value of the "href" attribute.</nowiki><br />print first.getAttribute("href")
 
What is happening here is that the <code>first</code> object (being the first <code>a</code> element in the list of those found) is being asked to return the value of the attribute whose name is <code>href</code>, and if such an attribute exists, a string is returned containing the contents of the attribute: in the case of the above example, this would be...
 
http://www.python.org
 
If the <code>href</code> attribute had not existed, such as in the following example element, then a value of <code>None</code> would have been returned.
 
This is not a hyperlink. It is a target.
 
===== Namespaces =====
 
Previously, this document recommended the usage of namespaces and the <code>getAttributeNS</code> method, rather than the <code>getAttribute</code> method. Whilst XML processing may involve extensive use of namespaces, some HTML parsers do not appear to expose them quite as one would expect: for example, not associating the XHTML namespace with XHTML elements in a document. Thus, it can be advisable to ignore namespaces unless their usage is unavoidable in order to distinguish between elements in mixed-content documents (XHTML combined with SVG, for example).
 
==== Finding More Specific Content ====
 
We are already being fairly specific, in a sense, in the way that we have chosen to access the <code>a</code> elements within the document, since we start from a particular point in the document's structure and search for elements from there. In the <code>SGMLParser</code> examples, we decided to look for descriptions of hyperlinks in the text which is enclosed between the starting and ending tags associated with hyperlinks, and we were largely successful with that, although there were some issues that could have been handled better. Here, we shall attempt to find <i>everything</i> that is descriptive within hyperlink elements.
 
===== Elements, Nodes and Child Nodes =====
 
Each hyperlink element is represented by an object whose attributes can be queried, as we did above in order to get the <code>href</code> attribute's value. However, elements can also be queried about their contents, and such contents take the form of objects which represent "nodes" within the document. (The nature of XML documents is described in another [http://esbinfo:8090/pages/XML_intro.html introductory document] which discusses the DOM.) In this case, it is interesting for us to inspect the nodes which reside within (or under) each hyperlink element, and since these nodes are known generally as "child nodes", we access them through the <code>childNodes</code> attribute on each so-called <code>Node</code> object.
 
<nowiki># Get the child nodes of the first "a" element.</nowiki><br />nodes = first.childNodes
 
===== Node Types =====
 
Nodes are the basis of any particular piece of information found in an XML document, so any element found in a document is based on a node and can be explicitly identified as an element by checking its "node type":
 
print first.nodeType<br /><nowiki># A number is returned which corresponds to one of the special values listed in</nowiki><br /><nowiki># the xml.dom.Node class. Since elements inherit from that class, we can access</nowiki><br /><nowiki># these values on 'first' itself!</nowiki><br />print first.nodeType == first.ELEMENT_NODE<br /><nowiki># If first is an element (it should be) then display the value 1.</nowiki>
 
One might wonder how this is useful, since the list of hyperlink elements, for example, is clearly a list of elements - that is, after all, what we asked for. However, if we ask an element for a list of "child nodes", we cannot immediately be sure which of these nodes are elements and which are, for example, pieces of textual data. Let us therefore examine the "child nodes" of <code>first</code> to see which of them are textual:
 
for node in first.childNodes:<br />    if node.nodeType == node.TEXT_NODE:<br />        print "Found a text node:", node.nodeValue
 
===== Navigating the Document Structure =====
 
If we wanted only to get the descriptive text within each hyperlink element, then we would need to visit all nodes within each element (the "child nodes") and record the value of the textual elements. However, this would not quite be enough - consider the following document region:
 
[http://www.python.org A <i>really</i> important page.]
 
Within the <code>a</code> element, there are text nodes and an <code>em</code> element - the text within that element is not directly available as a "child node" of the <code>a</code> element. If we did not consider textual child nodes of each child node, then we would miss important information. Consequently, it becomes essential to recursively descend inside the <code>a</code> element collecting child node values. This is not as hard as it sounds, however:
 
def collect_text(node):<br />    "A function which collects text inside 'node', returning that text."<br /><br />    s = ""<br />    for child_node in node.childNodes:<br />        if child_node.nodeType == child_node.TEXT_NODE:<br />            s += child_node.nodeValue<br />        else:<br />            s += collect_text(child_node)<br />    return s<br /><br /><nowiki># Call 'collect_text' on 'first', displaying the text found.</nowiki><br />print collect_text(first)
 
To contrast this with the <code>SGMLParser</code> approach, we see that much of the work done in that example to extract textual information is distributed throughout the <code>MyParser</code> class, whereas the above function, which looks quite complicated, gathers the necessary operations into a single place, thus making it look complicated.
 
===== Getting Document Regions as Text =====
 
Interestingly, it is easier to retrieve whole sections of the original document as text for each of the child nodes, thus collecting the complete contents of the <code>a</code> element as text. For this, we just need to make use of a function provided in the <code>xml.dom.ext</code> package:
 
from xml.dom.ext import PrettyPrint<br /><nowiki># In order to avoid getting the "a" starting and ending tags, prettyprint the</nowiki><br /><nowiki># child nodes.</nowiki><br />s = ""<br />for child_node in a_elements[0]:<br />    s += PrettyPrint(child_node)<br /><nowiki># Display the region of the original document between the tags.</nowiki><br />print s
 
<span><font face="sans-serif"><span><b></b></span></font></span>
 
Unfortunately, documents produced by libxml2dom do not work with <code>PrettyPrint</code>. However, we can use a method on each node object instead:
 
<nowiki># In order to avoid getting the "a" starting and ending tags, prettyprint the</nowiki><br /><nowiki># child nodes.</nowiki><br />s = ""<br />for child_node in a_elements[0]:<br />    s += child_node.toString(prettyprint=1)<br /><nowiki># Display the region of the original document between the tags.</nowiki><br />print s
 
It is envisaged that libxml2dom will eventually work better with such functions and tools.

Revision as of 16:36, 13 December 2007

h1. Links

h2. Some ASPN Examples

h1. URLLIB Tutorial: urllib2 - The Missing Manual


HOWTO Fetch Internet Resources with Python

urllib2 Tutorial

[21]Introduction

urllib2 is a Python module for fetching URLs (Uniform Resource Locators). It offers a very simple interface, in the form of the urlopen function. This is capable of fetching URLs using a variety of different protocols. It also offers a slightly more complex interface for handling common situations - like basic authentication, cookies, proxies and so on. These are provided by objects called handlers and openers.

urllib2 supports fetching URLs for many "URL schemes" (identified by the string before the ":" in URL - for example "ftp" is the URL scheme of "ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP). This tutorial focuses on the most common case, HTTP.

For straightforward situations urlopen is very easy to use. But as soon as you encounter errors or non-trivial cases when opening HTTP URLs, you will need some understanding of the HyperText Transfer Protocol. The most comprehensive and authoritative reference to HTTP is RFC 2616. This is a technical document and not intended to be easy to read. This HOWTO aims to illustrate using urllib2, with enough detail about HTTP to help you through. It is not intended to replace the urllib2 docs , but is supplementary to them.

[22]Fetching URLs

The simplest way to use urllib2 is as follows :

import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()

Many uses of urllib2 will be that simple (note that instead of an 'http:' URL we could have used an URL starting with 'ftp:', 'file:', etc.). However, it's the purpose of this tutorial to explain the more complicated cases, concentrating on HTTP.

HTTP is based on requests and responses - the client makes requests and servers send responses. urllib2 mirrors this with a Request object which represents the HTTP request you are making. In its simplest form you create a Request object that specifies the URL you want to fetch. Calling urlopen with this Request object returns a response object for the URL requested. This response is a file-like object, which means you can for example call .read() on the response :

import urllib2

req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
the_page = response.read()

Note that urllib2 makes use of the same Request interface to handle all URL schemes. For example, you can make an FTP request like so :

req = urllib2.Request('ftp://example.com/')

In the case of HTTP, there are two extra things that Request objects allow you to do: First, you can pass data to be sent to the server. Second, you can pass extra information ("metadata") about the data or the about request itself, to the server - this information is sent as HTTP "headers". Let's look at each of these in turn.

[23]Data

Sometimes you want to send data to a URL (often the URL will refer to a CGI (Common Gateway Interface) script [24][1] or other web application). With HTTP, this is often done using what's known as a POST request. This is often what your browser does when you submit a HTML form that you filled in on the web. Not all POSTs have to come from forms: you can use a POST to transmit arbitrary data to your own application. In the common case of HTML forms, the data needs to be encoded in a standard way, and then passed to the Request object as the data argument. The encoding is done using a function from the urllib library not from urllib2.

import urllib
import urllib2

url = 'http://www.someserver.com/cgi-bin/register.cgi'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }

data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()

Note that other encodings are sometimes required (e.g. for file upload from HTML forms - see HTML Specification, Form Submission for more details).

If you do not pass the data argument, urllib2 uses a GET request. One way in which GET and POST requests differ is that POST requests often have "side-effects": they change the state of the system in some way (for example by placing an order with the website for a hundredweight of tinned spam to be delivered to your door). Though the HTTP standard makes it clear that POSTs are intended to always cause side-effects, and GET requests never to cause side-effects, nothing prevents a GET request from having side-effects, nor a POST requests from having no side-effects. Data can also be passed in an HTTP GET request by encoding it in the URL itself.

This is done as follows.

>>> import urllib2
>>> import urllib

>>> data = {}
>>> data['name'] = 'Somebody Here'
>>> data['location'] = 'Northampton'
>>> data['language'] = 'Python'
>>> url_values = urllib.urlencode(data)
>>> print url_values
name=Somebody+Here&language=Python&location=Northampton
>>> url = 'http://www.example.com/example.cgi'
>>> full_url = url + '?' + url_values

>>> data = urllib2.open(full_url)

Notice that the full URL is created by adding a ? to the URL, followed by the encoded values.

[25]Headers

We'll discuss here one particular HTTP header, to illustrate how to add headers to your HTTP request.

Some websites [26][2] dislike being browsed by programs, or send different versions to different browsers [27][3] . By default urllib2 identifies itself as Python-urllib/x.y (where x and y are the major and minor version numbers of the Python release, e.g. Python-urllib/2.5), which may confuse the site, or just plain not work. The way a browser identifies itself is through the User-Agent header [28][4]. When you create a Request object you can pass a dictionary of headers in. The following example makes the same request as above, but identifies itself as a version of Internet Explorer [29][5].

import urllib
import urllib2

url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }

data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()

The response also has two useful methods. See the section on info and geturl which comes after we have a look at what happens when things go wrong.

[30]Handling Exceptions

urlopen raises URLError when it cannot handle a response (though as usual with Python APIs, builtin exceptions such as ValueError, TypeError etc. may also be raised).

HTTPError is the subclass of URLError raised in the specific case of HTTP URLs.

[31]URLError

Often, URLError is raised because there is no network connection (no route to the specified server), or the specified server doesn't exist. In this case, the exception raised will have a 'reason' attribute, which is a tuple containing an error code and a text error message.

e.g.

>>> req = urllib2.Request('http://www.pretend_server.org')
>>> try: urllib2.urlopen(req)
>>> except URLError, e:
>>> print e.reason
>>>

(4, 'getaddrinfo failed')

[32]HTTPError

Every HTTP response from the server contains a numeric "status code". Sometimes the status code indicates that the server is unable to fulfil the request. The default handlers will handle some of these responses for you (for example, if the response is a "redirection" that requests the client fetch the document from a different URL, urllib2 will handle that for you). For those it can't handle, urlopen will raise an HTTPError. Typical errors include '404' (page not found), '403' (request forbidden), and '401' (authentication required).

See section 10 of RFC 2616 for a reference on all the HTTP error codes.

The HTTPError instance raised will have an integer 'code' attribute, which corresponds to the error sent by the server.

[33]Error Codes

Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range.

BaseHTTPServer.BaseHTTPRequestHandler.responses is a useful dictionary of response codes in that shows all the response codes used by RFC 2616. The dictionary is reproduced here for convenience :

# Table mapping response codes to messages; entries have the
# form {code: (shortmessage, longmessage)}.
responses = {
100: ('Continue', 'Request received, please continue'),
101: ('Switching Protocols',
'Switching to new protocol; obey Upgrade header'),

200: ('OK', 'Request fulfilled, document follows'),
201: ('Created', 'Document created, URL follows'),
202: ('Accepted',
'Request accepted, processing continues off-line'),
203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
204: ('No Content', 'Request fulfilled, nothing follows'),
205: ('Reset Content', 'Clear input form for further input.'),
206: ('Partial Content', 'Partial content follows.'),

300: ('Multiple Choices',
'Object has several resources -- see URI list'),
301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
302: ('Found', 'Object moved temporarily -- see URI list'),
303: ('See Other', 'Object moved -- see Method and URL list'),
304: ('Not Modified',
'Document has not changed since given time'),
305: ('Use Proxy',
'You must use proxy specified in Location to access this '
'resource.'),
307: ('Temporary Redirect',
'Object moved temporarily -- see URI list'),

400: ('Bad Request',
'Bad request syntax or unsupported method'),
401: ('Unauthorized',
'No permission -- see authorization schemes'),
402: ('Payment Required',
'No payment -- see charging schemes'),
403: ('Forbidden',
'Request forbidden -- authorization will not help'),
404: ('Not Found', 'Nothing matches the given URI'),
405: ('Method Not Allowed',
'Specified method is invalid for this server.'),
406: ('Not Acceptable', 'URI not available in preferred format.'),
407: ('Proxy Authentication Required', 'You must authenticate with '
'this proxy before proceeding.'),
408: ('Request Timeout', 'Request timed out; try again later.'),
409: ('Conflict', 'Request conflict.'),
410: ('Gone',
'URI no longer exists and has been permanently removed.'),
411: ('Length Required', 'Client must specify Content-Length.'),
412: ('Precondition Failed', 'Precondition in headers is false.'),
413: ('Request Entity Too Large', 'Entity is too large.'),
414: ('Request-URI Too Long', 'URI is too long.'),
415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
416: ('Requested Range Not Satisfiable',
'Cannot satisfy request range.'),
417: ('Expectation Failed',
'Expect condition could not be satisfied.'),

500: ('Internal Server Error', 'Server got itself in trouble'),
501: ('Not Implemented',
'Server does not support this operation'),
502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
503: ('Service Unavailable',
'The server cannot process the request due to a high load'),
504: ('Gateway Timeout',
'The gateway server did not receive a timely response'),
505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
}

When an error is raised the server responds by returning an HTTP error code and an error page. You can use the HTTPError instance as a response on the page returned. This means that as well as the code attribute, it also has read, geturl, and info, methods.

>>> req = urllib2.Request('http://www.python.org/fish.html')
>>> try:
>>> urllib2.urlopen(req)
>>> except URLError, e:

>>> print e.code
>>> print e.read()
>>>
404

"http://www.w3.org/TR/html4/loose.dtd">
type="text/css"?>


...... etc...

[34]Wrapping it Up

So if you want to be prepared for HTTPError or URLError there are two basic approaches. I prefer the second approach.

[35]Number 1

from urllib2 import Request, urlopen, URLError, HTTPError
req = Request(someurl)
try:
response = urlopen(req)
except HTTPError, e:
print 'The server couldn't fulfill the request.'
print 'Error code: ', e.code
except URLError, e:
print 'We failed to reach a server.'
print 'Reason: ', e.reason
else:
# everything is fine

Note

The except HTTPError must come first, otherwise except URLError will also catch an HTTPError.

[36]Number 2

from urllib2 import Request, urlopen, URLError
req = Request(someurl)
try:
response = urlopen(req)
except URLError, e:
if hasattr(e, 'reason'):
print 'We failed to reach a server.'
print 'Reason: ', e.reason
elif hasattr(e, 'code'):
print 'The server couldn't fulfill the request.'
print 'Error code: ', e.code
else:
# everything is fine

Note

URLError is a subclass of the built-in exception IOError.

This means that you can avoid importing URLError and use :

from urllib2 import Request, urlopen
req = Request(someurl)
try:
response = urlopen(req)
except IOError, e:
if hasattr(e, 'reason'):
print 'We failed to reach a server.'
print 'Reason: ', e.reason
elif hasattr(e, 'code'):
print 'The server couldn't fulfill the request.'
print 'Error code: ', e.code
else:
# everything is fine

Under rare circumstances urllib2 can raise socket.error.

[37]info and geturl

The response returned by urlopen (or the HTTPError instance) has two useful methods info and geturl.

geturl - this returns the real URL of the page fetched. This is useful because urlopen (or the opener object used) may have followed a redirect. The URL of the page fetched may not be the same as the URL requested.

info - this returns a dictionary-like object that describes the page fetched, particularly the headers sent by the server. It is currently an httplib.HTTPMessage instance.

Typical headers include 'Content-length', 'Content-type', and so on. See the Quick Reference to HTTP Headers for a useful listing of HTTP headers with brief explanations of their meaning and use.

[38]Openers and Handlers

When you fetch a URL you use an opener (an instance of the perhaps confusingly-named urllib2.OpenerDirector). Normally we have been using the default opener - via urlopen - but you can create custom openers. Openers use handlers. All the "heavy lifting" is done by the handlers. Each handler knows how to open URLs for a particular URL scheme (http, ftp, etc.), or how to handle an aspect of URL opening, for example HTTP redirections or HTTP cookies.

You will want to create openers if you want to fetch URLs with specific handlers installed, for example to get an opener that handles cookies, or to get an opener that does not handle redirections.

To create an opener, instantiate an OpenerDirector, and then call .add_handler(some_handler_instance) repeatedly.

Alternatively, you can use build_opener, which is a convenience function for creating opener objects with a single function call. build_opener adds several handlers by default, but provides a quick way to add more and/or override the default handlers.

Other sorts of handlers you might want to can handle proxies, authentication, and other common but slightly specialised situations.

install_opener can be used to make an opener object the (global) default opener. This means that calls to urlopen will use the opener you have installed.

Opener objects have an open method, which can be called directly to fetch urls in the same way as the urlopen function: there's no need to call install_opener, except as a convenience.

[39]Basic Authentication

To illustrate creating and installing a handler we will use the HTTPBasicAuthHandler. For a more detailed discussion of this subject - including an explanation of how Basic Authentication works - see the Basic Authentication Tutorial.

When authentication is required, the server sends a header (as well as the 401 error code) requesting authentication. This specifies the authentication scheme and a 'realm'. The header looks like : Www-authenticate: SCHEME realm="REALM".

e.g.

Www-authenticate: Basic realm="cPanel Users"

The client should then retry the request with the appropriate name and password for the realm included as a header in the request. This is 'basic authentication'. In order to simplify this process we can create an instance of HTTPBasicAuthHandler and an opener to use this handler.

The HTTPBasicAuthHandler uses an object called a password manager to handle the mapping of URLs and realms to passwords and usernames. If you know what the realm is (from the authentication header sent by the server), then you can use a HTTPPasswordMgr. Frequently one doesn't care what the realm is. In that case, it is convenient to use HTTPPasswordMgrWithDefaultRealm. This allows you to specify a default username and password for a URL. This will be supplied in the absence of you providing an alternative combination for a specific realm. We indicate this by providing None as the realm argument to the add_password method.

The top-level URL is the first URL that requires authentication. URLs "deeper" than the URL you pass to .add_password() will also match.

# create a password manager
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()

# Add the username and password.
# If we knew the realm, we could use it instead of ``None``.
top_level_url = "http://example.com/foo/"
password_mgr.add_password(None, top_level_url, username, password)

handler = urllib2.HTTPBasicAuthHandler(password_mgr)

# create "opener" (OpenerDirector instance)
opener = urllib2.build_opener(handler)

# use the opener to fetch a URL
opener.open(a_url)

# Install the opener.
# Now all calls to urllib2.urlopen use our opener.
urllib2.install_opener(opener)

Note

In the above example we only supplied our HHTPBasicAuthHandler to build_opener. By default openers have the handlers for normal situations - ProxyHandler, UnknownHandler, HTTPHandler, HTTPDefaultErrorHandler, HTTPRedirectHandler, FTPHandler, FileHandler, HTTPErrorProcessor.

top_level_url is in fact either a full URL (including the 'http:' scheme component and the hostname and optionally the port number) e.g. "http://example.com/" or an "authority" (i.e. the hostname, optionally including the port number) e.g. "example.com" or "example.com:8080" (the latter example includes a port number). The authority, if present, must NOT contain the "userinfo" component - for example "joe:example.com" is not correct.

[40]Proxies

urllib2 will auto-detect your proxy settings and use those. This is through the ProxyHandler which is part of the normal handler chain. Normally that's a good thing, but there are occasions when it may not be helpful [41][6]. One way to do this is to setup our own ProxyHandler, with no proxies defined. This is done using similar steps to setting up a Basic Authentication handler :

>>> proxy_support = urllib2.ProxyHandler({})
>>> opener = urllib2.build_opener(proxy_support)
>>> urllib2.install_opener(opener)

Note

Currently urllib2 does not support fetching of https locations through a proxy. This can be a problem.

[42]Sockets and Layers

The Python support for fetching resources from the web is layered. urllib2 uses the httplib library, which in turn uses the socket library.

As of Python 2.3 you can specify how long a socket should wait for a response before timing out. This can be useful in applications which have to fetch web pages. By default the socket module has no timeout and can hang. Currently, the socket timeout is not exposed at the httplib or urllib2 levels. However, you can set the default timeout globally for all sockets using :

import socket
import urllib2

# timeout in seconds
timeout = 10
socket.setdefaulttimeout(timeout)

# this call to urllib2.urlopen now uses the default timeout
# we have set in the socket module
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)

[43]Footnotes

This document was reviewed and revised by John Lee.

[44][1]

For an introduction to the CGI protocol see Writing Web Applications in Python.

[45][2]

Like Google for example. The proper way to use google from a program is to use PyGoogle of course. See Voidspace Google for some examples of using the Google API.

[46][3]

Browser sniffing is a very bad practise for website design - building sites using web standards is much more sensible. Unfortunately a lot of sites still send different versions to different browsers.

[47][4]

The user agent for MSIE 6 is 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'

[48][5]

For details of more HTTP request headers, see Quick Reference to HTTP Headers.

[49][6]

In my case I have to use a proxy to access the internet at work. If you attempt to fetch localhost URLs through this proxy it blocks them. IE is set to use the proxy, which urllib2 picks up on. In order to test scripts with a localhost server, I have to prevent urllib2 from using the proxy.

Article Extracted from [50]>

Abstract

Various Web surfing tasks that I regularly perform could be made much easier, and less tedious, if I could only use Python to fetch the HTML pages and to process them, yielding the information I really need. In this document I attempt to describe HTML processing in Python using readily available tools and libraries.

NOTE: This document is not quite finished. I aim to include sections on using mxTidy to deal with broken HTML as well as some tips on cleaning up text retrieved from HTML resources.

Prerequisites

Depending on the methods you wish to follow in this tutorial, you need the following things:

  • For the "SGML parser" method, a recent release of Python is probably enough. You can find one at the Python download page.
  • For the "XML parser" method, a recent release of Python is required, along with a capable XML processing library. I recommend using libxml2dom, since it can handle badly-formed HTML documents as well as well-formed XML or XHTML documents. However, PyXML also provides support for such documents.
  • For fetching Web pages over secure connections, it is important that SSL support is enabled either when building Python from source, or in any packaged distribution of Python that you might acquire. Information about this is given in the source distribution of Python, but you can download replacement socket libraries with SSL support for older versions of Python for Windows from Robin Dunn's site.

Activities

Accessing sites, downloading content, and processing such content, either to extract useful information for archiving or to use such content to navigate further into the site, require combinations of the following activities. Some activities can be chosen according to preference: whether the SGML parser or the XML parser (or parsing framework) is used depends on which style of programming seems nicer to a given developer (although one parser may seem to work better in some situations). However, technical restrictions usually dictate whether certain libraries are to be used instead of others: when handling HTTP redirects, it appears that certain Python modules are easier to use, or even more suited to handling such situations.

Fetching Web Pages

Fetching standard Web pages over HTTP is very easy with Python:

import urllib
# Get a file-like object for the Python Web site's home page.
f = urllib.urlopen("http://www.python.org")
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()

Supplying Data

Sometimes, it is necessary to pass information to the Web server, such as information which would come from an HTML form. Of course, you need to know which fields are available in a form, but assuming that you already know this, you can supply such data in the urlopen function call:

# Search the Vaults of Parnassus for "XMLForms".
# First, encode the data.
data = urllib.urlencode({"find" : "XMLForms", "findtype" : "t"})
# Now get that file-like object again, remembering to mention the data.
f = urllib.urlopen("http://www.vex.net/parnassus/apyllo.py", data)
# Read the results back.
s = f.read()
s.close()

The above example passed data to the server as an HTTP POST request. Fortunately, the Vaults of Parnassus is happy about such requests, but this is not always the case with Web services. We can instead choose to use a different kind of request, however:

# We have the encoded data. Now get the file-like object...
f = urllib.urlopen("http://www.vex.net/parnassus/apyllo.py?" + data)
# And the rest...

The only difference is the use of a ? (question mark) character and the adding of data onto the end of the Vaults of Parnassus URL, but this constitutes an HTTP GET request, where the query (our additional data) is included in the URL itself.

Fetching Secure Web Pages

Fetching secure Web pages using HTTPS is also very easy, provided that your Python installation supports SSL:

import urllib
# Get a file-like object for a site.
f = urllib.urlopen("https://www.somesecuresite.com")
# NOTE: At the interactive Python prompt, you may be prompted for a username
# NOTE: and password here.
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()

Including data which forms the basis of a query, as illustrated above, is also possible with URLs starting with https.

Handling Redirects

Many Web services use HTTP redirects for various straightforward or even bizarre purposes. For example, a fairly common technique employed on "high traffic" Web sites is the HTTP redirection load balancing strategy where the initial request to the publicised Web site (eg. http://www.somesite.com) is redirected to another server (eg. http://www1.somesite.com) where a user's session is handled.

Fortunately, urlopen handles redirects, at least in Python 2.1, and therefore any such redirection should be handled transparently by urlopen without your program needing to be aware that it is happening. It is possible to write code to deal with redirection yourself, and this can be done using the httplib module; however, the interfaces provided by that module are more complicated than those provided above, if somewhat more powerful.

Using the SGML Parser

Given a character string from a Web service, such as the value held by s in the above examples, how can one understand the content provided by the service in such a way that an "intelligent" response can be made? One method is by using an SGML parser, since HTML is a relation of SGML, and HTML is probably the content type most likely to be experienced when interacting with a Web service.

In the standard Python library, the sgmllib module contains an appropriate parser class called SGMLParser. Unfortunately, it is of limited use to us unless we customise its activities somehow. Fortunately, Python's object-oriented features, combined with the design of the SGMLParser class, provide a means of customising it fairly easily.

Defining a Parser Class

First of all, let us define a new class inheriting from SGMLParser with a convenience method that I find very convenient indeed:

import sgmllib

class MyParser(sgmllib.SGMLParser):
"A simple parser class."

def parse(self, s):
"Parse the given string 's'."
self.feed(s)
self.close()

# More to come...

What the parse method does is provide an easy way of passing some text (as a string) to the parser object. I find this nicer than having to remember calling the feed method, and since I always tend to have the entire document ready for parsing, I do not need to use feed many times - passing many pieces of text which comprise an entire document is an interesting feature of SGMLParser (and its derivatives) which could be used in other situations.

Deciding What to Remember

Of course, implementing our own customised parser is only of interest if we are looking to find things in a document. Therefore, we should aim to declare these things before we start parsing. We can do this in the __init__ method of our class:

    # Continuing from above...

def __init__(self, verbose=0):
"Initialise an object, passing 'verbose' to the superclass."

sgmllib.SGMLParser.__init__(self, verbose)
self.hyperlinks = []

# More to come...

Here, we initialise new objects by passing information to the __init__ method of the superclass (SGMLParser); this makes sure that the underlying parser is set up properly. We also initialise an attribute called hyperlinks which will be used to record the hyperlinks found in the document that any given object will parse.

Care should be taken when choosing attribute names, since use of names defined in the superclass could potentially cause problems when our parser object is used, because a badly chosen name would cause one of our attributes to override an attribute in the superclass and result in our attributes being manipulated for internal parsing purposes by the superclass. We might hope that the SGMLParser class uses attribute names with leading double underscores (__) since this isolates such attributes from access by subclasses such as our own MyParser class.

Remembering Document Details

We now need to define a way of extracting data from the document, but SGMLParser provides a mechanism which notifies us when an interesting part of the document has been read. SGML and HTML are textual formats which are structured by the presence of so-called tags, and in HTML, hyperlinks may be represented in the following way:

The Python Web site
How SGMLParser Operates

An SGMLParser object which is parsing a document recognises starting and ending tags for things such as hyperlinks, and it issues a method call on itself based on the name of the tag found and whether the tag is a starting or ending tag. So, as the above text is recognised by an SGMLParser object (or an object derived from SGMLParser, like MyParser), the following method calls are made internally:

self.start_a(("href", "http://www.python.org"))
self.handle_data("The Python Web site")
self.end_a()

Note that the text between the tags is considered as data, and that the ending tag does not provide any information. The starting tag, however, does provide information in the form of a sequence of attribute names and values, where each name/value pair is placed in a 2-tuple:

# The form of attributes supplied to start tag methods:
# (name, value)
# Examples:
# ("href", "http://www.python.org")
# ("target", "python")
Why SGMLParser Works

Why does SGMLParser issue a method call on itself, effectively telling itself that a tag has been encountered? The basic SGMLParser class surely does not know what to do with such information. Well, if another class inherits from SGMLParser, then such calls are no longer confined to SGMLParser and instead act on methods in the subclass, such as MyParser, where such methods exist. Thus, a customised parser class (eg. MyParser) once instantiated (made into an object) acts like a stack of components, with the lowest level of the stack doing the hard parsing work and passing items of interest to the upper layers - it is a bit like a factory with components being made on the ground floor and inspection of those components taking place in the laboratories in the upper floors!

Class Activity
... Listens to reports, records other interesting things
MyParser Listens to reports, records interesting things
SGMLParser Parses documents, issuing reports at each step
Introducing Our Customisations

Now, if we want to record the hyperlinks in the document, all we need to do is to define a method called start_a which extracts the hyperlink from the attributes which are provided in the starting a tag. This can be defined as follows:

    # Continuing from above...

def start_a(self, attributes):
"Process a hyperlink and its 'attributes'."

for name, value in attributes:
if name == "href":
self.hyperlinks.append(value)

# More to come...

All we need to do is traverse the attributes list, find appropriately named attributes, and record the value of those attributes.

Retrieving the Details

A nice way of providing access to the retrieved details is to define a method, although Python 2.2 provides additional features to make this more convenient. We shall use the old approach:

    # Continuing from above...

def get_hyperlinks(self):
"Return the list of hyperlinks."

return self.hyperlinks

Trying it Out

Now that we have defined our class, we can instantiate it, making a new MyParser object. After that, it is just a matter of giving it a document to work with:

import urllib, sgmllib

# Get something to work with.
f = urllib.urlopen("http://www.python.org")
s = f.read()

# Try and process the page.
# The class should have been defined first, remember.
myparser = MyParser()
myparser.parse(s)

# Get the hyperlinks.
print myparser.get_hyperlinks()

The print statement should cause a list to be displayed, containing various hyperlinks to locations on the Python home page and other sites.

The Example File

The above example code can be downloaded and executed to see the results.

Finding More Specific Content

Of course, if it is sufficient for you to extract information from a document without worrying about where in the document it came from, then the above level of complexity should suit you perfectly. However, one might want to extract information which only appears in certain places or constructs - a good example of this is the text between starting and ending tags of hyperlinks which we saw above. If we just acquired every piece of text using a handle_data method which recorded everything it saw, then we would not know which piece of text described a hyperlink and which piece of text appeared in any other place in a document.

    # An extension of the above class.
# This is not very useful.

def handle_data(self, data):
"Handle the textual 'data'."

self.descriptions.append(data)

Here, the descriptions attribute (which we would need to initialise in the __init__ method) would be filled with lots of meaningless textual data. So how can we be more specific? The best approach is to remember not only the content that SGMLParser discovers, but also to remember what kind of content we have seen already.

Remembering Our Position

Let us add some new attributes to the __init__ method.

        # At the end of the __init__ method...

self.descriptions = []
self.inside_a_element = 0

The descriptions attribute is defined as we anticipated, but the inside_a_element attribute is used for something different: it will indicate whether or not SGMLParser is currently investigating the contents of an a element - that is, whether SGMLParser is between the starting a tag and the ending a tag.

Let us now add some "logic" to the start_a method, redefining it as follows:

    def start_a(self, attributes):
"Process a hyperlink and its 'attributes'."

for name, value in attributes:
if name == "href":
self.hyperlinks.append(value)
self.inside_a_element = 1

Now, we should know when a starting a tag has been seen, but to avoid confusion, we should also change the value of the new attribute when the parser sees an ending a tag. We do this by defining a new method for this case:

    def end_a(self):
"Record the end of a hyperlink."

self.inside_a_element = 0

Fortunately, it is not permitted to "nest" hyperlinks, so it is not relevant to wonder what might happen if an ending tag were to be seen after more than one starting tag had been seen in succession.

Recording Relevant Data

Now, given that we can be sure of our position in a document and whether we should record the data that is being presented, we can define the "real" handle_data method as follows:

    def handle_data(self, data):
"Handle the textual 'data'."

if self.inside_a_element:
self.descriptions.append(data)

This method is not perfect, as we shall see, but it does at least avoid recording every last piece of text in the document.

We can now define a method to retrieve the description data:

    def get_descriptions(self):
"Return a list of descriptions."

return self.descriptions

And we can add the following line to our test program in order to display the descriptions:

print myparser.get_descriptions()
The Example File

The example code with these modifications can be downloaded and executed to see the results.

Problems with Text

Upon running the modified example, one thing is apparent: there are a few descriptions which do not make sense. Moreover, the number of descriptions does not match the number of hyperlinks. The reason for this is the way that text is found and presented to us by the parser - we may be presented with more than one fragment of text for a particular region of text, so that more than one fragment of text may be signalled between a starting a tag and an ending a tag, even though it is logically one block of text.

We may modify our example by adding another attribute to indicate whether we are just beginning to process a region of text. If this new attribute is set, then we add a description to the list; if not, then we add any text found to the most recent description recorded.

The __init__ method is modified still further:

        # At the end of the __init__ method...

self.starting_description = 0

Since we can only be sure that a description is being started immediately after a starting a tag has been seen, we redefine the start_a method as follows:

    def start_a(self, attributes):
"Process a hyperlink and its 'attributes'."

for name, value in attributes:
if name == "href":
self.hyperlinks.append(value)
self.inside_a_element = 1
self.starting_description = 1

Now, the handle_data method needs redefining as follows:

    def handle_data(self, data):
"Handle the textual 'data'."

if self.inside_a_element:
if self.starting_description:
self.descriptions.append(data)
self.starting_description = 0
else:
self.descriptions[-1] += data

Clearly, the method becomes more complicated. We need to detect whether the description is being started and act in the manner discussed above.

The Example File

The example code with these modifications can be downloaded and executed to see the results.

Conclusions

Although the final example file produces some reasonable results - there are some still strange descriptions, however, and we have not taken images used within hyperlinks into consideration - the modifications that were required illustrate that as more attention is paid to the structure of the document, the more effort is required to monitor the origins of information. As a result, we need to maintain state information within the MyParser object in a not-too-elegant way.

For application purposes, the SGMLParser class, its derivatives, and related approaches (such as SAX) are useful for casual access to information, but for certain kinds of querying, they can become more complicated to use than one would initially believe. However, these approaches can be used for another purpose: that of building structures which can be accessed in a more methodical fashion, as we shall see below.

Using XML Parsers

Given a character string s, containing an HTML document which may have been retrieved from a Web service (using an approach described in an earlier section of this document), let us now consider an alternative method of interpreting the contents of this document so that we do not have to manage the complexity of remembering explicitly the structure of the document that we have seen so far. One of the problems with SGMLParser was that access to information in a document happened "serially" - that is, information was presented to us in the order in which it was found - but it may have been more appropriate to access the document information according to the structure of the document, so that we could request all parts of the document corresponding to the hyperlink elements present in that document, before examining each document portion for the text within each hyperlink element.

In the XML world, a standard called the Document Object Model (DOM) has been devised to provide a means of access to document information which permits us to navigate the structure of a document, requesting different sections of that document, and giving us the ability to revisit such sections at any time; the use of Python with XML and the DOM is described in another document. If all Web pages were well-formed XML - that is, they all complied with the expectations and standards set out by the XML specifications - then any XML parser would be sufficient to process any HTML document found on the Web. Unfortunately, many Web pages use less formal variants of HTML which are rejected by XML parsers. Thus, we need to employ particular tools and additional techniques to convert such pages to DOM representations.

Below, we describe how Web pages may be processed using the PyXML toolkit and with the libxml2dom package to obtain a top-level document object. Since both approaches yield an object which is broadly compatible with the DOM standard, the subsequent description of how we then inspect such documents applies regardless of whichever toolkit or package we have chosen.

Using PyXML

It is possible to use Python's XML framework with the kind of HTML found on the Web by employing a special "reader" class which builds a DOM representation from an HTML document, and the consequences of this are described below.

Creating the Reader

An appropriate class for reading HTML documents is found deep in the xml package, and we shall instantiate this class for subsequent use:

from xml.dom.ext.reader import HtmlLib
reader = HtmlLib.Reader()

Of course, there are many different ways of accessing the Reader class concerned, but I have chosen not to import Reader into the common namespace. One good reason for deciding this is that I may wish to import other Reader classes from other packages or modules, and we clearly need a way to distinguish between them. Therefore, I import the HtmlLib name and access the Reader class from within that module.

Loading a Document

Unlike SGMLParser, we do not need to customise any class before we load a document. Therefore, we can "postpone" any consideration of the contents of the document until after the document has been loaded, although it is very likely that you will have some idea of the nature of the contents in advance and will have written classes or functions to work on the DOM representation once it is available. After all, real programs extracting particular information from a certain kind of document do need to know something about the structure of the documents they process, whether that knowledge is put in a subclass of a parser (as in SGMLParser) or whether it is "encoded" in classes and functions which manipulate the DOM representation.

Anyway, let us load the document and obtain a Document object:

doc = reader.fromString(s)

Note that the "top level" of a DOM representation is always a Document node object, and this is what doc refers to immediately after the document is loaded.

Using libxml2dom

Obtaining documents using libxml2dom is slightly more straightforward:

import libxml2dom
doc = libxml2dom.parseString(s, html=1)

If the document text is well-formed XML, we could omit the html parameter or set it to have a false value. However, if we are not sure whether the text is well-formed, no significant issues will arise from setting the parameter in the above fashion.

Deciding What to Extract

Now, it is appropriate to decide which information is to be found and retrieved from the document, and this is where some tasks appear easier than with SGMLParser (and related frameworks). Let us consider the task of extracting all the hyperlinks from the document; we can certainly find all the hyperlink elements as follows:

a_elements = doc.getElementsByTagName("a")

Since hyperlink elements comprise the starting a tag, the ending a tag, and all data between them, the value of the a_elements variable should be a list of objects representing regions in the document which would appear like this:

The Python Web site
Querying Elements

To make the elements easier to deal with, each object in the list is not the textual representation of the element as given above. Instead, an object is created for each element which provides a more convenient level of access to the details. We can therefore obtain a reference to such an object and find out more about the element it represents:

# Get the first element in the list. We don't need to use a separate variable,
# but it makes it clearer.
first = a_elements[0]
# Now display the value of the "href" attribute.
print first.getAttribute("href")

What is happening here is that the first object (being the first a element in the list of those found) is being asked to return the value of the attribute whose name is href, and if such an attribute exists, a string is returned containing the contents of the attribute: in the case of the above example, this would be...

http://www.python.org

If the href attribute had not existed, such as in the following example element, then a value of None would have been returned.

This is not a hyperlink. It is a target.
Namespaces

Previously, this document recommended the usage of namespaces and the getAttributeNS method, rather than the getAttribute method. Whilst XML processing may involve extensive use of namespaces, some HTML parsers do not appear to expose them quite as one would expect: for example, not associating the XHTML namespace with XHTML elements in a document. Thus, it can be advisable to ignore namespaces unless their usage is unavoidable in order to distinguish between elements in mixed-content documents (XHTML combined with SVG, for example).

Finding More Specific Content

We are already being fairly specific, in a sense, in the way that we have chosen to access the a elements within the document, since we start from a particular point in the document's structure and search for elements from there. In the SGMLParser examples, we decided to look for descriptions of hyperlinks in the text which is enclosed between the starting and ending tags associated with hyperlinks, and we were largely successful with that, although there were some issues that could have been handled better. Here, we shall attempt to find everything that is descriptive within hyperlink elements.

Elements, Nodes and Child Nodes

Each hyperlink element is represented by an object whose attributes can be queried, as we did above in order to get the href attribute's value. However, elements can also be queried about their contents, and such contents take the form of objects which represent "nodes" within the document. (The nature of XML documents is described in another introductory document which discusses the DOM.) In this case, it is interesting for us to inspect the nodes which reside within (or under) each hyperlink element, and since these nodes are known generally as "child nodes", we access them through the childNodes attribute on each so-called Node object.

# Get the child nodes of the first "a" element.
nodes = first.childNodes
Node Types

Nodes are the basis of any particular piece of information found in an XML document, so any element found in a document is based on a node and can be explicitly identified as an element by checking its "node type":

print first.nodeType
# A number is returned which corresponds to one of the special values listed in
# the xml.dom.Node class. Since elements inherit from that class, we can access
# these values on 'first' itself!
print first.nodeType == first.ELEMENT_NODE
# If first is an element (it should be) then display the value 1.

One might wonder how this is useful, since the list of hyperlink elements, for example, is clearly a list of elements - that is, after all, what we asked for. However, if we ask an element for a list of "child nodes", we cannot immediately be sure which of these nodes are elements and which are, for example, pieces of textual data. Let us therefore examine the "child nodes" of first to see which of them are textual:

for node in first.childNodes:
if node.nodeType == node.TEXT_NODE:
print "Found a text node:", node.nodeValue
Navigating the Document Structure

If we wanted only to get the descriptive text within each hyperlink element, then we would need to visit all nodes within each element (the "child nodes") and record the value of the textual elements. However, this would not quite be enough - consider the following document region:

A really important page.

Within the a element, there are text nodes and an em element - the text within that element is not directly available as a "child node" of the a element. If we did not consider textual child nodes of each child node, then we would miss important information. Consequently, it becomes essential to recursively descend inside the a element collecting child node values. This is not as hard as it sounds, however:

def collect_text(node):
"A function which collects text inside 'node', returning that text."

s = ""
for child_node in node.childNodes:
if child_node.nodeType == child_node.TEXT_NODE:
s += child_node.nodeValue
else:
s += collect_text(child_node)
return s

# Call 'collect_text' on 'first', displaying the text found.
print collect_text(first)

To contrast this with the SGMLParser approach, we see that much of the work done in that example to extract textual information is distributed throughout the MyParser class, whereas the above function, which looks quite complicated, gathers the necessary operations into a single place, thus making it look complicated.

Getting Document Regions as Text

Interestingly, it is easier to retrieve whole sections of the original document as text for each of the child nodes, thus collecting the complete contents of the a element as text. For this, we just need to make use of a function provided in the xml.dom.ext package:

from xml.dom.ext import PrettyPrint
# In order to avoid getting the "a" starting and ending tags, prettyprint the
# child nodes.
s = ""
for child_node in a_elements[0]:
s += PrettyPrint(child_node)
# Display the region of the original document between the tags.
print s

Unfortunately, documents produced by libxml2dom do not work with PrettyPrint. However, we can use a method on each node object instead:

# In order to avoid getting the "a" starting and ending tags, prettyprint the
# child nodes.
s = ""
for child_node in a_elements[0]:
s += child_node.toString(prettyprint=1)
# Display the region of the original document between the tags.
print s

It is envisaged that libxml2dom will eventually work better with such functions and tools.