<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://performiq.com/kb/index.php?action=history&amp;feed=atom&amp;title=Scraper.py</id>
	<title>Scraper.py - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://performiq.com/kb/index.php?action=history&amp;feed=atom&amp;title=Scraper.py"/>
	<link rel="alternate" type="text/html" href="https://performiq.com/kb/index.php?title=Scraper.py&amp;action=history"/>
	<updated>2026-05-18T13:54:51Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.37.1</generator>
	<entry>
		<id>https://performiq.com/kb/index.php?title=Scraper.py&amp;diff=2709&amp;oldid=prev</id>
		<title>PeterHarding: New page: &lt;pre&gt; #06-09-04 #v1.3.0   # scraper.py # A general HTML &#039;parser&#039;.  # Copyright Michael Foord, 2004. # Released subject to the BSD License # Please see http://www.voidspace.org.uk/documents...</title>
		<link rel="alternate" type="text/html" href="https://performiq.com/kb/index.php?title=Scraper.py&amp;diff=2709&amp;oldid=prev"/>
		<updated>2008-11-12T06:11:53Z</updated>

		<summary type="html">&lt;p&gt;New page: &amp;lt;pre&amp;gt; #06-09-04 #v1.3.0   # scraper.py # A general HTML &amp;#039;parser&amp;#039;.  # Copyright Michael Foord, 2004. # Released subject to the BSD License # Please see http://www.voidspace.org.uk/documents...&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;&amp;lt;pre&amp;gt;&lt;br /&gt;
#06-09-04&lt;br /&gt;
#v1.3.0 &lt;br /&gt;
&lt;br /&gt;
# scraper.py&lt;br /&gt;
# A general HTML &amp;#039;parser&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
# Copyright Michael Foord, 2004.&lt;br /&gt;
# Released subject to the BSD License&lt;br /&gt;
# Please see http://www.voidspace.org.uk/documents/BSD-LICENSE.txt&lt;br /&gt;
&lt;br /&gt;
# For information about bugfixes, updates and support, please join the Pythonutils mailing list.&lt;br /&gt;
# http://voidspace.org.uk/mailman/listinfo/pythonutils_voidspace.org.uk&lt;br /&gt;
# Comments, suggestions and bug reports welcome.&lt;br /&gt;
# Scripts maintained at http://www.voidspace.org.uk/python/index.shtml&lt;br /&gt;
# E-mail fuzzyman@voidspace.org.uk&lt;br /&gt;
&lt;br /&gt;
import re&lt;br /&gt;
&lt;br /&gt;
#namefind is supposed to match a tag name and attributes into groups 1 and 2 respectively.&lt;br /&gt;
#the original version of this pattern:&lt;br /&gt;
# namefind = re.compile(r&amp;#039;(\S*)\s*(.+)&amp;#039;, re.DOTALL)&lt;br /&gt;
#insists that there must be attributes and if necessary will steal the last character&lt;br /&gt;
#of the tag name to make it so. this is annoying, so let us try:&lt;br /&gt;
namefind = re.compile(r&amp;#039;(\S+)\s*(.*)&amp;#039;, re.DOTALL)&lt;br /&gt;
&lt;br /&gt;
attrfind = re.compile(&lt;br /&gt;
    r&amp;#039;\s*([a-zA-Z_][-:.a-zA-Z_0-9]*)(\s*=\s*&amp;#039;&lt;br /&gt;
    r&amp;#039;(\&amp;#039;[^\&amp;#039;]*\&amp;#039;|&amp;quot;[^&amp;quot;]*&amp;quot;|[-a-zA-Z0-9./,:;+*%?!&amp;amp;$\(\)_#=~\&amp;#039;&amp;quot;@]*))?&amp;#039;)            # this is taken from sgmllib&lt;br /&gt;
&lt;br /&gt;
class Scraper:&lt;br /&gt;
    def __init__(self):&lt;br /&gt;
        &amp;quot;&amp;quot;&amp;quot;Initialise a parser.&amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
        self.buffer = &amp;#039;&amp;#039;&lt;br /&gt;
        self.outfile = &amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
    def reset(self):&lt;br /&gt;
        &amp;quot;&amp;quot;&amp;quot;This method clears the input buffer and the output buffer.&amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
        self.buffer = &amp;#039;&amp;#039;&lt;br /&gt;
        self.outfile = &amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
    def push(self):&lt;br /&gt;
        &amp;quot;&amp;quot;&amp;quot;This returns all currently processed data and empties the output buffer.&amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
        data = self.outfile&lt;br /&gt;
        self.outfile = &amp;#039;&amp;#039;&lt;br /&gt;
        return data&lt;br /&gt;
&lt;br /&gt;
    def close(self):&lt;br /&gt;
        &amp;quot;&amp;quot;&amp;quot;Returns any unprocessed data (without processing it) and resets the parser.&lt;br /&gt;
        Should be used after all the data has been handled using feed and then collected with push.&lt;br /&gt;
        This returns any trailing data that can&amp;#039;t be processed.&lt;br /&gt;
&lt;br /&gt;
        If you are processing everything in one go you can safely use this method to return everything.&lt;br /&gt;
        &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
        data = self.push() + self.buffer&lt;br /&gt;
        self.buffer = &amp;#039;&amp;#039;&lt;br /&gt;
        return data&lt;br /&gt;
&lt;br /&gt;
    def feed(self, data):&lt;br /&gt;
        &amp;quot;&amp;quot;&amp;quot;Pass more data into the parser.&lt;br /&gt;
        As much as possible is processed - but nothing is returned from this method.&lt;br /&gt;
        &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
        self.index = -1&lt;br /&gt;
        self.tempindex = 0&lt;br /&gt;
        self.buffer = self.buffer + data&lt;br /&gt;
        outlist = []&lt;br /&gt;
        thischunk = []&lt;br /&gt;
        while self.index &amp;lt; len(self.buffer)-1:          # rewrite with a list of all the occurences of &amp;#039;&amp;lt;&amp;#039; and jump between them, much faster than character by character - which is fast enough to be fair...&lt;br /&gt;
            self.index += 1&lt;br /&gt;
            inchar = self.buffer[self.index]&lt;br /&gt;
            if inchar == &amp;#039;&amp;lt;&amp;#039;:&lt;br /&gt;
                outlist.append(self.pdata(&amp;#039;&amp;#039;.join(thischunk)))&lt;br /&gt;
                thischunk = []&lt;br /&gt;
                result = self.tagstart()&lt;br /&gt;
                if result: outlist.append(result)&lt;br /&gt;
                if self.tempindex: break&lt;br /&gt;
            else:&lt;br /&gt;
                thischunk.append(inchar) &lt;br /&gt;
        if self.tempindex:&lt;br /&gt;
            self.buffer = self.buffer[self.tempindex:]&lt;br /&gt;
        else:&lt;br /&gt;
            self.buffer = &amp;#039;&amp;#039;&lt;br /&gt;
            if thischunk: self.buffer = &amp;#039;&amp;#039;.join(thischunk)&lt;br /&gt;
        self.outfile = self.outfile + &amp;#039;&amp;#039;.join(outlist)&lt;br /&gt;
&lt;br /&gt;
    def tagstart(self):&lt;br /&gt;
        &amp;quot;&amp;quot;&amp;quot;We have reached the start of a tag.&lt;br /&gt;
        self.buffer is the data&lt;br /&gt;
        self.index is the point we have reached.&lt;br /&gt;
        This function should extract the tag name and all attributes - and then handle them !.&amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
        test1 = self.buffer.find(&amp;#039;&amp;gt;&amp;#039;, self.index+1)&lt;br /&gt;
        test2 = self.buffer.find(&amp;#039;&amp;lt;&amp;#039;, self.index+1)         # will only happen for broken tags with a missing &amp;#039;&amp;gt;&amp;#039;&lt;br /&gt;
        test1 += 1&lt;br /&gt;
        test2 += 1&lt;br /&gt;
        if not test2 and not test1:                     &lt;br /&gt;
            self.tempindex = self.index                  # if we get this far the buffer is incomplete (the tag doesn&amp;#039;t close yet)&lt;br /&gt;
            self.index = len(self.buffer)               # this signals to feed that some of the buffer needs saving&lt;br /&gt;
            return&lt;br /&gt;
        if test1 and test2:&lt;br /&gt;
            test = min(test1, test2)&lt;br /&gt;
            if test == test2:           # if the closing tag is missing and we&amp;#039;re working from the next starting tag - we eed to be careful with our index position...&lt;br /&gt;
                mod=1&lt;br /&gt;
            else:&lt;br /&gt;
                mod=0&lt;br /&gt;
        else:&lt;br /&gt;
            test = test1 or test2&lt;br /&gt;
            if test2:&lt;br /&gt;
                mod=1&lt;br /&gt;
            else:&lt;br /&gt;
                mod=0&lt;br /&gt;
        thetag = self.buffer[self.index+1:test-1].strip()&lt;br /&gt;
&lt;br /&gt;
        if thetag.startswith(&amp;#039;!&amp;#039;):               # is a declaration or comment&lt;br /&gt;
            return self.pdecl()&lt;br /&gt;
        if thetag.startswith(&amp;#039;?&amp;#039;):&lt;br /&gt;
            return self.ppi()                              # is a processing instruction &lt;br /&gt;
&lt;br /&gt;
        if mod:                   # as soon as we return, the index will have 1 added to it straight away&lt;br /&gt;
            self.index = test -2&lt;br /&gt;
        else:&lt;br /&gt;
            self.index = test -1&lt;br /&gt;
            &lt;br /&gt;
        if thetag.startswith(&amp;#039;/&amp;#039;):&lt;br /&gt;
            return self.endtag(thetag)              # is an endtag &lt;br /&gt;
&lt;br /&gt;
        nt = namefind.match(thetag)&lt;br /&gt;
        if not nt: return self.emptytag(thetag)                              # nothing inside the tag ?&lt;br /&gt;
        name, attributes = nt.group(1,2)&lt;br /&gt;
&lt;br /&gt;
        matchlist = attrfind.findall(attributes)&lt;br /&gt;
        attrs = []&lt;br /&gt;
        #the doc says a tag must be nameless to be &amp;quot;empty&amp;quot; so kill&lt;br /&gt;
        #next line that calls any tag with no attributes &amp;quot;empty&amp;quot;&lt;br /&gt;
        #if not matchlist: return self.emptytag(thetag)                              # nothing inside the tag ?&lt;br /&gt;
        for entry in matchlist:&lt;br /&gt;
            attrname, rest, attrvalue = entry               # this little chunk nicked from sgmllib - except findall is used to match all the attributes&lt;br /&gt;
            if not rest:&lt;br /&gt;
                attrvalue = attrname&lt;br /&gt;
            elif attrvalue[:1] == &amp;#039;\&amp;#039;&amp;#039; == attrvalue[-1:] or \&lt;br /&gt;
                 attrvalue[:1] == &amp;#039;&amp;quot;&amp;#039; == attrvalue[-1:]:&lt;br /&gt;
                attrvalue = attrvalue[1:-1]&lt;br /&gt;
            attrs.append((attrname.lower(), attrvalue))&lt;br /&gt;
        return self.handletag(name.lower(), attrs, thetag)              # deal with what we&amp;#039;ve found.&lt;br /&gt;
&lt;br /&gt;
################################################################################################&lt;br /&gt;
    # The following methods are called to handle the various HTML elements.&lt;br /&gt;
    # They are intended to be overridden in subclasses.&lt;br /&gt;
&lt;br /&gt;
    def pdata(self, inchunk):&lt;br /&gt;
        &amp;quot;&amp;quot;&amp;quot;Called when we encounter a new tag. All the unprocessed data since the last tag is passed to this method.&lt;br /&gt;
        Dummy method to override. Just returns the data unchanged.&amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
        return inchunk&lt;br /&gt;
&lt;br /&gt;
    def pdecl(self):&lt;br /&gt;
        &amp;quot;&amp;quot;&amp;quot;Called when we encounter the *start* of a declaration or comment. &amp;lt;!....&lt;br /&gt;
        It uses self.index and isn&amp;#039;t passed anything.&lt;br /&gt;
        Dummy method to override. Just returns.&amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
        return &amp;#039;&amp;lt;&amp;#039;&lt;br /&gt;
    &lt;br /&gt;
    def ppi(self):&lt;br /&gt;
        &amp;quot;&amp;quot;&amp;quot;Called when we encounter the *start* of a processing instruction. &amp;lt;?....&lt;br /&gt;
        It uses self.index and isn&amp;#039;t passed anything.&lt;br /&gt;
        Dummy method to override. Just returns.&amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
        return &amp;#039;&amp;lt;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
    def endtag(self, thetag):&lt;br /&gt;
        &amp;quot;&amp;quot;&amp;quot;Called when we encounter a close tag. &amp;lt;/....&lt;br /&gt;
        It is passed the tag contents (including leading &amp;#039;/&amp;#039;) and just returns it.&amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
        return &amp;#039;&amp;lt;&amp;#039; + thetag + &amp;#039;&amp;gt;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
    def emptytag(self, thetag):&lt;br /&gt;
        &amp;quot;&amp;quot;&amp;quot;Called when we encounter a tag that we can&amp;#039;t extract any valid name or attributes from.&lt;br /&gt;
        It is passed the tag contents and just returns it.&amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
        return &amp;#039;&amp;lt;&amp;#039; + thetag + &amp;#039;&amp;gt;&amp;#039;  &lt;br /&gt;
&lt;br /&gt;
    def handletag(self, name, attrs, thetag):&lt;br /&gt;
        &amp;quot;&amp;quot;&amp;quot;Called when we encounter a tag.&lt;br /&gt;
        Is passed the tag name and a list of (attrname, attrvalue) - and the original tag contents as a string.&amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
        return &amp;#039;&amp;lt;&amp;#039; + thetag + &amp;#039;&amp;gt;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
#################################################################&lt;br /&gt;
# The simple test script looks for a file called &amp;#039;index.html&amp;#039;&lt;br /&gt;
# It parses it, and saves it back out as &amp;#039;index2.html&amp;#039;&lt;br /&gt;
#&lt;br /&gt;
# See how all the parsed file can safely be returned using the close method.&lt;br /&gt;
# If Scraper works - the new file should be a pretty much unchanged copy of the first.&lt;br /&gt;
&lt;br /&gt;
if __name__ == &amp;#039;__main__&amp;#039;:&lt;br /&gt;
#    a = approxScraper(&amp;#039;http://www.pythonware.com/daily&amp;#039;, &amp;#039;approx.py&amp;#039;)&lt;br /&gt;
    a = Scraper()&lt;br /&gt;
    a.feed(open(&amp;#039;index.html&amp;#039;).read())                   # read and feed&lt;br /&gt;
    open(&amp;#039;index2.html&amp;#039;,&amp;#039;w&amp;#039;).write(a.close())&lt;br /&gt;
&lt;br /&gt;
#################################################################&lt;br /&gt;
    &lt;br /&gt;
__doc__ = &amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
Scraper is a class to parse HTML files.&lt;br /&gt;
It contains methods to process the &amp;#039;data portions&amp;#039; of an HTML and the tags.&lt;br /&gt;
These can be overridden to implement your own HTML processing methods in a subclass.&lt;br /&gt;
This class does most of what HTMLParser.HTMLParser does - except without choking on bad HTML.&lt;br /&gt;
It uses the regular expression and a chunk of logic from sgmllib.py (standard python distribution)&lt;br /&gt;
&lt;br /&gt;
The only badly formed HTML that will cause errors is where a tag is missing the closing &amp;#039;&amp;gt;&amp;#039;. (Unfortunately common)&lt;br /&gt;
In this case the tag will be automatically closed at the next &amp;#039;&amp;lt;&amp;#039; - so some data could be incorrectly put inside the tag.&lt;br /&gt;
&lt;br /&gt;
The useful methods of a Scraper instance are :&lt;br /&gt;
&lt;br /&gt;
feed(data)  -   Pass more data into the parser.&lt;br /&gt;
                As much as possible is processed - but nothing is returned from this method.  &lt;br /&gt;
push()      -   This returns all currently processed data and empties the output buffer.&lt;br /&gt;
close()     -   Returns any unprocessed data (without processing it) and resets the parser.&lt;br /&gt;
                Should be used after all the data has been handled using feed and then collected with push.&lt;br /&gt;
                This returns any trailing data that can&amp;#039;t be processed.&lt;br /&gt;
reset()     -   This method clears the input buffer and the output buffer.&lt;br /&gt;
&lt;br /&gt;
The following methods are the methods called to handle various parts of an HTML document.&lt;br /&gt;
In a normal Scraper instance they do nothing and are intended to be overridden.&lt;br /&gt;
Some of them rely on the self.index attribute property of the instance which tells it where in self.buffer we have got to.&lt;br /&gt;
Some of them are explicitly passed the tag they are working on - in which case, self.index will be set to the end of the tag.&lt;br /&gt;
After all these methods have returned self.index will be incremented to the next character.&lt;br /&gt;
If your methods do any future processing they can manually modify self.index&lt;br /&gt;
All these methods should return anything to include in the processed document.&lt;br /&gt;
&lt;br /&gt;
pdata(inchunk)&lt;br /&gt;
    Called when we encounter a new tag. All the unprocessed data since the last tag is passed to this method.&lt;br /&gt;
    Dummy method to override. Just returns the data unchanged.&lt;br /&gt;
&lt;br /&gt;
pdecl()&lt;br /&gt;
    Called when we encounter the *start* of a declaration or comment. &amp;lt;!.....&lt;br /&gt;
    It uses self.index and isn&amp;#039;t passed anything.&lt;br /&gt;
    Dummy method to override. Just returns &amp;#039;&amp;lt;&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
ppi()&lt;br /&gt;
    Called when we encounter the *start* of a processing instruction. &amp;lt;?.....&lt;br /&gt;
    It uses self.index and isn&amp;#039;t passed anything.&lt;br /&gt;
    Dummy method to override. Just returns &amp;#039;&amp;lt;&amp;#039;.&lt;br /&gt;
&lt;br /&gt;
endtag(thetag)&lt;br /&gt;
    Called when we encounter a close tag.   &amp;lt;/...&lt;br /&gt;
    It is passed the tag contents (including leading &amp;#039;/&amp;#039;) and just returns it.&lt;br /&gt;
&lt;br /&gt;
emptytag(thetag)&lt;br /&gt;
    Called when we encounter a tag that we can&amp;#039;t extract any valid name or attributes from.&lt;br /&gt;
    It is passed the tag contents and just returns it.&lt;br /&gt;
&lt;br /&gt;
handletag(name, attrs, thetag)&lt;br /&gt;
    Called when we encounter a tag.&lt;br /&gt;
    Is passed the tag name and attrs (a list of (attrname, attrvalue) tuples) - and the original tag contents as a string.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Typical usage :&lt;br /&gt;
&lt;br /&gt;
filehandle = open(&amp;#039;file.html&amp;#039;, &amp;#039;r&amp;#039;)&lt;br /&gt;
parser = Scraper()&lt;br /&gt;
while True:&lt;br /&gt;
    data = filehandle.read(10000)               # read in the data in chunks&lt;br /&gt;
    if not data: break                      # we&amp;#039;ve reached the end of the file - python could do with a do:...while syntax...&lt;br /&gt;
    parser.feed(data)&lt;br /&gt;
##    print parser.push()                     # you can output data whilst processing using the push method&lt;br /&gt;
processedfile = parser.close()              # or all in one go using close  &lt;br /&gt;
## print parser.close()                       # Even if using push you will still need a final close&lt;br /&gt;
filehandle.close()&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
TODO/ISSUES&lt;br /&gt;
Could be sped up by jumping from &amp;#039;&amp;lt;&amp;#039; to &amp;#039;&amp;lt;&amp;#039; rather than a character by character search (which is still pretty quick).&lt;br /&gt;
Need to check I have all the right tags and attributes in the tagdict in approxScraper.&lt;br /&gt;
The only other modification this makes to HTML is to close tags that don&amp;#039;t have a closing &amp;#039;&amp;gt;&amp;#039;.. theoretically it could close them in the wrog place I suppose....&lt;br /&gt;
(This is very bad HTML anyway - but I need to watch for missing content that gets caught like this.)&lt;br /&gt;
Could check for character entities and named entities in HTML like HTMLParser.&lt;br /&gt;
Doesn&amp;#039;t do anything special for self clsoing tags (e.g. &amp;lt;br /&amp;gt;)&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
CHANGELOG&lt;br /&gt;
06-09-04        Version 1.3.0&lt;br /&gt;
A couple of patches by Paul Perkins - mainly prevents the namefind regular expression grabbing a characters when it has no attributes.&lt;br /&gt;
&lt;br /&gt;
28-07-04        Version 1.2.1&lt;br /&gt;
Was losing a bit of data with each new feed. Have sorted it now.&lt;br /&gt;
&lt;br /&gt;
24-07-04        Version 1.2.0&lt;br /&gt;
Refactored into Scraper and approxScraper classes.&lt;br /&gt;
Is now a general purpose, basic, HTML parser.&lt;br /&gt;
&lt;br /&gt;
19-07-04        Version 1.1.0&lt;br /&gt;
Modified to output URLs using the PATH_INFO method - see approx.py&lt;br /&gt;
Cleaned up tag handling - it now works properly when there is a missing closing tag (common - but see TODO - has to guess where to close it).&lt;br /&gt;
&lt;br /&gt;
11-07-04        Version 1.0.1&lt;br /&gt;
Added the close method.&lt;br /&gt;
&lt;br /&gt;
09-07-04        Version 1.0.0&lt;br /&gt;
First version designed to work with approx.py the CGI proxy.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;&amp;quot;&amp;quot;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:Internet]]&lt;br /&gt;
[[Category:Python]]&lt;/div&gt;</summary>
		<author><name>PeterHarding</name></author>
	</entry>
</feed>