"Linux Gazette...making Linux just a little more fun!"


Using Python to Generate HTML Pages

By Richie Bielak, richieb@netlabs.net


Introduction

I have waited for a long time to set up my own Web site, mostly because I didn't know what to put there that others may want to see. Then I got an idea. Since I'm an avid reader and an aviation enthusiast, I decided to create pages with a list of aviation books I have read. My initial intention was to write reviews for each book.

Setting up the pages was easy to start with, but as I added more books the maintenance became tedious. I had to update couple of indices with the same data and I had to sort them by hand, and alphabetizing was never my strong suit. I needed to find a better way.

Around the same time I became interested in the programming language Python and it seemed that Python would be a good tool to automatically generate the various HTML pages from a simple text file. This would greatly simplify the updates of my book pages, as I would only add one entry to one file and then create complete pages by running a Python script.

I was attracted to Python for two main reasons: it's very good at processing strings and it's object oriented. Of course the fact that Python interpreter is free and that it runs on many different systems helped. At first I installed Python on my Win95 machine, but I just couldn't force myself to do any programming in the Windows environment, even in Python. Instead I installed Linux and moved all my Web projects there.

The Problem

The main goal of the program is to generate three different book indices, by author, by title and by subject, from a single input file. I started by defining the format of this file. Here is what a typical entry describing one book looks like:
	title: Zero Three Bravo
	author: Gosnell, Mariana
	subject: General Aviation
	url: 3zb.htm
	# this is a comment
Each line starts with a keyword (eg. "title:" or "author:") and is followed by a value that will be shown in the final HTML page. Description of each book must start the "title:" line, there must be at least one "author:" tag, and the "url:" entry points to a review of the book, if there is one.

Since Python is object-oriented we begin program design by looking for "objects". In a nutshell, object oriented (OO) programming is a way to structure your code around the things, that is "objects", that the program is working with. This rather simple idea of organizing software around what it works with (objects), rather than what it does (functions), turns out to be surprisingly powerful.

Within an OO program similar objects are grouped into "classes" and the code we write describes each class. Objects that belong to a given class are called "instances of the class".

I hope it is pretty obvious to you that since the program will manipulate "book" objects, we need a Python class that will represent a single book. Just knowing this is enough to let us suspend design and write some code.

The Book Class

Before we start looking at the code we need to consider briefly how Python programs are organized. Each program consists of a number of modules, each module is contained in a file (usually named with the extension ".py") and the name of the file (without the ".py") serves as the module name. A module can contain any number of routines or classes. Typically things that are related are kept in one module. For example, there is string module that contains functions that operate on strings. To access functions or classes from another module we use the import statement. For example the first line of the Book module is:
    from string import split, strip
which says that the routines split and strip are obtained from the strings module.

Next, I have to point out few syntactic features of Python that are not immediately obvious the code. The most important is the fact that in Python indentation is part of the syntax. To see which statements will be executed following an "if", all you need to look at is indentation - there is no need for curly braces, BEGIN/END pairs or "fi" statements.

Here is a typical "if" statement extracted from the set_author routine in the Book class:

	if new_author:
	    names = split (new_author, ",")
	    self.last_name.append (strip (names[0]))
	    self.first_name.append (strip (names[1]))
	else:
	    self.last_name = []
	    self.first_name = []
The three statements following the "if" are executed if "new_author" variable contains a non-null value. The amount of indentation is not important, but it must be consistent. Also note the colon (":") which is used to terminate the header of each compound statement.

The Book class turns out to be very simple. It consists of routines that set the values for author, title, subject and the URL for each book. For example, here is the set_title routine:

    def set_title (self, new_title):
	self.title = new_title
The first argument to the "set_title" method (that is a routine which belongs to a class) is "self". This argument always refers to the instance to which the method is applied. Furthermore, the attributes (i.e. the data contained in each object) must be qualified with "self" when referenced within the body of a method. In the example above the attribute "title" of a "Book" object is set to value of "new_title".

If in another part of a program we have variable "b" that references an instance of a "Book" class this call would set the book's title:

    b.set_title ("Fate is the Hunter")
Note that the "self" argument is not present in the call, instead the object to which the method is applied (i.e. the object before the ".", "b" above) becomes the "self" argument.

At this point a reasonable question to ask is "Where do the objects come from?" Each object is created by a special call that uses the class name as the name of a function. In addition a class can define a method with the name __init__ which will automatically be called to initialize the new object's attributes (in C++ such a routine is called a constructor).

Here is the __init__ routine for the Book class:

    def __init__ (self, t="", a="", s="", u=""):
	#
	# Create an instance of Book
	#
	self.title = t
	self.last_name = []
	self.first_name = []
	self.set_author (a)
	self.subject = s
	self.url = u
The main purpose of the above routine is to create all the attributes of the new "Book" object. Note that the arguments to "__init__" are specified with default values, so that the caller needs only to pass the arguments that differ from the default.

Here are some examples of calls to create "Book" objects:

    a = Book()
    b = Book ("Fate is the Hunter")
    c = Book ("Some book", "First, Author")

There is one small complication in the "Book" class. It is possible for a book to have more than one author. That's why the attributes "first_name" and "last_name" are actually lists. We'll look more at lists in the next section.

The complete Book class is show in Listing #1. To test the class we add a little piece of code at the end of the file to test if the code is running as __main__ routine, that is execution started in this file. If so, the code to test the Book will run.

The Book_List Class

Once the Book is tested we can go back to designing. The next obvious object is a list which will contain all the "book" objects. For the purposes of our program we have to be able to create the book list from the input file and we have to sort the books in the list by author, title or subject. Sorted list will then be used as input into the code that actually generates HTML pages.

As it turns out one of Python's built-in data structures is a list. Here is a snippet of code showing creation of a list and addition of some items (this example was produced by running Python interactively):

 
Python 1.4 (Dec 18 1996)  [GCC 2.7.2.1]
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> s = []
>>> s.append ("a")
>>> s.append ("hello")
>>> s.append (1)
>>> print s
['a', 'hello', 1]
Above we create a list called "s" and add three items to it. Lists allow "slicing" operations, which let you pull out pieces of a list by specifying element numbers. These examples illustrate the idea:
>>> print s[1]
hello
>>> print s[1:]
['hello', 1]
>>> print s[:2]
['a', 'hello']
>>> print s[0]
a
s[1] denotes the second element of the list (indexing starts at zero), s[1:] is the slice from the second element to the end of the list, s[:2] goes from the start to the third element, and s[0] is the first item.

Finally, lists have a "sort" operator which sorts the elements according to a user supplied comparison function.

Armed with the knowledge of Python lists, writing the Book_List class is easy. The class will have a single attribute, "contents", which will be a list of books.

The constructor for the Book_List class simply creates a "contents" attribute and initializes it to be an empty list. The routine that parses the input file and creates list elements is called "make_from_file" and it begins with the code:

   def make_from_file (self, file):
	#
	# Read the file and create a book list
	#
	lines = file.readlines ()
	self.contents = []
The "file" argument is a handle to an open text file that contains the descriptions of the books. The first step this routine performs is to read the entire file into a list of strings, each string representing one line of text. Next, using Python's "for" loop we step through this list and examine each line of text:
	#
	# Parse each line and create a list of Book objects
	#
	for one_line in lines:
	    # It's  not a comment or empty line 
	    if (len(one_line) > 0) and (one_line[0] != "#"):
    	            # Split into tokens
		    tokens = string.split (one_line)
If the line is not empty or is not a comment (that is the first character is not a "#") then we split the line into words, a word being a sequence of characters without spaces. The call "tokens = string.split (one_line)" uses the "split" routine from the "string" module. "split" returns the words it found in a list.
		    if len (tokens) > 0:
			if (tokens[0] == "title:"):
			    current_book = book.Book (string.join (tokens[1:]))
			    self.contents.append (current_book)
			elif (tokens[0] == "author:"):
			    current_book.set_author (string.join (tokens[1:]))
			elif (tokens[0] == "subject:"):
			    current_book.set_subject (string.join (tokens[1:]))
			elif (tokens[0] == "url:"):
			    current_book.set_url (string.join (tokens[1:]))

The first token (i.e. word) on the line is the keyword that tells us what to do. If it is "title:" then we create a new Book object and append it to the list of books, otherwise we just set the proper attributes. Note that the remaining tokens found on each line are joined together into a string (using "string.join" routine). There is probably a more efficient way to code this, but for my purposes this code works fast enough.

The other interesting parts of the Book_List class are the sort routines. Here is how the list is sorted by title:

    def sort_by_title (self):
	#
	# Sort book list by title
	#
	self.contents.sort (lambda x, y: cmp (x.title, y.title))

We simply call "sort" routine on the list. To get proper ordering we need to supply a function that compares two Book objects. For sorting by title we have to supply an anonymous function, which is introduced with the keyword "lambda" (those of you familiar with Lisp, or other functional languages should recognize this construct). The definition:
      lambda x, y: cmp (x.title, y.title)
simply says that this is a function of two arguments and function result comes from calling the Python built-in function "cmp" (i.e. compare) on the "title" attribute of the two objects.

The other sort routines are similar, except that in "sort_by_author" I used a local function instead of a "lambda", because the comparison was little more complicated - I wanted to have all the books with the same author appear alphabetically by title.

Generating Pages:

Now that we have constructed a list of books, the next step is to create the HTML pages. We begin by creating a class, called Html_Page, that generates basic outline of a page and then we extend that class to create the titles, authors and subjects pages.

The idea that existing code can be extended yet not changed is the second most import idea of OO programming. The mechanism for doing this is called "inheritance" and it allows the programmer to create a new class by adding new properties to an old class and the old class does not have to change. A way to think about inheritance is as "programming by differences". In our program we will create three classes that inherit from Html_Page.

Html_Page is quite simple. It consists of routines that generate the header and the trailer tags for an HTML page. It also contains an empty routine for generating the body of the page. This routine will be defined in descendant classes. The __init__ routine let's the user of this class specify a title and a top level heading for the page.

When I first tested the output of the HTML generators I simply printed it to the screen and manually saved it into a file, so I could see the page in a browser. But once I was happy with the appearance, I had to change the code to save the data into a file. That's why in Html_Page you will see code like this:

	self.f.write ("<html>\n")
	self.f.write ("<head>\n")
for writing the output to a file referenced by the attribute "f".

However, since the actual output file will be different for each page opening of the file is deferred to a descendant class.

You can see complete code for Html_Page in Listing #3. The three classes Authors_Page, Titles_Page and Subjects_Page are used to create the final HTML pages. Since these classes belong together I put them in one module, called books_pages. Because the code for these is classes is very similar we will only look at the first one.

Here is how Authors_Page begins:

class Authors_Page (Html_Page):

    def __init__ (self):
	Html_Page.__init__ (self, "Aviation Books: by Author",
			    "<i>Aviation Books: indexed by Author</i>")
	self.f = open ("books_by_author.html", "w")
	print "Authors page in--> " + self.f.name
To start with that the class heading lists the name of the class from which Authors_Page inherits, mainly Html_Page. Next notice that the constructor invokes the constructor from the parent class, by calling the __init__ routine qualified by the class name. Finally, the constructor names and opens the output file. I decided not to make the file name a parameter for my own convenience to keep things simple.

Since the book list is needed for to generate the body of each page I added a book_list attribute to each page class. This attribute is set before HTML generation starts.

The generate_body routine redefines the empty routine from the parent class. Although fairly long, the code is pretty easy to understand once you know that the book list is represented as an HTML table and the "+" is the concatenation operator for strings.

In addition to replacing the generate_body routine we also redefine generate_trailer routine in order to put a back link to the book index at the bottom of each page:

    def generate_trailer (self):
	self.f.write ("<hr>\n")
	self.f.write ("<center><a href=books.html>Back to Aviation Books Top Page</a></center>\n")
	self.f.write ("<hr>\n")
	Html_Page.generate_trailer (self)
Notice how right after we generate the back link, we include a call to parent's generate_trailer routine to finish off the page with correct terminating tags.

Complete listing for the three page generating classes are found in Listing #4.

The main line of the entire program is shown in Listing #5. By now the code there should be self explanatory.

Summary

As you can see this particular program was not hard to write. Python is well suited for these types of tasks, you can quickly put together a useful program with minimal fuss.

After I have got the program to work I realized that its design is not the best. For example, the HTML generating code could be more general, perhaps the Book class should generate it's own HTML table entries. But for now the program fits my purposes, but I will modify if I need to create other HTML generating applications.

If you like to see the results of this script visit my book page.

To learn more about Python you should start with the Python Home Page which will point you to many Python resources on the net. I also found the O'Reilly book Programming in Python by Mark Lutz extremely helpful.

Finally, any mistakes in the description of Python features are my own fault, as I'm still a Python novice.


Copyright © 1997, Richie Bielak
Published in Issue 19 of the Linux Gazette, July 1997


[ TABLE OF CONTENTS ] [ FRONT PAGE ]  Back  Next