[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Editors, Searching, Geekness, and Gary Preckshot



"Edward C. Bailey" wrote:

> 
> We're *very* interested in searching content marked up in DocBook.  Do you
> have an example of DocBook that supports searching and retrieval versus one
> that doesn't?  I'm having a hard time getting my head around this aspect of
> SGML...
> 
OK

When you do a markup, tags are nested. For instance,
<author> appears in the context

<author>
<artheader>
<article>

In this context it means the author of the howto.

However, <author> can also appear in the nesting

<Author>
<BiblioEntry>
<Bibliography>
<article>

In this case it means the author of a cited reference. There
are innumerable other instances of context-dependent
meaning, and a search engine has to take account of this.
What this means for content producers is that nesting a tags
becomes an important tool for making content findable by a
search engine. Since tags can be separated by a lot of
intervening text and other tags, it's useful a) to have a
tool that makes this obvious, and b) to have a policy for
what is going to be searched for.

I'm going to give an example of a parser that does this, and
show how easy it is. For DocBook markups, a simple LR parser
(technology over 40 years old) is sufficient. The way one of
these works is it parses from left to right, and every time
it encounters a start tag (<artheader> for example) it
pushes a token <artheader> on a stack. Every time it pushes
a token, it checks the stack and if it recognizes a nesting
it's looking for (for example

<author>
<artheader>
<article>)

It has found the data it is looking for and can suck up data
until the end tag </author>. It also unstacks <author> when
it recognizes </author>.

To make this work, authors have to be aware of the
importance of tag nesting to searching. Another kind of
searching is indexing a particular location by inserting an
index tag in the document. We have two kinds of search
results - one, a reference to the whole howto meaning that
the howto meets some search criterion. The second is a
specific location in the howto that meets an indexing
criterion.

The harder issue is displaying or going to an indexed
location. One of the reasons for going to XML is that
browsers will have the capability to display XML directly,
so that search engines need only give the browser an
address. However, that capability is not yet widely
available.

One tool that seems like it would be quite useful is an
analog of lint. C can be used in many foolish ways, and lint
is a program that identifies bad practice. Similarly,
DocBook can be used in strange ways, and a DocLint program
could identify constructions that were unusual or search
criteria that were missing.

Gary


--  
To UNSUBSCRIBE, email to ldp-discuss-request@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org