Welcome to Zope.org
Copyright O'Reilly, 2000. All rights reserved.
This is an early draft chapter from a forthcoming book on Zope, to be
published by O'Reilly & Associates. The material has not been through
O'Reilly's editorial process, nor has it been reviewed for technical
accuracy. O'Reilly & Associates disclaims responsibility for any
errors in this draft and advises readers to use the information
contained herein with caution.
O'Reilly & Associates grants readers the right to read this material
and to print copies or make electronic copies for their own
use. O'Reilly & Associates does not grant anyone the right to use this
material as part of a commercial product or to modify and distribute
it. When O'Reilly & Associates publishes the final draft of this book
in print form, the content will be made available under an open
content license, but this chapter is not open content.
If you have any comments on the material in this chapter, you should
send them to the authors, Michel Pelletier and Amos Latteier, at
docs@digicool.com.
Searching and Categorizing Content
Introduction
A ZCatalog is a catalog of objects for Zope. A Catalog is a collection
of indexes that store references to other Zope objects.
ZCatalog is a powerful tool, providing a number of compelling features:
Searches are fast. The data structures used by the indexes
provide quick searches.
Searches are robust. ZCatalog supports boolean search terms,
relevance ranking, synonyms and stopwords.
Indexing is flexible. A ZCatalog can catalog custom
properties and track unique values. Since ZCatalog catalogs
objects instead of file handles, you can index any content that can
have a Python object wrapped around it. This also lets objects
participate in how they are cataloged, e.g. de-HTML-ifying contents
or extracting PDF properties.
Transactional. An indexing operation is part of a Zope
transaction. If something goes wrong after content is indexed, the
index is restored to its previous condition. This also means that
Undo will restore an index to its previous condition. A ZCatalog
can be altered privately in a Version, meaning no one else can see
the changes to the index.
Cache-friendly. The index is internally broken into different
"buckets", with each bucket being a separate Zope database object.
Only the part of the index that is needed is loaded into memory.
Alternatively, an un-needed part of the index can be removed from
memory.
Results are lazy. A search that returns a tremendous number
of matches won't return a large result set. Only the part of the
results, such as the second batch of twenty, are returned.
Indexing Concepts
Text Index
A text index is like an index in the back of a book:
ZCatalog: 59, 22, 15, 67, 88
This index shows the term ZCatalog occouring on five pages in a
book. Text indexes are good because they let you look up specific
words, or terms in a document. This is how almost all searching
systems work; by mapping words to the location of the documents or
pages that the words occour on.
ZCatalog text indexes actually map the location of a word to a
sequences of paths to the object that the word occurs in:
{ 'bob' -> '/Document1', '/Document2',
'uncle' -> '/Document2', '/Document3',
'bobo' -> '/Document1', '/Document3',
}
Vocabularies
Vocabularies are used by text indexes. A vocabulary is basically a
language abstraction. In order for the ZCatalog to work with any
kind of language, it must understand certain behaviors of that
language. For example, all languages:
have a different concept of words . In english and many other
languages, words are defined by whitespace boundaries, but in
other languages, like Chinese and Japanese, words are defined by
their contextual usage.
have different concepts of stopwords. The french word nous
(we ) would be extremely common in french text and should
probably be removed as a stopword, but in english text it might
make perfect sense to catalog this word because it is very
infrequent.
have different concepts of synonymous, The synonym pair
automobile -> car would not make sense in any language but
English.
have different concepts of stemming. In english, it is common
for text indexers to strip suffixes like ing from words, so
that bake and baking match the same word. These suffix
strippings would only make sense to english, and other languages
would want to provide their own stemming (or none at all).
Current Vocabularies
- Plain Vocbularies
Plain vocabularies are very simple and do
minimal english language specific tasks.
- Globbing Vocbularies
Globbing vocabularies are more complex
vocabularies that allow wildcard searches on english text to be
performed.
- JVocabulary
JVocabulary is a ZCatalog vocabularies that
supports splitting and indexing Japanese text.
Field Index
A field index is an index that maps atomic values to sequences
of paths to the object that has that value. An example would be
an index that kept track of when objects were last modified.
uniqueValues
Field indexes have a uniqueValues() method that returns a list
of all unique values in the index mapping.
Keyword Indexes
A keyword index indexes a sequence of keywords for objects and
can be queried for any objects that have one or more of those
keywords.
Indexing patterns
Mass Indexing
Mass indexing is simple but has severe drawbacks. The total
amount of content you can index in one transaction is equvalent to
the amount of free virtual memory available to the Zope process,
plus the amount of temporary storage the system has. If you have
one gig of virtual RAM and 10 gigs of temp storage, then you could
theoretically index 11 gigs of content.
But just indexing that much content would take a long, long time,
and as soon as virtual memory ran out, Zope would start doing a
lot of hard disk activity out to temp storage.
So mass indexing is cool if you want to index up to a few thousand
objects, but beyond that, you want to use incremental indexing,
which is much more efficient.
Mass Indexing - Example
Index lots of default content.
Incremental Indexing
Incremental indexing is when a stream of content is indexed over
time. This technique is more complex mass indexing but can scale
much better and is more efficient:
- efficient
When new content is added, old content does not
have to be re-indexed in a new sweep.
- smaller footprint
because less information is being indexed
per transaction the memory requirements of the Catalog reduce.
- no hot spot
Catalogs can become notorious hot spots in a
database, possible causing lots of conflicts. By spreading out the
database writing, less hot spots occur.
Incremental Incremental - Exmaple
Catalog an RSS stream into Document objects?
Automatic Indexing
Automatic indexing is, as its name applies, the easiest of all.
Automatic indexing is alot like incremental indexing becaue a stream
of content is being indexed when it is created. However, automaticly
indexed content can also re-indexed when it changes or removed from
the indexes when it is destroyed. This is the most efficient usage
of Zope, but it requires your objects knowing special things about
Cataloging themselves, so basic Zope objects like DTML Documents and
DTML Methods do not yet support Automatic indexing. This is an
advanced technique and will not be discussed until Chapter X.
XXX So I guess we need an event model now No example cuzza ZClasses XXX
Using ZCatalog
Querying
Once you have some content in a catalog you can query the Catalog for
objects that match certain criteria.
Search for object by Type - Example
XXX Use Form Action and uniqueValuesFor
Text Search for a Word in Certain Types - Example
XXX Use Form/Action and uniqueValuesFor
Explicit Queries
Aquery object should be in the form of a python mapping:
<dtml-in "Catalog({'index1' : term1, 'index2', term2,
'text_indexN', 'some words to look for',})>
...
</dtml-in>
They key of the mapping items should be the name of an index.
The value should be the term you want to query the index for.
Searching for a Certain Date - Example
Range Searching
You may want to search for a whole range of information, like all
the objects created after a certain date.
Date Range Search - Example
Range searches can be done easily with date fields:
<dtml-var standard_html_header>
<form action="search" method="get">
<TABLE>
<TR VALIGN="TOP">
<TD><p>containing the text:</p></TD>
<TD><input name="text_content" value=""></TD>
</TR>
<TR VALIGN="TOP">
<TD><p>with the type of:</p></TD>
<TD>
<select name="meta_type:list" size=6 MULTIPLE>
<dtml-in expr="uniqueValuesFor('meta_type')">
<option value="<dtml-var sequence-item>"><dtml-var sequence-item></options>
</dtml-in >
</select>
</TD>
</TR>
<TR>
<TD><p>modified since:</p></TD>
<TD>
<input type="hidden" name="date_usage" value="range:min">
<select name="date:date">
<option value="<dtml-var "ZopeTime(0)" >">Ever</option>
<option value="<dtml-var "ZopeTime() - 1" >">Yesterday</option>
<option value="<dtml-var "ZopeTime() - 7" >">Last Week</option>
<option value="<dtml-var "ZopeTime() - 30" >">Last Month</option>
<option value="<dtml-var "ZopeTime() - 365" >">Last Year</option>
<dtml-if "_.hasattr(AUTHENTICATED_USER,'prev_visit')">
<option value="<dtml-var "AUTHENTICATED_USER.prev_visit">">
Last Visit (<dtml-var "AUTHENTICATED_USER.prev_visit" fmt=Date>)
</option>
</dtml-if>
</select>
</TD>
</TR>
<tr><td></td>
<td><input type="submit" value=" Search "><input type="reset" value=" Clear ">
</td>
</tr>
</form>
</TABLE>
<dtml-var standard_html_footer>
Defining Record Objects with Meta-Data
Record objects work just like Brains from ZSQLMethods with the
exception of data_record_id_ .
The schema and values of a record object come from the Meta Data table.
This is useful when you want to present information on a report page.
You should only create the minimum amount of meta data you need for
your report, lots of meta data tables can consume excessive resources.
Fancy Report Form - Example
Found
items
text_content) and REQUEST['text_content']">
matching ""
Type' |
Title' |
Last modified |
Author |
|
|
|
|
There were no results.
Truncated summary synopsis of content - Example
XXX
Copyright O'Reilly, 2000. All rights reserved.
This is an early draft chapter from a forthcoming book on Zope, to be
published by O'Reilly & Associates. The material has not been through
O'Reilly's editorial process, nor has it been reviewed for technical
accuracy. O'Reilly & Associates disclaims responsibility for any
errors in this draft and advises readers to use the information
contained herein with caution.
O'Reilly & Associates grants readers the right to read this material
and to print copies or make electronic copies for their own
use. O'Reilly & Associates does not grant anyone the right to use this
material as part of a commercial product or to modify and distribute
it. When O'Reilly & Associates publishes the final draft of this book
in print form, the content will be made available under an open
content license, but this chapter is not open content.
If you have any comments on the material in this chapter, you should
send them to the authors, Michel Pelletier and Amos Latteier, at
docs@digicool.com.
|