Guide

The end-user manual for PageSeeder

Understanding Search

It is easy to get frustrated when searching for terms or phrases in a collection of documents. Text that can seem easy to find in a document, but doesn't show up in the search results, can create the impression of the system being broken. 

However, as it often is with computers, this is rarely the case. The purpose of this document is to help to explain the way that the PageSeeder search works. In doing so, we hope this information will help with both productivity and expectations. 

Examples of searches

While the following terms may seem straightforward, they are more complex to process than they appear.

509 BC

Romulus Augustus

5th century

476 AD

Source 5.75

Rome

Roman Empire

ID02.07

ID2.7

Fig 2.48

DNA

9/11

Binna Binna

Fire-breathers

37°C

grams (g)
kilograms (kg)
tonnes (t)

384–322

color

Dr. Heather Builth

Search Behaviour

Dictionary Search – essentially the dictionary content has stemming predefined. A word such as swimmer should be absolute, not searched not 'stemmed' to swim. 

Fuzzy Search – uses stemming to find a term.

Phrase Search – use a proximity value, perhaps a value of 10 or 15 ensures close enough proximity without generating too many false positives. Requires confirmation.

String Preparation – transform to lower case and strip special characters

Tokenization – special characters must be stripped from the content to be indexed in a way that is consistent with how it will be stripped out of the text of a query. In other words, if the term 'Motörhead' is to be converted to 'motorhead' for indexing, that is the same way the content of the query should be processed.

Search Scenario

(this scenario is distinct from Dictionary lookup)

Dictionary / Glossary component

1 Word ABCD is searched

2 ABCD is searched for as a key word or as a word variation in the dictionary

3 If ABCD is found as a either a key word or a variation that complete definition is returned (not to include where ABCD is part of a phrase with a definition)

4 If ABCD isn’t found as a key word or a variation then the text “ABCD was not found in the dictionary”

Content Component

Where the scope of search is chunk titles, chunk content, image captions, extra title links, question titles, question content & page numbers (in order of importance)

5 The items ABCD is found in are listed based on the following hierarchy:

  • If a the word is in the title for a chunk it appears at the top of the list
  • Followed by the chunk with the most occurrences
  • similar for the remaining content in the order specified above 

6 Where ABCD isn’t found plural and non-plural variations are searched for (obviously plural is more complex than trimming or adding an s for some words)

7 If ABCD or the plural variations aren’t found then nothing is listed from the content section

Alternative word searches

8 A list of the five closest words with their definitions (using the Levenshtein distance, and not showing anything greater than the default) and also including entries where ABCD may form part of a phrase with a definition.

Special Considerations

A: Where the exact phrase can’t be found the instances within the default proximity are returned

B: Searching for items with ‘.’ included need to favour exact matches as these will be searches or figures or activities that have names like “ID02.07 : Thinkers Keys” where it is very likely students will search for “ID02.07” or “ID2.7” looking for this item.

Stop Words

Are the same as the default Lucene stop words.

http://www.onjava.com/2003/01/15/lucene.html 

Created on , last edited on