The following form will search the text of Jane Austen's novels, and of
Lady Susan and The Watsons.
The text of the novels is stored in a file in which each sentence is put
on a separate line, and searching is done within each line. The result
returned by the search will be those lines on which the requested search
pattern is found. All searching is done in a case-insensitive way.
These are the details of the different search methods:
Phrase (keyword sequence) search:
This search finds any passage in which all the keywords
occur in the same order that they were entered into the form box.
Each individual keyword is matched as a whole word (in the same way as
explained for the next search method below), and all punctuation is
ignored, both in the search pattern and in the text being searched.
This type of search most closely resembles those of WWW search
engines (though it is not exactly the same). In this search, all punctuation
is ignored, each sequence of alphabetic characters is searched for as a whole
word, and every sentence that contains any of these keywords
is returned as a result of the search (in other words, this is a logical "or"
search). So searching for "hat" will return only the sentences that contain
the word "hat", and searching for "French, Italian" will return all sentences
which contain either the word "French" or the word "Italian" (the comma is
ignored here, except in its function of separating the two keywords).
When using this search, you need to search separately for noun plurals,
inflected forms of verbs, etc. (So searching for "hat" will not find
sentences which contain the word "hats", unless they also happen to contain
the word "hat".) However, you can find all occurrences of words beginning
with a certain sequence of letters by ending a keyword with the special "*"
wildcard character (so that using the search keyword "hat*" will find
sentences that contain any of the words "hat", "hats", hatred", etc.).
A final refinement of this search is that a hyphen directly preceded and
followed by alphabetic characters forms a special composite keyword, which
will match against cases in the e-texts where the hyphen is present, absent,
or replaced by a space. So the keyword "mantel-piece" will return sentences
that contain "mantel-piece", "mantelpiece", or "mantel piece". (This feature
gets around inconsistencies of hyphenation.)
(Note that any hyphen which does not have letters on both sides will be
ignored, as will any alphabetic characters that occur after an asterisk
character in a search keyword.)
Exact String Search:
This search simply takes the exact string of characters that you
have typed in (except that any leading or trailing spaces are removed), and
returns all the sentences in the e-texts which contain this precise sequence
(including punctuation characters) -- even where the string matches against parts of
words, rather than whole words. So if you do an exact string search on "hat",
the search will return the rather unmanageably large list of all sentences
which happen to include words that contain the sequence of letters
h, a, t. And searching for "French, Italian" will only return the
sentences where the words "French" and "Italian" occur next to each other, in
this order, and are separated by a comma.
Regular Expression Search (egrep):
This type of search allows you to use the regular expression
wildcard language which is available with the Unix egrep commmand.
Search results are separated by novel, but no indication of each
sentence's exact location within a novel is given, unless the "Show Chapter
and Volume Markers" box is clicked (in which case all
chapter headings will be shown). Also, the "Show Surrounding Context" option
in the search form above causes the three sentences which precede and follow
every matching sentence to be shown. (Selecting "Show Chapter and Volume
Markers" has no effect if the "Show Surrounding Context" option is also
Caveats: Note that publicly-available (i.e. non-scholarly)
e-texts of the novels were used, that often have modernized punctuation and
spelling. (The e-text of Lady Susan is closer to the original
manuscript, and has some occasional idiosyncratic Jane Austen spellings, such
as "ei" for "ie".) Sentences from the middle of a letter, or a multi-sentence
quotation, do not have any punctuation that indicates they are not part of the
narrative. Paragraph breaks in the original texts have not been preserved.
Chapter numbering has not been yet harmonized between the e-texts of the
different novels (i.e. roman numerals vs. decimal numbers, and volume-relative
chapter numbering vs. whole-book chapter numbering).
Details of format of texts: Abbreviations such as "Mr",
"Mrs", "Col" and "St" do not have any period at the end. There is no sequence
of more than one space character in the search text. In the texts of
Mansfield Park, Pride and Prejudice, Lady Susan, and The Watsons, the
beginnings and ends of italics are marked by "_" characters; italicization is
not marked in the other novels. The e-texts of the novels are otherwise
entirely in plain ASCII form (no HTML or other markup), except that there is
a "<P>" tag at the end of each line.