Processing of source documents to generate data for indexing, and of queries to generate data for searching, is done in accordance with retrieved tokenization rules and, if desired, retrieved normalization rules. Tokenization rules are used to define exactly what characters (letters, numbers, punctuation...http://www.google.co.uk/patents/US7152056?utm_source=gb-gplus-sharePatent US7152056 - Apparatus and method for generating data useful in indexing and searching