Typographic Neighborhood

Perigrams in The Readers Project

We have defined a Perigram as the basic element in a special variety of word-based n-gram (or Markov chain). In a standard word-based n-gram all possible word combinations in a text may be considered and ranked for frequency. Here we define the perigrams for a given text to be a subset of these phrases that take typographic neighborhood into account. Our current algorithm collects only those combinations of n-words that can be found within a 20-word reading window around any particular word in the text. This definition is intended to include the current selected word and all the words that might possibly be set, in standard prose typography, adjacently. An n-gram sequence composed of perigrams will contain probabilistically assembled phrases with a vocabulary constrained by the typographic neighborhood. It will, thus, contain language tending to be more sensitive to the context of the particular passage from which it is assembled.

Note that the preprocessed identification of perigrams for a text is carried out chiefly for reasons of efficiency. The frequencies of particular phrases are determined in advance rather than being searched in real-time. The extraction of perigrams means that considerably fewer word combinations need be considered and have data collected for them.