Computational tools for identifying related passages in large corpora

The term “intertextuality” describes a range of relationships among literary texts, including quotation, imitation, reference, and other types of connection. Such relationships vary in scope, from a single verbal echo of an earlier phrase to systemic and sustained engagement among one or more works. Understanding how such connections create meaning, whether through similarity or contrast, is a crucial element in the making and reading of literature. Intertextuality has thus been a feature of literary scholarship since ancient Greek critics first read Homer’s Iliad over two millennia ago, and it has been a core component of modern humanities research since the term was coined by the literary theorist Julia Kristeva in the mid-twentieth century. The field of potential literary intertexts is vast, even within a small and tightly knit corpus composed in a single language, and hence far exceeds the knowledge of any individual researcher. QCL develops methods to aid the detection and understanding of intertextual relationships using the power of computation. Although no computational method can replace the critical intuition required to make sense of the thematic implications of specific intertextual connections, computation holds tremendous promise for enabling the discovery and organization of intertextual relationships on a grand scale. This facility will enable at once a panoptic view of the literary tradition and a detailed picture of unfamiliar connections among individual works. QCL is currently working on a suite of tools for use on Latin and ancient Greek texts. In particular, we draw on bioinformatic techniques, such as sequence alignment, designed for the identification of homologous (similar) gene sequences. A crucial problem in both biology and literary study is the computational identification of inexact similarities, such as you might find when comparing the sequence of a particular human gene to its mouse homolog or when examining some text paraphrased from an earlier author. Our first publicly available tool, Filum, uses sequence alignment to identify related phrases in Latin literature on the basis of a character-by-character similarity score. In a 2015 article in Dictynna, we reported an initial application of Filum to Vergilian intertextuality in Silius Italicus’ epic poem Punica. Current areas of focus include the validation of Filum as an effective method for high-throughput identification of intertexts and the development of computational tools for finding anagrams in Latin literature and for locating related phrases between ancient Greek and Latin texts.