“That Was Then: Canonicity in the Trésor”

Susy C. Santos University of Manitoba umsant06@UManitoba.CA Paul A. Fortier Centre on Aging, University of Manitoba Fortier@cc.umanitoba.ca

The Trésor de la Langue Française (TLF) corpus (http://www.lib.uchicago.edu/efts/ARTFL/databases/TLF/index.html) was set up almost half a century ago. When one reads the description of how this was done, the distance becomes evident. Professor Imbs quite openly admits that the goal is to reflect "elite" usage of the French language; texts were chosen after consultation of histories of literature, some of which were quite dated even then (Imbs 1971, I, xv-xl). Considerations of inclusiveness, of representativity, as discussed in Scholes (1992) or von Hallberg (1984), do not seem to have concerned the committee which finalized the corpus. One is entitled to wonder to what extent this corpus represents the interests of scholars of French literature a half century later.

Purpose

It is legitimate to evaluate the extent to which the texts included in the TLF database do represent important trends in French literature, as judged by what interested scholars at the time it was constituted, and as reflected by what has interested scholars of the present. More specifically, it is possible to see whether the choices embodied in the TLF reflect what scholars of the time judged important by comparing the choices of texts in a given genre - the novel - to the number of lines dedicated to the authors chosen for the TLF found in the Oxford Companion to French Literature (Harvey & Heseltine 1959). Similarly, the MLA Bibliography (http://www.mla.org/publications/bibliography) provides online data showing the number of publications in the modern languages and literatures for the periods 1963-90 and 1991 to the present. A comparison between the number of publications mentioning a novelist found in this bibliography and the number of texts by the same novelist in the TLF will show the extent to which choices made by the TLF group have been confirmed by the interest of later scholars. Given the volume of data involved these questions must be dealt with using statistics.

Data

A subset of the TLF database was chosen for analysis: novels published between 1789 and 1954 (See Table 1). The name of the novelist (Author) and the number of novel texts included in the database for each writer (Texts) was recorded, along with the publication date of the text included in the database (Pub Date). When more than one novel by a given author is in the TLF Pub Date records the date of the earliest one published. In cases where authors were better known for other genres rather than prose fiction, they were removed from the test data, because they would be a source of ambiguity. These numbers were compared to three series of test data. The column OxC in Table 1 records the number of lines devoted to the novelist and to the included novels by that author which are found in the Oxford Companion to French Literature (Harvey & Heseltine 1959), a volume contemporary with the formation of the TLF database. Columns MLA 1 and MLA 2 record the number of articles mentioning the novelist or work(s) found in the MLA online bibliography of learned articles dealing with language and literature. MLA 1 covers the period 1963-1990 and MLA 2, 1991-2000. For analysis the entire set of 128 frequencies concerning novels was used. Subsequently subsets of roughly equal numbers of authors were generated, covering the periods 1789-1859 (33), 1860-1907 (35), 1908-23 (25), and 1925-54 (35).

Author	Pub Date	Texts	OxC	MLA 1	MLA 2
Abellio	1946	1	0	9	0
About	1857	2	14	1	0
Adam	1902	1	25	1	4
Alain-Fournier	1913	1	93	29	4
Ambriere	1946	1	0	1	0
Aragon	1936	1	25	445	305
Arland	1929	1	0	37	4
Ayme	1933	1	7	38	9
Baillon	1927	1	0	3	6
Balzac	1824	16	577	1986	781
Barbusse	1916	1	16	52	13
Barres	1888	5	87	93	72

Method

A glance at the frequencies of the texts recorded for individual authors shows a large number of authors with one text, and a very small number of authors with ten or more, a distribution pattern quite familiar to people who work with word frequencies in natural languages. These data do not form the familiar bell-shaped curve typical of the Gaussian or normal distribution. Since the data are not normally distributed, Pearson's product-moment correlation analysis cannot legitimately be used on them. Similarly these data would produce a very high proportion of predicted values smaller than 5 in a contingency table for a chi-squared analysis, so this method cannot be employed. The usual way of handling such a problem (grouping the data) is not appropriate, since it is the treatment of individual authors which is of interest. Spearman's rank correlation analysis does not require normally distributed data nor predicted frequencies greater than five; it has been chosen as the primary analytic technique and applied in pairwise fashion to the data, and to the four subsets of the data. At the same time, jackknifed outlier analysis provided by JMP-IN (Sall & Lehman 1996) has been used to identify authors whose distribution varies the most from the trends in the data.

Results

Taken as a whole, the data show a high degree of correlation among the number of texts in the TLF database, the number of lines in the Oxford Companion, and the two sets of MLA Bibliographic data (See Table 2). There is no measurable probability that these correlations be the result of chance alone.

Table 2: Nonparametric Measure of Association
Variable by	Variable	Spearman Rho	Prob>\|Rho\|
OxC	Texts	0.5528	<.0001
MLA_1	Texts	0.4475	<.0001
MLA_1	OxC	0.6101	<.0001
MLA_2	Texts	0.4047	<.0001
MLA_2	OxC	0.5918	<.0001
MLA_2	MLA_1	0.9084	<.0001

The data divided into four sections show a higher correlation in the earlier period than in the later, and outliers in the earlier two periods tend to be the greats of French literature, like Balzac, Stendhal and Zola, whereas in the later periods they tend frequently to be novelists whose literary fortunes are less obvious, like Simenon or Giono.

Conclusion

The analysis carried out on the number of novel texts included in the TLF database shows that the texts included tend to be about the same as what might have been included if a different team of scholars had drawn it up in the late 1950s. Similarly the works included do correspond - particularly for the period up to 1908 - to what scholars of our day find sufficiently interesting to be included in their published studies. It is thus reasonable to conclude that the TLF database is a valid representation of important French literary texts for the period from 1789 to 1954. As more and more databases become commercially available, the method presented here for validating the representativity of a database using readily-available online bibliographical information would seem to have a significance which goes beyond modern French literature.

Acknowledgements

The research reported here has been supported by the Social Sciences and Humanities Research Council of Canada (SSHRCC) under grant number 410-98-1348.

Bibliography

Paul Harvey J. E. Heseltine. The Oxford Companion to French Literature. Oxford: Oxford UP, 1959.

Paul Imbs. Le Trésor de la Langue Française: Dictionnaire de la langue du XIXe et du XXe siècle. Paris: CNRS, 1971. 16 vols..

John Sall Ann Lehman. JMP Start Statistics. Belmont, Ca.: SAS Institute, 1996.

Robert Scholes. “Canonicity and Textuality.” Introduction to Scholarship in Modern Languages and Literatures. Ed. Joseph Gibaldi. New York: MLA, 1992.

Robert von Hallberg. Canons. Chicago: U of Chicago P., 1984.